我需要確定短讀序列中的所有基因座,在該位置上,微衛星重複的數目(即" AA"," GTC"等的拷貝數)與參考基因組以及位置不同其中參考基因組和短讀序列的重複數相同。

我使用了Burrows-Wheeler Aligner(BWA mem)將高覆蓋率的短讀序列(從NCBI的短讀序列檔案中獲得)定位到參考基因組。輸出為.sam格式。我還使用了一個單獨的程序來確定參考基因組中微衛星發生的位置。

我想鑑定短讀序列中的所有基因座,在該序列中微衛星的長度和基因座與參考基因組不同。有誰知道我可以用來讀取映射到參考基因組的短讀序列的.sam / .bam文件並識別出短讀序列與參考基因組不同的特定基因座的任何工具或軟件包嗎?我正在使用RStudio,並且可以訪問我大學的超級計算機集群。




I haven't done any whole-genome STR analysis from NGS data myself, but are aware of others that have used lobSTR for this. There's also a recent paper [here] that compares a few different STR analysis packages (i.e. RepeatSeq, LobSTR, HipSTR, GangSTR). Here's the concluding paragraph:

In conclusion, all these tools are built to genotype STRs but have different strengths and weaknesses. Based on our analysis there is no clear overall winner. RepeatSeq and HipSTR are the best when considering genotyping error rate even with low coverage. On the other hand, GangSTR has an advantage because it is the only tool among them that can call alleles longer than the read length but shows higher error rate unless looking at only the enclosed class of reads, which in turn would lose the GangSTR's advantage of picking up long genotypes. In addition, GangSTR is the newest tool and so comes with reference files for different reference builds that are periodically updated according to the tool's webpage. The correct choice of a tool and the subsequent filtering depends on the aim of the analysis, and might be influenced by available hardware resources and time limit for running tools.

In other words, without more information about the specific problem you're trying to solve, it's not clear what the best tool to use is (this is a common issue with Bioinformatics problems).