比較元基因組之間的基因豐度


1

到目前為止我的工作流程:

在未組裝的基因組中查找標記基因的片段>下載並組裝後的基因組>恢復感興趣的基因鄰域/基因組

現在,通過組裝的重疊群上的"深度"(我使用MEGAHIT進行組裝),我對這些基因的豐富程度有了一個粗略的估計。我想知道是否有更徹底/正確的方法來做到這一點。我想比較a)同一研究中的樣本與b)不同研究中特定基因的豐度。我想在兩種情況下都應考慮單個元基因組的大小,但是b)點可能會增加其他困難,例如不同的測序技術。非常感謝您的見解。

1

I would avoid using assemblies to answer this question, as there's no guarantee that you will be able to assemble your genes of interest; you can however estimate their abundance even if they are relatively rare.

How I understand your question as being one of estimating the abundance of either some specific genes (e.g. butyrate metabolism genes) or all genes in a microbial community across multiple samples for comparative purposes. In other words, not 16S or marker gene analysis for the purposes of estimating organismal abundance, which is a rather different problem (though in that case I would still not use an assembly).

A more standard workflow is:

  1. align metagenomic reads against some existing database of genes annotated appropriately.
  2. estimate the number of reads aligning against some gene or ortholog in some way (using for example KEGG Orthology or similar).
  3. use counts from (2) as input to some statistical procedure, possibly summarizing across functional categories.

Some examples of how this has been done are here, here, here. I am sure that there are more recent/relevant references but I haven't been following the field closely in the last few years.