用RUVSeq標準化後如何創建差異表達(DE)基因的列表?


1
我正在使用edgeR對一組RNA-seq數據樣本(2個對照組; 8種處理方法)進行差異表達(DE)分析。為了糾正批處理效果,我正在使用RUVSeq。

我無需進行標準化即可獲得DE基因的列表:

x <- as.factor(rep(c("Ctl","Inf"),c(2,8)))
set <- newSeqExpressionSet(as.matrix(counttable),phenoData=data.frame(x,row.names=colnames(counttable)))
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
top <- topTags(lrt, n=nrow(set))$table
write.table(top, paste(OUT, "DE_genelist.txt", sep=""))

然後在創建" top"對象之後,立即使用RUVg進行規範化:

# [...]
top <- topTags(lrt, n=nrow(set))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]
ruvg <- RUVg(set, empirical, k=1)
write.table(ruvg, paste(OUT, "DE_RUVg_genelist.txt", sep=""))

我得到了錯誤:

Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class ‘structure("SeqExpressionSet", package = "EDASeq")’ to a data.frame

我不確定如何像使用非標準化數據一樣打印標準化結果列表。理想情況下,我將獲得與edgeR輸出相同格式的文件(作為.csv或.txt文件):

"logFC" "logCPM" "LR" "PValue" "FDR"
"COBLL1" -2.150 4.427061248733 75.0739519350016 4.53408921348828e-18 9.51203608115384e-15
"UBE2D1" -2.178 3.577168782408 74.9346752854903 4.86549160161322e-18 9.51203608115384e-15
"NEK7" -2.404 4.020072739285 72.6539117671717 1.54500340443843e-17 2.71843349010941e-14
"SMC6" -2.300 5.674738981329 61.8130019860261 3.7767230643666e-15 3.4974443325016e-12

在使用RUVSeq進行歸一化後,如何獲得基因列表作為輸出

0

I have not used this package but from your code it seems that ruvg is not a table. Instead, it is an R object, which means that you cannot use write.table. I think the results you want is stored in the object. All R objects contain "slots" of data, which can be accessed by @. If I were you, I would type [email protected] and should be able to see which data slots are contained in the object.


1

You do the normalization before running your edgeR. The purpose of RUVg is to remove "Remove Unwanted Variation Using Control Genes". In your code, you ran edgeR and then normalize the data using RUVg, which is only going to return you the normalized counts.

Using the example dataset in vignette:

library(RUVSeq)
library(zebrafishRNASeq)
data(zfGenes)
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
genes <- rownames(filtered)[grep("^ENS", rownames(filtered))]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]

x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
                           phenoData = data.frame(x, row.names=colnames(filtered)))
set <- betweenLaneNormalization(set, which="upper")

set1 <- RUVg(set, spikes, k=1)

You can look at it, it's an expression set with counts etc, not results:

set1
SeqExpressionSet (storageMode: lockedEnvironment)
assayData: 20865 features, 6 samples 
  element names: counts, normalizedCounts, offset 
protocolData: none
phenoData
  sampleNames: Ctl1 Ctl3 ... Trt13 (6 total)
  varLabels: x W_1
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:  

You run edgeR now on the results of RUVg:

design <- model.matrix(~x + W_1, data=pData(set1))
y <- DGEList(counts=counts(set1), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)