在R中,當glms具有負二項式分佈時,方差分析(glm1,glm2,test =" Chisq")告訴我的p值是多少?


2

已經在R中構造了兩個嵌套的negbin glms,一個具有下降因子,我們可以用test = "Chisq"進行anova來查看模型之間是否存在顯著差異。但是Pvalue實際告訴我什麼?

anova(mnegbin1, mnegbin2, test = "Chisq")
Likelihood ratio tests of Negative Binomial Models

Response: count
                        Model    theta Resid. df    2 x log-lik.   Test    df      LR stat. Pr(Chi)
1 origin + substrate + sample 18.46502      1232       -2459.886                                   
2          substrate + sample 18.46502      1232       -2459.886 1 vs 2     0 -1.500666e-11       1
2

When you fit a model, you need to check whether it's overly determined, meaning some terms cannot be determined. We can look at one of your samples, for example only for RB5 and Nocardia_nova_SH22a_RB20_RB56:

df = structure(list(X = c(226L, 227L, 228L, 229L, 230L, 231L, 232L, 
233L, 234L, 235L, 236L, 237L, 238L, 239L, 240L, 241L, 242L, 243L, 
244L, 245L, 246L, 247L, 248L, 249L, 250L, 251L, 252L, 253L, 254L, 
255L, 256L, 257L, 258L, 259L, 260L, 261L, 262L, 263L, 264L, 265L, 
266L, 267L, 268L, 269L, 270L, 1126L, 1127L, 1128L, 1129L, 1130L, 
1131L, 1132L, 1133L, 1134L, 1135L, 1136L, 1137L, 1138L, 1139L, 
1140L, 1141L, 1142L, 1143L, 1144L, 1145L, 1146L, 1147L, 1148L, 
1149L, 1150L, 1151L, 1152L, 1153L, 1154L, 1155L, 1156L, 1157L, 
1158L, 1159L, 1160L, 1161L, 1162L, 1163L, 1164L, 1165L, 1166L, 
1167L, 1168L, 1169L, 1170L), sample = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Nocardia_nova_SH22a_RB20_RB56", 
"RB56"), class = "factor"), substrate = structure(c(10L, 37L, 
29L, 11L, 24L, 9L, 39L, 30L, 42L, 14L, 21L, 40L, 38L, 45L, 28L, 
5L, 22L, 3L, 41L, 16L, 32L, 27L, 18L, 26L, 20L, 34L, 7L, 12L, 
23L, 17L, 44L, 35L, 4L, 19L, 25L, 13L, 1L, 33L, 31L, 43L, 36L, 
6L, 2L, 8L, 15L, 10L, 37L, 29L, 11L, 24L, 9L, 39L, 30L, 42L, 
14L, 21L, 40L, 38L, 45L, 28L, 5L, 22L, 3L, 41L, 16L, 32L, 27L, 
18L, 26L, 20L, 34L, 7L, 12L, 23L, 17L, 44L, 35L, 4L, 19L, 25L, 
13L, 1L, 33L, 31L, 43L, 36L, 6L, 2L, 8L, 15L), .Label = c("acylsphingosine", 
"agarose", "alcohol", "alginate", "alpha-glucan", "amylaceous polysaccharides", 
"arabinogalactans", "aribinose", "beta-glucan", "cellulose", 
"chitin", "cutin", "cyclomaltodextrin", "D-galactose", "D-glucan", 
"D-glucose", "dermatan sulfate", "dextran", "fructans", "fructose", 
"galactan", "galactose", "galacturonide", "glucan", "glucose", 
"glucuronoside", "glycan", "glycans", "glycogen", "hemicellulose", 
"hyaluronate", "L-rhamnose", "lacto-N-tetraose", "levan", "lignin", 
"maltose", "mannan", "mannose", "pectin", "peptidoglycan", "pullulan", 
"trehalose", "Vanillyl-alcohol", "xanthan", "xylose"), class = "factor"), 
    count = c(2L, 0L, 2L, 2L, 0L, 1L, 2L, 2L, 3L, 0L, 0L, 1L, 
    0L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 3L, 1L, 3L, 3L, 0L, 3L, 3L, 5L, 2L, 0L, 0L, 4L, 
    0L, 0L, 1L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
    0L, 0L, 0L), origin = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("free", "isolate"
    ), class = "factor")), row.names = c(226L, 227L, 228L, 229L, 
230L, 231L, 232L, 233L, 234L, 235L, 236L, 237L, 238L, 239L, 240L, 
241L, 242L, 243L, 244L, 245L, 246L, 247L, 248L, 249L, 250L, 251L, 
252L, 253L, 254L, 255L, 256L, 257L, 258L, 259L, 260L, 261L, 262L, 
263L, 264L, 265L, 266L, 267L, 268L, 269L, 270L, 1126L, 1127L, 
1128L, 1129L, 1130L, 1131L, 1132L, 1133L, 1134L, 1135L, 1136L, 
1137L, 1138L, 1139L, 1140L, 1141L, 1142L, 1143L, 1144L, 1145L, 
1146L, 1147L, 1148L, 1149L, 1150L, 1151L, 1152L, 1153L, 1154L, 
1155L, 1156L, 1157L, 1158L, 1159L, 1160L, 1161L, 1162L, 1163L, 
1164L, 1165L, 1166L, 1167L, 1168L, 1169L, 1170L), class = "data.frame")

Now we look at whether how it is distributed:

df$sample = droplevels(df$sample)
table(df$origin,df$sample)
         
          Nocardia_nova_SH22a_RB20_RB56 RB56
  free                               45    0
  isolate                             0   45

You can see, the samples are either free or isolate, and never both. So you cannot discern the effect of origin from samples. This means your model is over-determined and hence you see the error above.

To know the effect of origin, you have to fit, i used a poisson for this example because if i use only the subset above, the counts are not overdispersed hence glm.nb throws some error:

fit = glm(count~ origin+substrate,data=df,family=poisson)
summary(fit)

 head(coefficients(summary(fit)))
                           Estimate   Std. Error       z value   Pr(>|z|)
(Intercept)           -2.266968e+01 2.930556e+04 -7.735624e-04 0.99938279
originisolate          6.931472e-01 2.970443e-01  2.333481e+00 0.01962291
substrateagarose       2.282092e-06 4.144430e+04  5.506408e-11 1.00000000
substratealcohol       2.280854e-06 4.144430e+04  5.503420e-11 1.00000000
substratealginate      2.278865e-06 4.144430e+04  5.498621e-11 1.00000000
substratealpha-glucan  2.295736e+01 2.930556e+04  7.833790e-04 0.99937495

You can see that in this model, free is the reference, and isolate is compared to free in the "originisolate". The coefficient 6.931472e-01 means isolate samples are higher than free samples, by exp(6.931472e-01) times.