僅了解最小值/最大值的數據的統計方法


30

是否存在統計信息分支,用於處理確切值未知的數據,但是對於每個人,我們知道該值的最大值或最小值>?

我懷疑我的問題主要源於我正在努力用統計學的方式來表述這一事實,但希望有一個例子可以幫助闡明:

假設存在兩個相互關聯的總體$ A $和$ B $,這樣,在某些時候,$ A $的成員可能會"轉換"為$ B $ ,但不可能相反。過渡時間是可變的,但不是隨機的。例如,$ A $可以是"沒有後代的個體",而$ B $可以是"有至少一個後代的個體"。我對這種進展發生的年齡感興趣,但我只有橫截面數據。對於任何給定的個人,我都可以找出它們是否屬於$ A $或$ B $。我也知道這些人的年齡。對於人口中的每個人$ A $,我知道過渡年齡將比其當前年齡更大。同樣,對於$ B $的成員,我知道過渡時期要比當前年齡少。但是我不知道確切的值。

說我還有其他一些要與過渡年齡比較的因素。例如,我想知道一個人的亞種或體型是否會影響第一個後代的年齡。我絕對有一些有用的信息可以回答這些問題:平均而言,在$ A $中的個體中,年齡較大的個體會進行較晚的過渡。但是信息不完善,尤其是對於年輕人。對於人口$ B $,反之亦然。

是否存在處理此類數據的方法?我不一定需要使用完整的方法來進行這樣的分析,僅需要一些搜索詞或有用的資源就可以在正確的位置開始學習!

注意事項:我正在做一個簡化的假設,即從$ A $到$ B $的過渡是瞬時的。我還準備假設大多數人會在某個時候升至$ B $,前提是他們的壽命足夠長。而且我意識到縱向數據會非常有幫助,但是假設在這種情況下不可用。

很抱歉,如我所說,這是重複的,部分問題是我不知道應該尋找什麼。出於相同的原因,請添加其他適當的標籤。

樣本數據集:Ssp表示兩個亞種之一,$ X $或$ Y $。後代表示沒有後代($ A $)或至少一個後代($ B $)

 age ssp offsp
  21   Y     A
  20   Y     B
  26   X     B
  33   X     B
  33   X     A
  24   X     B
  34   Y     B
  22   Y     B
  10   Y     B
  20   Y     A
  44   X     B
  18   Y     A
  11   Y     B
  27   X     A
  31   X     B
  14   Y     B
  41   X     B
  15   Y     A
  33   X     B
  24   X     B
  11   Y     A
  28   X     A
  22   X     B
  16   Y     A
  16   Y     B
  24   Y     B
  20   Y     B
  18   X     B
  21   Y     B
  16   Y     B
  24   Y     A
  39   X     B
  13   Y     A
  10   Y     B
  18   Y     A
  16   Y     A
  21   X     A
  26   X     B
  11   Y     A
  40   X     B
   8   Y     A
  41   X     B
  29   X     B
  53   X     B
  34   X     B
  34   X     B
  15   Y     A
  40   X     B
  30   X     A
  40   X     B

編輯:示例數據集已更改,因為它的代表性不強

4

This is a case of censoring/coarse data. Assume you think that your data arises from a distribution with nicely behaved continuous (etc.) pdf $f(x)$ and cdf $F(x)$. The standard solution for time to event data when the exact time $x_i$ of an event for subject $i$ is known is that the likelihood contribution is $f(x_i)$. If we only know that the time was greater than $y_i$ (right-censoring), then the likelihood contribution is $1-F(y_i)$ under the assumption of independent censoring. If we know that the time is less than $z_i$ (left-censoring), then the likelihood contribution is $F(z_i)$. Finally, if the time falls into some interval $(y_i, z_i]$, then the likelihood contribution would be $F(z_i)-F(y_i)$.


4

This problem seems like it might be handled well by logistic regression.

You have two states, A and B, and want to examine the probability of whether a particular individual has switched irreversibly from state A to state B. One fundamental predictor variable would be age at the time of observation. The other factor or factors of interest would be additional predictor variables.

Your logistic model would then use the actual observations of A/B state, age, and other factors to estimate the probability of being in state B as a function of those predictors. The age at which that probability passes 0.5 could be used as the estimate of the transition time, and you would then examine the influences of the other factor(s) on that predicted transition time.

Added in response to discussion:

As with any linear model, you need to ensure that your predictors are transformed in a way that they bear a linear relation to the outcome variable, in this case the log-odds of the probability of having moved to state B. That is not necessarily a trivial problem. The answer by @CliffAB shows how a log transformation of the age variable might be used.


27

This is referred to as current status data. You get one cross sectional view of the data, and regarding the response, all you know is that at the observed age of each subject, the event (in your case: transitioning from A to B) has happened or not. This is a special case of interval censoring.

To formally define it, let $T_i$ be the (unobserved) true event time for subject $i$. Let $C_i$ the inspection time for subject $i$ (in your case: age at inspection). If $C_i < T_i$, the data are right censored. Otherwise, the data are left censored. We are interesting in modeling the distribution of $T$. For regression models, we are interested in modeling how that distribution changes with a set of covariates $X$.

To analyze this using interval censoring methods, you want to put your data into the general interval censoring format. That is, for each subject, we have the interval $(l_i, r_i)$, which represents the interval in which we know $T_i$ to be contained. So if subject $i$ is right censored at inspection time $c_i$, we would write $(c_i, \infty)$. If it is left censored at $c_i$, we would represent it as $(0, c_i)$.

Shameless plug: if you want to use regression models to analyze your data, this can be done in R using icenReg (I'm the author). In fact, in a similar question about current status data, the OP put up a nice demo of using icenReg. He starts by showing that ignoring the censoring part and using logistic regression leads to bias (important note: he is referring to using logistic regression without adjusting for age. More on this later.)

Another great package is interval, which contains log-rank statistic tests, among other tools.

EDIT:

@EdM suggested using logistic regression to answer the problem. I was unfairly dismissive of this, saying that you would have to worry about the functional form of time. While I stand behind the statement that you should worry about the functional form of time, I realized that there was a very reasonable transformation that leads to a reasonable parametric estimator.

In particular, if we use log(time) as a covariate in our model with logistic regression, we end up with a proportional odds model with a log-logistic baseline.

To see this, first consider that the proportional odds regression model is defined as

$\text{Odds}(t|X, \beta) = e^{X^T \beta} \text{Odds}_o(t)$

where $\text{Odds}_o(t)$ is the baseline odds of survival at time $t$. Note that the regression effects are the same as with logistic regression. So all we need to do now is show that the baseline distribution is log-logistic.

Now consider a logistic regression with log(Time) as a covariate. We then have

$P(Y = 1 | T = t) = \frac{\exp(\beta_0 + \beta_1 \log(t))}{1 + \exp(\beta_0 + \beta_1\log(t))}$

With a little work, you can see this as the CDF of a log-logistic model (with a non-linear transformation of the parameters).

R demonstration that the fits are equivalent:

> library(icenReg)
> data(miceData)
> 
> ## miceData contains current status data about presence 
> ## of tumors at sacrifice in two groups
> ## in interval censored format: 
> ## l = lower end of interval, u = upper end
> ## first three mice all left censored
> 
> head(miceData, 3)
  l   u grp
1 0 381  ce
2 0 477  ce
3 0 485  ce
> 
> ## To fit this with logistic regression, 
> ## we need to extract age at sacrifice
> ## if the observation is left censored, 
> ## this is the upper end of the interval
> ## if right censored, is the lower end of interval
> 
> age <- numeric()
> isLeftCensored <- miceData$l == 0
> age[isLeftCensored] <- miceData$u[isLeftCensored]
> age[!isLeftCensored] <- miceData$l[!isLeftCensored]
> 
> log_age <- log(age)
> resp <- !isLeftCensored
> 
> 
> ## Fitting logistic regression model
> logReg_fit <- glm(resp ~ log_age + grp, 
+                     data = miceData, family = binomial)
> 
> ## Fitting proportional odds regression model with log-logistic baseline
> ## interval censored model
> ic_fit <- ic_par(cbind(l,u) ~ grp, 
+            model = 'po', dist = 'loglogistic', data = miceData)
> 
> summary(logReg_fit)

Call:
glm(formula = resp ~ log_age + grp, family = binomial, data = miceData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1413  -0.8052   0.5712   0.8778   1.8767  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  18.3526     6.7149   2.733  0.00627 **
log_age      -2.7203     1.0414  -2.612  0.00900 **
grpge        -1.1721     0.4713  -2.487  0.01288 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 196.84  on 143  degrees of freedom
Residual deviance: 160.61  on 141  degrees of freedom
AIC: 166.61

Number of Fisher Scoring iterations: 5

> summary(ic_fit)

Model:  Proportional Odds
Baseline:  loglogistic 
Call: ic_par(formula = cbind(l, u) ~ grp, data = miceData, model = "po", 
    dist = "loglogistic")

          Estimate Exp(Est) Std.Error z-value        p
log_alpha    6.603 737.2000   0.07747  85.240 0.000000
log_beta     1.001   2.7200   0.38280   2.614 0.008943
grpge       -1.172   0.3097   0.47130  -2.487 0.012880

final llk =  -80.30575 
Iterations =  10 
> 
> ## Comparing loglikelihoods
> logReg_fit$deviance/(-2) - ic_fit$llk
[1] 2.643219e-12

Note that the effect of grp is the same in each model, and the final log-likelihood differs only by numeric error. The baseline parameters (i.e. intercept and log_age for logistic regression, alpha and beta for the interval censored model) are different parameterizations so they are not equal.

So there you have it: using logistic regression is equivalent to fitting the proportional odds with a log-logistic baseline distribution. If you're okay with fitting this parametric model, logistic regression is quite reasonable. I do caution that with interval censored data, semi-parametric models are typically favored due to difficulty of assessing model fit, but if I truly thought there was no place for fully-parametric models I would have not included them in icenReg.