插補次數和最大迭代次數如何影響多重插補的準確性?


12

MICE的幫助頁面將功能定義為:

mice(data, m = 5, method = vector("character", length = ncol(data)),
  predictorMatrix = (1 - diag(1, ncol(data))),
  visitSequence = (1:ncol(data))[apply(is.na(data), 2, any)],
  form = vector("character", length = ncol(data)),
  post = vector("character", length = ncol(data)), defaultMethod = c("pmm",
  "logreg", "polyreg", "polr"), maxit = 5, diagnostics = TRUE,
  printFlag = TRUE, seed = NA, imputationMethod = NULL,
  defaultImputationMethod = NULL, data.init = NULL, ...)

這些是很多參數。如何確定要指定的參數以及將哪些參數保留為默認值?

我對多重插補m的數量和最大迭代maxit尤其感興趣。這些參數如何影響準確性?

換句話說,何時(如何?)-在使用這些參數的同時-我真的可以說已經達到某種收斂了嗎?

20

Let's just go through the parameters one by one:

  • data doesn't require explanation
  • m is the number of imputations, generally speaking, the more the better. Originally (following Rubin, 1987) 5 was considered to be enough (hence the default). So from an accuracy point of view, 5 may be sufficient. However, this was based on an efficiency argument only. In order to achieve better estimates of standard errors, more imputations are needed. These days there is a rule of thumb to use whatever the average percentage rate of missingness is - so if there is 30% missing data on average in a dataset, use 30 imputations - see Bodner (2008) and White et al (2011) for further details.
  • method specifies which imputation method is to be used - this only necessary when the default method is to be over-ridden. For example, continuous data are imputed by predictive mean matching by default, and this usually works very well, but Bayesian linear regression, and several others including a multilevel model for nested/clustered data may be specified instead. Hence, expert/clinical/statistical knowledge may be of use in specifying alternatives to the default method(s).
  • predictorMatrix is a matrix which tells the algorithm which variables predict missingness in which other variables. mice uses a default based on correlations between variables and the proportion of usable cases if this is not specified. Expert/clinical knowledge may be very useful in specifying the predictor matrix, so the default should be used with care.
  • visitSequence specifies the order in which variables are imputed. It is not usually needed.
  • form is used primarily to aid the specification of interaction terms to be used in imputation, and isn't normally needed.
  • post is for post-imputation processing, for example to ensure that positive values are imputed. This isn't normally needed.
  • defaultMethod changes the default imputation methods, and is not normally needed
  • maxit is the number of iterations for each imputation. mice uses an iterative algorithm. It is important that the imputations for all variables reach convergence, otherwise they will be inaccurate. By inspecting the trace plots generated by plot() this can be visually determined. Unlike other Gibbs sampling methods, far fewer iterations are needed - generally in the region of 20-30 or less as a rule of thumb. When the trace lines reach a value and fluctuate slightly around it, convergence has been achieved. The following is an example showing healthy convergence, taken from here :

enter image description here

Here, 3 variables are being imputed with 5 imputations (coloured lines) for 20 iterations (x-axis on the plots), the y-axis on the plots are the imputed values for each imputation.

  • diagnostics produces useful diagnostic information by default.

  • printFlag outputs the algorithm progress by default which is useful because the estimated time to completion can easily be ascertained.

  • seed is a random seed parameter which is useful for reproducibility.

  • imputationMethod and defaultImputationMethod are for backwards compatibility only.

Bodner, Todd E. (2008) “What improves with increased missing data imputations?” Structural Equation Modeling: A Multidisciplinary Journal 15: 651-675. https://dx.doi.org/10.1080/10705510802339072

Rubin, Donald B. (1987) Multiple Imputation for Nonresponse in Surveys. New York: Wiley.

White, Ian R., Patrick Royston and Angela M. Wood (2011) “Multiple imputation using chained equations: Issues and guidance for practice.” Statistics in Medicine 30: 377-399. https://dx.doi.org/10.1002/sim.4067