Warnings‎ > ‎

Missing data

The main idea...

Missing data is a concern in virtually all empirical research, from psychology (Graham, 2009) to phylogenomics (Roure et al., 2013) and ecology and evolution (Nakagawa and Freckleton, 2008). Data is absent for a wide variety of reasons such as the phenomenon being unobservable at all times or technical failures. If missing values can be considered randomly distributed in a data set and are handled appropriately, missing data need not prevent quality analysis. However, if missing data is ignored, systematically generated, or badly handled, it can affect the interpretations and conclusions of most analyses.

Some types of missing data

The type of missing data you may encounter in an analysis depends, of course, on the data itself and the underlying phenomena generating that data. Below, the three major categories of missing data are described. Efforts should be made to identify what kind of 'missingness' is in effect as this will help determine what actions can be taken.

Missing completely at random (MCAR) In order to be MCAR, a value's missingness is in no way related to any other value in the data set. Little's MCAR test (Little, 1988) may be used to evaluate whether data is MCAR.

Missing at random (MAR) MAR data is random in the sense that missing values have no relation to the value that 'should' be there; however, the fact that a value is missing does depend on other variables in the data set. For example, if a sensor that measures a given variable, y, only operates above a certain temperature, tmin, and all values of y are missing below tmin, the missing data can be said to be MAR. Thus, other variables must be taken into account for MAR data to be considered random (i.e. missing data is "conditioned by" other data in the data set). 

Missing not at random (MNAR) MNAR data is absent in some systematic way which depends on the value of the variable of interest. Referring to the example above, this would suggest that the value of y is directly related to missingness of a given data point. This is the most 'dangerous' kind of missing data as it biases analytical results. If the missingness (coded as a dummy variable) of a variable can be shown to be associated to that variable's values (e.g. by regression) then there's a good chance that it's MNAR. Failure to find clear association should not be considered as 'proof' that missingness is not MNAR, but suggests that it could be MAR or MCAR. The process responsible for MNAR data must be identified and corrected for if possible or the experiment re-run in a manner that addresses the measurement bias in y.

Dealing with missing data

Multiple methods to deal with missing data exist. Some resort to simple deletion of objects or variables with missing values. Others attempt to "impute" (i.e. "fill in") reasonable substitute values based on either other values in the data set or a suitable distribution that models the variable of interest. While deletion is straightforward, imputation methods conserve sample size and hence prevent a loss of power. However, most imputation methods are subject to several assumptions and require careful handling. Below, a few common approaches to handling missing data are briefly described.

Listwise deletion Listwise or case deletion is the complete removal of objects which have variables with missing values from an analysis. This procedure is fairly straightforward and effective, however, will reduce sample size and thus will affect the power of an analysis. If the data is MCAR or MAR (and conditioning data has been taken into account), then this method should not introduce any bias into the analysis.

Pairwise deletion Pairwise deletion is the removal of an object from an analysis, but only where that object is described by a variable with a missing value and that variable-object combination is directly used in an analysis. Pairwise deletion may create inconsistencies in several analyses, such as correlations greater than "1", as the number of objects may change depending on which variables are being compared. If pairwise deletion is used on more than one matrix (e.g. a matrix of response variables and one of explanatory variables as used redundancy analysis) it may result in matrices with differing objects, which cannot be tolerated by many multivariate approaches.

Similar-value imputation This single-imputation technique prescribes the replacement of missing values in a variable describing an object with the value of that variable realised by a closely related object. The researcher must be confident that this is a reasonable measure and explain the rationale clearly in any reporting.

Mean imputation Also known as "mean substitution", is the practice of replacing a variable's missing values with the mean of that variable. Variables subject to mean imputation should be distributed in such a manner that their mean value adequately represents the centre of the distribution (e.g. a normal distribution). This technique can help in 'reinforcing' the mean value of the variable, but may bias other parameters, such as the variance, and distort correlations between variables with missing data. This latter feature often renders mean imputation unsuitable for multivariate analyses. This is a single-imputation technique.

Regression-based single imputation Regression-based single imputation uses variables present in a data set to predict the value of the missing data point using regression approaches, such as multiple linear regression. A fitted value from the regression model is substituted for the missing value. This method is suitable when your variables can be expected to predict one another (usually in a linear fashion) and meet the assumptions of whatever regression procedure you employ; however, like other single-imputation techniques, this method does not provide any measure of uncertainty associated with its estimate.

Multiple imputation Developed by Rubin (1978), multiple imputation (MI) has become a popular method for handling missing data in many settings (Rubin, 1996). MI methods use simulations to replace missing values multiple times, generating a collection of data sets which are identical except for their imputed values. Imputed values may be drawn from a distribution that approximates those of the variables with missing data or (for more complex data) be the result of Markov chain Monte Carlo procedures. Analysis may then be performed on the collection of MI data sets and the results compared, taking into account the impact of estimation uncertainty.

The number of imputed data sets needed to arrive at good estimates is controlled by the rate of missing data (γ;
 a function of the increase of variance due to missing data). Generally, a small number (< 5) of imputed data sets is sufficient for low rates (γ < 0.3).  See Schafer (1999) for a primer on MI.

Expectation-maximisation methods Expectation-maximisation (EM) methods (Dempster et al., 1977) generally approach missing data substitution by data augmentation and maximum likelihood estimation. EM methods are relatively advanced and require careful handling. See Do and Batzoglou (2008) for a primer on the EM algorithm. EM methods are generally suited to scenarios where the joint distribution of variables is exponential or multivariate normal. 

Roughly speaking, the objective is to
substitute missing values with values drawn from a distribution or probability model that describes a variable of interest. A two-step algorithm is used to iteratively tune a parameter of this distribution or model with reference to the observed data in an attempt to arrive at more likely substitutions. After an initial 'guess' at the parameter describing a variable's distribution, the algorithm performs the following:
  1. In an expectation, or "E", step, estimates of missing values are generated and the probabilities of these being 'true' are assigned based on the observed data and the current parameters. The expected values of the missing data are then calculated from the estimates and their probabilities.
  2. A maximisation, or "M", step, revises the parameters by evaluating the expected values of the substitutions generated in the E step. The new parameter estimates are then used in the next E step.
Eventually, the parameter values used in the E and M steps converge and the algorithm ends. Many implementations will deliver a data set with missing data replaced. 

  • MNAR data requires special handling. It is likely to introduce bias into many analyses.
  • Single-value imputation methods are convenient, however, are often seen as "overconfident" as they lead to underestimates of uncertainty by re-using existing data.
  • When applicable (e.g. in MI), it is important that the model used for imputing data is compatible with the analysis that the imputed data will be subject to. For example, if the analysis will use four variables, then all four variables should be included in the imputation model. 
  • MI and EM methods generally assume multivariate normality and require that the missing data is random. The choice of distribution or model used is pivotal and must reflect the nature of the variables in question. Note that one can transform variables to conform to a distribution, impute values, and back-transform the data.
  • The EM algorithm can be used with maximum a posteriori (MAP) estimators rather than maximum likelihood to avoid overfitting.
  • There is a risk that the EM algorithm will get stuck in a local optimum.
  • R
    • Functions such as is.na(), na.omit(), na.exclude(), na.pass(), and na.fail() are useful in handling missing data. 
    • The aregImpute() function from the Hmisc package allows predictive mean matching, regression-based imputation, and weighted sampling from related objects using bootstrapping to assess uncertainty.
    • The Amelia II package offers a range of imputation techniques including MI and bootstrapped EM and can support time-series and longitudinal data. This may be used with Zelig package to combine imputation results.
    • The mitools package offers a set of tools to create and combine multiple imputation data sets.
    • The robCompositions package  offers, among other robust methods, robust imputation of missing values for compositional data (i.e. where the data are relative and describe parts of some whole), which does not require multivariate normality.
    • The function LittleMCAR() in the BaylorEdPysch package can run Little's MCAR test on a data set with no more than 50 variables.
    • The package mice includes several advanced imputation functions
  • The SPSS MVA module offers imputation and augmentation techniques and related tests (e.g. Little's MCAR test)