Multiple linear regression

The main idea...

Multiple linear regression (MLR) aims to quantify the degree of linear association between one response variable and several explanatory variables (Equation 1Figure 1). Unlike correlation, regression asserts that there is a directional, causal relationship between response and explanatory variables. Thus, changes in the explanatory variables can be used to predict changes in the response variable, but not vice versa. Attempting to perform reverse regression is likely to be problematic (see Greene, 1984 for illustration).

Alternatively, MLR may be used to answer general questions of the kind:
    "Is there a significant, linear relationship between my response variable and one or more of my explanatory variables?"

If you wish to perform MLR on (dis)similarity matrices, consider multiple regression on (dis)similarity matrices (MRM).

Equation 1: The general MLR equation. A response variable (y; known as the regressand) is predicted by a number of explanatory variables (x1, x2 ... xn; the regressors). The strength of each regressor's effect on the response variable is determined by the regression coefficients β1 ... βn. Together, the linear combinations of the regression coefficients and explanatory variables are able to predict the values of y with some error, ε. The error term can be understood as the value of y when all βixi, for i = {1 ... n}, are equal to zero.

Results and interpretation

Most MLR implementations will deliver:
  • An estimated value for each regression coefficient, β. Coefficients may correspond to raw values of each explanatory variable (and thus inherit the scale of that variable) or may be standardised and scale free (coefficients may be directly compared between variables with different scales). Many implementations will report both standardised and unstandardised coefficients. 
  • A standard error value for each coefficient, estimating the amount of uncertainty associated with that coefficient estimate. 
  • A value for the t statistic (the coefficient value divided by its standard error) for each coefficient. This will be compared to a Student's t distribution appropriate to the sample size of the MLR and will be used to evaluate the significance of coefficient estimates (i.e. against the null hypothesis that the coefficient is zero). Note that this is influenced by the presence of other explanatory variables and is not comparable between different models.
  • A p-value indicating how much the observed t value differs from what could be expected by chance alone, given the appropriate Student's t distribution. A p-value of 1% indicates that 99% of the Student's t distribution was closer to that distribution's mean value relative to your t value. Hence, the null hypothesis that the t value is due to random chance can be rejected at an alpha level of 0.01. In some implementations, p-values are based on permutations rather than a given distribution.
  • Confidence intervals for each regression coefficient, indicating within what range you can expect the coefficient's value to fall with a given degree of confidence (often 95%).
  • The overall squared multiple correlation coefficient (R2) of the model. This value reports how much of the variability in the response variable can be accounted for (fitted) by the explanatory variables.
  • An adjusted R2, which corrects for the complexity of an MLR model. This is needed as adding more explanatory variables to an MLR will increase the overall R2 regardless of whether those variables have any real explanatory power. Many explanatory variables relative to objects (i.e. observations) can lead to a case of overdetermination.
  • The overall p-value for the entire regression, usually determined through an F test on the variance fitted by the explanatory variables and the non-fitted (or residual) variance.
  • Information about the residuals, or the "left over" variation in your response variable that could not be fit by linear combinations of your explanatory variables.These should be normally distributed. If they're not, this may indicate that there are outliers or skew in the data which can affect interpretation. In this case, your data may require more preparation and screening.
Values of immediate interest are typically the adjusted R2, the p-value of the F statistic, and information about the distribution of residuals. If a significant linear fit has been detected and the residuals are normally distributed and centred on zero, one may then examine the individual coefficients associated with each explanatory variable along with their p-values. Assuming there are significant explanatory variables in the MLR model, values of the response variable may be estimated by multiplying this set of explanatory variables values by their coefficients and taking their sum along with the model's intercept value (as in Equation 1). Individual coefficients suggest the magnitude of change in the response variable per unit change of the explanatory variable. Standardised coefficients express this relationship in terms of standard deviations.



Figure 1: General concepts of MLR. a) MLR regresses a response variable (y) on multiple explanatory variables (x1 ... xn). That is, values of x1...xn will be used to predict values of y through a linear function. b) An schematic representation of an MLR involving one response variable (y) and two explanatory variable (x1, x2). As y (the vertical axis) is being regressed on x1 and x2, the vertical distances (dashed lines) between objects (red filled circles; diameter is inversely proportional to distance from the viewer) and a plane-of-best-fit are the residuals of the regression depicted here. "Best fit" is determined by a loss function, usually attempting to minimise the sum of squared residuals. In cases where more explanatory variables are analysed, a hyperplane of best fit is needed.    

Interactions between explanatory variables

In addition to simple relationships between the response and explanatory variables (Equation 1), the response variable's linear relation to interactions between explanatory variables may also be assessed. An interaction term between explanatory variables xand x2for example, would appear as "βx1x2" in Equation 1 and is referred to as a second-order interaction term as the sum of the interacting variable's exponents is two. Specifying interactions may be useful if there is reason to believe that a set of explanatory variables influence one another and hence the fit to the response variable. For example, if the abundance of a zooplankton taxon (y) is thought to be influenced by the abundance of a phytoplankton taxon that is its primary prey (x1), the concentration of nitrate in the water column (x2), and the response of the phytoplankton to the nitrate concentration (x× x2), then the MLR model would be: y =  βx+ βx2 + βx1x2.

If an interaction term is found to be significant, is deemed substantively meaningful, and you wish to retain it in a model, include all lower-order permutations of that term even if they are not significant. For example, including a third-order term x1 × x2 × x3 would require you to retain the second-order terms x× x2, x2 × x3, and x1 × x3, and the constituent first-order terms x1, x2, and x3.

Variable selection in MLR

Once an MLR model has been generated and the results examined, explanatory variables may be added or removed to fine tune the model for better fit. The addition or removal of variables should be guided by domain knowledge rather than the results of statistical tests for a range of reasons described in the references below. Users are encouraged to read Whittingham et al., (2006) for an approachable summary. Selection based on the results of statistical model evaluation should be handled with great care: even if a model produces more significant coefficients and higher R2 values, there's no guarantee that it's the best model (or even a realistic model) for your system. The resulting model may simply be very well tuned to your data set, with no generalisability to other, related ecological systems.

Domain knowledge must guide decisions to add or remove variables. However, when there are many variables available or no a priori reasons to believe the inclusion or removal of one explanatory variable is more sensible than that of another (as in data-mining exercises), automated approaches to assess which combination of variables offers the best fit (which is not necessarily the best explanation of the response data) are available. Below, forward, backward, and all-possible-regression methods are described. 

Warning:  As discussed and demonstrated by Derksen & Keselman (1992), Harrel (2001), Whittingham et al., (2006) and many others, stepwise procedures have multiple weaknesses, including biased parameter estimates, over-fitting, and inherent multiple testing. Further, forward selection approaches for redundancy analysis (RDA) are described by Blanchet et al. (2008).  Please consult these references before considering stepwise methods.

The following automated procedures involve either adding or removing variables from a model based on the effect those variables have on the overall performance of the model. Performance may be evaluated in several ways, including F-tests, the Akaike or Bayesian information criteria (AIC, BIC, resp.), or by calculating Mallows's Cp. Each of these criteria have different properties, thus this choice should be well-informed.

Forward selection
The forward selection algorithm begins with an empty model and then adds explanatory variables based on the values the chosen criterion would have if they were included in the model. The variable with the 'best' impact on the criterion value is added to the model and the process repeated. Note that each time a variable is added to the model, the potential criterion values of the other variables change. The process terminates when the addition of any more variables would not lead to a sufficiently 'better' criterion value. The sufficiency of improvement is defined by a threshold which may be set by the user in many implementations. 

Backward selection
The backward selection algorithm begins with a model built with all the explanatory variables available. It then removes explanatory variables if this would sufficiently improve the value of the criterion chosen. The sufficiency of improvement is defined by a threshold which may be set by the user in many implementations. The process terminates when further removal would not sufficiently improve the selected criterion.

Bidirectional selection
This approach is a combination of forward and backward selection and will either add or remove variables based on the relative improvement such an action would have on the evaluation criterion. 

Once again, methods that attempt to use automatic selection procedures to add or remove variables from an MLR model should be treated with caution and always earmarked for validation on new data sets. 

Key assumptions
  • Response and explanatory variables are causally linked.
  • Response and explanatory variables are linearly related. Linearising transformations may be used if this is not the case.
  • Explanatory variables are (almost) error free and can be treated as fixed values. If there is a large amount of error in your explanatory variables, consider exploring measurement errors models.
  • Variables exhibit homoscedasticity or homogeneity of variance. If this cannot be met, approaches exist to allow MLR with heteroscedastic variables. These include weighted least squares and generalised least squares. 
  • The response variable is assumed to be dependent on the explanatory variables. All explanatory variables and individual observations are assumed to be independent. 
  • Residuals are normally distributed with a mean of zero. Mild to moderate departures from a Gaussian distribution can be tolerated by MLR, however, normalising transformations should be considered.
  • Explanatory variables should not be multicollinear (highly inter-correlated).
  • Regression analyses are very susceptible to outliers. Residuals should be carefully inspected to check for gross outliers. Outliers should be appropriately handled.
  • If explanatory variables are multicollinear, their contributions to the regression model may be reported as minimal or insignificant. This may be due to their redundancy rather than a lack of causal importance.
  • The "significance" of explanatory variables should be interpreted with caution, especially if there is 1) no a priori reason to believe that a given explanatory variable has any predictive power and 2) a large number of explanatory variables in the analysis. Following from the logic of multiple testing, there is always a chance that the significance is due to chance alone. Validation of the predictive power of significant explanatory variables should be performed using new data sets. Attempting to assess validity using the same data set the model was built from is an example of data dredging.
  • There are many advanced forms of MLR available; however, their application is not always advisable. Simply generating a better fit to the observed data does not mean the model itself is a better representation of reality.
  • If variables have been transformed prior to analysis, they must be reverse-transformed (or back-transformed) to arrive at values comparable to those of the original variables.
  • The real-world "importance" of the explanatory variables in predicting the response variable goes beyond p-values and effect sizes. In predicting the change of a particular bacterial taxon's abundance, for example, prior knowledge may suggest that some explanatory variables are far more likely to change than others. Thus, even if these variables are not the most prominent or significant predictors of the response, they may be the most important.
  • R
    • The lm() function builds a linear model and can be used for MLR when more than one explanatory variable or a matrix of explanatory variables is used as input. The anova() function may be used to test for differences between models generated by lm().