14 The Multiple Linear Regression Model (MLR)

In the previous chapters covering the simple linear regression (SLR) model, we studied how Y_i depends on a single variable X_i. However, Y_i may depend on multiple variables X_{i1}, X_{i2}, \dots, X_{ik}. For example, sales (Y_i) could depend on online advertising (X_{i1}) and offline advertising (X_{i2}), and each type of advertising could have a different impact on Y_i. We can allow for this with the multiple linear regression (MLR) model.

We model Y_i as a linear function of X_{i1}, X_{i2}, \dots, X_{ik} and and error term: Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} +\dots \beta_k X_{ik} + \varepsilon_i We are going to see that much of what we learned for the simple linear regression model carries directly over to this model. For example, how to obtain confidence intervals and how to perform hypothesis tests on a particular parameter \beta_j.

14.1 Interpretation of the Parameters

14.1.1 Slope Terms

In the simple linear regression model the regression slope was the average increase in the dependent variable from a unit increase in the independent variable. In the multiple linear regression model this interpretation changes slightly. The coefficient on the first variable \beta_1 is now how much the expected value of Y_i increases when x_{i1} increases by 1 unit and all other variables remain unchanged. This last part about all other variables remaining unchanged was not there before because in the simple linear regression model there were no other variables: there was just the one variable X_i.

The expected value of Y_i given each of the x_{i1}, \dots x_{ik} is: \mathbb{E}\left[Y_i|x_{i1},x_{i2},\dots,x_{ik} \right]=\beta_0 + \beta_1 x_{i1}+\beta_2x_{i2}+\dots +\beta_k x_{ik} If we increase x_{i1} by one unit this becomes:

\begin{split} \mathbb{E}\left[Y_i|x_{i1}+1,x_{i2},\dots,x_{ik} \right]&=\beta_0 + \beta_1\left( x_{i1}+1 \right)+\beta_2x_{i2}+\dots +\beta_k x_{ik} \\ \mathbb{E}\left[Y_i|x_{i1}+1,x_{i2},\dots,x_{ik} \right]&=\beta_1 + \beta_0 + \beta_1 x_{i1}+\beta_2x_{i2}+\dots +\beta_k x_{ik} \\ \end{split} If we subtract these we see that everything except \beta_1 cancels: \mathbb{E}\left[Y_i|x_{i1}+1,x_{i2},\dots,x_{ik} \right]-\mathbb{E}\left[Y_i|x_{i1},x_{i2},\dots,x_{ik} \right]=\beta_1 So how to interpret \beta_1 is the left-hand side of this equation: the expected change in Y_i from a unit increase in x_{i1} keep all other variables x_{i2}, x_{i3}, \dots, x_{ik} fixed.

Sometimes to say “keeping all other variables fixed” we say “all else equal” or ceteris paribus, which is Latin for “other things equal”.

We can use the same logic to interpret the coefficients in front of the other variables. For example, \beta_2 is the expected change in Y_i from a unit increase in x_{i2} keep all other variables x_{i1}, x_{i3}, x_{i4}, \dots, x_{ik} fixed.

14.1.2 Intercept

To interpret the intercept term \beta_0 we note that when all variables are exactly equal to zero, x_{i1}=x_{i2}=\dots=x_{ik}=0, we get: \begin{split} \mathbb{E}\left[Y_i|x_{i1}=0,x_{i2}=0,\dots,x_{ik}=0 \right]&=\beta_0 + \beta_1 \times 0+\beta_2\times 0 +\dots \beta_k \times 0 \\ &=\beta_0 \\ \end{split} So \beta_0 is the expected value of the dependent variable when all explanatory variables take on a value of zero.

With many explanatory variables (large k), having situations where all explanatory variables equal zero simultaneously becomes increasingly rare. Thus usually the estimate of the intercept \beta_0 will not make much sense and we won’t pay too much attention to it. But we will see some situations where it will.

14.2 Estimation of the Parameters

The parameters \beta_0, \beta_1, , \beta_k are estimated by minimizing the sum of squared errors like in the simple linear regression model.

The estimates b_0, b_1, b_2, \dots, b_k that we get are the ones that make the term below as small as possible: \sum_{i=1}^n \left(y_i - b_0 - b_1 x_{i1} - b_2 x_{i2} - \dots - b_k x_{ik}\right)^2 The mathematical formulas for b_0, b_1, b_2, \dots, b_k involve using matrix algebra so we will not show the formulas for the estimator here. Instead we will use R to estimate the model as in the example in the next subsection.

Just like the simple linear regression model, after estimation we obtain the sample regression line: \hat{y}_i = b_0 + b_1 x_{i1}+\dots+ b_k x_{ik} where \hat{y}_i are the predicted values and e_i are the residuals.

14.3 Example in R

We will now show an example in R. We will move away from the sales and advertising example dataset because that only has one explanatory variable (advertising). We will instead use the dataset wages1.csv which contains data on the hourly wage in dollars, years of education, and years of work experience for n=526 people. The data are from the National Longitudinal Survey in the US. We will estimate a model explaining wage (Y) with education (X_1) and experience (X_2).

Estimating the model is almost the same as with the simple linear regression model. The only thing that changes is that we add more explanatory variables to the formula in the lm() function using the plus symbol +.

df <- read.csv("wages1.csv")
m <- lm(wage ~ educ + exper, data = df)
summary(m)


Call:
lm(formula = wage ~ educ + exper, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5532 -1.9801 -0.7071  1.2030 15.8370 

Coefficients:
            Estimate Std. Error t value       Pr(>|t|)    
(Intercept) -3.39054    0.76657  -4.423 0.000011846645 ***
educ         0.64427    0.05381  11.974        < 2e-16 ***
exper        0.07010    0.01098   6.385 0.000000000378 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.257 on 523 degrees of freedom
Multiple R-squared:  0.2252,    Adjusted R-squared:  0.2222 
F-statistic: 75.99 on 2 and 523 DF,  p-value: < 2.2e-16

If we had more variables, we would just add these variables separating each by the plus symbol. For example: y ~ x1 + x2 + x3 + x4. We will see examples of this in the upcoming chapters.

The sample regression line in our example is: \hat{y}_i=-3.39 + 0.64 x_{i1} + 0.07 x_{i2} Let’s interpret each of these numbers.

The model predicts that an individual with zero years of education and zero years of experience will have an hourly wage of -$3.39. This doesn’t make much sense: who would work for a negative wage? If we check the data, neither the variable educ nor exper have values that equal zero:

summary(df)

      wage             educ           exper      
 Min.   : 0.530   Min.   : 0.00   Min.   : 1.00  
 1st Qu.: 3.330   1st Qu.:12.00   1st Qu.: 5.00  
 Median : 4.650   Median :12.00   Median :13.50  
 Mean   : 5.896   Mean   :12.56   Mean   :17.02  
 3rd Qu.: 6.880   3rd Qu.:14.00   3rd Qu.:26.00  
 Max.   :24.980   Max.   :18.00   Max.   :51.00

For educ the smallest value is 9. Because we need (several) observations with values x_{i1}=x_{i2}=0 for b_0 to be reliable, we cannot trust this estimate here.

We now move on to interpreting the coefficients in front of the explanatory variables. All else equal, increasing an individual’s education by 1 year while holding experience fixed increases the wage by $0.64 on average. All else equal, increasing an individual’s experience by 1 year while holding education fixed increases the wage by $0.07 on average.

14.4 Adding and Removing Variables

Suppose now we used the same dataset as above to estimate a model explaining wage with education only, leaving experience out of the model. We use the approach we used with the simple linear regression model:

lm(wage ~ educ, data = df)


Call:
lm(formula = wage ~ educ, data = df)

Coefficients:
(Intercept)         educ  
    -0.9049       0.5414

Now let’s compare the two sample regression equations, the model with experience included and with experience excluded: \begin{split} \hat{y}_i&=-3.39 + 0.64 x_{i1} + 0.07 x_{i2} \\ \hat{y}_i&=-0.90 + 0.54 x_{i1} \end{split} In the first model, increasing education by 1 year on average increased wages by $0.64 holding experience fixed. In the second model, increasing education by 1 year on average increased wages by $0.54 (without holding experience fixed).

The effect of education on wages is smaller in the model without experience. Increasing education by 1 year now only increases wages by $0.54 on average. Wages depend on experience, so in the simple model experience is included in \varepsilon_i. But education and experience are negatively correlated:

cor(df$educ, df$exper)

[1] -0.2995418

When education is higher for someone that often means they spent more time in school/college and got less experience. So when we increase education for someone and not hold experience fixed, it has a smaller effect on wages because that usually means that person has less experience. Thus in the simpler model we have a violation of the \mathbb{E}\left[ \varepsilon_i|X_i \right]=0 assumption. The error term which includes experience is negatively correlated with the education variable. The negative correlation biases the estimates of \beta_1 downward. This kind of bias is called omitted variable bias.

For this reason we prefer models that include more variables that can impact the Y variable that are correlated with our X variables of interest.