16 MLR Inference on a Single Variable

In Chapter 15 we discussed the model assumptions for the multiple linear regression model. When all of these assumptions hold we are able to perform inference. In this chapter we will discuss inference on a single variable: how to obtain confidence intervals and how to perform hypothesis tests on one variable. We will see that this is very similar to the SLR model.

16.1 Model Variance

To obtain standard errors for the regression coefficients, we first require an estimate of the model variable \sigma_\varepsilon^2.

The sum of squared errors SSE is the same as before: SSE=\sum_{i=1}^n \left( y_i-\hat{y}_i \right)^2=\sum_{i=1}^n e_i^2 To obtain the sample variance of the estimated model, s_\varepsilon^2, we use the formula: s_\varepsilon^2 =\frac{SSE}{n-k-1} Notice that now we divide by n-k-1 instead of n-2 in the simple linear regression model. Because we are now estimating k+1 parameters (the k coefficients on the variables plus the intercept) we have only n-(k+1)=n-k-1 degrees of freedom.

The simple linear regression model is a special case of the multiple linear regression model when k=1. So n-k-1=n-1-1=n-2, like what we had in the simple linear regression model.

The standard error of the estimated model is then s_\varepsilon=\sqrt{s_\varepsilon^2}.

16.2 Confidence Intervals

Obtaining a confidence interval for the multiple linear regression model is very similar to obtaining one for the simple linear regression model.

The formula for the confidence interval for b_j is: b_j \pm t_{1-\frac{\alpha}{2},n-k-1} \times s_{b_j} where j is one of the variables 1,\dots,k. We will not write the formula for s_{b_j} here, but will calculate it in R. The only difference is we use the Student’s t distribution with n-k-1 degrees of freedom instead of n-2.

Obtaining the confidence interval is also the same in R and we interpret it the same way. If we have a 95% confidence interval \left[ L_j,U_j \right] around b_j we say “we are a 95% confident that the population \beta_j is between L_j and U_j.”

Let’s show a quick example in R using the wages, education and experience data. If we want to get a 95% confidence interval around b_1, we do:

df <- read.csv("wages1.csv")
m <- lm(wage ~ educ + exper, data = df)
confint(m, "educ", level = 0.95)

         2.5 %    97.5 %
educ 0.5385695 0.7499747

We are 95% confident that the average impact on wages of one additional year of education is between $0.54 and $0.75 holding experience fixed.

16.3 Hypothesis Testing

Hypothesis tests for individual parameters are also done the same way as with the simple linear regression model. The only difference again is that we use n-k-1 degrees of freedom instead of n-2 when finding the quantiles of the t distribution and finding p-values.

We will do an example with the wages, education and experience data. Suppose you want to test the claim that increasing your experience by 1 year on average increases your wage by more than $0.05, holding education fixed. You will use \alpha=0.05%.

The model is:

\mathbb{E}[Y_i|x_{i1},x_{i2}]=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} with x_{i1} being education and x_{i2} being experience. The claim is equivalent to testing if $_2 > 0.05.

The null and alternative hypotheses are then:

\begin{split} H_0: & \beta_2 \leq 0.05 \\ H_1: & \beta_2 > 0.05 \end{split} Recall that the claim is alternative hypothesis and the null hypothesis is the opposite to the alternative. It is therefore usually helpful to write down the alternative hypothesis first.

Under the null hypothesis the test statistic T=\frac{B_2 - 0.05}{S_{B_2}} follows a t distribution with n-k-1 degrees of freedom.

We can use R to calculate the value of the test statistic:

df <- read.csv("wages1.csv")
m <- lm(wage ~ educ + exper, data = df)
b_2 <- coef(summary(m))["exper", "Estimate"]
s_b_2 <- coef(summary(m))["exper", "Std. Error"]
t <- (b_2 - 0.05) / s_b_2
t

[1] 1.830576

If we are using the critical value approach we check if t\geq t_{1-\alpha,n-k-1}. Let’s calculate this in R:

qt(0.95, m$df.residual)

[1] 1.647772

Notice that I used m$df.residual to get the degrees of freedom. This number stored in the model output always contains the number n-k-1. Let’s check that this matches what we would get if we wanted to get n-k-1 manually. We can get the number of observations (number of rows in our dataset) minus the number of regressors (2) minus 1:

nrow(df) - 2 - 1

[1] 523

m$df.residual

[1] 523

Both give 523. However, using m$df.residual is more reliable because if the dataset contains missing observations then the number of rows of df does not equal the number of observations n used in the regression.

Because the test statistic 1.831 is greater than the critical value 1.648 we reject the null hypothesis.

We can also do this using the p-value method. To get the p-value in R we do:

1 - pt(t, m$df.residual)

[1] 0.03386635

The p-value is less than the significane level \alpha=0.05 so we reject the null hypothesis. This is the same result as with the critical value approach.

To conclude, because we reject the null hypothesis there is sufficient evidence for the claim that increasing your experience by one year holding education fixed increases your wage by more than $0.05 on average.

Here is a summary of the R functions we use for hypothesis testing in the MLR model. First define the following:

The size of the test, \alpha, is alpha.
The regression is stored as m so that m$df.residual equals n-k-1.
The value of the test statistic, t, is t.

Critical Values:

If upper-tail test: qt(1-alpha, m$df.residual).
If lower-tail test: qt(alpha, m$df.residual).
If two-sided test: qt(1-alpha/2, m$df.residual).

p-values:

If upper-tailed test: 1-pt(t, m$df.residual).
If lower-tailed test: pt(t, m$df.residual).
If two-sided test: 2*(1-pt(abs(t), m$df.residual)).

16.4 Statistical Significance

Finally, just like with the simple linear regression model, the most common hypothesis test for the multiple linear regression model is the test for statistical significance: H_0:\beta_j = 0 \text{ vs } H_1 :\beta_j \neq 0 Under H_0, the variable j is useless within the model: \mathbb{E}\left[ Y_i|x_{i1},\dots,x_{ik} \right]=\beta_0 + \beta_1 x_{i1} +\dots + 0 \cdot x_{ij} + \dots +\beta_k x_{ik} That is, under H_0, variable j does not contribute to explaining Y_i.

In contrast, under H_1, the variable j is useful within the model. If we reject H_0, \beta_j is statistically different from zero. We say variable j is individually statistically significant.

Thus how we interpret it is slightly different to the SLR model. There we said that the model was useful (because there was only one variable), whereas here we say that an individual variable is useful within the model. Later in Chapter 19 we will learn how to test for model usefulness more generally in the MLR model.

To check for statistical significance in R we can use the summary() function and quickly check the p-values of the included regressors:

df <- read.csv("wages1.csv")
m <- lm(wage ~ educ + exper, data = df)
summary(m)


Call:
lm(formula = wage ~ educ + exper, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5532 -1.9801 -0.7071  1.2030 15.8370 

Coefficients:
            Estimate Std. Error t value       Pr(>|t|)    
(Intercept) -3.39054    0.76657  -4.423 0.000011846645 ***
educ         0.64427    0.05381  11.974        < 2e-16 ***
exper        0.07010    0.01098   6.385 0.000000000378 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.257 on 523 degrees of freedom
Multiple R-squared:  0.2252,    Adjusted R-squared:  0.2222 
F-statistic: 75.99 on 2 and 523 DF,  p-value: < 2.2e-16

If we see that the p-values are below our desired significance level (such as 0.05) we say that variable is individually statistically significant. In this model both variables are individually significant. We can also see this by looking at the *** next to the p-values.