26 Testing and Correcting for Heteroskedasticity

In the final three chapters we will revisit some of the model assumptions and introduce formal tests and corrections for two of these. This chapter will discuss testing and correcting for heteroskedasticity.

Under the homoskedasticity assumption \text{Var}\left( \varepsilon_i|x_{i1},\dots,x_{ik} \right)=\sigma_\varepsilon^2 for all x_{i1},\dots,x_{ik}.

Heteroskedasticity is when the variance of the errors varies with the values of the explanatory variables. In the presence of heteroskedasticity, the standard errors may not be reliable. This frequently occurs in practice. We will now learn:

How to formally test for the prescence of heteroskedasticity.
How to adjust the model’s standard errors for heteroskedasticity.

26.1 Formal Test for Heteroskedasticity

We can formally test for heteroskedasticity as follows.

Estimate the original model \mathbb{E}\left[ Y_i|x_{i1},\dots,x_{ik} \right]=\beta_0+\beta_1 x_{i1}+\dots+\beta_k x_{ik} and save the residuals, e_i.
Estimate the auxiliary model which uses e_i^2 as the dependent variable: \mathbb{E}\left[ e_i^2| x_{i1},\dots,x_{ik} \right]=\gamma_0+\gamma_1 x_{i1}+\dots+\gamma_k x_{ik}
Apply the F-test for the usefulness of this model.

Under H_0, \gamma_1=\dots=\gamma_k=0 and we have homoskedasticity (the dispersion of the residuals does not vary with the independent variables). Under H_1, at least one \gamma_j\neq 0 and we have heteroskedasticity (the dispersion of the residuals does not vary with the independent variables).

The logic of the test is that if the independent variables are useful at explaining e_i^2, then the variance of the residuals does depend on the values of the independent variables, violating homoskedasticity.

Let’s try this out with a regression model:

# Step 1: Estimate original model and save the residuals
df <- read.csv("wages2.csv")
m <- lm(wage ~ educ + female * married, data = df)
df$e <- m$residuals
# Step 2: Estimate the auxialiary model with the square of residuals
aux <- lm(e^2 ~ educ + female * married, data = df)
# Step 3: Apply the F-test:
summary(aux)$fstat

    value     numdf     dendf 
 10.72187   4.00000 521.00000

qf(0.95, 4, 521)

[1] 2.389045

Critical value approach: The F statistic (10.722) is larger than the critical value (2.389). Therefore we reject the null hypothesis. There is evidence of heteroskedasticity.
p-value approach: The F test p-value (0.000) is smaller than the significance level (0.05). Therefore we reject the null hypothesis. There is evidence of heteroskedasticity.

26.2 Correcting Standard Errors for Heteroskedasticity in R

The standard formula for the standard errors of the regression coefficients assumes homoskedasticity. In the presence of heteroskedasticity there is another formula that accounts and corrects for this. We won’t go into the details of this formula, but we will learn how to get R to use these corrected standard errors.

To do this we use the function vcovHC() from the sandwich package. This function name is from Variance Covariance Heteroskedasticity Consistent. The package is called sandwich because the mathematical formula for the standard errors has a “bread” component and a “meat” component with the form bread\times meat \times bread. Again, we won’t go into the details of this.

The function vcovHC() by itself doesn’t give us the corrected regression table. We will use the coeftest() function from the package lmtest to do this.

In practice, many people use these standard errors by default without even doing a formal test for heteroskedasticity. This is because heteroskedasticity is so common that the safe approach is to use heteroskedasticity-robust standard errors all the time. However, in the exam you should only use these standard errors if specifically instructed to use them. In normal cases you should use the default standard errors from the summary() function.

Let’s get the regression table with the corrected standard errors in R:

library(lmtest)
library(sandwich)
coeftest(m, vcov = vcovHC(m))


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)    -1.024421   0.787960 -1.3001    0.1941    
educ            0.493559   0.059092  8.3524 6.088e-16 ***
female         -0.368964   0.374822 -0.9844    0.3254    
married         2.641066   0.404064  6.5363 1.505e-10 ***
female:married -2.828826   0.501106 -5.6452 2.714e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the coefficient estimates are the same as before, but the standard errors are slightly different. Because the test statistics for individual significance and associated p-values depend on the standard errors, these also change.