27  Testing and Correcting for Serial Correlation

27.1 Introduction

With time-series data, serial correlation in the error terms is very common. If e_t is positive, e_{t+1} is often positive in the following period. This is called first-order autocorrelation. If this occurs, the default standard errors are no longer reliable.

Sometimes changing the regression specification helps remove the problem. For example:

  • Using differences x_t-x_{t-1} instead of levels x_t.
  • Using growth rates \frac{x_t-x_{t-1}}{x_{t-1}} instead of levels x_t.
  • Adding a time trend term to the model.

In this chapter we will learn how to formally test for first-order autocorrelation and how to correct the standard errors for it.

27.2 Formal Test for First-Order Autocorrelation

We can formally test for first-order autocorrelation as follows.

  1. Estimate the original model: \mathbb{E}\left[ Y_t|x_{t1},\dots,x_{tk} \right]=\beta_0+\beta_1 x_{t1}+\dots+\beta_k x_{tk} and save the residuals, e_t.
  2. Create a new variable which is the lag of the residuals, e_{t-1}.
  3. Estimate the auxiliary model: e_t = \gamma_0 + \gamma_1e_{t-1}+ \nu_t
  4. Apply the t-test (significance test) on \gamma_1. Under H_0 there is no first-order autocorrelation and under H_1 is there is first-order autocorrelation.

In this auxiliary regression, \gamma_1 is the correlation coefficient between e_t and e_{t-1}. The logic behind the test is that if the previous period’s residual can predict the current period’s one, then the residuals are not independent across time.

27.3 Testing for First-Order Autocorrelation in R

Let’s see how to do these steps in R. We will use the Dutch GDP and exports data we encountered in Chapter 8.

# Step 1: Estimate the original model and save the residuals:
df <- read.csv("nl-exports-gdp.csv")
m <- lm(gdp ~ exports, data = df)
df$e <- m$residuals
# Step 2: Create a new variable which is the lag of the residuals:
df$lag_e <- c(NA, df$e[1:(nrow(df)-1)])
# Step 3: Estimate the auxiliary model:
aux <- lm(e ~ lag_e, data = df)
# Step 4: Apply an individual significance test on the lagged residual term:
summary(aux)

Call:
lm(formula = e ~ lag_e, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.806  -3.847   1.886   5.140  12.424 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.91581    1.19398   0.767    0.447    
lag_e        0.94605    0.02968  31.878   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.773 on 52 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.9513,    Adjusted R-squared:  0.9504 
F-statistic:  1016 on 1 and 52 DF,  p-value: < 2.2e-16

The t-test for the individual significance of the lagged residual has a p-value close to zero. This is very strong evidence for first-order serial correlation.

27.4 Taking Growth Rates

Before learning how to correct the standard errors for serial correlation, let’s first try taking growth rates of both GDP and exports to see if the first-order serial correlation problem goes away. Note that by taking growth rates we lose the first observation because we do not know what the lagged value is in the first period in the data. This is why we need to use the na.omit() function to drop the missing observations.

df <- read.csv("nl-exports-gdp.csv")
df$lag_gdp <- c(NA, df$gdp[1:(nrow(df)-1)])
df$lag_exports <- c(NA, df$exports[1:(nrow(df)-1)])
df$gdp_growth <- (df$gdp - df$lag_gdp) / df$lag_gdp
df$exports_growth <- (df$exports - df$lag_exports) / df$lag_exports
df <- na.omit(df)
m <- lm(gdp_growth ~ exports_growth, data = df)
summary(m)

Call:
lm(formula = gdp_growth ~ exports_growth, data = df)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0283831 -0.0084130  0.0006188  0.0099133  0.0268511 

Coefficients:
               Estimate Std. Error t value         Pr(>|t|)    
(Intercept)    0.004750   0.002726   1.742           0.0873 .  
exports_growth 0.380987   0.042394   8.987 0.00000000000364 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01317 on 52 degrees of freedom
Multiple R-squared:  0.6083,    Adjusted R-squared:  0.6008 
F-statistic: 80.76 on 1 and 52 DF,  p-value: 0.000000000003638

We now repeat the formal test for serial autocorrelation to see if the problem remains:

df$e <- m$residuals
df$lag_e <- c(NA, df$e[1:(nrow(df)-1)])
aux <- lm(e ~ lag_e, data = df)
summary(aux)

Call:
lm(formula = e ~ lag_e, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.028405 -0.007534 -0.002025  0.008030  0.032830 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0002411  0.0017667  -0.136    0.892
lag_e        0.2116660  0.1354665   1.562    0.124

Residual standard error: 0.01286 on 51 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.04568,   Adjusted R-squared:  0.02697 
F-statistic: 2.441 on 1 and 51 DF,  p-value: 0.1244

Now the lagged residual has a p-value greater than 0.05. There is no longer evidence of first-order serial correlation.

27.5 Correcting for First-Order Autocorrelation in R

If taking growth rates, differences or adding a trend term does not remove the problem, you can correct the standard errors for serial correlation in a similar way to how we corrected for heteroskedasticity. To do this we use the function vcovHAC(), which corrects for both heteroskedasticity and autocorrelation.

We will now show how to do this in R. Let’s suppose for the moment that our model with growth rates still suffered from serial correlation and we wanted to correct for it.

library(lmtest)
library(sandwich)
coeftest(m, vcov = vcovHAC(m))

t test of coefficients:

                Estimate Std. Error t value         Pr(>|t|)    
(Intercept)    0.0047496  0.0030884  1.5379           0.1301    
exports_growth 0.3809872  0.0437884  8.7006 0.00000000001012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the coefficient estimates are the same as before but the standard errors are slightly different (e.g. 0.0437884 instead of 0.042394 for the slope).