In Chapter 20 we learned how to test if a subset of variables were useful in a model. We showed an example with the clothing expenditure data and determined that the “household composition” variables (number of children and the household size) were jointly useful in the model. Here are the steps again:
df <-read.csv("clothing-exp.csv")m1 <-lm(clothing_exp ~ hh_inc, data = df)m2 <-lm(clothing_exp ~ hh_inc + num_kids + hh_size, data = df)anova(m1, m2)
Analysis of Variance Table
Model 1: clothing_exp ~ hh_inc
Model 2: clothing_exp ~ hh_inc + num_kids + hh_size
Res.Df RSS Df Sum of Sq F Pr(>F)
1 298 3.3809
2 296 3.1442 2 0.23671 11.142 0.00002161 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value of the partial F-test is close to zero (0.00002161) so we reject the null hypothesis that the variables were useless in the model and conclude that they are useful.
Let’s take a look at the individual significance of each variable:
summary(m2)
Call:
lm(formula = clothing_exp ~ hh_inc + num_kids + hh_size, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.27225 -0.05878 -0.00765 0.05767 0.43981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0125930 0.0232879 -0.541 0.589
hh_inc 0.0822021 0.0004423 185.861 <2e-16 ***
num_kids 0.0108057 0.0137232 0.787 0.432
hh_size 0.0119808 0.0116495 1.028 0.305
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1031 on 296 degrees of freedom
Multiple R-squared: 0.9921, Adjusted R-squared: 0.9921
F-statistic: 1.246e+04 on 3 and 296 DF, p-value: < 2.2e-16
Here only household income is individually statistically significant at the 5% level. The p-values for number of children and household size are both greater than 0.05 (0.432 and 0.305 respectively) and thus insignificant.
How can it be that neither of these two variables are individually significant, but together they are jointly significant? We will see that this can happen when we face the problem of collinearity.
21.2 Collinearity versus Strictly Collinearity
Finding variables to be jointly significant but individually insignificant can sometimes occur if variables are strongly (but not perfectly) correlated with each other. Let’s check the correlation between the two variables:
cor(df$num_kids, df$hh_size)
[1] 0.9270981
A correlation of 0.927 indicates a very strong positive linear relationship between the two variables. This makes sense, because more children in a household usually means there are more people in the household in total!
When there is a strong correlation (close to +1 or -1) between the independent variables, we encounter a problem known as collinearity. This problem is related but different to the no strict collinearity assumption we encountered in Chapter 15.
Strict collinearity is when one of the independent variables is an exact linear combination of one or more independent variables. This would occur if two variables have a perfect linear relationship (a correlation of +1 or -1). In this case R will not estimate a regression coefficient for one of the two perfectly correlated variables and will return NA for that variable.
Collinearity, on the other hand, is when one of the independent variables is strongly related to another variable (or a linear combination of other variables) but not perfectly so. A correlation of 0.927 in our example above is an example of two variables that are the strongly but not perfectly related. In the presence of collinearity R will estimate the model but two problems can occur:
The interpretation of the parameter estimates can become difficult. It is unclear if the number of children or the number of adults or both are increasing the clothing expenditure.
The standard errors on the estimated parameters can increase. This results in wider confidence intervals and smaller p-values in individual significance tests.
21.3 Possible Remedies for Collinearity
When you face a collinearity problem there are a number of different possible remedies.
One solution is to remove the offending variable. If two variables are highly correlated, then including both does not offer very much additional information when one variable is already included. In the clothing expenditure example, we might decide to drop the num_kids variable, because once we know the household size, knowing how many children there are in the household does not contain very much additional information because we know that large households usually contain lots of children. Let’s try this out:
summary(lm(clothing_exp ~ hh_inc + hh_size, data = df))
Call:
lm(formula = clothing_exp ~ hh_inc + hh_size, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.27475 -0.05785 -0.00393 0.05942 0.43730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0252035 0.0168961 -1.492 0.137
hh_inc 0.0821609 0.0004389 187.202 < 2e-16 ***
hh_size 0.0204746 0.0043961 4.657 0.00000484 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.103 on 297 degrees of freedom
Multiple R-squared: 0.9921, Adjusted R-squared: 0.9921
F-statistic: 1.871e+04 on 2 and 297 DF, p-value: < 2.2e-16
We can see that the household size variable is now individually statistically significant.
Another solution that is sometimes available is to create a new variable from the two problematic variables to solve the problem. For example, we could create a variable num_adults from the hh_size and num_kids variables. We could then change the model to use num_adults and num_kids instead of the household size variable. Unlike the previous solution which throws away the information about the household composition, this approach allows us to see the effects of adults and children separately.
Let’s create the variable and check their correlation:
This correlation, although sizeable, is much smaller than before and not large enough to create a collinearity problem in the regression. To better understand this correlation, let’s cross-tabulate the two variables:
Here the number of adults is shown left to right (1 to 3) and the number of children is shown top to bottom (0 to 5). The numbers in the table show the number of observations with that number of adults and number of children combination. For example, the 54 indicates that there are 54 observations (out of 300) with 1 adult and 0 children in the household. The 101 indicates that there are 101 observations with 2 adults and 0 children.
Looking at the relationship between the number of adults and number of children, we see there are no houses with no adults (each house has at least 1, 2 or 3 adults). Houses with children generally have at least 2 adults. Only 11 houses have 1-2 children and only 1 adult. So the positive correlation comes from children mostly living in houses with 2-3 adults and most of the single-adult houses have no children.
Let’s now run the regression with these two variables:
summary(lm(clothing_exp ~ hh_inc + num_adults + num_kids, data = df))
Call:
lm(formula = clothing_exp ~ hh_inc + num_adults + num_kids, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.27225 -0.05878 -0.00765 0.05767 0.43981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0125930 0.0232879 -0.541 0.589
hh_inc 0.0822021 0.0004423 185.861 < 2e-16 ***
num_adults 0.0119808 0.0116495 1.028 0.305
num_kids 0.0227865 0.0052888 4.308 0.0000224 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1031 on 296 degrees of freedom
Multiple R-squared: 0.9921, Adjusted R-squared: 0.9921
F-statistic: 1.246e+04 on 3 and 296 DF, p-value: < 2.2e-16
We now see that num_kids is significant, while num_adults is insignificant. The size of the coefficient for num_kids is now similar to hh_size in the previous regression. The previous regression told us that more people in the household increased clothing expenditure, but we did not know if it was the children or the adults that were driving this. This regression now makes this clear: adding a child to a household increases the clothing expenditure on average more than adding an adult (holding all else constant).