<- read.csv("cpb-house-prices.csv", sep = ";", dec = ",")
df1 names(df1) <- c("municipality", "house_price_2022", "house_price_2021")
<- read.csv("municipality-province.csv")
df2 names(df2) <- c("municipality", "province")
<- merge(df1, df2, by = "municipality") df
25 Qualitative Variables with Multiple Levels
25.1 Introduction
In Chapter 24 we learned how to use qualitative variables with two values - such as gender - in a regression model. By including a dummy variable for one of the gender values, we were able to cover all the possible values that the gender could take: =1 for females and =0 for males.
But often we have data with qualitative variables that can take on more than two values. For example we could have variables like:
- Educational attainment: High school, Bachelor, Master, PhD.
- Industry: Primary sector (e.g.~agriculture), Manufacturing, Services.
- Region: US States, Provinces of the Netherlands.
What we will learn in this chapter is how to include this kind of information in a linear regression model.
25.2 Theory
Suppose we are interested in the impact of industry sector on wages. We have a sample of wages Y_i for n individuals and what sector they work in (sector_i): primary, manufacturing or services.
25.2.1 The Incorrect Approach
Suppose for the moment we decided to follow the logic in Chapter 24 and created a numeric variable x_{i1} with the sector information as follows:
- =0 if sector_i=\texttt{primary}.
- =1 if sector_i=\texttt{manufacturing}.
- =2 if sector_i=\texttt{services}.
Suppose also we used this variable to estimate the regression model: \mathbb{E}[Y_i | x_{i1}] = \beta_0 + \beta_1 x_{i1} We will see now that this approach is incorrect.
For individuals in the primary sector we have: \mathbb{E}[Y_i | x_{i1}=0] = \beta_0 Therefore \beta_0 is the average wage of people in the primary sector.
For individuals in the manufacturing sector we have: \mathbb{E}[Y_i | x_{i1}=1] = \beta_0 + \beta_1 This means that \beta_1 is the average difference in wages between people in the manufacturing sector and the primary sector.
But then for individuals in the services sector we have: \mathbb{E}[Y_i | x_{i1}=2] = \beta_0 + 2\beta_1 This means that \beta_1 is also the average difference in wages between people in the services sector and the manufacturing sector! It also means that the difference in wages between services and the primary sector is 2\beta_1.
Using a variable like this means that going from one sector to the next leads to an increase in wage of \beta_1 on average for all sectors. But there is no reason to think that going from primary to manufacturing and manufacturing to services will lead to the same average increase in wage. This is a very restrictive way to use this variable. We want a more flexible way to use the information about the sector in the model.
25.3 The Correct Approach
Instead of creating one numeric variable with the information from the qualitative variable what we should do is create a dummy variable for each value of the categorical variable. For the sector example we create 3 variables:
- D_{i1}=1 if primary sector and x_{i1}=0 otherwise.
- D_{i2}=1 if manufacturing sector and x_{i2}=0 otherwise.
- D_{i3}=1 if services sector and x_{i3}=0 otherwise.
We then estimate a regression model using these dummy variables. We cannot include all 3 dummy variables because otherwise we run into the dummy variable trap we encountered in Chapter 24. This is because D_{i1}=1-D_{i2}-D_{i3} always:
- If D_{i1}=0 then one of D_{i2} or D_{i3} equals 1.
- If D_{i1}=1 then D_{i2}=D_{i3}=0.
We need to choose one category to be the base category. Let’s let this be the primary sector. The model we would estimate is then:
\mathbb{E}[Y_i|sector_i]=\beta_0+\beta_1 D_{i2} + \beta_2 D_{i3} For the primary sector we have: \mathbb{E}[Y_i|sector_i=\texttt{primary}]=\beta_0 For the manufacturing sector we have: \mathbb{E}[Y_i|sector_i=\texttt{manufacturing}]=\beta_0+\beta_1 For the services sector we have: \mathbb{E}[Y_i|sector_i=\texttt{services}]=\beta_0+\beta_2 So:
- \beta_1 is the average difference between the manufacturing and primary sectors.
- \beta_2 is the average difference between the services and primary sectors.
- \beta_2-\beta_1 is the average difference between the services and manufacturing sectors.
Now the model is much more flexible.
25.4 Qualitative Variables in R
To show how to do this in R we will use a dataset on the average house prices Y_i by municipality (gemeente) in the Netherlands in 2022 and the province each municipality is in, prov_i. We will use this dataset to see how much location (province) impacts house prices.
To do this we create 12 dummy variables, one for each province:
- D_{i1}=1 if prov_i=\texttt{Drenthe} and zero otherwise.
- D_{i2}=1 if prov_i=\texttt{Flevoland} and zero otherwise.
- \vdots
- D_{i12}=1 if prov_i=\texttt{Zuid-Holland} and zero otherwise.
Because D_{i1}=1-D_{i2}-D_{i3}-\dots-D_{i12} for all i, we need to exclude one province to avoid the dummy variable trap. Let’s choose Drenthe (D_{i1}) to be the base level.
The model is then: \mathbb{E}\left[ Y_i|prov_i \right]= \beta_0 + \beta_1 D_{i2} + \beta_2 D_{i3} +\dots+ \beta_{11} D_{i12} To get the data ready we merge the following two datasets by municipality:
We need to be careful that the house prices data uses ;
for separators and commas for decimal points:
Next, what we could do is spend a lot of time creating 11 dummy variables, one for each province, and then typing all 11 into the formula in the lm()
function. The good news is that there is no need to do this with R. If we provide a character vector into the lm()
function, R will interpret it as a factor variable (a qualitative variable), and automatically create these dummies. R will also choose one level to be the base level automatically. Unless the variable is already a factor R will always choose the first value alphabetically (here Drenthe) to be the base level.
So estimating this model is as simple as:
<- lm(house_price_2022 ~ province, data = df)
m summary(m)
Call:
lm(formula = house_price_2022 ~ province, data = df)
Residuals:
Min 1Q Median 3Q Max
-270.55 -51.18 -8.15 33.68 577.65
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 381.308 27.768 13.732 < 2e-16 ***
provinceFlevoland 7.242 48.095 0.151 0.8804
provinceFryslân -10.419 35.848 -0.291 0.7715
provinceGelderland 53.907 30.862 1.747 0.0816 .
provinceGroningen -84.728 41.186 -2.057 0.0405 *
provinceLimburg -38.889 32.704 -1.189 0.2352
provinceNoord-Brabant 61.476 30.599 2.009 0.0453 *
provinceNoord-Holland 159.944 31.326 5.106 0.000000559 ***
provinceOverijssel -4.944 33.781 -0.146 0.8837
provinceUtrecht 137.172 33.570 4.086 0.000055142 ***
provinceZeeland -43.208 38.507 -1.122 0.2626
provinceZuid-Holland 62.294 30.982 2.011 0.0452 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 96.19 on 329 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3026
F-statistic: 14.41 on 11 and 329 DF, p-value: < 2.2e-16
Let’s interpret these estimates. We first note that: \begin{split} \mathbb{E}\left[ Y_i|prov_i=\texttt{Drenthe} \right]=& \beta_0+\beta_1 \times 0 +\beta_2 \times 0+ \dots + \beta_{10}\times 0+\beta_{11} \times 0 = \beta_0\\ \mathbb{E}\left[ Y_i|prov_i=\texttt{Flevoland} \right]=& \beta_0+\beta_1 \times 1 + \beta_2 \times 0+\dots + \beta_{10}\times 0+\beta_{11} \times 0 = \beta_0+\beta_1\\ &\vdots \\ \mathbb{E}\left[ Y_i|prov_i=\texttt{Zuid-Holland} \right]=& \beta_0+\beta_1 \times 0 + \beta_2 \times 0+\dots + \beta_{10}\times 0+\beta_{11} \times 1 = \beta_0+\beta_{11} \\ \end{split} So our estimate of \beta_0 is the average house price in Drenthe in our sample. Because house prices are in thousands of euros, the average house price in Drenthe in our sample is €381,308.33. Our estimate of \beta_0+\beta_1 is the average house price in Flevoland in our sample. This is €381,308.33+€7,241.67=€388,550.00. Our estimate of \beta_1 is therefore the difference in average house price between Flevoland and Drenthe in our sample.
We will now do some example questions with this output.
One example is: “are there any differences in average house prices across provinces (at the 5% level)?”
To do this, let \mu_j be the population average house price in province j=1,2,\dots,12. This question is essentially asking to test: \begin{split} H_0: & \mu_1=\mu_2=\dots=\mu_{12}\\ H_1: & \text{ at least one } \mu_j\neq\mu_k \text{ for } j,k=1,\dots,12\\ \end{split} Using our model with 11 dummy variables (with Drenthe as the base category), this is the same as: \begin{split} H_0 &: \beta_1=\beta_2=\dots=\beta_{11}=0 \\ H_1&: \text{ at least one } \beta_j\neq 0 \text{ for } j=1,2,\dots,11 \end{split} This is just an F-test for testing the model’s usefulness!
Let’s do the F-test as a recap. Under H_0, F\sim F_{k,n-k-1}. We can get the value of the test statistic from the model summary:
summary(m)$fstat
value numdf dendf
14.40891 11.00000 329.00000
The critical value can be found with (using numdf
and dendf
from above to get the numerator and denominator degrees of freedom):
qf(0.95, 11, 329)
[1] 1.817809
Because the test statistic (14.409) is larger than the critical value (1.818) we reject the null hypothesis. There is sufficient evidence to suggest that the average house prices are different across provinces.
We can also use the p-value approach. The p-value for the F-test is already shown in the summary output, but we could also obtain it manually using:
<- summary(m)$fstat[1]
f_stat <- 1 - pf(f_stat, 11, 329)) (p_val
value
0
The p-value (0) is smaller than the significance level (0.05), so we also reject H_0 with this approach.
25.5 Specifying the Base Level
The coefficient estimates b_1, , b_{11} are always interpreted as the differences with respect to the base level. Because of this, we may want to specify the base level to help us interpret the results. By default, R chooses Drenthe as the base level. But we may want to use Noord-Brabant or another province as the base level. How can we do this?
To do this we first convert the variable to a factor
and then “relevel” the factor variable using the relevel()
function specifying the base level. Let’s do this making Noord-Brabant the base level:
$province <- factor(df$province)
df$province <- relevel(df$province, ref = "Noord-Brabant")
dfsummary(lm(house_price_2022 ~ province, data = df))
Call:
lm(formula = house_price_2022 ~ province, data = df)
Residuals:
Min 1Q Median 3Q Max
-270.55 -51.18 -8.15 33.68 577.65
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 442.7839 12.8540 34.447 < 2e-16 ***
provinceDrenthe -61.4756 30.5986 -2.009 0.045343 *
provinceFlevoland -54.2339 41.3198 -1.313 0.190252
provinceFryslân -71.8950 26.0626 -2.759 0.006130 **
provinceGelderland -7.5682 18.6185 -0.406 0.684647
provinceGroningen -146.2039 33.0225 -4.427 0.000012994 ***
provinceLimburg -100.3646 21.5336 -4.661 0.000004582 ***
provinceNoord-Holland 98.4683 19.3781 5.081 0.000000629 ***
provinceOverijssel -66.4199 23.1372 -2.871 0.004361 **
provinceUtrecht 75.6968 22.8275 3.316 0.001015 **
provinceZeeland -104.6839 29.6136 -3.535 0.000466 ***
provinceZuid-Holland 0.8181 18.8163 0.043 0.965346
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 96.19 on 329 degrees of freedom
Multiple R-squared: 0.3251, Adjusted R-squared: 0.3026
F-statistic: 14.41 on 11 and 329 DF, p-value: < 2.2e-16
Now the intercept is the average house price in Noord-Brabant and all the coefficient estimates are differences between Noord-Brabant. For example, houses in Drenthe are on average €61,475.6 cheaper than Noord-Brabant while houses in Noord-Holland are on average €98,468.3 more expensive.
25.6 Interaction Terms with Dummy Variables
We can also combine dummy variables with interaction terms. Consider the following model with the wages2.csv data, where Y_i is the hourly wage: \mathbb{E}\left[ Y_i|educ_i,female_i,married_i \right]= \beta_0+ \beta_1 educ_i+ \beta_2 female_i+ \beta_3 married_i+ \beta_4 female_i\times married_i Holding educ_i fixed, there are 4 possible combinations for the female and married dummies: \begin{split} \text{Unmarried men:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=0,married_i=0 \right]&= \beta_0+ \beta_1 educ_i \\ \text{Unmarried women:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=1,married_i=0 \right]&= \beta_0+ \beta_1 educ_i+\beta_2\\ \text{Married men:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=0,married_i=1 \right]&= \beta_0+ \beta_1 educ_i+\beta_3\\ \text{Married women:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=1,married_i=1 \right]&= \beta_0+ \beta_1 educ_i+\beta_2+\beta_3+\beta_4\\ \end{split} Holding education fixed:
- \beta_2 is the average difference in wage between unmarried women and unmarried men.
- \beta_3 is the average difference in wage between married men and unmarried men.
- \beta_2+\beta_4 is the average difference in wage between married women and married men.
- \beta_3+\beta_4 is the average difference in wage between married women and unmarried women.
So:
- \beta_2 is the wage gap for unmarried women.
- \beta_2+\beta_4 is the wage gap for married women.
- \beta_4 can therefore be interpreted as the difference in wage gap between married and unmarried women.
Let’s estimate it in R:
<- read.csv("wages2.csv")
df <- lm(wage ~ educ + female * married, data = df)
m summary(m)
Call:
lm(formula = wage ~ educ + female * married, data = df)
Residuals:
Min 1Q Median 3Q Max
-6.5907 -1.6293 -0.7337 1.1014 14.6606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.02442 0.69311 -1.478 0.140
educ 0.49356 0.04856 10.164 < 2e-16 ***
female -0.36896 0.43341 -0.851 0.395
married 2.64107 0.39936 6.613 0.0000000000933 ***
female:married -2.82883 0.55556 -5.092 0.0000004962244 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.065 on 521 degrees of freedom
Multiple R-squared: 0.3165, Adjusted R-squared: 0.3113
F-statistic: 60.32 on 4 and 521 DF, p-value: < 2.2e-16
We now interpret the estimates:
According to the model, b_0=-1.02442 means the expected wage for an unmarried man with zero years of education is -$1.02.
Let’s see how many observations have educ_i=female_i=married_i=0:
nrow(df[df$educ == 0 & df$female == 0 & df$married == 0, ])
[1] 0
No observations satisfy this. Therefore we should not trust this estimate.
Interpreting b_1 i done as normal. Holding gender and marital status fixed, increasing education by one year on average increases the wage by 49 cents.
To interpret b_2 we need to be careful because female
also appears in the interaction. When the variable married equals zero, then this term drops out and we can interpret the variable as normal. So b_2 is the average difference in wage between unmarried women and unmarried men, holding education fixed. So holding education fixed, unmarried women on average earn 37 cents less than unmarried men. The wage gap is therefore 37 cents for married women.
To interpret b_3 we need to be careful because married
also appears in the interaction. When the female dummy equals zero, then this term drops out and we can interpret the variable as normal. So b_3 is the average difference in wage between married men and unmarried men, holding education fixed. Holding education fixed, married men on average earn $2.64 more than unmarried men.
Finally, for b_4 we recall that above we showed that \beta_4 can be interpreted as the difference in wage gap between married and unmarried women. So holding education fixed, the gender wage gap is $2.83 larger for married women compared to unmarried women.
Let’s consider an example question from this output.
Holding education fixed, do unmarried women earn less than unmarried men (at the 5% level)? Use a p-value approach.
This question is asking if \beta_2< 0, so the null and alternative hypotheses are H_0: \beta_2\geq 0 and H_1:\beta_2<0. Under H_0, the test statistic T=B_2/S_{B_2}\sim t_{n-k-1}. Because the hinge is zero, we can read the test statistic directly from the table: t=-0.8513. However, the p-value in the table is for a two-sided test. We can get the p-value with:
<- coef(summary(m))["female", "t value"]) (t
[1] -0.8513
pt(t, m$df.residual)
[1] 0.1974969
The p-value (0.1975) is greater than the significance level (0.05), so we we do not reject the null hypothesis. There is not enough evidence to show that unmarried women earn less than unmarried men of equal education levels.