25  Qualitative Variables with Multiple Levels

25.1 Introduction

In Chapter 24 we learned how to use qualitative variables with two values - such as gender - in a regression model. By including a dummy variable for one of the gender values, we were able to cover all the possible values that the gender could take: =1 for females and =0 for males.

But often we have data with qualitative variables that can take on more than two values. For example we could have variables like:

  • Educational attainment: High school, Bachelor, Master, PhD.
  • Industry: Primary sector (e.g.~agriculture), Manufacturing, Services.
  • Region: US States, Provinces of the Netherlands.

What we will learn in this chapter is how to include this kind of information in a linear regression model.

25.2 Theory

Suppose we are interested in the impact of industry sector on wages. We have a sample of wages Y_i for n individuals and what sector they work in (sector_i): primary, manufacturing or services.

25.2.1 The Incorrect Approach

Suppose for the moment we decided to follow the logic in Chapter 24 and created a numeric variable x_{i1} with the sector information as follows:

  • =0 if sector_i=\texttt{primary}.
  • =1 if sector_i=\texttt{manufacturing}.
  • =2 if sector_i=\texttt{services}.

Suppose also we used this variable to estimate the regression model: \mathbb{E}[Y_i | x_{i1}] = \beta_0 + \beta_1 x_{i1} We will see now that this approach is incorrect.

For individuals in the primary sector we have: \mathbb{E}[Y_i | x_{i1}=0] = \beta_0 Therefore \beta_0 is the average wage of people in the primary sector.

For individuals in the manufacturing sector we have: \mathbb{E}[Y_i | x_{i1}=1] = \beta_0 + \beta_1 This means that \beta_1 is the average difference in wages between people in the manufacturing sector and the primary sector.

But then for individuals in the services sector we have: \mathbb{E}[Y_i | x_{i1}=2] = \beta_0 + 2\beta_1 This means that \beta_1 is also the average difference in wages between people in the services sector and the manufacturing sector! It also means that the difference in wages between services and the primary sector is 2\beta_1.

Using a variable like this means that going from one sector to the next leads to an increase in wage of \beta_1 on average for all sectors. But there is no reason to think that going from primary to manufacturing and manufacturing to services will lead to the same average increase in wage. This is a very restrictive way to use this variable. We want a more flexible way to use the information about the sector in the model.

25.3 The Correct Approach

Instead of creating one numeric variable with the information from the qualitative variable what we should do is create a dummy variable for each value of the categorical variable. For the sector example we create 3 variables:

  1. D_{i1}=1 if primary sector and x_{i1}=0 otherwise.
  2. D_{i2}=1 if manufacturing sector and x_{i2}=0 otherwise.
  3. D_{i3}=1 if services sector and x_{i3}=0 otherwise.

We then estimate a regression model using these dummy variables. We cannot include all 3 dummy variables because otherwise we run into the dummy variable trap we encountered in Chapter 24. This is because D_{i1}=1-D_{i2}-D_{i3} always:

  • If D_{i1}=0 then one of D_{i2} or D_{i3} equals 1.
  • If D_{i1}=1 then D_{i2}=D_{i3}=0.

We need to choose one category to be the base category. Let’s let this be the primary sector. The model we would estimate is then:

\mathbb{E}[Y_i|sector_i]=\beta_0+\beta_1 D_{i2} + \beta_2 D_{i3} For the primary sector we have: \mathbb{E}[Y_i|sector_i=\texttt{primary}]=\beta_0 For the manufacturing sector we have: \mathbb{E}[Y_i|sector_i=\texttt{manufacturing}]=\beta_0+\beta_1 For the services sector we have: \mathbb{E}[Y_i|sector_i=\texttt{services}]=\beta_0+\beta_2 So:

  • \beta_1 is the average difference between the manufacturing and primary sectors.
  • \beta_2 is the average difference between the services and primary sectors.
  • \beta_2-\beta_1 is the average difference between the services and manufacturing sectors.

Now the model is much more flexible.

25.4 Qualitative Variables in R

To show how to do this in R we will use a dataset on the average house prices Y_i by municipality (gemeente) in the Netherlands in 2022 and the province each municipality is in, prov_i. We will use this dataset to see how much location (province) impacts house prices.

To do this we create 12 dummy variables, one for each province:

  • D_{i1}=1 if prov_i=\texttt{Drenthe} and zero otherwise.
  • D_{i2}=1 if prov_i=\texttt{Flevoland} and zero otherwise.
  • \vdots
  • D_{i12}=1 if prov_i=\texttt{Zuid-Holland} and zero otherwise.

Because D_{i1}=1-D_{i2}-D_{i3}-\dots-D_{i12} for all i, we need to exclude one province to avoid the dummy variable trap. Let’s choose Drenthe (D_{i1}) to be the base level.

The model is then: \mathbb{E}\left[ Y_i|prov_i \right]= \beta_0 + \beta_1 D_{i2} + \beta_2 D_{i3} +\dots+ \beta_{11} D_{i12} To get the data ready we merge the following two datasets by municipality:

We need to be careful that the house prices data uses ; for separators and commas for decimal points:

df1 <- read.csv("cpb-house-prices.csv", sep = ";", dec = ",")
names(df1) <- c("municipality", "house_price_2022", "house_price_2021")
df2 <- read.csv("municipality-province.csv")
names(df2) <- c("municipality", "province")
df <- merge(df1, df2, by = "municipality")

Next, what we could do is spend a lot of time creating 11 dummy variables, one for each province, and then typing all 11 into the formula in the lm() function. The good news is that there is no need to do this with R. If we provide a character vector into the lm() function, R will interpret it as a factor variable (a qualitative variable), and automatically create these dummies. R will also choose one level to be the base level automatically. Unless the variable is already a factor R will always choose the first value alphabetically (here Drenthe) to be the base level.

So estimating this model is as simple as:

m <- lm(house_price_2022 ~ province, data = df)
summary(m)

Call:
lm(formula = house_price_2022 ~ province, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-270.55  -51.18   -8.15   33.68  577.65 

Coefficients:
                      Estimate Std. Error t value    Pr(>|t|)    
(Intercept)            381.308     27.768  13.732     < 2e-16 ***
provinceFlevoland        7.242     48.095   0.151      0.8804    
provinceFryslân        -10.419     35.848  -0.291      0.7715    
provinceGelderland      53.907     30.862   1.747      0.0816 .  
provinceGroningen      -84.728     41.186  -2.057      0.0405 *  
provinceLimburg        -38.889     32.704  -1.189      0.2352    
provinceNoord-Brabant   61.476     30.599   2.009      0.0453 *  
provinceNoord-Holland  159.944     31.326   5.106 0.000000559 ***
provinceOverijssel      -4.944     33.781  -0.146      0.8837    
provinceUtrecht        137.172     33.570   4.086 0.000055142 ***
provinceZeeland        -43.208     38.507  -1.122      0.2626    
provinceZuid-Holland    62.294     30.982   2.011      0.0452 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.19 on 329 degrees of freedom
Multiple R-squared:  0.3251,    Adjusted R-squared:  0.3026 
F-statistic: 14.41 on 11 and 329 DF,  p-value: < 2.2e-16

Let’s interpret these estimates. We first note that: \begin{split} \mathbb{E}\left[ Y_i|prov_i=\texttt{Drenthe} \right]=& \beta_0+\beta_1 \times 0 +\beta_2 \times 0+ \dots + \beta_{10}\times 0+\beta_{11} \times 0 = \beta_0\\ \mathbb{E}\left[ Y_i|prov_i=\texttt{Flevoland} \right]=& \beta_0+\beta_1 \times 1 + \beta_2 \times 0+\dots + \beta_{10}\times 0+\beta_{11} \times 0 = \beta_0+\beta_1\\ &\vdots \\ \mathbb{E}\left[ Y_i|prov_i=\texttt{Zuid-Holland} \right]=& \beta_0+\beta_1 \times 0 + \beta_2 \times 0+\dots + \beta_{10}\times 0+\beta_{11} \times 1 = \beta_0+\beta_{11} \\ \end{split} So our estimate of \beta_0 is the average house price in Drenthe in our sample. Because house prices are in thousands of euros, the average house price in Drenthe in our sample is €381,308.33. Our estimate of \beta_0+\beta_1 is the average house price in Flevoland in our sample. This is €381,308.33+€7,241.67=€388,550.00. Our estimate of \beta_1 is therefore the difference in average house price between Flevoland and Drenthe in our sample.

We will now do some example questions with this output.

One example is: “are there any differences in average house prices across provinces (at the 5% level)?”

To do this, let \mu_j be the population average house price in province j=1,2,\dots,12. This question is essentially asking to test: \begin{split} H_0: & \mu_1=\mu_2=\dots=\mu_{12}\\ H_1: & \text{ at least one } \mu_j\neq\mu_k \text{ for } j,k=1,\dots,12\\ \end{split} Using our model with 11 dummy variables (with Drenthe as the base category), this is the same as: \begin{split} H_0 &: \beta_1=\beta_2=\dots=\beta_{11}=0 \\ H_1&: \text{ at least one } \beta_j\neq 0 \text{ for } j=1,2,\dots,11 \end{split} This is just an F-test for testing the model’s usefulness!

Let’s do the F-test as a recap. Under H_0, F\sim F_{k,n-k-1}. We can get the value of the test statistic from the model summary:

summary(m)$fstat
    value     numdf     dendf 
 14.40891  11.00000 329.00000 

The critical value can be found with (using numdf and dendf from above to get the numerator and denominator degrees of freedom):

qf(0.95, 11, 329)
[1] 1.817809

Because the test statistic (14.409) is larger than the critical value (1.818) we reject the null hypothesis. There is sufficient evidence to suggest that the average house prices are different across provinces.

We can also use the p-value approach. The p-value for the F-test is already shown in the summary output, but we could also obtain it manually using:

f_stat <- summary(m)$fstat[1]
(p_val <- 1 - pf(f_stat, 11, 329))
value 
    0 

The p-value (0) is smaller than the significance level (0.05), so we also reject H_0 with this approach.

25.5 Specifying the Base Level

The coefficient estimates b_1, , b_{11} are always interpreted as the differences with respect to the base level. Because of this, we may want to specify the base level to help us interpret the results. By default, R chooses Drenthe as the base level. But we may want to use Noord-Brabant or another province as the base level. How can we do this?

To do this we first convert the variable to a factor and then “relevel” the factor variable using the relevel() function specifying the base level. Let’s do this making Noord-Brabant the base level:

df$province <- factor(df$province)
df$province <- relevel(df$province, ref = "Noord-Brabant")
summary(lm(house_price_2022 ~ province, data = df))

Call:
lm(formula = house_price_2022 ~ province, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-270.55  -51.18   -8.15   33.68  577.65 

Coefficients:
                       Estimate Std. Error t value    Pr(>|t|)    
(Intercept)            442.7839    12.8540  34.447     < 2e-16 ***
provinceDrenthe        -61.4756    30.5986  -2.009    0.045343 *  
provinceFlevoland      -54.2339    41.3198  -1.313    0.190252    
provinceFryslân        -71.8950    26.0626  -2.759    0.006130 ** 
provinceGelderland      -7.5682    18.6185  -0.406    0.684647    
provinceGroningen     -146.2039    33.0225  -4.427 0.000012994 ***
provinceLimburg       -100.3646    21.5336  -4.661 0.000004582 ***
provinceNoord-Holland   98.4683    19.3781   5.081 0.000000629 ***
provinceOverijssel     -66.4199    23.1372  -2.871    0.004361 ** 
provinceUtrecht         75.6968    22.8275   3.316    0.001015 ** 
provinceZeeland       -104.6839    29.6136  -3.535    0.000466 ***
provinceZuid-Holland     0.8181    18.8163   0.043    0.965346    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.19 on 329 degrees of freedom
Multiple R-squared:  0.3251,    Adjusted R-squared:  0.3026 
F-statistic: 14.41 on 11 and 329 DF,  p-value: < 2.2e-16

Now the intercept is the average house price in Noord-Brabant and all the coefficient estimates are differences between Noord-Brabant. For example, houses in Drenthe are on average €61,475.6 cheaper than Noord-Brabant while houses in Noord-Holland are on average €98,468.3 more expensive.

25.6 Interaction Terms with Dummy Variables

We can also combine dummy variables with interaction terms. Consider the following model with the wages2.csv data, where Y_i is the hourly wage: \mathbb{E}\left[ Y_i|educ_i,female_i,married_i \right]= \beta_0+ \beta_1 educ_i+ \beta_2 female_i+ \beta_3 married_i+ \beta_4 female_i\times married_i Holding educ_i fixed, there are 4 possible combinations for the female and married dummies: \begin{split} \text{Unmarried men:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=0,married_i=0 \right]&= \beta_0+ \beta_1 educ_i \\ \text{Unmarried women:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=1,married_i=0 \right]&= \beta_0+ \beta_1 educ_i+\beta_2\\ \text{Married men:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=0,married_i=1 \right]&= \beta_0+ \beta_1 educ_i+\beta_3\\ \text{Married women:\qquad} \mathbb{E}\left[ Y_i|educ_i,female_i=1,married_i=1 \right]&= \beta_0+ \beta_1 educ_i+\beta_2+\beta_3+\beta_4\\ \end{split} Holding education fixed:

  • \beta_2 is the average difference in wage between unmarried women and unmarried men.
  • \beta_3 is the average difference in wage between married men and unmarried men.
  • \beta_2+\beta_4 is the average difference in wage between married women and married men.
  • \beta_3+\beta_4 is the average difference in wage between married women and unmarried women.

So:

  • \beta_2 is the wage gap for unmarried women.
  • \beta_2+\beta_4 is the wage gap for married women.
  • \beta_4 can therefore be interpreted as the difference in wage gap between married and unmarried women.

Let’s estimate it in R:

df <- read.csv("wages2.csv")
m <- lm(wage ~ educ + female * married, data = df)
summary(m)

Call:
lm(formula = wage ~ educ + female * married, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.5907 -1.6293 -0.7337  1.1014 14.6606 

Coefficients:
               Estimate Std. Error t value        Pr(>|t|)    
(Intercept)    -1.02442    0.69311  -1.478           0.140    
educ            0.49356    0.04856  10.164         < 2e-16 ***
female         -0.36896    0.43341  -0.851           0.395    
married         2.64107    0.39936   6.613 0.0000000000933 ***
female:married -2.82883    0.55556  -5.092 0.0000004962244 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.065 on 521 degrees of freedom
Multiple R-squared:  0.3165,    Adjusted R-squared:  0.3113 
F-statistic: 60.32 on 4 and 521 DF,  p-value: < 2.2e-16

We now interpret the estimates:

According to the model, b_0=-1.02442 means the expected wage for an unmarried man with zero years of education is -$1.02.

Let’s see how many observations have educ_i=female_i=married_i=0:

nrow(df[df$educ == 0 & df$female == 0 & df$married == 0, ])
[1] 0

No observations satisfy this. Therefore we should not trust this estimate.

Interpreting b_1 i done as normal. Holding gender and marital status fixed, increasing education by one year on average increases the wage by 49 cents.

To interpret b_2 we need to be careful because female also appears in the interaction. When the variable married equals zero, then this term drops out and we can interpret the variable as normal. So b_2 is the average difference in wage between unmarried women and unmarried men, holding education fixed. So holding education fixed, unmarried women on average earn 37 cents less than unmarried men. The wage gap is therefore 37 cents for married women.

To interpret b_3 we need to be careful because married also appears in the interaction. When the female dummy equals zero, then this term drops out and we can interpret the variable as normal. So b_3 is the average difference in wage between married men and unmarried men, holding education fixed. Holding education fixed, married men on average earn $2.64 more than unmarried men.

Finally, for b_4 we recall that above we showed that \beta_4 can be interpreted as the difference in wage gap between married and unmarried women. So holding education fixed, the gender wage gap is $2.83 larger for married women compared to unmarried women.

Let’s consider an example question from this output.

Holding education fixed, do unmarried women earn less than unmarried men (at the 5% level)? Use a p-value approach.

This question is asking if \beta_2< 0, so the null and alternative hypotheses are H_0: \beta_2\geq 0 and H_1:\beta_2<0. Under H_0, the test statistic T=B_2/S_{B_2}\sim t_{n-k-1}. Because the hinge is zero, we can read the test statistic directly from the table: t=-0.8513. However, the p-value in the table is for a two-sided test. We can get the p-value with:

(t <- coef(summary(m))["female", "t value"])
[1] -0.8513
pt(t, m$df.residual)
[1] 0.1974969

The p-value (0.1975) is greater than the significance level (0.05), so we we do not reject the null hypothesis. There is not enough evidence to show that unmarried women earn less than unmarried men of equal education levels.