24  Dummy Variables

24.1 Introduction

Very often we have categorical variables that can take on two values. Examples of this are:

  • Yes/no questions in a survey.
  • Gender (at birth).
  • Whether you have a college degree or not.

Because these are categorical variables and not numeric variables, we cannot include them in our regression model directly. However we can code a numeric variable that contains the information from the categorical variable. Such a variable is called a dummy variable.

A dummy variable is a variable that =1 if something is true and =0 if it is false:

  • For the yes/no questions, we can create a variable that =1 for “yes” responses and =0 for “no” responses.
  • For the gender variable, we can create a variable that =1 if observation is female and =0 if male. Such a variable is called a “female dummy”.
    • We could alternatively create a “male dummy” that =1 for male and =0 for female.
  • For the college degree variable, we can create a variable that =1 if the observation has a college degree and =0 if not. Such a variable is called a “college degree dummy”.

24.2 Theory

Consider a simple linear regression model with a dummy variable: \mathbb{E}\left[ Y_i|x_i \right]=\beta_0 + \beta_1 x_i - When the dummy variable equals 0, the expected value of Y_i is \mathbb{E}\left[ Y_i|x_i =0\right]=\beta_0. Call this \mu_0. - When the dummy variable equals 1, the expected value of Y_i is \mathbb{E}\left[ Y_i|x_i =1\right]=\beta_0+\beta_1. Call this \mu_1.

The difference in means between the two groups, \mu_1-\mu_0, is equal to \beta_1. Therefore we can estimate this regression model to estimate the difference in means, and hypothesis tests on \beta_1 are equivalent to hypothesis tests for the difference in means.

24.3 Dummy Variable Trap

Suppose we created two variables:

  1. x_{i1} is a female dummy that =1 for females and =0 for males.
  2. x_{i2} is a male dummy that =1 for males and =0 for females.

We could use either one of these to estimate the model above to get the difference in means. But what we cannot do is estimate a model with both variables. This is because x_{i1}=1-x_{i2} for every observation (when x_{i1}=0, x_{i2}=1 and vice versa). If we include both variables we run into the problem of strict collinearity and R will drop one of the variables. This problem is called the dummy variable trap. When we have a qualitative variable with two values we need to choose one value for zero (what we call the base level or base category) and the other for one and not include both.

24.4 Dummy Variables in R

The example datasets we worked with so far do not have categorical variables. We therefore will employ a new dataset to illustrate how to estimate and interpret a model with a dummy variable.

The dataset wages2.csv contains wage data for n=526 people from the 1976 Current Population Survey in the US.

The variables are:

  • wage: Average hourly earnings (in USD).
  • educ: Years of education.
  • female: Female dummy.
  • married: Married dummy.

We will use these data to test (with \alpha=0.05) if the average hourly wage of men is more than $2.00 larger than the mean hourly wage of women.

Mathematically, we want to test if \mu_0-2>\mu_1. In words: the population mean hourly wage for men minus 2 is greater than the mean hourly wage for women. This will be our H_1. Rewriting this as \mu_1-\mu_0< -2 means we can use a simple linear regression model with a female dummy to test if \beta_1 < -2.

Let’s estimate the regression model in R:

df <- read.csv("wages2.csv")
m <- lm(wage ~ female, data = df)
summary(m)

Call:
lm(formula = wage ~ female, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5995 -1.8495 -0.9877  1.4260 17.8805 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.0995     0.2100  33.806  < 2e-16 ***
female       -2.5118     0.3034  -8.279 1.04e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.476 on 524 degrees of freedom
Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114 
F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15

Let’s interpret the coefficient estimates before running the test. The intercept is the estimate of \mathbb{E}\left[Y_i|x_{i1}=0\right]=\beta_0. It means the average wage of men in the data is $7.10. The estimate of the slope \beta_1 is the difference between the mean hourly wage of women and the mean hourly wage of men. Thus women on average earn $2.51 less than men in the data.

We can also get these numbers by calculating the means by group directly:

mean(df$wage[df$female == 0])
[1] 7.099489
mean(df$wage[df$female == 1])
[1] 4.587659

The difference in means is then:

mean(df$wage[df$female == 1]) - mean(df$wage[df$female == 0])
[1] -2.51183

which corresponds to the estimate of the slope.

We could also get the means by group using the aggregate() function:

aggregate(wage ~ female, data = df, FUN = mean)
  female     wage
1      0 7.099489
2      1 4.587659

We are now ready to perform the hypothesis test. We set up the null and alternative hypothesis:

\begin{split} H_0&: \beta_1 \geq -2 \\ H_1&: \beta_1 < -2 \\ \end{split} Under H_0, the test statistic T=\frac{B_1-\left( -2 \right)}{S_{B_1}} follows a t distribution with n-2 degrees of freedom (524).

Let’s calculate the value of the test statistic in R:

b_1 <- coef(summary(m))["female", "Estimate"]
s_b_1 <- coef(summary(m))["female", "Std. Error"]
(t <- (b_1 + 2) / s_b_1)
[1] -1.686931

This is a lower tail test. If we are using the critical value method, we reject H_0 if t\leq t_{\alpha,n-2}. We can calculate the critical value in R with:

(cv <- qt(0.05, m$df.residual))
[1] -1.647767
t < cv
[1] TRUE

The test statistic is smaller than the critical value (lies in the rejection region) so we reject the null hypothesis.

If we are using the p-value method we can calculate the p-value with:

(pval <- pt(t, m$df.residual))
[1] 0.04610592
pval < 0.05
[1] TRUE

The p-value (0.0461) is smaller than the significance level (0.05) so we reject the null hypothesis.

In both cases we reject the null hypothesis. Thus there is sufficient evidence for the claim that men earn more than $2 more than women at the 5% level.

24.5 Multiple Linear Regression with Dummy Variables

We can also use dummy variables in a multiple linear regression model. Using the same data, let’s see if these differences in wages be explained by different levels of educational attainment. To do this we want to compare the average wages of women and men of the same education level.

Let x_{i1} be years of education and x_{i2} be the female dummy. The expected wage for men given the education level is: \mathbb{E}\left[ Y_i|x_{i1},x_{i2}=1 \right]=\beta_0 +\beta_1x_{i1} + \beta_2 \times 1 =\beta_0 +\beta_1x_{i1} + \beta_2 The expected wage for women given the education level is: \mathbb{E}\left[ Y_i|x_{i1},x_{i2}=0 \right]=\beta_0 +\beta_1x_{i1} + \beta_2 \times 0=\beta_0 +\beta_1x_{i1} Taking differences yields \beta_2. This is the difference in mean wages holding education fixed.

Let’s estimate the model in R:

df <- read.csv("wages2.csv")
m <- lm(wage ~ educ + female, data = df)
summary(m)

Call:
lm(formula = wage ~ educ + female, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9890 -1.8702 -0.6651  1.0447 15.4998 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.62282    0.67253   0.926    0.355    
educ         0.50645    0.05039  10.051  < 2e-16 ***
female      -2.27336    0.27904  -8.147 2.76e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.186 on 523 degrees of freedom
Multiple R-squared:  0.2588,    Adjusted R-squared:  0.256 
F-statistic: 91.32 on 2 and 523 DF,  p-value: < 2.2e-16

The estimated coefficient on the female dummy is now -2.27, compared to -2.51 before. This means that women in this sample on average earned $2.27 less than men of the same education level.