15  MLR Model Assumptions

We now discuss the model assumptions that we require to perform inference, which we will discuss for the single variable case in the next chapter (Chapter 16). These assumptions are almost the same as the simple linear regression model, except for assumption 3. For completeness we go through each individual assumption again here.

15.1 Assumption 1: Linear in Parameters

Assumption 1: Linear in Parameters

In the population model, the dependent variable Y_i is related to the independent variables X_{i1}, \dots, X_{ik} and the error \varepsilon_i according to: Y_i=\beta_0 + \beta_1 X_{i1} + \dots + \beta_k X_{ik} + \varepsilon_i

Again, this assumption means that the process that generates the data in our sample follows this model. That is, Y_i is linear in X_{i1}, X_{i2}, \dots, X_{ik} and the values Y_i are generated according to the model.

15.2 Assumption 2: Random Sampling

Assumption 2: Random Sampling

We have a random sample of size n, \left(\left( x_{11},\dots, x_{1k} ,y_1 \right),\dots\left( x_{n1},\dots,x_{nk}, y_n \right)\right) following the population model in Assumption 1.

This assumption means that the sample of data we observe were generated according to the model Y_i=\beta_0+\beta_1 X_{i1}+\dots+\beta_k X_{ik} +\varepsilon_i. The values of y_i that we observe are related to the unknown population parameters, observed x_{i1}, \dots, x_{ik} and the unobserved error \varepsilon_i according to \beta_0+\beta_1x_{i1}+\dots+\beta_kx_{ik} +\varepsilon_i, where \varepsilon_i is independent across observation i.

15.3 Assumption 3: No Perfect Collinearity

This assumption is now different from the SLR model:

Assumption 2: Random Sampling

In the sample, none of the independent variables are constant and there are no exact linear relationships among the independent variables.

The first part of this assumption is the same as before, holding for each individual x variable. It requires each variable in the regression to have a standard deviation greater than zero.

The second part means that we should not be able to write one of the variables as a linear function of one (or more) of the other variables, holding exactly for every observation.

We will explain this second part using an example dataset. We will use dataset clothing-exp.csv which contains data on a random sample of households with the following variables:

  • clothing_exp: Annual clothing expenditure of the household (in €000).
  • hh_exp: Annual household income household (in €000).
  • num_kids: Number of children in the household.
  • hh_size: Total number of people in the household.

Let’s estimate a regression model trying to explain clothing expenditure with the household size, the number of children and the total number of people in the household:

df <- read.csv("clothing-exp.csv")
m <- lm(clothing_exp ~ hh_inc + num_kids + hh_size, data = df)
summary(m)

Call:
lm(formula = clothing_exp ~ hh_inc + num_kids + hh_size, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27225 -0.05878 -0.00765  0.05767  0.43981 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0125930  0.0232879  -0.541    0.589    
hh_inc       0.0822021  0.0004423 185.861   <2e-16 ***
num_kids     0.0108057  0.0137232   0.787    0.432    
hh_size      0.0119808  0.0116495   1.028    0.305    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1031 on 296 degrees of freedom
Multiple R-squared:  0.9921,    Adjusted R-squared:  0.9921 
F-statistic: 1.246e+04 on 3 and 296 DF,  p-value: < 2.2e-16

For practice, let’s interpret the coefficients from the sample regression equation: \widehat{y}_i=-0.01259 + 0.08220 x_{i1} + 0.01081 x_{i2} + 0.01198 x_{i3} The estimate of the intercept (b_0) says that a household with zero income and nobody in it spends on average -€12.59 per year on clothing. Let’s check the summary statistics of the explanatory variables:

summary(df)
  clothing_exp       hh_inc         num_kids         hh_size     
 Min.   :0.910   Min.   :10.90   Min.   :0.0000   Min.   :1.000  
 1st Qu.:1.677   1st Qu.:20.50   1st Qu.:0.0000   1st Qu.:2.000  
 Median :2.270   Median :27.33   Median :0.0000   Median :2.000  
 Mean   :2.531   Mean   :30.43   Mean   :0.8733   Mean   :2.743  
 3rd Qu.:3.085   3rd Qu.:36.82   3rd Qu.:2.0000   3rd Qu.:4.000  
 Max.   :6.690   Max.   :83.38   Max.   :5.0000   Max.   :7.000  

Household income and household size are never zero in the data. Because we don’t have x_{i1}=x_{i2}=x_{i3}=0 for any observation, this estimate is not reliable. It also doesn’t make much sense either, because an unoccupied house does not have anyone in it to buy clothes (especially if they have no income!).

For b_1, increasing household income by €1,000, holding family composition fixed, increases clothing expenditure by €82.20 on average. For b_2, increasing the number of children by 1, holding income and the total household size fixed (i.e. replacing an adult with a child), increases clothing expenditure by €10.81 on average. For b_3, increasing the household size by 1, holding income and the number of children fixed (i.e. adding an adult), increases clothing expenditure by €11.98 on average.

Suppose now we wanted to create a new variable to add to this model: the number of adults. We can create this variable in R by subtracting the number of children from the total household size. Let’s try this:

df$num_adults <- df$hh_size - df$num_kids
m <- lm(clothing_exp ~ hh_inc + num_kids + hh_size + num_adults, data = df)
summary(m)

Call:
lm(formula = clothing_exp ~ hh_inc + num_kids + hh_size + num_adults, 
    data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27225 -0.05878 -0.00765  0.05767  0.43981 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0125930  0.0232879  -0.541    0.589    
hh_inc       0.0822021  0.0004423 185.861   <2e-16 ***
num_kids     0.0108057  0.0137232   0.787    0.432    
hh_size      0.0119808  0.0116495   1.028    0.305    
num_adults          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1031 on 296 degrees of freedom
Multiple R-squared:  0.9921,    Adjusted R-squared:  0.9921 
F-statistic: 1.246e+04 on 3 and 296 DF,  p-value: < 2.2e-16

Notice that we don’t get an estimate for num_adults. This is because of perfect collinearity. It’s possible to write: x_{i4} = x_{i3} - x_{i2} for all i which means there is an exact linear relationship between some of the independent variables.

To satisfy assumption 3 we should not be able to write one variable as a linear function of other explantory variables with the relationship holding exactly for every observation in the dataset.

15.4 Assumption 4: Zero Conditional Mean

Assumption 4: Zero Conditional Mean

The error \varepsilon_i has an expected value of zero given any value of the explanatory variables, i.e. \mathbb{E}\left[ \varepsilon_i|X_{i1},\dots,X_{ik} \right]=0 for all X_{i1},\dots,X_{ik}.

This assumption, like before, implies that the error term cannot be correlated with any of the explanatory variables. It also rules out any nonlinear relationships.

15.5 Assumption 5: Homoskedasticity

Assumption 5: Homoskedasticity

The error \varepsilon_i has the same variance given any value of the explanatory variables. In other words: \text{Var}\left( \varepsilon_i| x_{i1},\dots,x_{ik} \right)=\sigma_\varepsilon^2

Just like in the SLR model, this means that the dispersion of the error terms should not vary with any of the explanatory variables.

15.6 Assumption 6: Normality

Assumption 6: Normality

The distribution of \varepsilon_i conditional on x_{i1},\dots,x_{ik} is normally distributed.

This assumption, combined with assumptions 4 and 5 implies: \varepsilon_i | x_{i1},\dots,x_{ik} \sim \mathcal{N}\left( 0,\sigma_\varepsilon^2 \right) In words: \varepsilon_i conditional on x_{i1},\dots,x_{ik} follows a normal distribution with a zero mean and variance \sigma_\varepsilon^2.