We now discuss the model assumptions that we require to perform inference, which we will discuss for the single variable case in the next chapter (Chapter 16). These assumptions are almost the same as the simple linear regression model, except for assumption 3. For completeness we go through each individual assumption again here.
15.1 Assumption 1: Linear in Parameters
Assumption 1: Linear in Parameters
In the population model, the dependent variable Y_i is related to the independent variables X_{i1}, \dots, X_{ik} and the error \varepsilon_i according to: Y_i=\beta_0 + \beta_1 X_{i1} + \dots + \beta_k X_{ik} + \varepsilon_i
Again, this assumption means that the process that generates the data in our sample follows this model. That is, Y_i is linear in X_{i1}, X_{i2}, \dots, X_{ik} and the values Y_i are generated according to the model.
15.2 Assumption 2: Random Sampling
Assumption 2: Random Sampling
We have a random sample of size n, \left(\left( x_{11},\dots, x_{1k} ,y_1 \right),\dots\left( x_{n1},\dots,x_{nk}, y_n \right)\right) following the population model in Assumption 1.
This assumption means that the sample of data we observe were generated according to the model Y_i=\beta_0+\beta_1 X_{i1}+\dots+\beta_k X_{ik} +\varepsilon_i. The values of y_i that we observe are related to the unknown population parameters, observed x_{i1}, \dots, x_{ik} and the unobserved error \varepsilon_i according to \beta_0+\beta_1x_{i1}+\dots+\beta_kx_{ik} +\varepsilon_i, where \varepsilon_i is independent across observation i.
15.3 Assumption 3: No Perfect Collinearity
This assumption is now different from the SLR model:
Assumption 2: Random Sampling
In the sample, none of the independent variables are constant and there are no exact linear relationships among the independent variables.
The first part of this assumption is the same as before, holding for each individual x variable. It requires each variable in the regression to have a standard deviation greater than zero.
The second part means that we should not be able to write one of the variables as a linear function of one (or more) of the other variables, holding exactly for every observation.
We will explain this second part using an example dataset. We will use dataset clothing-exp.csv which contains data on a random sample of households with the following variables:
clothing_exp: Annual clothing expenditure of the household (in €000).
hh_exp: Annual household income household (in €000).
num_kids: Number of children in the household.
hh_size: Total number of people in the household.
Let’s estimate a regression model trying to explain clothing expenditure with the household size, the number of children and the total number of people in the household:
Call:
lm(formula = clothing_exp ~ hh_inc + num_kids + hh_size, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.27225 -0.05878 -0.00765 0.05767 0.43981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0125930 0.0232879 -0.541 0.589
hh_inc 0.0822021 0.0004423 185.861 <2e-16 ***
num_kids 0.0108057 0.0137232 0.787 0.432
hh_size 0.0119808 0.0116495 1.028 0.305
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1031 on 296 degrees of freedom
Multiple R-squared: 0.9921, Adjusted R-squared: 0.9921
F-statistic: 1.246e+04 on 3 and 296 DF, p-value: < 2.2e-16
For practice, let’s interpret the coefficients from the sample regression equation:
\widehat{y}_i=-0.01259 + 0.08220 x_{i1} +
0.01081 x_{i2} +
0.01198 x_{i3}
The estimate of the intercept (b_0) says that a household with zero income and nobody in it spends on average -€12.59 per year on clothing. Let’s check the summary statistics of the explanatory variables:
summary(df)
clothing_exp hh_inc num_kids hh_size
Min. :0.910 Min. :10.90 Min. :0.0000 Min. :1.000
1st Qu.:1.677 1st Qu.:20.50 1st Qu.:0.0000 1st Qu.:2.000
Median :2.270 Median :27.33 Median :0.0000 Median :2.000
Mean :2.531 Mean :30.43 Mean :0.8733 Mean :2.743
3rd Qu.:3.085 3rd Qu.:36.82 3rd Qu.:2.0000 3rd Qu.:4.000
Max. :6.690 Max. :83.38 Max. :5.0000 Max. :7.000
Household income and household size are never zero in the data. Because we don’t have x_{i1}=x_{i2}=x_{i3}=0 for any observation, this estimate is not reliable. It also doesn’t make much sense either, because an unoccupied house does not have anyone in it to buy clothes (especially if they have no income!).
For b_1, increasing household income by €1,000, holding family composition fixed, increases clothing expenditure by €82.20 on average. For b_2, increasing the number of children by 1, holding income and the total household size fixed (i.e. replacing an adult with a child), increases clothing expenditure by €10.81 on average. For b_3, increasing the household size by 1, holding income and the number of children fixed (i.e. adding an adult), increases clothing expenditure by €11.98 on average.
Suppose now we wanted to create a new variable to add to this model: the number of adults. We can create this variable in R by subtracting the number of children from the total household size. Let’s try this:
Call:
lm(formula = clothing_exp ~ hh_inc + num_kids + hh_size + num_adults,
data = df)
Residuals:
Min 1Q Median 3Q Max
-0.27225 -0.05878 -0.00765 0.05767 0.43981
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0125930 0.0232879 -0.541 0.589
hh_inc 0.0822021 0.0004423 185.861 <2e-16 ***
num_kids 0.0108057 0.0137232 0.787 0.432
hh_size 0.0119808 0.0116495 1.028 0.305
num_adults NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1031 on 296 degrees of freedom
Multiple R-squared: 0.9921, Adjusted R-squared: 0.9921
F-statistic: 1.246e+04 on 3 and 296 DF, p-value: < 2.2e-16
Notice that we don’t get an estimate for num_adults. This is because of perfect collinearity. It’s possible to write: x_{i4} = x_{i3} - x_{i2} for all i which means there is an exact linear relationship between some of the independent variables.
To satisfy assumption 3 we should not be able to write one variable as a linear function of other explantory variables with the relationship holding exactly for every observation in the dataset.
15.4 Assumption 4: Zero Conditional Mean
Assumption 4: Zero Conditional Mean
The error \varepsilon_i has an expected value of zero given any value of the explanatory variables, i.e. \mathbb{E}\left[ \varepsilon_i|X_{i1},\dots,X_{ik} \right]=0 for all X_{i1},\dots,X_{ik}.
This assumption, like before, implies that the error term cannot be correlated with any of the explanatory variables. It also rules out any nonlinear relationships.
15.5 Assumption 5: Homoskedasticity
Assumption 5: Homoskedasticity
The error \varepsilon_i has the same variance given any value of the explanatory variables. In other words:
\text{Var}\left( \varepsilon_i| x_{i1},\dots,x_{ik} \right)=\sigma_\varepsilon^2
Just like in the SLR model, this means that the dispersion of the error terms should not vary with any of the explanatory variables.
15.6 Assumption 6: Normality
Assumption 6: Normality
The distribution of \varepsilon_i conditional on x_{i1},\dots,x_{ik} is normally distributed.
This assumption, combined with assumptions 4 and 5 implies:
\varepsilon_i | x_{i1},\dots,x_{ik} \sim \mathcal{N}\left( 0,\sigma_\varepsilon^2 \right)
In words: \varepsilon_i conditional on x_{i1},\dots,x_{ik} follows a normal distribution with a zero mean and variance \sigma_\varepsilon^2.