17 MLR Quantifying Model Usefulness

The formulas for SSE, SSR, SST and R^2 are also the exact same as the SLR model. We show here how to calculate them in R.

17.1 SSE, SSR, SST

\begin{split} SSE &=\sum_{i=1}^n \left( {y}_i-\hat{y}_i \right)^2 \\ SSR &= \sum_{i=1}^n \left( \hat{y}_i-\bar{y} \right)^2\\ SST &=\sum_{i=1}^n \left( {y}_i-\bar{y} \right)^2 \\ R^2 &= SSR/SST = 1 - SSE/SST \end{split} We also calculate them in R using the approaches we saw in Chapter 12. Let’s show how to do this with the wages, education and experience model.

We first estimate a model using only an intercept and call it m1. We then estimate our full model and call it m2. We then use the anova() function to compare the two models:

df <- read.csv("wages1.csv")
m1 <- lm(wage ~ 1, data = df)
m2 <- lm(wage ~ educ + exper, data = df)
anova(m1, m2)

Analysis of Variance Table

Model 1: wage ~ 1
Model 2: wage ~ educ + exper
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    525 7160.4                                 
2    523 5548.2  2    1612.2 75.99 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The SST is 7160.4, the SSE is 5548.2 and the SSR is 1612.2.

More generally, if Model 1 is a model with our dependent variable and only a constant and Model 2 is the model with our dependent variable and all independent variables, the SST, SSE and SSR in the anova() output are in the following parts of the table:

	`Res.Df`	`RSS`	`Df`	\;\;\;`Sum of Sq`
1	n-1	SST
2	n-k-1	SSE	k	SSR

More generally, the structure of the anova() output where model 1 is y on only the intercept and model 2 is y on all the independent variables is:

m1 <- lm(y ~ 1, data = df)
m2 <- lm(y ~ x1 + x2, data = df)
anova(m1, m2)
Analysis of Variance Table

Model 1: y ~ 1
Model 2: y ~ x1 + x2
  Res.Df    RSS Df Sum of Sq
1    n-1    SST                                  
2  n-k-1    SSE  k       SSR 
---

We can also use the other approaches we saw in Chapter 12:

m <- lm(wage ~ educ + exper, data = df)

For the SST we can do either:

(nrow(df) - 1) * var(df$wage)

[1] 7160.414

sum((df$wage - mean(df$wage))^2)

[1] 7160.414

For the SSE we can do either:

deviance(m)

[1] 5548.16

sum(m$residuals^2)

[1] 5548.16

For the SSR we can use the above results to get:

sst <- (nrow(df) - 1) * var(df$wage)
sse <- deviance(m)
ssr <- sst - sse
ssr

[1] 1612.255

In each case we get the same numbers as the anova() function.

17.2 R^2

The R^2 is also shown in the default summary() output:

df <- read.csv("wages1.csv")
m <- lm(wage ~ educ + exper, data = df)
summary(m)


Call:
lm(formula = wage ~ educ + exper, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5532 -1.9801 -0.7071  1.2030 15.8370 

Coefficients:
            Estimate Std. Error t value       Pr(>|t|)    
(Intercept) -3.39054    0.76657  -4.423 0.000011846645 ***
educ         0.64427    0.05381  11.974        < 2e-16 ***
exper        0.07010    0.01098   6.385 0.000000000378 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.257 on 523 degrees of freedom
Multiple R-squared:  0.2252,    Adjusted R-squared:  0.2222 
F-statistic: 75.99 on 2 and 523 DF,  p-value: < 2.2e-16

The R^2 is 0.2252. In our model education and experience explain 22.52% of the variation in the wages data. The remaining 77.48% remains unexplained.

We can also extract this number from the output with:

summary(m)$r.squared

[1] 0.2251622

One thing to note is that the R^2 is no longer the square of the sample correlation coefficient. That is only true for the simple linear regression.

17.3 Adjusted R^2

Recall that the formula for the R squared is R^2=1-\frac{SSE}{SST} and measures the % of the variation in the y-data that is explained by the independent variables. If we add more and more variables to our model, the sum of squared errors always falls with each variable added and so will always increase the R^2. This could lead us to add too many variables to our model (a problem called “overfitting”).

You may have noticed that the summary() output also gives another number called the Adjusted R-squared. In our example it is 0.2222. This adjusted R squared is one way to help us building models to avoid this overfitting problem.¹

The formula for the adjusted R^2 is: R_{adj}^2 =1 - \frac{SSE/\left( n-k-1 \right)}{SST/\left( n-1 \right)} The adjusted R^2 will decrease if adding a new variable does not explain much of the variation in the y-data.

If we want to extract the adjusted R^2 from the R output we can use the command:

summary(m)$adj.r.squared

[1] 0.2221991

The adjusted R^2 is always smaller than the ordinary R squared and can be negative if the explanatory power of the model is very poor.

If you were concerned about overfitting there are much better techniques (such as cross-validation and bootstrap aggregating) than using the adjusted R^2, which is only a very simple tool.↩︎