The formulas for SSE, SSR, SST and R^2 are also the exact same as the SLR model. We show here how to calculate them in R.
17.1SSE, SSR, SST
\begin{split}
SSE &=\sum_{i=1}^n \left( {y}_i-\hat{y}_i \right)^2 \\
SSR &= \sum_{i=1}^n \left( \hat{y}_i-\bar{y} \right)^2\\
SST &=\sum_{i=1}^n \left( {y}_i-\bar{y} \right)^2 \\
R^2 &= SSR/SST = 1 - SSE/SST
\end{split}
We also calculate them in R using the approaches we saw in Chapter 12. Let’s show how to do this with the wages, education and experience model.
We first estimate a model using only an intercept and call it m1. We then estimate our full model and call it m2. We then use the anova() function to compare the two models:
df <-read.csv("wages1.csv")m1 <-lm(wage ~1, data = df)m2 <-lm(wage ~ educ + exper, data = df)anova(m1, m2)
Analysis of Variance Table
Model 1: wage ~ 1
Model 2: wage ~ educ + exper
Res.Df RSS Df Sum of Sq F Pr(>F)
1 525 7160.4
2 523 5548.2 2 1612.2 75.99 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The SST is 7160.4, the SSE is 5548.2 and the SSR is 1612.2.
More generally, if Model 1 is a model with our dependent variable and only a constant and Model 2 is the model with our dependent variable and all independent variables, the SST, SSE and SSR in the anova() output are in the following parts of the table:
Res.Df
RSS
Df
\;\;\;Sum of Sq
1
n-1
SST
2
n-k-1
SSE
k
SSR
More generally, the structure of the anova() output where model 1 is y on only the intercept and model 2 is y on all the independent variables is:
m1 <- lm(y ~ 1, data = df)
m2 <- lm(y ~ x1 + x2, data = df)
anova(m1, m2)
Analysis of Variance Table
Model 1: y ~ 1
Model 2: y ~ x1 + x2
Res.Df RSS Df Sum of Sq
1 n-1 SST
2 n-k-1 SSE k SSR
---
We can also use the other approaches we saw in Chapter 12:
In each case we get the same numbers as the anova() function.
17.2R^2
The R^2 is also shown in the default summary() output:
df <-read.csv("wages1.csv")m <-lm(wage ~ educ + exper, data = df)summary(m)
Call:
lm(formula = wage ~ educ + exper, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.5532 -1.9801 -0.7071 1.2030 15.8370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.39054 0.76657 -4.423 0.000011846645 ***
educ 0.64427 0.05381 11.974 < 2e-16 ***
exper 0.07010 0.01098 6.385 0.000000000378 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.257 on 523 degrees of freedom
Multiple R-squared: 0.2252, Adjusted R-squared: 0.2222
F-statistic: 75.99 on 2 and 523 DF, p-value: < 2.2e-16
The R^2 is 0.2252. In our model education and experience explain 22.52% of the variation in the wages data. The remaining 77.48% remains unexplained.
We can also extract this number from the output with:
summary(m)$r.squared
[1] 0.2251622
One thing to note is that the R^2 is no longer the square of the sample correlation coefficient. That is only true for the simple linear regression.
17.3 Adjusted R^2
Recall that the formula for the R squared is R^2=1-\frac{SSE}{SST} and measures the % of the variation in the y-data that is explained by the independent variables. If we add more and more variables to our model, the sum of squared errors always falls with each variable added and so will always increase the R^2. This could lead us to add too many variables to our model (a problem called “overfitting”).
You may have noticed that the summary() output also gives another number called the Adjusted R-squared. In our example it is 0.2222. This adjusted R squared is one way to help us building models to avoid this overfitting problem.1
The formula for the adjusted R^2 is:
R_{adj}^2 =1 - \frac{SSE/\left( n-k-1 \right)}{SST/\left( n-1 \right)}
The adjusted R^2 will decrease if adding a new variable does not explain much of the variation in the y-data.
If we want to extract the adjusted R^2 from the R output we can use the command:
summary(m)$adj.r.squared
[1] 0.2221991
The adjusted R^2 is always smaller than the ordinary R squared and can be negative if the explanatory power of the model is very poor.
If you were concerned about overfitting there are much better techniques (such as cross-validation and bootstrap aggregating) than using the adjusted R^2, which is only a very simple tool.↩︎