We will now discuss a particular hypothesis test for the regression slope that is so common it has its own name: a test for statistical significance. This is a two-sided test for the regression slope with a zero hinge (b=0):
H_0: \beta_1 =0 \qquad H_1:\beta_1 \neq 0
Recall that the model is:
\mathbb{E}[Y_i|x_i]=\beta_0 + \beta_1 x_i
Under the null hypothesis, the model is simply \mathbb{E}[Y_i|x_i]=\beta_0. The expected value of Y_i does not depend on x_i. The model trying to predict Y_i using x_i is completely useless. Under the alternative hypothesis, \mathbb{E}[Y_i|x_i]=\beta_0 + \beta_1 x_i with \beta_1\neq0 so Y_i varies with x_i and the model is useful (at least to some degree).
Therefore this test is a test of model usefulness. If we reject H_0 at the 5% level we say the model is useful at the 5% level.
If H_0 is rejected, we say the variable X is significant and b_1 is significantly different from zero.
If H_0 is not rejected, we say the variable X is insignificant and b_1 is not significantly different from zero.
Because this test is so common, most statistical software (including R) that estimate the simple linear regression model provide test statistics and p-values for this test by default. We will see this in the next example.
11.2 Example in R
Let’s test for model usefulness using the advertising and sales data. We will see that the summary() command provides the test statistic and p-value for this test by default:
df <-read.csv("advertising-sales.csv")m <-lm(sales ~ advertising, data = df)summary(m)
Call:
lm(formula = sales ~ advertising, data = df)
Residuals:
Min 1Q Median 3Q Max
-8.0546 -1.3071 0.1173 1.5961 7.1895
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.243028 0.438525 9.676 <2e-16 ***
advertising 0.048688 0.001982 24.564 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.6 on 198 degrees of freedom
Multiple R-squared: 0.7529, Adjusted R-squared: 0.7517
F-statistic: 603.4 on 1 and 198 DF, p-value: < 2.2e-16
If were to calculate the value of the test statistic from our sample manually, we would calculate it from b_1 and s_{b_1} using:
t=\frac{b_1-0}{s_{b_1}}=\frac{0.048688-0}{0.001982}
Let’s calculate this in R:
Looking back at the summary() output we see that under t value and across from advertising in the summary() also has this number (rounded to 3 digits after the decimal). The summary() command for a regression model always shows the test statistic for a two-sided test with a zero hinge (the test for statistical significance).
Let’s compare this to the critical value for \alpha=0.05:
qt(0.975, 198)
[1] 1.972017
The value of the test statistic 24.564 is greater than the critical value 1.972, so advertising is statistically significant at the 5% level.
The summary() table also shows the corresponding p-value for this test in the 4th column. The <2e-16 means that the number is very very close to zero. 2e-16 here means the number 2 divided by a very large number (a 1 followed by 16 zeros). The <2e-16 means that the p-value is smaller than this number. Thus the p-value is close to zero, so advertising is statistically significant at the 5% level (p<0.05).
11.3 Significance Stars
The summary() command also shows some *** after the p-value and below the coefficients table it shows Signif. codes. This indicates that 3 stars means the p-value is less than 0.001. Here is what all the stars would mean:
3 stars (***): p-value is between 0 and 0.001.
2 stars (**): p-value is between 0.001 and 0.01.
1 star (*): p-value is between 0.01 and 0.05.
1 dot (.): p-value is between 0.05 and 0.1.
No star/dot: p-value is between 0.01 and 1.
In the example above, both the intercept and the slope have 3 stars because the p-value for the hypothesis test that the coefficient is different from zero is close to zero in both cases.
The purpose of these stars is for you to be able to quickly see which estimates are significantly different from zero.