<- read.csv("advertising-sales.csv")
df <- lm(sales ~ advertising, data = df)
m <- data.frame(advertising = 100)
df_p predict(m, df_p)
1
9.111816
Before we learned how to see what Y the model predicted for each value of X in the data. This was the predicted value: \hat{y}_i=b_0 + b_1 x_i But we can also use the model to predict a value of Y for any value of X, not only values of X in our data.
Suppose we wanted to predict what value Y would be if the independent variable was equal to x_p, some value that we choose (and know). Call this value Y_p.
The population model says that: Y_p = \beta_0 + \beta_1 x_p + \varepsilon_p There are two different objects we may be interested in from this model:
In our sales and advertising example, the first object could be the average amount of sales if advertising was equal to x_p (not in any particular location; just the average), whereas the second object is the actual value of sales in one location if advertising was set at x_p.
Now, it turns out that the sample statistic \hat{Y}_p=B_0+B_1 x_p is both the point estimator of \mathbb{E}\left[ Y_p|x_p \right] (the first object) and the point predictor of Y_p (the second object).
However, the standard errors for these two estimators will be different:
The first object \mathbb{E}[Y_p|x_p]=\beta_0+\beta_1 x_p does not contain \varepsilon_p, whereas Y_p= \beta_0 + \beta_1 x_p + \varepsilon_p does. So the prediction interval for {Y}_p (which includes the variability in \varepsilon_p) should be much wider than the confidence interval for \mathbb{E}\left[ {Y}_p|x_p \right].
We won’t discuss the different formulas for these confidence/prediction intervals because we will use R to calculate them. However it is important to be aware why one is wider than the other.
Let’s go back to our advertising and sales dataset to show an example of this. Suppose we want to predict sales if €100,000 was spent on advertising. We also want to obtain:
If all we were interested in was to get the expectation \mathbb{E}\left[ Y_p|x_p \right] or the predicted value \widehat{Y}_p, we do the following. We need to make a small data.frame
with one observation with the appropriate value for x. We then use the predict()
function in R with our estimated regression model m
. Let’s try it out:
<- read.csv("advertising-sales.csv")
df <- lm(sales ~ advertising, data = df)
m <- data.frame(advertising = 100)
df_p predict(m, df_p)
1
9.111816
As we said above, the expectation \mathbb{E}\left[Y_p|x_p\right] and the prediction of Y_p are estimated the same way, so both have the same value. Here, the average value of sales conditional on €100,000 spent on advertising is €9.11m (our estimate of \mathbb[Y_p|x_p=100]) and our prediction for what sales would be in one market when €100,000 advertising is also €9.11m (our prediction \hat{Y}_p).
Now, suppose we wanted to get a 95% confidence interval for \mathbb{E}\left[Y_p|x_p\right]. We can get this by specifying "confidence"
in the interval
option in the predict()
function. We can set the level using the level
option:
<- read.csv("advertising-sales.csv")
df <- lm(sales ~ advertising, data = df)
m <- data.frame(advertising = 100)
df_p predict(m, df_p, interval = "confidence", level = 0.95)
fit lwr upr
1 9.111816 8.57622 9.647413
This also gives the estimate of \mathbb{E}\left[Y_p|x_p\right] which is 9.111816
(€9.11m). The interpretation of this interval is as follows: We are 95% confident that in the population of markets where €100,000 is spent on advertising, the mean value of sales is between €8.572m and €9.647m.
Now let’s get a 95% prediction interval for Y_p. The steps to do this are almost the same as above. All we need to change is replacing "confidence"
with "prediction"
in the interval
argument:
<- read.csv("advertising-sales.csv")
df <- lm(sales ~ advertising, data = df)
m <- data.frame(advertising = 100)
df_p predict(m, df_p, interval = "prediction", level = 0.95)
fit lwr upr
1 9.111816 3.956741 14.26689
The interpretation of this interval is as follows: We are 95% confident that if we spend €100,000 on advertising in one market, the actual value of sales in that market will be between €3.9567 and €14.2669m.
Notice how this interval is much wider than the previous interval for \mathbb{E}\left[Y_p|x_p\right]. This is because it also includes the variability in \varepsilon_p which is not included in the interval for \mathbb{E}\left[Y_p|x_p\right].