13  SLR Prediction Intervals

13.1 Theory

Before we learned how to see what Y the model predicted for each value of X in the data. This was the predicted value: \hat{y}_i=b_0 + b_1 x_i But we can also use the model to predict a value of Y for any value of X, not only values of X in our data.

Suppose we wanted to predict what value Y would be if the independent variable was equal to x_p, some value that we choose (and know). Call this value Y_p.

The population model says that: Y_p = \beta_0 + \beta_1 x_p + \varepsilon_p There are two different objects we may be interested in from this model:

  1. An estimate of \mathbb{E}\left[ Y_p|x_p \right], the expected value of the dependent variable when the independent variable is equal to x_p.
  2. A prediction of Y_p, our best prediction of the value of the dependent variable for one observation when the independent variable is equal to x_p.

In our sales and advertising example, the first object could be the average amount of sales if advertising was equal to x_p (not in any particular location; just the average), whereas the second object is the actual value of sales in one location if advertising was set at x_p.

Now, it turns out that the sample statistic \hat{Y}_p=B_0+B_1 x_p is both the point estimator of \mathbb{E}\left[ Y_p|x_p \right] (the first object) and the point predictor of Y_p (the second object).

However, the standard errors for these two estimators will be different:

  1. The 95% confidence interval for \mathbb{E}\left[{Y}_p|x_p \right] should contain the expected value of {Y}_p given x_p with 95% probability.
  2. The 95% prediction interval for {Y_p} should contain the (still unknown) actual realization of {Y}_p with 95% probability.

The first object \mathbb{E}[Y_p|x_p]=\beta_0+\beta_1 x_p does not contain \varepsilon_p, whereas Y_p= \beta_0 + \beta_1 x_p + \varepsilon_p does. So the prediction interval for {Y}_p (which includes the variability in \varepsilon_p) should be much wider than the confidence interval for \mathbb{E}\left[ {Y}_p|x_p \right].

We won’t discuss the different formulas for these confidence/prediction intervals because we will use R to calculate them. However it is important to be aware why one is wider than the other.

13.2 Example in R

Let’s go back to our advertising and sales dataset to show an example of this. Suppose we want to predict sales if €100,000 was spent on advertising. We also want to obtain:

  1. A 95% confidence interval for the expected value of sales given this level of advertising.
  2. A 95% prediction interval for the value sales if we advertised at this level in one market.

If all we were interested in was to get the expectation \mathbb{E}\left[ Y_p|x_p \right] or the predicted value \widehat{Y}_p, we do the following. We need to make a small data.frame with one observation with the appropriate value for x. We then use the predict() function in R with our estimated regression model m. Let’s try it out:

df <- read.csv("advertising-sales.csv")
m <- lm(sales ~ advertising, data = df)
df_p <- data.frame(advertising = 100)
predict(m, df_p)
       1 
9.111816 

As we said above, the expectation \mathbb{E}\left[Y_p|x_p\right] and the prediction of Y_p are estimated the same way, so both have the same value. Here, the average value of sales conditional on €100,000 spent on advertising is €9.11m (our estimate of \mathbb[Y_p|x_p=100]) and our prediction for what sales would be in one market when €100,000 advertising is also €9.11m (our prediction \hat{Y}_p).

Now, suppose we wanted to get a 95% confidence interval for \mathbb{E}\left[Y_p|x_p\right]. We can get this by specifying "confidence" in the interval option in the predict() function. We can set the level using the level option:

df <- read.csv("advertising-sales.csv")
m <- lm(sales ~ advertising, data = df)
df_p <- data.frame(advertising = 100)
predict(m, df_p, interval = "confidence", level = 0.95)
       fit     lwr      upr
1 9.111816 8.57622 9.647413

This also gives the estimate of \mathbb{E}\left[Y_p|x_p\right] which is 9.111816 (€9.11m). The interpretation of this interval is as follows: We are 95% confident that in the population of markets where €100,000 is spent on advertising, the mean value of sales is between €8.572m and €9.647m.

Now let’s get a 95% prediction interval for Y_p. The steps to do this are almost the same as above. All we need to change is replacing "confidence" with "prediction" in the interval argument:

df <- read.csv("advertising-sales.csv")
m <- lm(sales ~ advertising, data = df)
df_p <- data.frame(advertising = 100)
predict(m, df_p, interval = "prediction", level = 0.95)
       fit      lwr      upr
1 9.111816 3.956741 14.26689

The interpretation of this interval is as follows: We are 95% confident that if we spend €100,000 on advertising in one market, the actual value of sales in that market will be between €3.9567 and €14.2669m.

Notice how this interval is much wider than the previous interval for \mathbb{E}\left[Y_p|x_p\right]. This is because it also includes the variability in \varepsilon_p which is not included in the interval for \mathbb{E}\left[Y_p|x_p\right].