<- read.csv("wages1.csv")
df <- lm(wage ~ educ + exper, data = df)
m <- data.frame(educ = 12, exper = 13)
df_p predict(m, df_p, interval = "confidence", level = 0.95)
fit lwr upr
1 5.251966 4.948709 5.555222
Just like we saw in Chapter 13, with chosen values x_{p1}, \dots, x_{pk} for each of the independent variables, we can use our regression model to estimate both the expected value of the dependent variable at those values \mathbb{E}\left[Y_p | x_{p1}, \dots, x_{pk}\right] and make a prediction of the realized value of Y_p. We can also obtain a confidence interval for \mathbb{E}\left[Y_p | x_{p1}, \dots, x_{pk}\right] and a prediction interval for \hat{Y}_p. Doing these in R is very similar to how we did it for the simple linear regression model in Chapter 13. We will show examples of these using the wages, education and experience data.
You want to estimate the mean wage of people with 12 years of education and 13 years of experience and also obtain a 95% confidence interval for this mean.
Just like in Chapter 13 we perform the following steps:
data.frame
with one row containing the values for each of the independent variables.predict()
function with the estimated model and this one-row data.frame
, specifying that we want a confidence interval for the mean (using interval = "confidence"
).Here are the steps for our example:
<- read.csv("wages1.csv")
df <- lm(wage ~ educ + exper, data = df)
m <- data.frame(educ = 12, exper = 13)
df_p predict(m, df_p, interval = "confidence", level = 0.95)
fit lwr upr
1 5.251966 4.948709 5.555222
The model estimates that the average wage of people with 12 years of education and 13 years of experience is $5.25.
To interpret the confidence interval we say that we are 95% confident that the population mean wage of people with 12 years of education and 13 years of experience is between $4.95 and $5.56.
Suppose now you want to predict the wage of one individual with 12 years of education and 13 years of experience and obtain a 95% prediction interval for that prediction. That is, you want an interval that contains with 95% probability the actual wage for this individual.
We follow almost the same steps as before, but now we use the "prediction"
option for interval
in the predict()
function instead of "confidence"
:
<- read.csv("wages1.csv")
df <- lm(wage ~ educ + exper, data = df)
m <- data.frame(educ = 12, exper = 13)
df_p predict(m, df_p, interval = "prediction", level = 0.95)
fit lwr upr
1 5.251966 -1.153713 11.65764
The model predicts that the wage of an individual with 12 years of education and 13 years of experience is $5.25. We are 95% confident that this individual with 12 years of education and 13 years of experience will have a wage between -$1.15 and $11.66.
Notice that the prediction is the same as the estimate of \mathbb{E}\left[Y_p | x_{p1}=12, x_{p2}=13\right] but the confidence interval is much wider. This is because we are more uncertain about the wage of one individual (which contains the variability of the error \varepsilon_p) compared to the average wage (where the errors are averaged out across individuals). The lower bound of this confidence interval is even negative! The upper bound is also very large in the distribution of wages:
summary(df$wage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.530 3.330 4.650 5.896 6.880 24.980
We can check what quantile the upper bound is in:
mean(df$wage < 11.65764)
[1] 0.9220532
This means that our prediction interval is extremely wide: by needing to be 95% confident, we can only say that the wage of this individual will be between $0 (smaller than the lowest observed wage in the data) and $11.66 (larger than 92.2% of observed wages in the data)!