For all of the applicable problems, plots must be generated using the functions within the ggplot2 package.

Problem 1. Before answering the following questions, take a few minutes to look through the components of the built-in R data set “cars,” which gives the speed (mph) and the stopping distance (ft) of cars from the 1920’s. Then, against a significance level of α = 0.05, complete the following.

a) Using “ggplot()” generate a scatterplot to look at the linear relationship between the variables “speed” and “dist.” Be sure to creative in your color and shape choices, in addition to using appropriate labels on the axes and a title for your graph. What preliminary observations can you make about the association between speed and stopping distance?
library(ggplot2)
ggplot(data=cars, aes(x=speed, y=dist)) + 
  geom_point(colour="steelblue",shape=12,size=2) +
  labs(title="Scatterplot of Car Stopping Distances and Their Speeds",
  y="Stopping Distance (ft)",
  x="Speed (mph)")

The scatterplot shows a strong positive association between speed and stopping distance. As a car’s speed increases, the distance required to stop tends to increase as well. The pattern appears roughly linear, with higher speeds corresponding to larger stopping distances. There are no obvious clusters or outliers that drastically deviate from the overall positive trend.

b) Generate and interpret Pearson’s correlation coefficient between “speed” and “dist.” What does this tell you about the strength of the association between speed and stopping distance?
with(cars, cor(speed,dist))
## [1] 0.8068949

The Pearson correlation coefficient between speed and stopping distance is r = 0.8068949, which indicates a strong positive association between the two variables. Together with the scatterplot from part (a), we can say that the association appears to be a relatively strong positive roughly linear association. This means that, in general, higher car speeds are associated with longer stopping distances. Because the value is close to 1, the relationship is both positive and fairly strong, suggesting that speed is a fairly good predictor of stopping distance.

c) Generate a fitted simple linear regression model where “dist” is a function of “speed.” What are your beta coefficients? Discuss limitations this model would have if it were to be applied to today’s cars.
car_mod<-lm(dist~speed, data=cars)
car_mod
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932

The estimated beta coefficients from the simple linear regression model are β₀ = –17.579 and β₁ = 3.932. This means the model predicts that stopping distance increases by about 3.93 feet for each 1 mph increase in speed. The negative β₀ intercept has no practical meaning but and is just used as a parameter for model creation.

This model has several limitations if applied to modern cars. The data were collected in the 1920s, when braking systems, tire technology, and vehicle design were very different, so the relationship would not reflect today’s stopping distances. In addition, the model assumes a strictly linear relationship and may not capture the more complex braking behavior of modern vehicles, making predictions outside the historical context unreliable.

d) Add a fitted line to your scatterplot in part (a). Does it appear to be a good fit? Why or why not?
cars_update <- data.frame(
  speed = seq(min(cars$speed, na.rm = TRUE),
              max(cars$speed, na.rm = TRUE),
              length.out = 100)
)

pred <- predict(car_mod, newdata = cars_update, interval = "prediction", level = 0.95)

cars_pred <- cbind(cars_update, as.data.frame(pred))

ggplot(data=cars, aes(x=speed, y=dist)) + 
  geom_point(colour="steelblue",shape=12,size=2) +
  labs(title="Scatterplot of Car Stopping Distances and Their Speeds",
  y="Stopping Distance (ft)",
  x="Speed (mph)") +
  geom_line(data = cars_pred, aes(x = speed, y = fit),
              colour = "grey25", linewidth = 1)

The fitted line closely follows the overall upward trend in the scatterplot, with an approximately equal number of points above and below the line, suggesting a reasonably good linear fit. However, the points become more spread out at higher speeds, indicating greater variability in stopping distances and that the model does not capture all sources of variation.

e) Generate and interpret the prediction interval, relative to the level of significance, for the car’ stopping distance if it suddenly stops while driving 30 mph
new_speed <- data.frame(speed = 30)

predict(car_mod, newdata = new_speed, interval = "prediction", level = 0.95)
##        fit      lwr     upr
## 1 100.3932 66.86529 133.921

The model predicts that if a car suddenly stops while driving at 30 mph, the stopping distance would be approximately 100.39 ft. The 95% prediction interval ranges from 66.87 ft to 133.92 ft. This means we can be 95% confident that an individual car’s stopping distance at this speed will fall somewhere within this interval.

f) Add a shaded prediction interval to your scatterplot in part (d). Does it visually align with your prediction interval in part (e) at the 30 mph point?
pred <- predict(car_mod, newdata = cars_update, interval = "prediction", level = 0.95)

cars_pred <- cbind(cars_update, as.data.frame(pred))

ggplot(data = cars, aes(x = speed, y = dist)) +
  geom_point(colour = "steelblue", shape = 12, size = 2) +
  geom_ribbon(data = cars_pred, aes(x = speed, ymin = lwr, ymax = upr, y = NULL),
              alpha = 0.2, fill = "grey50") +
  geom_line(data = cars_pred, aes(x = speed, y = fit), colour = "grey25", linewidth = 1) +
  labs(
    title = "Scatterplot of Speed vs. Stopping Distance with Prediction Interval",
    x = "Speed (mph)",
    y = "Stopping Distance (ft)"
  )

The shaded grey area represents the 95% prediction interval for stopping distance. Although the scatter plot only includes observed speeds up to 25 mph, we can extrapolate the strong positive linear trend to 30 mph. Based on this trend, the prediction interval of 66.87 ft to 133.92 ft calculated in part (e) appears reasonable. This illustrates that while the model provides an estimate beyond the collected data, extrapolation carries more uncertainty, and the actual stopping distance could differ from the predicted range.

Problem 2. Before answering the following questions, take a few minutes to look through the components of the built-in R data set “faithful,” which the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. Then, against a significance level of α = 0.01, complete the following.

a) Using “ggplot()” generate a scatterplot to look at the linear relationship between the variables “eruptions” and “waiting.” Be sure to creative in your color and shape choices, in addition to using appropriate labels on the axes and a title for your graph. What preliminary observations can you make about the association between the duration of eruptions and the waiting time between eruptions?
ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
  geom_point(color = "lightgreen", shape = 15, size = 2) +
  labs(
    title = "Scatterplot of Old Faithful Eruption Duration vs. Waiting Time",
    x = "Eruption Duration (minutes)",
    y = "Waiting Time to Next Eruption (minutes)"
  )

The scatterplot shows a positive association between eruption duration and the subsequent waiting time. Longer eruptions generally lead to longer waiting times, though the points appear to cluster in two distinct groups, suggesting the relationship may not be perfectly linear across all observations.

b) Calculate and interpret Pearson’s correlation coefficient between “eruptions” and “waiting.” What does this tell you about the strength of the association between waiting time and eruption duration?
with(faithful, cor(eruptions, waiting))
## [1] 0.9008112

The Pearson correlation coefficient between eruption duration and waiting time is r = 0.9008112, indicating a strong positive association. Along with the roughly linear trend demonstrated in the scatterplot, there appears to be a strong roughly linear positive association between eruption duration and waiting time till next eruption. This means that, generally, longer eruptions are followed by longer waiting times until the next eruption. The high value of r suggests a strong relationship, though it is not perfect, so other factors may also influence the waiting time. Overall, the strength of the correlation implies that eruption duration is a good predictor of the subsequent waiting time.

c) Generate an estimated simple linear regression model where “waiting” is a function of “eruptions.” What are your beta coefficients? Discuss any limitations this model would have if it were to be applied to future eruptions of Old Faithful.
faithful_mod <- lm(waiting ~ eruptions, data = faithful)
faithful_mod
## 
## Call:
## lm(formula = waiting ~ eruptions, data = faithful)
## 
## Coefficients:
## (Intercept)    eruptions  
##       33.47        10.73

The estimated beta coefficients from the simple linear regression model are β₀ = 33.47 and β₁ = 10.73. This means the model predicts that the waiting time until the next eruption increases by about 10.73 minutes for each additional minute of eruption duration. The intercept β₀ = 33.47 can be interpreted to mean that we expect that the minimum waiting time between eruption is 33.47 minutes.

This model has several limitations if applied to future eruptions of Old Faithful. The data show two distinct clusters of waiting times, so the assumption of a strictly linear relationship may not hold across all eruption durations. Predictions outside the observed range could be inaccurate, and the model does not account for natural variability or environmental factors that may influence waiting times. Therefore, while useful for general trends, the model has limited precision for predicting individual future eruptions.

d) Add a fitted line to your scatterplot in part (a). Does it appear to be a good fit? Why or why not?
faithful_update <- data.frame(
  eruptions = seq(min(faithful$eruptions), max(faithful$eruptions), length.out = 100)
)
pred <- predict(faithful_mod, newdata = faithful_update)
faithful_pred <- cbind(faithful_update, fit = pred)

ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point(color = "lightgreen", shape = 15, size = 2) +
  geom_line(data = faithful_pred, aes(x = eruptions, y = fit), color = "grey25", linewidth = 1) +
  labs(
    title = "Scatterplot of Old Faithful with Fitted Regression Line",
    x = "Eruption Duration (minutes)",
    y = "Waiting Time (minutes)"
  )

The fitted line follows the overall positive trend, but the scatter of points and the two clusters indicate that the linear model does not perfectly capture all variation. It is a reasonable first approximation but may oversimplify the relationship.

e) Generate and interpret the prediction interval for how long a person is expected to wait for the next eruption, if the prior eruption lasted 6 minutes. Does the waiting time predicted seem plausible? Why or why not?
new_eruption <- data.frame(eruptions = 6)

predict(faithful_mod, newdata = new_eruption, interval = "prediction", level = 0.99)
##        fit     lwr      upr
## 1 97.85225 82.3459 113.3586

The predicted waiting time for the next eruption after a 6-minute eruption is approximately 97.85 minutes, with a 99% prediction interval ranging from 82.35 to 113.36 minutes. This interval means we can be 99% confident that the actual waiting time for an individual eruption following a 6-minute duration will fall within this range.

While the prediction seems plausible based on the general trend in the data, longer eruptions tend to be followed by longer waiting times, we must be cautious of extrapolation, since a 6-minute eruption is above the higher end of observed durations. Predictions outside the range of the collected data carry more uncertainty, and the actual waiting time could differ from the model’s estimate.

f) Add a shaded prediction interval to your scatterplot in part (d). Does it visually align with your prediction interval in part (e) at the 6 minutes point?
pred <- predict(faithful_mod, newdata = faithful_update, interval = "prediction", level = 0.99)
faithful_pred <- cbind(faithful_update, as.data.frame(pred))

ggplot(faithful, aes(x = eruptions, y = waiting)) +
  geom_point(color = "lightgreen", shape = 15, size = 2) +
  geom_ribbon(data = faithful_pred, aes(x = eruptions, ymin = lwr, ymax = upr, y = NULL),
              alpha = 0.2, fill = "grey50") +
  geom_line(data = faithful_pred, aes(x = eruptions, y = fit), color = "grey25", linewidth = 1) +
  labs(
    title = "Old Faithful Eruption Duration vs. Waiting Time with Prediction Interval",
    x = "Eruption Duration (minutes)",
    y = "Waiting Time (minutes)"
  )

The shaded area represents the 99% prediction interval. At an eruption duration of 6 minutes, the interval visually matches the numerical prediction (~86.5 to 111.3 minutes), confirming that the model predicts the waiting time for the next eruption falls within this range. The band is wider at extreme values, reflecting greater uncertainty for eruptions outside the central range of observed data.