A study of air properties in the south coast air basin of California over a set amount of years produces the data set presented below. A meteorologist hired services to roll up the data to understand the significance of Days (Y) on Index (X). Days refer to the amount of days that exceeds 0.22 ppm and index is the seasonal average 850-millibar temperature.
| Year | Days | Index |
|---|---|---|
| 1976 | 91 | 16.7 |
| 1977 | 105 | 17.1 |
| 1978 | 106 | 18.2 |
| 1979 | 108 | 18.1 |
| 1980 | 88 | 17.2 |
| 1981 | 91 | 18.2 |
| 1982 | 58 | 16.0 |
| 1983 | 82 | 17.2 |
| 1984 | 81 | 18.0 |
| 1985 | 65 | 17.2 |
| 1986 | 61 | 16.9 |
| 1987 | 48 | 17.1 |
| 1988 | 61 | 18.2 |
| 1989 | 43 | 17.3 |
| 1990 | 33 | 17.5 |
| 1991 | 36 | 16.6 |
x <- c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
plot(x,y,main = "Days (y) vs Index (x)",xlab = "Index (x)",ylab = "Days (y)",col='Green', pch=16)
yx_Out <- lm(y~x)
summary(yx_Out)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.70 -21.54 2.12 18.56 36.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984 163.503 -1.180 0.258
## x 15.296 9.421 1.624 0.127
##
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared: 0.1585, Adjusted R-squared: 0.09835
## F-statistic: 2.636 on 1 and 14 DF, p-value: 0.1267
#Note: Produces b_0, b_1, R^2, and p-value
new_x <- seq(min(x),max(x),by=0.02)
#Calculate confidence/prediciton lines
conf_int <- predict(yx_Out, data.frame(x = new_x), interval = "confidence", level = 0.95)
pred_int <- predict(yx_Out, data.frame(x = new_x), interval = "prediction", level = 0.95)
#Output did not show prediction lines, need to specify y limits
y_min <- min(y,pred_int[,"lwr"])
y_max <- max(y,pred_int[,"upr"])
#Plot
plot(x,y,main="SLR with 95% Confidence & Prediction Bands",xlab ="Index (x)",ylab = "Days (y)",pch=19,col="blue",ylim=c(y_min,y_max))
abline(yx_Out,col="red")
#Confidence upper and lower
lines(new_x, conf_int[,"lwr"],col="green")
lines(new_x, conf_int[,"upr"],col="green")
#Prediction upper and lower
lines(new_x, pred_int[, "lwr"], col = "purple")
lines(new_x, pred_int[, "upr"], col = "purple")
legend("bottomright",legend = c("Fitted line", "95% Confidence band", "95% Prediction band"),col=c("red", "green", "purple"),lty=c(1,1,1),cex=0.65)
#Other plots
par(mfrow=c(2,2)) #Cleaner in html output
plot(yx_Out)
The scatterplot of ozone exceedance days versus the Index shows a positive trend upon inpsection, where higher Index values tend to correspond to more exceedance days. The data points have a wide spred, indicating variability between years not explained completely by the Index. This suggests that the Index is not a strong stand‑alone predictor, and other variables could impact the model.
The fitted simple linear regression model yields an intercept of 192.98 and a slope of 15.30. For each oneunit increase in the Index, the model predicts about 15 additional ozone exceedance days. The intercept is not physically meaningful because an Index value of zero eceptionally outside the observed meteorological conditions. The positive slope aligns with expectations that warmer conditions favor ozone formation.
The model adequacy diagnostics including the residuals versus fitted plot, normal QQ plot, scale location plot, and residuals versus leverage plot show deviation from ideal SLR assumptions. Residuals exhibit mild curvature and possible changes in spread showing non linearity and non constant variance. The QQ plot shows deviations from normality. The set overall indicates that the model does not necessarily capture the structure of the data and therefore may not be reliable.
The model’s R-squared value of about 0.16 suggests that only 16% of the variability in ozone exceedance days is explained by the meteorological Index, with the remaining 84% unexplained. This value illustrates the Index alone is not sufficient for predicting exceedance days, and that there are other meteorological or emissions related variables unconsidered by the function.
The hypothesis test for the slope produces a p‑value of 0.127, which is larger than the 0.05 threshold, meaning we fail to reject the null hypothesis that the true slope is zero. This means that although the fitted slope is positive, the sample data does not provide statistically strong evidence of a linear relationship between the Index and exceedance days.