Days On Index Simple Linear Regression

Problem Statement

A study of air properties in the south coast air basin of California over a set amount of years produces the data set presented below. A meteorologist hired services to roll up the data to understand the significance of Days (Y) on Index (X). Days refer to the amount of days that exceeds 0.22 ppm and index is the seasonal average 850-millibar temperature.

Year Days Index
1976 91 16.7
1977 105 17.1
1978 106 18.2
1979 108 18.1
1980 88 17.2
1981 91 18.2
1982 58 16.0
1983 82 17.2
1984 81 18.0
1985 65 17.2
1986 61 16.9
1987 48 17.1
1988 61 18.2
1989 43 17.3
1990 33 17.5
1991 36 16.6

Computation

Step 1: Start with concatenating the the columns x and y into vectors:

x <- c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)

y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)

Step 2: Produce scatter plot of Days (y) on Index (x):

plot(x,y,main = "Days (y) vs Index (x)",xlab = "Index (x)",ylab = "Days (y)",col='Green', pch=16)

Step 3: SLR of y on x with summary statistics:

yx_Out <- lm(y~x)
summary(yx_Out)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.70 -21.54   2.12  18.56  36.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984    163.503  -1.180    0.258
## x             15.296      9.421   1.624    0.127
## 
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.09835 
## F-statistic: 2.636 on 1 and 14 DF,  p-value: 0.1267
#Note: Produces b_0, b_1, R^2, and p-value

Step 4: Produce final visual with confidence and prediction intervals:

new_x <- seq(min(x),max(x),by=0.02)

#Calculate confidence/prediciton lines
conf_int <- predict(yx_Out, data.frame(x = new_x), interval = "confidence", level = 0.95)
pred_int <- predict(yx_Out, data.frame(x = new_x), interval = "prediction", level = 0.95)

#Output did not show prediction lines, need to specify y limits
y_min <- min(y,pred_int[,"lwr"])
y_max <- max(y,pred_int[,"upr"])

#Plot
plot(x,y,main="SLR with 95% Confidence & Prediction Bands",xlab ="Index (x)",ylab = "Days (y)",pch=19,col="blue",ylim=c(y_min,y_max))

abline(yx_Out,col="red")

#Confidence upper and lower
lines(new_x, conf_int[,"lwr"],col="green")
lines(new_x, conf_int[,"upr"],col="green")

#Prediction upper and lower
lines(new_x, pred_int[, "lwr"], col = "purple")
lines(new_x, pred_int[, "upr"], col = "purple")

legend("bottomright",legend = c("Fitted line", "95% Confidence band", "95% Prediction band"),col=c("red", "green", "purple"),lty=c(1,1,1),cex=0.65)

Step 5: Plots for model accuracy:

#Other plots
par(mfrow=c(2,2)) #Cleaner in html output
plot(yx_Out)

Interpretation of Results

The scatterplot of ozone exceedance days versus the Index shows a positive trend upon inpsection, where higher Index values tend to correspond to more exceedance days. The data points have a wide spred, indicating variability between years not explained completely by the Index. This suggests that the Index is not a strong stand‑alone predictor, and other variables could impact the model.

The fitted simple linear regression model yields an intercept of 192.98 and a slope of 15.30. For each oneunit increase in the Index, the model predicts about 15 additional ozone exceedance days. The intercept is not physically meaningful because an Index value of zero eceptionally outside the observed meteorological conditions. The positive slope aligns with expectations that warmer conditions favor ozone formation.

The model adequacy diagnostics including the residuals versus fitted plot, normal QQ plot, scale location plot, and residuals versus leverage plot show deviation from ideal SLR assumptions. Residuals exhibit mild curvature and possible changes in spread showing non linearity and non constant variance. The QQ plot shows deviations from normality. The set overall indicates that the model does not necessarily capture the structure of the data and therefore may not be reliable.

The model’s R-squared value of about 0.16 suggests that only 16% of the variability in ozone exceedance days is explained by the meteorological Index, with the remaining 84% unexplained. This value illustrates the Index alone is not sufficient for predicting exceedance days, and that there are other meteorological or emissions related variables unconsidered by the function.

The hypothesis test for the slope produces a p‑value of 0.127, which is larger than the 0.05 threshold, meaning we fail to reject the null hypothesis that the true slope is zero. This means that although the fitted slope is positive, the sample data does not provide statistically strong evidence of a linear relationship between the Index and exceedance days.