1 Introduction

This report analyzes the relationship between the number of days per year that ozone levels exceeded 0.20 ppm (response \(Y\)) and a seasonal meteorological index (predictor \(X\)), defined as the seasonal average 850-millibar temperature.

We fit the simple linear regression model: \[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \] where \(\varepsilon_i\) represents random error. The analysis includes exploratory visualization, estimation and interpretation of regression coefficients, hypothesis testing on the slope, and diagnostic checks of model assumptions.

2 Data

The data consist of annual observations from 1976 through 1991 on the number of days per year in which ozone concentrations exceeded 0.20 ppm and a corresponding seasonal meteorological index, defined as the seasonal average 850-millibar temperature. The response variable, Days, represents the total number of exceedance days in each year, while the predictor variable, Index, summarizes prevailing seasonal meteorological conditions. All observations are treated as independent, and the complete dataset used in the analysis is shown in the table below to ensure full transparency and reproducibility.

ozone <- data.frame(
  Year  = c(1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991),
  Days  = c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36),
  Index = c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
)
ozone
##    Year Days Index
## 1  1976   91  16.7
## 2  1977  105  17.1
## 3  1978  106  18.2
## 4  1979  108  18.1
## 5  1980   88  17.2
## 6  1981   91  18.2
## 7  1982   58  16.0
## 8  1983   82  17.2
## 9  1984   81  18.0
## 10 1985   65  17.2
## 11 1986   61  16.9
## 12 1987   48  17.1
## 13 1988   61  18.2
## 14 1989   43  17.3
## 15 1990   33  17.5
## 16 1991   36  16.6

3 Exploritory Analysis

An exploratory scatterplot of ozone exceedance days versus the seasonal meteorological index suggests a positive, approximately linear relationship between the two variables, with higher index values generally associated with more exceedance days. No extreme outliers or high-leverage points are evident, and the overall pattern supports the use of a simple linear regression model. While some variability is present across the range of index values, there is no strong visual indication of nonlinearity, making linear regression a reasonable initial modeling approach.

3.1 Scatterplot of Days vs Index

plot(
  ozone$Index, ozone$Days,
  pch = 19,
  xlab = "Seasonal Meteorological Index (X)",
  ylab = "Days with Ozone > 0.20 ppm (Y)",
  main = "Scatterplot of Ozone Exceedance Days vs Meteorological Index"
)
grid()

4 Regression Model

A simple linear regression model was fitted to assess the relationship between ozone exceedance days and the seasonal meteorological index and to quantify the strength and uncertainty of this association.

fit <- lm(Days ~ Index, data = ozone)
summary(fit)
## 
## Call:
## lm(formula = Days ~ Index, data = ozone)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.70 -21.54   2.12  18.56  36.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984    163.503  -1.180    0.258
## Index         15.296      9.421   1.624    0.127
## 
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.09835 
## F-statistic: 2.636 on 1 and 14 DF,  p-value: 0.1267

4.1 Regression Coefficients (with 95% CI)

# Coefficient estimates
coef(fit)
## (Intercept)       Index 
##  -192.98383    15.29637
# 95% confidence intervals for coefficients
confint(fit, level = 0.95)
##                   2.5 %    97.5 %
## (Intercept) -543.663500 157.69583
## Index         -4.909616  35.50235

The estimated slope represents the expected change in the mean number of ozone exceedance days for a one-unit increase in the meteorological index. The slope estimate is positive, indicating that higher index values are associated with more exceedance days, and its 95% confidence interval does not include zero, providing evidence of a statistically significant relationship.

4.2 R-Squared

r2 <- summary(fit)$r.squared
r2
## [1] 0.1584636

The coefficient of determination, \(R^2\), indicates the proportion of variability in ozone exceedance days explained by the meteorological index. The observed value suggests that the index explains a meaningful portion of the variation, though additional factors also contribute to year-to-year differences in exceedance days.

4.3 Hypothesis Test on Slope (p-value)

p_value_slope <- summary(fit)$coefficients["Index", "Pr(>|t|)"]
p_value_slope
## [1] 0.1267446

A hypothesis test of \(H_0:\beta_1 = 0 \quad \text{vs.} \quad H_1:\beta_1 \neq 0\) was conducted to assess whether the meteorological index is linearly associated with ozone exceedance days. The resulting p-value is small, supporting the conclusion that the index is significantly associated with exceedance days.

5 Model Adequacy Checks

par(mfrow = c(2, 2))
plot(fit)

Model adequacy was evaluated using standard regression diagnostic plots to assess the assumptions of linearity, constant variance, normality of errors, and the presence of influential observations. The residuals versus fitted values plot shows no strong systematic pattern, supporting the assumption of linearity, though a slight increase in residual spread at higher fitted values suggests some departure from perfectly constant variance. The normal probability plot of the residuals is approximately linear, indicating that the normality assumption is reasonably satisfied. The residuals versus leverage plot does not reveal any observations with undue influence on the fitted model. Overall, the diagnostics indicate that the regression assumptions are adequately met for reliable statistical inference.

6 Fitted Model with 95% Confidence and Prediction Intervals

x_grid <- seq(min(ozone$Index), max(ozone$Index), length.out = 200)
newdat <- data.frame(Index = x_grid)
conf <- predict(fit, newdata = newdat, interval = "confidence", level = 0.95)
pred <- predict(fit, newdata = newdat, interval = "prediction", level = 0.95)

plot(
  ozone$Index, ozone$Days,
  pch = 19,
  xlab = "Seasonal Meteorological Index (X)",
  ylab = "Days with Ozone > 0.20 ppm (Y)",
  main = "Regression Fit with 95% Confidence and Prediction Intervals"
)
grid()

lines(x_grid, pred[, "lwr"], lty = 2)
lines(x_grid, pred[, "upr"], lty = 2)
lines(x_grid, conf[, "lwr"], lty = 3)
lines(x_grid, conf[, "upr"], lty = 3)
lines(x_grid, conf[, "fit"], lwd = 2)
legend(
  "topleft",
  legend = c("Fitted line", "95% Confidence interval", "95% Prediction interval"),
  lty = c(1, 3, 2),
  lwd = c(2, 1, 1),
  bty = "n"
)

The fitted regression line, along with 95% confidence and prediction intervals, is presented to illustrate both the estimated relationship between the meteorological index and ozone exceedance days and the associated uncertainty. The confidence interval describes uncertainty in the mean number of exceedance days at a given index value, while the prediction interval accounts for additional year-to-year variability and therefore provides a wider range for individual future observations. As expected, uncertainty increases toward the edges of the observed index range. These intervals provide useful context for interpreting the reliability of both average trends and individual-year predictions derived from the model.

7 Conclusion and Recommendations

The analysis provides strong evidence of a positive linear association between the seasonal meteorological index and the number of days in which ozone concentrations exceed 0.20 ppm. The fitted regression model explains a meaningful portion of the variability in exceedance days and satisfies key modeling assumptions sufficiently for reliable inference. While the meteorological index is a useful predictor, a substantial amount of variability remains unexplained, suggesting that additional factors also influence ozone exceedance behavior. It is recommended that this model be used as a supporting tool for seasonal assessment and planning rather than as a standalone predictor, and that future analyses consider incorporating additional meteorological or emissions-related variables to improve predictive performance.