Simple Linear Regression Analysis of Ozone Exceedance Days
Section 1: Introduction
Ground-level ozone is a major component of air pollution and poses serious risks to both public health and environmental quality. Understanding the meteorological conditions associated with elevated ozone levels is therefore an important component of air quality management and environmental policy. Statistical modeling provides a framework for quantifying these relationships and assessing how strongly atmospheric conditions influence pollution outcomes.
This report investigates the relationship between the annual number of days in which ozone levels exceeded 0.20 ppm and a seasonal meteorological index defined as the average 850-millibar temperature. The ozone exceedance count serves as the response variable, while the meteorological index acts as the explanatory variable. The data were collected in the South Coast Air Basin of California and span the years 1976 through 1991.
The primary purpose of this analysis is to determine whether a statistically significant linear relationship exists between the meteorological index and the number of high-ozone days. In addition, the study seeks to quantify the strength and direction of the association, evaluate whether a simple linear regression model provides an adequate description of the data, and assess the usefulness of the model for explanation and prediction. All data, code, and graphical output are included to ensure full reproducibility of results.
Section 2: Data Description
The dataset consists of sixteen annual observations, with each row corresponding to a single calendar year. For each year, the number of days in which ozone concentrations exceeded 0.20 ppm was recorded, along with the value of a seasonal meteorological index based on upper-air temperature measurements.
#import the data
ozone_dt<- data.frame(
Year = 1976:1991,
Days = c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36),
Index = c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
)
ozone_dt
## Year Days Index
## 1 1976 91 16.7
## 2 1977 105 17.1
## 3 1978 106 18.2
## 4 1979 108 18.1
## 5 1980 88 17.2
## 6 1981 91 18.2
## 7 1982 58 16.0
## 8 1983 82 17.2
## 9 1984 81 18.0
## 10 1985 65 17.2
## 11 1986 61 16.9
## 12 1987 48 17.1
## 13 1988 61 18.2
## 14 1989 43 17.3
## 15 1990 33 17.5
## 16 1991 36 16.6
The variable Days is quantitative and represents the response of interest, while Index is also quantitative and represents large-scale atmospheric conditions that may influence ozone formation. Because ozone chemistry is sensitive to temperature and atmospheric stability, this index is expected to be physically relevant to ozone exceedance frequency.
Section 3: Exploratory Data Analysis
Prior to formal modeling, it is important to visually assess the relationship between the two quantitative variables. A scatterplot of ozone exceedance days versus the meteorological index provides an initial indication of the form, direction, and strength of the association.
#create a scatterplot
plot(ozone_dt$Index, ozone_dt$Days,
xlab = "Seasonal Meteorological Index (850 mb Temperature)",
ylab = "Number of Days Ozone > 0.20 ppm",
main = "Scatterplot of Ozone Exceedance Days vs. Meteorological Index",
pch = 10)
The scatterplot suggests a slight positive association between the meteorological index and the number of ozone exceedance days. In general, years with higher index values tend to show more exceedance days, although the points exhibit substantial variability around any apparent trend. The relationship does not appear strongly linear, and the spread of observations indicates that additional factors likely influence ozone exceedance frequency. Nevertheless, a simple linear regression model is considered a reasonable initial approach for quantifying the observed tendency.
Section 4: Simple Linear Regression Model
Using the simple linear regression equation:
\(\text{Days} = \beta_0 + \beta_1 (\text{Index}) + \varepsilon\)
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\varepsilon\) is a random error term with mean 0 and constant variance.
#create the fit a simple linear regression model with Days as the response
fit <- lm(Days ~ Index, data = ozone_dt)
summary(fit)
##
## Call:
## lm(formula = Days ~ Index, data = ozone_dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.70 -21.54 2.12 18.56 36.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984 163.503 -1.180 0.258
## Index 15.296 9.421 1.624 0.127
##
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared: 0.1585, Adjusted R-squared: 0.09835
## F-statistic: 2.636 on 1 and 14 DF, p-value: 0.1267
The estimated slope is positive, suggesting that higher values of the meteorological index are associated with an increased number of ozone exceedance days. However, the p-value for the slope is 0.127, which is greater than the conventional significance level of 0.05. The estimated slope of approximately 15.3 indicates that, according to the fitted model, a one-unit increase in the seasonal 850-mb temperature index is associated with an average increase of about 15 additional ozone exceedance days per year. From an environmental perspective, this represents a substantial change in air quality conditions. However, because this estimated relationship is not statistically significant at the 0.05 level, this numerical effect should be interpreted cautiously and viewed as suggestive rather than conclusive evidence of a linear relationship.
Section 5: Statistical Inference for the Slope
We formally test whether the meteorological index is linearly associated with ozone exceedance days.
H₀: \(\beta_1\) = 0 (no linear relationship)
H₁: \(\beta_1\) ≠ 0 (linear relationship exists)
The p-value for the slope in the regression output corresponds to this hypothesis test. In this analysis, the p-value is greater than 0.05, so we fail to reject the null hypothesis. This means there is not enough statistical evidence to conclude that a linear relationship exists between the meteorological index and the number of high-ozone days.
Although the estimated slope is positive and indicates an upward trend, the relationship is not statistically significant in this dataset. It is important to note that failing to reject the null hypothesis does not prove that no relationship exists between the meteorological index and ozone exceedance days. Rather, it indicates that this dataset does not provide strong enough statistical evidence to confirm a linear association. The lack of significance may be influenced by the relatively small sample size (16 years) or by the omission of other important variables that also affect ozone formation.
Section 6: Coefficient of Determination
#pull the r quared value out
summary(fit)$r.squared
## [1] 0.1584636
The coefficient of determination, \(R^2\), measures the proportion of variability in ozone exceedance days explained by the meteorological index. In this case, the value of R² is relatively small, indicating that only a modest portion of the year-to-year variation in exceedance days is explained by this single predictor. Much of the variability is likely due to other environmental and atmospheric factors not included in the model.
The relatively low \(R^2\) value is not unexpected in an environmental setting. Ground-level ozone formation depends on a complex combination of factors, including precursor emissions, sunlight intensity, atmospheric circulation patterns, and local weather conditions. Therefore, it is reasonable that a single large-scale temperature index explains only a portion of the year-to-year variability in exceedance days.
Section 7: Model Assumptions
Linear regression relies on several assumptions, including linearity of the mean relationship, independence of errors, constant variance, and approximate normality of residuals. These assumptions are evaluated using diagnostic plots.
#create the 2x2 layout to display 4 diagnostic plots
par(mfrow = c(2,2))
#produce standard regression diagnostic plots for the fitted model
plot(fit)
#reset plotting layout back to a single plot
par(mfrow = c(1,1))
The residuals versus fitted plot is used to assess the assumptions of linearity and constant variance. The points appear to be randomly scattered around zero with no clear systematic pattern, suggesting that the linearity assumption is reasonable and that the variance of the errors is approximately constant.
The normal Q–Q plot is used to evaluate whether the residuals follow an approximately normal distribution. Most of the points lie close to the reference line, with only minor deviations in the tails, indicating that the normality assumption is reasonably satisfied.
The scale–location plot provides another check for constant variance. The spread of the standardized residuals appears fairly consistent across the range of fitted values, offering no strong evidence of heteroscedasticity.
The residuals versus leverage plot helps identify potentially influential observations. While a few points show moderate leverage, none appear to exceed Cook’s distance thresholds, suggesting that no single observation unduly influences the fitted model.
One additional consideration is that the data consist of annual observations collected sequentially from 1976 to 1991 in the South Coast Air Basin, California. Because the observations are ordered in time, there is a possibility that ozone conditions in one year may be correlated with those in nearby years. This potential temporal dependence is not formally assessed in the current simple linear regression model. Future analyses could incorporate time series methods or additional diagnostics to evaluate whether autocorrelation is present in the residuals.
Overall, the diagnostic plots do not indicate any serious violations of the key linear regression assumptions. Although minor departures from ideal conditions are present, they are not substantial enough to invalidate the use of the linear regression model for this analysis.
Section 8: Confidence and Prediction Intervals
#create the plot with the original data points
new_index <- data.frame(Index = seq(min(ozone_dt$Index), max(ozone_dt$Index),
length.out = 100))
conf_int <- predict(fit, newdata = new_index, interval = "confidence", level = 0.95)
pred_int <- predict(fit, newdata = new_index, interval = "prediction", level = 0.95)
plot(ozone_dt$Index, ozone_dt$Days,
xlab = "Seasonal Meteorological Index",
ylab = "Ozone Exceedance Days",
main = "Fitted Regression with 95% Confidence and Prediction Intervals",
pch = 10)
#create the fitted regression line
lines(new_index$Index, conf_int[,"fit"], lwd = 2)
#create the lower bound of the 95% CI
lines(new_index$Index, conf_int[,"lwr"], lty = 2)
#create the upper bound of the 95% CI
lines(new_index$Index, conf_int[,"upr"], lty = 2)
#create the lower bound of the 95% PI
lines(new_index$Index, pred_int[,"lwr"], lty = 3)
#create the upper bound of the 95% PI
lines(new_index$Index, pred_int[,"upr"], lty = 3)
The confidence interval describes the uncertainty in the estimated mean number of exceedance days for all years with a given index value. In contrast, the prediction interval is used to estimate the likely range for a single future year at that same index value. Because individual years can vary substantially due to other unmeasured influences, the prediction interval is wider than the confidence interval.
Section 9: Discussion
The results suggest a possible positive association between the meteorological index and ozone exceedance frequency, with higher index values tending to correspond to more high-ozone days. However, because the relationship is not statistically significant and the model explains only a limited portion of the variability, the index alone does not provide a strong predictive tool. Meteorological conditions represent only one part of the complex set of processes that influence ozone formation. Factors such as emissions, atmospheric chemistry, and local weather patterns likely also contribute substantially to year-to-year variation in exceedance days.
This analysis is based on a relatively small sample of sixteen years and includes only a single explanatory variable, which limits the ability of the model to capture the full range of influences on ozone levels. In addition, possible time trends, interactions among meteorological variables, and other environmental factors were not examined. Future work could improve upon this analysis by incorporating additional predictors, exploring multiple regression models, or applying time series methods to better represent the complexity of ozone dynamics.
Section 10: Conclusion
This analysis indicates a positive but not statistically significant relationship between the seasonal meteorological index and the number of ozone exceedance days. While the fitted regression line suggests that warmer atmospheric conditions may be associated with more frequent exceedances, the statistical evidence is not strong enough to confirm a linear relationship in the population. The model provides some insight into the data but leaves a substantial amount of variability unexplained. These results highlight the importance of considering multiple meteorological and emissions-related factors when evaluating long-term ozone behavior.