1 Introduction

To analyze if the seasonal meteorological index is a valid predictor for the number of high ozone days in South Coast Air Basin.

Understanding the relationship between Days and an explanatory Index is important for identifying trends and making predictions. In this report, we analyze this relationship using simple linear regression, where Days is modeled as a linear function of Index.

The analysis includes:

2 Davidson (1993) Dataset.

  • Response (y): Number of days ozone exceeded 0.20ppm

  • Predictor (x): Seasonal meteorological index.

a <- c(1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991)
y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
x <- c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)

There are 16 observations, each consisting of an Index value and the corresponding number of Days.

3 Scatterplot of Days vs Index

The scatterplot shows that Days tend to decrease as Index increases. This visual pattern supports the use of a simple linear regression model.

plot(x,y,main="Scatterplot of Days vs Index", xlab ="Index", ylab = "Days", pch=19)

4 Simple Linear Regression Model

\[ y=\beta _{0} + \beta _{1}+\epsilon \] Here we are regressing days on index. This will show the summary of our function to get all the factors such as coefficients, residuals, slope, etc.

modela<- lm(y~x)
summary(modela)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.70 -21.54   2.12  18.56  36.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984    163.503  -1.180    0.258
## x             15.296      9.421   1.624    0.127
## 
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.09835 
## F-statistic: 2.636 on 1 and 14 DF,  p-value: 0.1267

4.1 Interpretation of Regression Coefficients

From the model output:

  • Intercept: Estimates values of Days.

  • Slope: Average change in Days for a one unit increase in index.

A positive slope indicates that lower index values are associated with more Days.

summary(lm(y~x))$coefficients
##               Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) -192.98383 163.503283 -1.180306 0.2575450
## x             15.29637   9.420975  1.623650 0.1267446

4.2 Coefficient of Determination \(R^{2}\)

R² represents the proportion of variability in Days explained by Index.

summary(lm(y~x))$r.square
## [1] 0.1584636

These means 15% of variability in days is explained by index.

4.3 Hypothesis Test for the Slope

\[ H_{0}:\beta _{1}=0 \mid H_{a} :\beta _{1} \neq 0 \]

The p-value for the slope is obtained from the regression summary.

summary(lm(y~x))$coefficients[2,4]
## [1] 0.1267446

Interpretation:

  • If a p-value indicates strong evidence that Index is a significant predictor of Days.

  • This suggests a statistically meaningful linear relationship.

5 Model Adequacy

To validate the regression model, we check the assumptions on the error terms.

5.1 Normal Error

For Normal Error we need to plot the Normal Probability Plot.

#if the residuals lie approximately in a straight line, the normality assumption is reasonable.
residuals(lm(y~x))
##          1          2          3          4          5          6          7 
##  28.534535  36.415989  20.589987  24.119624  17.886353   5.589987   6.241991 
##          8          9         10         11         12         13         14 
##  11.886353  -1.350740  -5.113647  -4.524738 -20.584011 -24.410013 -28.643284 
##         15         16 
## -41.702557 -24.935828
residuals<-residuals(lm(y~x))
qqnorm(residuals,main="Normal Error")
qqline(residuals, col= "red", lwd=2)

As the residuals lie approximately in a straight line, the normality assumption is reasonable.

5.2 Constant Variance

Here we have to plot the graph between Residuals and Predicted values

where the Predicted values is determined by:

\[ y\hat{_{i}}=b_{0}+b_{1}x_{i} \]

residuals(lm(y~x))
##          1          2          3          4          5          6          7 
##  28.534535  36.415989  20.589987  24.119624  17.886353   5.589987   6.241991 
##          8          9         10         11         12         13         14 
##  11.886353  -1.350740  -5.113647  -4.524738 -20.584011 -24.410013 -28.643284 
##         15         16 
## -41.702557 -24.935828
residuals<-residuals(lm(y~x))
predict(lm(y~x))
##        1        2        3        4        5        6        7        8 
## 62.46546 68.58401 85.41001 83.88038 70.11365 85.41001 51.75801 70.11365 
##        9       10       11       12       13       14       15       16 
## 82.35074 70.11365 65.52474 68.58401 85.41001 71.64328 74.70256 60.93583
predicted<-predict(lm(y~x))
plot(residuals,predicted, xlab = "Predicted",ylab ="Residuals", main = "Constant Variance",col="blue", lwd=2)

A random scatter around zero indicates constant variance of errors.

Residuals vs Fitted: Ideally, we want to see a random cloud of points around the horizontal dashed line. If we see a “U” shape or a funnel shape, the model may be inappropriate.

5.3 Independence of Errors

The Independent errors graph plot between Residuals and Order of Collections.

Order<-1:length(residuals)
plot(residuals,Order, main="Independence of Error", col="green", lwd=2)

Here we can see that the points are showing some pattern which suggest that the errors are not independent.

6 Fitted Regression Line

Explanation for the Meteorologist:

  • Confidence intervals describe uncertainty in the mean Days at a given Index.

  • Prediction intervals describe uncertainty for a new future observation and are wider.

min(x)
## [1] 16
max(x)
## [1] 18.2
newx<-seq(16,18.2,.95)
conf <-predict(modela, data.frame(x=newx),interval="confidence")
pred <-predict(modela,data.frame(x=newx),interval = "prediction")
conf
##        fit      lwr      upr
## 1 51.75801 21.75791 81.75811
## 2 66.28956 51.25340 81.32571
## 3 80.82110 63.81774 97.82447
pred
##        fit      lwr      upr
## 1 51.75801 -7.44155 110.9576
## 2 66.28956 13.08553 119.4936
## 3 80.82110 27.02801 134.6142
plot(x,y)
abline(modela)
lines(newx,conf[,1],col="blue", lwd=2)
lines(newx,pred[,1],col="blue", lwd=2)
lines(newx,conf[,2], col="green", lwd=2)
lines(newx,conf[,3],col="green", lwd=2)
lines(newx,pred[,2],col="red", lwd=2)
lines(newx,pred[,3],col="red", lwd=2)

7 R script

a <- c(1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991)
y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
x <- c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
plot(x,y,main="Scatterplot of Days vs Index", xlab ="Index", ylab = "Days", pch=19)

modela<- lm(y~x)
summary(modela)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.70 -21.54   2.12  18.56  36.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984    163.503  -1.180    0.258
## x             15.296      9.421   1.624    0.127
## 
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.09835 
## F-statistic: 2.636 on 1 and 14 DF,  p-value: 0.1267
summary(lm(y~x))$coefficients
##               Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) -192.98383 163.503283 -1.180306 0.2575450
## x             15.29637   9.420975  1.623650 0.1267446
summary(lm(y~x))$r.square
## [1] 0.1584636
summary(lm(y~x))$coefficients[2,4]
## [1] 0.1267446
#if the residuals lie approximately in a straight line, the normality assumption is reasonable.
residuals(lm(y~x))
##          1          2          3          4          5          6          7 
##  28.534535  36.415989  20.589987  24.119624  17.886353   5.589987   6.241991 
##          8          9         10         11         12         13         14 
##  11.886353  -1.350740  -5.113647  -4.524738 -20.584011 -24.410013 -28.643284 
##         15         16 
## -41.702557 -24.935828
residuals<-residuals(lm(y~x))
qqnorm(residuals,main="Normal Error")
qqline(residuals, col= "red", lwd=2)

predict(lm(y~x))
##        1        2        3        4        5        6        7        8 
## 62.46546 68.58401 85.41001 83.88038 70.11365 85.41001 51.75801 70.11365 
##        9       10       11       12       13       14       15       16 
## 82.35074 70.11365 65.52474 68.58401 85.41001 71.64328 74.70256 60.93583
predicted<-predict(lm(y~x))
plot(residuals,predicted, xlab = "Predicted",ylab ="Residuals", main = "Constant Variance",col="blue", lwd=2)

Order<-1:length(residuals)
plot(residuals,Order, main="Independence of Error", col="green", lwd=2)

min(x)
## [1] 16
max(x)
## [1] 18.2
newx<-seq(16,18.2,.95)
conf <-predict(modela, data.frame(x=newx),interval="confidence")
pred <-predict(modela,data.frame(x=newx),interval = "prediction")
conf
##        fit      lwr      upr
## 1 51.75801 21.75791 81.75811
## 2 66.28956 51.25340 81.32571
## 3 80.82110 63.81774 97.82447
pred
##        fit      lwr      upr
## 1 51.75801 -7.44155 110.9576
## 2 66.28956 13.08553 119.4936
## 3 80.82110 27.02801 134.6142
plot(x,y)
abline(modela)
lines(newx,conf[,1],col="blue", lwd=2)
lines(newx,pred[,1],col="blue", lwd=2)
lines(newx,conf[,2], col="green", lwd=2)
lines(newx,conf[,3],col="green", lwd=2)
lines(newx,pred[,2],col="red", lwd=2)
lines(newx,pred[,3],col="red", lwd=2)