To analyze if the seasonal meteorological index is a valid predictor for the number of high ozone days in South Coast Air Basin.
Understanding the relationship between Days and an explanatory Index is important for identifying trends and making predictions. In this report, we analyze this relationship using simple linear regression, where Days is modeled as a linear function of Index.
The analysis includes:
Response (y): Number of days ozone exceeded 0.20ppm
Predictor (x): Seasonal meteorological index.
a <- c(1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991)
y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
x <- c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
There are 16 observations, each consisting of an Index value and the corresponding number of Days.
The scatterplot shows that Days tend to decrease as Index increases. This visual pattern supports the use of a simple linear regression model.
plot(x,y,main="Scatterplot of Days vs Index", xlab ="Index", ylab = "Days", pch=19)
\[ y=\beta _{0} + \beta _{1}+\epsilon \] Here we are regressing days on index. This will show the summary of our function to get all the factors such as coefficients, residuals, slope, etc.
modela<- lm(y~x)
summary(modela)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.70 -21.54 2.12 18.56 36.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984 163.503 -1.180 0.258
## x 15.296 9.421 1.624 0.127
##
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared: 0.1585, Adjusted R-squared: 0.09835
## F-statistic: 2.636 on 1 and 14 DF, p-value: 0.1267
From the model output:
Intercept: Estimates values of Days.
Slope: Average change in Days for a one unit increase in index.
A positive slope indicates that lower index values are associated with more Days.
summary(lm(y~x))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.98383 163.503283 -1.180306 0.2575450
## x 15.29637 9.420975 1.623650 0.1267446
R² represents the proportion of variability in Days explained by Index.
summary(lm(y~x))$r.square
## [1] 0.1584636
These means 15% of variability in days is explained by index.
\[ H_{0}:\beta _{1}=0 \mid H_{a} :\beta _{1} \neq 0 \]
The p-value for the slope is obtained from the regression summary.
summary(lm(y~x))$coefficients[2,4]
## [1] 0.1267446
Interpretation:
If a p-value indicates strong evidence that Index is a significant predictor of Days.
This suggests a statistically meaningful linear relationship.
To validate the regression model, we check the assumptions on the error terms.
For Normal Error we need to plot the Normal Probability Plot.
#if the residuals lie approximately in a straight line, the normality assumption is reasonable.
residuals(lm(y~x))
## 1 2 3 4 5 6 7
## 28.534535 36.415989 20.589987 24.119624 17.886353 5.589987 6.241991
## 8 9 10 11 12 13 14
## 11.886353 -1.350740 -5.113647 -4.524738 -20.584011 -24.410013 -28.643284
## 15 16
## -41.702557 -24.935828
residuals<-residuals(lm(y~x))
qqnorm(residuals,main="Normal Error")
qqline(residuals, col= "red", lwd=2)
As the residuals lie approximately in a straight line, the normality assumption is reasonable.
Here we have to plot the graph between Residuals and Predicted values
where the Predicted values is determined by:
\[ y\hat{_{i}}=b_{0}+b_{1}x_{i} \]
residuals(lm(y~x))
## 1 2 3 4 5 6 7
## 28.534535 36.415989 20.589987 24.119624 17.886353 5.589987 6.241991
## 8 9 10 11 12 13 14
## 11.886353 -1.350740 -5.113647 -4.524738 -20.584011 -24.410013 -28.643284
## 15 16
## -41.702557 -24.935828
residuals<-residuals(lm(y~x))
predict(lm(y~x))
## 1 2 3 4 5 6 7 8
## 62.46546 68.58401 85.41001 83.88038 70.11365 85.41001 51.75801 70.11365
## 9 10 11 12 13 14 15 16
## 82.35074 70.11365 65.52474 68.58401 85.41001 71.64328 74.70256 60.93583
predicted<-predict(lm(y~x))
plot(residuals,predicted, xlab = "Predicted",ylab ="Residuals", main = "Constant Variance",col="blue", lwd=2)
A random scatter around zero indicates constant variance of errors.
Residuals vs Fitted: Ideally, we want to see a random cloud of points around the horizontal dashed line. If we see a “U” shape or a funnel shape, the model may be inappropriate.
The Independent errors graph plot between Residuals and Order of Collections.
Order<-1:length(residuals)
plot(residuals,Order, main="Independence of Error", col="green", lwd=2)
Here we can see that the points are showing some pattern which suggest that the errors are not independent.
Explanation for the Meteorologist:
Confidence intervals describe uncertainty in the mean Days at a given Index.
Prediction intervals describe uncertainty for a new future observation and are wider.
min(x)
## [1] 16
max(x)
## [1] 18.2
newx<-seq(16,18.2,.95)
conf <-predict(modela, data.frame(x=newx),interval="confidence")
pred <-predict(modela,data.frame(x=newx),interval = "prediction")
conf
## fit lwr upr
## 1 51.75801 21.75791 81.75811
## 2 66.28956 51.25340 81.32571
## 3 80.82110 63.81774 97.82447
pred
## fit lwr upr
## 1 51.75801 -7.44155 110.9576
## 2 66.28956 13.08553 119.4936
## 3 80.82110 27.02801 134.6142
plot(x,y)
abline(modela)
lines(newx,conf[,1],col="blue", lwd=2)
lines(newx,pred[,1],col="blue", lwd=2)
lines(newx,conf[,2], col="green", lwd=2)
lines(newx,conf[,3],col="green", lwd=2)
lines(newx,pred[,2],col="red", lwd=2)
lines(newx,pred[,3],col="red", lwd=2)
a <- c(1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991)
y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
x <- c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
plot(x,y,main="Scatterplot of Days vs Index", xlab ="Index", ylab = "Days", pch=19)
modela<- lm(y~x)
summary(modela)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.70 -21.54 2.12 18.56 36.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984 163.503 -1.180 0.258
## x 15.296 9.421 1.624 0.127
##
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared: 0.1585, Adjusted R-squared: 0.09835
## F-statistic: 2.636 on 1 and 14 DF, p-value: 0.1267
summary(lm(y~x))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.98383 163.503283 -1.180306 0.2575450
## x 15.29637 9.420975 1.623650 0.1267446
summary(lm(y~x))$r.square
## [1] 0.1584636
summary(lm(y~x))$coefficients[2,4]
## [1] 0.1267446
#if the residuals lie approximately in a straight line, the normality assumption is reasonable.
residuals(lm(y~x))
## 1 2 3 4 5 6 7
## 28.534535 36.415989 20.589987 24.119624 17.886353 5.589987 6.241991
## 8 9 10 11 12 13 14
## 11.886353 -1.350740 -5.113647 -4.524738 -20.584011 -24.410013 -28.643284
## 15 16
## -41.702557 -24.935828
residuals<-residuals(lm(y~x))
qqnorm(residuals,main="Normal Error")
qqline(residuals, col= "red", lwd=2)
predict(lm(y~x))
## 1 2 3 4 5 6 7 8
## 62.46546 68.58401 85.41001 83.88038 70.11365 85.41001 51.75801 70.11365
## 9 10 11 12 13 14 15 16
## 82.35074 70.11365 65.52474 68.58401 85.41001 71.64328 74.70256 60.93583
predicted<-predict(lm(y~x))
plot(residuals,predicted, xlab = "Predicted",ylab ="Residuals", main = "Constant Variance",col="blue", lwd=2)
Order<-1:length(residuals)
plot(residuals,Order, main="Independence of Error", col="green", lwd=2)
min(x)
## [1] 16
max(x)
## [1] 18.2
newx<-seq(16,18.2,.95)
conf <-predict(modela, data.frame(x=newx),interval="confidence")
pred <-predict(modela,data.frame(x=newx),interval = "prediction")
conf
## fit lwr upr
## 1 51.75801 21.75791 81.75811
## 2 66.28956 51.25340 81.32571
## 3 80.82110 63.81774 97.82447
pred
## fit lwr upr
## 1 51.75801 -7.44155 110.9576
## 2 66.28956 13.08553 119.4936
## 3 80.82110 27.02801 134.6142
plot(x,y)
abline(modela)
lines(newx,conf[,1],col="blue", lwd=2)
lines(newx,pred[,1],col="blue", lwd=2)
lines(newx,conf[,2], col="green", lwd=2)
lines(newx,conf[,3],col="green", lwd=2)
lines(newx,pred[,2],col="red", lwd=2)
lines(newx,pred[,3],col="red", lwd=2)