1 Index (X) vs Days (Y) Analysis

#data sets are first put into vector format
Y <- c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
X <- c(16.7,17.1,18.2,18.1,17.2,18.2,16,17.2,18,17.2,16.9,17.1,18.2,17.3,17.5,16.6)

1.1 Data Visualization

The first step of our data analysis is to plot the Index (X) data vs the Days (Y) data. This visual will allow us to see if there are any obvious correlations between the data sets.

df <- data.frame(X,Y) #consolidate data
plot(df, main="Index vs Days") #plot data for visualization

Based on the above graph, there appears to be a slight linear correlation between our two data sets.

1.2 Least Squares Regression Plot

mod1 <- lm(Y ~ X) #put least squares linear regression model into its own object
plot(df, main="Least Squares Linear Regression") #plot is not stored between chunks
abline(mod1, col="blue") #add linear regression model into plot

Based on the visualization of our linear model alone, there is in fact a slight correlation between the number of days that the ozone levels exceeded 0.20 ppm, and the seasonal meteorological index. We can see this in the slope of the line that appears in the linear model shown on the graph above.

newx <- seq(min(X),max(X),length.out=50)
#finding 95% confidence and prediction intervals
conf <- predict(mod1, data.frame(X=newx), interval="confidence", level = 0.95)
pred <- predict(mod1, data.frame(X=newx), interval="prediction", level = 0.95)
#plot is not stored between chunks
plot(df, main="Confidence and Prediction Intervals") 
abline(mod1, col="blue")
#adding upper and lower limits of the confidence and prediction intervals
lines(newx,conf[,2], col="green") 
lines(newx,conf[,3], col="green")
lines(newx,pred[,2], col="red")
lines(newx,pred[,3], col="red")

From the above plot, we can see that our 95% confidence interval lies between the two green lines, and our 95% prediction interval lies between the two red lines. We can be 95% confident that our average values for the number of days above .2 ppm given a certain meteorological index will lie somewhere between the two green lines, and we can be 95% sure that a data point taken in the future will lie somewhere between the two red lines.

info <- summary(mod1)
info
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.70 -21.54   2.12  18.56  36.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984    163.503  -1.180    0.258
## X             15.296      9.421   1.624    0.127
## 
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.09835 
## F-statistic: 2.636 on 1 and 14 DF,  p-value: 0.1267

I would like to draw your attention to a few of the values generated from the summary report of our linear model for this data, namely our \(p\)-value of 0.1267, and our \(R^2\) value of 0.1585. Our \(p\)-value is greater than 0.05, so we can say that our response variable (days above .2 ppm) is likely not influenced by our predictor variable (seasonal meteorological index). Our \(R^2\) value of 0.1585 means that the correlation is not very strong, as it is quite close to 0. We can see from our slope of 15.296 that we have a positive correlation, with each 1 increase of the seasonal meteorological index increasing the prediction of the number of days above .2 ppm by 15.296.

1.3 Model Adequacy Plots

#calculate predicted outputs using linear model coefficients
Pred_Y <- (info$coefficients[1,1] + info$coefficients[2,1]*X)
#find residuals
e <- Y - Pred_Y 
#vector with index values for each given point
val_index <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
p_val <- val_index / 16
z_val <- 1/p_val

In the two code chunks above, we are finding our error residuals for each value of the index based on the difference between our given number of days and the number of days predicted by the model. We then find the percentiles in the data (p_val), and the Z-quantiles (z_val). These are used to determine if our model is valid or not.

1.3.1 Normal Probability Plot (test for normal errors)

plot(z_val,e, main="Normal Probability")

As shown in the plot above, there appears to be a sign that our data has a light tail, which would mean non-normal error.

1.3.2 Residual by Predicted Plot (test for constant variance)

plot(Pred_Y,e, main="Residual by Predicted")

Coming to our constant variance test, we can see that there may be a pattern in the graph, with the data appearing to expand in the e range as the predicted number of days gets larger, and then the range of e decreases again starting around 70 for our predicted number of days. This pattern would mean we don’t have a constant variance.

1.3.3 Residual by Time Order of Collection (test for independence)

plot(val_index,e, main="Residual by Time")

Finally, testing for the independence, we can see a clearly negative linear slope in our values. This pattern means that there likely not a very strong correlation between our two data sets.

1.4 Conclusion

Given both our low \(R^2\) value and the patterns seen in our tests for the adequacy of our least squares linear regression model, there is likely not a strong correlation between the seasonal meteorological index and the number of days that the ozone levels exceed 0.20 ppm.

2 Complete Code

#data sets are first put into vector format
Y <- c(91,   105,  106,  108,  88,   91,   58, 82,   81, 65,   61)
X <- c(16.7, 17.1, 18.2, 18.1, 17.2, 18.2, 16, 17.2, 18, 17.2, 16.9)

df <- data.frame(X,Y) #consolidate data
plot(df, main="Index vs Days") #plot data for visualization

mod1 <- lm(Y ~ X) #put least squares linear regression model into its own object
abline(mod1, col="blue") #add linear regression model into plot

newx <- seq(min(X),max(X),length.out=50)
#finding 95% confidence and prediction intervals
conf <- predict(mod1, data.frame(X=newx), interval="confidence", level = 0.95)
pred <- predict(mod1, data.frame(X=newx), interval="prediction", level = 0.95)
#adding upper and lower limits of the confidence and prediction intervals
lines(newx,conf[,2], col="green") 
lines(newx,conf[,3], col="green")
lines(newx,pred[,2], col="red")
lines(newx,pred[,3], col="red")

info <- summary(mod1)
info

#calculate predicted outputs using linear model coefficients
Pred_Y <- (info$coefficients[1,1] + info$coefficients[2,1]*X)
#find residuals
e <- Y - Pred_Y 

#vector with index values for each given point
val_index <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
p_val <- val_index / 16
z_val <- 1/p_val

plot(z_val,e, main="Normal Probability")

plot(Pred_Y,e, main="Residual by Predicted")

plot(val_index,e, main="Residual by Time")