1 Problem Statement

Davidson (“Update on Ozone Trends in California’s South Coast Air Basin,” Air and Waste, 43, 226, 1993) studied the ozone levels in the South Coast Air Basin of California for the years 1976-1991. He believes that the number of days the ozone levels exceeded 0.20 ppm ( the response) depends on the seasonal meteorological index, which is the seasonal average 850-millibar temperature (the regressor). The following table gives the data.
Ozone Trends in California South Coast Basin 1976-1991
year days seasonal_index
1 91 16.7
2 105 17.1
3 106 18.2
4 108 18.1
5 88 17.2
6 91 18.2
7 58 16.0
8 82 17.2
9 81 18.0
10 65 17.2
11 61 16.9
12 48 17.1
13 61 18.2
14 43 17.3
15 33 17.5
16 36 16.6

The following is a scatter plot of the raw data

2 Model Adequacy Analysis

In order to show model adequacy, we need to show the following:

  • Normal probability
  • Constant variance
  • Data Independence

We find the equation for \({y_i} = {b_0} + {b_1}x\) The calculated equation is \({y_i} = -192.838 + 15.296x\) Using this calculated information we will graphical techniques to analyze the adequacy of our model.

2.1 Normal Probability

First for the first condition “Normal probability of error” we are using the following equation to calculate the residual error for our datum. \([{e_i} = {y_i} - ({b_0} + {b_1}x)]\), where \({y_i}\) is the observed value and \(({b_0} + {b_1}x)\) is the expected value. We are assuming the error is \(\varepsilon \sim N(0,{\sigma ^2})\)

As evidenced from the NPP plot the data is approximately linear, so we can safely conclude that the first assumption that the errors are normally distributed.

2.2 Constant Variance

For our second assumption we are looking for constant variance that is evidenced by a random scatter of data with the residuals plotted on the y-axis and the predicted value, \({{\hat y}_i} = ({b_0} + {b_1}{x_i})\) on the x-axis

From the above plots the data does display a random arrangement, so we can conclude that there is constant variance to our data. So our second requirement of constant variance is satisfied.

2.3 Independence

Our third and final requirement for model adequacy is that we have independence. This is evidenced by a random scatter of our datum error with our collected samples versus time. The should not see any periodic behaviour of the data.

The model does satisfy all three requirements for model adequacy.

3 Statistical Analysis

The following is the statistical analysis for our data

## 
## Call:
## lm(formula = linear_obj)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41.70 -21.54   2.12  18.56  36.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984    163.503  -1.180    0.258
## x             15.296      9.421   1.624    0.127
## 
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared:  0.1585, Adjusted R-squared:  0.09835 
## F-statistic: 2.636 on 1 and 14 DF,  p-value: 0.1267
## 
## Call:
## lm(formula = linear_obj)
## 
## Coefficients:
## (Intercept)            x  
##      -193.0         15.3
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value Pr(>F)
## x          1 1492.6  1492.6  2.6362 0.1267
## Residuals 14 7926.8   566.2
## (Intercept)           x 
##   -1.180306    1.623650
##               Estimate Std. Error   t value  Pr(>|t|)
## (Intercept) -192.98383 163.503283 -1.180306 0.2575450
## x             15.29637   9.420975  1.623650 0.1267446

The confidence interval would be given by the following

##         fit      lwr       upr
## 1  51.75801 21.75791  81.75811
## 2  53.86126 26.35031  81.37221
## 3  55.96451 30.88192  81.04710
## 4  58.06776 35.33322  80.80230
## 5  60.17101 39.67663  80.66539
## 6  62.27426 43.87267  80.67585
## 7  64.37751 47.86522  80.88980
## 8  66.48076 51.57671  81.38481
## 9  68.58401 54.90761  82.26042
## 10 70.68726 57.74912  83.62540
## 11 72.79051 60.01612  85.56490
## 12 74.89376 61.68722  88.10031
## 13 76.99701 62.81679  91.17723
## 14 79.10026 63.50594  94.69458
## 15 81.20351 63.86209  98.54494
## 16 83.30676 63.97530 102.63822
## 17 85.41001 63.91295 106.90708
##         fit       lwr      upr
## 1  51.75801 -7.441550 110.9576
## 2  53.86126 -4.116617 111.8391
## 3  55.96451 -0.901284 112.8303
## 4  58.06776  2.197902 113.9376
## 5  60.17101  5.174631 115.1674
## 6  62.27426  8.022988 116.5255
## 7  64.37751 10.737623 118.0174
## 8  66.48076 13.313923 119.6476
## 9  68.58401 15.748171 121.4199
## 10 70.68726 18.037689 123.3368
## 11 72.79051 20.180940 125.4001
## 12 74.89376 22.177590 127.6099
## 13 76.99701 24.028524 129.9655
## 14 79.10026 25.735810 132.4647
## 15 81.20351 27.302613 135.1044
## 16 83.30676 28.733076 137.8804
## 17 85.41001 30.032167 140.7879

4 Conclusion

Our original hypothesis will be:

\[ \begin{array}{l} {H_0}:{B_1} = 0\\ {H_a}:{B_1} \ne 0 \end{array} \]

Since the \({R^2}\) value is 0.1585, we can attributed approximately 15% of the regressor variable to the fitted model. Also with the p-value of 0.1267, at a significance level of \(\alpha = 0.05\) would yield a conclusion that we would conclude that the evidence is not strong to conclude a difference with this relationship between days and index.

5 Complete R Code

It is a good idea to include this at the end of every RMarkdown document

#Derrell Dunn
#IE5344 Week 3 RMarkdown

#Input data y
y<-c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)

#input data for x
x<-c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
#copy data to variable with the appropriate names
year<-seq(1976:1991)
days<-y
seasonal_index<-x

##lets print the data in a table for the User to view in the document
library(knitr)
library(kableExtra)
df <- data.frame( year,days,seasonal_index)

#display the table
knitr::kable(df, caption = "Ozone Trends in California South Coast Basin 1976-1991") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  )

#For this section for ease of mathematical calculation, I am using x and y
#as variables instead of the names of years, days and seasonal index

#get the length of the input vectors
number_data<-length(y)

#sum the y vector
y_sum<-sum(y)

#get the mean for both the y and the xcarbon vector
y_mean<-mean(y)
x_mean<-mean(x)

#sum the xcarbon vector
x_sum<-sum(x)

#calculate the summation of (y * xcarbon)
y_mult_x<-sum(y*x)

#Calculate the sum of the xcarbon squared
x_sqrd<-sum((x^2))

#calculate the sum of the xcarbon and square the result
x_sumsq<-(sum(x_sum))^2

#We finally calculate the B1 coefficient using page 14 - eqn 2.7 from text and lecture
b1_coef<-((y_mult_x) - ((y_sum * x_sum)/number_data))/(x_sqrd-((x_sumsq)/number_data))

#calculate B0 using page 14 - eqn 2.6 in the text
b0_coef<-(y_mean)-(b1_coef*x_mean)

 myfunction<-function(xvar) {
   b0_coef + (b1_coef * xvar)
 }
 
 #not needed for this analysis
 #expected_y<-curve(myfunction(x), from = 16, to= 19)
 #curve(myfunction(x), add =TRUE, col="orange")
 
 expected_yi <-myfunction(x)
 
 error_residuals_values<- y - expected_yi
 data_num <-seq(1:16)
 
 plot(expected_yi, error_residuals_values, main="Constant Variance", pch=7, col="brown",panel.last = grid(10,10))

 plot(data_num, error_residuals_values, main="Independence", pch=16, col="gold",panel.last = grid(10,10))
#now verify the result using the R function lm, we store the result 
#into an object "linear_obj" for readability and SW best practice

#linear_obj<-lm(y~x)
#abline(linear_obj, col="blue")
 
#setup the NPP to test condition 1. Normal distrb. of errors
res <- residuals(linear_obj)

qqnorm(res)
qqline(res, pch=7, main = "Normal Q-Q Plot",xlab="1Theoretical Quantiles",ylab="Sample Quantiles"
         ,col = "red", lwd = 2)


#run the summary function to get our result
summary(lm(linear_obj))

#construct the linear model
lm(linear_obj)

#do an ANOVA and get the t statistics
anova(linear_obj)
t_stats <- summary(linear_obj)$coefficients[, "t value"]
t_stats

#get a summary
summary(linear_obj)$coefficients

newx<-seq(min(x),max(x),.1375)

predict(linear_obj,data.frame(x=newx),interval="confidence") #confidence interval
predict(linear_obj,data.frame(x=newx),interval="prediction") #prediction interval