The R datasets library contains data on air quality in New York (airquality). Conduct a hypothesis test to evaluate if ozone levels are a function of month.

NOTE: dichotomize month. If that test were significant, what else would be required? Post your hypothesis test and R code with your discussion.

mydata<- (airquality)
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
Months<- mydata$Month
df<- mydata$Ozone

Are Ozone levels a function of month? Check the correlation between the two variables to see if there is a relationship.

cor.test(mydata$Ozone, mydata$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  mydata$Ozone and mydata$Temp
## t = 10.418, df = 114, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5913340 0.7812111
## sample estimates:
##       cor 
## 0.6983603

We can see there is a high correlation between the variables.

Data available for months of May through September. Check the mean for Ozone by month.

Ozone_May<- mean(airquality$Ozone[airquality$Month==5 ], na.rm=T)
Ozone_May
## [1] 23.61538
Ozone_June<- mean(airquality$Ozone[airquality$Month==6 ], na.rm=T)
Ozone_June
## [1] 29.44444
Ozone_July<- mean(airquality$Ozone[airquality$Month==7 ], na.rm=T)
Ozone_July
## [1] 59.11538
Ozone_August<- mean(airquality$Ozone[airquality$Month==8 ], na.rm=T)
Ozone_August
## [1] 59.96154
Ozone_September<- mean(airquality$Ozone[airquality$Month==9 ], na.rm=T)
Ozone_September
## [1] 31.44828

Develop a null hypothesis based on the data Ho: Mu1 - Mu2 = 0 mu=35

mean(mydata$Ozone, na.rm=T)
## [1] 42.12931
t.test(df, mu=35)
## 
##  One Sample t-test
## 
## data:  df
## t = 2.3277, df = 115, p-value = 0.02168
## alternative hypothesis: true mean is not equal to 35
## 95 percent confidence interval:
##  36.06240 48.19622
## sample estimates:
## mean of x 
##  42.12931

Because our sample population mean is not within the sample estimates and our p-value is below .05 significance, We reject the null hypothesis in favor of the alternative.

Ideally, we would have data for the entire year rather than just for a few months out of the year. This skews the data and makes it unlikely to develop an accurate result for testing the null hypothesis.