For this week’s discussion we are required to conduct a hypothesis test on the dataset “airquality” to evaluate if Ozone levels are correlated to Month.

Null hypothesis: Variance of ozone levels for each month are equal

Alternative hypothesis: variance of Ozone levels for each month are not equal

We will first take an exploratory look at the dataset:

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

The dataset contains 4 numerical variables - Ozone, Solar.R (radiation), Wind, Temp - and 2 categorical variables - Month and Day. There are 153 rows of observations spanning Months 5 through 9.

Looking at the structure of the dataset there are “NA” values in the observations for Ozone and Solar R.

colSums(is.na(airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

There are 37 missing values for Ozone and 7 missing for Solar.R. That can give errors for our analysis.

Since we are only concerned with conducting our hypothesis on Ozone levels as a function of month for now, we will clean up the data frame as follows:

 month5<- subset(airquality,Month==5,select=-Day) #create subset of each Month and remove the var Day
 month6<- subset(airquality,Month==6,select=-Day)
 month7<- subset(airquality,Month==7,select=-Day)
 month8<- subset(airquality,Month==8,select=-Day)
 month9<- subset(airquality,Month==9,select=-Day)
month5$Ozone[is.na(month5$Ozone)]<- mean(month5$Ozone,na.rm=TRUE) 
month6$Ozone[is.na(month6$Ozone)]<- mean(month6$Ozone,na.rm=TRUE) #Impute mean of Ozone for each month; na.rm+TRUE removes the missing values for the variable when computing the mean
 month7$Ozone[is.na(month7$Ozone)]<- mean(month7$Ozone,na.rm=TRUE)
 month8$Ozone[is.na(month8$Ozone)]<- mean(month8$Ozone,na.rm=TRUE)
 month9$Ozone[is.na(month9$Ozone)]<- mean(month9$Ozone,na.rm=TRUE)
 month5$Solar.R[is.na(month5$Solar.R)]<- mean(month5$Solar.R,na.rm=TRUE) #Impute mean of Solar.R for each month
 month6$Solar.R[is.na(month6$Solar.R)]<- mean(month6$Solar.R,na.rm=TRUE)
 month7$Solar.R[is.na(month7$Solar.R)]<- mean(month7$Solar.R,na.rm=TRUE)
 month8$Solar.R[is.na(month8$Solar.R)]<- mean(month8$Solar.R,na.rm=TRUE)
 month9$Solar.R[is.na(month9$Solar.R)]<- mean(month9$Solar.R,na.rm=TRUE)

With each month’s subset now “cleaned up”, we will recombine them to create our new dataset “air”:

air<- rbind(month5,month6,month7,month8,month9)
air

Let us proceed with our analysis using ANOVA for the dataset air.

myanovadata<- aov(air$Ozone~air$Month)
summary(myanovadata)
##              Df Sum Sq Mean Sq F value Pr(>F)   
## air$Month     1   6790    6790   8.115  0.005 **
## Residuals   151 126343     837                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the p value higly significant, we reject the Null hypothesis in favor of the alternative, i.e., the variance for Ozone levels are correlated to month.

We can further show this graphically:

air$Month<- factor(air$Month,levels=5:9,labels=month.abb[5:9],ordered=TRUE)
boxplot(air$Ozone~air$Month,main="NY Air Quality: Ozone Levels by Month")

We can follow up with further testing, for example, using Tukey’s (HSD) range test to compare group means.