I started by looking at the summary statistics for the data:
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
library(psych)
describe(airquality)
## vars n mean sd median trimmed mad min max range skew
## Ozone 1 116 42.13 32.99 31.5 37.80 25.95 1.0 168.0 167 1.21
## Solar.R 2 146 185.93 90.06 205.0 190.34 98.59 7.0 334.0 327 -0.42
## Wind 3 153 9.96 3.52 9.7 9.87 3.41 1.7 20.7 19 0.34
## Temp 4 153 77.88 9.47 79.0 78.28 8.90 56.0 97.0 41 -0.37
## Month 5 153 6.99 1.42 7.0 6.99 1.48 5.0 9.0 4 0.00
## Day 6 153 15.80 8.86 16.0 15.80 11.86 1.0 31.0 30 0.00
## kurtosis se
## Ozone 1.11 3.06
## Solar.R -1.00 7.45
## Wind 0.03 0.28
## Temp -0.46 0.77
## Month -1.32 0.11
## Day -1.22 0.72
Next I divided the data by month. From the summary, I know the months go from month 5 to month 9, which I am labeling as May through September:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
May<-filter(airquality, airquality$Month == "5")
June<-filter(airquality, airquality$Month == "6")
July<-filter(airquality, airquality$Month == "7")
August<-filter(airquality, airquality$Month == "8")
September<-filter(airquality, airquality$Month == "9")
Given that the variability appears to differ a good amount by month, I decided to use a different approach. In order to test whether ozone levels are a function of month, I created a two-sample t-test, comparing the ozone levels in May to the ozone levels in each other month. My hypothesis for each test is the following:
H0: The average ozone level is the same between May and other month.
HA: The average ozone level varies between May and other month.
t.test(May$Ozone,June$Ozone)
##
## Welch Two Sample t-test
##
## data: May$Ozone and June$Ozone
## t = -0.7801, df = 16.938, p-value = 0.4461
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -21.598419 9.940299
## sample estimates:
## mean of x mean of y
## 23.61538 29.44444
The p-value is greater than the significance level of 0.05 so I would fail to reject the null hypothesis and say that ozone level is not impacted by month when comparing May and June.
t.test(May$Ozone,July$Ozone)
##
## Welch Two Sample t-test
##
## data: May$Ozone and July$Ozone
## t = -4.682, df = 44.843, p-value = 2.647e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -50.77291 -20.22709
## sample estimates:
## mean of x mean of y
## 23.61538 59.11538
The p-value is less than the significance level of 0.05 so I would reject the null hypothesis and say that ozone level is impacted by month when comparing May and July.
t.test(May$Ozone,August$Ozone)
##
## Welch Two Sample t-test
##
## data: May$Ozone and August$Ozone
## t = -4.0749, df = 39.279, p-value = 0.0002169
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -54.38358 -18.30873
## sample estimates:
## mean of x mean of y
## 23.61538 59.96154
The p-value is less than the significance level of 0.05 so I would reject the null hypothesis and say that ozone level is impacted by month when comparing May and August.
t.test(May$Ozone,September$Ozone)
##
## Welch Two Sample t-test
##
## data: May$Ozone and September$Ozone
## t = -1.2527, df = 52.957, p-value = 0.2158
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -20.374201 4.708419
## sample estimates:
## mean of x mean of y
## 23.61538 31.44828
The p-value is greater than the significance level of 0.05 so I would fail to reject the null hypothesis and say that ozone level is not impacted by month when comparing May and September.
Overall I would say ozone levels are a function of month because I found that to be the case when comparing May to July and August.