I started by preforming a T-Test to compare the means between two in order to determine if there is significicant difference between the two groups. I set ‘Group 1: May and June’ and ‘Group 2: July, August, September’. I set the following Hypotheses:

Null Hypothesis (H0): \(\mu\)1 = \(\mu\)2

Alternative Hypothesis (H1): \(\mu\)1 \(\neq\) \(\mu\)2

# Load the air quality data set
data("airquality")

# Dichotomize the month
airquality$Month_Dichotomized <- ifelse(airquality$Month %in% c(5, 6), "First_Half", "Second_Half")

# Separate the data into two groups
group1 <- airquality$Ozone[airquality$Month_Dichotomized == "First_Half"]
group2 <- airquality$Ozone[airquality$Month_Dichotomized == "Second_Half"]

# Remove NA values for t-test
group1 <- na.omit(group1)
group2 <- na.omit(group2) 

# Perform two-sample t-test
t_test_result <- t.test(group1, group2)

# Print the t-test results
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  group1 and group2
## t = -4.645, df = 100.63, p-value = 1.031e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -34.77405 -13.96034
## sample estimates:
## mean of x mean of y 
##  25.11429  49.48148

Given the small P-value found by running this test we reject the null hypothesis and conclude that there is a true difference in the means. To break this down further I ran an ANOVA test. Before running the test I wanted to make sure that Independence, Approximately normal, and Constant Variance assumptions held.

Since these observations are made on different days on different months there is no reason to believe these samples are not independent so that assumption holds.

I then looked plotted histograms to determine normal approximation and boxplots to see if the variance is similar.

# Load the necessary library
library(datasets)
library(ggplot2) 

# Plot histograms for each month
ggplot(airquality, aes(x = Ozone)) + 
  geom_histogram(binwidth = 10, fill = "skyblue", color = "black", alpha = 0.7) + 
  facet_wrap(~ Month, scales = "free_x") + 
  labs(title = "Histogram of Ozone Levels by Month", 
       x = "Ozone", 
       y = "Frequency") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Create side-by-side boxplots for ozone levels by month
ggplot(airquality, aes(x = factor(Month), y = Ozone)) + 
  geom_boxplot(fill = "skyblue", color = "black", alpha = 0.7) + 
  labs(title = "Boxplots of Ozone Levels by Month", 
       x = "Month", 
       y = "Ozone Levels") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

In looking at the histograms it appears that this data does not follow a normal distribution. The boxplots tell a similar story as a few of the months have a similar variance but July and August have variances that are much higher. Something to note at the when I ran the initial T-Test I did not replace any of the NA values and simply removed them. I’m not sure if this was the correct method. If so I believe the next step would be to run regression analysis to determine if there are any other significant explanatory variables in the data set that affect ozone levels.