Introduction

Ozone levels are an important environmental indicator that can influence air quality and public health. This analysis uses the built-in “airquality” dataset in R to examine how ozone concentrations vary from May to September. One challenge that this dataset brings is the presence of missing ozone values. If ignored or not handled properly, these missing values can lead to misleading conclusions. To address the missing values, we can apply the multiple imputation approach using the Amelia package. This method is designed to produce more accurate and efficient estimates by filling in missing values with plausible data based on observed patterns. By comparing regression results from complete cases and imputed datasets, we can gain a better understanding of how missing data may affect our interpretation of seasonal ozone trends.

Data Preparation

The relevant variables for this analysis are:

Ozone: The daily maximum ozone concentration measured in parts per billion (ppb).

Month: The month during which the measurement was recorded (May to September).

To prepare the data for analysis, missing values were assessed and the number of incomplete cases was recorded. Since the Ozone variable contains a substantial number of missing values, multiple imputation was used to address potential biases and retain more observations in the regression model.

# Import the data
data(airquality)
dim(airquality)
## [1] 153   6
# Check the structure of the data and missing values
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
# Number of missing values
missing_values <- sum(is.na(airquality))
cat("Total missing values in the dataset:", missing_values, "\n")
## Total missing values in the dataset: 44
# Number of complete cases (non-missing data)
complete_cases <- nrow(airquality) - sum(complete.cases(airquality) == FALSE)
cat("Number of observations with missing data:", sum(complete.cases(airquality) == FALSE), "\n")
## Number of observations with missing data: 42
cat("Number of complete cases:", complete_cases, "\n")
## Number of complete cases: 111
# Convert numeric months to month names
airquality$Month <- factor(airquality$Month, labels = month.name[5:9])

# Convert Month from numeric to factor
airquality$Month <- as.factor(airquality$Month)

Original Analysis

# Average ozone level by month
avg_ozone_month <- airquality %>%
  group_by(Month) %>%
  summarise(avg_ozone = mean(Ozone, na.rm = TRUE))

# Print the result
avg_ozone_month
## # A tibble: 5 × 2
##   Month     avg_ozone
##   <fct>         <dbl>
## 1 May            23.6
## 2 June           29.4
## 3 July           59.1
## 4 August         60.0
## 5 September      31.4
# Bar plot for average ozone level by month
ggplot(avg_ozone_month, aes(x = Month, y = avg_ozone, fill = Month)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Ozone Level by Month", 
       x = "Month", 
       y = "Average Ozone Level (ppb)") +
  scale_fill_brewer(palette = "Set3")

# Linear regression of ozone and month
reg_ozone_month <- lm(Ozone ~ Month, data = airquality)
modelsummary(reg_ozone_month)
(1)
(Intercept) 23.615
(5.759)
MonthJune 5.829
(11.356)
MonthJuly 35.500
(8.144)
MonthAugust 36.346
(8.144)
MonthSeptember 7.833
(7.931)
Num.Obs. 116
R2 0.235
R2 Adj. 0.208
AIC 1120.2
BIC 1136.7
Log.Lik. -554.092
F 8.536
RMSE 28.72

The results of the linear regression model show how ozone levels vary by month from May to September. The average ozone level in May is estimated to be 23.62 ppb. Compared to May, ozone levels are 5.83 ppb higher in June, 35.50 ppb higher in July, 36.35 ppb higher in August, and 7.83 ppb higher in September. The increases in July and August are statistically significant, suggesting that ozone levels tend to peak during these summer months.

Multiple Imputation Approach

# Multiple imputation using Amelia package
a.out <- amelia(x = airquality, cs = "Month", ts = NULL)
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6
# Linear regression on imputed dataset
z.out <- with(a.out, lm(Ozone ~ Month))
# Display results
dzout <- mi.combine(z.out)
tinytable::tt(dzout)
term estimate std.error statistic p.value df r miss.info
(Intercept) 19.97475 5.331679 3.746428 1.823361e-04 3412.643 0.03544978 0.03480161
MonthJune 21.02556 7.595592 2.768126 5.665248e-03 3804.362 0.03351234 0.03293394
MonthJuly 37.50738 7.503248 4.998820 5.918152e-07 6547.053 0.02534409 0.02501544
MonthAugust 39.72965 7.798744 5.094365 5.281366e-07 423.162 0.10769536 0.10146148
MonthSeptember 11.40227 7.545552 1.511125 1.307869e-01 10465.781 0.01993970 0.01973719

The results of the linear regression model show how ozone levels vary by month from May to September. The average ozone level in May is estimated to be 18.72 ppb. Compared to May, ozone levels are 23.13 ppb higher in June, 38.69 ppb higher in July, 40.55 ppb higher in August, and 12.54 ppb higher in September. These results indicate that ozone concentrations are generally higher in July and August, and lower in May and September.

Conclusion

In order to address missing data in the “airquality” dataset, a multiple imputation approach was used. The Amelia algorithm created several imputed versions of the dataset by estimating missing ozone values using information from other variables and patterns in the data. This approach allows for more accurate and reliable statistical analysis compared to traditional methods like listwise deletion which can result in significant data loss.

The initial linear regression analysis, estimated that average ozone levels in July and August were 35.50 ppb and 36.35 ppb higher than in May, respectively. After applying multiple imputation to account for missing ozone values, the updated regression results showed even higher increases with 38.69 ppb for July and 40.55 ppb for August. Additionally, the estimated average ozone level for May decreased slightly from 23.62 ppb to 18.72 ppb after imputation.

These results suggest that the original analysis may have slightly underestimated the magnitude of ozone increases in the summer months due to data loss. By incorporating more information through imputation, the regression model provided stronger evidence of seasonal differences in ozone concentration, especially highlighting more pronounced peaks in July and August. This comparison highlights the value of using multiple imputation to preserve data and improve the accuracy of statistical estimates.