Ozone levels are an important environmental indicator that can influence air quality and public health. This analysis uses the built-in “airquality” dataset in R to examine how ozone concentrations vary from May to September. One challenge that this dataset brings is the presence of missing ozone values. If ignored or not handled properly, these missing values can lead to misleading conclusions. To address the missing values, we can apply the multiple imputation approach using the Amelia package. This method is designed to produce more accurate and efficient estimates by filling in missing values with plausible data based on observed patterns. By comparing regression results from complete cases and imputed datasets, we can gain a better understanding of how missing data may affect our interpretation of seasonal ozone trends.
The relevant variables for this analysis are:
Ozone: The daily maximum ozone concentration measured in parts per billion (ppb).
Month: The month during which the measurement was recorded (May to September).
To prepare the data for analysis, missing values were assessed and the number of incomplete cases was recorded. Since the Ozone variable contains a substantial number of missing values, multiple imputation was used to address potential biases and retain more observations in the regression model.
# Import the data
data(airquality)
dim(airquality)
## [1] 153 6
# Check the structure of the data and missing values
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
# Number of missing values
missing_values <- sum(is.na(airquality))
cat("Total missing values in the dataset:", missing_values, "\n")
## Total missing values in the dataset: 44
# Number of complete cases (non-missing data)
complete_cases <- nrow(airquality) - sum(complete.cases(airquality) == FALSE)
cat("Number of observations with missing data:", sum(complete.cases(airquality) == FALSE), "\n")
## Number of observations with missing data: 42
cat("Number of complete cases:", complete_cases, "\n")
## Number of complete cases: 111
# Convert numeric months to month names
airquality$Month <- factor(airquality$Month, labels = month.name[5:9])
# Convert Month from numeric to factor
airquality$Month <- as.factor(airquality$Month)
# Average ozone level by month
avg_ozone_month <- airquality %>%
group_by(Month) %>%
summarise(avg_ozone = mean(Ozone, na.rm = TRUE))
# Print the result
avg_ozone_month
## # A tibble: 5 × 2
## Month avg_ozone
## <fct> <dbl>
## 1 May 23.6
## 2 June 29.4
## 3 July 59.1
## 4 August 60.0
## 5 September 31.4
# Bar plot for average ozone level by month
ggplot(avg_ozone_month, aes(x = Month, y = avg_ozone, fill = Month)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(title = "Average Ozone Level by Month",
x = "Month",
y = "Average Ozone Level (ppb)") +
scale_fill_brewer(palette = "Set3")
# Linear regression of ozone and month
reg_ozone_month <- lm(Ozone ~ Month, data = airquality)
modelsummary(reg_ozone_month)
(1) | |
---|---|
(Intercept) | 23.615 |
(5.759) | |
MonthJune | 5.829 |
(11.356) | |
MonthJuly | 35.500 |
(8.144) | |
MonthAugust | 36.346 |
(8.144) | |
MonthSeptember | 7.833 |
(7.931) | |
Num.Obs. | 116 |
R2 | 0.235 |
R2 Adj. | 0.208 |
AIC | 1120.2 |
BIC | 1136.7 |
Log.Lik. | -554.092 |
F | 8.536 |
RMSE | 28.72 |
The results of the linear regression model show how ozone levels vary by month from May to September. The average ozone level in May is estimated to be 23.62 ppb. Compared to May, ozone levels are 5.83 ppb higher in June, 35.50 ppb higher in July, 36.35 ppb higher in August, and 7.83 ppb higher in September. The increases in July and August are statistically significant, suggesting that ozone levels tend to peak during these summer months.
# Multiple imputation using Amelia package
a.out <- amelia(x = airquality, cs = "Month", ts = NULL)
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7
##
## -- Imputation 2 --
##
## 1 2 3 4 5
##
## -- Imputation 3 --
##
## 1 2 3 4 5
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6
# Linear regression on imputed dataset
z.out <- with(a.out, lm(Ozone ~ Month))
# Display results
dzout <- mi.combine(z.out)
tinytable::tt(dzout)
term | estimate | std.error | statistic | p.value | df | r | miss.info |
---|---|---|---|---|---|---|---|
(Intercept) | 19.97475 | 5.331679 | 3.746428 | 1.823361e-04 | 3412.643 | 0.03544978 | 0.03480161 |
MonthJune | 21.02556 | 7.595592 | 2.768126 | 5.665248e-03 | 3804.362 | 0.03351234 | 0.03293394 |
MonthJuly | 37.50738 | 7.503248 | 4.998820 | 5.918152e-07 | 6547.053 | 0.02534409 | 0.02501544 |
MonthAugust | 39.72965 | 7.798744 | 5.094365 | 5.281366e-07 | 423.162 | 0.10769536 | 0.10146148 |
MonthSeptember | 11.40227 | 7.545552 | 1.511125 | 1.307869e-01 | 10465.781 | 0.01993970 | 0.01973719 |
The results of the linear regression model show how ozone levels vary by month from May to September. The average ozone level in May is estimated to be 18.72 ppb. Compared to May, ozone levels are 23.13 ppb higher in June, 38.69 ppb higher in July, 40.55 ppb higher in August, and 12.54 ppb higher in September. These results indicate that ozone concentrations are generally higher in July and August, and lower in May and September.
In order to address missing data in the “airquality” dataset, a multiple imputation approach was used. The Amelia algorithm created several imputed versions of the dataset by estimating missing ozone values using information from other variables and patterns in the data. This approach allows for more accurate and reliable statistical analysis compared to traditional methods like listwise deletion which can result in significant data loss.
The initial linear regression analysis, estimated that average ozone levels in July and August were 35.50 ppb and 36.35 ppb higher than in May, respectively. After applying multiple imputation to account for missing ozone values, the updated regression results showed even higher increases with 38.69 ppb for July and 40.55 ppb for August. Additionally, the estimated average ozone level for May decreased slightly from 23.62 ppb to 18.72 ppb after imputation.
These results suggest that the original analysis may have slightly underestimated the magnitude of ozone increases in the summer months due to data loss. By incorporating more information through imputation, the regression model provided stronger evidence of seasonal differences in ozone concentration, especially highlighting more pronounced peaks in July and August. This comparison highlights the value of using multiple imputation to preserve data and improve the accuracy of statistical estimates.