Air quality analysis of New York

from May 1, 1973 to September 30, 1973.

Why is New York orange?

huffingtonpost.co.uk
huffingtonpost.co.uk

0. Loading Data set

data("airquality")

How the first few rows of the dataset looks like?

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Let’s see the basic summary of the dataset.

summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 

How is the structure of the data looks?

str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Okay, that’s great. Seems the data is clean enough. (Obvious)

So, now let’s see the dimension of the data.

dim(airquality)
## [1] 153   6

Fine, so we have 153 row. Not so much of a big data. We know there are missing values from the summary of the data. But let’s use another way to see if there is any missing values.

colSums(is.na(airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

Ohh! So, there are 37 rows under ‘Ozone’ that are empty and 7 rows under ‘Solar.R’. Let’s keep this in mind. This data set might need some data wrangling.

We would now see how each variables are correlated to each other.

cor(airquality[, 1:6])
##         Ozone Solar.R       Wind       Temp        Month          Day
## Ozone       1      NA         NA         NA           NA           NA
## Solar.R    NA       1         NA         NA           NA           NA
## Wind       NA      NA  1.0000000 -0.4579879 -0.178292579  0.027180903
## Temp       NA      NA -0.4579879  1.0000000  0.420947252 -0.130593175
## Month      NA      NA -0.1782926  0.4209473  1.000000000 -0.007961763
## Day        NA      NA  0.0271809 -0.1305932 -0.007961763  1.000000000

Okay, so you can see that because of the missing values in ‘Ozone’ & ‘Solar.R’, we are unable to get a good corelation matrix.

I. Time to Data Wrangle

# Load necessary libraries for wrangling
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Filter rows with missing values in Ozone column
airquality <- airquality %>%
  filter(!is.na(Ozone))

Column Ozone filtered of null values.

# Filter rows with missing values in Ozone column
airquality <- airquality %>%
  filter(!is.na(Solar.R))

Column Solar.R filtered of null values.

Let’s try correlation matrix again

cor(airquality[, 1:6])
##                Ozone     Solar.R        Wind       Temp        Month
## Ozone    1.000000000  0.34834169 -0.61249658  0.6985414  0.142885168
## Solar.R  0.348341693  1.00000000 -0.12718345  0.2940876 -0.074066683
## Wind    -0.612496576 -0.12718345  1.00000000 -0.4971897 -0.194495804
## Temp     0.698541410  0.29408764 -0.49718972  1.0000000  0.403971709
## Month    0.142885168 -0.07406668 -0.19449580  0.4039717  1.000000000
## Day     -0.005189769 -0.05775380  0.04987102 -0.0965458 -0.009001079
##                  Day
## Ozone   -0.005189769
## Solar.R -0.057753801
## Wind     0.049871017
## Temp    -0.096545800
## Month   -0.009001079
## Day      1.000000000

Hmm! Yes. See now you get all the values in correlation matrix.

Now, Let’s use Groupby, another cool function of dplyr.

# Let's group data by month and calculate mean Temp and Wind speed
airquality_summary <- airquality %>%
  group_by(Month) %>%
  summarize(mean_temp = mean(Temp), mean_wind = mean(Wind))
airquality_summary
## # A tibble: 5 × 3
##   Month mean_temp mean_wind
##   <int>     <dbl>     <dbl>
## 1     5      66.5     11.5 
## 2     6      78.2     12.2 
## 3     7      83.9      8.52
## 4     8      83.7      8.86
## 5     9      76.9     10.1

So, here we can see the mean temperature and the mean wind speed from the month of May 1973 to September 1973. Clearly, we can observe that the month of July had highest mean of maximum daily temperature at 83.88 Fahrenheit which is hot enough to motivate New Yorkers to get in the water and the wind speed were highest in June at around average 12.17 mph which is still considered gentle breeze.

II. Visualize to understand more

library(ggplot2)
# Example: Create a scatter plot of Ozone vs. Wind
ggplot(airquality, aes(x = Ozone, y = Wind)) +
  geom_point() +
  labs(title = "Scatter Plot of Ozone vs. Wind")

From the scatter plot, it can be easily detected that there is a negative correlation between Wind speed and Ozone layer. This explains how changes in prevailing wind speed & directions exhibit seasonal dependence and concentrations of transported O3 in atmosphere.

ggplot(airquality, aes(x = Month)) +
  geom_bar() +
  labs(title = "Count of Monthly Observations",
       x = "Month",
       y = "Count")

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(title = "Histogram of Ozone Levels",
       x = "Ozone Level",
       y = "Frequency")

Very Obvious, a right skewed data when considering Ozone level. Higher Ozone layer means lower air quality. We can see there is one extraordinary case when the Ozone level was at max. When did this happen?

# Find the row with the maximum Ozone level
max_ozone_row <- airquality %>%
  filter(Ozone == max(Ozone))

# Get the month from the row with the maximum Ozone level
max_ozone_month <- max_ozone_row$Month

# Print the result
cat("The maximum Ozone level occurred in month:", max_ozone_month, "\n")
## The maximum Ozone level occurred in month: 8

Ohh! okay. So in the month of August, the ozone level touched the maximum. Can be a good question to ask why this extraordinary event happened in August? Was there any unusual factors involved?

Not in the scope of this study, so we move on to next chart.

# Define custom colors for each month
custom_colors <- c("blue", "green", "red", "purple", "orange", "brown")

# Create a ggplot with custom colors
ggplot(airquality, aes(x = Day, y = Temp, group = Month, color = factor(Month))) +
  geom_line() +
  labs(title = "Line Plot of Daily Temperature",
       x = "Day",
       y = "Temperature (Fahrenheit)") +
  scale_color_manual(values = custom_colors)

Woah!! So Cool! The line chart illustrates the daily temperature fluctuations in Fahrenheit over several months, with each colored line representing a different month. It allows for a quick and clear comparison of temperature patterns between months, highlighting seasonal variations and differences in temperature ranges. The chart readily identifies outliers and transitions between months with distinct temperature profiles, aiding in the interpretation of weather-related data and providing valuable insights into the changing climate throughout the observed period.

ggplot(airquality, aes(x = Month, y = Ozone)) +
  geom_boxplot() +
  labs(title = "Box Plot of Ozone Levels by Month",
       x = "Month",
       y = "Ozone Level")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

We can see the distribution of Ozone levels from this box plot. And we can identify there were two outliers. We identified one, what about the other?

# Sort the data in descending order of Ozone levels
sorted_data <- airquality %>%
  arrange(desc(Ozone))

# Extract the month with the second maximum Ozone level
second_max_ozone_month <- sorted_data$Month[2]

# Print the result
cat("The second maximum Ozone level occurred in month:", second_max_ozone_month, "\n")
## The second maximum Ozone level occurred in month: 7

Okay, so in month of July 1983.

ggplot(airquality, aes(x = Ozone, fill = Month)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Ozone Levels",
       x = "Ozone Level",
       y = "Density")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

It more or less gives same idea as the histogram already did.

library(reshape2)

# Calculate the correlation matrix
cor_matrix <- cor(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")])

# Convert to long format for ggplot2
cor_matrix_long <- melt(cor_matrix)

# Create a heatmap
ggplot(cor_matrix_long, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap",
       x = "Variables",
       y = "Variables")

Now here comes my strongest move! You can easily infer from the correlation matrix above. But let me add pointers for better understanding.

  1. Ozone and Temperature (Temp): There is a strong positive correlation of approximately 0.70 between Ozone and Temperature. This suggests that as the Ozone levels increase, so does the temperature. Warmer days tend to have higher Ozone levels.

  2. Ozone and Wind: There is a significant negative correlation of approximately -0.61 between Ozone and Wind. This indicates that as the wind speed increases, Ozone levels tend to decrease. Wind might disperse or dilute Ozone in the air.

  3. Ozone and Solar Radiation (Solar.R): Ozone and Solar Radiation show a positive correlation of approximately 0.35. This suggests that there is some association between higher solar radiation levels and increased Ozone, although the correlation is not very strong.

  4. Temperature and Wind: Temperature and Wind exhibit a negative correlation of about -0.50. As the wind speed increases, temperatures tend to decrease. Wind can have a cooling effect.

  5. Month and Temperature: Month and Temperature are positively correlated at around 0.40. This indicates a seasonal pattern where temperatures tend to be higher in certain months, possibly corresponding to summer months.

  6. No Strong Relationship with Day: Day of the month (Day) doesn’t show strong correlations with the other variables, which is expected since day numbers don’t have a direct relationship with Ozone, Temperature, or other weather-related factors.

library(ggplot2)

ggplot(airquality, aes(x = Month, y = Ozone, fill = Month)) +
  geom_violin() +
  labs(title = "Violin Plot of Ozone Levels by Month",
       x = "Month",
       y = "Ozone Level")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

Hmm. Nothing much. Same inferences as the box plot, just visually easy to understand that most of the frequencies happened in the range of 0-50 Fahrenheit.

library(igraph) 
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
# Example: Create a correlation matrix and a graph based on correlations
correlation_matrix <- cor(airquality[, c("Ozone", "Solar.R", "Wind", "Temp")])
g <- graph.adjacency(correlation_matrix, weighted = TRUE, mode = "undirected")

III. Machine learning to extract more insights

Regression: Predict Ozone Levels

# Load required libraries
library(caret)
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
# Split the data into training and testing sets
set.seed(123)
split_index <- createDataPartition(airquality$Ozone, p = 0.7, list = FALSE)
train_data <- airquality[split_index, ]
test_data <- airquality[-split_index, ]

# Train a random forest regression model
rf_model <- randomForest(Ozone ~ ., data = train_data)

# Make predictions on the test data
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model
rmse <- sqrt(mean((test_data$Ozone - predictions)^2))
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 16.40662

Hmmmmmmm.

Our target variable, that is Ozone layer varies in the data from a range of 1 to 168, an RMSE of 16.41 suggests that there is room for improvement in predicting Ozone levels, especially if high precision is crucial for your application. Evaluating the impact of prediction errors in the specific context of our project is essential in determining whether the current level of accuracy is acceptable or if further model refinement is necessary.

Classification: AQI Classification

# Categorize air quality based on Ozone levels
airquality$AirQualityCategory <- cut(airquality$Ozone,
  breaks = c(0, 33, 66, Inf),
  labels = c("Good", "Moderate", "Unhealthy")
)

# Split the data into training and testing sets
set.seed(123)
split_index <- createDataPartition(airquality$AirQualityCategory, p = 0.7, list = FALSE)
train_data <- airquality[split_index, ]
test_data <- airquality[-split_index, ]

# Train a random forest classification model
rf_model <- randomForest(AirQualityCategory ~ ., data = train_data)

# Make predictions on the test data
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model
confusion_matrix <- table(predictions, test_data$AirQualityCategory)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.96875

Woah!! an accuracy of 96.875% indicates that the random forest classification model is effective in categorizing air quality based on Ozone levels and holds promise for practical applications in air quality assessment and monitoring.

———————————–E N D ————————————