AirQualityHW

Author

Robert Gravatt

Airquality Tutorial and Homework Assignment

Load the library

library(tidyverse)

Load the dataset into your global environment

data("airquality")

Look at the structure of the data

View the data using the “head” function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary Statistics

mean(airquality$Temp)
[1] 77.88235

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source

Plot 1 Output

p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")

Plot 2 Output

p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))

Plot 3 Output

p3

Plot 4: Side by Side Boxplots in Gray Scale

Plot 4 Code

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))

Plot 4 Output

p4

Plot 5: Using a 3 factor linear model to predict Ozone levels

Plot 5 Code

clean_air <- na.omit(airquality)    ## get rid of those nasty NA's
model <- lm(Ozone ~ Temp + Wind + Solar.R, data = clean_air)
clean_air$Predicted_Ozone <- predict(model)
plot(clean_air$Ozone, clean_air$Predicted_Ozone,
     xlab = "Actual Ozone",
     ylab = "Predicted Ozone",
     main = "Actual vs. Predicted Ozone Levels",
     pch = 19, col = "steelblue")



# Fit an order 2 polynomial 
poly_fit <- lm(Predicted_Ozone ~ poly(Ozone, 2), data = clean_air)

# Generate sequence of Ozone values for smooth curve
ozone_seq <- seq(min(clean_air$Ozone), max(clean_air$Ozone), length.out = 100)

# Predict fitted values from polynomial model
curve_points <- predict(poly_fit, newdata = data.frame(Ozone = ozone_seq))

# Add the polynomial curve
lines(ozone_seq, curve_points, col = "forestgreen", lwd = 2)

In this code I built a linear model to predict the ozone levels based on three environmental factors: temperature, wind speed, and solar radiation. This is called a 3-factor linear regression, which tries to find a straight-line relationship between these inputs and ozone concentration. I compared the predicted ozone levels to the actual ozone levels recorded in New York City during the summer of 1973. However, the scatter plot of predicted levels compared to observed levels was not well represented by a straight line. Thus I added a quadratic polynomial fit line to the plot of actual vs. predicted ozone values. This curve helps reveal patterns a straight-line model might miss. If the points consistently fall above or below the line, then it shows that the model may underestimate or overestimate ozone levels in certain conditions.

N.B. I did have a chat with Copilot about this modeling. However, I selected the quadratic fit line myself as I had used it before in DATA 101 just for fun.