Air Quality Assignment

Author

Ryan Seabold

Rockville, MD AQI (AirNow.gov)

Rockville, MD AQI (AirNow.gov)

Load Tidyverse

In order to access dplyr and ggplot2

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data in the global environment

data(airquality)

Check the structure and statistics of the data

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

Replace month numbers with names

# Create a backup month numbers column (for later)
airquality <- airquality |>
  mutate(Monthnum = Month)

airquality$Month[airquality$Month == 5] <- "May"
airquality$Month[airquality$Month == 6] <- "June"
airquality$Month[airquality$Month == 7] <- "July"
airquality$Month[airquality$Month == 8] <- "August"
airquality$Month[airquality$Month == 9] <- "September"

Give the months an order

airquality$Month <- factor(airquality$Month, levels =
                           c("May", "June","July", "August", "September"))

Check the summary again

summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
       Month         Day          Monthnum    
 May      :31   Min.   : 1.0   Min.   :5.000  
 June     :30   1st Qu.: 8.0   1st Qu.:6.000  
 July     :31   Median :16.0   Median :7.000  
 August   :31   Mean   :15.8   Mean   :6.993  
 September:30   3rd Qu.:23.0   3rd Qu.:8.000  
                Max.   :31.0   Max.   :9.000  
                                              

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side by Side Boxplots in Gray Scale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Scatterplot of temperature by date

p5 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) +
  labs(x = "Months from May through September", y = "Temperatures",
       title = "Scatterplot of Daily Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_point(shape = 21, size = 3) +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))
p5

It looks like a new column has to be created, since these temperatures are only organized by month

We’ll have to make a new column for date and format the plot differently

# Create a new date column combining Month and Day so that the data can be organized
airquality <- airquality |>
  mutate(Date = as.Date(paste(1973, Month = Monthnum, Day, sep = "-"))) # 1973 because the data is from 1973

p6 <- airquality |>
  ggplot(aes(Date, Temp, fill = Month)) +
  labs(x = "Date from May through September", y = "Temperatures",
       title = "Scatterplot of Daily Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_point(shape = 21, size = 3) +
  scale_x_date(date_breaks = "1 month", date_labels = "%B 1") +  # Show month names and "1"
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))
p6

Essay

I created a scatterplot that shows the temperature fluctuations, not just between months, but within them. Using this format, rather than a scatterplot or histogram with distinct months, we can draw a more precise line to predict temperature behaviors. Additionally, given data from multiple years, we could further predict how temperatures fluctuate within months.

As you can see, the temperatures within months tend to be rather chaotic, but some patterns do emerge. For instance, while the temperatures throughout May tend to stay around 65 ± 10, the temperatures rise, then fall, then rise again in June. The same happens in August, before falling quickly in September.

To achieve this plot, I had to create a new “Date” column. Some trouble emerged when I tried to use only the month and day, but it turns out dates are automatically stored in Y-m-d format. I solved this problem by choosing to use scale_x_date() to show only the first of each month as a reference date, rather than showing every day separately. Besides, there would be far too many labels in the latter method.

ChatGPT-4o was utilized for debugging purposes.