Air Quality HW Assignment

Author

Ash Ibasan

Air Quality Assignment

Ash Ibasan

EPA Air Quality Index (AQI)

EPA Air Quality Index (AQI)

Load tidyverse library

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.1
Warning: package 'tibble' was built under R version 4.4.1
Warning: package 'tidyr' was built under R version 4.4.1
Warning: package 'readr' was built under R version 4.4.1
Warning: package 'purrr' was built under R version 4.4.1
Warning: package 'dplyr' was built under R version 4.4.1
Warning: package 'stringr' was built under R version 4.4.1
Warning: package 'forcats' was built under R version 4.4.1
Warning: package 'lubridate' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data in global enviroment

data(airquality)

View data

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate summary statistics

mean(airquality$Temp) #calculate the mean
[1] 77.88235
mean(airquality[,4]) #another way to calc mean: refer to matrix [row,column]; refers to column 4, the Temp column
[1] 77.88235

Calculate median, standard deviation, and variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename Months from number to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Look at summary statistics of dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels called factors

airquality$Month<-factor(airquality$Month,
                         levels=c("May", "June", "July", "August", "September"))

Plot 1: Histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2: Improve histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side-by-side boxplots in grayscale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Density Plot of Wind Speed by Month

p5 <- airquality |>
  ggplot(aes(x = Wind, fill = factor(Month))) +
  geom_density(alpha = 0.4) +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September")) +
  labs(x = "Wind Speed (mph)", 
       y = "Density", 
       title = "Density Plot of Wind Speed by Month",
       caption = "New York State Department of Conservation and the National Weather Service") +
  theme_minimal()
p5

Plot 5 Essay

Description

For plot 5, I chose a density plot to visualize wind speed distribution across different months. A density plot provides a smooth estimate of the data point distribution and helps identify how values are distributed over a continuous range. In this case, I visualized the wind speeds recorded between May and September, using color to represent each month.

Insights

The density plot reveals wind speeds vary significantly by month, with May and June consistently having the highest wind speeds, peaking around 10-12 mph. In contrast, wind speeds in July and August are lower, with peaks closer to 8 mph. September shows a more evenly distributed range of wind speeds. The plot offers valuable insights into the seasonal fluctuations in wind speeds, with the smooth curves effectively emphasizing trends without being disrupted by excessive noise.

Special Code

To create this plot, I used geom_density() to generate smooth curves representing the wind speed distribution for each month from May to September. Within geom_density(), I set the alpha parameter to 0.4, slightly transparent, to visualize multiple distributions’ overlap. The color scheme was applied using scale_fill_discrete() to differentiate between the months. The overall minimalist plot aesthetic uses theme_minimal() to maintain focus on the data without unnecessary visual elements.