Air Quality Homework Assignment

Author

Emma Poch

Maryland Department of the Environment, 2022

Loading the necessary packages

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading data in the global environment and viewing data

data("airquality")
head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Viewing summary statistics

mean(airquality$Wind)

[1] 9.957516

mean(airquality[,3])

[1] 9.957516

median(airquality$Ozone, na.rm = T)

[1] 31.5

sd(airquality$Temp)

[1] 9.46527

var(airquality$Solar.R, na.rm = T)

[1] 8110.519

Converting months from numbers to names

airquality$Month[airquality$Month == 5] <- "May"
airquality$Month[airquality$Month == 6] <- "June"
airquality$Month[airquality$Month == 7] <- "July"
airquality$Month[airquality$Month == 8] <- "August"
airquality$Month[airquality$Month == 9] <- "September"
head(airquality$Month)

[1] "May" "May" "May" "May" "May" "May"

summary(airquality$Month)

   Length     Class      Mode 
      153 character character

Reordering months into non-alphabetical levels

airquality$Month <- factor(airquality$Month, levels = c("May", "June", "July", "August", "September"))

Plot 1: Creating a histogram categorized by month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill =Month))+
  geom_histogram(position="identity")+
  scale_fill_discrete(name="Month", labels=c("May", "June", "July", "August", "September"))+
  labs(x = "Monthly Temperatures From May - Sept", y = "Frequency of Temps", title = "Histogram of Monthly Temps from May - Sept, 1973", caption = "New York State Department of Conservation and the National Weather Service")
p1

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot is useful to some extent; it’s effective in showing the general range of temperatures that the city experiences between May and September, and thereby showing which months tend to be colder or warmer. However, the visualization would likely be more intuitive if the x axis was divided by month so that all temperature ranges for a given month could be shown in one place.

Plot 2: cleaning up and improving the previous plot

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month))+
  geom_histogram(position="identity", alpha=0.5, binwidth=5, color="white")+
  scale_fill_discrete(name = "Month", labels=c("May", "June", "July", "August", "September"))+
  labs(x = "Monthly Temperatures From May - Sept", y = "Frequency of Temps", title = "Histogram of Monthly Temps from May - Sept, 1973", caption = "New York State Department of Conservation and the National Weather Service")
p2

These changes visibly improved the plot’s readability. They’ve made the distinctions between months much clearer, making it easier to see which temperature ranges occur across multiple months and which ones are only experienced during one or two of the months.

Plot 3: side by side boxplots, sorted by month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill=Month))+
  labs(x = "Months from May - September", y = "Temperatures", title = "Side-by-Side Boxplot of Monthly Temperature", caption = "New York State Department of Conservation and the National Weather Service")+
  geom_boxplot()+
  scale_fill_discrete(name = "Month", labels=c("May", "June", "July", "August", "September"))
p3

Plot 4: side-by-side boxplots in greyscale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill=Month))+
  labs(x = "Months from May - September", y = "Temperatures", title = "Side-by-Side Boxplot of Monthly Temperature", caption = "New York State Department of Conservation and the National Weather Service")+
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels=c("May", "June", "July", "August", "September"))
p4

Plot 5: scatterplot visualization of the association between ozone level and temperature

library(viridis)

Warning: package 'viridis' was built under R version 4.3.3

Loading required package: viridisLite

p5 <- airquality |>
  ggplot(aes(Temp, Ozone))+
  geom_smooth(method="lm", se=F, na.rm=T, size=0.5)+
  geom_point(aes(color=Ozone), na.rm=T, size=3)+
  scale_color_viridis()+
  labs(x="Temperature (Degrees Fahrenheit)", y="Ozone (DU)", title = "Relationship Between Temperature and Atmospheric Ozone Concentration, 1973", caption = "New York State Department of Conservation and the National Weather Service")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

p5

`geom_smooth()` using formula = 'y ~ x'

I created a scatterplot indicating how the concentration of ozone in the atmosphere changes with respect to temperature. As the trend of the graph and the line of best fit demonstrate, there exists a positive relationship between temperature and ozone concentration; this makes sense, as exposure to sunlight is a catalyzing step in the ozone production process. However, it appears that the strength of the correlation is not as significant below 80 degrees Fahrenheit; the observed pattern becomes much more strongly pronounced after crossing the 80-degree threshold. The only code utilized that was not a part of base R or ggplot was the package “viridis,” used to provide the color. I made the choice to color the graph with respect to ozone to give the viewer of a better idea of the prevalence of different levels of ozone, as I worried that using y-axis position alone would make the graph more difficult to interpret. I will also note that, although I did not do the proper calculations to determine whether a linear regression model would suit this dataset, I still incorporated the line of best fit as it was moreso intended to be used as a visual guide for the trend of the data rather than as a distinct linear model. I hope this sufficiently justifies its use.