RealAirQualityHomework

Author

Julian Beckert

Load Library & Dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Because airquality is a pre-built dataset, we can write it to our data directory to store it for later use.

data(airquality)

Exploring dataset properties

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename rows & explore summary

airquality$Month[airquality$Month == 5] <- "May"
airquality$Month[airquality$Month == 6] <- "June"
airquality$Month[airquality$Month == 7] <- "July"
airquality$Month[airquality$Month == 8] <- "August"
airquality$Month[airquality$Month == 9] <- "September"

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Reordering rows

airquality$Month <- factor(airquality$Month,
                         levels=c("May", "June","July", "August",
                                  "September"))
# the default order is alphabetical, so after changing the month names, reordering them is important.

Plot 1: Histogram categorized by months

p1 <- airquality %>%
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", # create legend
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept",
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #important to provide the data source

p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2, more cohesive

p2 <- airquality %>%
  ggplot(aes(x=Temp, fill=Month)) + 
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+ # alpha defines transparency, binwidth defines width and color defines outlines
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")

p2

Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.

Plot 3, side-by-side boxplots categorized by month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", "labels" = c("May", "June","July", "August", "September"))

p3

Plot 4: Side-by-side greyscale boxplots

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))

p4

My own plot

??position
??scale_fill


p5 <- airquality %>%
  ggplot(aes(x=Wind, fill=Month)) + 
  geom_histogram(position="stack", alpha=0.5, binwidth = 3, color = "white")+
  scale_fill_brewer(palette="PuBuGn",name = "Month", labels = c("May", "June", "July", "August", "September")) +
  labs(x = "Monthly Wind Speed from May - Sept", 
       y = "Frequency of Wind Speed",
       title = "Histogram of Monthly Wind Speed from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")

p5

For this plot, I was mostly exploring the different elements and commands involved in the plots that Professor Saidi used earlier in this document. There’s a lot of commands I don’t recognize. I am trying to figure out what they do, when they should be used and how to use them effectively. I used ??position and ??scale_fill to search documentation on these two commands. My main goal was to learn how to change the colors of the month variables in the plot. I couldn’t figure out how to make it adhere to colors I chose myself, and ??scale_fill didn’t help me very much, so I looked online and found an article by “Cookbook for R” explaining how scale_fill_brewer works. I ended up using a simple colorbrewer palette. I am not completely satisfied with the plot created. I haven’t had the time I wanted to explore this assignment properly. I think the plot looks very pretty, but it’s difficult to read or understand what it’s trying to say. Wind has been on my mind lately; earlier today, after being hit with a powerful gust of wind for the first time in a while, I was wondering which seasons are the windiest. The plot seems to indicate that June is the windiest month of the year, and experiences wind speeds of about 10 mph the most, more frequently than other speeds. Using a stacked histogram was interesting because it allows me to see the data for each row and column at once. However, it’s confusing to try and read.

External sources used:

Cookbook for R. (n.d.). Colors (ggplot2). Cookbook for R. http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/.