Airquality HW

Author

R Kwan

Load the library

library(tidyverse)

Load the dataset into your global environment

data("airquality")

Look at the structure of the data

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Rename the Months from number to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)

   Length     Class      Mode 
      153 character character

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

Plot 1

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source

Plot 1 Output

p1

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Plot 2 Code

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3 Code

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4 Code

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5 Code

p5 <- airquality |>
  ggplot(aes(Month, Ozone, fill = Month)) +
  labs(x = "May - September", y = "Ozone levels",
       title = "Boxplot of Monthly Ozone Levels",) +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September")) +
  coord_flip()
p5

Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The plot I have created is a boxplot with the ozone levels from the months May through September. The plot shows the how the median, upper quartile, and lower quartile ozone levels from each month differ from month to month, while also giving data on the higher outliers for each month. On average, July has the highest median ozone levels compared to any other month in the graph, but August is close with a higher upper quartile of ozone levels and the highest maximum value. I also did the coord_flip() function becaue it helps the viewer have a neasier time processing the boxplots and I personally feel that it is easier to read. Since the key on the right is vertical, it makes sense that the boxplots should also be lined up vertically for easier readability. You can clearly see that August has the largest outlier compared to all the other boxplots, while September, May, and June are more compact and have a lower range compared to August and July.