Air Quality Homework Assignment

Author

J Liang

Air Quality Assignment

Joyce Liang

Air Quality Tutorial and Homework Assignment

source:cleanairpartners.net

Load in the Library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The source for this data set is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.

Load the data set into your global environment

Because air quality is a pre-built data set, we can write it to our data directory to store it for later use

data("airquality")

Look at the structure of the data

In the global environment, click on the row with the air quality data set and it will take you to a “spreadsheet” view of the data.

View the data using the “head” function

The function, head, will only display the first 6 rows of the data set. Notice in the global environment to the right, there are 153 (rows)

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Notice that all the variables are classified as either integers or continuous values.

Caluculate SUmmary Statistics

Here are two different ways to calculate “mean”

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235

For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var((airquality$Wind))
[1] 12.41154

Rename the Months from number to names

There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September.

airquality$Month[airquality$Month == 5] <- "May"
airquality$Month[airquality$Month == 6] <- "June"
airquality$Month[airquality$Month == 7] <- "July"
airquality$Month[airquality$Month == 8] <- "August"
airquality$Month[airquality$Month == 9] <- "September"

Now look at the summary statistics of the dataset

See how Month has changed to have characters instead of numbers (it is now classified as “character” rather than “integer”)

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot #1

airquality$Month <- factor(airquality$Month,
                           levels = c("May", "June", "July", "August", "September" ))

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x = Temp, fill = Month)) +
  geom_histogram(position ="identity") +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September")) +
  labs (x = "Monthly Temperatures from May - Sept", 
        y = "Frequency of Temps",
        title = "Histogram of Monthly Temperatures from May - Sept, 1973", 
        caption = "New York State Department of Conservation and the National Weather Service")
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot Output

Is this plot useful in answering questions about monthly temperature values?

The plot can provide some information, but it is not easy to read nor is it pleasing to look at.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position ="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name= "Month", labels = c("May", "June", "July", "August","Septmeber")) +
  labs(x= "Monthly Temperatures from May- Sept", 
       y= "Frequency of Temps", 
       title="Histogram of Monthly Temperatures from May - Sept, 1973",
        caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 2 Output

Did this improve the readability of the plot?

Plot 2 displays improvement on readability, however the color overlap will pose a problem.

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) +
  labs( x = "Month from May through September", y = "Temperatures", 
        title = "Side-by-Side Boxplot of Monthly Temperatures",
        caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))
p3

Plot 4: Side by Side Boxplots in Gray Scale

p4 <- airquality |> 
  ggplot(aes(Month, Temp, fill= Month)) +
  labs(x= "Monthly Temperatures", y = "Temperatures",
       title = "Side-by-Side Boxplot of Monthly Temperatures", 
       caption = "New York State Department of Conservation and the National Weather Service")+
  geom_boxplot()+ 
  scale_fill_grey(name = "Month", labels = c("May","June","August","September"))
p4

Plot 5:

p5 <- airquality |>
  ggplot(aes(Month, Ozone, fill = Month)) +
  labs(x = "Month", y = "Ozone Levels",
       title = "Side-by-Side Boxplot of Monthly Ozone Levels",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +  #Tried using geo_point(). Decided against it
  scale_fill_discrete( name= "Month", labels = c("May", "June", "July", "August", "September"))
p5
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Brief Essay

The plot that I have used is the box plot. The box plot is the preferred option because it depicts the 5 number summary: minimum, lower quartile, median, upper quartile, and maximum. Not only does it depict the 5 number summary, but it includes the outliers. From the box plot, audience members can conclude that the average ozone level in July is higher than the ozone levels in May, June, August, and September. I did not use any special code to make my plot, however, I did try using a geo_point function. I did not prefer the outlook of the geo_point as it only displayed the data as points in a straight line, in their respective month, so I went with the box plot.