Airquality Homework Tutorial

Author

A. Diaz-Nova

load in the library

# We will not have to install any packages since this is a pre-built dataset (airquality)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

load the dataset into your global environment

# NY State Department of Conservation and the National Weather worked in tangent to record data over a five month period (May-Sept) on a daily basis
data("airquality")

look at the structure of the data

# We will use the head function in this case, but be of caution, it will only display the first 6 rows of the dataset
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

calculate summary statistics

# Since we have data to pull from, we should look into specific stats to get a grasp of the data
mean(airquality$Temp)
[1] 77.88235

calculate median, standard deviation, and variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

rename the months from number to names

# (Note to self - This may look confusing, but number 5-9 still represent May through Sept)
airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

now look at the summary statistics of the dataset

# Check to see if months has changed from number to characters, if so class and mode should have changed 
summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

month is a categorical variable with different levels, called factors

airquality$Month<-factor(airquality$Month, levels=c("May", "June","July", "August", "September"))
# One way to reorder the months to separate from alphabetical order

plot 1: create a histogram categorized by month

# The first plot will view temperatures through every month. Remember to take notes about the histogram, for example the median temp
p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The scale_fill_discrete provides a scheme to follow along with the plot (As seen on the right legend)

think critically: Is this plot useful in answering questions about monthly temperature values?

# In most cases, months are stacked upon one another in the plot, which makes it hard to make concrete inferences and/or opinions. Basically it is hard to read the plot and what it is trying to say

plot 2: Improve the histogram using ggplot

# Outline the bars in WHITE using the color = "white"
# Use alpha to add some transparency
# Change the binwidth
p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

# Add some transparency and white borders to the look of the histogram bars. 

Did the new adjustments improvement the readability of the plot, yes or no?

# Yes, as a matter of fact no matter how you may look at it, these changes make a real impact to the overall approach toward this data

Plot 3: create side by side boxplots categorized by Month

# August can be noted to have higher temperatures compared to other months
p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3 

Be keen to note if there are outliers present, as shown for June and July

Plot 4: Make the same side by side boxplots, but in grey-scale

# Be sure to use the scale_fill_grey for the grey-scale legend, and again, use fill=Month in the aesthetics
p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

# All we did was change the color palette to grey scale using scale_fill_grey
# Now with all this information, remember to include the scale_fill_grey and fill = month into your plot for certain aesthetics and legends

Now make one plot on your own of any of the variables in this dataset. Any kind, but be sure to write a brief essay describing each in full explisite detail

mean(airquality$Wind)
[1] 9.957516
median(airquality$Wind)
[1] 9.7
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
p5 <- airquality |>
  ggplot(aes(Month, Wind, fill=Month)) + 
  geom_boxplot(position="identity", alpha = 0.8, binwidth = 3, color = "black") + 
  scale_fill_discrete(name= "Month", labels = c("May", "June", "July", "August", "September")) + 
  labs(x = "Monthly Wind (mph) from May-Sept", y = "Frequency of Wind", title = "Reports on Monthly Wind from May-Sept", caption = "New York State Department of Conservation and the National Weather Service") 
Warning in geom_boxplot(position = "identity", alpha = 0.8, binwidth = 3, :
Ignoring unknown parameters: `binwidth`
p5

# This plot will get into deeper how monthly wind could be depicted through the lens of a box plot. Functions such as, labs and ggplot, to name a few are key to making this blot pop out more and readable. Mind that June has the highest max and min which is cool to note. This plot makes me think about is May because of the reported high winds. What days where those taken place? How long did it last?