Airquality

Author

SenayL

Load in the library

Load library tidyverse in order to access dplyr and ggplot2

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the dataset into your global environment

Because airquality is a pre-built dataset, we can write it to our data directory to store it for later use.

data("airquality")

View the data using the “head” function

The function, head, will only display the first 6 rows of the data set. Notice in the global environment to the right, there are 153 observations (rows)

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary Statistics

Here are 2 different ways to calculate “mean.”

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality[,4])
[1] 89.59133

Rename the months from number to names

Sometimes we prefer the months to be numerical, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 
summary(airquality$Wind)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.700   7.400   9.700   9.958  11.500  20.700 

Month is a categorical variable with different levels, called factors

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

Plot 1: Histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill = Month)) + 
  geom_histogram(position = "identity") +
  scale_fill_discrete (name = "month",
                       labels = c("May", "June","July", "August", "September")) + 
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2: Improve the histogram of average temperature by month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.6, binwidth = 5, color = "black")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() 
p3

Plot 4: Side by Side Boxplots in Gray Scale

Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics.

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

plot 5

p5 <- airquality |>
  ggplot(aes(Month, Wind, fill = Month)) + 
  labs(x = "Wind speeds from May through September", y = "Wind Speeds", 
       title = "Side-by-Side Boxplot of Wind speeds",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p5

This data visualization is a box plot that shows wind speeds for the months May through September. To make this box plot I did not use any special code but instead substituted the chosen variable of “Temp” by “Wind”. This box plot shows that there are two outliers for wind speeds in June, we can see that June had both the lowest and the highest recorded wind speed. We can also conclude May had a greater median than any other month. Another insight we can gather from this boxplot is that the median wind speeds for July and August is the same.