Airquality

Author

Linh Le

Airquality Tutorial and Homework Assignment

Load in the Dataset

Load the tidyverse package to get the dataset

# install.packages("tidyverse")
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Load the dataset into your global environment

airquality <- airquality

Look at the structure of the data

str(airquality)
'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Calculating Summary Statistics

If you want to look at specific statistics, here are some variations on coding. Here are 2 different ways to calculate “mean.”

mean(airquality$Temp)
[1] 77.88235
mean(airquality$Temp)
[1] 77.88235

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Change the Months from 5 - 9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Look at the summary statistics of the dataset, and see how Month has changed to have characters instead of numbers

str(airquality)
'data.frame':   153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : chr  "May" "May" "May" "May" ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
    Month                Day      
 Length:153         Min.   : 1.0  
 Class :character   1st Qu.: 8.0  
 Mode  :character   Median :16.0  
                    Mean   :15.8  
                    3rd Qu.:23.0  
                    Max.   :31.0  
                                  
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67   May   1
2    36     118  8.0   72   May   2
3    12     149 12.6   74   May   3
4    18     313 11.5   62   May   4
5    NA      NA 14.3   56   May   5
6    28      NA 14.9   66   May   6

Month is a categorical variable with different levels, called factors.

Reorder the Months so they do not default to alphabetical

airquality$Month<-factor(airquality$Month, levels=c("May", "June","July", "August", "September"))

Plot 1: Create a histogram categorized by Month with qplot

Qplot stands for “Quick-Plot” (in the ggplot2 package)

p1 <- qplot(data = airquality,Temp,fill = Month,geom = "histogram", bins = 20)
Warning: `qplot()` was deprecated in ggplot2 3.4.0.
p1

Plot 2: Make a histogram using ggplot

p2 <- airquality %>%
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p2

Plot 3: Create side-by-side boxplots categorized by Month

fill=Month command fills each boxplot with a different color in the aesthetics

scale_fill_discrete makes the legend on the side for discrete color values

Side by Side Boxplots of Average Temperature by Month

p3 <- airquality %>%
  ggplot(aes(Month, Temp, fill = Month)) + 
  ggtitle("Temperatures") +
  xlab("Monthly Temperatures") +
  ylab("Frequency") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3 

Plot 4: Make the same side-by-side boxplots, but in grey-scale

Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics

Side by Side Boxplots in Gray Scale

p4 <- airquality %>%
  ggplot(aes(Month, Temp, fill = Month)) + 
  ggtitle("Monthly Temperature Variations") +
  xlab("Monthly Temperatures") +
  ylab("Frequency") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June", "July", "August", "September"))
p4

Plot 5: Making a boxplot using ggplot

p5 <- airquality %>%
   ggplot(aes(Solar.R, Month, fill=Month)) +
  ggtitle("Monthly Solar Radiation") +
  xlab("Solar.R") +
  ylab("Month") + 
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p5 
Warning: Removed 7 rows containing non-finite values (`stat_boxplot()`).

Brief essay: Plot 5 uses ggplot to create boxplot catergorized by months from May to September. Month in value y is filled with color, while value x is for Solar Radiation. Apparently, Solar Radiation in May reaches the highest. I did fill=Month command in each boxplot with a different color to help with a clearer visualization for this dataset.