Airquality Homework

Author

Arif Jadji

Load in the Dataset

Airquality is a pre-built dataset.

Lets install tidyverse.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the dataset into the global environment

airquality <- airquality

Look at the structure of the dataset

We use the function, head to view the data, but it will only display the first 6 rows of the dataset.

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary Statistics

To look at specific statistics, here are some variations. Here are two different ways to calculate the mean:

mean(airquality$Temp)

[1] 77.88235

mean(airquality[,4])

[1] 77.88235

Calculate the Median, Standard Deviation, and Variance

median(airquality$Temp)

[1] 79

sd(airquality$Wind)

[1] 3.523001

var(airquality$Wind)

[1] 12.41154

Changing the months from number to names

Number 5-9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Lets look at the summary statistics of the dataset

summary(airquality$Month)

   Length     Class      Mode 
      153 character character

Month is a categorical variable with different levels called factors.

Reorder the Months so they do not default to alphabetical

airquality$Month<-factor(airquality$Month, levels=c("May", "June","July", "August", "September"))

Plot 1: Histogram

create a histogram categorized by month with qplot. Qplot stands for “Quick-Plot”

p1 <- qplot(data = airquality,Temp,fill = Month,geom = "histogram", bins = 20)

Warning: `qplot()` was deprecated in ggplot2 3.4.0.

p1

Plot 2: Histogram using ggplot

ggplot is more sophisticated than qplot. Reorder the legend so that it is not the default (alphabetical), but in order that months come. Outline the bar in white using the color = “white” command.

Histogram of Average Temperature by Month

p2 <- airquality %>%
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  xlab("Monthly Temperatures") +
  ylab("Frequency") +
  ggtitle("Histogram of Monthly Temperatures")
p2

Plot 3: Side-by-side boxplots categorized by Month

fill=Month command fills each boxplot with a different color in aesthetics. scale_fill_discrete makes the legends on the side for discrete color values. use “labs” to include the title, axis labels, caption for the data source.

This is a Side-by-side boxplots of Average Temperature by Month

p3 <- airquality %>%
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side-by-side boxplot in grey-scale

Use the scale_fill_grey command for the grey scale legend and use fill=Month in aesthetics

p4 <- airquality %>%
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Histogram by me

p2 <- airquality %>%
  ggplot(aes(x=Wind, fill=Month)) +
  geom_histogram(position="dodge", alpha=0.7, binwidth = 5, color = "black")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  xlab("Monthly Winds") +
  ylab("Frequency") +
  ggtitle("Histogram of Monthly Winds")
p2

This histogram shows the frequency of wind each month. This is the best representation for these variables because each bar corresponds to the frequency or relative frequency of data within the particular range or interval. Looking at the graph, we see that wind at 10 occurs more often in July, the highest bar. In August, at wind 11, we see this is the second most occurring wind. Noticed the four outliers in the graph: May wind at about -2, July wind at 0, July wind at 20, and August wind at 21. These four data are considered outliers because they significantly deviate from the overall pattern of the dataset; they are distant from the other observations and show extreme values.

To make this modification, I used ggplot, changing the aes to x=Wind, fill=Month, position to “dodge”, alpha = 0.7, bindwidth = 5, color = “black”. I also changed the xlab to “Monthly Winds” and ggtitle to “Histogram of Monthly Winds”