Air Quality HW

Author

Carla Musafi

Load in the Library

Load library tidyverse in order to access dplyr and ggplot2

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the dataset into your global environment

Because air quality is a pre-built data set, we can write it to our data directory to store it for later use.

data("airquality")

Look at the structure of the data

In the global environment, click on the row with the air quality data set and it will take you to a “spreadsheet” view of the data.

View the data using the “head” function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary Statistics

Here are 2 different ways to calculate “mean.”

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4]) 
[1] 77.88235

For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

library(tidyverse)

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #scale_fill_discrete(name = “Month” May-Sep)
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha= 0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3

p3 <- airquality |>
  ggplot(aes(Month= , Temp,fill= Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5

p5 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha= 0, binwidth = 7, color = "purple")+ 
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Income from May - Sept", 
       y = "Frequency of Income",
       title = "Histogram of Monthly Income from May - Sept, 2020",
       caption = "New York State Department of Conservation and the National bank Service")
p5

This data visualization shows the monthly income from the National bank service in 2020. As you can see the frequency increases until it meets around the end 88 of the months, which is around 81.5 and decreases after it passes 88, with a zero transparency. I noticed the middle of each bar no longer falls under an integer, In my visualization the center of all bar’s other than 80 fall between the beginning or towards the end of an integer. For these codes I used plot 1 as an example and I changed the variables to purple coloring and I replaced weather, to income. The coding technique used was ggplot and geom_histogram. I changed the binwidth from 5 to 7 and I noticed my histograms got a bit thicker and the bars decreased in value.