Airquality Homework Assignment

Author

Leika Joseph

Airquality Assignment

Code Orange Air Quality

Code Orange Air Quality

Load the library tidyverse

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data in the global environment

data(airquality)

View on the struture of the dataset

view(airquality)

View dataset using the “head” function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary Statistics

Two different ways to calculate “mean”

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235

Calculate Median, Standard, Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

airquality$Month[airquality$Month == 5] <- "May"
airquality$Month[airquality$Month == 6] <- "June"
airquality$Month[airquality$Month == 7] <- "July"
airquality$Month[airquality$Month == 8] <- "August"
airquality$Month[airquality$Month == 9] <- "September"

Summary of statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 
airquality$Month<-factor(airquality$Month, level=c("May", "June", "July", "August", "September"))

Plot 1

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot gives an overview of the temperature distribution according to the dataset. However, it has some limitations that makes it less useful. The colors are all over each other.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

This improved the readability of the plot because now we can see the colors and they can tell more about the temperature distribution from a month to another.

Plot 3: Create side-by-side boxplot categorized by Month.

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side-by-side Boxplots in Gray Scale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Scatter plot of solar Radiation vs Ozone Levels

airquality <- airquality|> 
  filter(!is.na(Solar.R) & !is.na(Ozone))

p5 <- airquality |>
  ggplot(aes(x = Solar.R, y = Ozone)) +
  geom_point(aes(color = factor(Month)), size = 3, alpha = 0.8) +
  labs(x = "Solar Radiation (Langleys)", 
       y = "Ozone Concentration (ppb)",
       color = "Month",
       title = "Scatter Plot of Solar Radiation vs. Ozone Levels",
       caption = "New York State Department of Conservation and the National Weather Service") +
  scale_color_manual(values = c("red", "orange", "yellow", "green", "blue"),
                     labels = c("May", "June", "July", "August", "September")) 

p5

Brief essay on my scatter plot

Plot number 5 is a scatter plot illustrating the relationship between solar radiation and ozone concentration from May to September. Higher solar radiation is attributed to higher ozone levels, especially in June, July, and August. However, in May and September, the ozone level is lower. This made us think that the ozone level is impacted by the change of the seasons.

Before I run the code for my plot I use filter(!is.na()) to remove the missing rows from the data set for my two chosen variables which are Solar R. and Ozone.

Overall I think the scatter plot shows the impact solar radiation can have on the ozone level, and we can also consider a seasonal factor when we consider how the radiation level changes when the seasons change.