Airqality Assignment

Author

Qian.H

Airquality Assignment

Load in the library

# to access tools(ggplot2 etc)
library(tidyverse)

Load the dataset into your global environment

#to store data
data("airquality")

Look at the structure of the data

View the data using the “head” function

#to view data
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary Statistics

#calculate mean
#1
mean(airquality$Temp)
[1] 77.88235
#2
mean(airquality[,4])
[1] 77.88235

Calculate Median, Standard Deviation, and Variance

#calculate median
median(airquality$Temp)
[1] 79
#calculate standard deviation
sd(airquality$Wind)
[1] 3.523001
#calculate variance
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

#airquality$Month
airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

#months change from numbers to characters
summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

#reorder months
airquality$Month<-factor(airquality$Month,
                         levels=c("May","June","July","August",
                                  "September"))

Plot 1: Create a histogram categorized by Month

Plot 1 Code

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity") +
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source

Plot 1 Output

p1
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

Plot 2 Code

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.6, binwidth = 5, color = "orange")+
  #binwidth only for histogram
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")

Plot 2 Output

p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))

Plot 3 Output

p3

Plot 4: Side by Side Boxplots in Gray Scale

Plot 4 Code

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))

Plot 4 Output

p4

Plot 5:Scatterplot of Wind Speed and Ozone Levels

Plot 5 Code

p5 <- airquality |>
  ggplot(aes(x=Wind, y=Ozone)) +
  geom_point(aes(color=factor(Month)),
             alpha=0.6,
             na.rm=TRUE)+
  geom_smooth(method = "lm")+
  labs(x = "Wind Speed", 
       y = "Ozone Level",
       title = "Scatterplot of Wind Speed and Ozone Levels",
       caption = "New York State Department of Conservation and the National Weather Service")

Plot 5 Output

p5
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).

Essay

In Plot 5, I created a scatterplot to show the relationship between wind speed and ozone levels in New York from May to September 1973. This graph shows the relationship between wind speed and ozone levels. In the graph, the x-axis is wind speed, and the y-axis is ozone level.

From the plot, we can see a general negative relationship between wind speed and ozone levels from blue line. When wind speed is low, ozone levels tend to be higher. As wind speed increases, ozone levels decrease. It shows that stronger winds may help clear air pollutants and lower ozone concentrations. The points are quite spread out, but we can still see a downward trend.

I used geom_point() to display the data points and included na.rm = TRUE to remove missing ozone values. Additionally, I added color=factor(Month) to add more aesthetic visual stimulation for fun. I also adjusted the color transparency using alpha to make overlapping points more visible and aesthetically pleasing. Last but not the least, I used geom_smooth(method = “lm”) to show the blue line to make the downward trend more visible.