Airquality Assignment

Author

C Matenje

Airquality Assignment

Load the library

library(tidyverse)

Loading dataset into Global Environment

data("airquality")
# There are 153 observations and 6 variables

Looking at the structure of the data

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
#There are 6 variables: ozone, solar.r., wind, temp, month, & day

Calculate summary statistics

# I learned the below code in my Data 101 class with Professor Hairimun

summary (airquality) #calculating summary statistics for each variable in the dataset. Not useful for Month/Day
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               
#Calculating Standard Deviation and Variance for numeric variables

#Ozone

sd(airquality$Ozone)
[1] NA
var(airquality$Ozone)
[1] NA
#There is missing data for Ozone so sd and variance are coming back NA

#Solar.R

sd(airquality$Solar.R)
[1] NA
var(airquality$Solar.R)
[1] NA
#There is missing data for Solar. R so sd and variance are coming back NA

#Wind

sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
#Temp

sd(airquality$Temp)
[1] 9.46527
var(airquality$Temp)
[1] 89.59133

Renaming months from numeric to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Checking to see that months have been successfully renamed and that variable has changed from Integer to character

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67   May   1
2    36     118  8.0   72   May   2
3    12     149 12.6   74   May   3
4    18     313 11.5   62   May   4
5    NA      NA 14.3   56   May   5
6    28      NA 14.9   66   May   6
#Checking to see if Month has changed from Integer to Character
summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
    Month                Day      
 Length:153         Min.   : 1.0  
 Class :character   1st Qu.: 8.0  
 Mode  :character   Median :16.0  
                    Mean   :15.8  
                    3rd Qu.:23.0  
                    Max.   :31.0  
                                  
summary(airquality$Month)
   Length     Class      Mode 
      153 character character 
view(airquality)

Reordering the months so that they do not appear in alphabetical order

#I noticed after using "head(airquality) above and viewing the dataset in a separate tab it appears the months are already following the correct order but I will run the below code again just to make sure

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

view(airquality) #Viewing in separate tab to double check changes

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p1
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side by Side Boxplots in Gray Scale

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5:Scatterplot of Temperature and Ozone Levels

p5 <- airquality |>
  ggplot(aes(x = Temp, y = Ozone)) +
  geom_point(color = "orange") +
  labs(
    title = "Relationship Between Temperature and Ozone Levels (1973)",
    x = "Temperature (°F)",
    y = "Ozone (ppb)",
    caption = "Source: New York State Department of Conservation and the National Weather Service"
  )
p5
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Essay on Plot 5 - Scatterplot demonstrating relationship between temperature and ozone Levels

For plot 5, I wished to explore the potential relationship between temperature and ozone levels between May to June of 1973. I recalled vaguely from an environmental science class in high school that typically increases in temperatures result in increases in ozone levels which can actually lead to poor air quality. Since temperature and ozone are continous variables in this dataset, I felt a scatterplot would be appropriate for this purpose.

I used geom_point to create the scatterplot with temperature on the x axis and ozone on the y axis. I selected the color orange for easier visualization of the datapoints. Overall, it appears there a positive relationship between ozone levels and temperature. As temperature increases, ozone levels also increase, with the exception of a few outliers which is expected due to variability in the data. While, lower temperatures are also associated with lower ozone levels.

I will note that R showed me a warning that 37 rows containing missing values have been removed. I observed the missing values labeled “NA” when checking the entire dataset in a separate tab.