Airquality final

Author

K Bedassa

Air quality Assignment

Load the library

library(tidyverse)

Load the dataset into your global environment =

data("airquality")

View the data using the “head” function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate Summary

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

airquality$Month[airquality$Month ==5]<- "May"
airquality$Month[airquality$Month ==6]<- "June"
airquality$Month[airquality$Month ==7]<- "July"
airquality$Month[airquality$Month ==8]<- "August"
airquality$Month[airquality$Month ==9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

airquality$Month<-factor(airquality$Month,
                         levels=c("May","June","July","August","September")) 

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side by Side Boxplots in Gray Scale

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Scatterplot of Solar.R and Temperature

p5 <- airquality |>
  ggplot(aes(x=Solar.R, y=Temp)) +
  geom_point(aes(color=factor(Month)),
             alpha=0.6,
             na.rm=TRUE)+
  geom_smooth(method = "lm")+
  labs(x = "Solar.R", 
       y = "Temp",
       title = "Scatterplot of Solar.R and Temp",
       caption = "New York State Department of Conservation and the National Weather Service")
p5
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_smooth()`).

Essay

In Plot 5, I created a scatter plot to see how Solar Radiation (Solar.R) correlates with Temperature (Temp) in the airquality data set. I accomplished this by utilizing the ggplot() function to map Solar.R on the x-axis and Temp on the y-axis inside the aes() function. I then plotted this scatter plot to see how Temp changes with the increase of Solar.R. To view the values on the graph, I added geom_point() so that each observation would be displayed in the scatter plot. I also set the transparency to alpha = 0.6, so that I could see where multiple points landed on the graph.

Then, to differentiate how this scatterplot looks at different points in time, I colored the points by month by adding color = factor(Month) inside geom_point(). This turned Month into a factor to give each point a different color by month so that I could see if there were any seasonal trends reflected in the data. I also added the geom_smooth(method = “lm”) for linear regression, which plots a smoothing line so that viewers of this graph can see the positive correlation between Solar.R and Temp. The upward slope that is added from this function confirms that as solar radiation increases, so too do temperature readings.

Finally, I adjusted the plot with the labs() function to add axis labels, as well as a title that explains the plot content, and a caption that acknowledges the data source. As you can see, this plot successfully demonstrates the positive correlation between solar radiation and temperature, while still maintaining the integrity of the variations that can be seen by month.