Airquality Homework Assignment

Author

Ayomide Joe-Adigwe

Airquality Assignment

airquality index

airquality index

Load the library tidyverse

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data in the global environment

data("airquality")

Use head function to display the first 6 rows of dataset

head("airquality")
[1] "airquality"

Calculating Summary Statistics

mean(airquality$Temp)
[1] 77.88235

Calculating The Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Renaming the Months from numbers to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Summary Statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Rearrange the Months so they are not listed in alphabetical order.

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

PLOT 1: A histogram categorized by Month:

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
print(p1)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram plot is useful for analyzing monthly temperature values as it shows the distribution of temperatures for each month with different colors. It helps compare how temperature ranges vary across the months from May to September. To improve clarity adjusting the bin width or adding transparency can be resourceful

PLOT 2: Improving the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
print(p2)

Yes, this plot improves readability by using side-by-side boxplots to clearly compare temperature distributions across months. It visually distinguishes each month’s temperatures, highlighting August as having the highest temperatures. The plot also identifies outliers in June and July, providing insight into unusual temperature values. The use of distinct colors and clear labels further enhances the plot’s effectiveness in displaying and comparing monthly temperature patterns

PLOT 3: A side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
print(p3)

Plot 4: Side by Side Boxplots in Gray Scale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
print(p4)

Plot 5: Scatterplot of Solar Radiation vs. Temperature

p5 <- airquality |>
  ggplot(aes(x = Solar.R, y = Temp)) +
  geom_point(aes(color = Month), alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(x = "Solar Radiation (Langley)",
       y = "Temperature (°F)",
       title = "Scatterplot of Solar Radiation vs. Temperature",
       caption = "New York State Department of Conservation and the National Weather Service") +
  scale_color_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))
print(p5)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 7 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).

brief essay

Plot Type: The plot is a scatterplot displaying the relationship between Solar Radiation (Solar.R) and Temperature(Temp).

Insights: This scatterplot reveals how solar radiation levels are associated with temperature variations. The plot shows a general trend where higher solar radiation tends to be associated with higher temperatures. The added regression line (black line) provides a visual indication of this relationship, suggesting a positive correlation between the two variables. Different colors represent the months, allowing us to see if this relationship varies by month.

Special Code: I used geom_point() to create the scatterplot and geom_smooth() with method = “lm” to add a linear regression line, which helps to identify the overall trend. The alpha = 0.7 parameter in geom_point() adds transparency to the points, making overlapping points easier to distinguish. The scale_color_discrete() function colors the points by month, adding an additional layer of information to compare how the relationship between solar radiation and temperature might differ across months.