Airquality HW

Author

A Elmi

Load the library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the dataset into your global environment

data("airquality")

Temp chunk

plotX <-airquality |>
  ggplot(aes(x=Ozone, y=Temp)) +
  geom_point()
plotX
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

View the data using the “head” function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate summary statistics

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4]) 
[1] 77.88235

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

Plot 1: Create a histogram categorized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Months", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Plot 3: Create side-by-side boxplots categorized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Side by Side Boxplots in Gray Scale

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5:Scatterplot of Temperature vs Solar.R

plotX <-airquality |>
  ggplot(aes(x=Solar.R, y=Temp)) +
   labs(x = "Solar.R", y = "Temperatures", 
       title = "Scatterplot of Temperature vs Solar.R",  
      caption = "New York State Department of Conservation and the National Weather Service") +
  geom_point(colour="red", alpha=0.6,)
plotX
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).

Essay

Describe the plot type you have created:

For my plot (plot 5), Temperature vs Solar. R, I decided to make a scatterplot of SolarR as my x-axis and Temperature as my y-axis. I chose a scatterplot because it’s easy to view and it clearly shows that an increase solar radiation, increases the temperatures.

Any insights that the plot shows:

The plot showed that there is a positive correlation between temperature and Solar R. The higher the Solar.R, the higher the temperature tends to be, according to the plot. This data is from 1973; it would be interesting to observe it in 2025.

Describe any special code you used to make this plot:

To make this plot, I played around with the geom_point(). I added the color red because when I think of hot temperatures and Solar.R I think of red. I also used alpha 0.6 because some points overlapped, and I wanted each point to be seen clearly so the trend could be easily identified. Lastly, I used labs() and added the axis labels, captions, and title.