Airquality HW

Load in the library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data set into the global environment

data("airquality")

head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Calculate Summary Statistics

mean(airquality$Temp)

## [1] 77.88235

mean(airquality[,4])

## [1] 77.88235

Calculate Median. Standard Deviation, and Variance

median(airquality$Temp)

## [1] 79

sd(airquality$Wind)

## [1] 3.523001

var(airquality$Wind)

## [1] 12.41154

Rename the Months from numbers to names

Number 5-9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the Summary statistics of the data set

See how the months have changed to numbers

summary(airquality$Month)

##    Length     Class      Mode 
##       153 character character

Months is categorical variable with different levels, called

This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot 1)

airquality$Month<-factor(airquality$Month,  levels = c("May", "June","July", "August", "September"))

Plot 1: Create a histogram categorized by Month

Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.

Reorder the legend so that it is not the default (alphabetical), but rather in chronological order.

fill = Month colors the histogram by months between May - Sept.

scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend.

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No, the graph is not particularly useful in answering questions about monthly temperature values. While I can identify the variables being observed, the way the data is presented, especially with the colors of different months stacked on each other, makes it confusing and difficult for me to interpret

Plot 2: Improve the histogram using ggplot

outline the bars in white using the color = “white” command. Use alpha to add some transparency (values between 0 and 1). And then change the bandwidth

Histogram of Average Temp by Month

Add some transparency and white borders around the histogram bars. Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Yes, by making the colors more transparent and adding a white border, the data has become much clearer to see. The lighter colors also enhance the visual appeal, making the graph easier on one’s eyes. Overall, the readability of the plot has been significantly improved

Plot 3:Create side-by-side boxplots categorized by Month

We can see that August has the highest temperatures based on the boxplot distribution.

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Make the same side-by-side boxplots, but in grey-scale

Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics. Side by Side Boxplots in Gray Scale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Bar Graph

I wanted to focus on Wind Averages and Months. I first tried to do mean but I kept on getting errors. But after looking at an online source (https://sparkbyexamples.com/r-programming/calculate-mean-or-average-in-r/). And from that point I would plot the averages into a bar graph

avg_wind_per_month <- airquality |>
  group_by(Month) %>%
  summarise(Avg_Wind_Speed = mean(Wind, na.rm = TRUE))

print(avg_wind_per_month)

## # A tibble: 5 × 2
##   Month     Avg_Wind_Speed
##   <fct>              <dbl>
## 1 May                11.6 
## 2 June               10.3 
## 3 July                8.94
## 4 August              8.79
## 5 September          10.2

p5 <- ggplot(avg_wind_per_month, aes(x = factor(Month, labels = c("May", "June", "July", "August", "September")), y = Avg_Wind_Speed, fill = factor(Month))) +
  geom_bar(stat = "identity", position = "dodge", color = "white", alpha=0.8, width=0.7) +  # Note the stat="identity" +
  labs(
    x = "Month", 
    y = "Average Wind Speed", 
    title = "Average Wind Speed by Month")
print(p5)

The visualization I developed using code is a bar graph. This bar graph displays average wind speeds and months. I initially had done Temp and Wind as my variables and was having a lot of trouble. However, as I was working I saw a message in the class discord that said we need to pick one quantitative and one qualitative variable for the visualization. As mentioned earlier I really wanted to work with averages so I found an online source that I linked above that helped me figure it out how to average out the wind speeds. I initially kept on trying to use “mean(”Wind”)” but it was not working. I also utilized the bar graph tutorial we had to go through last week. I had chosen to border the bars with white to help make the graph look better. I used alpha=0.8 like in the bar graph tutorial and put width=0.7 because without this piece of code the bars were very wide and did not make for a good visualization. I also made sure that my axes were labeled and that the visualization had a title. However, when I was writing my code I kept on getting an error when I ended with “p5” The code shared for the assignment had usually ended in “p4 or p3” so I added print because that seemed to be the only thing that worked for me.