Load in the library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the data set into the global environment
data("airquality")
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Calculate Summary Statistics
mean(airquality$Temp)
## [1] 77.88235
mean(airquality[,4])
## [1] 77.88235
Rename the Months from numbers to names
Number 5-9 to May through September
airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"
Now look at the Summary statistics of the data set
See how the months have changed to numbers
summary(airquality$Month)
## Length Class Mode
## 153 character character
Months is categorical variable with different levels, called
This is one way to reorder the Months so they do not default to
alphabetical (you will see another way to reorder DIRECTLY in the chunk
that creates the plot below in Plot 1)
airquality$Month<-factor(airquality$Month, levels = c("May", "June","July", "August", "September"))
Plot 1: Create a histogram categorized by Month
Here is a first attempt at viewing a histogram of temperature by the
months May through September. We will see that temperatures increase
over these months. The median temperature appears to be about 75
degrees.
Reorder the legend so that it is not the default (alphabetical), but
rather in chronological order.
fill = Month colors the histogram by months between May - Sept.
scale_fill_discrete(name = “Month”…) provides the month names on the
right side as a legend.
p1 <- airquality |>
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity")+
scale_fill_discrete(name = "Month",
labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service") #provide the data source
p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

No, the graph is not particularly useful in answering questions
about monthly temperature values. While I can identify the variables
being observed, the way the data is presented, especially with the
colors of different months stacked on each other, makes it confusing and
difficult for me to interpret
Plot 2: Improve the histogram using ggplot
outline the bars in white using the color = “white” command. Use
alpha to add some transparency (values between 0 and 1). And then change
the bandwidth
Histogram of Average Temp by Month
Add some transparency and white borders around the histogram bars.
Here July stands out for having high frequency of 85 degree
temperatures. The dark purple color indicates overlaps of months due to
the transparency.
p2 <- airquality |>
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
p2

Yes, by making the colors more transparent and adding a white
border, the data has become much clearer to see. The lighter colors also
enhance the visual appeal, making the graph easier on one’s eyes.
Overall, the readability of the plot has been significantly
improved
Plot 3:Create side-by-side boxplots categorized by Month
We can see that August has the highest temperatures based on the
boxplot distribution.
p3 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Months from May through September", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Plot 4: Make the same side-by-side boxplots, but in grey-scale
Use the scale_fill_grey command for the grey-scale legend, and
again, use fill=Month in the aesthetics. Side by Side Boxplots in Gray
Scale
p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Monthly Temperatures", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Bar Graph
I wanted to focus on Wind Averages and Months. I first tried to do
mean but I kept on getting errors. But after looking at an online source
(https://sparkbyexamples.com/r-programming/calculate-mean-or-average-in-r/).
And from that point I would plot the averages into a bar graph
avg_wind_per_month <- airquality |>
group_by(Month) %>%
summarise(Avg_Wind_Speed = mean(Wind, na.rm = TRUE))
print(avg_wind_per_month)
## # A tibble: 5 × 2
## Month Avg_Wind_Speed
## <fct> <dbl>
## 1 May 11.6
## 2 June 10.3
## 3 July 8.94
## 4 August 8.79
## 5 September 10.2
p5 <- ggplot(avg_wind_per_month, aes(x = factor(Month, labels = c("May", "June", "July", "August", "September")), y = Avg_Wind_Speed, fill = factor(Month))) +
geom_bar(stat = "identity", position = "dodge", color = "white", alpha=0.8, width=0.7) + # Note the stat="identity" +
labs(
x = "Month",
y = "Average Wind Speed",
title = "Average Wind Speed by Month")
print(p5)

The visualization I developed using code is a bar graph. This bar
graph displays average wind speeds and months. I initially had done Temp
and Wind as my variables and was having a lot of trouble. However, as I
was working I saw a message in the class discord that said we need to
pick one quantitative and one qualitative variable for the
visualization. As mentioned earlier I really wanted to work with
averages so I found an online source that I linked above that helped me
figure it out how to average out the wind speeds. I initially kept on
trying to use “mean(”Wind”)” but it was not working. I also utilized the
bar graph tutorial we had to go through last week. I had chosen to
border the bars with white to help make the graph look better. I used
alpha=0.8 like in the bar graph tutorial and put width=0.7 because
without this piece of code the bars were very wide and did not make for
a good visualization. I also made sure that my axes were labeled and
that the visualization had a title. However, when I was writing my code
I kept on getting an error when I ended with “p5” The code shared for
the assignment had usually ended in “p4 or p3” so I added print because
that seemed to be the only thing that worked for me.