library(tidyverse)
Airquality HW
Load the library
Load the dataset into your global environment
data("airquality")
Look at the structure of the data
View the data using the “head” function
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Calculate Summary Statistics
mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235
Calculate Median, Standard Deviation, and Variance
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
Rename the Months from number to names
$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September" airquality
Now look at the summary statistics of the dataset
summary(airquality$Month)
Length Class Mode
153 character character
Month is a categorical variable with different levels, called factors.
$Month<-factor(airquality$Month,
airqualitylevels=c("May", "June","July", "August",
"September"))
Plot 1: Create a histogram categorized by Month
Plot 1 Code
<- airquality |>
p1 ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity")+
scale_fill_discrete(name = "Month",
labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service") #provide the data source
Plot 1 Output
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot 2: Improve the histogram of Average Temperature by Month
Plot 2 Code
<- airquality |>
p2 ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
Plot 2 Output
p2
Plot 3: Create side-by-side boxplots categorized by Month
<- airquality |>
p3 ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Months from May through September", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
Plot 3 Output
p3
Plot 4: Side by Side Boxplots in Gray Scale
Plot 4 Code
<- airquality |>
p4 ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Monthly Temperatures", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
Plot 4 Output
p4
Plot 5:
<- mean(airquality$Ozone, na.rm = TRUE) mean_ozone
<- airquality |>
p5 ggplot(aes(x=Ozone, fill=Month)) +
geom_histogram(position="identity", alpha=.3, binwidth = 10, color = " white")+
geom_vline(aes(xintercept = mean(Ozone, na.rm = TRUE)), color = "darkred", linetype = "solid", linewidth = 1)+
scale_fill_brewer(name = "Month", labels = c("May", "June","July", "August", "September"), palette = "RdPu", direction = 1) +
labs(x = "Ozone Levels from May - Sept",
y = "Frequency of Ozone",
title = "Ozone Levels in May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
Plot 5 Output
p5
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_bin()`).
Write a brief essay here
The data visualization that I created for Plot 5 is to view the Ozone Levels from May-Sept. I used similar code from Plot 2 but there are two main differences which are changing the default color palette and adding a mean line.
Firstly, I changed from the Temp data to Ozone data in the line of “ggplot(aes(x=Ozone, fill=Month)”. Then, I adjusted the titles so it can fit with the data.
For the color palette, I analyzed the “scale_fill_discrete” code from the tutorial, I found a similar code that worked which was “scale_fill_brewer”. The “brewer” part is a way to change the color of the plot using color palettes. This led me to further research R’s color palette library where I managed to find a color palette (RdPu) that I personally like and looks good with the data. I had to adjust the thickness and the opacity of the histogram bars so it is clearer to view.
Code for the color palette:
scale_fill_brewer(name = “Month”, labels = c(“May”, “June”,“July”, “August”, “September”), palette = “RdPu”, direction = 1)
I added the mean line by analyzing the previous plot’s code where I found the line “geom_vline”. When I hovered over the code, it showed me how to use it where I then customized the line color, width and style so it can stand out in the data. I did that by adding this code:
geom_vline(aes(xintercept = mean(Ozone, na.rm = TRUE)), color = “darkred”, linetype = “solid”, linewidth = 1)
Overall, the data is now easy to analyze by having a dark red line to show what the mean of the Ozone levels were from May-Sept. This can be helpful to determine any outliers or abnormal levels Also, changing the color palette to included lighter shades is helpful to see when the frequency of ozone levels overlap each other so it is easier to view.