library(tidyverse)
Airquality HW
Load the library
Load the dataset into your global environment
data("airquality")
View the data using the “head” function
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Calculate Summary Statistics
mean(airquality$Temp)
[1] 77.88235
or
mean(airquality[,4])
[1] 77.88235
Calculate Median, Standard Deviation, and Variance
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
Rename the Months from number to names
$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September" airquality
Now look at the summary statistics of the dataset
summary(airquality$Month)
Length Class Mode
153 character character
Month is a categorical variable with different levels, called factors.
$Month<-factor(airquality$Month,
airqualitylevels=c("May", "June","July", "August",
"September"))
Plot 1: Create a histogram categorized by Month
<- airquality |>
p1 ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity")+
scale_fill_discrete(name = "Month",
labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service") #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot 2: Improve the histogram of Average Temperature by Month
<- airquality |>
p2 ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
p2
Plot 3: Create side-by-side boxplots categorized by Month
<- airquality |>
p3 ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Months from May through September", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3
Plot 4: Side by Side Boxplots in Gray Scale
<- airquality |>
p4 ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Monthly Temperatures", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4
Plot 5:
<- airquality |>
p5 ggplot(aes(x=Solar.R, y=Ozone, colour = Month)) +
geom_point() +
labs(x = "Solar Radiation (lang)",
y = "Ozone (ppb)",
title = "Scatterplot of Ozone and Solar Radiation",
caption = "New York State Department of Conservation and the National Weather Service")
p5
Warning: Removed 42 rows containing missing values or values outside the scale range
(`geom_point()`).
Write a brief essay here
For plot 5, I created a scatterplot depicting ozone in parts per billion and solar radiation in Langleys. The plot shows a correlation between the two variables, as the data appears to show higher levels of ozone around ~150 to ~300 Langleys of solar radiation. This range’s increase in ozone levels notably contains the months July and August. To make this plot, I used the tidyverse library’s ggplot function to plot the airquality dataset, as well as the geom_point function to create the scatterplot. In addition, I used the labs() function to create the title, caption, and axis labels. I included “colour = Month” in the ggplot function to display the month of each data point in color, as well as a key assigning each color to a month from the dataset (from intro2r.com 5.2.2).