library(tidyverse)
AirQualityHW
Airquality Tutorial and Homework Assignment
Load the library
Load the dataset into your global environment
data("airquality")
Look at the structure of the data
View the data using the “head” function
head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Calculate Summary Statistics
mean(airquality$Temp)
[1] 77.88235
Calculate Median, Standard Deviation, and Variance
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
Rename the Months from number to names
$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September" airquality
Now look at the summary statistics of the dataset
summary(airquality$Month)
Length Class Mode
153 character character
Month is a categorical variable with different levels, called factors.
$Month<-factor(airquality$Month,
airqualitylevels=c("May", "June","July", "August",
"September"))
Plot 1: Create a histogram categorized by Month
<- airquality |>
p1 ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity")+
scale_fill_discrete(name = "Month",
labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service") #provide the data source
Plot 1 Output
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot 2: Improve the histogram of Average Temperature by Month
<- airquality |>
p2 ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
Plot 2 Output
p2
Plot 3: Create side-by-side boxplots categorized by Month
<- airquality |>
p3 ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Months from May through September", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
Plot 3 Output
p3
Plot 4: Side by Side Boxplots in Gray Scale
Plot 4 Code
<- airquality |>
p4 ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Monthly Temperatures", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
Plot 4 Output
p4
Plot 5: Using a 3 factor linear model to predict Ozone levels
Plot 5 Code
<- na.omit(airquality) ## get rid of those nasty NA's
clean_air <- lm(Ozone ~ Temp + Wind + Solar.R, data = clean_air)
model $Predicted_Ozone <- predict(model)
clean_airplot(clean_air$Ozone, clean_air$Predicted_Ozone,
xlab = "Actual Ozone",
ylab = "Predicted Ozone",
main = "Actual vs. Predicted Ozone Levels",
pch = 19, col = "steelblue")
# Fit an order 2 polynomial
<- lm(Predicted_Ozone ~ poly(Ozone, 2), data = clean_air)
poly_fit
# Generate sequence of Ozone values for smooth curve
<- seq(min(clean_air$Ozone), max(clean_air$Ozone), length.out = 100)
ozone_seq
# Predict fitted values from polynomial model
<- predict(poly_fit, newdata = data.frame(Ozone = ozone_seq))
curve_points
# Add the polynomial curve
lines(ozone_seq, curve_points, col = "forestgreen", lwd = 2)
In this code I built a linear model to predict the ozone levels based on three environmental factors: temperature, wind speed, and solar radiation. This is called a 3-factor linear regression, which tries to find a straight-line relationship between these inputs and ozone concentration. I compared the predicted ozone levels to the actual ozone levels recorded in New York City during the summer of 1973. However, the scatter plot of predicted levels compared to observed levels was not well represented by a straight line. Thus I added a quadratic polynomial fit line to the plot of actual vs. predicted ozone values. This curve helps reveal patterns a straight-line model might miss. If the points consistently fall above or below the line, then it shows that the model may underestimate or overestimate ozone levels in certain conditions.
N.B. I did have a chat with Copilot about this modeling. However, I selected the quadratic fit line myself as I had used it before in DATA 101 just for fun.