library(tidyverse)Airquality Assignment
Airquality Assignment
Load the library
Loading dataset into Global Environment
data("airquality")
# There are 153 observations and 6 variablesLooking at the structure of the data
head(airquality) Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
#There are 6 variables: ozone, solar.r., wind, temp, month, & dayCalculate summary statistics
# I learned the below code in my Data 101 class with Professor Hairimun
summary (airquality) #calculating summary statistics for each variable in the dataset. Not useful for Month/Day Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
#Calculating Standard Deviation and Variance for numeric variables
#Ozone
sd(airquality$Ozone)[1] NA
var(airquality$Ozone)[1] NA
#There is missing data for Ozone so sd and variance are coming back NA
#Solar.R
sd(airquality$Solar.R)[1] NA
var(airquality$Solar.R)[1] NA
#There is missing data for Solar. R so sd and variance are coming back NA
#Wind
sd(airquality$Wind)[1] 3.523001
var(airquality$Wind)[1] 12.41154
#Temp
sd(airquality$Temp)[1] 9.46527
var(airquality$Temp)[1] 89.59133
Renaming months from numeric to names
airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"Checking to see that months have been successfully renamed and that variable has changed from Integer to character
head(airquality) Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
5 NA NA 14.3 56 May 5
6 28 NA 14.9 66 May 6
#Checking to see if Month has changed from Integer to Character
summary(airquality) Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Length:153 Min. : 1.0
Class :character 1st Qu.: 8.0
Mode :character Median :16.0
Mean :15.8
3rd Qu.:23.0
Max. :31.0
summary(airquality$Month) Length Class Mode
153 character character
view(airquality)Reordering the months so that they do not appear in alphabetical order
#I noticed after using "head(airquality) above and viewing the dataset in a separate tab it appears the months are already following the correct order but I will run the below code again just to make sure
airquality$Month<-factor(airquality$Month,
levels=c("May", "June","July", "August",
"September"))
view(airquality) #Viewing in separate tab to double check changesPlot 1: Create a histogram categorized by Month
p1 <- airquality |>
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity")+
scale_fill_discrete(name = "Month",
labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
p1`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Plot 2: Improve the histogram of Average Temperature by Month
p2 <- airquality |>
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")
p2Plot 3: Create side-by-side boxplots categorized by Month
p3 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Months from May through September", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3Plot 4: Side by Side Boxplots in Gray Scale
p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Monthly Temperatures", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4Plot 5:Scatterplot of Temperature and Ozone Levels
p5 <- airquality |>
ggplot(aes(x = Temp, y = Ozone)) +
geom_point(color = "orange") +
labs(
title = "Relationship Between Temperature and Ozone Levels (1973)",
x = "Temperature (°F)",
y = "Ozone (ppb)",
caption = "Source: New York State Department of Conservation and the National Weather Service"
)
p5Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).
Essay on Plot 5 - Scatterplot demonstrating relationship between temperature and ozone Levels
For plot 5, I wished to explore the potential relationship between temperature and ozone levels between May to June of 1973. I recalled vaguely from an environmental science class in high school that typically increases in temperatures result in increases in ozone levels which can actually lead to poor air quality. Since temperature and ozone are continous variables in this dataset, I felt a scatterplot would be appropriate for this purpose.
I used geom_point to create the scatterplot with temperature on the x axis and ozone on the y axis. I selected the color orange for easier visualization of the datapoints. Overall, it appears there a positive relationship between ozone levels and temperature. As temperature increases, ozone levels also increase, with the exception of a few outliers which is expected due to variability in the data. While, lower temperatures are also associated with lower ozone levels.
I will note that R showed me a warning that 37 rows containing missing values have been removed. I observed the missing values labeled “NA” when checking the entire dataset in a separate tab.