# We will not have to install any packages since this is a pre-built dataset (airquality)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
load the dataset into your global environment
# NY State Department of Conservation and the National Weather worked in tangent to record data over a five month period (May-Sept) on a daily basisdata("airquality")
look at the structure of the data
# We will use the head function in this case, but be of caution, it will only display the first 6 rows of the datasethead(airquality)
# Since we have data to pull from, we should look into specific stats to get a grasp of the datamean(airquality$Temp)
[1] 77.88235
calculate median, standard deviation, and variance
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
rename the months from number to names
# (Note to self - This may look confusing, but number 5-9 still represent May through Sept)airquality$Month[airquality$Month ==5]<-"May"airquality$Month[airquality$Month ==6]<-"June"airquality$Month[airquality$Month ==7]<-"July"airquality$Month[airquality$Month ==8]<-"August"airquality$Month[airquality$Month ==9]<-"September"
now look at the summary statistics of the dataset
# Check to see if months has changed from number to characters, if so class and mode should have changed summary(airquality$Month)
Length Class Mode
153 character character
month is a categorical variable with different levels, called factors
airquality$Month<-factor(airquality$Month, levels=c("May", "June","July", "August", "September"))# One way to reorder the months to separate from alphabetical order
plot 1: create a histogram categorized by month
# The first plot will view temperatures through every month. Remember to take notes about the histogram, for example the median tempp1 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position="identity")+scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September")) +labs(x ="Monthly Temperatures from May - Sept", y ="Frequency of Temps",title ="Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State Department of Conservation and the National Weather Service") #provide the data sourcep1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# The scale_fill_discrete provides a scheme to follow along with the plot (As seen on the right legend)
think critically: Is this plot useful in answering questions about monthly temperature values?
# In most cases, months are stacked upon one another in the plot, which makes it hard to make concrete inferences and/or opinions. Basically it is hard to read the plot and what it is trying to say
plot 2: Improve the histogram using ggplot
# Outline the bars in WHITE using the color = "white"# Use alpha to add some transparency# Change the binwidthp2 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position="identity", alpha=0.5, binwidth =5, color ="white")+scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September")) +labs(x ="Monthly Temperatures from May - Sept", y ="Frequency of Temps",title ="Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State Department of Conservation and the National Weather Service")p2
# Add some transparency and white borders to the look of the histogram bars.
Did the new adjustments improvement the readability of the plot, yes or no?
# Yes, as a matter of fact no matter how you may look at it, these changes make a real impact to the overall approach toward this data
Plot 3: create side by side boxplots categorized by Month
# August can be noted to have higher temperatures compared to other monthsp3 <- airquality |>ggplot(aes(Month, Temp, fill = Month)) +labs(x ="Months from May through September", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Department of Conservation and the National Weather Service") +geom_boxplot() +scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September"))p3
Be keen to note if there are outliers present, as shown for June and July
Plot 4: Make the same side by side boxplots, but in grey-scale
# Be sure to use the scale_fill_grey for the grey-scale legend, and again, use fill=Month in the aestheticsp4 <- airquality |>ggplot(aes(Month, Temp, fill = Month)) +labs(x ="Monthly Temperatures", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Department of Conservation and the National Weather Service") +geom_boxplot()+scale_fill_grey(name ="Month", labels =c("May", "June","July", "August", "September"))p4
# All we did was change the color palette to grey scale using scale_fill_grey# Now with all this information, remember to include the scale_fill_grey and fill = month into your plot for certain aesthetics and legends
Now make one plot on your own of any of the variables in this dataset. Any kind, but be sure to write a brief essay describing each in full explisite detail
mean(airquality$Wind)
[1] 9.957516
median(airquality$Wind)
[1] 9.7
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
p5 <- airquality |>ggplot(aes(Month, Wind, fill=Month)) +geom_boxplot(position="identity", alpha =0.8, binwidth =3, color ="black") +scale_fill_discrete(name="Month", labels =c("May", "June", "July", "August", "September")) +labs(x ="Monthly Wind (mph) from May-Sept", y ="Frequency of Wind", title ="Reports on Monthly Wind from May-Sept", caption ="New York State Department of Conservation and the National Weather Service")
# This plot will get into deeper how monthly wind could be depicted through the lens of a box plot. Functions such as, labs and ggplot, to name a few are key to making this blot pop out more and readable. Mind that June has the highest max and min which is cool to note. This plot makes me think about is May because of the reported high winds. What days where those taken place? How long did it last?