Laod library tidyverse in order to access dplyr and ggplot2
library(tidyverse)
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.
Laod the dataset into your global environment
Because airquality ia a pre-build dataset, we can write it to our data directory to store it for later use.
data(airquality)
Look at the structure of the data
In the global environment, click on the row with the airquality dataset and it will take you to a “spreadsheet” view of the data.
View the data using the “head” function
The function, head, will only display the first 6 rows of the dataset. Notice in the global environment to the right, there are 153 observations (rows)
Notice that all the variable are classified as either integers or continuous values.
Calculate Summary Statistics
If you want to look at specific statistics, here are some variation on coding. Here are 2 different ways to calculate “mean.”
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
Rename the Months from number to names
Sometimes we prefer the months to be numeric, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5-9 to May through September
See how Month has changed to have characters instead of numbers (if is now classified as “character” rather than “integer”)
summary(airquality$Month)
Length Class Mode
153 character character
Month is a categorical variable with different levels, called factors.
This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot in Plot #1)
Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. We will see that temperature increase over these months. The median temperature appears to be about 75 degrees.
fill = Month colors the histogram by months between May-Sept.
scale_fill_discrete(name = “Month”..) provides the month names on the right side as a legend in chronological order. This is a different way to order than what shown above.
labs allows us to add a title, axes labels, and a caption for the data source
Plot 1 Code
p1 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position ="identity") +scale_fill_discrete (name ="Month",labels =c("may", "June", "July", "August", "September")) +labs ( x ="Month Temperature from May - Sept",y =" Frequency of Temp",title=" Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State department of Conservation and the National Weather Service") #provide the data source
Plot 1 Output
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Is this plot useful in answering questions about monthly temperature values?
Plot 2: Improve the histogram of Average Temperature by Month
Outline the bars in white using the color = “white” command
Use alpha to add some transparency (value between 0 and1 )
Change the binwidth
Add some transparency and white borders around the histogram bars.
Plot 2 Code
p2 <- airquality |>ggplot(aes(x = Temp, fill = Month)) +geom_histogram( position ="identity", alpha=0.5, binwidth =5, color ="white") +scale_fill_discrete(name ="Month", labels =c("May", "June", "July", "August", "September"))+labs ( x ="Monthly Temperature from May - Sept",y ="Frequency of Temp",title ="Histogram of Monthly temperature from May - Sept, 1973",caption ="New York state department of Conservationa and the National Weather Service")
p2
Here June stands out for having frequency of 85 degree temperature. The dark purple color indicates overlaps of months due to the transparency
Did this improve the readability of the plot?
Plot 3: Create side-by-side boxplots categorized by Month
We can see that August has the highest temperature based on the boxplot distribution.
p3 <- airquality |>ggplot(aes(Month, Temp, fill = Month )) +labs( x ="Months from May through September", y ="Temperature",title="Side-by-Side Boxplot of monthly Temperature",caption ="New York State Deprtment of Conservation and the National Weather Service") +geom_boxplot() +scale_fill_discrete(name ="Month", labels =c("May", "June", "July", "August", "September"))
Plot 3 Output
p3
Notice that the points above and below the boxplots in June and July are outliers.
Plot 4: Side by Side Boxplot in Gray Scale
Use the scale_fill_gray command for the grey-scale legend, and again, use fill=Month in the aesthetics.
Plot 4 Code
Here we just changed the color palette to gray scale using scale_fill_gray
p4 <- airquality |>ggplot(aes(Month, Temp, fill= Month)) +labs (x ="Monthly Temperatures", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Departmnet of Conservation and the National Weather Service") +geom_boxplot() +scale_fill_grey(name ="Month", labels =c("May", "June", "July","August","September"))
Plot 4 Output
p4
Plot 5
p5 <- airquality |>ggplot(aes(x = Wind, y = Ozone )) +geom_point(color="purple", size =3, alpha =0.6) +labs (title ="Scatterplot of Ozone vs Wind",x="Wind",y=" Ozone",caption ="New York State Departmnet of Conservation and the National Weather Service" ) +geom_smooth()p5
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).
For Plot 5, I created a scatterplot to explore the relationship between Wind speed and Ozone levels using the airquality dataset. I thought these two variables might be connected, and the scatterplot made it easier to see if there’s a pattern.
I used geom_point() to plot the data points in purple with some transparency, so overlapping points would still be visible. Then I added a smoothing line using geom_smooth() to show the general trend. The shape of the line suggests that when the wind is strong, ozone levels tend to be lower. This could be because wind helps clear out pollutants in the air.
I also added a title, axis labels, and a caption to make the plot more complete and professional. I didn’t filter out missing values manually in this case, so the plot gave a warning and removed rows with NA values by itself. This plot helped me see how wind might play a role in improving air quality.