Source: AirNow.gov
Load library tidyverse in order to access dplyr and ggplot2
The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.
Because airquality is a pre-built dataset, we can write it to our data directory to store it for later use.
In the global environment, click on the row with the airquality dataset and it will take you to a “spreadsheet” view of the data.
The function, head, will only disply the first 6 rows of the dataset. Notice in the global environment to the right, there are 153 observations (rows)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
Notice that all the variables are classified as either integers
If you want to look at specific statistics, here are some variations on coding. Here are 2 different ways to calculate “mean.”
For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows
Sometimes we prefer the months to be numerical, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September
See how Month has changed to have characters instead of numbers (it is now classified as “character” rather than “integer”)
This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot #1
Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.
fill = Month colors the histogram by months between May - Sept.
scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend in chronological order. This is a different way to order than what was shown above.
labs allows us to add a title, axes labels, and a caption for the data source
p1 <- airquality |>
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity")+
scale_fill_discrete(name = "Month",
labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service") #provide the data sourceIs this plot useful in answering questions about monthly temperature values?
Outline the bars in white using the color = “white” command
Use alpha to add some transparency (values between 0 and 1)
Change the binwidth
Add some transparency and white borders around the histogram bars.
p2 <- airquality |>
ggplot(aes(x=Temp, fill=Month)) +
geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
labs(x = "Monthly Temperatures from May - Sept",
y = "Frequency of Temps",
title = "Histogram of Monthly Temperatures from May - Sept, 1973",
caption = "New York State Department of Conservation and the National Weather Service")Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.
Did this improve the readability of the plot?
We can see that August has the highest temperatures based on the boxplot distribution.
p3 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Months from May through September", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot() +
scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))Notice that the points above and below the boxplots in June and July are outliers.
Make the same side-by-side boxplots, but in grey-scale
Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics.
Here we just changed the color palette to gray scale using scale_fill_grey
p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) +
labs(x = "Monthly Temperatures", y = "Temperatures",
title = "Side-by-Side Boxplot of Monthly Temperatures",
caption = "New York State Department of Conservation and the National Weather Service") +
geom_boxplot()+
scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))1. The histogram plot is useful for analyzing monthly temperature values as it shows the distribution of temperatures for each month with different colors. It helps compare how temperature ranges vary across the months from May to September. To improve clarity, consider adjusting bin width or adding transparency if needed.
2. Yes, this plot improves readability by using side-by-side boxplots to clearly compare temperature distributions across months. It visually distinguishes each month’s temperatures, highlighting August as having the highest temperatures. The plot also identifies outliers in June and July, providing insight into unusual temperature values. The use of distinct colors and clear labels further enhances the plot’s effectiveness in displaying and comparing monthly temperature patterns.
brief essay
Plot Type: The plot is a scatterplot displaying the relationship between Solar Radiation (Solar.R) and Temperature(Temp).
Insights: This scatterplot reveals how solar radiation levels are associated with temperature variations. The plot shows a general trend where higher solar radiation tends to be associated with higher temperatures. The added regression line (black line) provides a visual indication of this relationship, suggesting a positive correlation between the two variables. Different colors represent the months, allowing us to see if this relationship varies by month.
Special Code: I used geom_point() to create the scatterplot and geom_smooth() with method = “lm” to add a linear regression line, which helps to identify the overall trend. The alpha = 0.7 parameter in geom_point() adds transparency to the points, making overlapping points easier to distinguish. The scale_color_discrete() function colors the points by month, adding an additional layer of information to compare how the relationship between solar radiation and temperature might differ across months.