Airquality Assignment

Author

V. Lyon

Airquality Tutorial and Homework Assignment

Load in the library

Laod library tidyverse in order to access dplyr and ggplot2

library(tidyverse)

Warning: package 'readr' was built under R version 4.4.3

Warning: package 'lubridate' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.

Laod the dataset into your global environment

Because airquality ia a pre-build dataset, we can write it to our data directory to store it for later use.

data(airquality)

Look at the structure of the data

In the global environment, click on the row with the airquality dataset and it will take you to a “spreadsheet” view of the data.

View the data using the “head” function

The function, head, will only display the first 6 rows of the dataset. Notice in the global environment to the right, there are 153 observations (rows)

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Notice that all the variable are classified as either integers or continuous values.

Calculate Summary Statistics

If you want to look at specific statistics, here are some variation on coding. Here are 2 different ways to calculate “mean.”

median(airquality$Temp)

[1] 79

sd(airquality$Wind)

[1] 3.523001

var(airquality$Wind)

[1] 12.41154

Rename the Months from number to names

Sometimes we prefer the months to be numeric, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5-9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

See how Month has changed to have characters instead of numbers (if is now classified as “character” rather than “integer”)

summary(airquality$Month)

   Length     Class      Mode 
      153 character character

Month is a categorical variable with different levels, called factors.

This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot in Plot #1)

airquality$Month<-factor(airquality$Month,
                         levels=c("May", "June", "July", "August",
                                  "September"))

Plot 1: Create a histogram categorized by Month

Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. We will see that temperature increase over these months. The median temperature appears to be about 75 degrees.

fill = Month colors the histogram by months between May-Sept.
scale_fill_discrete(name = “Month”..) provides the month names on the right side as a legend in chronological order. This is a different way to order than what shown above.
labs allows us to add a title, axes labels, and a caption for the data source

Plot 1 Code

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position = "identity") +
  scale_fill_discrete (name = "Month",
                       labels = c("may", "June", "July", "August", "September")) +
  labs ( x = "Month Temperature from May - Sept",
        y =" Frequency of Temp",
        title=" Histogram of Monthly Temperatures from May - Sept, 1973",
        caption = "New York State department of Conservation and the National Weather Service") #provide the data source

Plot 1 Output

p1

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Is this plot useful in answering questions about monthly temperature values?

Plot 2: Improve the histogram of Average Temperature by Month

Outline the bars in white using the color = “white” command
Use alpha to add some transparency (value between 0 and1 )
Change the binwidth
Add some transparency and white borders around the histogram bars.

Plot 2 Code

p2 <- airquality |>
  ggplot(aes(x = Temp, fill = Month)) +
  geom_histogram( position = "identity", alpha=0.5, binwidth = 5, color = "white") +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))+
  labs ( x = "Monthly Temperature from May - Sept",
         y = "Frequency of Temp",
         title = "Histogram of Monthly temperature from May - Sept, 1973",
         caption = "New York state department of Conservationa and the National Weather Service")

p2

Here June stands out for having frequency of 85 degree temperature. The dark purple color indicates overlaps of months due to the transparency

Did this improve the readability of the plot?

Plot 3: Create side-by-side boxplots categorized by Month

We can see that August has the highest temperature based on the boxplot distribution.

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month )) +
  labs( x = "Months from May through September", y = "Temperature",
        title= "Side-by-Side Boxplot of monthly Temperature",
        caption = "New York State Deprtment of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))

Plot 3 Output

p3

Notice that the points above and below the boxplots in June and July are outliers.

Plot 4: Side by Side Boxplot in Gray Scale

Use the scale_fill_gray command for the grey-scale legend, and again, use fill=Month in the aesthetics.

Plot 4 Code

Here we just changed the color palette to gray scale using scale_fill_gray

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill= Month)) +
  labs (x = "Monthly Temperatures", y = "Temperatures", 
        title = "Side-by-Side Boxplot of Monthly Temperatures",
        caption = "New York State Departmnet of Conservation and the National Weather Service") +
  geom_boxplot() + 
  scale_fill_grey(name = "Month", labels = c("May", "June", "July","August","September"))

Plot 4 Output

p4

Plot 5

p5 <- airquality |>
  ggplot(aes(x = Wind, y = Ozone )) +
  geom_point(color= "purple", size = 3, alpha = 0.6) +
  labs (
    title = "Scatterplot of Ozone vs Wind",
    x= "Wind",
    y=" Ozone",
    caption = "New York State Departmnet of Conservation and the National Weather Service"
  ) +
  geom_smooth()

p5

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

For Plot 5, I created a scatterplot to explore the relationship between Wind speed and Ozone levels using the airquality dataset. I thought these two variables might be connected, and the scatterplot made it easier to see if there’s a pattern.

I used geom_point() to plot the data points in purple with some transparency, so overlapping points would still be visible. Then I added a smoothing line using geom_smooth() to show the general trend. The shape of the line suggests that when the wind is strong, ozone levels tend to be lower. This could be because wind helps clear out pollutants in the air.

I also added a title, axis labels, and a caption to make the plot more complete and professional. I didn’t filter out missing values manually in this case, so the plot gave a warning and removed rows with NA values by itself. This plot helped me see how wind might play a role in improving air quality.