averma13_DATA110_Week_2_Airquality_Assignment

Author

Adi Ve

Airquality Homework Assignment

https://knowyourmeme.com/memes/elmo-rise

Load in the Library

Just btw I’m copying all the text from professor Saidi’s original work too.

Because airquality is a pre-built dataset, we can write it to our data directory to store it for later use.

The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'purrr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'stringr' was built under R version 4.2.3

Warning: package 'forcats' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the dataset into your global environment

data("airquality")

Look at the structure of the data

the function, head, will only display the first 6 rows of the dataset. Notice in the global environment to the right, there are 153 observations (rows)

View the data using the “head” function

head(airquality)

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Calculate summary statistics

mean(airquality$Temp)

[1] 77.88235

mean(airquality[,4])

[1] 77.88235

median(airquality$Temp)

[1] 79

sd(airquality$Wind)

[1] 3.523001

var(airquality$Wind)

[1] 12.41154

Rename the months from number to names

Number 5 - 9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

See how Month has changed to have characters instead of numbers

summary(airquality$Month)

   Length     Class      Mode 
      153 character character

Month is a categorical variable with different levels, called factors.

This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot 1)

airquality$Month<-factor(airquality$Month, levels=c("May", "June","July", "August", "September"))

Plot 1: Create a histogram categorized by Month

Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.

Reorder the legend so that it is not the default (alphabetical), but rather in chronological order.

fill = Month colors the histogram by months between May - Sept.

scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend.

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I guess this plot could be somewhat useful?? But only basically as much as having the computer do the counts for you. It’s so ugly that it’s hard to tell what’s even going on because I couldn’t possibly know if there are bars hidden under places I don’t see.

Plot 2: Improve the histogram using ggplot

Outline the bars in white using the color = “white” command

Use alpha to add some transparency (values between 0 and 1)

Change the binwidth

Histogram of Average Temperature by Month

Add some transparency and white borders around the histogram bars. Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
p2

Yes, this improved the readability of the graph a lot, I can actually see where the bars are generally, but it is still super crowded and hard to read.

Plot 3: Create side-by-side boxplots categorized by Month

We can see that August has the highest temperatures based on the boxplot distribution.

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
p3

Notice that the points above and below the boxplots in June and July are outliers.

Plot 4: Make the same side-by-side boxplots, but in grey-scale

Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics

Side by Side Boxplots in Gray Scale

Here we just changed the color palette to gray scale using scale_fill_grey

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5: Now make one plot on your own of any of the variables in this dataset. It may be a scatterplot, histogram, or boxplot.

model <- lm(airquality$Ozone~airquality$Temp)
summary(model) # wow this is so cool i didn't expect there to be such a strong


Call:
lm(formula = airquality$Ozone ~ airquality$Temp)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.729 -17.409  -0.587  11.306 118.271 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -146.9955    18.2872  -8.038 9.37e-13 ***
airquality$Temp    2.4287     0.2331  10.418  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.71 on 114 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.4877,    Adjusted R-squared:  0.4832 
F-statistic: 108.5 on 1 and 114 DF,  p-value: < 2.2e-16

               # relationship right off the bat
plot(model)

p5 <- ggplot(airquality,aes(x=Temp, y=Ozone)) +
          geom_point() +
          geom_smooth(method='loess', aes(color='Loess'), fill='#b8e186') +
          geom_smooth(method='lm', aes(color='Linear'), fill='#f1b6da') +
          theme_minimal() +
          labs(x='Temperature (F)', y='Ozone (ppb)', title='Models of Ozone vs. Temperature',
               caption = 'all p of linear model < 0.01, R^2~0.48') +
          theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
          scale_color_manual(values = c(Loess = '#4dac26', Linear = '#d01c8b'),
                              labels = c(Loess = "Loess", Linear = "Linear"))
          
suppressWarnings(print(p5))

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

The plot I have created graphs the Ozone and Temperature as a scatterplot, with Temperature being the explanatory variable. I found this super interesting, because it suggests to me that there may be some important thermokinetic effect on the Ozone propogation reaction, of course there very well could be plenty of confounding or colinearity which could explain this effect. Along with the scatterplot, there are two lines, the pink being a linear regression, with a corresponding error bar, and the green being a Loess smooth, with its own corresponding error bar as well. This excited me so much, because you can see a marked sigmoid shape in the Loess curve, which corresponds super well with the Residuals vs. Fitted plot from the summary of the linear model. I feel that this can explain to viewers without the need for much statistical or chemical knowledge that the relationship between ozone and temperature is likely not linear. This has another super cool link to chemistry, in which kinetics (especially in acid-base reactions) very often follow a characteristic sigmoid curve with respect to temperature and other factors. I think its very interesting that all this different information about chemical kinetics, appropriateness of linear models, and uncertainty can all be easily conveyed in one plot. On the other hand, it is important to note that there are very few sections in which the error bars of the linear regression and the Loess smooth have no intersection, which at a=0.05 does not provide significant evidence against the linear model, perhaps just a visual suggestion which could be otherwise informed by knowledge from chemistry or other subjects. To create this plot I used the geom_point, geom_smooth, labs, theme, and scale_color_manual functions in ggplot. I got the format for the original linear regression from statology: https://www.statology.org/ggplot2-linear-regression/. I also got stuck when trying to make the legend, it took me a good hour to get it right messing with the aes function, because I didn’t understand that I would need to specify the aesthetics in the very first ggplot command as well. This post on stackoverflow helped me to understand that: https://stackoverflow.com/questions/74392543. This was a super fun visualization to make and I was really lucky to find a relationship so strong on my first try poking through the variables.