Air Quality Assignment

Author

Zaid Hageman

Air Quality Assignment

Delhi Air Pollution 2019


FIRST SLIDE

The first step is always to import whatever data and tools you are working with

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(airquality)

The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.


SECOND SLIDE

View the Data Using the “head” function

This displays the first 6 rows of the data set

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

All of the values are saved as either integers or values


THIRD SLIDE

Calculating Summary Statistics

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4]) 
[1] 77.88235

Using the $ sign will allow you to type the int column you want using the the matrix [row,column] will achieve the same thing


FOURTH SLIDE

Calculating Median, St. Dev, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

FIFTH SLIDE

How to Rename a Variable from Numbers to Letters

For this example the months numbers will be changed to the names

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

SIXTH SLIDE

Checking the Summary Statistic

The way that the data of the numbers is stored will appear different now

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

SEVENTH SLIDE

Months is a Categorical Variable with Different Levels called “Factors”

This is how you would reorder the months so they do not appear alphabetically

airquality$Month <- factor(airquality$Month,
                           levels = c("May", "June", "July", "August",
                                      "September"))

EIGHTH SLIDE

Plot 1: Histogram Organized by Month

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National 
       Weather Service")  #provide the data source

NINTH SLIDE

Plot 1 Output

p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


TENTH SLIDE

Plot 2: Improve the histogram

This includes things like the Alpha, Bin width, and Color of Border

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June", "July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National 
       Weather Service")

ELEVENTH SLIDE

Plot 2 Output

p2

This improves the readability of the plot greatly


TWELVTH SLIDE

Plot 3: Side by Side Box Plot Organized by Month

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National 
       Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August",
                                                 "September"))

THIRTEENTH SLIDE

Plot 3 Output

p3

This Presents all of the Outliers Clearly


FOURTEENTH SLIDE

Plot 4: Side by Side Boxplot in Greyscale

p4 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National 
       Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", 
                                             "September"))

scale_fill_grey is the focus here


FIFTEENTH SLIDE

Plot 4 Output

p4


SIXTEENTH SLIDE

Plot 5: A Lineplot Connecting Windspeed and Month

clean_data <- na.omit(airquality)

p5 <- clean_data |>
  ggplot(aes(Ozone, Solar.R)) +
  geom_point(color = "red", alpha = 0.6) +
  labs(
    x = "Ozone Concentration (Parts per Billion)", y = "Solar Radiation", 
    title = "Scatterplot Connecting Ozone Concentration with Solar Radiation",
    caption = "New York State Department of Conservation and the National 
       Weather Service"
    ) + 
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15, face = "bold"),
    axis.title = element_text(size = 13, face = "bold"),
    axis.text = element_text(size = 9),
    plot.caption = element_text(size = 7)
    )

SEVENTEENTH SLIDE

Plot 5 Output

p5


EIGHTEENTH SLIDE

Plot 5 Analysis

Plot 5 is a scatter plot illustrating the relationship between ozone concentration and solar radiation. The plot shows a direct correlation between the two variables, suggesting that higher the Ozone concentration, the lower the Solar radiation levels. However, the data points exhibit some scatter, indicating that other factors may also influence ozone levels.

The plot was created using the ggplot2 package in R, with the na.omit() function used to remove missing data. The geom_point() layer added the scatter plot points, and the theme_minimal() function applied a minimalist theme. The plot’s labels and aesthetics were customized using the labs() and theme() functions. With these the font size and whether it was bold or not was able to be implemented.