Air Quality Homework

Author

Jonathan RH

Load in the library

Load library tidyverse in order to access dplyr and ggplot2

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The source for this data set is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.

Load the data set into your global environment

Because air quality is a pre-built data set, we can write it to our data directory to store it for later use.

data("airquality")

Look at the structure of the data

In the global environment, click on the row with the air quality data set and it will take you to a “spreadsheet” view of the data.

View the data using the “head” function

The function, head, will only display the first 6 rows of the data set. Notice in the global environment to the right, there are 153 observations (rows)

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Notice that all the variables are classified as either integers or continuous values

Calculate Summary Statistics

If you want to look at specific statistics, here are some variations on coding. Here are 2 different ways to calculate “mean.”

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235

For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

Sometimes we prefer the months to be numerical, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the data set

See how Month has changed to have characters instead of numbers (it is now classified as “character” rather than “integer”)

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot #1

airquality$Month<-factor(airquality$Month, 
                         levels=c("May", "June","July", "August",
                                  "September"))

Plot 1: Create a histogram categorized by Month

Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.

  • fill = Month colors the histogram by months between May - Sept.

  • scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend in chronological order. This is a different way to order than what was shown above.

  • labs allows us to add a title, axes labels, and a caption for the data source

Plot 1 Code

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source

Plot 1 Output

p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Is this plot useful in answering questions about monthly temperature values?

Plot 2: Improve the histogram of Average Temperature by Month

Outline the bars in white using the color = “white” command

Use alpha to add some transparency (values between 0 and 1)

Change the bin width

Add some transparency and white borders around the histogram bars.

Plot 2 Code

p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5, binwidth = 5, color = "white")+
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")

Plot 2 Output

p2

Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.

Did this improve the readability of the plot?

Plot 3: Create side-by-side box plots categorized by Month

We can see that August has the highest temperatures based on the box plot distribution.

p3 <- airquality |>
  ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))

Plot 3 Output

p3

Notice that the points above and below the box plots in June and July are outliers.

Plot 4: Side by Side Box plots in Gray Scale

Make the same side-by-side box plots, but in grey-scale Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics.

Plot 4 Code

Here we just changed the color palette to gray scale using scale_fill_grey

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))

Plot 4 Output

p4

Plot 5:

Now make one new plot on your own, that is meaningfully different from the 4 I have shown you. You can select any of the variables in this data set. Be sure to explore the data set to see which variables are included that we have not explored yet. You may create a scatter plot, histogram, box plot, or something else. Be sure to include a title, axes labels, and caption for the data source in your Plot 5. Then finally, below your chunk of code for your plot 5, ….

Cleaning for Plot 5:

clean_AQ <- airquality |>
  drop_na(Solar.R)

clean_AQ
    Ozone Solar.R Wind Temp     Month Day
1      41     190  7.4   67       May   1
2      36     118  8.0   72       May   2
3      12     149 12.6   74       May   3
4      18     313 11.5   62       May   4
5      23     299  8.6   65       May   7
6      19      99 13.8   59       May   8
7       8      19 20.1   61       May   9
8      NA     194  8.6   69       May  10
9      16     256  9.7   69       May  12
10     11     290  9.2   66       May  13
11     14     274 10.9   68       May  14
12     18      65 13.2   58       May  15
13     14     334 11.5   64       May  16
14     34     307 12.0   66       May  17
15      6      78 18.4   57       May  18
16     30     322 11.5   68       May  19
17     11      44  9.7   62       May  20
18      1       8  9.7   59       May  21
19     11     320 16.6   73       May  22
20      4      25  9.7   61       May  23
21     32      92 12.0   61       May  24
22     NA      66 16.6   57       May  25
23     NA     266 14.9   58       May  26
24     23      13 12.0   67       May  28
25     45     252 14.9   81       May  29
26    115     223  5.7   79       May  30
27     37     279  7.4   76       May  31
28     NA     286  8.6   78      June   1
29     NA     287  9.7   74      June   2
30     NA     242 16.1   67      June   3
31     NA     186  9.2   84      June   4
32     NA     220  8.6   85      June   5
33     NA     264 14.3   79      June   6
34     29     127  9.7   82      June   7
35     NA     273  6.9   87      June   8
36     71     291 13.8   90      June   9
37     39     323 11.5   87      June  10
38     NA     259 10.9   93      June  11
39     NA     250  9.2   92      June  12
40     23     148  8.0   82      June  13
41     NA     332 13.8   80      June  14
42     NA     322 11.5   79      June  15
43     21     191 14.9   77      June  16
44     37     284 20.7   72      June  17
45     20      37  9.2   65      June  18
46     12     120 11.5   73      June  19
47     13     137 10.3   76      June  20
48     NA     150  6.3   77      June  21
49     NA      59  1.7   76      June  22
50     NA      91  4.6   76      June  23
51     NA     250  6.3   76      June  24
52     NA     135  8.0   75      June  25
53     NA     127  8.0   78      June  26
54     NA      47 10.3   73      June  27
55     NA      98 11.5   80      June  28
56     NA      31 14.9   77      June  29
57     NA     138  8.0   83      June  30
58    135     269  4.1   84      July   1
59     49     248  9.2   85      July   2
60     32     236  9.2   81      July   3
61     NA     101 10.9   84      July   4
62     64     175  4.6   83      July   5
63     40     314 10.9   83      July   6
64     77     276  5.1   88      July   7
65     97     267  6.3   92      July   8
66     97     272  5.7   92      July   9
67     85     175  7.4   89      July  10
68     NA     139  8.6   82      July  11
69     10     264 14.3   73      July  12
70     27     175 14.9   81      July  13
71     NA     291 14.9   91      July  14
72      7      48 14.3   80      July  15
73     48     260  6.9   81      July  16
74     35     274 10.3   82      July  17
75     61     285  6.3   84      July  18
76     79     187  5.1   87      July  19
77     63     220 11.5   85      July  20
78     16       7  6.9   74      July  21
79     NA     258  9.7   81      July  22
80     NA     295 11.5   82      July  23
81     80     294  8.6   86      July  24
82    108     223  8.0   85      July  25
83     20      81  8.6   82      July  26
84     52      82 12.0   86      July  27
85     82     213  7.4   88      July  28
86     50     275  7.4   86      July  29
87     64     253  7.4   83      July  30
88     59     254  9.2   81      July  31
89     39      83  6.9   81    August   1
90      9      24 13.8   81    August   2
91     16      77  7.4   82    August   3
92    122     255  4.0   89    August   7
93     89     229 10.3   90    August   8
94    110     207  8.0   90    August   9
95     NA     222  8.6   92    August  10
96     NA     137 11.5   86    August  11
97     44     192 11.5   86    August  12
98     28     273 11.5   82    August  13
99     65     157  9.7   80    August  14
100    NA      64 11.5   79    August  15
101    22      71 10.3   77    August  16
102    59      51  6.3   79    August  17
103    23     115  7.4   76    August  18
104    31     244 10.9   78    August  19
105    44     190 10.3   78    August  20
106    21     259 15.5   77    August  21
107     9      36 14.3   72    August  22
108    NA     255 12.6   75    August  23
109    45     212  9.7   79    August  24
110   168     238  3.4   81    August  25
111    73     215  8.0   86    August  26
112    NA     153  5.7   88    August  27
113    76     203  9.7   97    August  28
114   118     225  2.3   94    August  29
115    84     237  6.3   96    August  30
116    85     188  6.3   94    August  31
117    96     167  6.9   91 September   1
118    78     197  5.1   92 September   2
119    73     183  2.8   93 September   3
120    91     189  4.6   93 September   4
121    47      95  7.4   87 September   5
122    32      92 15.5   84 September   6
123    20     252 10.9   80 September   7
124    23     220 10.3   78 September   8
125    21     230 10.9   75 September   9
126    24     259  9.7   73 September  10
127    44     236 14.9   81 September  11
128    21     259 15.5   76 September  12
129    28     238  6.3   77 September  13
130     9      24 10.9   71 September  14
131    13     112 11.5   71 September  15
132    46     237  6.9   78 September  16
133    18     224 13.8   67 September  17
134    13      27 10.3   76 September  18
135    24     238 10.3   68 September  19
136    16     201  8.0   82 September  20
137    13     238 12.6   64 September  21
138    23      14  9.2   71 September  22
139    36     139 10.3   81 September  23
140     7      49 10.3   69 September  24
141    14      20 16.6   63 September  25
142    30     193  6.9   70 September  26
143    NA     145 13.2   77 September  27
144    14     191 14.3   75 September  28
145    18     131  8.0   76 September  29
146    20     223 11.5   68 September  30

Plot 5 Code:

Plot5 <- clean_AQ |>
  ggplot(aes(Solar.R,Temp, col = Month)) +
   geom_point() +
   theme_light()+
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(x = "Solar Radiation", 
       y = "Temperature (Degree F)",
       title = "Relationship Between Solar Radiation and Temperature",
       caption ="New York State Department of Conservation and the National Weather Service")

Plot 5 Output

Plot5
`geom_smooth()` using formula = 'y ~ x'

A Brief Essay

The plot I created is a scatter plot with a linear regression line to represent the relationship between solar radiation and temperature. The plot shows that there is a weak positive relationship between the two variables. The relationship is weak as most of the dots are further away from the line, and it is positive because the linear regression line increases. This shows that there may be other factors that affect temperature significantly more than solar radiation. Some “special codes” that I used to complete this plot are as follows:

  • col =, in ggplot(aes()), to add color to the dots as fill =, was not working.

  • geom_point(), to make the scatter plot.

  • theme_light(), to provide a brighter background.

  • geom_smooth(), to add the linear regression line.

  • color = “black,” to have one line of linear regression.

BONUS

Bonus <- clean_AQ |>
  ggplot(aes(Solar.R,Temp, col = Month)) +
   geom_point() +
   theme_light()+
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Solar Radiation", 
       y = "Temperature (Degree F)",
       title = "Relationship Between Solar Radiation and Temperature",
       caption ="New York State Department of Conservation and the National Weather Service")
  
Bonus
`geom_smooth()` using formula = 'y ~ x'