Airquality HW

Author

P Daniel-Orie

cbs New York

cbs New York

Load in the library

Load library tidyverse in order to access dplyr and ggplot2

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.

Load the dataset into your global environment

data("airquality")

Look at the structure of the data

In the global environment, click on the row with the airquality dataset and it will take you to a “spreadsheet” view of the data.

View the data using the “head” function

View the data using the “head” function

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Notice that all the variables are classified as either integers or continuous values .

Calculate Summary Statistics

If you want to look at specific statistics, here are some variations on coding. Here are 2 different ways to calculate “mean.”

mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235

For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows

Calculate Median, Standard Deviation, and Variance

median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154

Rename the Months from number to names

Sometimes we prefer the months to be numerical, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September

airquality$Month[airquality$Month == 5]<- "May"
airquality$Month[airquality$Month == 6]<- "June"
airquality$Month[airquality$Month == 7]<- "July"
airquality$Month[airquality$Month == 8]<- "August"
airquality$Month[airquality$Month == 9]<- "September"

Now look at the summary statistics of the dataset

summary(airquality$Month)
   Length     Class      Mode 
      153 character character 

Month is a categorical variable with different levels, called factors.

This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot #1

airquality$Month<-factor(airquality$Month, 
                         levels=c("May","June", "July",  "August", "September"))

Plot 1: Create a histogram categorized by Month

Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.

  • fill = Month colors the histogram by months between May - Sept.

  • scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend in chronological order. This is a different way to order than what was shown above.

  • labs allows us to add a title, axes labels, and a caption for the data source

Plot 1 Code

p1 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity")+
  scale_fill_discrete(name = "Month", 
                      labels = c("May", "June","July", "August", "September")) +
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")  #provide the data source
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot 2: Improve the histogram of Average Temperature by Month

  • Outline the bars in white using the color = “white” command

  • Use alpha to add some transparency (values between 0 and 1)

Change the binwidth

  • Add some transparency and white borders around the histogram bars.
p2 <- airquality |>
  ggplot(aes(x=Temp, fill=Month)) +
  geom_histogram(position="identity", alpha=0.5,binwidth = 5, color = "white")+
  labs(x = "Monthly Temperatures from May - Sept", 
       y = "Frequency of Temps",
       title = "Histogram of Monthly Temperatures from May - Sept, 1973",
       caption = "New York State Department of Conservation and the National Weather Service")
print(p2)

Plot 2 Output

Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.

Did this improve the readability of the plot?

Plot 3: Create side-by-side boxplots categorized by Month

We can see that August has the highest temperatures based on the boxplot distribution.

p3 <- airquality |>
  ggplot(aes(Month,Temp, fill = Month)) + 
  labs(x = "Months from May through September", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot() +
  scale_fill_discrete(name = "Month", labels = c("May", "June","July", "August", "September"))
print(p3)

Notice that the points above and below the boxplots in June and July are outliers.

Plot 4: Side by Side Boxplots in Gray Scale

Make the same side-by-side boxplots, but in grey-scale

Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics.

Plot 4 Code

Here we just changed the color palette to gray scale using scale_fill_grey

p4 <- airquality |>
ggplot(aes(Month, Temp, fill = Month)) + 
  labs(x = "Monthly Temperatures", y = "Temperatures", 
       title = "Side-by-Side Boxplot of Monthly Temperatures",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_boxplot()+
  scale_fill_grey(name = "Month", labels = c("May", "June","July", "August", "September"))
p4

Plot 5:

Now make one new plot on your own, that is meaningfully different from the 4 I have shown you. You can select any of the variables in this dataset. Be sure to explore the dataset to see which variables are included that we have not explored yet. You may create a scatterplot, histogram, boxplot, or something else. Be sure to include a title, axes labels, and caption for the data source in your Plot 5. Then finally, below your chunk of code for your plot 5, ….

# Remove rows with missing values
cleaned_data <- na.omit(airquality)
cleaned_data
    Ozone Solar.R Wind Temp     Month Day
1      41     190  7.4   67       May   1
2      36     118  8.0   72       May   2
3      12     149 12.6   74       May   3
4      18     313 11.5   62       May   4
7      23     299  8.6   65       May   7
8      19      99 13.8   59       May   8
9       8      19 20.1   61       May   9
12     16     256  9.7   69       May  12
13     11     290  9.2   66       May  13
14     14     274 10.9   68       May  14
15     18      65 13.2   58       May  15
16     14     334 11.5   64       May  16
17     34     307 12.0   66       May  17
18      6      78 18.4   57       May  18
19     30     322 11.5   68       May  19
20     11      44  9.7   62       May  20
21      1       8  9.7   59       May  21
22     11     320 16.6   73       May  22
23      4      25  9.7   61       May  23
24     32      92 12.0   61       May  24
28     23      13 12.0   67       May  28
29     45     252 14.9   81       May  29
30    115     223  5.7   79       May  30
31     37     279  7.4   76       May  31
38     29     127  9.7   82      June   7
40     71     291 13.8   90      June   9
41     39     323 11.5   87      June  10
44     23     148  8.0   82      June  13
47     21     191 14.9   77      June  16
48     37     284 20.7   72      June  17
49     20      37  9.2   65      June  18
50     12     120 11.5   73      June  19
51     13     137 10.3   76      June  20
62    135     269  4.1   84      July   1
63     49     248  9.2   85      July   2
64     32     236  9.2   81      July   3
66     64     175  4.6   83      July   5
67     40     314 10.9   83      July   6
68     77     276  5.1   88      July   7
69     97     267  6.3   92      July   8
70     97     272  5.7   92      July   9
71     85     175  7.4   89      July  10
73     10     264 14.3   73      July  12
74     27     175 14.9   81      July  13
76      7      48 14.3   80      July  15
77     48     260  6.9   81      July  16
78     35     274 10.3   82      July  17
79     61     285  6.3   84      July  18
80     79     187  5.1   87      July  19
81     63     220 11.5   85      July  20
82     16       7  6.9   74      July  21
85     80     294  8.6   86      July  24
86    108     223  8.0   85      July  25
87     20      81  8.6   82      July  26
88     52      82 12.0   86      July  27
89     82     213  7.4   88      July  28
90     50     275  7.4   86      July  29
91     64     253  7.4   83      July  30
92     59     254  9.2   81      July  31
93     39      83  6.9   81    August   1
94      9      24 13.8   81    August   2
95     16      77  7.4   82    August   3
99    122     255  4.0   89    August   7
100    89     229 10.3   90    August   8
101   110     207  8.0   90    August   9
104    44     192 11.5   86    August  12
105    28     273 11.5   82    August  13
106    65     157  9.7   80    August  14
108    22      71 10.3   77    August  16
109    59      51  6.3   79    August  17
110    23     115  7.4   76    August  18
111    31     244 10.9   78    August  19
112    44     190 10.3   78    August  20
113    21     259 15.5   77    August  21
114     9      36 14.3   72    August  22
116    45     212  9.7   79    August  24
117   168     238  3.4   81    August  25
118    73     215  8.0   86    August  26
120    76     203  9.7   97    August  28
121   118     225  2.3   94    August  29
122    84     237  6.3   96    August  30
123    85     188  6.3   94    August  31
124    96     167  6.9   91 September   1
125    78     197  5.1   92 September   2
126    73     183  2.8   93 September   3
127    91     189  4.6   93 September   4
128    47      95  7.4   87 September   5
129    32      92 15.5   84 September   6
130    20     252 10.9   80 September   7
131    23     220 10.3   78 September   8
132    21     230 10.9   75 September   9
133    24     259  9.7   73 September  10
134    44     236 14.9   81 September  11
135    21     259 15.5   76 September  12
136    28     238  6.3   77 September  13
137     9      24 10.9   71 September  14
138    13     112 11.5   71 September  15
139    46     237  6.9   78 September  16
140    18     224 13.8   67 September  17
141    13      27 10.3   76 September  18
142    24     238 10.3   68 September  19
143    16     201  8.0   82 September  20
144    13     238 12.6   64 September  21
145    23      14  9.2   71 September  22
146    36     139 10.3   81 September  23
147     7      49 10.3   69 September  24
148    14      20 16.6   63 September  25
149    30     193  6.9   70 September  26
151    14     191 14.3   75 September  28
152    18     131  8.0   76 September  29
153    20     223 11.5   68 September  30
cleaned_data<-airquality|>
  ggplot(aes(x = Temp, y = Ozone, color = factor(Month))) + 
  labs(x = "Temperature (°F)", y = "Ozone Levels (ppb)", 
       title = "Scatter Plot of Ozone Levels vs Temperature",
       caption = "New York State Department of Conservation and the National Weather Service") +
  geom_point() +
  scale_color_discrete(name = "Month", labels = c("May", "June", "July", "August", "September"))
  cleaned_data
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Write a brief essay here

-Describe the plot type you have created Plot 5 is a scatter plot which visualizes the relationship between temperature and ozone levels from the airquality dataset. Each point on the scatter plot represents an observation, with temperature on the x-axis and ozone levels on the y-axis. The points are colored by month, allowing us to see how this relationship varies throughout the months of May to September.

  • Any insights that the plot shows The scatter plot shows a general trend where higher temperatures are associated with higher ozone levels. By coloring the points based on the month, we can observe how this relationship differs across months. For instance:May and September (lighter colors) show lower temperatures and a wider range of ozone levels. June, July, and August (darker colors) tend to have higher temperatures, and the relationship between temperature and ozone is more pronounced.

-Describe any special code you used to make this plot I used na.omit() to remove all observation with missing data on the Ozone variable.