Load library tidyverse in order to access dplyr and ggplot2
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The source for this data set is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.
Load the data set into your global environment
Because air quality is a pre-built data set, we can write it to our data directory to store it for later use.
data("airquality")
Look at the structure of the data
In the global environment, click on the row with the air quality data set and it will take you to a “spreadsheet” view of the data.
View the data using the “head” function
The function, head, will only display the first 6 rows of the data set. Notice in the global environment to the right, there are 153 observations (rows)
Notice that all the variables are classified as either integers or continuous values
Calculate Summary Statistics
If you want to look at specific statistics, here are some variations on coding. Here are 2 different ways to calculate “mean.”
mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235
For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows
Calculate Median, Standard Deviation, and Variance
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
Rename the Months from number to names
Sometimes we prefer the months to be numerical, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September
Now look at the summary statistics of the data set
See how Month has changed to have characters instead of numbers (it is now classified as “character” rather than “integer”)
summary(airquality$Month)
Length Class Mode
153 character character
Month is a categorical variable with different levels, called factors.
This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot #1
Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.
fill = Month colors the histogram by months between May - Sept.
scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend in chronological order. This is a different way to order than what was shown above.
labs allows us to add a title, axes labels, and a caption for the data source
Plot 1 Code
p1 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position="identity")+scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September")) +labs(x ="Monthly Temperatures from May - Sept", y ="Frequency of Temps",title ="Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State Department of Conservation and the National Weather Service") #provide the data source
Plot 1 Output
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Is this plot useful in answering questions about monthly temperature values?
Plot 2: Improve the histogram of Average Temperature by Month
Outline the bars in white using the color = “white” command
Use alpha to add some transparency (values between 0 and 1)
Change the bin width
Add some transparency and white borders around the histogram bars.
Plot 2 Code
p2 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position="identity", alpha=0.5, binwidth =5, color ="white")+scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September")) +labs(x ="Monthly Temperatures from May - Sept", y ="Frequency of Temps",title ="Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State Department of Conservation and the National Weather Service")
Plot 2 Output
p2
Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.
Did this improve the readability of the plot?
Plot 3: Create side-by-side box plots categorized by Month
We can see that August has the highest temperatures based on the box plot distribution.
p3 <- airquality |>ggplot(aes(Month, Temp, fill = Month)) +labs(x ="Months from May through September", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Department of Conservation and the National Weather Service") +geom_boxplot() +scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September"))
Plot 3 Output
p3
Notice that the points above and below the box plots in June and July are outliers.
Plot 4: Side by Side Box plots in Gray Scale
Make the same side-by-side box plots, but in grey-scale Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics.
Plot 4 Code
Here we just changed the color palette to gray scale using scale_fill_grey
p4 <- airquality |>ggplot(aes(Month, Temp, fill = Month)) +labs(x ="Monthly Temperatures", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Department of Conservation and the National Weather Service") +geom_boxplot()+scale_fill_grey(name ="Month", labels =c("May", "June","July", "August", "September"))
Plot 4 Output
p4
Plot 5:
Now make one new plot on your own, that is meaningfully different from the 4 I have shown you. You can select any of the variables in this data set. Be sure to explore the data set to see which variables are included that we have not explored yet. You may create a scatter plot, histogram, box plot, or something else. Be sure to include a title, axes labels, and caption for the data source in your Plot 5. Then finally, below your chunk of code for your plot 5, ….
Cleaning for Plot 5:
clean_AQ <- airquality |>drop_na(Solar.R)clean_AQ
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
5 23 299 8.6 65 May 7
6 19 99 13.8 59 May 8
7 8 19 20.1 61 May 9
8 NA 194 8.6 69 May 10
9 16 256 9.7 69 May 12
10 11 290 9.2 66 May 13
11 14 274 10.9 68 May 14
12 18 65 13.2 58 May 15
13 14 334 11.5 64 May 16
14 34 307 12.0 66 May 17
15 6 78 18.4 57 May 18
16 30 322 11.5 68 May 19
17 11 44 9.7 62 May 20
18 1 8 9.7 59 May 21
19 11 320 16.6 73 May 22
20 4 25 9.7 61 May 23
21 32 92 12.0 61 May 24
22 NA 66 16.6 57 May 25
23 NA 266 14.9 58 May 26
24 23 13 12.0 67 May 28
25 45 252 14.9 81 May 29
26 115 223 5.7 79 May 30
27 37 279 7.4 76 May 31
28 NA 286 8.6 78 June 1
29 NA 287 9.7 74 June 2
30 NA 242 16.1 67 June 3
31 NA 186 9.2 84 June 4
32 NA 220 8.6 85 June 5
33 NA 264 14.3 79 June 6
34 29 127 9.7 82 June 7
35 NA 273 6.9 87 June 8
36 71 291 13.8 90 June 9
37 39 323 11.5 87 June 10
38 NA 259 10.9 93 June 11
39 NA 250 9.2 92 June 12
40 23 148 8.0 82 June 13
41 NA 332 13.8 80 June 14
42 NA 322 11.5 79 June 15
43 21 191 14.9 77 June 16
44 37 284 20.7 72 June 17
45 20 37 9.2 65 June 18
46 12 120 11.5 73 June 19
47 13 137 10.3 76 June 20
48 NA 150 6.3 77 June 21
49 NA 59 1.7 76 June 22
50 NA 91 4.6 76 June 23
51 NA 250 6.3 76 June 24
52 NA 135 8.0 75 June 25
53 NA 127 8.0 78 June 26
54 NA 47 10.3 73 June 27
55 NA 98 11.5 80 June 28
56 NA 31 14.9 77 June 29
57 NA 138 8.0 83 June 30
58 135 269 4.1 84 July 1
59 49 248 9.2 85 July 2
60 32 236 9.2 81 July 3
61 NA 101 10.9 84 July 4
62 64 175 4.6 83 July 5
63 40 314 10.9 83 July 6
64 77 276 5.1 88 July 7
65 97 267 6.3 92 July 8
66 97 272 5.7 92 July 9
67 85 175 7.4 89 July 10
68 NA 139 8.6 82 July 11
69 10 264 14.3 73 July 12
70 27 175 14.9 81 July 13
71 NA 291 14.9 91 July 14
72 7 48 14.3 80 July 15
73 48 260 6.9 81 July 16
74 35 274 10.3 82 July 17
75 61 285 6.3 84 July 18
76 79 187 5.1 87 July 19
77 63 220 11.5 85 July 20
78 16 7 6.9 74 July 21
79 NA 258 9.7 81 July 22
80 NA 295 11.5 82 July 23
81 80 294 8.6 86 July 24
82 108 223 8.0 85 July 25
83 20 81 8.6 82 July 26
84 52 82 12.0 86 July 27
85 82 213 7.4 88 July 28
86 50 275 7.4 86 July 29
87 64 253 7.4 83 July 30
88 59 254 9.2 81 July 31
89 39 83 6.9 81 August 1
90 9 24 13.8 81 August 2
91 16 77 7.4 82 August 3
92 122 255 4.0 89 August 7
93 89 229 10.3 90 August 8
94 110 207 8.0 90 August 9
95 NA 222 8.6 92 August 10
96 NA 137 11.5 86 August 11
97 44 192 11.5 86 August 12
98 28 273 11.5 82 August 13
99 65 157 9.7 80 August 14
100 NA 64 11.5 79 August 15
101 22 71 10.3 77 August 16
102 59 51 6.3 79 August 17
103 23 115 7.4 76 August 18
104 31 244 10.9 78 August 19
105 44 190 10.3 78 August 20
106 21 259 15.5 77 August 21
107 9 36 14.3 72 August 22
108 NA 255 12.6 75 August 23
109 45 212 9.7 79 August 24
110 168 238 3.4 81 August 25
111 73 215 8.0 86 August 26
112 NA 153 5.7 88 August 27
113 76 203 9.7 97 August 28
114 118 225 2.3 94 August 29
115 84 237 6.3 96 August 30
116 85 188 6.3 94 August 31
117 96 167 6.9 91 September 1
118 78 197 5.1 92 September 2
119 73 183 2.8 93 September 3
120 91 189 4.6 93 September 4
121 47 95 7.4 87 September 5
122 32 92 15.5 84 September 6
123 20 252 10.9 80 September 7
124 23 220 10.3 78 September 8
125 21 230 10.9 75 September 9
126 24 259 9.7 73 September 10
127 44 236 14.9 81 September 11
128 21 259 15.5 76 September 12
129 28 238 6.3 77 September 13
130 9 24 10.9 71 September 14
131 13 112 11.5 71 September 15
132 46 237 6.9 78 September 16
133 18 224 13.8 67 September 17
134 13 27 10.3 76 September 18
135 24 238 10.3 68 September 19
136 16 201 8.0 82 September 20
137 13 238 12.6 64 September 21
138 23 14 9.2 71 September 22
139 36 139 10.3 81 September 23
140 7 49 10.3 69 September 24
141 14 20 16.6 63 September 25
142 30 193 6.9 70 September 26
143 NA 145 13.2 77 September 27
144 14 191 14.3 75 September 28
145 18 131 8.0 76 September 29
146 20 223 11.5 68 September 30
Plot 5 Code:
Plot5 <- clean_AQ |>ggplot(aes(Solar.R,Temp, col = Month)) +geom_point() +theme_light()+geom_smooth(method ="lm", se =FALSE, color ="black") +labs(x ="Solar Radiation", y ="Temperature (Degree F)",title ="Relationship Between Solar Radiation and Temperature",caption ="New York State Department of Conservation and the National Weather Service")
Plot 5 Output
Plot5
`geom_smooth()` using formula = 'y ~ x'
A Brief Essay
The plot I created is a scatter plot with a linear regression line to represent the relationship between solar radiation and temperature. The plot shows that there is a weak positive relationship between the two variables. The relationship is weak as most of the dots are further away from the line, and it is positive because the linear regression line increases. This shows that there may be other factors that affect temperature significantly more than solar radiation. Some “special codes” that I used to complete this plot are as follows:
col =, in ggplot(aes()), to add color to the dots as fill =, was not working.
geom_point(), to make the scatter plot.
theme_light(), to provide a brighter background.
geom_smooth(), to add the linear regression line.
color = “black,” to have one line of linear regression.
BONUS
Bonus <- clean_AQ |>ggplot(aes(Solar.R,Temp, col = Month)) +geom_point() +theme_light()+geom_smooth(method ="lm", se =FALSE) +labs(x ="Solar Radiation", y ="Temperature (Degree F)",title ="Relationship Between Solar Radiation and Temperature",caption ="New York State Department of Conservation and the National Weather Service")Bonus