Load library tidyverse in order to access dplyr and ggplot2
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The source for this dataset is the New York State Department of Conservation and the National Weather Service of 1973 for five months from May to September recorded daily.
Load the dataset into your global environment
data("airquality")
Look at the structure of the data
In the global environment, click on the row with the airquality dataset and it will take you to a “spreadsheet” view of the data.
Notice that all the variables are classified as either integers or continuous values .
Calculate Summary Statistics
If you want to look at specific statistics, here are some variations on coding. Here are 2 different ways to calculate “mean.”
mean(airquality$Temp)
[1] 77.88235
mean(airquality[,4])
[1] 77.88235
For the second way to calculate the mean, the matrix [row,column] is looking for column #4, which is the Temp column and we use all rows
Calculate Median, Standard Deviation, and Variance
median(airquality$Temp)
[1] 79
sd(airquality$Wind)
[1] 3.523001
var(airquality$Wind)
[1] 12.41154
Rename the Months from number to names
Sometimes we prefer the months to be numerical, but here, we need them as the month names. There are MANY ways to do this. Here is one way to convert numbers 5 - 9 to May through September
Month is a categorical variable with different levels, called factors.
This is one way to reorder the Months so they do not default to alphabetical (you will see another way to reorder DIRECTLY in the chunk that creates the plot below in Plot #1
Here is a first attempt at viewing a histogram of temperature by the months May through September. We will see that temperatures increase over these months. The median temperature appears to be about 75 degrees.
fill = Month colors the histogram by months between May - Sept.
scale_fill_discrete(name = “Month”…) provides the month names on the right side as a legend in chronological order. This is a different way to order than what was shown above.
labs allows us to add a title, axes labels, and a caption for the data source
Plot 1 Code
p1 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position="identity")+scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September")) +labs(x ="Monthly Temperatures from May - Sept", y ="Frequency of Temps",title ="Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State Department of Conservation and the National Weather Service") #provide the data sourcep1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Plot 2: Improve the histogram of Average Temperature by Month
Outline the bars in white using the color = “white” command
Use alpha to add some transparency (values between 0 and 1)
Change the binwidth
Add some transparency and white borders around the histogram bars.
p2 <- airquality |>ggplot(aes(x=Temp, fill=Month)) +geom_histogram(position="identity", alpha=0.5,binwidth =5, color ="white")+labs(x ="Monthly Temperatures from May - Sept", y ="Frequency of Temps",title ="Histogram of Monthly Temperatures from May - Sept, 1973",caption ="New York State Department of Conservation and the National Weather Service")print(p2)
Plot 2 Output
Here July stands out for having high frequency of 85 degree temperatures. The dark purple color indicates overlaps of months due to the transparency.
Did this improve the readability of the plot?
Plot 3: Create side-by-side boxplots categorized by Month
We can see that August has the highest temperatures based on the boxplot distribution.
p3 <- airquality |>ggplot(aes(Month,Temp, fill = Month)) +labs(x ="Months from May through September", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Department of Conservation and the National Weather Service") +geom_boxplot() +scale_fill_discrete(name ="Month", labels =c("May", "June","July", "August", "September"))print(p3)
Notice that the points above and below the boxplots in June and July are outliers.
Plot 4: Side by Side Boxplots in Gray Scale
Make the same side-by-side boxplots, but in grey-scale
Use the scale_fill_grey command for the grey-scale legend, and again, use fill=Month in the aesthetics.
Plot 4 Code
Here we just changed the color palette to gray scale using scale_fill_grey
p4 <- airquality |>ggplot(aes(Month, Temp, fill = Month)) +labs(x ="Monthly Temperatures", y ="Temperatures", title ="Side-by-Side Boxplot of Monthly Temperatures",caption ="New York State Department of Conservation and the National Weather Service") +geom_boxplot()+scale_fill_grey(name ="Month", labels =c("May", "June","July", "August", "September"))p4
Plot 5:
Now make one new plot on your own, that is meaningfully different from the 4 I have shown you. You can select any of the variables in this dataset. Be sure to explore the dataset to see which variables are included that we have not explored yet. You may create a scatterplot, histogram, boxplot, or something else. Be sure to include a title, axes labels, and caption for the data source in your Plot 5. Then finally, below your chunk of code for your plot 5, ….
# Remove rows with missing valuescleaned_data <-na.omit(airquality)cleaned_data
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 May 1
2 36 118 8.0 72 May 2
3 12 149 12.6 74 May 3
4 18 313 11.5 62 May 4
7 23 299 8.6 65 May 7
8 19 99 13.8 59 May 8
9 8 19 20.1 61 May 9
12 16 256 9.7 69 May 12
13 11 290 9.2 66 May 13
14 14 274 10.9 68 May 14
15 18 65 13.2 58 May 15
16 14 334 11.5 64 May 16
17 34 307 12.0 66 May 17
18 6 78 18.4 57 May 18
19 30 322 11.5 68 May 19
20 11 44 9.7 62 May 20
21 1 8 9.7 59 May 21
22 11 320 16.6 73 May 22
23 4 25 9.7 61 May 23
24 32 92 12.0 61 May 24
28 23 13 12.0 67 May 28
29 45 252 14.9 81 May 29
30 115 223 5.7 79 May 30
31 37 279 7.4 76 May 31
38 29 127 9.7 82 June 7
40 71 291 13.8 90 June 9
41 39 323 11.5 87 June 10
44 23 148 8.0 82 June 13
47 21 191 14.9 77 June 16
48 37 284 20.7 72 June 17
49 20 37 9.2 65 June 18
50 12 120 11.5 73 June 19
51 13 137 10.3 76 June 20
62 135 269 4.1 84 July 1
63 49 248 9.2 85 July 2
64 32 236 9.2 81 July 3
66 64 175 4.6 83 July 5
67 40 314 10.9 83 July 6
68 77 276 5.1 88 July 7
69 97 267 6.3 92 July 8
70 97 272 5.7 92 July 9
71 85 175 7.4 89 July 10
73 10 264 14.3 73 July 12
74 27 175 14.9 81 July 13
76 7 48 14.3 80 July 15
77 48 260 6.9 81 July 16
78 35 274 10.3 82 July 17
79 61 285 6.3 84 July 18
80 79 187 5.1 87 July 19
81 63 220 11.5 85 July 20
82 16 7 6.9 74 July 21
85 80 294 8.6 86 July 24
86 108 223 8.0 85 July 25
87 20 81 8.6 82 July 26
88 52 82 12.0 86 July 27
89 82 213 7.4 88 July 28
90 50 275 7.4 86 July 29
91 64 253 7.4 83 July 30
92 59 254 9.2 81 July 31
93 39 83 6.9 81 August 1
94 9 24 13.8 81 August 2
95 16 77 7.4 82 August 3
99 122 255 4.0 89 August 7
100 89 229 10.3 90 August 8
101 110 207 8.0 90 August 9
104 44 192 11.5 86 August 12
105 28 273 11.5 82 August 13
106 65 157 9.7 80 August 14
108 22 71 10.3 77 August 16
109 59 51 6.3 79 August 17
110 23 115 7.4 76 August 18
111 31 244 10.9 78 August 19
112 44 190 10.3 78 August 20
113 21 259 15.5 77 August 21
114 9 36 14.3 72 August 22
116 45 212 9.7 79 August 24
117 168 238 3.4 81 August 25
118 73 215 8.0 86 August 26
120 76 203 9.7 97 August 28
121 118 225 2.3 94 August 29
122 84 237 6.3 96 August 30
123 85 188 6.3 94 August 31
124 96 167 6.9 91 September 1
125 78 197 5.1 92 September 2
126 73 183 2.8 93 September 3
127 91 189 4.6 93 September 4
128 47 95 7.4 87 September 5
129 32 92 15.5 84 September 6
130 20 252 10.9 80 September 7
131 23 220 10.3 78 September 8
132 21 230 10.9 75 September 9
133 24 259 9.7 73 September 10
134 44 236 14.9 81 September 11
135 21 259 15.5 76 September 12
136 28 238 6.3 77 September 13
137 9 24 10.9 71 September 14
138 13 112 11.5 71 September 15
139 46 237 6.9 78 September 16
140 18 224 13.8 67 September 17
141 13 27 10.3 76 September 18
142 24 238 10.3 68 September 19
143 16 201 8.0 82 September 20
144 13 238 12.6 64 September 21
145 23 14 9.2 71 September 22
146 36 139 10.3 81 September 23
147 7 49 10.3 69 September 24
148 14 20 16.6 63 September 25
149 30 193 6.9 70 September 26
151 14 191 14.3 75 September 28
152 18 131 8.0 76 September 29
153 20 223 11.5 68 September 30
cleaned_data<-airquality|>ggplot(aes(x = Temp, y = Ozone, color =factor(Month))) +labs(x ="Temperature (°F)", y ="Ozone Levels (ppb)", title ="Scatter Plot of Ozone Levels vs Temperature",caption ="New York State Department of Conservation and the National Weather Service") +geom_point() +scale_color_discrete(name ="Month", labels =c("May", "June", "July", "August", "September")) cleaned_data
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).
Write a brief essay here
-Describe the plot type you have created Plot 5 is a scatter plot which visualizes the relationship between temperature and ozone levels from the airquality dataset. Each point on the scatter plot represents an observation, with temperature on the x-axis and ozone levels on the y-axis. The points are colored by month, allowing us to see how this relationship varies throughout the months of May to September.
Any insights that the plot shows The scatter plot shows a general trend where higher temperatures are associated with higher ozone levels. By coloring the points based on the month, we can observe how this relationship differs across months. For instance:May and September (lighter colors) show lower temperatures and a wider range of ozone levels. June, July, and August (darker colors) tend to have higher temperatures, and the relationship between temperature and ozone is more pronounced.
-Describe any special code you used to make this plot I used na.omit() to remove all observation with missing data on the Ozone variable.