Use the built-in airquality
R dataset to visualize the
air quality in 1973. Explore the patterns in the data by creating
histograms, boxplots, line graphs, heatmaps, and scatterplots.
Load the libraries ggplot2
and reshape2
in
the setup chunk below.
# Add your library below.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.3
We will use the airquality
dataset that you have as part
of your R installation. Read the airquality
data using the
data.frame()
function and save it as air
.
# Write your code below.
airquality
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## 11 7 NA 6.9 74 5 11
## 12 16 256 9.7 69 5 12
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
## 16 14 334 11.5 64 5 16
## 17 34 307 12.0 66 5 17
## 18 6 78 18.4 57 5 18
## 19 30 322 11.5 68 5 19
## 20 11 44 9.7 62 5 20
## 21 1 8 9.7 59 5 21
## 22 11 320 16.6 73 5 22
## 23 4 25 9.7 61 5 23
## 24 32 92 12.0 61 5 24
## 25 NA 66 16.6 57 5 25
## 26 NA 266 14.9 58 5 26
## 27 NA NA 8.0 57 5 27
## 28 23 13 12.0 67 5 28
## 29 45 252 14.9 81 5 29
## 30 115 223 5.7 79 5 30
## 31 37 279 7.4 76 5 31
## 32 NA 286 8.6 78 6 1
## 33 NA 287 9.7 74 6 2
## 34 NA 242 16.1 67 6 3
## 35 NA 186 9.2 84 6 4
## 36 NA 220 8.6 85 6 5
## 37 NA 264 14.3 79 6 6
## 38 29 127 9.7 82 6 7
## 39 NA 273 6.9 87 6 8
## 40 71 291 13.8 90 6 9
## 41 39 323 11.5 87 6 10
## 42 NA 259 10.9 93 6 11
## 43 NA 250 9.2 92 6 12
## 44 23 148 8.0 82 6 13
## 45 NA 332 13.8 80 6 14
## 46 NA 322 11.5 79 6 15
## 47 21 191 14.9 77 6 16
## 48 37 284 20.7 72 6 17
## 49 20 37 9.2 65 6 18
## 50 12 120 11.5 73 6 19
## 51 13 137 10.3 76 6 20
## 52 NA 150 6.3 77 6 21
## 53 NA 59 1.7 76 6 22
## 54 NA 91 4.6 76 6 23
## 55 NA 250 6.3 76 6 24
## 56 NA 135 8.0 75 6 25
## 57 NA 127 8.0 78 6 26
## 58 NA 47 10.3 73 6 27
## 59 NA 98 11.5 80 6 28
## 60 NA 31 14.9 77 6 29
## 61 NA 138 8.0 83 6 30
## 62 135 269 4.1 84 7 1
## 63 49 248 9.2 85 7 2
## 64 32 236 9.2 81 7 3
## 65 NA 101 10.9 84 7 4
## 66 64 175 4.6 83 7 5
## 67 40 314 10.9 83 7 6
## 68 77 276 5.1 88 7 7
## 69 97 267 6.3 92 7 8
## 70 97 272 5.7 92 7 9
## 71 85 175 7.4 89 7 10
## 72 NA 139 8.6 82 7 11
## 73 10 264 14.3 73 7 12
## 74 27 175 14.9 81 7 13
## 75 NA 291 14.9 91 7 14
## 76 7 48 14.3 80 7 15
## 77 48 260 6.9 81 7 16
## 78 35 274 10.3 82 7 17
## 79 61 285 6.3 84 7 18
## 80 79 187 5.1 87 7 19
## 81 63 220 11.5 85 7 20
## 82 16 7 6.9 74 7 21
## 83 NA 258 9.7 81 7 22
## 84 NA 295 11.5 82 7 23
## 85 80 294 8.6 86 7 24
## 86 108 223 8.0 85 7 25
## 87 20 81 8.6 82 7 26
## 88 52 82 12.0 86 7 27
## 89 82 213 7.4 88 7 28
## 90 50 275 7.4 86 7 29
## 91 64 253 7.4 83 7 30
## 92 59 254 9.2 81 7 31
## 93 39 83 6.9 81 8 1
## 94 9 24 13.8 81 8 2
## 95 16 77 7.4 82 8 3
## 96 78 NA 6.9 86 8 4
## 97 35 NA 7.4 85 8 5
## 98 66 NA 4.6 87 8 6
## 99 122 255 4.0 89 8 7
## 100 89 229 10.3 90 8 8
## 101 110 207 8.0 90 8 9
## 102 NA 222 8.6 92 8 10
## 103 NA 137 11.5 86 8 11
## 104 44 192 11.5 86 8 12
## 105 28 273 11.5 82 8 13
## 106 65 157 9.7 80 8 14
## 107 NA 64 11.5 79 8 15
## 108 22 71 10.3 77 8 16
## 109 59 51 6.3 79 8 17
## 110 23 115 7.4 76 8 18
## 111 31 244 10.9 78 8 19
## 112 44 190 10.3 78 8 20
## 113 21 259 15.5 77 8 21
## 114 9 36 14.3 72 8 22
## 115 NA 255 12.6 75 8 23
## 116 45 212 9.7 79 8 24
## 117 168 238 3.4 81 8 25
## 118 73 215 8.0 86 8 26
## 119 NA 153 5.7 88 8 27
## 120 76 203 9.7 97 8 28
## 121 118 225 2.3 94 8 29
## 122 84 237 6.3 96 8 30
## 123 85 188 6.3 94 8 31
## 124 96 167 6.9 91 9 1
## 125 78 197 5.1 92 9 2
## 126 73 183 2.8 93 9 3
## 127 91 189 4.6 93 9 4
## 128 47 95 7.4 87 9 5
## 129 32 92 15.5 84 9 6
## 130 20 252 10.9 80 9 7
## 131 23 220 10.3 78 9 8
## 132 21 230 10.9 75 9 9
## 133 24 259 9.7 73 9 10
## 134 44 236 14.9 81 9 11
## 135 21 259 15.5 76 9 12
## 136 28 238 6.3 77 9 13
## 137 9 24 10.9 71 9 14
## 138 13 112 11.5 71 9 15
## 139 46 237 6.9 78 9 16
## 140 18 224 13.8 67 9 17
## 141 13 27 10.3 76 9 18
## 142 24 238 10.3 68 9 19
## 143 16 201 8.0 82 9 20
## 144 13 238 12.6 64 9 21
## 145 23 14 9.2 71 9 22
## 146 36 139 10.3 81 9 23
## 147 7 49 10.3 69 9 24
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 NA 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
air <- data.frame(airquality)
After you load the data, there will be some NA values in the data. You need to figure out what to do about those nasty NAs. Google is your friend. Use Google actively to find a way to get rid of NAs using R.
HINT: Find columns that include NA values first (use
colSums()
function, then replace NAs with the mean value of
these columns).
Optional: If you want to implement a better approaches to deal with
missing data, “multiple imputation” would be where you should go.
You can check sources like:
# Write your code below.
# Find columns with NAs
colSums(is.na(air))
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
# Replace NAs in each column with that column's mean (ignore NAs when computing mean)
for (col in names(air)) {
if (any(is.na(air[[col]]))) {
air[[col]][is.na(air[[col]])] <- mean(air[[col]], na.rm = TRUE)
}
}
# Check again
colSums(is.na(air))
## Ozone Solar.R Wind Temp Month Day
## 0 0 0 0 0 0
Create the following visualizations:
Ozone
, and boxplots for different
wind
values (round wind
to get a good number
of “buckets”)HINT: you can create one graph with
Ozone
as X-axis and Wind
as Y-axis. Also, you
can create two graphs (one with Ozone
values on the Y-axis,
and another one with Y-axis value is rounded value of
Wind
). In this case, you need to create a fake grouping
variable (x = factor(0))
.
# Write your code below.
# Melt the data for easy faceted histograms
air_long <- melt(air)
## No id variables; using all as measure variables
ggplot(air_long, aes(x = value)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms of Airquality Variables", x = "Value", y = "Count")
# Boxplot for Ozone
ggplot(air, aes(x = factor(0), y = Ozone)) +
geom_boxplot(fill = "tomato") +
labs(title = "Boxplot of Ozone Levels", x = "", y = "Ozone")
# Boxplots for Ozone grouped by rounded Wind
air$WindGroup <- round(air$Wind)
ggplot(air, aes(x = factor(WindGroup), y = Ozone)) +
geom_boxplot(fill = "violet") +
labs(title = "Ozone by Rounded Wind Groups", x = "Rounded Wind", y = "Ozone")
First, create the appropriate dates (this data was from 1973). Then
create line charts for Ozone
, Temp
,
Wind
and Solar.R
(one line chart for each, and
then one chart with 4 lines, each variable having a different
color).
HINT:
paste()
function) 1973 with the
Month
and Day
columns, using “/” in
betweenas.Date()
function to transform the value to “Date”
format and create a Date
column to record this value. Use
?as.Date
to find example use of this function.Note that for the chart with 4 lines, you need to think about how to effectively use the y-axis.
HINT: You need to “melt” the columns so that it
becomes “long” format. You can either use the melt()
function or use the gather()
from the “tidyr” package for
the process. Also, compared to other values, the Wind
values are too small, so they are close to the x-axis. You need to
multiply Wind
by 10 to address this problem.
The correct result should look like this:
# Write your code below.
# Create a Date column for 1973
air$Date <- as.Date(paste(1973, air$Month, air$Day, sep = "/"), format = "%Y/%m/%d")
# Line charts for each variable
ggplot(air, aes(x = Date, y = Ozone)) +
geom_line(color = "royalblue", linewidth = 1) +
labs(title = "Ozone Over Time", x = "Date", y = "Ozone")
ggplot(air, aes(x = Date, y = Temp)) +
geom_line(color = "firebrick", linewidth = 1) +
labs(title = "Temperature Over Time", x = "Date", y = "Temperature (F)")
ggplot(air, aes(x = Date, y = Wind)) +
geom_line(color = "lightseagreen", linewidth = 1) +
labs(title = "Wind Over Time", x = "Date", y = "Wind (mph)")
ggplot(air, aes(x = Date, y = Solar.R)) +
geom_line(color = "goldenrod", linewidth = 1) +
labs(title = "Solar Radiation Over Time", x = "Date", y = "Solar Radiation (lang)")
# Line chart with all 4 variables (Wind multiplied by 10 for scaling)
air_melted <- melt(air, id.vars = "Date", measure.vars = c("Ozone", "Temp", "Solar.R", "Wind"))
air_melted$value_scaled <- air_melted$value
air_melted$value_scaled[air_melted$variable == "Wind"] <- air_melted$value[air_melted$variable == "Wind"] * 10
ggplot(air_melted, aes(x = Date, y = value_scaled, color = variable)) +
geom_line(linewidth = 1) +
labs(title = "Air Quality Metrics Over Time", x = "Date", y = "Value (Wind x10)", color = "Variable")
Create a heatmap, with each day (using dates) along the x-axis and
Ozone
, Temp
, Wind
and
Solar.r
along the y-axis.
Note that you need to figure out how to show the relative change equally across all the variables.
HINTS: - You may find this post helpful: - https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/
- Melt the original table to make a long format before visualization
using geom_tile()
Remember, your programming skills will improve as much as you struggle and put effort in them.
The correct result should look like this:
# Write your code below.
# Normalize for heatmap (z-score within each variable)
air_heat <- melt(air, id.vars = "Date", measure.vars = c("Ozone", "Temp", "Solar.R", "Wind"))
air_heat$zvalue <- ave(air_heat$value, air_heat$variable, FUN = function(x) scale(x))
ggplot(air_heat, aes(x = Date, y = variable, fill = zvalue)) +
geom_tile(color = "grey80") +
scale_fill_gradient2(low = "steelblue", high = "tomato", mid = "white", midpoint = 0) +
labs(title = "Heatmap of Air Quality Variables", x = "Date", y = "Variable", fill = "Z-Score")
Create a scatter chart, with: - the x-axis representing the
Wind
- the y-axis representing the Temp
- the
size of each dot representing the Ozone
- the color
representing Solar.R
# Write your code below.
ggplot(air, aes(x = Wind, y = Temp, size = Ozone, color = Solar.R)) +
geom_point(alpha = 0.7) +
labs(title = "Scatterplot: Temp vs Wind\n(Size = Ozone, Color = Solar.R)",
x = "Wind (mph)", y = "Temperature (F)", size = "Ozone", color = "Solar Radiation")
Do you see any patterns after exploring the data?
There appears to be a seasonal trend in Ozone and Solar.R, with higher levels in summer months. Temperature closely follows this pattern.”]
What was the most useful visualization?
The heatmap made it easy to compare daily fluctuations and spot multi-variable peaks and valleys across the season.”]