Instructions

Use the built-in airquality R dataset to visualize the air quality in 1973. Explore the patterns in the data by creating histograms, boxplots, line graphs, heatmaps, and scatterplots.

Load the libraries ggplot2 and reshape2 in the setup chunk below.


# Add your library below.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.3

Step 1 - Load the Data

We will use the airquality dataset that you have as part of your R installation. Read the airquality data using the data.frame() function and save it as air.

# Write your code below.
airquality
##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      NA      NA 14.3   56     5   5
## 6      28      NA 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     NA     194  8.6   69     5  10
## 11      7      NA  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     NA     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78      NA  6.9   86     8   4
## 97     35      NA  7.4   85     8   5
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    NA      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    NA     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119    NA     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30
air <- data.frame(airquality)

Step 2 - Clean the Data

After you load the data, there will be some NA values in the data. You need to figure out what to do about those nasty NAs. Google is your friend. Use Google actively to find a way to get rid of NAs using R.

HINT: Find columns that include NA values first (use colSums() function, then replace NAs with the mean value of these columns).

Optional: If you want to implement a better approaches to deal with missing data, “multiple imputation” would be where you should go.
You can check sources like:

# Write your code below.
# Find columns with NAs
colSums(is.na(air))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0
# Replace NAs in each column with that column's mean (ignore NAs when computing mean)
for (col in names(air)) {
  if (any(is.na(air[[col]]))) {
    air[[col]][is.na(air[[col]])] <- mean(air[[col]], na.rm = TRUE)
  }
}
# Check again
colSums(is.na(air))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##       0       0       0       0       0       0

Step 3 - Understand the data distribution

Create the following visualizations:

HINT: you can create one graph with Ozone as X-axis and Wind as Y-axis. Also, you can create two graphs (one with Ozone values on the Y-axis, and another one with Y-axis value is rounded value of Wind). In this case, you need to create a fake grouping variable (x = factor(0)).

# Write your code below.
# Melt the data for easy faceted histograms
air_long <- melt(air)
## No id variables; using all as measure variables
ggplot(air_long, aes(x = value)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black") +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms of Airquality Variables", x = "Value", y = "Count")

# Boxplot for Ozone
ggplot(air, aes(x = factor(0), y = Ozone)) +
  geom_boxplot(fill = "tomato") +
  labs(title = "Boxplot of Ozone Levels", x = "", y = "Ozone")

# Boxplots for Ozone grouped by rounded Wind
air$WindGroup <- round(air$Wind)
ggplot(air, aes(x = factor(WindGroup), y = Ozone)) +
  geom_boxplot(fill = "violet") +
  labs(title = "Ozone by Rounded Wind Groups", x = "Rounded Wind", y = "Ozone")


Step 4 - Explore how the data changes over time

First, create the appropriate dates (this data was from 1973). Then create line charts for Ozone, Temp, Wind and Solar.R (one line chart for each, and then one chart with 4 lines, each variable having a different color).

HINT:

  1. Combine (use paste() function) 1973 with the Month and Day columns, using “/” in between
  2. Use as.Date() function to transform the value to “Date” format and create a Date column to record this value. Use ?as.Date to find example use of this function.

Note that for the chart with 4 lines, you need to think about how to effectively use the y-axis.

HINT: You need to “melt” the columns so that it becomes “long” format. You can either use the melt() function or use the gather() from the “tidyr” package for the process. Also, compared to other values, the Wind values are too small, so they are close to the x-axis. You need to multiply Wind by 10 to address this problem.

The correct result should look like this: Step 4 Line Graph

# Write your code below.
# Create a Date column for 1973
air$Date <- as.Date(paste(1973, air$Month, air$Day, sep = "/"), format = "%Y/%m/%d")

# Line charts for each variable
ggplot(air, aes(x = Date, y = Ozone)) +
  geom_line(color = "royalblue", linewidth = 1) +
  labs(title = "Ozone Over Time", x = "Date", y = "Ozone")

ggplot(air, aes(x = Date, y = Temp)) +
  geom_line(color = "firebrick", linewidth = 1) +
  labs(title = "Temperature Over Time", x = "Date", y = "Temperature (F)")

ggplot(air, aes(x = Date, y = Wind)) +
  geom_line(color = "lightseagreen", linewidth = 1) +
  labs(title = "Wind Over Time", x = "Date", y = "Wind (mph)")

ggplot(air, aes(x = Date, y = Solar.R)) +
  geom_line(color = "goldenrod", linewidth = 1) +
  labs(title = "Solar Radiation Over Time", x = "Date", y = "Solar Radiation (lang)")

# Line chart with all 4 variables (Wind multiplied by 10 for scaling)
air_melted <- melt(air, id.vars = "Date", measure.vars = c("Ozone", "Temp", "Solar.R", "Wind"))
air_melted$value_scaled <- air_melted$value
air_melted$value_scaled[air_melted$variable == "Wind"] <- air_melted$value[air_melted$variable == "Wind"] * 10

ggplot(air_melted, aes(x = Date, y = value_scaled, color = variable)) +
  geom_line(linewidth = 1) +
  labs(title = "Air Quality Metrics Over Time", x = "Date", y = "Value (Wind x10)", color = "Variable")

Step 5: Look at all the data via a heatmap

Create a heatmap, with each day (using dates) along the x-axis and Ozone, Temp, Wind and Solar.r along the y-axis.

Note that you need to figure out how to show the relative change equally across all the variables.

HINTS: - You may find this post helpful: - https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/ - Melt the original table to make a long format before visualization using geom_tile()

Remember, your programming skills will improve as much as you struggle and put effort in them.

The correct result should look like this: Step 4 Line Graph

# Write your code below.
# Normalize for heatmap (z-score within each variable)
air_heat <- melt(air, id.vars = "Date", measure.vars = c("Ozone", "Temp", "Solar.R", "Wind"))
air_heat$zvalue <- ave(air_heat$value, air_heat$variable, FUN = function(x) scale(x))

ggplot(air_heat, aes(x = Date, y = variable, fill = zvalue)) +
  geom_tile(color = "grey80") +
  scale_fill_gradient2(low = "steelblue", high = "tomato", mid = "white", midpoint = 0) +
  labs(title = "Heatmap of Air Quality Variables", x = "Date", y = "Variable", fill = "Z-Score")

Step 6: Look at all the data via a scatter chart

Create a scatter chart, with: - the x-axis representing the Wind - the y-axis representing the Temp - the size of each dot representing the Ozone - the color representing Solar.R

# Write your code below.
ggplot(air, aes(x = Wind, y = Temp, size = Ozone, color = Solar.R)) +
  geom_point(alpha = 0.7) +
  labs(title = "Scatterplot: Temp vs Wind\n(Size = Ozone, Color = Solar.R)",
       x = "Wind (mph)", y = "Temperature (F)", size = "Ozone", color = "Solar Radiation")

Step 7: Final analysis

You MUST answer to these questions in your compiled file.

Do you see any patterns after exploring the data?

There appears to be a seasonal trend in Ozone and Solar.R, with higher levels in summer months. Temperature closely follows this pattern.”]

What was the most useful visualization?

The heatmap made it easy to compare daily fluctuations and spot multi-variable peaks and valleys across the season.”]