Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
# Ozone
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here
# Temp
mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here
# Wind
mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
For Ozone, the mean is higher than the median, which indicates is right skewed. The standard deviation 32.99 is large, showing high variability in ozone level For Temperature, the mean and median are very close but with a higher median than mean, indicating a left skewed distribution. The standard deviation (9.5) shows moderate variability in temperature. For Wind, the mean and median are similar, indicating a symmetric distribution. the standard deviation is 3.52, showing that wind speed is less variable than temperature an Ozone.
Generate the histogram for Ozone.
#Your code goes here
library(ggplot2)
p1 <- ggplot(airquality, aes(x = Ozone)) +
geom_histogram(bindwdith = 10, fill = "purple", color = "black") +
labs( title = "Histogram of Ozone Levels",
x = "Ozone",
y = "Frequency"
)
## Warning in geom_histogram(bindwdith = 10, fill = "purple", color = "black"):
## Ignoring unknown parameters: `bindwdith`
p1
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The distribution of Ozone levels in the airquality dataset is right skewed. There is an outlier around 150 - 160 represent days unusually high ozone concentration. These extreme values increase the mean slightly and contribute to the right skewed shape.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
aq <- airquality |>
mutate(month_name = factor(case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"),
levels = c("May", "June", "July", "August", "September")))
aq
## Ozone Solar.R Wind Temp Month Day month_name
## 1 41 190 7.4 67 5 1 May
## 2 36 118 8.0 72 5 2 May
## 3 12 149 12.6 74 5 3 May
## 4 18 313 11.5 62 5 4 May
## 5 NA NA 14.3 56 5 5 May
## 6 28 NA 14.9 66 5 6 May
## 7 23 299 8.6 65 5 7 May
## 8 19 99 13.8 59 5 8 May
## 9 8 19 20.1 61 5 9 May
## 10 NA 194 8.6 69 5 10 May
## 11 7 NA 6.9 74 5 11 May
## 12 16 256 9.7 69 5 12 May
## 13 11 290 9.2 66 5 13 May
## 14 14 274 10.9 68 5 14 May
## 15 18 65 13.2 58 5 15 May
## 16 14 334 11.5 64 5 16 May
## 17 34 307 12.0 66 5 17 May
## 18 6 78 18.4 57 5 18 May
## 19 30 322 11.5 68 5 19 May
## 20 11 44 9.7 62 5 20 May
## 21 1 8 9.7 59 5 21 May
## 22 11 320 16.6 73 5 22 May
## 23 4 25 9.7 61 5 23 May
## 24 32 92 12.0 61 5 24 May
## 25 NA 66 16.6 57 5 25 May
## 26 NA 266 14.9 58 5 26 May
## 27 NA NA 8.0 57 5 27 May
## 28 23 13 12.0 67 5 28 May
## 29 45 252 14.9 81 5 29 May
## 30 115 223 5.7 79 5 30 May
## 31 37 279 7.4 76 5 31 May
## 32 NA 286 8.6 78 6 1 June
## 33 NA 287 9.7 74 6 2 June
## 34 NA 242 16.1 67 6 3 June
## 35 NA 186 9.2 84 6 4 June
## 36 NA 220 8.6 85 6 5 June
## 37 NA 264 14.3 79 6 6 June
## 38 29 127 9.7 82 6 7 June
## 39 NA 273 6.9 87 6 8 June
## 40 71 291 13.8 90 6 9 June
## 41 39 323 11.5 87 6 10 June
## 42 NA 259 10.9 93 6 11 June
## 43 NA 250 9.2 92 6 12 June
## 44 23 148 8.0 82 6 13 June
## 45 NA 332 13.8 80 6 14 June
## 46 NA 322 11.5 79 6 15 June
## 47 21 191 14.9 77 6 16 June
## 48 37 284 20.7 72 6 17 June
## 49 20 37 9.2 65 6 18 June
## 50 12 120 11.5 73 6 19 June
## 51 13 137 10.3 76 6 20 June
## 52 NA 150 6.3 77 6 21 June
## 53 NA 59 1.7 76 6 22 June
## 54 NA 91 4.6 76 6 23 June
## 55 NA 250 6.3 76 6 24 June
## 56 NA 135 8.0 75 6 25 June
## 57 NA 127 8.0 78 6 26 June
## 58 NA 47 10.3 73 6 27 June
## 59 NA 98 11.5 80 6 28 June
## 60 NA 31 14.9 77 6 29 June
## 61 NA 138 8.0 83 6 30 June
## 62 135 269 4.1 84 7 1 July
## 63 49 248 9.2 85 7 2 July
## 64 32 236 9.2 81 7 3 July
## 65 NA 101 10.9 84 7 4 July
## 66 64 175 4.6 83 7 5 July
## 67 40 314 10.9 83 7 6 July
## 68 77 276 5.1 88 7 7 July
## 69 97 267 6.3 92 7 8 July
## 70 97 272 5.7 92 7 9 July
## 71 85 175 7.4 89 7 10 July
## 72 NA 139 8.6 82 7 11 July
## 73 10 264 14.3 73 7 12 July
## 74 27 175 14.9 81 7 13 July
## 75 NA 291 14.9 91 7 14 July
## 76 7 48 14.3 80 7 15 July
## 77 48 260 6.9 81 7 16 July
## 78 35 274 10.3 82 7 17 July
## 79 61 285 6.3 84 7 18 July
## 80 79 187 5.1 87 7 19 July
## 81 63 220 11.5 85 7 20 July
## 82 16 7 6.9 74 7 21 July
## 83 NA 258 9.7 81 7 22 July
## 84 NA 295 11.5 82 7 23 July
## 85 80 294 8.6 86 7 24 July
## 86 108 223 8.0 85 7 25 July
## 87 20 81 8.6 82 7 26 July
## 88 52 82 12.0 86 7 27 July
## 89 82 213 7.4 88 7 28 July
## 90 50 275 7.4 86 7 29 July
## 91 64 253 7.4 83 7 30 July
## 92 59 254 9.2 81 7 31 July
## 93 39 83 6.9 81 8 1 August
## 94 9 24 13.8 81 8 2 August
## 95 16 77 7.4 82 8 3 August
## 96 78 NA 6.9 86 8 4 August
## 97 35 NA 7.4 85 8 5 August
## 98 66 NA 4.6 87 8 6 August
## 99 122 255 4.0 89 8 7 August
## 100 89 229 10.3 90 8 8 August
## 101 110 207 8.0 90 8 9 August
## 102 NA 222 8.6 92 8 10 August
## 103 NA 137 11.5 86 8 11 August
## 104 44 192 11.5 86 8 12 August
## 105 28 273 11.5 82 8 13 August
## 106 65 157 9.7 80 8 14 August
## 107 NA 64 11.5 79 8 15 August
## 108 22 71 10.3 77 8 16 August
## 109 59 51 6.3 79 8 17 August
## 110 23 115 7.4 76 8 18 August
## 111 31 244 10.9 78 8 19 August
## 112 44 190 10.3 78 8 20 August
## 113 21 259 15.5 77 8 21 August
## 114 9 36 14.3 72 8 22 August
## 115 NA 255 12.6 75 8 23 August
## 116 45 212 9.7 79 8 24 August
## 117 168 238 3.4 81 8 25 August
## 118 73 215 8.0 86 8 26 August
## 119 NA 153 5.7 88 8 27 August
## 120 76 203 9.7 97 8 28 August
## 121 118 225 2.3 94 8 29 August
## 122 84 237 6.3 96 8 30 August
## 123 85 188 6.3 94 8 31 August
## 124 96 167 6.9 91 9 1 September
## 125 78 197 5.1 92 9 2 September
## 126 73 183 2.8 93 9 3 September
## 127 91 189 4.6 93 9 4 September
## 128 47 95 7.4 87 9 5 September
## 129 32 92 15.5 84 9 6 September
## 130 20 252 10.9 80 9 7 September
## 131 23 220 10.3 78 9 8 September
## 132 21 230 10.9 75 9 9 September
## 133 24 259 9.7 73 9 10 September
## 134 44 236 14.9 81 9 11 September
## 135 21 259 15.5 76 9 12 September
## 136 28 238 6.3 77 9 13 September
## 137 9 24 10.9 71 9 14 September
## 138 13 112 11.5 71 9 15 September
## 139 46 237 6.9 78 9 16 September
## 140 18 224 13.8 67 9 17 September
## 141 13 27 10.3 76 9 18 September
## 142 24 238 10.3 68 9 19 September
## 143 16 201 8.0 82 9 20 September
## 144 13 238 12.6 64 9 21 September
## 145 23 14 9.2 71 9 22 September
## 146 36 139 10.3 81 9 23 September
## 147 7 49 10.3 69 9 24 September
## 148 14 20 16.6 63 9 25 September
## 149 30 193 6.9 70 9 26 September
## 150 NA 145 13.2 77 9 27 September
## 151 14 191 14.3 75 9 28 September
## 152 18 131 8.0 76 9 29 September
## 153 20 223 11.5 68 9 30 September
ggplot(aq, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = "red", color = "black", na.rm = TRUE) +
labs(title = "Ozone Levels by Month",
x = "Month",
y = "Ozone") +
theme_minimal()
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
From May to July the Ozone concentration increase, reaching the highest level in July. After July, Ozone levels decrease in August and September.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(aq, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(size = 3, alpha = 0.7, na.rm = TRUE) +
labs(title = "Temperature vs Ozone",
x = "Temperature (F)",
y = "Ozone",
Color = "Month") +
theme_minimal()
## Ignoring unknown labels:
## • Color : "Month"
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
The relationship is positive, meaning that as temperature increase, Ozone level increase. So, warmer days have higher level concentration. May and June appear at lower temperatures and lower Ozone. July and August shows at higher temperature and higher Ozone levels. September is slightly cooler and Ozone levels start to decrease.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
aq_clean <- airquality |>
select(Ozone, Temp, Wind) |>
na.omit()
aq_clean
## Ozone Temp Wind
## 1 41 67 7.4
## 2 36 72 8.0
## 3 12 74 12.6
## 4 18 62 11.5
## 6 28 66 14.9
## 7 23 65 8.6
## 8 19 59 13.8
## 9 8 61 20.1
## 11 7 74 6.9
## 12 16 69 9.7
## 13 11 66 9.2
## 14 14 68 10.9
## 15 18 58 13.2
## 16 14 64 11.5
## 17 34 66 12.0
## 18 6 57 18.4
## 19 30 68 11.5
## 20 11 62 9.7
## 21 1 59 9.7
## 22 11 73 16.6
## 23 4 61 9.7
## 24 32 61 12.0
## 28 23 67 12.0
## 29 45 81 14.9
## 30 115 79 5.7
## 31 37 76 7.4
## 38 29 82 9.7
## 40 71 90 13.8
## 41 39 87 11.5
## 44 23 82 8.0
## 47 21 77 14.9
## 48 37 72 20.7
## 49 20 65 9.2
## 50 12 73 11.5
## 51 13 76 10.3
## 62 135 84 4.1
## 63 49 85 9.2
## 64 32 81 9.2
## 66 64 83 4.6
## 67 40 83 10.9
## 68 77 88 5.1
## 69 97 92 6.3
## 70 97 92 5.7
## 71 85 89 7.4
## 73 10 73 14.3
## 74 27 81 14.9
## 76 7 80 14.3
## 77 48 81 6.9
## 78 35 82 10.3
## 79 61 84 6.3
## 80 79 87 5.1
## 81 63 85 11.5
## 82 16 74 6.9
## 85 80 86 8.6
## 86 108 85 8.0
## 87 20 82 8.6
## 88 52 86 12.0
## 89 82 88 7.4
## 90 50 86 7.4
## 91 64 83 7.4
## 92 59 81 9.2
## 93 39 81 6.9
## 94 9 81 13.8
## 95 16 82 7.4
## 96 78 86 6.9
## 97 35 85 7.4
## 98 66 87 4.6
## 99 122 89 4.0
## 100 89 90 10.3
## 101 110 90 8.0
## 104 44 86 11.5
## 105 28 82 11.5
## 106 65 80 9.7
## 108 22 77 10.3
## 109 59 79 6.3
## 110 23 76 7.4
## 111 31 78 10.9
## 112 44 78 10.3
## 113 21 77 15.5
## 114 9 72 14.3
## 116 45 79 9.7
## 117 168 81 3.4
## 118 73 86 8.0
## 120 76 97 9.7
## 121 118 94 2.3
## 122 84 96 6.3
## 123 85 94 6.3
## 124 96 91 6.9
## 125 78 92 5.1
## 126 73 93 2.8
## 127 91 93 4.6
## 128 47 87 7.4
## 129 32 84 15.5
## 130 20 80 10.9
## 131 23 78 10.3
## 132 21 75 10.9
## 133 24 73 9.7
## 134 44 81 14.9
## 135 21 76 15.5
## 136 28 77 6.3
## 137 9 71 10.9
## 138 13 71 11.5
## 139 46 78 6.9
## 140 18 67 13.8
## 141 13 76 10.3
## 142 24 68 10.3
## 143 16 82 8.0
## 144 13 64 12.6
## 145 23 71 9.2
## 146 36 81 10.3
## 147 7 69 10.3
## 148 14 63 16.6
## 149 30 70 6.9
## 151 14 75 14.3
## 152 18 76 8.0
## 153 20 68 11.5
cor_matrix <- cor(aq_clean)
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black",
tl.srt = 45)
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
Ozone is more strongly correlated to Temperature (r= 0.698) than with Wind (r= -0.602). Meaning that higher temperature tend to get higher Ozone levels, while windier days tend to have lower Ozone levels. The weakest correlation is between Temperature and Wind (r= -0.511), showing a negative relationship.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
summary_table <- airquality |>
group_by(Month) |>
summarize(
count = n(),
avg_Ozone = mean(Ozone, na.rm = TRUE),
avg_Temp = mean(Temp, na.rm = TRUE),
av_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## Month count avg_Ozone avg_Temp av_Wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
According to the summary table, August has the highest average Ozone level (59.961). Temperature increases from May until August and then start dropping in September, While, winds are higher during the months of May and September. Hotter temperatures in summer increase Ozone concentration,while higher wind speeds in cooler days help disperse Ozone.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard