Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
summary_stats <- airquality |>
summarise(
mean_0zone = mean(Ozone, na.rm = TRUE),
median_0zone = median(Ozone, na.rm = TRUE),
sd_0zone = sd(Ozone, na.rm = TRUE),
min_0zone = min(Ozone, na.rm = TRUE),
max_0zone = max(Ozone, na.rm = TRUE))
summary_stats
## mean_0zone median_0zone sd_0zone min_0zone max_0zone
## 1 42.12931 31.5 32.98788 1 168
Mean vs Median: The mean Ozone level is 42.13 ppb, and the median is 31.5 ppb. These values are noticeably different, with the mean being higher than the median. This suggests that the distribution of ozone concentration is right-skewed, meaning there are some days with unusually high ozone levels that pull the mean upward.
Standard Deviation: The standard deviation is approximately 32.99 ppb, indicating a moderate to high variability in ozone levels. This means that ozone concentrations vary significantly from day to day.
Range (Min–Max): zone levels range from 1 to 168 ppb, confirming that the dataset covers a wide range of air quality conditions. However, the clustering around the median suggests that most days have moderate ozone concentrations, with a few days experiencing very high levels.
#Your code for Temp goes here
summary_stats <- airquality |>
summarise(
mean_Temp = mean(Temp, na.rm = TRUE),
median_Temp = median(Temp, na.rm = TRUE),
sd_Temp = sd(Temp, na.rm = TRUE),
min_Temp = min(Temp, na.rm = TRUE),
max_Temp = max(Temp, na.rm = TRUE))
summary_stats
## mean_Temp median_Temp sd_Temp min_Temp max_Temp
## 1 77.88235 79 9.46527 56 97
Mean vs Median: The mean temperature is 77.88°F, and the median is 79°F. These values are very close, which suggests that the temperature distribution is fairly symmetric and not strongly skewed.
Standard Deviation: The standard deviation is approximately 9.47°F, meaning that most daily maximum temperatures fall within about ±9°F of the mean. This indicates a moderate spread in temperature values, with no extreme variability.
Range (Min–Max): Temperatures range from 56°F to 97°F, showing that the dataset captures a wide range of summer temperatures in New York. The clustering around the mean and median suggests that most days were warm to hot, typical of the May–September period.
#Your code for Wind goes here
summary_stats <- airquality |>
summarise(
mean_Wind = mean(Wind, na.rm = TRUE),
median_Wind = median(Wind, na.rm = TRUE),
sd_Wind = sd(Wind, na.rm = TRUE),
min_Wind = min(Wind, na.rm = TRUE),
max_Wind = max(Wind, na.rm = TRUE))
summary_stats
## mean_Wind median_Wind sd_Wind min_Wind max_Wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Mean vs Median: The mean wind speed is 9.96 mph, and the median is 9.7 mph. These values are very close, suggesting that the distribution of wind speeds is fairly symmetric and not significantly skewed.
Standard Deviation: The standard deviation is approximately 3.52 mph, indicating a moderate variability in daily wind speeds. Most days have wind speeds within ±3.5 mph of the mean.
Range (Min–Max): Wind speeds range from 1.7 to 20.7 mph, showing that the dataset includes both calm and breezy days. The clustering around the mean and median suggests that most days experienced moderate wind conditions.
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
Generate the histogram for Ozone.
#Your code goes here
library(ggplot2)
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 10, fill = "#ff7f0e", color = "black") +
labs(title = "Histogram of Ozone levels",
x = "Ozone levels (ppb)",
y = "Count") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features? The histogram of ozone concentration appears right-skewed (a few days have very high ozone values). Most observations are clustered around 20–60 ppb, which is typical for the summer air quality in New York. There are a few outliers on the higher end, with ozone levels reaching up to 168 ppb. These extreme values likely represent days with unusual atmospheric conditions rather than errors, but they do pull the mean upward compared to the median.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
airquality_named <- airquality %>%
mutate(
month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
)
)
ggplot(airquality_named, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = "#2ca02c") +
labs(
title = "Boxplot of Ozone Levels by Month",
x = "Month",
y = "Ozone (ppb)"
) +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Lower Ozone Levels (Most Days): Median ozone concentration is around 31.5 ppb. Most values fall in the ~20–60 ppb range (IQR). The distribution is right-skewed, meaning a few days have much higher ozone levels than the rest. Higher Ozone Levels (Extreme Days): Some outliers exist at the high end, with ozone levels reaching 168 ppb. These extreme values create a long tail and pull the mean upward compared to the median. Overall Pattern: The overlap between typical and extreme days is large, so ozone alone doesn’t perfectly separate “normal” vs. “high pollution” days. However, the upward shift on certain days is consistent with environmental conditions that favor ozone formation (e.g., hot, sunny, stagnant air).
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(Month))) +
geom_point(alpha = 0.7) +
labs(
title = "Scatterplot of Temperature vs. Ozone",
x = "Temperature (°F)",
y = "Ozone (ppb)",
color = "Month"
) +
scale_color_manual(values = c(
"5" = "#1f77b4", # May - blue
"6" = "#ff7f0e", # June - orange
"7" = "#2ca02c", # July - green
"8" = "#d62728", # August - red
"9" = "#9467bd" # September - purple
)) +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
The scatterplot suggests that higher temperatures are generally associated with higher ozone levels. As temperature increases, ozone concentrations tend to rise, indicating a positive relationship between these two variables. Certain months appear to cluster together: July and August (the warmest months) show many points in the upper range of both temperature and ozone, suggesting that hotter summer days often coincide with elevated ozone levels. May and June cluster toward lower temperatures and lower ozone concentrations, reflecting cooler early summer conditions with cleaner air. Overall, the pattern indicates that ozone pollution tends to be worse on hotter days, which aligns with known atmospheric chemistry where sunlight and heat accelerate ozone formation.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_matrix <- cor(airquality[, c("Ozone", "Temp", "Wind")], use = "complete.obs")
corrplot(cor_matrix,
method = "color",
type = "upper",
order = "hclust",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
Ozone vs. Temp (r ≈ 0.70): Strong positive correlation. Higher temperatures are associated with higher ozone levels, which makes sense because ozone formation increases on hot, sunny days.
Ozone vs. Wind (r ≈ -0.60): Strong negative correlation. Higher wind speeds tend to disperse pollutants, leading to lower ozone concentrations.
Temp vs. Wind (r ≈ -0.40): Moderate negative correlation. Windy days are often cooler, which explains this relationship.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
summary_table <- airquality %>%
group_by(Month) %>%
summarise(
Count = n(),
Avg_Ozone = mean(Ozone, na.rm = TRUE),
Avg_Temp = mean(Temp, na.rm = TRUE),
Avg_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## Month Count Avg_Ozone Avg_Temp Avg_Wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? Highest Average Ozone: July and August have the highest average ozone levels (≈59 ppb). This suggests that mid-summer conditions favor ozone formation.
Temperature Variation: Temperatures peak in July and August (≈84°F), which aligns with the highest ozone levels. Hot, sunny weather accelerates photochemical reactions that produce ozone.
Wind Speed Variation: Wind speeds are lowest in July and August (≈8.8 mph). Lower wind speeds mean less dispersion of pollutants, allowing ozone to accumulate.
Environmental Explanation: High ozone in July and August is likely due to hot temperatures, strong sunlight, and stagnant air conditions, which are ideal for ozone formation. May and September have cooler temperatures and higher wind speeds, reducing ozone levels.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard