Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here
mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here
mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability? The variables show different levels of symmetry. For Ozone, the mean (42.13) is considerably higher than the median (31.5), which indicates a right-skewed (positively skewed) distribution where a few very high-pollution days are pulling the average upward. In contrast, Temp has a mean (77.88) and median (79) that are very close, suggesting a much more symmetric, bell-shaped distribution. Wind is also relatively symmetric, with a mean of 9.96 and a median of 9.7. Standard deviation measures how much the data “spreads out” from the average. Ozone has a very high standard deviation (32.99), meaning the air quality fluctuates drastically from day to day. Temp has a much lower standard deviation (9.47), indicating that the daily temperatures are more consistent and cluster closer to the mean. Wind shows the lowest variability (3.52), suggesting that wind speeds stay within a relatively narrow range throughout the dataset.
Generate the histogram for Ozone.
#Your code goes here
# Generate the histogram for ozone
hist (airquality$Ozone,
main = "Distribution of Ozone Levels",
xlab = "Ozone (ppb)",
col = "skyblue",
border ="white")
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features? The Ozone distribution is unimodal (having one clear peak) and strongly right-skewed. The “hump” of the data is concentrated between 0 and 50 ppb, but the distribution stretches out with a long tail toward the higher values. There are clear outliers on the far right of the histogram, specifically the maximum value of 168, which represents extreme weather or pollution events that are far outside the typical daily range.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
# Recode the Month variable into names
airquality_clean <- airquality %>%
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
# Convert to factor to keep them in calendar order rather than alphabetical
airquality_clean$month_name <- factor(airquality_clean$month_name,
levels = c("May", "June", "July", "August", "September"))
# Generate the boxplot
boxplot(Ozone ~ month_name, data = airquality_clean,
main = "Ozone Levels by Month",
xlab = "Month",
ylab = "Ozone (ppb)",
col = "orange")
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate? The boxplot reveals that ozone levels are not consistent across the months, showing a clear seasonal increase that peaks in July, which holds the highest median ozone level. July and August also exhibit the most variability, evidenced by the taller boxes and longer whiskers, indicating more volatile fluctuations during the peak of summer compared to the more stable, lower levels in May and September. The individual points above the whiskers represent outliers, or specific days with unusually high pollution levels that fall outside the typical range for those months.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
library(tidyverse)
# Create scatterplot of Temp vs Ozone colored by Month
ggplot(airquality_clean, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(size = 2) +
labs(title = "Relationship between Temperature and Ozone",
x = "Temperature (°F)",
y = "Ozone (ppb)",
color = "Month") +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There is a clear positive relationship between temperature and ozone levels; as the temperature increases, the ozone levels tend to rise as well. When looking at the color coding, you’ll notice that July and August (the warmest months) cluster in the top-right of the plot with the highest ozone and temperature values. In contrast, May and September tend to cluster in the bottom-left, where both temperatures and ozone levels are significantly lower. This suggests that heat is a major driver for higher ozone concentrations. #### Task 5: Correlation Matrix
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
# 1. Create a subset of the data with just the three variables
air_subset <- airquality %>%
select(Ozone, Temp, Wind)
# 2. Compute the correlation matrix
cor_matrix <- cor(air_subset, use = "complete.obs")
print(cor_matrix)
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
# 3. Visualize the correlation matrix
corrplot(cor_matrix, method = "color", addCoef.col = "black",
type = "upper", diag = FALSE)
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables. The correlation matrix identifies Temperature and Ozone as the strongest positive relationship, suggesting that higher heat is a major driver for increased ozone concentrations. Conversely, Wind and Ozone show a significant negative correlation, indicating that higher wind speeds effectively lower ozone levels by dispersing the air. These values confirm that environmental factors like heat and air movement have a direct, measurable impact on air quality.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
# Generate the summary table
summary_table <- airquality_clean %>%
group_by(month_name) %>%
summarize(
count = n(),
avg_ozone = mean(Ozone, na.rm = TRUE),
avg_temp = mean(Temp, na.rm = TRUE),
avg_wind = mean(Wind, na.rm = TRUE)
)
# View the table
print(summary_table)
## # A tibble: 5 × 5
## month_name count avg_ozone avg_temp avg_wind
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 31 23.6 65.5 11.6
## 2 June 30 29.4 79.1 10.3
## 3 July 31 59.1 83.9 8.94
## 4 August 31 60.0 84.0 8.79
## 5 September 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? The summary table confirms that July has the highest average ozone level, coinciding with the peak temperatures of the summer. This trend is explained by the strong positive correlation (0.69) between temperature and ozone, as ground-level ozone forms more readily through chemical reactions triggered by intense sunlight and heat. Conversely, wind speed shows a negative correlation (-0.60) with ozone, suggesting that calmer winds in mid-summer allow pollutants to stagnate and accumulate, while the higher wind speeds typically seen in May help disperse these gases. Consequently, the combination of peak thermal energy and atmospheric stability in July and August creates the ideal environmental conditions for maximum ozone concentration.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard