Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
summary_stats <- airquality |>
summarise(
mean_ozone = mean(Ozone, na.rm = TRUE),
median_ozone = median(Ozone, na.rm = TRUE),
sd_ozone = sd(Ozone, na.rm = TRUE),
min_ozone = min(Ozone, na.rm = TRUE),
max_ozone = max(Ozone, na.rm = TRUE))
summary_stats
## mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1 42.12931 31.5 32.98788 1 168
#Your code for Temp goes here
summary_stats <- airquality |>
summarise(
mean_temp = mean(Temp, na.rm = TRUE),
median_temp = median(Temp, na.rm = TRUE),
sd_temp = sd(Temp, na.rm = TRUE),
min_temp = min(Temp, na.rm = TRUE),
max_temp = max(Temp, na.rm = TRUE))
summary_stats
## mean_temp median_temp sd_temp min_temp max_temp
## 1 77.88235 79 9.46527 56 97
#Your code for Wind goes here
summary_stats <- airquality |>
summarise(
mean_wind = mean(Wind, na.rm = TRUE),
median_wind = median(Wind, na.rm = TRUE),
sd_wind = sd(Wind, na.rm = TRUE),
min_wind = min(Wind, na.rm = TRUE),
max_wind = max(Wind, na.rm = TRUE))
summary_stats
## mean_wind median_wind sd_wind min_wind max_wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability? ## Ozone the mean is higher than the median which suggests that the distribution is right-skewed meaning most Ozone values are relatively low with a few high outliers that pull the mean upward. The standard deviation is large which indicates high variability in Ozone levels. ## Temp The mean and median are very close which suggests that the distribution is roughly symmetric and there aren’t many extreme outliers. The standard deviation indicates moderate variability confirming that the values are roughly symmetric. ## Wind The mean and median are almost the same which indicates that the distribution is approximately symmetric with no extremes. The standard deviation shows moderate variability which suggests that wind speed vary somewhat but not dramatically.
Generate the histogram for Ozone.
#Your code goes here
library(ggplot2)
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 20, fill = "#1f77b4", color = "black") +
labs(title = "Histogram of Ozone Levels", x = "Ozone", y = "Count") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features? The large standard deviation supports that the data are highly variable, with some unusually high values. Besides that since the mean is greater than the medium we should expect the distribution to be right-skewed which we can confirm with the long tail in the histogram, which caracterizes as unimodal.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
airquality <- airquality %>%
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
ggplot(airquality, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = "pink", color = "purple") +
labs(
title = "Ozone Levels by Month",
x = "Month",
y = "Ozone (ppb)"
) +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate? Ozone levels rise from May to July then decline in September. July is the month with the highest median Ozone levels.September shows the highest frequency of outliers besides May, June and August. While summer outliers are due heat sunlgiht, September outliers may reflect isolated warm spells or local polution episodes.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
airquality <- airquality %>%
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
ggplot(airquality, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(size = 2, alpha = 0.7) +
labs(
title = "Scatterplot of Ozone vs Temperature by Month",
x = "Temperature (°F)",
y = "Ozone (ppb)",
color = "Month"
) +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns. We can observe that there is a positive relationship between temperature and Ozone because as one increases the other tends to follow. We can clearly see that months with cooler temperatures such as May tend to cluster toward the lower-left of the plot at lower ozone levels. We can observe the same with months with warmer temperatures such as July and August, which cluster toward the upper-middle/right of the plot where the higher ozone levels are located.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_matrix <- cor(
airquality |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
title = "Correlation Matrix of Numeric Variables")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables. The strongest correlation is between Ozone and Temp which is positive, suggesting that higher temperatures tend to coincide with higher ozone levels. The weekest correlation is between Temp and Wind, which is negative, suggesting that higher wind speeds tend to be associated with lower ozone levels due to the dispersion of ozone preventing accumulation.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
monthly_summary <- airquality %>%
group_by(Month) %>%
summarise(
count = n(),
mean_ozone = mean(Ozone, na.rm = TRUE),
mean_temp = mean(Temp, na.rm = TRUE),
mean_wind = mean(Wind, na.rm = TRUE)
)
monthly_summary
## # A tibble: 5 × 5
## Month count mean_ozone mean_temp mean_wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? August has the highest average ozone level. Temperature rises between May through August and drop in September which is expected because of the change of seasons. Wind decreases from May through August and slighly increase in September which explains the high ozone levels in the summer months.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard