Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
airquality %>%
summarise(mean = mean(Ozone, na.rm = TRUE), median = median(Ozone, na.rm = TRUE),
sd = sd(Ozone, na.rm = TRUE), min = min(Ozone, na.rm = TRUE), max = max(Ozone, na.rm = TRUE))
## mean median sd min max
## 1 42.12931 31.5 32.98788 1 168
#Your code for Temp goes here
airquality %>%
summarise(mean = mean(Temp), median = median(Temp),
sd = sd(Temp), min = min(Temp), max = max(Temp))
## mean median sd min max
## 1 77.88235 79 9.46527 56 97
#Your code for Wind goes here
airquality %>%
summarise(mean = mean(Wind), median = median(Wind),
sd = sd(Wind), min = min(Wind), max = max(Wind))
## mean median sd min max
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
For Ozone, the mean is higher than the median (31.50), which suggests a right-skewed distribution where high-value days pull the average up. The large standard deviation (32.99) indicates very high variability in daily ozone levels. Temp and Wind have means and medians that are very close to one another, suggesting more symmetric distributions. The lower standard deviation for Wind (3.52) indicates it is the most consistent variable of the three.
Generate the histogram for Ozone.
#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) +
labs(title = "Distribution of Ozone Levels", x = "Ozone (ppb)", y = "Frequency")
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The distribution of ozone is strongly right-skewed (positively skewed). Most observations are clustered at lower concentrations between 0 and 50 ppb. There are several outliers on the high end of the scale, particularly those exceeding 100 ppb, (representing extreme air pollution event)s.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
airquality_clean <- airquality %>%
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
)) %>%
# Setting levels ensures they appear in calendar order, not alphabetical
mutate(month_name = factor(month_name, levels = c("May", "June", "July", "August", "September")))
ggplot(airquality_clean, aes(x = month_name, y = Ozone, fill = month_name)) +
geom_boxplot() +
labs(title = "Ozone Levels by Month", x = "Month", y = "Ozone (ppb)")
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Ozone levels show an increase during the summer. July has the highest median ozone level, indicated by the center line in the boxplot. Multiple months show outliers with August containing the most extreme outlier (above 150 ppb). These outliers indicate specific days with bad weather conditions like high heat and stagnant air.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(airquality_clean, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black") +
labs(title = "Temperature vs. Ozone Levels", x = "Temperature (°F)", y = "Ozone (ppb)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There is a strong positive linear relationship visible as the temperature increases, ozone levels tend to rise as well. The months show distinct clustering: May and September (purple) cluster in the cooler, lower-ozone area, while July (green) and August (blue) cluster in the upper-right.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
# Filter for numeric columns and handle NAs
cor_data <- airquality %>%
select(Ozone, Temp, Wind) %>%
cor(use = "complete.obs")
corrplot(cor_data, method = "color", addCoef.col = "black", type = "upper")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest positive correlation is between Ozone and Temp (0.70): higher temperatures are strongly associated with higher ozone levels. The strongest negative correlation is between Ozone and Wind (-0.60), suggesting that higher wind speeds effectively disperse ozone and lower its concentration.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
airquality_clean %>%
group_by(month_name) %>%
summarise(
count = n(),
avg_ozone = mean(Ozone, na.rm = TRUE),
avg_temp = mean(Temp, na.rm = TRUE),
avg_wind = mean(Wind, na.rm = TRUE)
)
## # A tibble: 5 × 5
## month_name count avg_ozone avg_temp avg_wind
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 31 23.6 65.5 11.6
## 2 June 30 29.4 79.1 10.3
## 3 July 31 59.1 83.9 8.94
## 4 August 31 60.0 84.0 8.79
## 5 September 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
August has the highest average ozone level at 59.96 ppb. As the months progress from May to August, average temperatures rise while average wind speeds generally decrease. The combination of high heat and lower wind speeds in July and August prevents the dispersal of pollutants, leading to the highest concentration levels.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard