Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
# For Ozone
ozone_stats <- c(
Mean = mean(airquality$Ozone, na.rm = TRUE),
Median = median(airquality$Ozone, na.rm = TRUE),
SD = sd(airquality$Ozone, na.rm = TRUE),
Min = min(airquality$Ozone, na.rm = TRUE),
Max = max(airquality$Ozone, na.rm = TRUE)
)
ozone_stats
## Mean Median SD Min Max
## 42.12931 31.50000 32.98788 1.00000 168.00000
#Your code for Temp goes here
# For Temp
temp_stats <- c(
Mean = mean(airquality$Temp, na.rm = TRUE),
Median = median(airquality$Temp, na.rm = TRUE),
SD = sd(airquality$Temp, na.rm = TRUE),
Min = min(airquality$Temp, na.rm = TRUE),
Max = max(airquality$Temp, na.rm = TRUE)
)
temp_stats
## Mean Median SD Min Max
## 77.88235 79.00000 9.46527 56.00000 97.00000
#Your code for Wind goes here
# For Wind
wind_stats <- c(
Mean = mean(airquality$Wind, na.rm = TRUE),
Median = median(airquality$Wind, na.rm = TRUE),
SD = sd(airquality$Wind, na.rm = TRUE),
Min = min(airquality$Wind, na.rm = TRUE),
Max = max(airquality$Wind, na.rm = TRUE)
)
wind_stats
## Mean Median SD Min Max
## 9.957516 9.700000 3.523001 1.700000 20.700000
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
Ozone: Mean (42.13) is greater than median (31.5), suggesting right (positive) skewness. The standard deviation (32.99) is high, indicating large variability in ozone levels.
Temp: Mean (77.88) and median (79) are very close, suggesting a symmetric distribution. Standard deviation (9.47) indicates moderate variability.
Wind: Mean (9.96) and median (9.7) are similar, suggesting near symmetry. Standard deviation (3.52) indicates relatively low variability compared to the mean
Generate the histogram for Ozone.
#Your code goes here
# Histogram for Ozone
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
labs(title = "Histogram of Ozone Levels", x = "Ozone (ppb)", y = "Frequency") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The distribution of ozone is right-skewed and unimodal, with most values concentrated below 60 ppb. There are a few high values above 100 ppb, which may be considered outliers. The skewness aligns with the mean > median observation in Task 1.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
# Recode Month into month_name
airquality <- airquality %>%
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
# Recode Month into month_name
airquality <- airquality %>%
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
# Boxplot of Ozone by month_name
ggplot(airquality, aes(x = month_name, y = Ozone, fill = month_name)) +
geom_boxplot() +
labs(title = "Boxplot of Ozone Levels by Month", x = "Month", y = "Ozone (ppb)") +
theme_minimal() +
theme(legend.position = "none")
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Ozone levels generally increase from May to July, then decrease in August and September. July has the highest median ozone level. There are outliers in May, June, and July, which could indicate unusually high ozone days, possibly due to specific weather conditions or pollution events.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
# Scatterplot of Temp vs Ozone, colored by Month
ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(Month))) +
geom_point() +
labs(title = "Scatterplot of Temperature vs. Ozone", x = "Temperature (F)", y = "Ozone (ppb)") +
scale_color_discrete(name = "Month") +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There is a positive relationship between temperature and ozone: as temperature increases, ozone levels tend to increase. Months with higher temperatures (July, August) cluster in the upper right, showing higher ozone levels. Cooler months (May, September) tend to have lower ozone.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
# Compute correlation matrix
cor_matrix <- airquality %>%
select(Ozone, Temp, Wind) %>%
cor(use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
# Visualize correlation matrix
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
Strongest correlation: Ozone and Temp (0.70) – strong positive relationship.
Weakest correlation: Ozone and Wind (-0.46) – moderate negative relationship. Ozone is more strongly correlated with temperature than with wind speed. This suggests that higher temperatures are associated with higher ozone levels, while higher wind speeds are associated with lower ozone levels (likely due to dispersion).
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
# Summary table by month
summary_table <- airquality %>%
group_by(Month) %>%
summarise(
Count = n(),
Avg_Ozone = mean(Ozone, na.rm = TRUE),
Avg_Temp = mean(Temp, na.rm = TRUE),
Avg_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## Month Count Avg_Ozone Avg_Temp Avg_Wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
July has the highest average ozone level (59.12 ppb).
Temperature is highest in July and August, and wind speed is generally lower in summer months.
Higher temperatures and lower wind speeds in summer likely contribute to ozone formation and accumulation, explaining the peak in July.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard