Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
ozone_stats <- airquality |>
summarise(
Mean = mean(Ozone, na.rm = TRUE),
Median = median(Ozone, na.rm = TRUE),
SD = sd(Ozone, na.rm = TRUE),
Min = min(Ozone, na.rm = TRUE),
Max = max(Ozone, na.rm = TRUE)
)
ozone_stats
## Mean Median SD Min Max
## 1 42.12931 31.5 32.98788 1 168
#Your code for Temp goes here
temp_stats <- airquality |>
summarise(
Mean = mean(Temp, na.rm = TRUE),
Median = median(Temp, na.rm = TRUE),
SD = sd(Temp, na.rm = TRUE),
Min = min(Temp, na.rm = TRUE),
Max = max(Temp, na.rm = TRUE)
)
temp_stats
## Mean Median SD Min Max
## 1 77.88235 79 9.46527 56 97
#Your code for Wind goes here
wind_stats <- airquality |>
summarise(
Mean = mean(Wind, na.rm = TRUE),
Median = median(Wind, na.rm = TRUE),
SD = sd(Wind, na.rm = TRUE),
Min = min(Wind, na.rm = TRUE),
Max = max(Wind, na.rm = TRUE)
)
wind_stats
## Mean Median SD Min Max
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
Ozone The mean is 42.13 and the median is 31.5 where the mean is significantly higher than the median. This suggests a right-skewed distribution. The standard deviation is 32.99 which is large and it indicates a high variability.
Temp The mean is 77.88 and the median is 79 where the mean is very close to median. This suggests a relatively symmetric (very slightly skewed to the left) distribution. The standard deviation is 9.47 which is relatively small compared to the median & mean and it indicates a moderate variability making the temperature during the season was somewhat stable.
Wind The mean is 9.96 and the median is 9.7 where the mean is nearly identical to the median. This suggests a normal symmetric distribution. The standard deviation is 3.52 which is fairly small and with the low median and mean as well, it indicates a low variability showing fairly steady if you think about how wind feels and it being anywhere from 6-13 mph.
Generate the histogram for Ozone.
#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 10, fill = "#1f77b4", color = "black") +
labs(title = "Histogram of Ozone Levels",
x = "Ozone (ppb)",
y = "Frequency") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The shape of the ozone distribution is unimodal since it has one clear peak and it is very strongly right skewed. There is an outlier on the very right of the x-axis especially with its gap in data between 140 to 150. This gap in data also suggests that extremely high ozone pollution data mentioned above is an isolated event rather than being part of the frequent trends.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
airquality_num <- airquality |>
mutate(
month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
),
month_name = factor(month_name, levels = c("May", "June", "July", "August", "September"))
)
#plot
ggplot(airquality_num, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = c("red", "orange", "yellow", "green", "blue"), na.rm = TRUE) +
labs(title = "Boxplot of Ozone Levels by Month",
x = "Month",
y = "Ozone (ppb)") +
theme_dark()
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Ozone levels show low level in May & June and sudden increase in July while keeping similar level on August then a sudden decrease in September. The month of July has the highest median ozone. There are some outliers in all 5 months except July and mostly concentrated in September. This indicates that those outliers represent specific days of bad air quality or extreme heat waves. Especially with summer having a lot of heat waves in mind, this also might suggests that the outliers in the month of May and September are more likely to be bad air quality issue.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(airquality_num, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(alpha = 0.7) +
labs(
title = "Scatterplot of Temperature vs. Ozone Levels",
x = "Temperature (F)",
y = "Ozone (ppb)",
color = "Month"
) +
scale_color_manual(values = c("May" = "red",
"June" = "orange",
"July" = "yellow",
"August" = "green",
"September" = "blue")) +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
Yes, the relationship between temperature and ozone levels have a strong positive correlation which suggests more heat means more ozone level. There is a seasonal grouping of the months clustered together. Most of May and September data are largely concentrated in the lower left side of the plot while July and August data are largely concentrates in the upper right side of the plot making June data being located in the transition of those two clustered mentioned before. Most of extreme ozone levels seems to occur once the temperature passes the 80 degree mark except one from May just under 80 degrees.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
cor_matrix <- cor(
airquality |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
# Visualize correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
title = "Correlation Matrix of Air Quality Variables")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest correlation is between Ozone and Temp with it being around 0.70 while the weakest correlation is between Ozone and Wind which is -0.60. This suggests that higher wind speeds may help disperse more ozone, making it lower concentration. On the other hand, Ozone and Temp has a proportional relationship suggesting that when it is hot, ozone is also likely to be high.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
summary_table <- airquality_num |>
group_by(month_name) |>
summarise(
Count = n(),
Avg_Ozone = mean(Ozone, na.rm = TRUE),
Avg_Temp = mean(Temp, na.rm = TRUE),
Avg_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## month_name Count Avg_Ozone Avg_Temp Avg_Wind
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 31 23.6 65.5 11.6
## 2 June 30 29.4 79.1 10.3
## 3 July 31 59.1 83.9 8.94
## 4 August 31 60.0 84.0 8.79
## 5 September 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
The month of August has the highest average ozone level with it being at 60.0 ppb which is significantly higher than May at 21.6 ppb. The temperature has a bell shaped trend starting with cooler temperature in May and peaking in July & August and cooler in September. The wind speed is the highest in May with slightly less speed in June & September with July and August on the low side. There are a few environmental factors that might explain these differences as mentioned partially above. First, ozone is created when sunlight and heat, the two highest factor during that period, react with the pollutants as shown in July and August data trends. Second, lower wind speeds stagnate the pollutants causing ozone to accumulate at higher levels as shown in July and August data trends. Lastly, seasonal transition like Spring and Fall tends to have lower ozone levels since temperatures are lower while wind speeds are higher which can help disperse the pollutants and the air pollution.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard