Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
summary_stats <- airquality |>
summarise(
mean_Ozone = mean(Ozone, na.rm = TRUE),
median_Ozone = median(Ozone, na.rm = TRUE),
sd_Ozone = sd(Ozone, na.rm = TRUE),
min_Ozone = min(Ozone, na.rm = TRUE),
max_Ozone = max(Ozone, na.rm = TRUE))
summary_stats
## mean_Ozone median_Ozone sd_Ozone min_Ozone max_Ozone
## 1 42.12931 31.5 32.98788 1 168
summary_stats <- airquality |>
summarise(
mean_temp = mean(Temp, na.rm = TRUE),
median_temp = median(Temp, na.rm = TRUE),
sd_temp = sd(Temp, na.rm = TRUE),
min_temp = min(Temp, na.rm = TRUE),
max_temp = max(Temp, na.rm = TRUE))
summary_stats
## mean_temp median_temp sd_temp min_temp max_temp
## 1 77.88235 79 9.46527 56 97
summary_stats <- airquality |>
summarise(
mean_wind = mean(Wind, na.rm = TRUE),
median_wind = median(Wind, na.rm = TRUE),
sd_wind = sd(Wind, na.rm = TRUE),
min_wind = min(Wind, na.rm = TRUE),
max_wind = max(Wind, na.rm = TRUE))
summary_stats
## mean_wind median_wind sd_wind min_wind max_wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
The smallest mean and median is wind, then ozone, then temp which is leading. All of these numbers are very different, but doesn’t represent too much skewness because each of the mean and medians for each variable are quite close to each other. The SD for wind and temp and kind of close to each other, this means there isn’t any significant change in wind or temp, but for the ozone, it has a very high number compared to the two other variables;32. This could mean that the quality in the air quality is abnormaly high in the ozone.
Generate the histogram for Ozone.
library(ggplot2)
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 5, fill = "#1f77b4", color = "black") +
labs(title = "Ozone Levels", x = "Ozone quality", y = "Count") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
It is right skewed, with an outlier past 150 on the x axis.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
airquality <- airquality |>
mutate (month_name = case_when
(
Month==5~ "May",
Month==6 ~ "June",
Month== 7 ~ "July",
Month== 8~ "August",
Month==9~ "September"
))
ggplot(airquality, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = "#2ca02c") +
labs(title = "Ozone levels by month",
x = "Month", y = "Ozone") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Across the months, we see that May has the lowest Ozone levels. The highest median would be in July. There are outliers in all months except July. They probably indicate that the ozone levels rocketed for a bit of time, then went back to the average levels later on.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
ggplot(airquality, aes(x = Temp, y = Ozone, color = Month)) +
geom_point(alpha = 0.7) +
labs(
title = "Scatterplot of Temp vs. Ozone",
x = "Temp",
y = "Ozone",
color = "Month"
) +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There are lots of clusters between 65 to a little over 80, and it seems that the lower the temperature, the earlier it is in the month. So like darker colors would be May, and you see a lot of that in the 60’s temperature area. Then you see some lighter colors trickle in after 60 and through 80. This would mean it is around July-August time.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
cor_matrix <- cor(
airquality |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest correlations would be in Ozone and Temp, while Ozone and Wind & Temp and Wind are more lower level strong correlations. Not exactly weak, but they’re close to being weak. This means Ozone relies most on temp when it comes to its levels. It heavily impacts it, compared to Ozone and Wind & Temp and Wind.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
summary_table <- airquality |>
group_by(Month) |>
summarise(
Count = n(),
Avg_ozone= mean(Ozone, na.rm = TRUE),
Avg_temp = mean(Temp, na.rm = TRUE),
Avg_wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## Month Count Avg_ozone Avg_temp Avg_wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
August has the highest ozone level. Overall, the wind speeds get low during July and August, but rise in June and September, while May has the highest average wind. Environmental factors this might happen is because July and August are pretty hot months, where we are deep in the summer. While May it is just coming off the coolness and slowly getting hot.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard