Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(ggplot2)
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
mean(airquality$Ozone,na.rm=TRUE)
## [1] 42.12931
median(airquality$Ozone,na.rm=TRUE)
## [1] 31.5
sd(airquality$Ozone,na.rm=TRUE)
## [1] 32.98788
min(airquality$Ozone,na.rm=TRUE)
## [1] 1
max(airquality$Ozone,na.rm=TRUE)
## [1] 168
#Your code for Temp goes here
mean(airquality$Temp,na.rm=TRUE)
## [1] 77.88235
median(airquality$Temp,na.rm=TRUE)
## [1] 79
sd(airquality$Temp,na.rm=TRUE)
## [1] 9.46527
min(airquality$Temp,na.rm=TRUE)
## [1] 56
max(airquality$Temp,na.rm=TRUE)
## [1] 97
#Your code for Wind goes here
mean(airquality$Wind,na.rm=TRUE)
## [1] 9.957516
median(airquality$Wind,na.rm=TRUE)
## [1] 9.7
sd(airquality$Wind,na.rm=TRUE)
## [1] 3.523001
min(airquality$Wind,na.rm=TRUE)
## [1] 1.7
max(airquality$Wind,na.rm=TRUE)
## [1] 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
The mean and median were quite close for both temperature and wind, indicating that any skew is minimal. Conversely, the mean was far above the median for ozone, indicating a drastic low skew.
The median temperature was above the mean temperature, indicating it skews low. On the other hand, the median wind speed was below the mean wind speed, indicating it skews high.
The standard deviation was most elevated (relative to the mean) with ozone, meaning tht data had the highest variability.
Generate the histogram for Ozone.
#Your code goes here
hist(airquality$Ozone)
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The ozone distribution is highly skewed, and most closley resembles a Poisson distribution. There seem to be some minor high outliers, but not enough to classify it as bimodal.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
airquality <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September",
TRUE ~ NA_character_
),
month_name = factor(month_name,
levels = c(
"May",
"June",
"July",
"August",
"September"),
ordered = TRUE)
)
boxplot(Ozone ~ month_name,
data = airquality,
main = "Ozone Levels by Month",
xlab = "Month",
ylab = "Ozone",
las = 2)
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
The ozone clearly peaks in the mid-summer, with the highest median occuring in July. There are outliers, which may indicate atmospheric inversion events that trap ozone near the ground level, or other atmosphereic phenomena.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(airquality, aes(x = Temp, y = Ozone, color = Month)) +
geom_point(size = 2, alpha = 0.7) +
labs(title = "Ozone vs Temperature by Month",
x = "Temperature",
y = "Ozone",
color = "Month") +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There is a visible positive correlation between ozone and temperature. Earlier months seem to cluster together at lower levels, whereas months mid-season tend to be more diffuse.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor(airquality[, c("Ozone", "Temp", "Wind")], use = "complete.obs")
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest correlation is between Ozone and temperature, whereas the weakest is between wind and temperature. This would suggest that while wind and temperature are more loosely correlated (and negatively so), temperature and ozone are fairly strongly correlated. It would suggest that most of the ozone level change is explained by / derived from changes in temperature.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
airquality |>
group_by(Month) |>
summarise(
count = n(),
avg_ozone = mean(Ozone, na.rm = TRUE),
avg_temp = mean(Temp, na.rm = TRUE),
avg_wind = mean(Wind, na.rm = TRUE)
)
## # A tibble: 5 × 5
## Month count avg_ozone avg_temp avg_wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
The highest average ozone is in August. Temperatures rise during summer months, but winds fall. This could be explained by heat domes trapping hot air in urban environments, slowing wind, or by increased wind caused by the change in seasons.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard