Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
ozone <- airquality |>
select(Ozone) |>
summarize(
Mean_Ozone = mean(Ozone, na.rm = TRUE),
Median_Ozone = median(Ozone, na.rm = TRUE),
SD_Ozone = sd(Ozone, na.rm = TRUE),
Min_Ozone = min(Ozone, na.rm = TRUE),
Max_Ozone = max(Ozone, na.rm = TRUE))
ozone
## Mean_Ozone Median_Ozone SD_Ozone Min_Ozone Max_Ozone
## 1 42.12931 31.5 32.98788 1 168
#Your code for Temp goes here
temp <- airquality |>
select(Temp) |>
summarize(
Mean_Temp = mean(Temp, na.rm = TRUE),
Median_Temp = median(Temp, na.rm = TRUE),
SD_Temp = sd(Temp, na.rm = TRUE),
Min_Temp = min(Temp, na.rm = TRUE),
Max_Temp = max(Temp, na.rm = TRUE))
temp
## Mean_Temp Median_Temp SD_Temp Min_Temp Max_Temp
## 1 77.88235 79 9.46527 56 97
#Your code for Wind goes here
wind <- airquality |>
select(Wind) |>
summarize(
Mean_Wind = mean(Wind, na.rm = TRUE),
Median_Wind = median(Wind, na.rm = TRUE),
SD_Wind = sd(Wind, na.rm = TRUE),
Min_Wind = min(Wind, na.rm = TRUE),
Max_Wind = max(Wind, na.rm = TRUE))
wind
## Mean_Wind Median_Wind SD_Wind Min_Wind Max_Wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
Both ozone and wind have their medians be less than their means (with 31.5 and 9.7 median respectively and 42.1 and 9.96 respectively) while temperature has their median be greater than their mean (with a 79 median and a 77.9 mean). Because the mean for ozone is greater than the median the data would be skewed right. For wind even thought the mean is also greater than the median they are almost similar meaning the data could be more symmetrical. As the median is greater than the mean for temperature the data would be skewed left. Ozone has the highest standard deviation followed by temperature then wind. This means that the data for ozone is more spread out while the data for wind is the least spread out.
Generate the histogram for Ozone.
#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 10, fill = "red", color = "black") +
labs(title = "Histogram of Ozone Levels", x = "Ozone(pbb)", y = "Count") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The shape of the histogram is skewed right with most of the ozone levels being on the lower end between 10 pbb and 20 pbb. There is one outlier around 170 pbb.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
ozone_month <- airquality|>
mutate(month_name = factor(case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"),
levels = c("May", "June", "July", "August", "September")
))
ggplot(ozone_month, aes(x = factor(month_name), y = Ozone)) +
geom_boxplot(fill = "green", color = "black") +
labs(title = "Monthly Ozone Levels",
x = "Month", y = "Ozone Level (pbb)") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
The range for ozone levels for the months of May, June, and September are all very close together around 0 to 45. The range for July and August are more spread out between 10 and 125. July has the highest median ozone level then August followed by June, September, and May. May, June, August, and September all have outliers with September having the most with four outliers. These outliers can show that the ozone levels for these months can sometimes be very high compared to regularly.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(ozone_month, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(size = 3) +
labs(title = "Temperature vs. Ozone",
x = "Temperature (F)", y = "Ozone (ppb)",
color = "Month") +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There seems to be a positive correlation between the temperature and the ozone levels. As the temperature gets higher the ozone levels increase. The data for May and September seem to somewhat cluster together near the beginning where there is lower temperatures and lower ozone levels. August and July cluster near the end where their temperatures are higher and where you can see the ozone levels rise.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_matrix <- cor(
airquality |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45, addCoef.col = "white",
title = "Correlation Between Ozone, Temperature, and Wind")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
Ozone and temperature have the strongest correlation with a correlation of 0.70 making it a strong positive correlation. The second strongest correlation is between ozone and wind with a moderate negative correlation of -0.60. The weakest weakest correlation is between temperature and wind with a moderate negative correlation of -0.51.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
summary_table <- ozone_month |>
group_by(month_name) |>
summarise(
Count = n(),
Avg_Ozone = mean(Ozone, na.rm = TRUE),
Avg_Temp = mean(Temp, na.rm = TRUE),
Avg_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## month_name Count Avg_Ozone Avg_Temp Avg_Wind
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 31 23.6 65.5 11.6
## 2 June 30 29.4 79.1 10.3
## 3 July 31 59.1 83.9 8.94
## 4 August 31 60.0 84.0 8.79
## 5 September 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
August has the highest average ozone level with an average of 59.96 pbb. The two months (July and August) with the highest average temperature have the lowest average wind while the other months with a cooler average temperature have a higher average wind. July and August are during the summer meaning temperatures are typically higher due to an increase in sunlight which can aid in the creation of more ozone and cause less wind. May is during the spring, June is during the transition from spring into summer, and September is in the fall meaning the temperatures during these months are typically cooler which makes sense as to why there is a higher average of wind and lower ozone levels.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard