Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
library(ggplot2)
data("airquality")
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
summary_stats_Ozone <- airquality |>
select(Ozone) |>
summarize(
Mean_Ozone = mean(airquality$Ozone, na.rm = TRUE),
Median_Ozone = median(airquality$Ozone, na.rm = TRUE),
Sd_Ozone = sd(airquality$Ozone, na.rm = TRUE),
Min_Ozone = min(airquality$Ozone, na.rm = TRUE),
Max_Ozone = max(airquality$Ozone, na.rm = TRUE)
)
summary_stats_Ozone
## Mean_Ozone Median_Ozone Sd_Ozone Min_Ozone Max_Ozone
## 1 42.12931 31.5 32.98788 1 168
#Your code for Temp goes here
summary_stats_Temp <- airquality |>
select(Temp) |>
summarize(
Mean_Temp = mean(airquality$Temp, na.rm = TRUE),
Median_Temp = median(airquality$Temp, na.rm = TRUE),
Sd_Temp = sd(airquality$Temp, na.rm = TRUE),
Min_Temp = min(airquality$Temp, na.rm = TRUE),
Max_Temp = max(airquality$Temp, na.rm = TRUE)
)
summary_stats_Temp
## Mean_Temp Median_Temp Sd_Temp Min_Temp Max_Temp
## 1 77.88235 79 9.46527 56 97
#Your code for Wind goes here
summary_stats_Wind <- airquality |>
select(Wind) |>
summarize(
Mean_Wind = mean(airquality$Wind, na.rm = TRUE),
Median_Wind = median(airquality$Wind, na.rm = TRUE),
Sd_Wind = sd(airquality$Wind, na.rm = TRUE),
Min_Wind = min(airquality$Wind, na.rm = TRUE),
Max_Wind = max(airquality$Wind, na.rm = TRUE)
)
summary_stats_Wind
## Mean_Wind Median_Wind Sd_Wind Min_Wind Max_Wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
Mean_Ozone = 42.12931, Mean_Temp = 77.88235, Mean_Wind =
9.957516
Median_Ozone = 31.5, Median_Temp= 79, Median_Wind = 9.7 Sd_Ozone =
32.98788 , Sd_Temp = 9.46527, Sd_Wind = 3.523001
When I look at the mean and median for Ozone, they seem pretty close to each other and when I look at the standard deviation, it shows that the data values are really close to the mean. This would mean the distribution would be skewed to one side.
When I look at the mean and median for Temperature, they are close to each other and when I look at the standard deviation, it shows that the values are very far from the mean meaning that the distribution will be more uniformed.
When I look at the mean and median for Wind, they are very close and looking at the standard deviation, it shows that the values seems to be far from the mean meaning that the distribution will be skewed to one side.
Generate the histogram for Ozone.
#Your code goes here
library(ggplot2)
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 20, fill = "#1f77b4", color = "black") +
labs(title = "Histogram of Ozone Levels", x = "Ozone (ppb)", y = "Frequency") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The shape of the ozone distribution is skewed right showing outliers on the right side of the graph that are greater than 150.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
library(ggplot2)
airquality <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
ggplot(airquality, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = "purple", outlier.color = "red") +
labs(title = "Ozone Levels by Month",
x = "Month",
y = "Ozone (ppb)") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
There is a huge variation in the ozone levels by month with the mean ranging from 15ppb- 60ppb. The month with the highest median ozone is July. There are outliers in the months of August, June, May and September. This goes to show that there are some extreme values that may be affecting the distribution.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
airquality <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
ggplot(airquality, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(size = 3) +
labs(title = "Scatterplot of Temperature vs. Ozone",
x = "Temperature", y = "Ozone",
color = "Month") +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
Yes there is a visible relationship between temperature and ozone levels. Yes there are plenty of months that cluster together when the temperature is around 81-82 degrees during the months of July, September and August, June. This goes to show that across the months the ozone levels and temperature may increase but they start to go back down in the month of September hence why there are some clusters.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_matrix <- cor(
airquality |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 40, addCoef.col = "black",
title = "Correlation Matrix of Ozone, Temp and Wind")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest correlation is between Ozone and Temperature and the weakest correlation is between Ozone and Wind.
This goes to show that there the Ozone levels are highly affected by the temperature hence one relationship affects the other whereas wind speed has no effect on the Ozone levels.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
summary_table <- airquality |>
group_by(Month) |>
summarize(
count = n(),
mean_Ozone = mean(Ozone, na.rm = TRUE),
mean_Temp = mean(Temp, na.rm = TRUE),
mean_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## Month count mean_Ozone mean_Temp mean_Wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
The month with the highest average ozone level is August. There was a temperature increase across the months and then it drops in September. When it comes to wind speed, it would decrease across the months then it begins to peak up in the month of September.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard