Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
summary_ozone <- airquality |>
summarise(
mean_ozone = mean(Ozone, na.rm = TRUE),
median_ozone = median(Ozone, na.rm = TRUE),
sd_ozone = sd(Ozone, na.rm = TRUE),
min_ozone = min(Ozone, na.rm = TRUE),
max_ozone = max(Ozone, na.rm = TRUE))
summary_ozone
## mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1 42.12931 31.5 32.98788 1 168
summary_temp <- airquality |>
summarise(
mean_temp = mean(Temp, na.rm = TRUE),
median_temp = median(Temp, na.rm = TRUE),
sd_temp = sd(Temp, na.rm = TRUE),
min_temp = min(Temp, na.rm = TRUE),
max_temp = max(Temp, na.rm = TRUE))
summary_temp
## mean_temp median_temp sd_temp min_temp max_temp
## 1 77.88235 79 9.46527 56 97
summary_wind <- airquality |>
summarise(
mean_wind = mean(Wind, na.rm = TRUE),
median_wind = median(Wind, na.rm = TRUE),
sd_wind = sd(Wind, na.rm = TRUE),
min_wind = min(Wind, na.rm = TRUE),
max_wind = max(Wind, na.rm = TRUE))
summary_wind
## mean_wind median_wind sd_wind min_wind max_wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
For the ozone levels, the mean is larger than the median, which suggests that the distribution is right-skewed. For temperature, the median is larger than the mean, indicating a left-skewed distribution. The mean and median for wind are very similar, suggesting that the distribution is symmetrical. The standard deviation for ozone is high, showing that the data are more spread out and variable. Compared to the standard deviation for temperature which is lower than ozone, meaning the data vary less. The wind has the lowest standard deviation of the three, showing that its values are the least spread out.
Generate the histogram for Ozone.
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 5, fill = "purple", color = "black") +
labs(title = "Histogram of Ozone Concentration", x = "Ozone Concentration (PPB)") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The shape of the ozone distribution is right skewed and there is an outlier present near an ozone concentration of 180 ppb.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name
with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
ozone_months <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
),
month_name = factor(month_name, levels = c("May", "June", "July", "August","September" ))
)
ggplot(ozone_months, aes(x = factor(month_name), y = Ozone)) +
geom_boxplot() +
labs(title = "Boxplot of Ozone Concentration by Month",
x = "Months", y = "Ozone Concentration(PPB)") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
The ozone levels across the months generally appear similar, ranging between 0 and 75. July has the highest median ozone level, and there are outliers in September, August, June, and May, which indicates that ozone levels during these months can reach unusually high compared to the overall trend.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
ggplot(ozone_months, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(alpha = 0.7) +
labs(
title = "Scatterplot of Temp vs. Ozone Concentration",
x = "Temp",
y = "Ozone Concentration (PPB)",
color = "Month"
) +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There is a visible relationship between temperature and ozone levels, as the temperature increases, ozone levels also tend to rise with it. The data points for each month appear to cluster together. For example, the points for May are grouped near an ozone level between 0 and 50 with temperatures between 60 and 70 degrees Fahrenheit, while the points for July cluster around 80 to 90 degrees Fahrenheit and have a higher ozone level of 50 and 150.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
cor_matrix <- cor(
airquality |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", addCoef.col = "black",
title = "Correlation Matrix of Numeric Variables")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest correlation is between temperature and ozone at 0.69, while the weakest correlation is between wind and temperature at -0.51. This correlation value suggests that there is a strong positive relationship between the temperature and the ozone levels.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
summary_table <- ozone_months |>
group_by(month_name) |>
summarise(
Count = n(),
avg_ozone = mean(Ozone, na.rm = TRUE),
avg_wind = mean(Wind, na.rm = TRUE),
avg_temp = mean(Temp, na.rm = TRUE),
)
summary_table
## # A tibble: 5 × 5
## month_name Count avg_ozone avg_wind avg_temp
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 31 23.6 11.6 65.5
## 2 June 30 29.4 10.3 79.1
## 3 July 31 59.1 8.94 83.9
## 4 August 31 60.0 8.79 84.0
## 5 September 30 31.4 10.2 76.9
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
August has the highest average ozone level at 59.96. The temperature and wind speed vary across the months, as higher average temperatures are often accompanied by lower average wind speeds. An environmental factor that might explain these differences is the increased intensity of sunlight during the summer months, which can increase ozone formation while reducing wind activity.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard