Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
summary_Ozonestats1 <- airquality |>
summarise(
mean_Ozone1 = mean(Ozone, na.rm = TRUE),
median_Ozone1 = median(Ozone, na.rm = TRUE),
sd_Ozone1 = sd(Ozone, na.rm = TRUE),
min_Ozone1 = min(Ozone, na.rm = TRUE),
max_Ozone1 = max(Ozone, na.rm = TRUE))
summary_Ozonestats1
## mean_Ozone1 median_Ozone1 sd_Ozone1 min_Ozone1 max_Ozone1
## 1 42.12931 31.5 32.98788 1 168
#Your code for Temp goes here
summary_tempstats <- airquality |>
summarise(
mean_temp = mean(Temp, na.rm = TRUE),
median_temp = median(Temp, na.rm = TRUE),
sd_temp = sd(Temp, na.rm = TRUE),
min_temp = min(Temp, na.rm = TRUE),
max_temp = max(Temp, na.rm = TRUE))
summary_tempstats
## mean_temp median_temp sd_temp min_temp max_temp
## 1 77.88235 79 9.46527 56 97
#Your code for Wind goes here
summary_Windstats <- airquality |>
summarise(
mean_Wind = mean(Wind, na.rm = TRUE),
median_Wind = median(Wind, na.rm = TRUE),
sd_Wind = sd(Wind, na.rm = TRUE),
min_Wind = min(Wind, na.rm = TRUE),
max_Wind = max(Wind, na.rm = TRUE))
summary_Windstats
## mean_Wind median_Wind sd_Wind min_Wind max_Wind
## 1 9.957516 9.7 3.523001 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
largest gap for ozone so more skewed and fairly close for temp and wind. skewed right for ozone , slightly skewed left for temp or symmetrical, basically symmetrical for wind. ozone most variability, a lot less for temp, and least variability for wind.
Generate the histogram for Ozone.
#Your code goes here
library(ggplot2)
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 20, fill = "green", color = "black") +
labs(title = "Histogram of Ozone Levels", x = "Ozone", y = "Count") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
skewed right
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
aq_month_name <- airquality |>
mutate(month_name = case_when(
Month == "5" ~ "May",
Month == "6" ~ "June",
Month == "7" ~ "July",
Month == "8" ~ "August",
Month == "9" ~ "September"
)
)
ggplot(aq_month_name, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = c("red", "green","blue","orange","purple")) +
labs(title = "Ozone Levels by Month",
x = "Month", y = "Ozone Level") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
similar in June and September with more outliers in September, slightly lower in May, and higher in July and August. outlines in may, june, augest, and september
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(aq_month_name, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(alpha = 0.7) +
labs(
title = "Temp vs. Ozone",
x = "Temp",
y = "Ozone",
color = "Month"
) +
scale_color_discrete() +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
loks like higher temps tend to have higher ozone levels. looks like the months kind of cluster together with some having more variation than others.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_matrix <- cor(
aq_month_name |>
select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
Ozone and Temp have the strongest correlation Temp and wind have the lowest correlation the strongest positive relation is between ozone and temperature (ozone goes up-temp goes up) temp and wind have the weakest negative relation (temp goes up-wind goes down)
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
mean of mean or just mean?
# your code goes here
summary_table <- aq_month_name |>
group_by(month_name) |>
summarise(
Count = n(),
Avg_ozone = mean(Ozone, na.rm = TRUE),
Avg_temperature = mean(Temp, na.rm = TRUE),
Avg_wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## month_name Count Avg_ozone Avg_temperature Avg_wind
## <chr> <int> <dbl> <dbl> <dbl>
## 1 August 31 60.0 84.0 8.79
## 2 July 31 59.1 83.9 8.94
## 3 June 30 29.4 79.1 10.3
## 4 May 31 23.6 65.5 11.6
## 5 September 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
August has the highest average ozone level. Temperature and wind look like they have an inverse relationship. hihger wind usualy means to lower temp.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard