title: “HW6” author: “Enter your name here” output: html_document —
Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
summary_ozone <- airquality |>
summarise(
mean_ozone =mean(Ozone, na.rm =TRUE),
median_ozone = median(Ozone, na.rm = TRUE),
sd_ozone = sd(Ozone, na.rm = TRUE),
min_ozone = min(Ozone, na.rm = TRUE))
summary_ozone
## mean_ozone median_ozone sd_ozone min_ozone
## 1 42.12931 31.5 32.98788 1
#Your code for Temp goes here
summary_temp <- airquality |>
summarise(
mean_temp =mean(Temp, na.rm =TRUE),
median_temp = median(Temp, na.rm = TRUE),
sd_temp = sd(Temp, na.rm = TRUE),
min_temp = min(Temp, na.rm = TRUE))
summary_temp
## mean_temp median_temp sd_temp min_temp
## 1 77.88235 79 9.46527 56
#Your code for Wind goes here
summary_wind <- airquality |>
summarise(
mean_wind =mean(Wind, na.rm =TRUE),
median_wind = median(Wind, na.rm = TRUE),
sd_wind = sd(Wind, na.rm = TRUE),
min_wind = min(Wind, na.rm = TRUE))
summary_wind
## mean_wind median_wind sd_wind min_wind
## 1 9.957516 9.7 3.523001 1.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)?
What does the standard deviation indicate about variability? Ozone’s mean is the higher than its median, showing a right-skewed distribution with some high values.Temperature’s mean and median are similar, so it is fairly symmetric. Wind’s mean and median are also close, meaning it is not strongly skewed. Ozone has high Variability, while Temperature and wind show moderate variability.
Generate the histogram for Ozone.
#Your code goes here
library(ggplot2)
ggplot(airquality, aes (x = Ozone)) +
geom_histogram(binwidth = 5, fill = "purple", color = "pink") +
labs(title = "Histogram for Ozone levels" , x = "Ozone Levels") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The ozone level distribution is unimodal, right-skewed, and contains possible outliers at higher values.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
library(ggplot2)
airquality <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))
ggplot(airquality, aes( x = month_name, y = Ozone, fill = month_name)) +
geom_boxplot() +
labs(title = "Ozone levels by Month", x = "Month", y = "Ozone (ppb)") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Ozone levels increase from May to July, reaching their highest levels in July, then begin to decrease through August and September.The highest median ozone occurs in July, showing that this month generally has the most concentrated ozone levels. There are outliers in several months, especially in July and August. These outliers likely represent days with unusually high ozone levels, which could be caused by hot weather, air pollution, or stagnant air conditions that trap pollutants near the surface.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
ggplot(airquality, aes(x = Temp, y = Ozone, Color = factor(Month))) +
geom_point(size = 3, alpha = 1, na.rm = TRUE ) +
labs(title = "Temperature vs Ozone",
x = "Temperature (°F)",
y = "Ozone", Color = "Month") +
theme_minimal()
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns. There is a positive relationship between temperature and ozone levels as temperature increases, ozone levels also tend to rise.Warmer months like July and August show higher ozone concentrations, while cooler months such as May and June have lower ozone values. The points for July and August cluster toward the upper-right corner of the plot, indicating higher temperatures and higher ozone, whereas May and June cluster lower, showing cooler temperatures with less ozone.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_matrix <- cor (
airquality |>
select(Ozone, Temp, Wind), use = ("complete.obs"))
cor_matrix
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables. The strongest correlation is between Ozone and Temperature, which is positive meaning that as temperature increases, ozone levels also rise.The weakest correlation is between Ozone and Wind, which is negative, showing that higher wind speeds tend to lower ozone levels.This suggests that hot, calm days favor higher ozone concentrations, while cooler or windier conditions help disperse ozone and reduce its levels.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
library(dplyr)
summary_table <- airquality |>
group_by(Month) |>
summarise(
count = n(),
avg_ozone = mean(Ozone, na.rm = TRUE),
avg_temp = mean(Temp, na.rm = TRUE),
avg_wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
## Month count avg_ozone avg_temp avg_wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? July has the highest ozone level among all months. Temperature increases from May to July, then slightly decreases in August and September. Wind speed tends to be stronger in spring (May-June) and weaker in midsummer (July-August). These differences can be explained by seasonal weather patterns higher temperatures and lighter winds in summer allow ozone to build up, while cooler, windier conditions in spring and early fall help disperse pollutants.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard