Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here

summary_ozone  <- airquality |>
  summarise(
    mean_ozone = mean(Ozone, na.rm = TRUE),
    median_ozone = median(Ozone, na.rm = TRUE),
    sd_ozone = sd(Ozone, na.rm = TRUE),
    min_ozone = min(Ozone, na.rm = TRUE),
    max_ozone = max(Ozone, na.rm = TRUE))
summary_ozone
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here
summary_temp <- airquality |>
  summarise(
  mean_temp = mean(Temp, na.rm = TRUE),
    median_temp = median(Temp, na.rm = TRUE),
    sd_temp = sd(Temp, na.rm = TRUE),
    min_temp = min(Temp, na.rm = TRUE),
    max_temp = max(Temp, na.rm = TRUE))
summary_temp
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here
summary_wind <- airquality |>
  summarise(
  mean_wind = mean(Wind, na.rm = TRUE),
    median_wind = median(Wind, na.rm = TRUE),
    sd_wind = sd(Wind, na.rm = TRUE),
    min_wind = min(Wind, na.rm = TRUE),
    max_wind = max(Wind, na.rm = TRUE))
summary_wind
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Answer: For Ozone, the mean is higher than the median, which means the data is right-skewed. This shows some very high Ozone values raise the average. The standard deviation is large, so Ozone levels change a lot from day to day. For Temperature, the mean and median are almost the same, so the data is symmetric. The standard deviation is moderate, meaning temperatures stay fairly steady. For Wind, the mean and median are close, showing a symmetric distribution. The standard deviation is small, which means wind speeds are quite consistent.

Task 2: Histogram

Generate the histogram for Ozone.

library(ggplot2)

ggplot(airquality, aes(x = Ozone)) +
 geom_histogram(binwidth = 6, fill = "#a3b7ca", color = "black") +
 labs(title = "Histogram of Ozone Levels", x = "Ozone (ppb)", y = "Frequency") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

Answer: The histogram of ozone levels is right-skewed. Most values are low, but a few very high ones stretch to the right, it is showing that ozone is usually low to moderate with some high days. These elevated values serve as outliers, contributing to a long tail on the right side of the graph.Overall, the distribution is unimodal and not normal because of the skewness.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here

airquality <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September"
))

ggplot(airquality, aes(x = month_name, y = Ozone, fill = month_name)) +
geom_boxplot() +
labs(title = "Ozone Levels by Month", x = "Month", y = "Ozone (ppb)") +
theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Answer: Ozone levels change by month. They are lowest in May and September and highest in July. The median ozone is also highest in July, showing more ozone that month. There are outliers in July and August, meaning some days had very high ozone levels. These spikes may be attributed to hotter weather or the accumulation of pollution.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here

ggplot(airquality, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "Temperature vs Ozone by Month", 
     x = "Temperature (°F)",    #copy the degree symbol (°) form the website
     y = "Ozone (ppb)",
     color = "Month") +
theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Answer: Yes, the scatterplot shows a positive relationship between temperature and ozone levels as temperature rises, ozone levels also rise. July and August cluster at higher temperatures and ozone levels, while May and September show lower values. This means hotter weather is linked to higher ozone concentrations.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Answer: The strongest correlation is between Ozone and temperature; when temperature goes up, ozone levels also go up. The weakest correlation is between Temperature and Wind, showing little connection. Ozone and Wind have a negative correlation, meaning higher wind speeds are linked to lower ozone levels.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

summary_table <- airquality |>
group_by(month_name) |>
  summarise(
    count = n(),
    avg_Ozone = mean(Ozone, na.rm = TRUE),
    avg_Temp = mean(Temp, na.rm = TRUE),
    avg_Wind = mean(Wind, na.rm = TRUE)
)
summary_table
## # A tibble: 5 × 5
##   month_name count avg_Ozone avg_Temp avg_Wind
##   <chr>      <int>     <dbl>    <dbl>    <dbl>
## 1 August        31      60.0     84.0     8.79
## 2 July          31      59.1     83.9     8.94
## 3 June          30      29.4     79.1    10.3 
## 4 May           31      23.6     65.5    11.6 
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

Answer: July has the highest average ozone level. It is also the hottest month with lower wind speeds. Hot and calm weather helps ozone build up in the air. May and September are cooler and windier, so they have lower ozone levels.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard