Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.5.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.1
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
# Summary statistics for Ozone
ozone_stats <- airquality |>
  summarise(
    mean_ozone = mean(Ozone, na.rm = TRUE),
    median_ozone = median(Ozone, na.rm = TRUE),
    sd_ozone = sd(Ozone, na.rm = TRUE),
    min_ozone = min(Ozone, na.rm = TRUE),
    max_ozone = max(Ozone, na.rm = TRUE)
  )

ozone_stats
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here

temp_stats <- airquality |>
  summarise(
    mean_temp = mean(Temp, na.rm = TRUE),
    median_temp = median(Temp, na.rm = TRUE),
    sd_temp = sd(Temp, na.rm = TRUE),
    min_temp = min(Temp, na.rm = TRUE),
    max_temp = max(Temp, na.rm = TRUE)
  )

temp_stats
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here

wind_stats <- airquality |>
  summarise(
    mean_wind = mean(Wind, na.rm = TRUE),
    median_wind = median(Wind, na.rm = TRUE),
    sd_wind = sd(Wind, na.rm = TRUE),
    min_wind = min(Wind, na.rm = TRUE),
    max_wind = max(Wind, na.rm = TRUE)
  )

wind_stats
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

The mean of ozone variable (42.13) is higher than the median (31.5) which indicates a right-skewness in the distribution while the standard deviation (32.99) shows high variability. The mean (77.88) and median (79) of the temperature variable are very close which suggests a roughly symmetric temperature distribution while the standard deviation of 9.47 indicates a moderate variability in daily temperatures. The mean (9.96) and median (9.7) for the wind variable are nearly identical which indicates a symmetric wind speed distribution and the standard deviation of 3.52 shows relatively consistent wind speeds across days.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 15, fill = "blue", color = "black", na.rm = TRUE) +
  labs(title = "Histogram of Ozone Levels",
       x = "Ozone (ppb)", 
       y = "Count") +
  theme_minimal()

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The ozone distribution is strongly right-skewed. It is also unimodal with a peak around 10-30 ppb. The potential outliers is in the 150+ ppb range.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here

airquality_month <- airquality |>
  mutate(
    month_name = case_when(
      Month == 5 ~ "May",
      Month == 6 ~ "June", 
      Month == 7 ~ "July",
      Month == 8 ~ "August",
      Month == 9 ~ "September",
      TRUE ~ as.character(Month)
    ),
    month_name = factor(month_name, levels = c("May", "June", "July", "August", "September"))
  )

# Create boxplot
ggplot(airquality_month, aes(x = month_name, y = Ozone)) +
  geom_boxplot(fill = c("blue", "yellow", "purple", "red", "green")) +
  labs(title = "Boxplot of Ozone Levels by Month",
       x = "Month",
       y = "Ozone (ppb)") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

July and August have the highest median ozone levels while August has the highest overall and May has the lowest median ozone. There are several outliers in June and September, indicating days with unusually high ozone levels for those months.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(airquality_month, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(alpha = 0.7, na.rm = TRUE) +
  labs(title = "Scatterplot of Temperature vs. Ozone Levels",
       x = "Temperature (°F)",
       y = "Ozone (ppb)",
       color = "Month") +
  scale_color_manual(values = c("May" = "blue", 
                               "June" = "orange",
                               "July" = "green",
                               "August" = "red",
                               "September" = "black")) +
  theme_minimal()

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

There is a strong positive relationship between temperature and ozone levels. I notice in the scatterplot that as temperature increases, ozone concentrations generally increase. The months are also clustered by temperature. May has the lowest temperatures and ozone levels, while July and August have the highest temperatures and ozone levels.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
# Visualize correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix of Air Quality Variables")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation is between Ozone and Temperature, indicating a strong positive relationship where higher temperatures are associated with higher ozone levels. The second strongest is between Ozone and Wind, showing a strong negative relationship where higher wind speeds are associated with lower ozone levels. The weakest correlation is between Temperature and Wind, though this still represents a moderate negative relationship.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

summary_table <- airquality_month |>
  group_by(month_name) |>
  summarise(
    Count = n(),
    Average_Ozone = mean(Ozone, na.rm = TRUE),
    Average_Temp = mean(Temp, na.rm = TRUE),
    Average_Wind = mean(Wind, na.rm = TRUE)
  )

summary_table
## # A tibble: 5 × 5
##   month_name Count Average_Ozone Average_Temp Average_Wind
##   <fct>      <int>         <dbl>        <dbl>        <dbl>
## 1 May           31          23.6         65.5        11.6 
## 2 June          30          29.4         79.1        10.3 
## 3 July          31          59.1         83.9         8.94
## 4 August        31          60.0         84.0         8.79
## 5 September     30          31.4         76.9        10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level of 83.97ppb. August also has the highest average temperature (83.97°F) and moderate wind speeds. July has the second highest average ozone level of 66.60ppb. May has the lowest average ozone of 23.62ppb, lowest temperatures of 65.55°F, and highest wind speeds of 11.62mph.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard