Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here
mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here
mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

The mean and median of Ozone differ substantially, suggesting a skewed distribution with several high outliers. In contrast, Temperature and Wind have means and medians that are quite similar, indicating more symmetric distributions. The standard deviation values show that Ozone has the greatest variability, while Wind is the most consistent variable. Overall, Ozone levels in the dataset are more unevenly distributed compared to Temperature and Wind.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here

# Basic histogram of Ozone
hist(airquality$Ozone,
     main = "Histogram of Ozone Levels",
     xlab = "Ozone (ppb)",
     col = "purple")

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
airquality <- airquality |>
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))

boxplot(Ozone ~ month_name, 
        data = airquality,
        main = "Ozone Levels by Month",
        xlab = "Month",
        ylab = "Ozone (ppb)",
        col = "lightblue",
        border = "black")

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

The boxplot shows that ozone levels vary by month. Ozone concentrations are lowest in May and June, rise sharply in July, and reach their highest median in August before declining again in September. Several outliers appear in May, June, and September, indicating occasional high-ozone days outside the usual range. These extreme values could be related to specific weather patterns. Overall, ozone levels peak in midsummer and decline as the season transitions to fall.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(airquality, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(title = "Temperature vs. Ozone by Month",
       x = "Temperature (°F)",
       y = "Ozone (ppb)",
       color = "Month") +
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns. The scatterplot reveals a positive relationship between temperature and ozone levels. As temperature increases, ozone concentrations tend to rise. Points from July and August cluster in the upper right, showing both high temperatures and elevated ozone levels. In contrast, May and June points cluster toward the lower left, reflecting cooler temperatures and lower ozone. This pattern suggests that ozone formation is influenced by temperature and sunlight intensity, with the highest values occurring during the warmest summer months.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
otw <- airquality[, c("Ozone", "Temp", "Wind")]
cor_matrix <- cor(otw, use = "complete.obs")
cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables. The strongest correlation is between ozone and temperature (≈ 0.70), indicating that higher temperatures are strongly associated with higher ozone levels. Next is the negative correlation between ozone and wind (≈ –0.60), suggesting that higher wind speeds reduce ozone concentrations. The weakest correlation is between temperature and wind (≈ –0.50), showing that warmer days tend to have lighter winds but the relationship is moderate. These results imply that hot, calm days favor ozone formation, while windy conditions help to lower ozone levels.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality |>
  group_by(month_name) |>
  summarise(
    count = n(),
    mean_ozone = mean(Ozone, na.rm = TRUE),
    mean_temp  = mean(Temp, na.rm = TRUE),
    mean_wind  = mean(Wind, na.rm = TRUE)
  )

summary_table
## # A tibble: 5 × 5
##   month_name count mean_ozone mean_temp mean_wind
##   <chr>      <int>      <dbl>     <dbl>     <dbl>
## 1 August        31       60.0      84.0      8.79
## 2 July          31       59.1      83.9      8.94
## 3 June          30       29.4      79.1     10.3 
## 4 May           31       23.6      65.5     11.6 
## 5 September     30       31.4      76.9     10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

The highest average ozone levels are in August, followed by July. These months also have the highest average temperatures and the lowest wind speeds. The cooler months like May and September have lower ozone concentrations and stronger winds. This suggests that ozone formation is influenced by environmental conditions: hot temperatures and calm air promote ozone buildup, while windier or cooler conditions limit ozone accumulation.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard