Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(dplyr)
library(ggplot2)

data("airquality")
view(airquality)

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
# Scope -> entire dataset
Ozone_summary <- airquality |> 
  summarise(
    Ozone_Mean = round(mean(Ozone, na.rm = TRUE), 0),
    Ozone_Median = round(median(Ozone, na.rm = TRUE), 0),
    Ozone_SD = round(sd(Ozone, na.rm = TRUE), 0),
    Ozone_Min = min(Ozone, na.rm = TRUE),
    Ozone_Max = max(Ozone, na.rm = TRUE)
  )

Ozone_summary
##   Ozone_Mean Ozone_Median Ozone_SD Ozone_Min Ozone_Max
## 1         42           32       33         1       168
#Your code for Temp goes here
Temp_summary <- airquality |> 
  summarise(
   Temp_Mean = round(mean(Temp, na.rm = TRUE), 0),
    Temp_Median = round(median(Temp, na.rm = TRUE), 0),
    Temp_SD = round(sd(Temp, na.rm = TRUE), 0),
    Temp_Min = min(Temp, na.rm = TRUE),
    Temp_Max = max(Temp, na.rm = TRUE)
  )

Temp_summary
##   Temp_Mean Temp_Median Temp_SD Temp_Min Temp_Max
## 1        78          79       9       56       97
#Your code for Wind goes here
Wind_summary <- airquality |> 
  summarise(
   Wind_Mean = round(mean(Wind, na.rm = TRUE), 0),
    Wind_Median = round(median(Wind, na.rm = TRUE), 0),
    Wind_SD = round(sd(Wind, na.rm = TRUE), 0),
    Wind_Min = min(Wind, na.rm = TRUE),
    Wind_Max = max(Wind, na.rm = TRUE)
  )

Wind_summary
##   Wind_Mean Wind_Median Wind_SD Wind_Min Wind_Max
## 1        10          10       4      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Answer: The mean and median of each variable show how the data are distributed. For Ozone, the mean (42) is higher than the median (32), showing a positive skew with some higher values. Temperature has a mean of 78 and median of 79, indicating it’s nearly symmetric and close to a normal distribution. Wind has both mean and median of 10, showing a normal, symmetric distribution. The standard deviation shows how spread out the data are higher values mean more variability in the dataset.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(fill = "#1f77b4", 
           color = "black",
           bins = 30) +
  labs(
    title = "Ozone Levels Measured",
    x = "Ozone level",
    y = "Count"
  ) +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

Answer: The Ozone distribution is positively skewed, with most values clustered at the lower end and a few high values stretching the tail to the right. There are outliers or unusually large observations above 150, which contribute to this skewness.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
month_name <- airquality |> 
    mutate(Month = as.factor(Month)) |>
    mutate(Month = fct_recode(Month,"May" = "5", "June" = "6", "July" = "7", "August" = "8",  "September" = "9"))

levels(month_name$Month)
## [1] "May"       "June"      "July"      "August"    "September"
ggplot(month_name, aes(x = Month,
                       y = Ozone)) +
  geom_boxplot(fill = "#1f77b4", 
           color = "black") +
  labs(
    title = "Ozone by Month",
    x = "Month",
    y = "Ozone level"
  ) +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Answer: Ozone levels generally increase from May to August and then decrease in September, showing a clear seasonal pattern. The highest median ozone occurs in July or August, when levels peak. Outliers appear in most months — May, June, August, and September — suggesting occasional events or conditions that caused unusually high ozone concentrations.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(month_name, aes(x = Temp, y = Ozone, color = Month)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(
    title = "Ozone vs Temperature by Month",
    x = "Temperature",
    y = "Ozone",
    color = "Month"
  ) +
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Answer: The scatterplot shows a positive relationship between temperature and ozone levels as temperature increases, ozone levels also rise. July and August form clear clusters at higher temperatures and ozone values, indicating that warmer months tend to have higher ozone concentrations. August shows a larger range, suggesting greater variability during that month.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
# using rows of observations with out NAs
df <- airquality |>
  select(Ozone, Temp, Wind) |>
  filter(complete.cases(across(everything()))) |>
  cor()

df
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(df, method = "color", addCoef.col = "black",
         tl.col = "black", number.cex = 1.2)

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Answer: The strongest correlation (besides each variable with itself) is between Ozone and Temperature, with a value above 0.5, indicating a moderate to strong positive relationship higher temperatures are generally associated with higher ozone levels. The weakest correlation is between Ozone and Wind, suggesting little to no relationship between these two variables.

Task 6: Summary Table

Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
# Scope -> individual months -> group_by
df_Month_summary <- month_name |>
  group_by(Month) |>
  summarise(
    count = n(),
    Ozone_mean = round(mean(Ozone, na.rm = TRUE),0),
    Temp_mean = round(mean(Temp, na.rm = TRUE),0),
    Wind_mean = round(mean(Wind, na.rm = TRUE),0)
  )

df_Month_summary
## # A tibble: 5 × 5
##   Month     count Ozone_mean Temp_mean Wind_mean
##   <fct>     <int>      <dbl>     <dbl>     <dbl>
## 1 May          31         24        66        12
## 2 June         30         29        79        10
## 3 July         31         59        84         9
## 4 August       31         60        84         9
## 5 September    30         31        77        10

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

Answer: The month with the highest average ozone level is August, with July close behind. Temperature increases steadily from May to August, then begins to drop in September, while wind speed shows a gradual decline across the months, reaching its lowest values in July and August. These patterns suggest that warmer temperatures and lower wind speeds contribute to higher ozone levels, as heat and stagnant air conditions can promote ozone formation.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard