Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here

oz_summary_stats <- airquality |>
 select(Ozone) |>
  summarise(
    mean_oz = mean(Ozone, na.rm = TRUE),
    median_oz = median(Ozone, na.rm = TRUE),
    sd_oz = sd(Ozone, na.rm = TRUE),
    min_oz = min(Ozone, na.rm = TRUE),
    max_oz = max(Ozone, na.rm = TRUE))

oz_summary_stats
##    mean_oz median_oz    sd_oz min_oz max_oz
## 1 42.12931      31.5 32.98788      1    168
#Your code for Temp goes here
temp_summary_stats <- airquality |>
 select(Temp) |>
  summarise(
    mean_temp = mean(Temp, na.rm = TRUE),
    median_temp = median(Temp, na.rm = TRUE),
    sd_temp = sd(Temp, na.rm = TRUE),
    min_temp = min(Temp, na.rm = TRUE),
    max_temp = max(Temp, na.rm = TRUE))

temp_summary_stats
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here
wind_summary_stats <- airquality |>
 select(Wind) |>
  summarise(
    mean_wind = mean(Wind, na.rm = TRUE),
    median_wind = median(Wind, na.rm = TRUE),
    sd_wind = sd(Wind, na.rm = TRUE),
    min_wind = min(Wind, na.rm = TRUE),
    max_wind = max(Wind, na.rm = TRUE))

wind_summary_stats
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Ozone- The mean and median for the Ozone variable are 42.12931 and 31.5, indicating that it is strongly skewed with a wider distribution of values. Standard deviation of approximately 32.99 suggests extreme variability in ozone readings.

Temp- The mean and median for the Temperature variable are close together at 77.88235 and 79, meaning they are not strongly skewed. Standard deviation of approximately 9.47 suggests moderate variability.

Wind- The mean and median for the Wind variable are approximately 9.95 and 9.7, indicating they are not strongly skewed. Standard deviation of approximately 3.52 suggests very low variability.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here

library(ggplot2)

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 20, fill = "#1f77b4", color = "black") +
  labs(title = "Histogram of Ozone Levels", x = "Count", y = "Ozone (ppb") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The shape is skewed. There are two outliers featured, one at the far left end and one at the far right end of the histogram, with the left end being significantly greater than the other levels recorded and the far right end being significantly reduced as compared to the other levels displayed.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
library(ggplot2)

airquality_data <- airquality %>%
  mutate(
    month_name = case_when(
      Month == 5 ~ "May",
      Month == 6 ~ "June",
      Month == 7 ~ "July",
      Month == 8 ~ "August",
      Month == 9 ~ "September",
    )
  )

airquality_data$month_name <- factor(
  airquality_data$month_name,
  levels = c("May", "June", "July", "August", "September")
)

ggplot(airquality_data, aes(x = month_name, y = Ozone)) +
  geom_boxplot() +
  labs(
    title = "Ozone Levels by Month",
    x = "Month",
    y = "Ozone Level"
  ) +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Ozone levels were lower in the earlier months of May and June, increase in July and steadily decrease through August before decreasing back down to earlier levels in September. July had the highest median ozone, with outliers featured in May, June, August, and September. These outliers indicate higher variability of ozone levels recorded in these months.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here

# Create the scatterplot
ggplot(airquality_data, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(size = 3) +
  scale_color_manual(values = c("#EE3B3B", "#9BCD9B", "#ff7f0e", "#1f77b4", "#9f77b4"), labels = c("May", "June", "July", "August", "September")) +
  labs(title = "Temperature vs. Ozone",
       x = "Temperature",
       y = "Ozone",
       color = "Month") +
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

There is a relationship indicated between temperature and ozone levels based on the visible clustering pattern that can be observed between months that immediately follow or precede one another. This shows strong similarity in both temperatures recorded as well as ozone levels recorded, suggesting that there is a correlation between shifting temperatures and ozone levels. Higher temperatures appear to correlate with higher ozone levels, as there is a steady upward trend in ozone levels with the rise of temperatures in the hotter months.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Ozone is more strongly correlated with temperature than with wind speed. The weakest correlation is between temperature and wind, suggesting those variables have the least influence on each other as compared to the strongest correlation observed between temperature and ozone, with there being a moderate relationship suggested between wind and ozone.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

summary_table <- airquality_data |>
  group_by(month_name) |>
  summarise(
    Count = n(),
    avg_ozone = mean(Ozone, na.rm = TRUE),
    avg_temp = mean(Temp, na.rm = TRUE),
    avg_wind = mean(Wind, na.rm = TRUE),
  )
  
print(summary_table)
## # A tibble: 5 × 5
##   month_name Count avg_ozone avg_temp avg_wind
##   <fct>      <int>     <dbl>    <dbl>    <dbl>
## 1 May           31      23.6     65.5    11.6 
## 2 June          30      29.4     79.1    10.3 
## 3 July          31      59.1     83.9     8.94
## 4 August        31      60.0     84.0     8.79
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level, in addition to recording the highest average temperatures out of the five months, followed closely by July in both temperature and ozone. Conversely, those months record the lowest average wind, followed by September, and May having the highest average winds recorded, followed by June. This suggests that there is a positive relationship between temperature and ozone, and given that the temperatures recorded in September are lower than those recorded in June, while ozone levels in September are still higher than those recorded in June, this could point to a slight relationship between wind and ozone as well to explain the slight discrepancy observed.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard