Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")




Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

colSums(is.na(airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0
#airquality <- airquality %>%
  #mutate(Ozone_imputed = if_else(
    #is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))

summary_stats_ozone <- airquality |>
  summarise(
    Mean_Ozone = mean(airquality$Ozone, na.rm = TRUE),
    Median_Ozone = median(airquality$Ozone, na.rm = TRUE),
    SD_Ozone = sd(airquality$Ozone, na.rm = TRUE),
    Min_Ozone = min(airquality$Ozone, na.rm = TRUE),
    Max_Ozone = max(airquality$Ozone, na.rm = TRUE))

summary_stats_ozone
##   Mean_Ozone Median_Ozone SD_Ozone Min_Ozone Max_Ozone
## 1   42.12931         31.5 32.98788         1       168
summary_stats_temp <- airquality |>
  summarise(
    Mean_Temp = mean(airquality$Temp, na.rm = TRUE),
    Median_Temp = median(airquality$Temp, na.rm = TRUE),
    SD_Temp = sd(airquality$Temp, na.rm = TRUE),
    Min_Temp = min(airquality$Temp, na.rm = TRUE),
    Max_Temp = max(airquality$Temp, na.rm = TRUE))

summary_stats_temp
##   Mean_Temp Median_Temp SD_Temp Min_Temp Max_Temp
## 1  77.88235          79 9.46527       56       97
summary_stats_wind<- airquality |>
  summarise(
    Mean_Wind = mean(airquality$Wind, na.rm = TRUE),
    Median_Wind = median(airquality$Wind, na.rm = TRUE),
    SD_Wind = sd(airquality$Wind, na.rm = TRUE),
    Min_Wind = min(airquality$Wind, na.rm = TRUE),
    Max_Wind = max(airquality$Wind, na.rm = TRUE))

summary_stats_wind
##   Mean_Wind Median_Wind  SD_Wind Min_Wind Max_Wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

The mean is greater than the median for the ozone variable, indicating a right-skewed distribution. The mean is a bit less than the median for the temp variable, indicating a slight left-skew. The wind variable’s mean is slightly greater than the median, however it is close enough to assume that this is a mainly symmetrical distribution that is not heavily skewed one way or the other.

The standard deviation for the ozone variable is ~33 indicating that the data is widely scattered around the mean. The temp variable has a standard deviation of ~9.5, indicating a fair amount of spread but less than the ozone variable. The wind variable has a standard deviation of ~3.5, indicating a fairly low spread of data around the mean.


Task 2: Histogram

Generate the histogram for Ozone.

hist(airquality$Ozone, main = "Histogram of Ozone Levels", 
     xlab = "Value", col = "lightblue", breaks = 20)

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The distribution is right-skewed, with most of the ozone values clustered below 50 and a long tail extending toward the higher values. It is unimodal with it’s peak around 10-20. There are some outliers above 150.


Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9). Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

airquality <- airquality %>%
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))

airquality$month_name <- factor(airquality$month_name, levels = c("May", "June", "July", "August", "September"))

ggplot(airquality, aes(x = month_name, y = Ozone)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Ozone Levels by Month", 
       x = "Month", y = "Ozone Levels") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

The ozone levels vary widely across the months, with it peaking during the summer in July and August while May, June, and September show lower levels. July has the highest median levels, showing that July experienced the highest typical ozone concentrations. There are noticeable outliers in May, June, August, and September; with September having the greatest amount of outliers. These outliers might indicate short, transient events which spiked ozone levels.


Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(month_name))) +
  geom_point(size = 3) +
  scale_color_manual(values = c(
      "May" = "red",      
      "June" = "orange",     
      "July" = "yellow",   
      "August" = "green",   
      "September" = "blue" 
  ), labels = c("May", "June", "July", "August", "September")) +
  labs(title = "Ozone Levels vs. Monthly Temperature",
       x = "Temperature (F)", y = "Ozone",
       color = "Month") +
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

There is a noticeable positive relationship between temperature and ozone levels. As we can see in the graph, excluding the extreme outliers, the further right the points are the higher up they tend to be. We can also see that there is some evidence of clustering, as seen with May and June having generally lower ozone values and cooler temperatures, July and August showing higher ozone levels and warmer temperatures, and September shifting back toward lower temperatures and lower ozone levels. The presence of outliers could indicate that there are other conditions that can cause a transient burst of ozone levels.


Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix of Numeric Variables")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Ozone and temperature shows the strongest correlation with an r value of 0.70. This means that as the temperature increases, the ozone levels do as well, as seen in the previous scatterplot. The weakest correlation is between wind and temperature, with an r value of -0.51. This suggest a modest negative relationship, indicating that higher wind speeds are somewhat associated with lower temperatures. Ozone and wind have a negative correlation as well, with and r value of -0.60. This suggests that higher wind speeds have the effect of lowering the ozone values.


Task 6: Summary Table

Generate the summary table grouped by Month. Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

summary_table <- airquality |>
  group_by(month_name) |>
  summarise(
    Count = n(),
    Avg_Ozone = mean(Ozone, na.rm = TRUE),
    Avg_Temp = mean(Temp, na.rm = TRUE),
    Avg_Wind = mean(Wind, na.rm = TRUE),
  ) 

summary_table
## # A tibble: 5 × 5
##   month_name Count Avg_Ozone Avg_Temp Avg_Wind
##   <fct>      <int>     <dbl>    <dbl>    <dbl>
## 1 May           31      23.6     65.5    11.6 
## 2 June          30      29.4     79.1    10.3 
## 3 July          31      59.1     83.9     8.94
## 4 August        31      60.0     84.0     8.79
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level, sitting at ~60. The temperatures increase from May through August, and then drop in September. Wind speeds are highest in May, Juny, and September, with July and August showing a decline. This shows us that higher temperatures and lower wind speeds seem to have an effect in producing an overall greater amount of ozone. Hot summer conditions seem to be exacerbating ozone levels compared to cooler and more windy periods of the year.