Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

ozone_summary_stats <- airquality |>
  summarise(
    mean_ozone = mean(Ozone, na.rm = TRUE),
    median_ozone = median(Ozone, na.rm = TRUE),
    sd_ozone = sd(Ozone, na.rm = TRUE),
    min_ozone = min(Ozone, na.rm = TRUE),
    max_ozone = max(Ozone, na.rm = TRUE))

ozone_summary_stats
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
temp_summary_stats <- airquality |>
  summarise(
    mean_temp = mean(Temp, na.rm = TRUE),
    median_temp = median(Temp, na.rm = TRUE),
    sd_temp = sd(Temp, na.rm = TRUE),
    min_temp = min(Temp, na.rm = TRUE),
    max_temp = max(Temp, na.rm = TRUE))

temp_summary_stats
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
wind_summary_stats <- airquality |>
  summarise(
    mean_wind = mean(Wind, na.rm = TRUE),
    median_wind = median(Wind, na.rm = TRUE),
    sd_wind = sd(Wind, na.rm = TRUE),
    min_wind = min(Wind, na.rm = TRUE),
    max_wind = max(Wind, na.rm = TRUE))

wind_summary_stats
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

The median wind speed is a bit lower than the mean, which means the wind speed data is right-skewed. For ozone, the mean is higher than the median, so that one’s also skewed to the right. Since both are right-skewed, their standard deviations are probably larger. On the other hand, the median temperature is higher than the mean, meaning the temperature data is left-skewed.

Task 2: Histogram

Generate the histogram for Ozone.

hist(airquality$Ozone, main = "Histogram of Ozone Levels", 
     xlab = "Value", col = "indianred1", breaks = 20)

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

In this plot I see that the ozone data is skewed to the right, meaning it has a longer tail on the right side. There are some outliers at higher ozone levels, and the mean is higher than the median.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

ozone_month <- airquality |>
  mutate(month_name = case_when(
    Month == "5" ~ "May",
    Month == "6" ~ "June",
    Month == "7" ~ "July",
    Month == "8" ~ "August",
    Month == "9" ~ "September"
  ))
    

#plot
ggplot(ozone_month, aes(x = month_name, y = Ozone)) +
  geom_boxplot(fill = "palegreen2", color = "black") +
  labs(title = "Ozone Levels by Month", 
       x = "Month", y = "Ozone Levels") 
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

August has the highest maximum ozone level, but the ozone levels seem to go down as the moths go by as well. July seems to have the highest median ozone level, while may has the lowest. Outliers I see are in May, June, and September, meaning those months ozone levels can become unsually high compared to the normal ranges.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

ggplot(ozone_month, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(size = 1) +
  labs(
    title = "Scatterplot of Temp vs. Ozone",
    x = "Temp",
    y = "Ozone Levels",
    color = "Month"
  ) + 
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Warmer months usually have higher ozone levels. Some months, like May, have data points that cluster together. The scatter plot shows a slight positive trend. There are a few outliers when the temperature is 80 or above, especially in May, June, July, and August, where the ozone levels are higher than what’s normal for those months.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix for Ozone, Temp, and Wind")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation is between ozone and temperature, with an r value of 0.70. This means that as temperature increases, ozone levels also tend to increase. The weakest correlation is between wind and temperature, with an r value of -0.51, showing that higher wind speeds are usually linked to slightly lower temperatures. Ozone also has a negative correlation with wind (r = -0.60), meaning stronger winds tend to reduce ozone levels. Overall, ozone is more strongly related to temperature than to wind speed.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

summary_table <- ozone_month |>
  group_by(Month) |>
  summarise(
    Count = n(),
    Avg_Ozone = mean(Ozone, na.rm = TRUE),
    Avg_Temp = mean(Temp, na.rm = TRUE),
    Avg_Wind = mean(Wind, na.rm = TRUE)
  ) 
summary_table
## # A tibble: 5 × 5
##   Month Count Avg_Ozone Avg_Temp Avg_Wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.5    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level, around 60. Temperatures rise from May through August and then drop in September. Wind speeds are strongest in May, June, and September but lower in July and August. These patterns are mostly due to seasonal changes. Warmer temperatures and calmer winds in the summer allow ozone to build up more easily, while cooler and windier months help disperse it.