Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
summary_ozone <- airquality |>
  summarise(
    mean_ozone = mean(Ozone, na.rm = TRUE),
    median_ozone = median(Ozone, na.rm = TRUE),
    sd_ozone = sd(Ozone, na.rm = TRUE),
    min_ozone = min(Ozone, na.rm = TRUE),
    max_ozone = max(Ozone, na.rm = TRUE))

summary_ozone
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here
summary_temp <- airquality |>
  summarise(
    mean_temp = mean(Temp, na.rm = TRUE),
    median_temp = median(Temp, na.rm = TRUE),
    sd_temp = sd(Temp, na.rm = TRUE),
    min_temp = min(Temp, na.rm = TRUE),
    max_temp = max(Temp, na.rm = TRUE))

summary_temp
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here
summary_wind <- airquality |>
  summarise(
    mean_wind = mean(Wind, na.rm = TRUE),
    median_wind = median(Wind, na.rm = TRUE),
    sd_wind = sd(Wind, na.rm = TRUE),
    min_wind = min(Wind, na.rm = TRUE),
    max_wind = max(Wind, na.rm = TRUE))

summary_wind
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

For the temp and wind variable, the mean and median are similar or closer to each other, therefore there’s a more symmetrical distribution of data in these variables. The ozone variable however has a higher mean than the median, making the distribution skewed positively to the right. The higher the standard deviation means that the data points are more spread out, meaning higher variability and the lower the standard deviation is, the lesser the variability as the data points are clustered more closely to the mean.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
library(ggplot2)

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 20, fill = "#8EE5EE", color = "black") +
  labs(title = "Histogram of Ozone Concentration", x = "Ozone (ppb)", y = "Count") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The shape of the ozone distribution is skewed positively or right skewed. I think there are some outliers in this data with the values with 150 pbb, since it makes the mean much higher than the median.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
airquality_month <- airquality |>
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July", 
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))

#plot
ggplot(airquality_month, aes(x = factor(month_name), y = Ozone)) +
  geom_boxplot(fill = c("#97FFFF", "#2ca02c", "#FFD700", "#FF8C00", "#CD5C5C")) +
  labs(title = "Boxplot of Ozone Concentration by Months",
       x = "Months", y = "Ozone (pbb)") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

The ozone levels are higher in some months and lower on others, specifically during the peaks of summer which is in the months of August and July, the ozone levels are significantly higher. During the colder months however, there is a decrease in the ozone levels. The month with the highest median ozone is in July. Yes, there are outliers in all the months except July. These outliers could indicate that there might be unusual environmental factors to consider or heavier pollution in the atmosphere.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(airquality_month, aes(x = Ozone, y = Temp, color = month_name)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Scatterplot of Temperature vs. Ozone",
    x = "Temp (F)",
    y = "Ozone (pbb)",
    color = "Heart Disease"
  ) +
  scale_color_manual(values = c("May" = "#FF8C00",
                                "June" = "#FFD700",
                                "July" = "#2ca02c", 
                                "August" = "#97FFFF",
                                "September" = "#CD5C5C")) + 
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Yes, there seems to be a visible relationship between temperature and ozone levels. It seems to be that during warmer months, like June, July, and August, the ozone levels are in the higher levels compared to colder months. I wouldn’t necessarily say that there’s a clear linear regression in the graph but more like a logarithmic curve since most of the data, no matter what month, is clustered around the ozone level with 80 pbb.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
cor_matrix <- cor(
    airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix of Ozone, Temperature, and Wind")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

From the correlation matrix, we learn that the strongest correlation is between the Ozone levels and Temperatures while the weakest correlation is the Wind and Ozone level.

Ozone vs. Wind = least amount of correlation Temp vs. Wind = still pretty low correlation from each other but not as low as ozone and wind speed Ozone vs. Temp = has the strongest correlation which alligns from the Scatterplot created earlier.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality_month |>
  group_by(month_name) |>
  summarise(
    Count = n(),
    Avg_Ozone = mean(Ozone, na.rm = TRUE),
    Avg_Temp = mean(Temp, na.rm = TRUE),
    Avg_Wind = mean(Wind, na.rm = TRUE),
  ) 
summary_table
## # A tibble: 5 × 5
##   month_name Count Avg_Ozone Avg_Temp Avg_Wind
##   <chr>      <int>     <dbl>    <dbl>    <dbl>
## 1 August        31      60.0     84.0     8.79
## 2 July          31      59.1     83.9     8.94
## 3 June          30      29.4     79.1    10.3 
## 4 May           31      23.6     65.5    11.6 
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

The month that has the highest average ozone level is August. From the data collected, it seems to be that during May, we have the highest wind speed and it lowers down during summer, on higher temperatures, and goes back up again during September. The number one environmental factor that would explain these differences is the changing of seasons, especially when we consider that the months of May and September is when the seasons start to transition, which leads to higher wind speeds in these months.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard