Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## Warning: package 'forcats' was built under R version 4.5.3
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here

mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here

mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here

mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Answer: The mean of Ozone (42.13 ppb) is significantly higher than the median of Ozone (31.5 ppb). As such, there are some outliers that increase the mean since it is pulled to the right. The difference between these two indicates that there is right skewness – most days have medium levels of ozone while a few days have extremely high levels of ozone. Standard deviation (32.99) is large indicating high variability. With Temp, there is not much difference between the mean (77.88°F) and median (79°F), meaning that it has a symmetrically distributed normal curve with a standard deviation of only 9.47. In the case of Wind, the mean (9.96 mph) and median (9.7 mph) are almost the same with standard deviation being 3.52.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "white", na.rm = TRUE) +
  labs(
    title = "Distribution of Ozone Levels",
    x = "Ozone (ppb)",
    y = "Count"
  ) +
  theme_minimal()

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

Answer: The distribution of the ozone variable is positively skewed with unimodal characteristics. The data points in the lower section (approximate value of 0 to 50 ppb) account for the larger portion of the distribution with an extended tail on the right side. There are multiple instances of days where the ozone level exceeds 100 ppb, particularly an instance with almost 168 ppb. This accounts for the high difference between the mean and the median values.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9). Recode the Month variable into a new column called month_name with month names using case_when from week 4. Generate a boxplot of Ozone by month_name.

# Your code here

airquality <- airquality %>%
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))

airquality$month_name <- factor(
  airquality$month_name,
  levels = c("May", "June", "July", "August", "September")
)

ggplot(airquality, aes(x = month_name, y = Ozone, fill = month_name)) +
  geom_boxplot(na.rm = TRUE) +
  labs(
    title = "Ozone Levels by Month",
    x = "Month",
    y = "Ozone (ppb)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Answer: The ozone concentration peaks during the months of May to July and August before declining in September. The months of July and August register the maximum median concentrations of ozone, which corresponds to the warmest period of summer, as high temperatures and bright sunshine promote the photochemical reaction leading to ozone formation. May records the minimum median values and the smallest variation. There are outliers in most months, especially May and August, where some days recorded ozone concentration far exceeding the usual values for that month.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here

ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(Month))) +
  geom_point(na.rm = TRUE, size = 2, alpha = 0.8) +
  labs(
    title = "Temperature vs. Ozone Levels",
    x = "Temperature (°F)",
    y = "Ozone (ppb)",
    color = "Month"
  ) +
  scale_color_manual(
    values = c("5" = "#4393c3", "6" = "#74c476", "7" = "#fd8d3c",
               "8" = "#e31a1c", "9" = "#9e9ac8"),
    labels = c("5" = "May", "6" = "June", "7" = "July",
               "8" = "August", "9" = "September")
  ) +
  theme_minimal()

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Answer: Yes, there is indeed a positive correlation – as temperature rises, so does the level of ozone. In the graph, one can observe an increasing pattern from left to right. The data points are also categorized according to the month – for example, points pertaining to May are in blue color and positioned at the bottom left (which means low temperature and low ozone). Meanwhile, July and August are in orange and red colors and located at the top right (high temperature and high ozone).

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

cor_data <- airquality[, c("Ozone", "Temp", "Wind")]
cor_matrix <- cor(cor_data, use = "complete.obs")

print(round(cor_matrix, 4))
##         Ozone    Temp    Wind
## Ozone  1.0000  0.6984 -0.6015
## Temp   0.6984  1.0000 -0.5111
## Wind  -0.6015 -0.5111  1.0000
corrplot(cor_matrix,
         method = "color",
         type = "upper",
         addCoef.col = "black",
         tl.col = "black",
         title = "Correlation Matrix: Ozone, Temp, Wind",
         mar = c(0, 0, 1, 0))

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Answer: The correlation between Ozone and Temp (r = 0.70) is the highest and it is a moderate and positive correlation; days that have higher temperatures have higher concentrations of ozone due to the formation process of the gas. The second highest correlation is the one between Ozone and Wind (r = -0.60) and it is negative because the higher winds can distribute the pollutants and thus lower their concentrations. The lowest correlation exists between Temp and Wind (r = -0.51), which is again negative indicating that windy days tend to be colder.

Task 6: Summary Table

Generate the summary table grouped by Month. Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

summary_table <- airquality %>%
  group_by(Month) %>%
  summarise(
    count     = n(),
    avg_ozone = round(mean(Ozone, na.rm = TRUE), 2),
    avg_temp  = round(mean(Temp,  na.rm = TRUE), 2),
    avg_wind  = round(mean(Wind,  na.rm = TRUE), 2)
  )

print(summary_table)
## # A tibble: 5 × 5
##   Month count avg_ozone avg_temp avg_wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.6    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

Answer: The month of August exhibits the maximum levels of ozone (59.96 ppb) and is closely followed by July (59.12 ppb). These two months exhibit the maximum average temperatures (approximately 83-84 degrees Fahrenheit), which further proves that high temperatures result in maximum production of ozone. Temperature exhibits an upward trend, starting from May (65.55 degrees Fahrenheit) until July/August, and then begins to decline by September (76.9 degrees Fahrenheit). Wind speed displays the opposite pattern as its maximum is observed in May (11.62 mph) and starts declining up until September. Logically speaking, there are lower wind speeds during the summer months because of which pollutants tend to build up, whereas the high temperature of summer days causes photochemical reactions that produce ozone.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard