Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here
mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here
mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

In the Wind and Ozone columns, the median value is smaller than the mean value, whereas in the Temp column, the median is larger than the mean. This suggests that the skewness for Wind and Ozone is right skewed, whereas the skewness for Temp is left skewed. The standard deviation indicates that there are going to be very few amount of outlier values in each column.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
library(ggplot2)

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 10, fill = "#1f77b4", color = "black") +
  labs(title = "Histogram of The Ozone", x = "Ozone", y = "Count") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The shape of the ozone distribution is skewed right, with an outlier with a value above 150.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
ozone_air_quality <- airquality |>
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))

ggplot(ozone_air_quality, aes(x = factor(month_name), y = ozone_air_quality$Ozone)) +
  geom_boxplot(aes(fill = month_name)) +
  labs(title = "Boxplot of Ozone Levels By Month",
       x = "Month", y = "Ozone Level") +
  theme_minimal()
## Warning: Use of `ozone_air_quality$Ozone` is discouraged.
## ℹ Use `Ozone` instead.
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Each ozone from May to September possess different ozone level values and different medians. Some months contain a median which is more skewed than others. The month with the highest median ozone is July, with an median ozone around 60. Each month displays outliers other than July, all months with outliers are placed above the maximum range of their respective box plot, indicating that the results are skewed right and that the occasion that the outliers occur are rare.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(ozone_air_quality, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Scatterplot of Temp vs. Ozone",
    x = "Temp",
    y = "Ozone",
    color = "Month Name"
  ) +
  scale_color_manual(values = c("May" = "red", 
                     "June" = "yellow",
                     "July" = "green",
                     "August" = "blue",
                     "September" = "purple")) + 
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

A visible relationship between temperature and ozone levels is as the temperature increases, so does the ozone level gradually. Specific months cluster together such as points for May typically bunch together in the lower temperature and ozone levels (bottom left), while the points for July bunch up around the mid to right section, relatively high temperature and low to high ozone levels.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
ozone_temp_wind <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

corrplot(ozone_temp_wind, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix for Ozone, Temp, and Wind")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation values are the Temp-Ozone with 0.7 for positive correlation and Ozone-Wind for negative correlation with -0.6. Therefore, ozone is temperature is more strongly correlated with temperature. The relationship between each variable is that ozone and temperature tend to have a more positive correlation where as wind with ozone and temperature tend to have a more negative correlation.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality |>
  group_by(Month) |>
  summarise(
    Count = n(),
    Avg_Ozone = mean(Ozone, na.rm = TRUE),
    Avg_Temp = mean(Temp, na.rm = TRUE),
    Avg_Wind = mean(Wind, na.rm = TRUE),
  ) 
summary_table
## # A tibble: 5 × 5
##   Month Count Avg_Ozone Avg_Temp Avg_Wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.5    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

The month with the highest average ozone level is August with 59.9, with July being the second highest with 59.1. As the temperature increases as each month progresses, the less wind there is. While wind increases the lower the temperature. Environmental factors that may explain the differences is the changes in seasons, as each season presents its own climate. Another factor is the amount of sunlight which hits the earth during different seasons, impacting the amount of temperature and wind that may be present on specific days or months in a year.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard