Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

# Ozone
mean_ozone   <- mean(airquality$Ozone, na.rm = TRUE)
median_ozone <- median(airquality$Ozone, na.rm = TRUE)
sd_ozone     <- sd(airquality$Ozone, na.rm = TRUE)
min_ozone    <- min(airquality$Ozone, na.rm = TRUE)
max_ozone    <- max(airquality$Ozone, na.rm = TRUE)
# Temperature
mean_temp   <- mean(airquality$Temp)
median_temp <- median(airquality$Temp)
sd_temp     <- sd(airquality$Temp)
min_temp    <- min(airquality$Temp)
max_temp    <- max(airquality$Temp)
# Wind
mean_wind   <- mean(airquality$Wind)
median_wind <- median(airquality$Wind)
sd_wind     <- sd(airquality$Wind)
min_wind    <- min(airquality$Wind)
max_wind    <- max(airquality$Wind)

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? Ozone Mean vs. Median is 42.13/31.50 which is higher than the median and suggests a positive skewed distribution. Temperatures Mean vs. Median is 77.88/79.00 which are very close and indicate a symmetrical distribution. Winds Mean vs. Median is 9.96/9.70 also very close and suggetive of normal distribution.

What does the standard deviation indicate about variability? The SD for Ozone is 32.99 which is very high in regards to the mean suggesting that its unstable and varies almost daily.The SD for Temperature is 79.00 which is moderate showing that temperature is more predictable and not are volitile as Ozone. Winds SD is 3.52 which is the lowest of the three and based on this shows more stability when compared to the other two and shows more consistency.

Task 2: Histogram

Generate the histogram for Ozone.

# Histogram of Ozone
hist(airquality$Ozone, 
     main = "Distribution of Ozone", 
     xlab = "Ozone (ppb)", 
     col = "blue", 
     breaks = 15) 

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The shape is right skewed with high end extremes(outliers).These extremes as are very sinificant as they impact the skew of the histogram.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Boxplot of Ozone Levels by Month
aq_data <- airquality %>%
  mutate(month_name = factor(case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ), levels = c("May", "June", "July", "August", "September")))
ggplot(aq_data, aes(x = month_name, y = Ozone, fill = month_name)) +
  geom_boxplot(na.rm = TRUE) +
  labs(title = "Ozone Levels by Month",
       x = "Month",
       y = "Ozone (ppb)") +
  theme_minimal() +
  guides(fill = "none")

Question: How do ozone levels vary across months? Which month has the highest median ozone?

Based on the visualization, Ozone levels rise as the weather warms and peaking during the mid-summer months with drastic drop in September.

Are there outliers in any month, and what might they indicate?

Yes there are outliers especially in July and August.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Scatterplot of Temperature Vs. Ozone by month
ggplot(aq_data, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(size = 3, alpha = 0.7, na.rm = TRUE) +
  labs(title = "Relationship Between Temperature and Ozone Levels",
       subtitle = "Color Diversity by Month (May - September)",
       x = "Temperature (°F)",
       y = "Ozone (ppb)",
       color = "Month") +
  theme_minimal()

Question: Is there a visible relationship between temperature and ozone levels? Yes there is a clear and strong positive relationship between temperature and Ozone levels because as temperature increases, Ozone concentrations seem to rise significantly.

Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

May and June cluster in the cooler temperatures, July and August cluster in the warmer temperatures and September seems to be the outlier of the months as its all over the place.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

library(tidyverse)
library(corrplot)

# Selected variables and handling of missing values (NAs)
cor_data <- airquality %>%
  select(Ozone, Temp, Wind) %>%
  drop_na()

# Computing the Pearson correlation matrix
cor_matrix <- cor(cor_data)

# Numerical matrix
print(cor_matrix)
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
# Visualizing with a correlation plot
corrplot(cor_matrix, method = "color", 
         addCoef.col = "black", 
         type = "upper", 
         tl.col = "black", 
         title = "Correlation: Ozone, Temp, and Wind",
         mar = c(0,0,1,0))

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? The strongest correlation is between Ozone and Temp. The weakest correlation is between Ozone and wind.

Explain what the correlation values suggest about relationships between variables. Ozone and Temp have a value of 0.70 which suggests that high heat days causes Ozone to speed up potentially leading to higher concentration.While on the other hand, Wind value is -0.61 leading to an inverse relationship with Ozone.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

library(tidyverse)

# Summary table with months by name

monthly_summary_names <- airquality %>%
  mutate(month_name = factor(case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ), levels = c("May", "June", "July", "August", "September"))) %>%
  group_by(month_name) %>%
  summarise(
    count = n(),
    avg_ozone = mean(Ozone, na.rm = TRUE),
    avg_temp  = mean(Temp, na.rm = TRUE),
    avg_wind  = mean(Wind, na.rm = TRUE)
  )

print(monthly_summary_names)
## # A tibble: 5 × 5
##   month_name count avg_ozone avg_temp avg_wind
##   <fct>      <int>     <dbl>    <dbl>    <dbl>
## 1 May           31      23.6     65.5    11.6 
## 2 June          30      29.4     79.1    10.3 
## 3 July          31      59.1     83.9     8.94
## 4 August        31      60.0     84.0     8.79
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? August has the highest average Ozone level at 60.00ppb followed closely by July at 59.12ppb.

How do temperature and wind speed vary across months? Temperature follows a seasonal curve with May being the lowest and reaching the max in July and August then dropping off in September.

Wind has an inverse relationship with temperature as it at its highest in May then reaches the lowest in July and august

What environmental factors might explain these differences? Heat and Sunlight most likely affect these phenominones to occur.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard