Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
ozone_summary<- airquality |> 
  summarise(
    mean_ozone= mean(Ozone, na.rm=TRUE),
     median_ozone= median(Ozone, na.rm=TRUE),
     sd_ozone= sd(Ozone, na.rm=TRUE),
     min_ozone= min(Ozone, na.rm=TRUE),
     max_ozone= max(Ozone, na.rm=TRUE),
  )
ozone_summary
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here
temp_summary<- airquality |> 
  summarise(
    mean_temp= mean(Temp, na.rm=TRUE),
     median_temp= median(Temp, na.rm=TRUE),
     sd_temp= sd(Temp, na.rm=TRUE),
     min_temp= min(Temp, na.rm=TRUE),
     max_temp= max(Temp, na.rm=TRUE),
  )
temp_summary
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here
wind_summary<- airquality |> 
  summarise(
    mean_wind= mean(Wind, na.rm=TRUE),
     median_wind= median(Wind, na.rm=TRUE),
     sd_wind= sd(Wind, na.rm=TRUE),
     min_wind= min(Wind, na.rm=TRUE),
     max_wind= max(Wind, na.rm=TRUE),
  )
wind_summary
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)?

What does the standard deviation indicate about variability? The mean (42.1) for ozone is greater than the median (31.5), meaning they’re different. Since the mean is greater than the median, it suggests that the ozone data is right-skewed. For both Temperature and Wind, the mean and median are approximately close, suggesting a symmetric distribution. The standard deviation is high for ozone, suggesting more variation in the ozone levels. Wind’s standard deviation is pretty low, suggesting less variation. Temperature’s standard deviation is pretty ordinary, which suggest average variation.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 20, fill = "#1f77b4", color = "black") +
  labs(title = "Histogram of Ozone Levels", x = "Ozone)", y = "Count") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The shape of the ozone distribution is right-skewed with no outliers or unusual features.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
airquality<- airquality |> 
  mutate(
    month_name= case_when(
      Month==5 ~ "May",
      Month==6 ~ "June",
      Month==7 ~ "July",
      Month==8 ~ "August",
      Month==9 ~ "September"
    )
  )
ggplot(data = airquality, aes(x=month_name, y= Ozone, fill = month_name))+
  geom_boxplot(fill = c("blue")) +
  labs(title = "Ozone Levels by Month",
       x = "Months", y = "Ozone Levels") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

For the months of July and August, it shows higher levels of Ozone, while May, June, and September show lower levels of Ozone. July has the highest median ozone and is the only month without any outliers as the other months show outliers. These outliers can indicate days with higher ozone levels from the box.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(data=airquality, aes(x=Temp, y=Ozone, color = Month))+
  geom_point(alpha = 0.7) +
  labs(
    title = "Scatterplot of Temp vs.Ozone",
    x = "Temp",
    y = "Ozone",
    color = "Month"
  )
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Yes there is a positive relationship between temperature and ozone levels. The higher the temperature, the higher the ozone levels. The months that seem to cluster together is months 7 (July) and 8 (August). This pattern clearly shows that warmer months such as July and August can contain higher levels of ozone.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix of Ozone, Temp, and Wind")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation is between ozone and temperature displaying 0.70.This can suggest the higher the ozone levels, the higher the temperature The weakest correlation is between ozone and wind displaying -0.60. This can suggest the lower the ozone levels, the higher the wind speed.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality |>
  group_by(Month) |>
  summarise(
    Count = n(),
    Avg_Ozone = mean(Ozone, na.rm = TRUE),
    Avg_Temp = mean(Temp, na.rm = TRUE),
    Avg_Wind = mean(Wind, na.rm = TRUE),
    
  ) 
summary_table
## # A tibble: 5 × 5
##   Month Count Avg_Ozone Avg_Temp Avg_Wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.5    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

According to the summary table, the month of August has the highest average ozone level (59.96 or approx. 60). The temperature across the months show a big increase from the month of May to June with an approximate 14 degrees of increase (65.5 to 79.1). The highest temperatures are in the months of July and August (83.9 for both). From August to September, the average temperature drops by 7 degrees (83.9 to 76.9). The month of July shows an average wind speed of 8.94 while the month of August shows an average wind speed of 8.79. It’s a bit lower compared to the rest but it explains why the temperature may be so high as the lower the wind speed , the higher the ozone levels, contributing to higher temperatures.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard