Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here

ozone_summary <- airquality |>
  summarise(
    
   mean_ozone = mean(airquality$Ozone, na.rm = TRUE),
   median_ozone = median(airquality$Ozone, na.rm = TRUE),
   sd_ozone = sd(airquality$Ozone, na.rm = TRUE),
   min_ozone = min(airquality$Ozone, na.rm = TRUE),
   max_ozone = max(airquality$Ozone, na.rm = TRUE)
)

ozone_summary
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here

temp_summary <- airquality |>
  summarise(
    mean_temp = mean(airquality$Temp),
    median_temp = median(airquality$Temp),
    sd_temp = sd(airquality$Temp),
    max_temp = max(airquality$Temp),
    min_temp = min(airquality$Temp))

temp_summary
##   mean_temp median_temp sd_temp max_temp min_temp
## 1  77.88235          79 9.46527       97       56
#Your code for Wind goes here

wind_summary <- airquality |>
  
  summarise(
    mean_wind = mean(airquality$Wind),
    median_wind = median(airquality$Wind),
    sd_wind= sd(airquality$Wind),
    max_wind = max(airquality$Wind),
    min_wind = min(airquality$Wind))

wind_summary
##   mean_wind median_wind  sd_wind max_wind min_wind
## 1  9.957516         9.7 3.523001     20.7      1.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

For Ozone, the mean (42.1) is noticeably higher than the median (31.5), which suggests that the distribution is right-skewed meaning a few very high ozone values are pulling the mean upward.For Temperature, the mean (77.88) and median (79) are very close, indicating that the temperature data are fairly symmetric with little skewness.For Wind, the mean (9.95) and median (9.7) are also very similar, showing that wind speed is approximately normally distributed. The standard deviations show how much the data vary around their means, Ozone has the highest variability, meaning ozone levels fluctuate widely from day to day.Temperature has moderate variability. While Wind shows the least variation, meaning wind speeds are relatively consistent.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here

library(ggplot2)

ggplot(airquality, aes(x=Ozone)) +
  geom_histogram(binwidth= 20, fill= "#1f77b4", color = "black")+
  labs(title= "Histogram of ozone levels" , x = "ozone levels", y = "count")+
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The histogram of ozone levels is right-skewed, meaning most observations are concentrated at lower ozone values, with a few much higher values extending to the right tail. The distribution appears unimodal, with one clear peak around the lower ozone range.There are also a few outliers or unusually high values, which represent days with extremely high ozone concentrations.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here

airquality_bymonth <- airquality |>
  
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
))



ggplot(airquality_bymonth, aes(x = factor(month_name), y = Ozone)) +
  geom_boxplot(fill= "#2ca02c") +
  labs(title = "Ozone Levels by Month",
       x = "Month", y = "Ozone") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Ozone levels vary noticeably across the months. The boxplot shows that July has the highest median ozone level, followed closely by August, indicating that ozone concentrations tend to peak in the middle of summer when temperatures are highest.May, June, and September have lower median ozone levels, suggesting that ozone concentrations are generally lower and more consistent in those months.There are several outliers present in May, June, August, and September. These outliers likely represent days with unusually high ozone levels, possibly caused by specific weather conditions such as heat waves.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here

ggplot(airquality_bymonth, aes(x= Temp, y= Ozone,color= Month))+
  geom_point(alpha=0.7)+
  labs(title = "Scatterplot of Temprature vs Ozone", 
       x ="Temp" , 
       y = "Ozone", 
       color ="Month")+
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Yes, there is a clear positive relationship between temperature and ozone levels — as temperature increases, ozone levels also tend to rise. The scatterplot also shows clustering by month like the lighter blue points (representing July and August) are mostly found in the upper right part of the plot, where both temperature and ozone are higher. The darker points (representing May and June) appear toward the lower left, showing cooler temperatures and lower ozone values.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

cor_matrix <- cor(
  airquality |>
    select(Ozone,Temp,Wind), use= "complete.obs"
)

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(cor_matrix,method= "color", type="upper",order="hclust",
         tl.col="black", tl.srt= 45, addCoef.col = "black",
         title= "Correlation Matrix for ozone, temp and wind")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation is between Ozone and Temperature, indicating a strong positive relationship. As temperature increases, ozone levels also tend to increase. This suggests that warmer days are usually associated with higher ozone concentrations.The weakest correlation is between Wind and Temperature, showing a moderate negative relationship higher wind speeds are generally linked to slightly lower temperatures. The correlation between Wind and Ozone is also negative and moderately strong, meaning that when wind speed increases, ozone levels tend to decrease.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

summary_table <- airquality |>
  group_by(Month)|>
  summarise(
    count= n(),
    Avg_ozone = mean(Ozone, na.rm = TRUE),
    Avg_temp = mean(Temp),
    Avg_wind = mean(Wind)
  )

summary_table
## # A tibble: 5 × 5
##   Month count Avg_ozone Avg_temp Avg_wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.5    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

The month with the highest average ozone level is August, followed closely by July. These two months also have the highest average temperatures and the lowest average wind speeds. In contrast, May and June show lower average ozone levels, along with cooler temperatures and stronger winds. These patterns suggest that hotter temperatures and calmer winds contribute to higher ozone concentrations.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard