Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

Loading dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

Ozone

summary_stats <- airquality |>
  summarise(
    mean_ozone = mean(Ozone, na.rm = TRUE),
    median_ozone = median(Ozone, na.rm = TRUE),
    sd_ozone = sd(Ozone, na.rm = TRUE),
    min_ozone = min(Ozone, na.rm = TRUE),
    max_ozone = max(Ozone, na.rm = TRUE)
  )

summary_stats
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168

Temp

summary_stats2 <- airquality |>
  summarise(
    mean_temp = mean(Temp, na.rm = TRUE),
    median_temp = median(Temp, na.rm = TRUE),
    sd_temp = sd(Temp, na.rm = TRUE),
    min_temp = min(Temp, na.rm = TRUE),
    max_temp = max(Temp, na.rm = TRUE)
  )

summary_stats2
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97

Wind

summary_stats3 <- airquality |>
  summarise(
    mean_wind = mean(Wind, na.rm = TRUE),
    median_wind = median(Wind, na.rm = TRUE),
    sd_wind = sd(Wind, na.rm = TRUE),
    min_wind = min(Wind, na.rm = TRUE),
    max_wind = max(Wind, na.rm = TRUE)
  )

summary_stats3
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Ozone

The mean for Ozone is 42.12931, and the median is 31.5. This suggests that the variable Ozone the distribution is right skewed. The mean is higher than the median, and the standard deviation is higher than the median. The Standard Deviation is 32.98788, which is large, meaning that there is a large spread in Ozone levels and that there is high variability.

Temp

The mean for Temperature is 77.88235, and the median is 79. This suggests that the variable Temp is approximately symmetrical, not majorly skewed. The Standard Deviation is 9.46, which indicates that meaning that maximum daily temperature values fall within about ±9°F of the mean. This shows there is a moderate spread in temperature, but no extreme variability.

Wind The mean for Wind is 9.957516, and the median is 9.7. This suggests that the variable Wind is approximately symmetrical, not majorly skewed. The Standard Deviation is 3.52, which indicates that meaning the average wind speed in miles per hour values fall within about ±3.5 wind speed in miles of the mean. This shows there is a slight spread in temperature, but no extreme variability.

Task 2: Histogram

Generate the histogram for Ozone.

library(ggplot2)

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 10, fill = "Blue", color = "black") +
  labs(title = "Histogram of Ozone levels", 
       x = "Ozone concentration (ppb)", 
       y = "Count") +
  theme_minimal()

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The main bulk of the data is roughly below 60 ppb, The center is around 30-40 ppb. The distribution’s shape is slightly right-skewed, as there are values stretching further into the higher range (160-170 ppb). The distribution is unimodal meaning it has only one peak.

There’s a large bar near 0. That likely reflects missing values coded as 0 since Ozone concentration can’t realistically be 0. On the high end, a few days have ozone value above 150 ppb, which are extreme compared to the majority of the dataset. These are potential outliers.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

plot2 <- airquality |>
  mutate(month_name = case_when( 
    Month == 5~ "May",
     Month == 6 ~ "June",
    Month == 7 ~ "July",
   Month == 8 ~ "August",
     Month == 9 ~ "September",
    TRUE ~ "Others")) #call the rest others
ggplot(plot2, aes(x = factor(month_name), y = Ozone, fill=month_name)) + 
 geom_boxplot(fill = c("#FF4040", "#2ca02c", "blue", "pink", "yellow")) +
  labs(title = "Ozone levels by Months",
       x = "Months",
       y= "Ozone levels (ppb)") +
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Ozone levels vary across the months as shown in the box plot.July has low median ozone level in ppb. June has the lowest median ozone levels. July also has no outliers, representing low levels in the hotter summer seasons.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

ggplot(plot2, aes(x = Ozone, y = Temp, color = month_name)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Scatterplot of Temp vs. Ozone levels",
    x = "Ozone",
    y = "Temp",
    color = "Month"
  ) +
  scale_color_manual(values = c("May" = "#2ca02c", 
                                "June" = "#FF4040",
                                "July"= "#009999",
                                "August"= "#ff7f0e",
                                "September"= "magenta")) +
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

There is a positive correlation between temperature and ozone levels. As temperature tends to increase ozone levels tend to increase. The hotter summer months (July and August) have higher ozone levels and are concentrated to the right. The oclder months such as May and september are on the left since they have lower temperatures and hence lower ozone levels. However, correlation does not equal causation.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

Computing the correlation matrix for Ozone, Temp, and Wind

cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Visualizing

corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         title = "Correlation Matrix of Numeric Variables")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Temperature and Ozone have the highest correlation of 0.70. This highlights that as temp increases the ozone levels increase. Ozone levels and Wind have a moderate negative correlation of -0.60 indicating that as the daily wind lowers the ozone levels increase. Temperature and wind also have a negative correlation of 0.51 indicating that as the winds decrease the temperature increase.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

summary_table <- airquality |>
  group_by(Month) |>
  summarise(
    Count = n(),
    Avg_ozone = mean(Ozone, na.rm = TRUE),
    Avg_temp = mean(Temp, na.rm = TRUE),
    Avg_wind = mean(Wind, na.rm = TRUE),
  ) 
summary_table
## # A tibble: 5 × 5
##   Month Count Avg_ozone Avg_temp Avg_wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.5    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone levels. Temperature increases between may-august and drops in September. In contrast, daily wind decrease between may-august and increases in September. During may-august the temperature increase due to the summer seasons while average wind decrease because of how hot it is during summer.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard