Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
library(corrplot)
library(ggplot2)

data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
summary_ozone <- airquality |>
  summarise(
    mean_ozone = mean(Ozone, na.rm = T),
    median_ozone = median(Ozone, na.rm = T),
    sd_ozone = sd(Ozone, na.rm = T),
    min_ozone = min(Ozone, na.rm = T),
    max_ozone = max(Ozone, na.rm = T)
  )
summary_ozone
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here
summary_temp <- airquality |>
  summarise(
    mean_temp = mean(Temp, na.rm = T),
    median_temp = median(Temp, na.rm = T),
    sd_temp = sd(Temp, na.rm = T),
    min_temp = min(Temp, na.rm = T),
    max_temp = max(Temp, na.rm = T)
  )
summary_temp
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here
summary_wind <- airquality |>
  summarise(
    mean_wind = mean(Wind, na.rm = T),
    median_wind = median(Wind, na.rm = T),
    sd_wind = sd(Wind, na.rm = T),
    min_wind = min(Wind, na.rm = T),
    max_wind = max(Wind, na.rm = T)
  )
summary_wind
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

The mean for ozone is 42.12931 and the median is 31.5. They are quite different, suggesting the distribution is not symmetric but rather skewed to the right.

The standard deviation of ozone is 32.98788, showing extreme variability.

The mean for temp is 77.88235 and the median is 79. They are fairly similar, suggesting the distribution is mostly symmetric and not strongly skewed.

The standard deviation of temp is 9.46527, showing no extreme variability.

The mean for wind is 9.957516 and the median is 9.7. They are very similar, suggesting the distribution is mostly symmetric and not strongly skewed.

The standard deviation of wind is 3.523001, showing a fair amount of variability.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 20, fill = "darkmagenta", color = "black") +
  labs(title = "Histogram of Ozone Levels", x = "Ozone (ppb)", y = "Count") +
  theme_minimal()

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The ozone distribution is unimodal and heavily skewed to the right. There appears to be a mild spike around 70-80 ppb in the distribution.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
ggplot(airquality, aes(x = factor(Month), y = Ozone)) +
  geom_boxplot(fill = c("darkred", "magenta", "purple", "darkcyan", "darkgreen")) +
  labs(title = "Boxplot of Ozone Levels by Month",
       x = "Month", y = "Ozone (ppb)") +
  theme_minimal()

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Ozone levels area similar in May, June, and September at median ~25 ppb with little spread, and they are higher in July and August at median ~50-60 ppb with greater spread.

July has the highest median ozone.

May, June, August, and September include outliers. This may indicate improperly collected/documented data or unusual natural events such as weather that impacted the data.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(Month))) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Scatterplot of Temperature vs. Ozone",
    x = "Temperature (F)",
    y = "Ozone (ppb)",
    color = "Month"
  ) +
  scale_color_manual(values = c("5" = "darkred", 
                                "6" = "magenta", 
                                "7" = "purple", 
                                "8" = "darkcyan", 
                                "9" = "darkgreen")) +
  theme_minimal()

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

These is a visible positive relationship between temperature and ozone levels. As temperature increases, ozone concentration increases. Months cluster together by temperature. There seem to be more outliers at 80-85 degrees.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
#compute
cor_matrix <- cor(
  airquality |>
    select(Ozone, Temp, Wind), use = "complete.obs")
cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
#visualize
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45, addCoef.col = "white",
         title = "Correlation Matrix of Numeric Variables")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation is between ozone and temperature with a correlation coefficient of 0.70. This suggests that as temperature increases, ozone will increase and as temp decreases, ozone will decrease.

The weakest correlation is between wind and temperature with a correlation coefficient of -0.51. This suggests that there is somewhat of a relationship between the two variables, where as wind increases temperature will decrease and vise versa.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality |>
  group_by(Month) |>
  summarise(
    count = n(),
    avg_ozone = mean(Ozone, na.rm = T),
    avg_temp = mean(Temp, na.rm = T),
    avg_wind = mean(Wind, na.rm = T)
  )
summary_table
## # A tibble: 5 × 5
##   Month count avg_ozone avg_temp avg_wind
##   <int> <int>     <dbl>    <dbl>    <dbl>
## 1     5    31      23.6     65.5    11.6 
## 2     6    30      29.4     79.1    10.3 
## 3     7    31      59.1     83.9     8.94
## 4     8    31      60.0     84.0     8.79
## 5     9    30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level at 59.96154 ppb.

Temperature increases each month before dropping in September.

Wind decreases each month before increasing in September.

The differences in ozone are likely explained by the temperature of each month which varies due to seasonal changes.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard