Objective: You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset Source: Built-in R dataset airquality. Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973. Variables (selected for this assignment): - Ozone: Mean ozone concentration in parts per billion (ppb, numeric). - Temp: Maximum daily temperature in Fahrenheit (numeric). - Wind: Average wind speed in miles per hour (numeric). - Month: Month of the year (5 = May, 6 = June, …, 9 = September, categorical).

Notes -The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them. -If you encounter errors, check that tidyverse and corrplot are installed and loaded. -Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions: Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions #### Task 1: Measures of Central Tendency and Spread Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

ozone_mean <- mean(airquality$Ozone, na.rm = TRUE)
ozone_median <- median(airquality$Ozone, na.rm = TRUE)
ozone_sd <- sd(airquality$Ozone, na.rm = TRUE)
ozone_min <- min(airquality$Ozone, na.rm = TRUE)
ozone_max <- max(airquality$Ozone, na.rm = TRUE)

cat("Ozone Statistics:\n")
## Ozone Statistics:
cat("Mean:", ozone_mean, "\n")
## Mean: 42.12931
cat("Median:", ozone_median, "\n")
## Median: 31.5
cat("Standard Deviation:", ozone_sd, "\n")
## Standard Deviation: 32.98788
cat("Min:", ozone_min, "\n")
## Min: 1
cat("Max:", ozone_max, "\n\n")
## Max: 168
temp_mean <- mean(airquality$Temp, na.rm = TRUE)
temp_median <- median(airquality$Temp, na.rm = TRUE)
temp_sd <- sd(airquality$Temp, na.rm = TRUE)
temp_min <- min(airquality$Temp, na.rm = TRUE)
temp_max <- max(airquality$Temp, na.rm = TRUE)

cat("Temperature Statistics:\n")
## Temperature Statistics:
cat("Mean:", temp_mean, "\n")
## Mean: 77.88235
cat("Median:", temp_median, "\n")
## Median: 79
cat("Standard Deviation:", temp_sd, "\n")
## Standard Deviation: 9.46527
cat("Min:", temp_min, "\n")
## Min: 56
cat("Max:", temp_max, "\n\n")
## Max: 97
wind_mean <- mean(airquality$Wind, na.rm = TRUE)
wind_median <- median(airquality$Wind, na.rm = TRUE)
wind_sd <- sd(airquality$Wind, na.rm = TRUE)
wind_min <- min(airquality$Wind, na.rm = TRUE)
wind_max <- max(airquality$Wind, na.rm = TRUE)

cat("Wind Statistics:\n")
## Wind Statistics:
cat("Mean:", wind_mean, "\n")
## Mean: 9.957516
cat("Median:", wind_median, "\n")
## Median: 9.7
cat("Standard Deviation:", wind_sd, "\n")
## Standard Deviation: 3.523001
cat("Min:", wind_min, "\n")
## Min: 1.7
cat("Max:", wind_max, "\n\n")
## Max: 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability? The Mean for the ozone is about 42.1 while the median is 31.5. These are significantly different from each other and indicate that the graph has a right skewed distribution. The temperature has a mean of 77.9 and a median of 79. These are relatively similar indicating that the graph is a relatively symmetric distribution. Lastly, the wind mean is 9.96 while the median is 9.7. These are also relatively similar indicating a relatively symmetric distribution with minimal skewness.Having a high standard deviation such as the one in the ozone variable indicates high variability and large fluctuations within the data. Having a standard deviation such the temperatures and wind have, indicate moderate variability.

Task 2: Histogram

Generate the histogram for Ozone.

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 10, fill = "steelblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Ozone Levels",
       subtitle = "New York, May-September 1973",
       x = "Ozone (ppb)",
       y = "Frequency") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features? The graph is slightly right skewed with an outlier at around 160 ppd. The general shape of the graph is unimodal because there is one clear peak in the distributions.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

airquality <- airquality %>%
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September",
    TRUE ~ as.character(Month)
  ))

ggplot(airquality, aes(x = month_name, y = Ozone, fill = month_name)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Ozone Levels by Month",
       subtitle = "New York, May-September 1973",
       x = "Month",
       y = "Ozone (ppb)") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5),
        legend.position = "none")
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate? The ozone levels show clear seasonal variation, giving significantly higher values in August and July than those in June, May, and September. Overall, July has the highest median ozone while May has the lowest median ozone. August has an outlier at around 180, June has an outlier at about 70, May has an outlier at about 130, and September has outliers at about, 73, 78, 80, and 90. These outliers may indicate days with exceptional weather conditions such as extremely high temperatures, higher solar radiation, or unexpected stagnant air masses.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

ggplot(airquality, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Relationship between Temperature and Ozone Levels",
       subtitle = "Colored by Month",
       x = "Temperature (°F)",
       y = "Ozone (ppb)",
       color = "Month") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns. Yes there is a very strong positive linear relationship between temperatures and ozone levels. May and September are generaly clustered near the lower temperatures while June and August are closer to the higher temperatures. June on the other hand is not clustered in one specific place and is sprean out pretty evenly across the whole graph.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

cor_matrix <- cor(
  airquality %>%
    select(Ozone, Temp, Wind), 
  use = "complete.obs"
)
print("Correlation Matrix:")
## [1] "Correlation Matrix:"
corrplot(cor_matrix, 
         method = "color",
         type = "lower",
         order = "hclust",
         addCoef.col = "black",
         tl.col = "black",
         tl.srt = 45,
         number.cex = 0.8,
         col = colorRampPalette(c("#6D9EC1", "white", "#E46726"))(200),
         title = "Correlation Matrix: Ozone, Temperature, and Wind",
         mar = c(0, 0, 2, 0))

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables. The strongest correlation is between temperature and ozone as it has an r of 0.7 while the weakest correlation is between the temperature and the wind because it has an r of -0.5. The correlation between temperature and ozone suggest that hat drive ozone production. The wind ozone relationship shows that wind disperses ozone and its precursors. Lastly the temperature wind relationship suggests very weak coupling between the two variables.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

summary_table <- airquality %>%
  group_by(month_name) %>%
  summarise(
    Count = n(),
    Avg_Ozone = round(mean(Ozone, na.rm = TRUE), 2),
    Avg_Temp = round(mean(Temp, na.rm = TRUE), 2),
    Avg_Wind = round(mean(Wind, na.rm = TRUE), 2)
  )

print("Summary Statistics by Month:")
## [1] "Summary Statistics by Month:"
print(summary_table)
## # A tibble: 5 × 5
##   month_name Count Avg_Ozone Avg_Temp Avg_Wind
##   <chr>      <int>     <dbl>    <dbl>    <dbl>
## 1 August        31      60.0     84.0     8.79
## 2 July          31      59.1     83.9     8.94
## 3 June          30      29.4     79.1    10.3 
## 4 May           31      23.6     65.6    11.6 
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? July has the highest average ozone level at about 59.96. The ranking from the highest to lowest ozone is: July, August, June, September, May. The ranking from the highest to lowest temperatures are: May, June, July, August, September. Finally, the ranking for wind speed patterns from highest to lowers are: May, June, July, August, September. Some environmental factors that can explain these differences are: the day length and the angle of the sun, precipitation patterns, or atmospheric circulation patterns

Submission Requirements Publish it to Rpubs and submit your link on blackboard