Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset
library(tidyverse)
library(corrplot)
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here

mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here

mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here

mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

The mean and median for temperature and wind are pretty close, which suggests those variables are more evenly distributed. Ozone has a mean that is higher than the median, which suggests a right skewed distribution. The standard deviation shows how spread out the data is, and ozone has more variability than temperature and wind.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here

hist(airquality$Ozone,
     main = "Histogram of Ozone",
     xlab = "Ozone",
     col = "lightblue")

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The ozone distribution is right-skewed, with most values on the lower end and a few high values stretching the graph to the right. It looks unimodal because there is one main peak. There also seem to be some high values that could be considered outliers.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here

airquality <- airquality %>%
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))

airquality$month_name <- factor(airquality$month_name,
                               levels = c("May", "June", "July", "August", "September"))

boxplot(Ozone ~ month_name,
        data = airquality,
        main = "Ozone Levels by Month",
        xlab = "Month",
        ylab = "Ozone",
        col = "lightgreen")

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Ozone levels change across the months, with the summer months generally showing higher values. August has the highest median ozone level, slightly higher than July. There are also outliers in some months, which may represent unusually high ozone days.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here

ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(Month))) +
  geom_point(na.rm = TRUE) +
  labs(title = "Temperature vs Ozone",
       x = "Temperature",
       y = "Ozone",
       color = "Month")

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

There is a visible positive relationship between temperature and ozone, meaning higher temperatures tend to go with higher ozone levels. Some months also seem to cluster together, especially the warmer months. This suggests ozone tends to be higher during hotter parts of the year.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

cor_matrix <- cor(airquality[, c("Ozone", "Temp", "Wind")],
                  use = "complete.obs")

corrplot(cor_matrix, method = "circle")

cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The strongest correlation is between ozone and temperature, and it is positive. The correlation between ozone and wind is negative, which means higher wind speeds are associated with lower ozone levels. The weakest relationship is between temperature and wind. These values suggest ozone is more strongly connected to temperature than to wind.

Task 6: Summary Table

Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

airquality %>%
  group_by(month_name) %>%
  summarise(
    count = n(),
    avg_ozone = mean(Ozone, na.rm = TRUE),
    avg_temp = mean(Temp, na.rm = TRUE),
    avg_wind = mean(Wind, na.rm = TRUE)
  )
## # A tibble: 5 × 5
##   month_name count avg_ozone avg_temp avg_wind
##   <fct>      <int>     <dbl>    <dbl>    <dbl>
## 1 May           31      23.6     65.5    11.6 
## 2 June          30      29.4     79.1    10.3 
## 3 July          31      59.1     83.9     8.94
## 4 August        31      60.0     84.0     8.79
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level. Temperature is generally higher in the summer months, while wind speed varies less but can still affect ozone levels. Warmer weather and lower wind may help explain why ozone is higher in certain months.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard