HW6

Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Ozone: Mean ozone concentration in parts per billion (ppb, numeric).
Temp: Maximum daily temperature in Fahrenheit (numeric).
Wind: Average wind speed in miles per hour (numeric).
Month: Month of the year (5 = May, 6 = June, …, 9 = September, categorical).

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'tibble' was built under R version 4.4.3

## Warning: package 'tidyr' was built under R version 4.4.3

## Warning: package 'readr' was built under R version 4.4.3

## Warning: package 'purrr' was built under R version 4.4.3

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'stringr' was built under R version 4.4.3

## Warning: package 'forcats' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
mean(airquality$Ozone, na.rm = TRUE)

## [1] 42.12931

median(airquality$Ozone, na.rm = TRUE)

## [1] 31.5

sd(airquality$Ozone, na.rm = TRUE)

## [1] 32.98788

min(airquality$Ozone, na.rm = TRUE)

## [1] 1

max(airquality$Ozone, na.rm = TRUE)

## [1] 168

#Your code for Temp goes here
mean(airquality$Temp, na.rm = TRUE)

## [1] 77.88235

median(airquality$Temp, na.rm = TRUE)

## [1] 79

sd(airquality$Temp, na.rm = TRUE)

## [1] 9.46527

min(airquality$Temp, na.rm = TRUE)

## [1] 56

max(airquality$Temp, na.rm = TRUE)

## [1] 97

#Your code for Wind goes here
mean(airquality$Wind, na.rm = TRUE)

## [1] 9.957516

median(airquality$Wind, na.rm = TRUE)

## [1] 9.7

sd(airquality$Wind, na.rm = TRUE)

## [1] 3.523001

min(airquality$Wind,na.rm = TRUE)

## [1] 1.7

max(airquality$Wind, na.rm = TRUE)

## [1] 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

If the mean and median are close, the data is likely symmetric. If they’re far apart, it might be skewed. The standard deviation tells how spread out the values are, higher SD means more variation.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
hist(airquality$Ozone,
main = "Histogram of Ozone Levels",
xlab = "Ozone (ppb)",
col = "lightblue",
border = "black")

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The ozone data looks skewed to the right (not normal). There are a few higher values that might be outliers.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
airquality <- airquality %>%
  mutate(month_name = case_when(
    Month == 5 ~ "May",
    Month == 6 ~ "June",
    Month == 7 ~ "July",
    Month == 8 ~ "August",
    Month == 9 ~ "September"
  ))
boxplot(Ozone ~ month_name,
data = airquality,
main = "Ozone Levels by Month",
xlab = "Month",
ylab = "Ozone (ppb)",
col = "lightgreen")

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Answer: Ozone Levels are higher in July and August. These months have higher median ozone and some outliers, which could be days with very high pollution.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(airquality, aes(x = Temp, y = Ozone, color = as.factor(Month)))+ geom_point()+
labs(title = "Temperature vs Ozone Levels",
     x = "Temperature (\u00B0F)",
     y = "Ozone (ppb)",
     color = "Month")

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

answer: There seems to be a positve relationship, higher temperatures usually have higher ozone.The warmer months (July and August) tend to have more ozone.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
corr_data <- airquality %>%
select(Ozone, Temp, Wind) %>%
  drop_na()

corr_matrix <-cor(corr_data, use = "complete.obs")
corr_matrix

##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

corrplot(corr_matrix, method = "number",type = "upper")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Answer: Ozone and Temperature are strongly positively correlated, when it’s hotter, ozone levels go up. Ozone and Wind have a weak negative correlation, meaning higher wind tends to lower ozone slightly.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality %>%
group_by(month_name)%>%
  summarise(
    count = n(),
    avg_ozone = mean(Ozone, na.rm = TRUE),
    avg_temp = mean(Temp, na.rm = TRUE),
    avg_wind = mean(Wind, na.rm = TRUE)
  )
summary_table

## # A tibble: 5 × 5
##   month_name count avg_ozone avg_temp avg_wind
##   <chr>      <int>     <dbl>    <dbl>    <dbl>
## 1 August        31      60.0     84.0     8.79
## 2 July          31      59.1     83.9     8.94
## 3 June          30      29.4     79.1    10.3 
## 4 May           31      23.6     65.5    11.6 
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

July has the highest average ozone. Summer months (July and August) are hotter, which might explain higher ozone due to stronger sunlight and heat reactions. Wind tends to be slightly lower in those months.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard