Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here

summary(airquality$Ozone)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   18.00   31.50   42.13   63.25  168.00      37
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
#Your code for Temp goes here

summary(airquality$Temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   72.00   79.00   77.88   85.00   97.00
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
#Your code for Wind goes here

summary(airquality$Wind)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.700   7.400   9.700   9.958  11.500  20.700
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Explain:For ozone, the mean is 43.13, for median it is 31.50. They are different as the mean is higher. This suggests that is it right-skewed. The standard deviation for ozone is 32.98 rounded to 33. This high SD indicates that there is more variation. For temperature, the mean is 77.88, and the median is 79. They are similar! This suggest that the distribution is symmetric. The SD rounded is 9.5, this value indicates there is moderate to low variability. For wind, the mean is 9.958 and the median is 9.700. They are similar as well. This indicates that the distribution is symmetric. For wind, the sd is 3.5, this indicates the variability is low.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here

hist(airquality$Ozone, 
main= "Histogram for Ozone", xlab = "Ozone (ppb)", ylab = "Frequency")

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

Explain:The shape of the ozone distributions is right- skewed, there is an outlier at around 160 ppb. In terms of the median, 1st q and 3rd q for ozone, it tells us that about 50% my observation falls between 1st q and 3rd q.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
 
airquality <- airquality |>
mutate(month_name = case_when(
Month == 5 ~ "May", 
Month == 6 ~"June",
Month == 7 ~"July",
Month == 8 ~"August", 
Month == 9 ~"September"
))

boxplot(Ozone ~ month_name, data = airquality, 
main= "Ozone Levels by Month", 
xlab = "Month",
ylab = "Ozone (ppb)")

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Explain: The highest ozone levels are seen to be in August and July. These two months have the highest medians around the upper 50’s ppb, in comparison to the rest of months. We see lower ozone levels in the months like June, May and September. There are outliers, in May for example we have at an 120 ppb ozone level. This is extremely irregular since the majority of the data ranges up to 45 ppb. This may indicate that there might have been certain irregular conditions like low wind, which allowed ozone to build-up amongst other possibilities.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here

ggplot(airquality, aes(x = Temp, y = Ozone, color = month_name)) + geom_point() + labs(title = "Temp vs Ozone", x = "Temperature", y = "Ozone (ppb)", color = "Month")
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

Explain:There is a visible strong correlation with ozone and temperature, because we see as temperatures rise so does the ozone. These higher ozone levels are seen in certain months like August, July and even certain days in September. We see these clusters form as it raises to 80-90 degrees.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

library(corrplot)

correlation_matrix <- cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs")
correlation_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(correlation_matrix, method = "number",type = "upper")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Explain: The strongest correlations are, Ozone and Temperature which is 0.70. This correlation value suggests that as temperature rises, ozone will increase as well. Next strongest is Ozone and Wind, which is -0.60. This suggests that as wind starts to go up, ozone will decrease. And lastly, the weakest is temperature and wind which is at -0.51. This suggests that when temperature rises, the wind goes down.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here

summary <- airquality |>
  group_by(month_name) |>
  summarise(count = n(),
            ozone_average = mean(Ozone, na.rm = TRUE), temp_average = mean(Temp, na.rm= TRUE), wind_average = mean(Wind, na.rm= TRUE)
            )

summary
## # A tibble: 5 × 5
##   month_name count ozone_average temp_average wind_average
##   <chr>      <int>         <dbl>        <dbl>        <dbl>
## 1 August        31          60.0         84.0         8.79
## 2 July          31          59.1         83.9         8.94
## 3 June          30          29.4         79.1        10.3 
## 4 May           31          23.6         65.5        11.6 
## 5 September     30          31.4         76.9        10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

Explain: August has the highest average ozone level at 59.9, rounded to 60 ppb. Temperature is higher in months like August and July while at the same time the wind speed in these hot months is seen at 8.7 and 8.9 which lower than the colder months. Then, for example in months like May and Sept, you will see the temp decreasing with may being at 65.5, rounded to 66 which is at least 18 degrees lower with wind speeds higher at 11.6. Some environmental factors that explain the vast difference is that there is more heat/direct sunlight this can boost the ozone to form and since its hot there isn’t much wind, so its allowing the ozone to accumulate.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard