Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

#Your code for Ozone goes here
summary_stats1 <- airquality |>
  summarise(mean_ozone=mean(Ozone,na.rm=TRUE),
            median_ozone =median(Ozone,na.rm = TRUE),
            sd_ozone=sd(Ozone,na.rm=TRUE),
            min_ozone=min(Ozone,na.rm=TRUE),
            max_ozone=max(Ozone,na.rm=TRUE))
summary_stats1
##   mean_ozone median_ozone sd_ozone min_ozone max_ozone
## 1   42.12931         31.5 32.98788         1       168
#Your code for Temp goes here
summary_stats2<-airquality|>
  summarise(mean_temp=mean(Temp,na.rm=TRUE),
            median_temp=median(Temp,na.rm=TRUE),
            sd_temp=sd(Temp,na.rm=TRUE),
            min_temp=min(Temp,na.rm=TRUE),
            max_temp=max(Temp,na.rm=TRUE))
summary_stats2
##   mean_temp median_temp sd_temp min_temp max_temp
## 1  77.88235          79 9.46527       56       97
#Your code for Wind goes here
summary_stats3 <- airquality |>
  summarise(mean_wind=mean(Wind,na.rm=TRUE),
            median_wind=median(Wind,na.rm=TRUE),
            sd_wind=sd(Wind,na.rm=TRUE),
            min_wind=min(Wind,na.rm=TRUE),
            max_wind=max(Wind,na.rm=TRUE))
summary_stats3
##   mean_wind median_wind  sd_wind min_wind max_wind
## 1  9.957516         9.7 3.523001      1.7     20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

Ozone: Mean vs Median: The mean ozone is 42.13 and the median ozone is 31.5. These values are note very close,which suggests that the ozone distribution is skewed and not symmetric.

Standard Deviation: The SD is about 32.99, meaning that most ozone values falls withing roughly ±33 units of the mean(approximately from 9 to 75). This shows that there is a large amount of variability in ozone levels. In other words,ozone concentrations change a lot from one observation to another, rather than staying consistent.

Temp: Mean vs Median: The mean temp is 77.88 and the median temp is 79. These values are close, which suggests that the temp distribution is symmetric not strongly skewed.

Standard Deviation: The SD is about 9.47, meaning that most temperature values fall within roughly ±9 degrees of the mean (about 68 to 87) . This shows moderate variability in temperature, indicating that daily temperatures are relatively consistent without extreme fluctuations.

Wind: Mean vs Median: The mean wind is 9.96 and the median wind is 9.7. These values close,which suggests that the wind distribution is symmetric and not strongly skewed.

Standard Deviation: The SD is about 3.52 meaning that most wind speeds fall within roughly ±3.5 of the mean (about 6.5 to 13.5). This shows low to moderate variability, suggesting that wind conditions are generally stable across the observations.

Task 2: Histogram

Generate the histogram for Ozone.

#Your code goes here
ggplot(airquality,aes(x=Ozone))+
  geom_histogram(binwidth=20, fill="#8EE5EE",color="black")+
  labs(title="Histogram of Ozone",x="Ozone(ppb)",y="Count")+
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The histogram is right skewed.And the distribution is unimodal, with one main peak around the lower ozone values. On the high end, a few days have ozone value above 150 ppb, which are extreme compared to the majority of the data set. These are potential outliers.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here
airquality1 <- airquality |>
  mutate(month_name = case_when(Month == 5 ~ "May",
                                Month == 6 ~ "June",
                                Month == 7~"July",
                                Month==8~"August",
                                Month==9~"September"))

airquality1$month_name <- factor(airquality1$month_name,levels = c("May","June","July","August","September")) 

ggplot(airquality1,aes(x=factor(month_name),y=Ozone))+
  geom_boxplot(fill=c("#FF69B4","#9370DB","#6B8E23","#4169E1","#FFA500"))+
  labs(title="Boxplot of Ozone by Month",x="Month",y="Ozone(ppb)")+
  theme_minimal()
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

Ozone levels vary across the months as shown in the boxplot. June has a low median ozone level, while August has the highest median ozone level in pbb. May shows the lowest overall ozone levels. Every month have some outliers except July. Overall ozone levels tend to rise in the warmer months and stay lower in the spring and fall.

Task 4: Scatterplot

Produce the scatterplot of Temp vs. Ozone, colored by Month.

# Your code goes here
ggplot(airquality1,aes(x=Temp,y=Ozone,color=month_name))+
  geom_point(alpha=0.7)+
  labs(
    title= "Scatterplot of Temp vs Ozone",
    x="Temp",y="Ozone",color = "Month") +
  scale_color_manual(values = c("#0000FF","#FF7F50","#7FFF00","#2F4F4F","#B22222"))+
  theme_minimal()
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

The scatterplot shows a positive relationship between temperature and ozone levels. As temperature increases,ozone levels also tend to rise. It means that higher temperatures are usually linked with higher ozone concentrations. The points for July and August cluster in the higher temperature and higher ozone range, showing that ozone levels are generally higher in the warmer summer months.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here
cor_matrix <- cor(airquality1 |>
  select(Ozone,Temp,Wind),use = "complete.obs")
cor_matrix
##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000
corrplot(cor_matrix,method ="color",type="upper",order = "hclust",
         tl.col = "black",tl.srt = 45,addCoef.col = "black",
         title = "Correlation Matrix of Numeric Variables")

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

The correlation matrix shows that ozone and temperature have the strongest positive correlation(r=0.70), meaning that higher temperatures are usually linked with higher ozone levels. Ozone and wind have a moderate negative correlation(r=-0.60), which suggests that when wind speed increases, ozone levels tend to decrease. The weakest correlation is between temperature and wind(r=-0.51), showing a mild negative relationship.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary_table <- airquality1 |>
  group_by(month_name)|>
  summarise(
    Count = n(),
    Avg_ozone = mean(Ozone, na.rm=TRUE),
    Avg_temp = mean(Temp, na.rm = TRUE),
    Avg_wind= mean(Wind,na.rm=TRUE)
  )
summary_table
## # A tibble: 5 × 5
##   month_name Count Avg_ozone Avg_temp Avg_wind
##   <fct>      <int>     <dbl>    <dbl>    <dbl>
## 1 May           31      23.6     65.5    11.6 
## 2 June          30      29.4     79.1    10.3 
## 3 July          31      59.1     83.9     8.94
## 4 August        31      60.0     84.0     8.79
## 5 September     30      31.4     76.9    10.2

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?

August has the highest average ozone level. August and July have the highest average temperatures and average wind for these months is much lower than others. May, June and July have cooler temperatures and the winds for these months are higher than the July and August. Overall, Ozone levels rise during the hot,calm summer months and decrease when the weather is cooler or winder.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard