Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset and check structure, NAs, show 6 lines
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(RColorBrewer)
data("airquality")
str(airquality) # showing 153 obs. of 6 variables, int/num
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
print (colSums(is.na(airquality))) # showing the NA in Ozone and Solar.R
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
airquality |>
summarize( # for Ozone get:
mean(Ozone, na.rm = TRUE), # mean
median (Ozone, na.rm = TRUE), # median
sd(Ozone, na.rm = TRUE), # std dev
min(Ozone, na.rm = TRUE), # min
max(Ozone, na.rm = TRUE) # max
)
## mean(Ozone, na.rm = TRUE) median(Ozone, na.rm = TRUE) sd(Ozone, na.rm = TRUE)
## 1 42.12931 31.5 32.98788
## min(Ozone, na.rm = TRUE) max(Ozone, na.rm = TRUE)
## 1 1 168
#Your code for Temp goes here
airquality |>
summarize( # for Temp get:
mean(Temp, na.rm = TRUE), # mean
median (Temp, na.rm = TRUE), # median
sd(Temp, na.rm = TRUE), # std dev
min(Temp), # min
max(Temp) # max
)
## mean(Temp, na.rm = TRUE) median(Temp, na.rm = TRUE) sd(Temp, na.rm = TRUE)
## 1 77.88235 79 9.46527
## min(Temp) max(Temp)
## 1 56 97
#Your code for Wind goes here
airquality |>
summarize( # for Wind get:
mean(Wind, na.rm = TRUE), # mean
median (Wind, na.rm = TRUE), # median
sd(Wind, na.rm = TRUE), # std dev
min(Wind), # min
max(Wind) # max
)
## mean(Wind, na.rm = TRUE) median(Wind, na.rm = TRUE) sd(Wind, na.rm = TRUE)
## 1 9.957516 9.7 3.523001
## min(Wind) max(Wind)
## 1 1.7 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
The Ozone mean/median show a right skewness and the sd shows much variability in the data (as one would expect from ozone concentration that depends on several earth atmospheric parameters and also solar activity.) The Temp and Wind mean/median show a well-behaved normal distribution and their sd show some variability. Wind has broader variability than Temp but both behave similarly.
Generate the histogram for Ozone.
#Your code goes here
hist(airquality$Ozone, # hist ignores NA values by default
main = "Ozone levels in New York from May to September 1973", # Title
xlab = "Ozone, ppb", # X-axis label
ylab = "Frequency", # Y-axis label
col = "skyblue", # Bar color
border = "green", # enhance a little
breaks = 200 # call for more bins to narrow line
)
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The ozone distribution shows a right skew with a long tail, and an outlier around 170. Browsing the dataset values, I see the single outlier at 168 ppb.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
# first recode months and create factor. Just following the class activity (storms)
airquality2 <- airquality |>
mutate(
month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~ "September" ),
# Convert to factor with explicit order
month_name = factor(month_name,
levels = c("May", "June", "July", "August", "September"))
)
# second remove rows with ozone NA, then boxplot
airquality3 <- filter(airquality2, !is.na(Ozone)) # filter rows with NAs in Ozone
boxplot(Ozone ~ month_name, data=airquality3,
main = "Ozone by Month, 1973",
xlab = "Month",
ylab = "Ozone, ppb",
col = "lightblue")
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Ozone levels increase since May, go through a maximum level in August and decrease in September. The highest median ozone is in July (ca. 60 ppb). There are outliers in May (ca. 115 ppb), June (ca. 75 ppb) and September (ca. 70, 80, 90, 100 ppb). In general ozone forms due to higher temperatures and stronger solar radiation, which happen in the warmer months of the year.
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
plot (Temp ~ Ozone, data = airquality3,
col = airquality3$month_name,
main = "Temperature vs Ozone Level in 1973, somewhere in New York",
ylab = "Temperature (F)",
xlab = "Ozone (ppb)"
)
legend("bottomright", legend=levels(airquality3$month_name), fill=unique(airquality3$month_name)) # legend added for clarity
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
Yes, definitely. There’s a sharp rise in Ozone level as Temperature rises above ca. 80 F. The data shows that those temperatures can occur in June, July, August and September. In general, the highest ozone levels happen in July and August, and the lowest in May.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor_airquality <- cor (
airquality3 |>
select(Ozone, Temp, Wind)
)
cor_airquality
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
corrplot(cor_airquality, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 40, addCoef.col = "black", col = brewer.pal(n = 8, name = "PiYG"),
title = "Correlation among Ozone, Temperature and Wind")
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
The strongest correlation is among Ozone and Temp (.70) and the weakest between Wind and Temp (-0.51). The negative correlation of Ozone with Wind (-.60) indicates that stronger winds lead to smaller levels of ozone. The explanation may be in the negative correlation between Wind and Temp (-51): higher winds lead to lower temperatures which themselves lead to lower ozone levels. (The leading independent variable is temperature, not wind speed.)
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
airquality3 |>
group_by(month_name) |>
summarize (count = n(), Ozone_avg = mean(Ozone), Temp_avg = mean(Temp), Wind_avg = mean(Wind)) # airquality3 contains no NA
## # A tibble: 5 × 5
## month_name count Ozone_avg Temp_avg Wind_avg
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 26 23.6 66.7 11.5
## 2 June 9 29.4 78.2 12.2
## 3 July 26 59.1 83.9 8.52
## 4 August 26 60.0 84.0 8.57
## 5 September 29 31.4 76.9 10.1
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
The highest average ozone level happens in August. Average temperature rises since May reaching a maximum in August and drops in September. Average wind speed drops since May and June, reaching a minimum in August and rises in September. So, average ozone levels follow directly the average temperature and inversely the average wind speed. As stated in previous tasks, the leading cause of ozone production in the atmosphere is temperature (a measure of accumulated thermal energy or “heat”), which increases with solar radiation (proximity to the Sun) and thus maximizes in the Summer months. Stagnated air leads to higher accumulation of heat – higher temperature. Wind “sweeps” or dilutes stagnated air, lowering the accumulation of heat and air temperature, thus lowering the ozone levels. Another underlying environmental factor is the ongoing global warming, which due to accumulation of CO2 leads to a greenhouse effect; this produces an increase in atmospheric temperature and thus ozone generation. However, such effect would have been present at about the same level during the short May-September timeframe of this database, thus its variation is not evident in the data.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard
Published at https://rpubs.com/rmiranda/1358568