#Shelsy Chouakong
Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
data(airquality)
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
#Your code for Ozone goes here
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
#Your code for Temp goes here
#temperature stats
mean(airquality$Temp, na.rm = TRUE)
## [1] 77.88235
median(airquality$Temp, na.rm = TRUE)
## [1] 79
sd(airquality$Temp, na.rm = TRUE)
## [1] 9.46527
min(airquality$Temp, na.rm = TRUE)
## [1] 56
max(airquality$Temp, na.rm = TRUE)
## [1] 97
#Your code for Wind goes here
#Wind Stats
mean(airquality$Wind, na.rm = TRUE)
## [1] 9.957516
median(airquality$Wind, na.rm = TRUE)
## [1] 9.7
sd(airquality$Wind, na.rm = TRUE)
## [1] 3.523001
min(airquality$Wind, na.rm = TRUE)
## [1] 1.7
max(airquality$Wind, na.rm = TRUE)
## [1] 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
-Both the mean and median for each variable are all drastically different but within the same variable some are very near each other. The skewness of the ozone in the dataset is more more right leaning because the mean is higher than the median. The skewness of temperature and wind are more balanced/even because the mean and median are very close as both sets are less than 2 digits apart but the temperature’s skewness is slightly more symmetric because it’s mean is lower than the median whereas the wind’s skewness is less than 0.2 digits apart. The standard deviation indicates that the greater the value, the larger the variability but the smaller the value, the lower the variability.
Generate the histogram for Ozone.
#Your code goes here
hist(airquality$Ozone, na.rm = TRUE)
## Warning in plot.window(xlim, ylim, log, ...): "na.rm" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "na.rm" is not a graphical parameter
## Warning in axis(1, ...): "na.rm" is not a graphical parameter
## Warning in axis(2, at = yt, ...): "na.rm" is not a graphical parameter
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The shape of this ozone distribution is skewed to the right where it heavily leans at the very left end so it is unimodal as well. The main outliers are the fact that the largest two peaks of the graph are the two tallest and are about double the size of the next closest peak in size.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
airquality$Month_name <-case_when(
airquality$Month == 5 ~ "MAY",
airquality$Month == 6 ~ "JUNE",
airquality$Month == 7 ~ "JULY",
airquality$Month == 8 ~ "AUGUST",
airquality$Month == 9 ~ "SEPTEMBER"
)
boxplot(airquality$Ozone ~ airquality$Month_name,
na.action = na.omit,
main = "Level of Ozone - Month ",
xlab = "Month",
ylab = "Ozone")
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
Based on the data, the ozone levels peak in the middle of the summer in July and August.There are outliers mainly more towards the middle of summer which can indicate high temperatures or pollution that contribute to the ozone layer during those months of peak summer
Produce the scatterplot of Temp vs. Ozone, colored by Month.
# Your code goes here
plot(airquality$Temp, airquality$Ozone,
xlab = "Temperature",
ylab = "Ozone",
col = airquality$Month,
pch = 16,
main = "TEMP VS OZONE, Colored By Month")
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
-There does seem to be a visible relationship between temperature and ozone levels in that they are both increasing with each other and showing overall growth and they do tend to cluster with the patterns being that the middle of summer months have a bigger close cluster than the part before and after.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
cor(airquality[, c("Ozone", "Temp", "Wind")], use ="complete.obs")
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
library(corrplot)
library(tidyverse)
corrplot(cor(airquality[, c("Ozone", "Temp", "Wind")], use ="complete.obs"))
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
-The strongest correlation is between the ozone and temperature because the is a direct connection due to the temperature also rising when the ozone levels are rising. The weakest correlation is between the temperature and the wind because there isn’t a direct enough connection to attribute them to each other.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
airquality |>
group_by(Month) |>
summarize(
count = n(),
mean_ozone = mean(Ozone, na.rm = TRUE),
mean_temp = mean(Temp, na.rm = TRUE),
mean_wind = mean(Wind, na.rm = TRUE)
)
## # A tibble: 5 × 5
## Month count mean_ozone mean_temp mean_wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences?
-The month with the highest average ozone level is July because we know the temperature and ozone levels correlate and July has the highest temperatures thus making July the answer. Temperature is usually higher in the middle of summer while wind speed is usually slower during that time yet faster during May and September. I believe the explanation for these differences is due to the fact that more wind likely reduces the ozone levels as it decreases the temperature thus decreasing the ozone levels.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard