Objective:
You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.
Dataset
Source: Built-in R dataset airquality.
Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.
Variables (selected for this assignment):
Notes
-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.
-If you encounter errors, check that tidyverse and corrplot are installed and loaded.
-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.
Instructions:
Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.
#Load your dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
data("airquality")
Tasks and Questions
Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.
max(airquality$Ozone, na.rm = TRUE)
## [1] 168
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
mean(airquality$Temp)
## [1] 77.88235
median(airquality$Temp)
## [1] 79
sd(airquality$Temp)
## [1] 9.46527
min(airquality$Temp)
## [1] 56
max(airquality$Temp)
## [1] 97
mean(airquality$Wind)
## [1] 9.957516
median(airquality$Wind)
## [1] 9.7
sd(airquality$Wind)
## [1] 3.523001
min(airquality$Wind)
## [1] 1.7
max(airquality$Wind)
## [1] 20.7
Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?
For the wind column, the mean and median are close which indicates a normal distribution. Ozone column mean and median are not that close which indicates a skewed distribution. For the temperature column, mean and median are pretty close indicating a normal distribution. Because of how close the ozone’s mean is to the standard deviation, I would say a graph for ozone would show less variability versus the other columns.
Generate the histogram for Ozone.
hist(airquality$Ozone)
Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?
The distribution for the ozone plot appears to be skewed right. There are outliers visible on the right end of the distribution.
Create a boxplot of ozone levels (Ozone) by month, with months
displayed as names (May, June, July, August, September) instead of
numbers (5–9).Recode the Month variable into a new column called
month_name with month names using case_when
from week 4.Generate a boxplot of Ozone by month_name.
# Your code here
month_name <- recode_factor(airquality$Month,
'5' = "May",
'6' = "June",
'7' = "July",
'8' = "August",
'9' = "September")
month_name <- boxplot(airquality$Ozone)
Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?
If you were to put the ozone level values on a distribution, it would be skewed right. There are outliers which may indicate the ozone changes at specific times through the months rather than staying the same for every month.
Produce the scatter plot of Temp vs. Ozone, colored by Month.
scatterplot(airquality$Temp, airquality$Ozone)
airquality |>
group_by("Month")
## # A tibble: 153 × 7
## # Groups: "Month" [1]
## Ozone Solar.R Wind Temp Month Day `"Month"`
## <int> <int> <dbl> <int> <int> <int> <chr>
## 1 41 190 7.4 67 5 1 Month
## 2 36 118 8 72 5 2 Month
## 3 12 149 12.6 74 5 3 Month
## 4 18 313 11.5 62 5 4 Month
## 5 NA NA 14.3 56 5 5 Month
## 6 28 NA 14.9 66 5 6 Month
## 7 23 299 8.6 65 5 7 Month
## 8 19 99 13.8 59 5 8 Month
## 9 8 19 20.1 61 5 9 Month
## 10 NA 194 8.6 69 5 10 Month
## # ℹ 143 more rows
Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.
There seems to be a positive relationship between ozone and temperature according the trend shown on the graph. Residuals a clustered together under the line of best fit on the entire graph.
Compute and visualize the correlation matrix for Ozone, Temp, and Wind.
# Your code goes here
airquality [ c("Temp", "Ozone", "Wind")]
## Temp Ozone Wind
## 1 67 41 7.4
## 2 72 36 8.0
## 3 74 12 12.6
## 4 62 18 11.5
## 5 56 NA 14.3
## 6 66 28 14.9
## 7 65 23 8.6
## 8 59 19 13.8
## 9 61 8 20.1
## 10 69 NA 8.6
## 11 74 7 6.9
## 12 69 16 9.7
## 13 66 11 9.2
## 14 68 14 10.9
## 15 58 18 13.2
## 16 64 14 11.5
## 17 66 34 12.0
## 18 57 6 18.4
## 19 68 30 11.5
## 20 62 11 9.7
## 21 59 1 9.7
## 22 73 11 16.6
## 23 61 4 9.7
## 24 61 32 12.0
## 25 57 NA 16.6
## 26 58 NA 14.9
## 27 57 NA 8.0
## 28 67 23 12.0
## 29 81 45 14.9
## 30 79 115 5.7
## 31 76 37 7.4
## 32 78 NA 8.6
## 33 74 NA 9.7
## 34 67 NA 16.1
## 35 84 NA 9.2
## 36 85 NA 8.6
## 37 79 NA 14.3
## 38 82 29 9.7
## 39 87 NA 6.9
## 40 90 71 13.8
## 41 87 39 11.5
## 42 93 NA 10.9
## 43 92 NA 9.2
## 44 82 23 8.0
## 45 80 NA 13.8
## 46 79 NA 11.5
## 47 77 21 14.9
## 48 72 37 20.7
## 49 65 20 9.2
## 50 73 12 11.5
## 51 76 13 10.3
## 52 77 NA 6.3
## 53 76 NA 1.7
## 54 76 NA 4.6
## 55 76 NA 6.3
## 56 75 NA 8.0
## 57 78 NA 8.0
## 58 73 NA 10.3
## 59 80 NA 11.5
## 60 77 NA 14.9
## 61 83 NA 8.0
## 62 84 135 4.1
## 63 85 49 9.2
## 64 81 32 9.2
## 65 84 NA 10.9
## 66 83 64 4.6
## 67 83 40 10.9
## 68 88 77 5.1
## 69 92 97 6.3
## 70 92 97 5.7
## 71 89 85 7.4
## 72 82 NA 8.6
## 73 73 10 14.3
## 74 81 27 14.9
## 75 91 NA 14.9
## 76 80 7 14.3
## 77 81 48 6.9
## 78 82 35 10.3
## 79 84 61 6.3
## 80 87 79 5.1
## 81 85 63 11.5
## 82 74 16 6.9
## 83 81 NA 9.7
## 84 82 NA 11.5
## 85 86 80 8.6
## 86 85 108 8.0
## 87 82 20 8.6
## 88 86 52 12.0
## 89 88 82 7.4
## 90 86 50 7.4
## 91 83 64 7.4
## 92 81 59 9.2
## 93 81 39 6.9
## 94 81 9 13.8
## 95 82 16 7.4
## 96 86 78 6.9
## 97 85 35 7.4
## 98 87 66 4.6
## 99 89 122 4.0
## 100 90 89 10.3
## 101 90 110 8.0
## 102 92 NA 8.6
## 103 86 NA 11.5
## 104 86 44 11.5
## 105 82 28 11.5
## 106 80 65 9.7
## 107 79 NA 11.5
## 108 77 22 10.3
## 109 79 59 6.3
## 110 76 23 7.4
## 111 78 31 10.9
## 112 78 44 10.3
## 113 77 21 15.5
## 114 72 9 14.3
## 115 75 NA 12.6
## 116 79 45 9.7
## 117 81 168 3.4
## 118 86 73 8.0
## 119 88 NA 5.7
## 120 97 76 9.7
## 121 94 118 2.3
## 122 96 84 6.3
## 123 94 85 6.3
## 124 91 96 6.9
## 125 92 78 5.1
## 126 93 73 2.8
## 127 93 91 4.6
## 128 87 47 7.4
## 129 84 32 15.5
## 130 80 20 10.9
## 131 78 23 10.3
## 132 75 21 10.9
## 133 73 24 9.7
## 134 81 44 14.9
## 135 76 21 15.5
## 136 77 28 6.3
## 137 71 9 10.9
## 138 71 13 11.5
## 139 78 46 6.9
## 140 67 18 13.8
## 141 76 13 10.3
## 142 68 24 10.3
## 143 82 16 8.0
## 144 64 13 12.6
## 145 71 23 9.2
## 146 81 36 10.3
## 147 69 7 10.3
## 148 63 14 16.6
## 149 70 30 6.9
## 150 77 NA 13.2
## 151 75 14 14.3
## 152 76 18 8.0
## 153 68 20 11.5
Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.
Ozone is weakly correlated with temp and wind because the values don’t seem to be higher or lower based on the value of ozone. Wind and temperature seem to be correlated because the more the wind blows, the lower the temperature is and vice versa.
Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.
# your code goes here
summary(airquality$Ozone,
airquality$Temp,
airquality$Wind)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 18.00 31.50 42.13 63.25 168.00 37
airquality |>
group_by(Month)
## # A tibble: 153 × 6
## # Groups: Month [5]
## Ozone Solar.R Wind Temp Month Day
## <int> <int> <dbl> <int> <int> <int>
## 1 41 190 7.4 67 5 1
## 2 36 118 8 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## # ℹ 143 more rows
Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? Looking at the data, month 8 - August seems to have the highest average ozone level. Temperature and wind speed vary throughout the months depending on the seasonal climate. The months covered are late Spring through early Fall which may explain the differences in these factors by month.
Submission Requirements
Publish it to Rpubs and submit your link on blackboard