Objective:

You will use R to analyze the built-in airquality dataset, applying descriptive statistics techniques to explore environmental data. The assignment covers measures of central tendency, spread, histograms, boxplots, scatterplots, correlations, and summary tables, aligning with the Week 6 agenda on Descriptive Statistics.

Dataset

Source: Built-in R dataset airquality.

Description: Contains 153 observations of daily air quality measurements in New York from May to September 1973.

Variables (selected for this assignment):

Notes

-The airquality dataset has missing values in Ozone and Solar.R. The code uses na.rm = TRUE or use = “complete.obs” to handle them.

-If you encounter errors, check that tidyverse and corrplot are installed and loaded.

-Feel free to modify plot aesthetics (e.g., colors, binwidth) to enhance clarity.

Instructions:

Complete the following tasks using R to analyze the airquality dataset. Submit your Rpubs link that includes code, outputs (tables and plots), and written interpretations for each task. Ensure you load the dataset using data(airquality) and install/load the tidyverse and corrplot packages.

#Load your dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
data("airquality")

Tasks and Questions

Task 1: Measures of Central Tendency and Spread

Using functions you learned this week, compute mean, median, standard deviation, min, and max separately for Ozone, Temp, and Wind.

max(airquality$Ozone, na.rm = TRUE)
## [1] 168
min(airquality$Ozone, na.rm = TRUE)
## [1] 1
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
mean(airquality$Temp)
## [1] 77.88235
median(airquality$Temp)
## [1] 79
sd(airquality$Temp)
## [1] 9.46527
min(airquality$Temp)
## [1] 56
max(airquality$Temp)
## [1] 97
mean(airquality$Wind)
## [1] 9.957516
median(airquality$Wind)
## [1] 9.7
sd(airquality$Wind)
## [1] 3.523001
min(airquality$Wind)
## [1] 1.7
max(airquality$Wind)
## [1] 20.7

Question: Compare the mean and median for each variable. Are they similar or different, and what does this suggest about the distribution (e.g., skewness)? What does the standard deviation indicate about variability?

For the wind column, the mean and median are close which indicates a normal distribution. Ozone column mean and median are not that close which indicates a skewed distribution. For the temperature column, mean and median are pretty close indicating a normal distribution. Because of how close the ozone’s mean is to the standard deviation, I would say a graph for ozone would show less variability versus the other columns.

Task 2: Histogram

Generate the histogram for Ozone.

hist(airquality$Ozone)

Question: Describe the shape of the ozone distribution (e.g., normal, skewed, unimodal). Are there any outliers or unusual features?

The distribution for the ozone plot appears to be skewed right. There are outliers visible on the right end of the distribution.

Task 3: Boxplot

Create a boxplot of ozone levels (Ozone) by month, with months displayed as names (May, June, July, August, September) instead of numbers (5–9).Recode the Month variable into a new column called month_name with month names using case_when from week 4.Generate a boxplot of Ozone by month_name.

# Your code here

month_name <- recode_factor(airquality$Month,
                                  '5' = "May",
                                  '6' = "June",
                                  '7' = "July",
                                  '8' = "August",
                                  '9' = "September")
month_name <- boxplot(airquality$Ozone)

Question: How do ozone levels vary across months? Which month has the highest median ozone? Are there outliers in any month, and what might they indicate?

If you were to put the ozone level values on a distribution, it would be skewed right. There are outliers which may indicate the ozone changes at specific times through the months rather than staying the same for every month.

Task 4: Scatterplot

Produce the scatter plot of Temp vs. Ozone, colored by Month.

scatterplot(airquality$Temp, airquality$Ozone)

airquality |>
  group_by("Month")
## # A tibble: 153 × 7
## # Groups:   "Month" [1]
##    Ozone Solar.R  Wind  Temp Month   Day `"Month"`
##    <int>   <int> <dbl> <int> <int> <int> <chr>    
##  1    41     190   7.4    67     5     1 Month    
##  2    36     118   8      72     5     2 Month    
##  3    12     149  12.6    74     5     3 Month    
##  4    18     313  11.5    62     5     4 Month    
##  5    NA      NA  14.3    56     5     5 Month    
##  6    28      NA  14.9    66     5     6 Month    
##  7    23     299   8.6    65     5     7 Month    
##  8    19      99  13.8    59     5     8 Month    
##  9     8      19  20.1    61     5     9 Month    
## 10    NA     194   8.6    69     5    10 Month    
## # ℹ 143 more rows

Question: Is there a visible relationship between temperature and ozone levels? Do certain months cluster together (e.g., higher ozone in warmer months)? Describe any patterns.

There seems to be a positive relationship between ozone and temperature according the trend shown on the graph. Residuals a clustered together under the line of best fit on the entire graph.

Task 5: Correlation Matrix

Compute and visualize the correlation matrix for Ozone, Temp, and Wind.

# Your code goes here

airquality [ c("Temp", "Ozone", "Wind")]
##     Temp Ozone Wind
## 1     67    41  7.4
## 2     72    36  8.0
## 3     74    12 12.6
## 4     62    18 11.5
## 5     56    NA 14.3
## 6     66    28 14.9
## 7     65    23  8.6
## 8     59    19 13.8
## 9     61     8 20.1
## 10    69    NA  8.6
## 11    74     7  6.9
## 12    69    16  9.7
## 13    66    11  9.2
## 14    68    14 10.9
## 15    58    18 13.2
## 16    64    14 11.5
## 17    66    34 12.0
## 18    57     6 18.4
## 19    68    30 11.5
## 20    62    11  9.7
## 21    59     1  9.7
## 22    73    11 16.6
## 23    61     4  9.7
## 24    61    32 12.0
## 25    57    NA 16.6
## 26    58    NA 14.9
## 27    57    NA  8.0
## 28    67    23 12.0
## 29    81    45 14.9
## 30    79   115  5.7
## 31    76    37  7.4
## 32    78    NA  8.6
## 33    74    NA  9.7
## 34    67    NA 16.1
## 35    84    NA  9.2
## 36    85    NA  8.6
## 37    79    NA 14.3
## 38    82    29  9.7
## 39    87    NA  6.9
## 40    90    71 13.8
## 41    87    39 11.5
## 42    93    NA 10.9
## 43    92    NA  9.2
## 44    82    23  8.0
## 45    80    NA 13.8
## 46    79    NA 11.5
## 47    77    21 14.9
## 48    72    37 20.7
## 49    65    20  9.2
## 50    73    12 11.5
## 51    76    13 10.3
## 52    77    NA  6.3
## 53    76    NA  1.7
## 54    76    NA  4.6
## 55    76    NA  6.3
## 56    75    NA  8.0
## 57    78    NA  8.0
## 58    73    NA 10.3
## 59    80    NA 11.5
## 60    77    NA 14.9
## 61    83    NA  8.0
## 62    84   135  4.1
## 63    85    49  9.2
## 64    81    32  9.2
## 65    84    NA 10.9
## 66    83    64  4.6
## 67    83    40 10.9
## 68    88    77  5.1
## 69    92    97  6.3
## 70    92    97  5.7
## 71    89    85  7.4
## 72    82    NA  8.6
## 73    73    10 14.3
## 74    81    27 14.9
## 75    91    NA 14.9
## 76    80     7 14.3
## 77    81    48  6.9
## 78    82    35 10.3
## 79    84    61  6.3
## 80    87    79  5.1
## 81    85    63 11.5
## 82    74    16  6.9
## 83    81    NA  9.7
## 84    82    NA 11.5
## 85    86    80  8.6
## 86    85   108  8.0
## 87    82    20  8.6
## 88    86    52 12.0
## 89    88    82  7.4
## 90    86    50  7.4
## 91    83    64  7.4
## 92    81    59  9.2
## 93    81    39  6.9
## 94    81     9 13.8
## 95    82    16  7.4
## 96    86    78  6.9
## 97    85    35  7.4
## 98    87    66  4.6
## 99    89   122  4.0
## 100   90    89 10.3
## 101   90   110  8.0
## 102   92    NA  8.6
## 103   86    NA 11.5
## 104   86    44 11.5
## 105   82    28 11.5
## 106   80    65  9.7
## 107   79    NA 11.5
## 108   77    22 10.3
## 109   79    59  6.3
## 110   76    23  7.4
## 111   78    31 10.9
## 112   78    44 10.3
## 113   77    21 15.5
## 114   72     9 14.3
## 115   75    NA 12.6
## 116   79    45  9.7
## 117   81   168  3.4
## 118   86    73  8.0
## 119   88    NA  5.7
## 120   97    76  9.7
## 121   94   118  2.3
## 122   96    84  6.3
## 123   94    85  6.3
## 124   91    96  6.9
## 125   92    78  5.1
## 126   93    73  2.8
## 127   93    91  4.6
## 128   87    47  7.4
## 129   84    32 15.5
## 130   80    20 10.9
## 131   78    23 10.3
## 132   75    21 10.9
## 133   73    24  9.7
## 134   81    44 14.9
## 135   76    21 15.5
## 136   77    28  6.3
## 137   71     9 10.9
## 138   71    13 11.5
## 139   78    46  6.9
## 140   67    18 13.8
## 141   76    13 10.3
## 142   68    24 10.3
## 143   82    16  8.0
## 144   64    13 12.6
## 145   71    23  9.2
## 146   81    36 10.3
## 147   69     7 10.3
## 148   63    14 16.6
## 149   70    30  6.9
## 150   77    NA 13.2
## 151   75    14 14.3
## 152   76    18  8.0
## 153   68    20 11.5

Question: Identify the strongest and weakest correlations. For example, is ozone more strongly correlated with temperature or wind speed? Explain what the correlation values suggest about relationships between variables.

Ozone is weakly correlated with temp and wind because the values don’t seem to be higher or lower based on the value of ozone. Wind and temperature seem to be correlated because the more the wind blows, the lower the temperature is and vice versa.

Task 6: Summary Table

Generate the summary table grouped by Month.Generate the summary table grouped by Month. It should include count, average mean of ozone, average mean of temperature, and average mean of wind per month.

# your code goes here
summary(airquality$Ozone,
        airquality$Temp,
        airquality$Wind)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   18.00   31.50   42.13   63.25  168.00      37
airquality |>
group_by(Month)
## # A tibble: 153 × 6
## # Groups:   Month [5]
##    Ozone Solar.R  Wind  Temp Month   Day
##    <int>   <int> <dbl> <int> <int> <int>
##  1    41     190   7.4    67     5     1
##  2    36     118   8      72     5     2
##  3    12     149  12.6    74     5     3
##  4    18     313  11.5    62     5     4
##  5    NA      NA  14.3    56     5     5
##  6    28      NA  14.9    66     5     6
##  7    23     299   8.6    65     5     7
##  8    19      99  13.8    59     5     8
##  9     8      19  20.1    61     5     9
## 10    NA     194   8.6    69     5    10
## # ℹ 143 more rows

Question: Which month has the highest average ozone level? How do temperature and wind speed vary across months? What environmental factors might explain these differences? Looking at the data, month 8 - August seems to have the highest average ozone level. Temperature and wind speed vary throughout the months depending on the seasonal climate. The months covered are late Spring through early Fall which may explain the differences in these factors by month.

Submission Requirements

Publish it to Rpubs and submit your link on blackboard