This homework has two parts. Part 1 uses lubridate and
factors on NYC flight data. Part 2 does a full EDA on the built-in
airquality dataset.
# install.packages(c("nycflights13", "lubridate", "zoo", "forcats")) # if needed
library(nycflights13)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats)
data(flights)
Quick look:
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
# Q1. Create a column dep_datetime by combining year, month, day, and dep_time into a
# POSIXct datetime using lubridate's make_datetime().
# Hint: hour = dep_time %/% 100, minute = dep_time %% 100
# Show the first 5 rows with year, month, day, dep_time, and dep_datetime.
flights <- flights |> mutate(dep_datatime = make_datetime(year, month, day, dep_time %/%100, dep_time %%100))
flights |> select(year, month, day, dep_time, dep_datatime) |> head(5)
## # A tibble: 5 × 5
## year month day dep_time dep_datatime
## <int> <int> <int> <int> <dttm>
## 1 2013 1 1 517 2013-01-01 05:17:00
## 2 2013 1 1 533 2013-01-01 05:33:00
## 3 2013 1 1 542 2013-01-01 05:42:00
## 4 2013 1 1 544 2013-01-01 05:44:00
## 5 2013 1 1 554 2013-01-01 05:54:00
# Q2. After creating dep_datetime, use lubridate's month() to filter flights that
# departed in JUNE 2013. How many flights are there?
# (Hint: filter(month(dep_datetime) == 6))
june_flights <- flights |> filter(month(dep_datatime) ==6)
nrow(june_flights)
## [1] 27234
# Q3. The carrier column is a character. Convert it to a factor and check its levels.
flights$carrier <- factor(flights$carrier)
levels(flights$carrier)
## [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US" "VX" "WN"
## [16] "YV"
# Q4. Use fct_collapse() to keep "UA", "AA", and "DL" as their own levels and lump
# everything else into "Other". Then count flights per recoded carrier level.
# (Hint: see the fct_collapse demo from the Wrangling Activity Part H)
flights <- flights |> mutate(carrier_group = fct_collapse(carrier, UA = "UA", AA = "AA", DL = "DL", Other = c("9E", "AS", "B6", "EV", "F9", "FL", "US", "HA", "MQ", "OO", "VX", "WN", "YV")))
count(flights, carrier_group)
## # A tibble: 4 × 2
## carrier_group n
## <fct> <int>
## 1 Other 197272
## 2 AA 32729
## 3 DL 48110
## 4 UA 58665
# Q5. Missing data: how many flights have NA for dep_delay? Filter them out and report
# the remaining row count.
sum(is.na(flights$dep_delay))
## [1] 8255
flights_no_na <- flights |> filter(!is.na(dep_delay))
nrow(flights_no_na)
## [1] 328521
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2 4.0.3 ✔ stringr 1.6.0
## ✔ purrr 1.2.2 ✔ tibble 3.3.1
## ✔ readr 2.2.0 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("airquality")
# Q6. For Ozone, Temp, and Wind: compute mean, median, sd, min, max
# (use na.rm = TRUE where needed).
summary(airquality$Ozone)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 1.00 18.00 31.50 42.13 63.25 168.00 37
mean(airquality$Ozone, na.rm = TRUE)
## [1] 42.12931
median(airquality$Ozone, na.rm = TRUE)
## [1] 31.5
sd(airquality$Ozone, na.rm = TRUE)
## [1] 32.98788
summary(airquality$Temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.00 72.00 79.00 77.88 85.00 97.00
mean(airquality$Temp)
## [1] 77.88235
median(airquality$Temp)
## [1] 79
sd(airquality$Temp)
## [1] 9.46527
summary(airquality$Wind)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.700 7.400 9.700 9.958 11.500 20.700
mean(airquality$Wind)
## [1] 9.957516
median(airquality$Wind)
## [1] 9.7
sd(airquality$Wind)
## [1] 3.523001
Q7. Compare the mean and median for each variable. What does the relationship between mean and median suggest about distribution skewness? What does the standard deviation tell you about variability?
For Ozone, the mean (42.13) is greater than the median (31.5), implying that it is skewed to the right. For Temp, the mean (77.88) is similar to the median (79), suggesting that there is symmetry. For Wind, the mean (9.96), again, is fairly close to the median (9.7), displaying that there is little to no skewness. Standard deviation illustrates the degree of spread in the data. Since Ozone has the largest standard deviation (32.99), its data is more diverse. Wind possesses the lowest standard deviation (3.52).
# Q8. Make a histogram of Ozone.
hist(airquality$Ozone, main = "Histogram of Ozone", xlab = "Ozone")
Q9. Describe the shape of the Ozone distribution. Any outliers or unusual features?
It is evident that a right-skewed distribution is present in the Ozone data since the bulk of the data lies on the left-hand side of the distribution, whereas a long tail is present at the high end. There are some extreme observations for Ozone, which could possibly be outliers.
# Q10. Create a new column month_name with full month names (May–September) using case_when.
# Then make a boxplot of Ozone by month_name.
# (Hint: case_when was covered in the Wrangling Activity.)
airquality <- airquality |> mutate(month_name = case_when(Month == 5 ~ "May", Month == 6 ~ "June", Month == 7 ~ "July", Month == 8 ~ "August", Month == 9 ~ "September"))
boxplot(Ozone ~ month_name, data = airquality, main = "Ozone by Month", xlab = "Month", ylab = "Ozone")
Q11. Which month has the highest median Ozone? Are there outliers in any month? As demonstrated in the boxplot, July possesses the highest median value for the Ozone due to the median line being the highest out of all the months. There are some outliers in the months of May, June, and September.
# Q12. Scatterplot of Temp vs Ozone, colored by Month.
plot(airquality$Temp, airquality$Ozone, col = airquality$Month, xlab = "Temperature", ylab = "Ozone", main = "Temperature V. Ozone")
Q13. Is there a visible relationship between temperature and ozone? There is a positive correlation between temperature and ozone content present. With a rise in temperature, an increase in the ozone content occurs. However, outliers can still be found throughout the scatterplot.
# Q14. Compute the correlation matrix for Ozone, Temp, and Wind.
# (Hint: cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs"))
cor(airquality[, c("Ozone", "Temp", "Wind")], use = "complete.obs")
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
Q15. Which pair has the strongest correlation? What does that suggest? Strongest correlation exists between Ozone and Temperature (0.698). This indicates that there is a direct correlation between temperature and ozone, implying that high temperatures lead to greater ozone levels.
# Q16. Generate a summary table grouped by Month with: count, mean Ozone, mean Temp,
# mean Wind for each month.
airquality |> group_by(Month) |> summarize(count = n(), mean_ozone = mean(Ozone, na.rm = TRUE), mean_temp = mean(Temp, na.rm = TRUE), mean_wind = mean(Wind, na.rm =TRUE))
## # A tibble: 5 × 5
## Month count mean_ozone mean_temp mean_wind
## <int> <int> <dbl> <dbl> <dbl>
## 1 5 31 23.6 65.5 11.6
## 2 6 30 29.4 79.1 10.3
## 3 7 31 59.1 83.9 8.94
## 4 8 31 60.0 84.0 8.79
## 5 9 30 31.4 76.9 10.2
Q17. Which month has the highest average ozone? How do temperature and wind vary across months?
August has the highest value for ozone concentration (60.0/59.96). The temperature increases from May up until August and then begins to fall from September. Wind speed peaks in May, however, is low throughout July and August. This suggest that ozone values peak during warmer seasons with slower wind speed.