This homework has two parts. Part 1 uses lubridate and
factors on NYC flight data. Part 2 does a full EDA on the built-in
airquality dataset.
# install.packages(c("nycflights13", "lubridate", "zoo", "forcats")) # if needed
library(nycflights13)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats)
data(flights)
Quick look:
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
dim(flights)
## [1] 336776 19
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NAs :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NAs :8255 NAs :8713 NAs :9430
## carrier flight tailnum origin
## Length :336776 Min. : 1 Length :336776 Length :336776
## N.unique : 16 1st Qu.: 553 N.unique : 4043 N.unique : 3
## N.blank : 0 Median :1496 N.blank : 0 N.blank : 0
## Min.nchar: 2 Mean :1972 Min.nchar: 5 Min.nchar: 3
## Max.nchar: 2 3rd Qu.:3465 Max.nchar: 6 Max.nchar: 3
## Max. :8500 NAs : 2512
##
## dest air_time distance hour
## Length :336776 Min. : 20.0 Min. : 17 Min. : 1.00
## N.unique : 105 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## N.blank : 0 Median :129.0 Median : 872 Median :13.00
## Min.nchar: 3 Mean :150.7 Mean :1040 Mean :13.18
## Max.nchar: 3 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NAs :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :29.00 Median :2013-07-03 10:00:00
## Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :59.00 Max. :2013-12-31 23:00:00
##
# Q1. Create a column dep_datetime by combining year, month, day, and dep_time into a
# POSIXct datetime using lubridate's make_datetime().
# Hint: hour = dep_time %/% (integer division) 100, minute = dep_time %% (modulo/remainder) 100
# Show the first 5 rows with year, month, day, dep_time, and dep_datetime.
# read in original data, then creating a new dataset with an additional column using the piping and mutate function
flights_dep_datetime <- flights |>
mutate(
dep_datetime = make_datetime(
year,
month,
day,
hour = dep_time %/% 100,
min = dep_time %% 100
)
)
# Show the first 5 rows with year, month, day, dep_time, and dep_datetime.
flights_dep_datetime |>
select(year, month, day, dep_time, dep_datetime) |>
head(5)
## # A tibble: 5 × 5
## year month day dep_time dep_datetime
## <int> <int> <int> <int> <dttm>
## 1 2013 1 1 517 2013-01-01 05:17:00
## 2 2013 1 1 533 2013-01-01 05:33:00
## 3 2013 1 1 542 2013-01-01 05:42:00
## 4 2013 1 1 544 2013-01-01 05:44:00
## 5 2013 1 1 554 2013-01-01 05:54:00
# Q2. After creating dep_datetime, use lubridate's month() to filter flights that
# departed in JUNE 2013. How many flights are there?
# (Hint: filter(month(dep_datetime) == 6))
june_2013_flights_dep_datetime <- flights_dep_datetime |>
filter(month(dep_datetime) == 6)
nrow(june_2013_flights_dep_datetime)
## [1] 27234
# There are 27234 flights from June 2013
# Q3. The carrier column is a character. Convert it to a factor and check its levels.
flights_carrier_as_factor <-flights |>
mutate(
carrier_character = carrier, # want to preserve carrier as character column just in case
carrier = factor(carrier)
)
levels(flights_carrier_as_factor$carrier)
## [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US" "VX" "WN"
## [16] "YV"
# Q4. Use fct_collapse() to keep "UA", "AA", and "DL" as their own levels and lump
# everything else into "Other". Then count flights per recorded carrier level.
# (Hint: see the fct_collapse demo from the Wrangling Activity Part H)
flights_three_levels_and_others <- flights |>
mutate(
carrier_grouped = fct_collapse(
carrier,
UA = "UA",
AA = "AA",
DL = "DL",
other_level = "Others"
)
)
flights_three_levels_and_others |>
count(carrier_grouped)
## # A tibble: 4 × 2
## carrier_grouped n
## <fct> <int>
## 1 AA 32729
## 2 DL 48110
## 3 UA 58665
## 4 Others 197272
# There are 32,729 AA flights, 48110 Delta flights, 58665 United Airlines flights and the rest, 197272 flights were other.
# Q5. Missing data: how many flights have NA for dep_delay? Filter them out and report
# the remaining row count.
sum(is.na(flights$dep_delay)) # tells me how many NA's in dep_delay column
## [1] 8255
flights_no_NA <- flights |> # create and name new data with flights without NA
filter(!is.na(dep_delay)) # keeps the rows without NA
nrow(flights_no_NA) # counting flights without NA
## [1] 328521
# There are 8255 flights with NA for dep_delay.
# There are 328521 flights without NA for dep_delay.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2 4.0.3 ✔ stringr 1.6.0
## ✔ purrr 1.2.2 ✔ tibble 3.3.1
## ✔ readr 2.2.0 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("airquality")
# Q6. For Ozone, Temp, and Wind: compute mean, median, sd, min, max
# (use na.rm = TRUE where needed).
question_6_summary <- airquality |>
summarise(
Ozone_mean = mean(Ozone, na.rm = TRUE),
Ozone_median = median(Ozone, na.rm = TRUE),
Ozone_sd = sd(Ozone, na.rm = TRUE),
Ozone_min = min(Ozone, na.rm = TRUE),
Ozone_max = max(Ozone, na.rm = TRUE),
Temp_mean = mean(Temp, na.rm = TRUE),
Temp_median = median(Temp, na.rm = TRUE),
Temp_sd = sd(Temp, na.rm = TRUE),
Temp_min = min(Temp, na.rm = TRUE),
Temp_max = max(Temp, na.rm = TRUE),
Wind_mean = mean(Wind, na.rm = TRUE),
Wind_median = median(Wind, na.rm = TRUE),
Wind_sd = sd(Wind, na.rm = TRUE),
Wind_min = min(Wind, na.rm = TRUE),
Wind_max = max(Wind, na.rm = TRUE),
)
question_6_summary # row version
## Ozone_mean Ozone_median Ozone_sd Ozone_min Ozone_max Temp_mean Temp_median
## 1 42.12931 31.5 32.98788 1 168 77.88235 79
## Temp_sd Temp_min Temp_max Wind_mean Wind_median Wind_sd Wind_min Wind_max
## 1 9.46527 56 97 9.957516 9.7 3.523001 1.7 20.7
t(question_6_summary) #column look
## [,1]
## Ozone_mean 42.129310
## Ozone_median 31.500000
## Ozone_sd 32.987885
## Ozone_min 1.000000
## Ozone_max 168.000000
## Temp_mean 77.882353
## Temp_median 79.000000
## Temp_sd 9.465270
## Temp_min 56.000000
## Temp_max 97.000000
## Wind_mean 9.957516
## Wind_median 9.700000
## Wind_sd 3.523001
## Wind_min 1.700000
## Wind_max 20.700000
Q7. Compare the mean and median for each variable. What does the relationship between mean and median suggest about distribution skewness? What does the standard deviation tell you about variability?
For Ozone, the mean was 42.1 and the median was 31.5. The mean is to the right of the median so the distribution is skewed to the right (tails on the right). The standard deviation was 33 units, which is large compared to the mean, so the variability is high. Min was 1 and max was 168.
For Temperature, the mean was 77.9 and the median was 79.0. The median and mean are close so the distribution is approximately symmetric. The standard deviation was 9.5, which is not that big compared to the mean, so temperatures have moderate fluctuation. Min was 56 and max was 97.
For Wind, the mean was 9.96 and the median was 9.70. They are close to each other so the distribition should be roughly symmmetric. The standard deviation was 3.5, so there was not much variability in wind. Min was 1.7 and the max was 20.7
# Q8. Make a histogram of Ozone.
ggplot(airquality, aes(x = Ozone)) +
geom_histogram(binwidth = 15, fill = "dodgerblue" , color = "white" , na.rm = TRUE) +
labs(title = "Ozone Distribution", x = "Ozone (parts/billion" , y = "count")
Q9. Describe the shape of the Ozone distribution. Any outliers or unusual features?
The Ozone distribution is right skewed which is consistent with the mean of 42.1 being higher than the median of 31.5. From the histogram there seem to be outliers greater than 150 parts/billion. Using R, we found using the Inter Quartile Range and 1.5 X IQR to find outliers of 135 and 168. These may contribute to the high standard deviation.
Q1 <- quantile(airquality$Ozone, 0.25, na.rm = TRUE)
Q3 <- quantile(airquality$Ozone, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
Q1 - 1.5 * IQR # lower bound
## 25%
## -49.875
Q3 + 1.5 * IQR # upper bound
## 75%
## 131.125
airquality |>
filter(Ozone > 131.125) # which values are larger than 131 and so outliers?
## Ozone Solar.R Wind Temp Month Day
## 1 135 269 4.1 84 7 1
## 2 168 238 3.4 81 8 25
# Q10. Create a new column month_name with full month names (May–September) using case_when.
# Then make a boxplot of Ozone by month_name.
# (Hint: case_when was covered in the Wrangling Activity.)
# first the case_when and creating a new data set
airquality_months_may_to_september <- airquality |>
mutate(
month_name = case_when(
Month == 5 ~ "May",
Month == 6 ~ "June",
Month == 7 ~ "July",
Month == 8 ~ "August",
Month == 9 ~"September"
)
)
# want box plots to be plotted graphed in chronological order, not alphabetical.
airquality_months_may_to_september_chrono <- airquality_months_may_to_september |>
mutate(
month_name = fct_relevel(month_name, "May", "June", "July", "August", "September")
)
# now the box plots
ggplot (airquality_months_may_to_september_chrono, aes(x = month_name, y = Ozone)) +
geom_boxplot(fill = "dodgerblue", na.rm = TRUE) +
labs(title = "Ozone by Month", x = "Month", y = "Ozone (ppb)")
Q11. Which month has the highest median Ozone? Are there outliers in any month?
Using R, it appears that July has highest median Ozone of 60.
May has one outlier. June has one outlier. July has none. August has one outlier. September has four.
airquality_months_may_to_september |>
group_by(month_name) |>
summarise(median_ozone = median(Ozone, na.rm = TRUE)) |>
arrange(desc(median_ozone))
## # A tibble: 5 × 2
## month_name median_ozone
## <chr> <dbl>
## 1 July 60
## 2 August 52
## 3 June 23
## 4 September 23
## 5 May 18
ggplot(airquality_months_may_to_september_chrono, aes(x = Temp, y = Ozone, color = month_name)) +
geom_point(na.rm = TRUE) +
labs(title = "Temperature vs Ozone", x = "Temperature (F)", y = "Ozone (ppb)", color = "Month")
Q13. Is there a visible relationship between temperature and ozone?
It appears that higher Ozone ppb are associated with Temperatures.
# Q14. Compute the correlation matrix for Ozone, Temp, and Wind.
# (Hint: cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs"))
cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs")
## Ozone Temp Wind
## Ozone 1.0000000 0.6983603 -0.6015465
## Temp 0.6983603 1.0000000 -0.5110750
## Wind -0.6015465 -0.5110750 1.0000000
Q15. Which pair has the strongest correlation? What does that suggest?
It appears that Temperature and Ozone (ppb) have the strongest correlation of 0.698 . It happens to be a positive correlation. The higher temperatures are strongly associated with higher Ozone levels.
There does seem to be a moderately strong correlation between Wind and Ozone of -0.601. It happens to be negative correlation. The windier days are associated with the lower ozone levels.
# Q16. Generate a summary table grouped by Month with: count (for number of rows or days in month with no NAs), mean Ozone, mean Temp,
# mean Wind for each month.
airquality_months_may_to_september_chrono |>
group_by(month_name) |>
summarise(
count = n(),
Ozone_mean = mean(Ozone, na.rm = TRUE),
Temp_mean = mean(Temp, na.rm = TRUE),
Wind_mean = mean(Wind, na.rm = TRUE)
)
## # A tibble: 5 × 5
## month_name count Ozone_mean Temp_mean Wind_mean
## <fct> <int> <dbl> <dbl> <dbl>
## 1 May 31 23.6 65.5 11.6
## 2 June 30 29.4 79.1 10.3
## 3 July 31 59.1 83.9 8.94
## 4 August 31 60.0 84.0 8.79
## 5 September 30 31.4 76.9 10.2
Q17. Which month has the highest average ozone? How do temperature and wind vary across months?
It appears that August has the highest mean Ozone of 59.962 ppb, beating out July by only about .847 parts per billion.
Mean temperature was lowest in May (65.5 degrees F) and it’s highest in July and August, each around 84 degrees F. In September, mean temperature fell to about 77 degrees, two degrees lower than June.
Mean wind speed was at its highest in May at about 11.6 (miles per hour?) then dropping through June and with July and August having close mean wind speeds around 8.8 / 8.9 miles per hour. In September, the mean wind speed rose again to June levels at 10.2 miles per hour.
Months with the highest ozone also had the highest temperatures and lower wind speeds. This seems consistent with the positive correlation we found between ozone and temperature and the negative correlation between ozone and wind.