Homework 3 — Dates, Factors + EDA

This homework has two parts. Part 1 uses lubridate and factors on NYC flight data. Part 2 does a full EDA on the built-in airquality dataset.

Part 1 — NYC Flights: Dates and Factors

#install.packages(c("nycflights13", "lubridate", "zoo", "forcats"))  # if needed
library(nycflights13)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.5.2

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(zoo)

## Warning: package 'zoo' was built under R version 4.5.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(forcats)
data(flights)

Quick look:

str(flights)

## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# Q1. Create a column dep_datetime by combining year, month, day, and dep_time into a
#     POSIXct datetime using lubridate's make_datetime().
#     Hint: hour = dep_time %/% 100, minute = dep_time %% 100
#     Show the first 5 rows with year, month, day, dep_time, and dep_datetime.

flights1<- flights |>
  mutate(
    
    hours = dep_time %% 100,
    minutes = dep_time %% 100,
    dep_datetime = make_datetime(year, month, day, dep_time)
  ) 
  
flights1 |>
  select(year, month, day, dep_time,dep_datetime) |>
  head(5)

## # A tibble: 5 × 5
##    year month   day dep_time dep_datetime       
##   <int> <int> <int>    <int> <dttm>             
## 1  2013     1     1      517 2013-01-22 13:00:00
## 2  2013     1     1      533 2013-01-23 05:00:00
## 3  2013     1     1      542 2013-01-23 14:00:00
## 4  2013     1     1      544 2013-01-23 16:00:00
## 5  2013     1     1      554 2013-01-24 02:00:00

# Q2. After creating dep_datetime, use lubridate's month() to filter flights that
#     departed in JUNE 2013. How many flights are there?
#     (Hint: filter(month(dep_datetime) == 6))

june_flights <- flights1 |> 
  filter(month(dep_datetime) == 6)

# How many flights?
nrow(june_flights)

## [1] 27688

# Q3. The carrier column is a character. Convert it to a factor and check its levels.

# Q3. Convert carrier to factor and check levels
flights1 <- flights |> 
  mutate(carrier = as.factor(carrier))

levels(flights$carrier)

## NULL

# Q4. Use fct_collapse() to keep "UA", "AA", and "DL" as their own levels and lump
#     everything else into "Other". Then count flights per recoded carrier level.
#     (Hint: see the fct_collapse demo from the Wrangling Activity Part H)

# Q4. Collapse carriers: keep UA, AA, DL; others → "Other"
flights1 <- flights1 |> 
  mutate(
    carrier_recoded = fct_collapse(
      carrier,
      UA = "UA",
      AA = "AA",
      DL = "DL",
      Other = c("9E", "AS", "B6", "EV", "F9", "FL", "HA", "MQ", "OO", "US", "VX", "WN", "YV")
    )
  )

# Count flights per recoded carrier
flights1 |> 
  count(carrier_recoded, sort = TRUE)

## # A tibble: 4 × 2
##   carrier_recoded      n
##   <fct>            <int>
## 1 Other           197272
## 2 UA               58665
## 3 DL               48110
## 4 AA               32729

# Q5. Missing data: how many flights have NA for dep_delay? Filter them out and report
#     the remaining row count.

# Q5. How many flights have NA in dep_delay?
num_missing <- sum(is.na(flights1$dep_delay))
num_missing

## [1] 8255

# Filter out NAs and report remaining row count
flights_clean <- flights1 |> 
  filter(!is.na(dep_delay))

nrow(flights_clean)

## [1] 328521

Part 2 — Airquality EDA

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## Warning: package 'purrr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2 4.0.2     ✔ stringr 1.6.0
## ✔ purrr   1.2.1     ✔ tibble  3.3.1
## ✔ readr   2.1.6     ✔ tidyr   1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data("airquality")

# Q6. For Ozone, Temp, and Wind: compute mean, median, sd, min, max
#     (use na.rm = TRUE where needed).

# Q6. Compute mean, median, sd, min, max for Ozone, Temp, and Wind
airquality |> 
  select(Ozone, Temp, Wind) |> 
  summarise(
    across(everything(),
           list(
             mean = ~mean(.x, na.rm = TRUE),
             median = ~median(.x, na.rm = TRUE),
             sd = ~sd(.x, na.rm = TRUE),
             min = ~min(.x, na.rm = TRUE),
             max = ~max(.x, na.rm = TRUE)
           ))
  )

##   Ozone_mean Ozone_median Ozone_sd Ozone_min Ozone_max Temp_mean Temp_median
## 1   42.12931         31.5 32.98788         1       168  77.88235          79
##   Temp_sd Temp_min Temp_max Wind_mean Wind_median  Wind_sd Wind_min Wind_max
## 1 9.46527       56       97  9.957516         9.7 3.523001      1.7     20.7

Q7. Compare the mean and median for each variable. What does the relationship between mean and median suggest about distribution skewness? What does the standard deviation tell you about variability?

answer: For Ozone: mean is usually higher than median leading to a right-skewed distribution. Temp and Wind are closer to symmetric. Standard deviation tells us the average spread around the mean. Ozone has higher variability than Wind.

# Q8. Make a histogram of Ozone.

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 10, fill = "brown", color = "black") +
  labs(title = "Distribution of Ozone Levels", x = "Ozone", y = "Count")

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

#The histogram tails more to the right which meand that the mean is is greater 
#than median and mode. the skewness is more to the right side.

Q9. Describe the shape of the Ozone distribution. Any outliers or unusual features?

The Ozone distribution is right-skewed. There are several high outliers.

# Q10. Create a new column month_name with full month names (May–September) using case_when.
#      Then make a boxplot of Ozone by month_name.
#      (Hint: case_when was covered in the Wrangling Activity.)

# Q10. Create month_name and boxplot of Ozone by month
airquality <- airquality |> 
  mutate(
    month_name = case_when(
      Month == 5 ~ "May",
      Month == 6 ~ "June",
      Month == 7 ~ "July",
      Month == 8 ~ "August",
      Month == 9 ~ "September"
    )
  )

# Boxplot
ggplot(airquality, aes(x = month_name, y = Ozone)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Ozone Levels by Month", x = "Month", y = "Ozone (ppb)")

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Q11. Which month has the highest median Ozone? Are there outliers in any month?

July has the highest median Ozone. There are outliers in August, June, May, September

# Q12. Scatterplot of Temp vs Ozone, colored by Month.

ggplot(airquality, aes(x = Temp, y = Ozone, color = factor(month_name))) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "Temperature vs Ozone", 
       x = "Temperature (°F)", 
       y = "Ozone (ppb)",
       color = "Month") +
  theme_dark()

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

Q13. Is there a visible relationship between temperature and ozone?

Yes, there is a positive relationship, higher temperatures are generally associated with higher ozone levels.

# Q14. Compute the correlation matrix for Ozone, Temp, and Wind.
#      (Hint: cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs"))

cor_matrix <- cor(airquality[, c("Ozone", "Temp", "Wind")], 
                  use = "complete.obs")
cor_matrix

##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Q15. Which pair has the strongest correlation? What does that suggest? The strongest correlation is between Ozone and Temp (positive, around 0.6–0.7). This suggests that temperature is a good predictor of ozone levels

# Q16. Generate a summary table grouped by Month with: count, mean Ozone, mean Temp,
#      mean Wind for each month.

# Q16. Summary table grouped by Month
airquality |> 
  group_by(month_name) |> 
  summarise(
    count = n(),
    mean_ozone = mean(Ozone, na.rm = TRUE),
    mean_temp = mean(Temp, na.rm = TRUE),
    mean_wind = mean(Wind, na.rm = TRUE),
  )

## # A tibble: 5 × 5
##   month_name count mean_ozone mean_temp mean_wind
##   <chr>      <int>      <dbl>     <dbl>     <dbl>
## 1 August        31       60.0      84.0      8.79
## 2 July          31       59.1      83.9      8.94
## 3 June          30       29.4      79.1     10.3 
## 4 May           31       23.6      65.5     11.6 
## 5 September     30       31.4      76.9     10.2

Q17. Which month has the highest average ozone? How do temperature and wind vary across months?

August has the highest average ozone. Temperature is highest in July and August and lowest in May and September. Wind speed tends to be higher in June and September.

Homework 3 — Dates, Factors + EDA

Enter your name

Part 1 — NYC Flights: Dates and Factors

Part 2 — Airquality EDA