This homework has two parts. Part 1 uses lubridate and factors on NYC flight data. Part 2 does a full EDA on the built-in airquality dataset.

Part 1 — NYC Flights: Dates and Factors

# install.packages(c("nycflights13", "lubridate", "zoo", "forcats"))  # if needed
library(nycflights13)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(forcats)
data(flights)

Quick look:

str(flights)

## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

dim(flights)

## [1] 336776     19

summary(flights)

##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NAs    :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NAs    :8255      NAs    :8713                  NAs    :9430      
##       carrier           flight          tailnum             origin      
##  Length   :336776   Min.   :   1   Length   :336776   Length   :336776  
##  N.unique :    16   1st Qu.: 553   N.unique :  4043   N.unique :     3  
##  N.blank  :     0   Median :1496   N.blank  :     0   N.blank  :     0  
##  Min.nchar:     2   Mean   :1972   Min.nchar:     5   Min.nchar:     3  
##  Max.nchar:     2   3rd Qu.:3465   Max.nchar:     6   Max.nchar:     3  
##                     Max.   :8500   NAs      :  2512                     
##                                                                         
##         dest           air_time        distance         hour      
##  Length   :336776   Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  N.unique :   105   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  N.blank  :     0   Median :129.0   Median : 872   Median :13.00  
##  Min.nchar:     3   Mean   :150.7   Mean   :1040   Mean   :13.18  
##  Max.nchar:     3   3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NAs    :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
##

# Q1. Create a column dep_datetime by combining year, month, day, and dep_time into a
#     POSIXct datetime using lubridate's make_datetime().
#     Hint: hour = dep_time %/% (integer division) 100, minute = dep_time %% (modulo/remainder) 100
#     Show the first 5 rows with year, month, day, dep_time, and dep_datetime.

    #     read in original data, then creating a new dataset with an additional column using the piping and mutate function

flights_dep_datetime <- flights |>
  mutate(
    dep_datetime = make_datetime(
      year,
      month,
      day,
      hour = dep_time %/% 100,
      min = dep_time %% 100
    )
  )

    #     Show the first 5 rows with year, month, day, dep_time, and dep_datetime.

flights_dep_datetime |>
  select(year, month, day, dep_time, dep_datetime) |>
  head(5)

## # A tibble: 5 × 5
##    year month   day dep_time dep_datetime       
##   <int> <int> <int>    <int> <dttm>             
## 1  2013     1     1      517 2013-01-01 05:17:00
## 2  2013     1     1      533 2013-01-01 05:33:00
## 3  2013     1     1      542 2013-01-01 05:42:00
## 4  2013     1     1      544 2013-01-01 05:44:00
## 5  2013     1     1      554 2013-01-01 05:54:00

# Q2. After creating dep_datetime, use lubridate's month() to filter flights that
#     departed in JUNE 2013. How many flights are there?
#     (Hint: filter(month(dep_datetime) == 6))

june_2013_flights_dep_datetime <- flights_dep_datetime |>
  filter(month(dep_datetime) == 6)

nrow(june_2013_flights_dep_datetime)

## [1] 27234

#   There are 27234 flights from June 2013

# Q3. The carrier column is a character. Convert it to a factor and check its levels.

flights_carrier_as_factor <-flights |>
  mutate(
    carrier_character = carrier,     # want to preserve carrier as character column just in case
    carrier = factor(carrier)
  )

levels(flights_carrier_as_factor$carrier)

##  [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US" "VX" "WN"
## [16] "YV"

# Q4. Use fct_collapse() to keep "UA", "AA", and "DL" as their own levels and lump
#     everything else into "Other". Then count flights per recorded carrier level.
#     (Hint: see the fct_collapse demo from the Wrangling Activity Part H)

flights_three_levels_and_others <- flights |>
  mutate(
    carrier_grouped = fct_collapse(
      carrier,
      UA = "UA",
      AA = "AA",
      DL = "DL",
      other_level = "Others"
    )
  )

flights_three_levels_and_others |>
  count(carrier_grouped)

## # A tibble: 4 × 2
##   carrier_grouped      n
##   <fct>            <int>
## 1 AA               32729
## 2 DL               48110
## 3 UA               58665
## 4 Others          197272

# There are 32,729 AA flights, 48110 Delta flights, 58665 United Airlines flights and the rest, 197272 flights were other.

# Q5. Missing data: how many flights have NA for dep_delay? Filter them out and report
#     the remaining row count.

sum(is.na(flights$dep_delay)) # tells me how many NA's in dep_delay column

## [1] 8255

flights_no_NA <- flights |>   # create and name new data with flights without NA
  filter(!is.na(dep_delay))   # keeps the rows without NA

nrow(flights_no_NA)           # counting flights without NA

## [1] 328521

#   There are 8255 flights with NA for dep_delay.
#   There are 328521 flights without NA for dep_delay.

Part 2 — Airquality EDA

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2 4.0.3     ✔ stringr 1.6.0
## ✔ purrr   1.2.2     ✔ tibble  3.3.1
## ✔ readr   2.2.0     ✔ tidyr   1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data("airquality")

# Q6. For Ozone, Temp, and Wind: compute mean, median, sd, min, max
#     (use na.rm = TRUE where needed).

question_6_summary <- airquality |>
  summarise(
    Ozone_mean     = mean(Ozone,   na.rm = TRUE),
    Ozone_median   = median(Ozone, na.rm = TRUE),
    Ozone_sd       = sd(Ozone,     na.rm = TRUE),
    Ozone_min      = min(Ozone,    na.rm = TRUE),
    Ozone_max      = max(Ozone,    na.rm = TRUE),
    
    Temp_mean      = mean(Temp,    na.rm = TRUE), 
    Temp_median    = median(Temp,  na.rm = TRUE),
    Temp_sd        = sd(Temp,      na.rm = TRUE),
    Temp_min       = min(Temp,     na.rm = TRUE),
    Temp_max       = max(Temp,     na.rm = TRUE),
    
    Wind_mean      = mean(Wind,     na.rm = TRUE),
    Wind_median    = median(Wind,   na.rm = TRUE),
    Wind_sd        = sd(Wind,       na.rm = TRUE),
    Wind_min       = min(Wind,      na.rm = TRUE),
    Wind_max       = max(Wind,      na.rm = TRUE), 
  )

question_6_summary      # row version

##   Ozone_mean Ozone_median Ozone_sd Ozone_min Ozone_max Temp_mean Temp_median
## 1   42.12931         31.5 32.98788         1       168  77.88235          79
##   Temp_sd Temp_min Temp_max Wind_mean Wind_median  Wind_sd Wind_min Wind_max
## 1 9.46527       56       97  9.957516         9.7 3.523001      1.7     20.7

t(question_6_summary)   #column look

##                    [,1]
## Ozone_mean    42.129310
## Ozone_median  31.500000
## Ozone_sd      32.987885
## Ozone_min      1.000000
## Ozone_max    168.000000
## Temp_mean     77.882353
## Temp_median   79.000000
## Temp_sd        9.465270
## Temp_min      56.000000
## Temp_max      97.000000
## Wind_mean      9.957516
## Wind_median    9.700000
## Wind_sd        3.523001
## Wind_min       1.700000
## Wind_max      20.700000

Q7. Compare the mean and median for each variable. What does the relationship between mean and median suggest about distribution skewness? What does the standard deviation tell you about variability?

For Ozone, the mean was 42.1 and the median was 31.5. The mean is to the right of the median so the distribution is skewed to the right (tails on the right). The standard deviation was 33 units, which is large compared to the mean, so the variability is high. Min was 1 and max was 168.

For Temperature, the mean was 77.9 and the median was 79.0. The median and mean are close so the distribution is approximately symmetric. The standard deviation was 9.5, which is not that big compared to the mean, so temperatures have moderate fluctuation. Min was 56 and max was 97.

For Wind, the mean was 9.96 and the median was 9.70. They are close to each other so the distribition should be roughly symmmetric. The standard deviation was 3.5, so there was not much variability in wind. Min was 1.7 and the max was 20.7

# Q8. Make a histogram of Ozone.

ggplot(airquality, aes(x = Ozone)) +
  geom_histogram(binwidth = 15, fill = "dodgerblue" , color = "white" , na.rm = TRUE) +
  labs(title = "Ozone Distribution", x = "Ozone (parts/billion" , y = "count")

Q9. Describe the shape of the Ozone distribution. Any outliers or unusual features?

The Ozone distribution is right skewed which is consistent with the mean of 42.1 being higher than the median of 31.5. From the histogram there seem to be outliers greater than 150 parts/billion. Using R, we found using the Inter Quartile Range and 1.5 X IQR to find outliers of 135 and 168. These may contribute to the high standard deviation.

Q1  <- quantile(airquality$Ozone, 0.25, na.rm = TRUE)
Q3  <- quantile(airquality$Ozone, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

Q1 - 1.5 * IQR   # lower bound

##     25% 
## -49.875

Q3 + 1.5 * IQR   # upper bound

##     75% 
## 131.125

airquality |>
  filter(Ozone > 131.125)     #  which values are larger than 131 and so outliers?

##   Ozone Solar.R Wind Temp Month Day
## 1   135     269  4.1   84     7   1
## 2   168     238  3.4   81     8  25

# Q10. Create a new column month_name with full month names (May–September) using case_when.
#      Then make a boxplot of Ozone by month_name.
#      (Hint: case_when was covered in the Wrangling Activity.)

# first the case_when and creating a new data set
airquality_months_may_to_september <- airquality |>
  mutate(
    month_name = case_when(
      Month == 5 ~ "May",
      Month == 6 ~ "June",
      Month == 7 ~ "July",
      Month == 8 ~ "August",
      Month == 9 ~"September" 
    )
  )

# want box plots to be plotted graphed in chronological order, not alphabetical.

airquality_months_may_to_september_chrono <- airquality_months_may_to_september |>
  mutate(
    month_name = fct_relevel(month_name, "May", "June", "July", "August", "September")
  )
  
# now the box plots
ggplot (airquality_months_may_to_september_chrono, aes(x = month_name, y = Ozone)) +
  geom_boxplot(fill = "dodgerblue", na.rm = TRUE) +
  labs(title = "Ozone by Month", x = "Month", y = "Ozone (ppb)")

Q11. Which month has the highest median Ozone? Are there outliers in any month?

Using R, it appears that July has highest median Ozone of 60.

May has one outlier. June has one outlier. July has none. August has one outlier. September has four.

airquality_months_may_to_september |>
  group_by(month_name) |>
  summarise(median_ozone = median(Ozone, na.rm = TRUE)) |>
  arrange(desc(median_ozone))

## # A tibble: 5 × 2
##   month_name median_ozone
##   <chr>             <dbl>
## 1 July                 60
## 2 August               52
## 3 June                 23
## 4 September            23
## 5 May                  18

Q12. Scatterplot of Temp vs Ozone, colored by Month.

ggplot(airquality_months_may_to_september_chrono, aes(x = Temp, y = Ozone, color = month_name)) + 
  geom_point(na.rm = TRUE) +
  labs(title = "Temperature vs Ozone", x = "Temperature (F)", y = "Ozone (ppb)", color = "Month")

Q13. Is there a visible relationship between temperature and ozone?

It appears that higher Ozone ppb are associated with Temperatures.

# Q14. Compute the correlation matrix for Ozone, Temp, and Wind.
#      (Hint: cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs"))

cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs")

##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

Q15. Which pair has the strongest correlation? What does that suggest?

It appears that Temperature and Ozone (ppb) have the strongest correlation of 0.698 . It happens to be a positive correlation. The higher temperatures are strongly associated with higher Ozone levels.

There does seem to be a moderately strong correlation between Wind and Ozone of -0.601. It happens to be negative correlation. The windier days are associated with the lower ozone levels.

# Q16. Generate a summary table grouped by Month with: count (for number of rows or days in month with no NAs), mean Ozone, mean Temp,
#      mean Wind for each month.


airquality_months_may_to_september_chrono |>
  group_by(month_name) |>
  summarise(
    count = n(),
    Ozone_mean      = mean(Ozone,   na.rm = TRUE),
    Temp_mean       = mean(Temp,    na.rm = TRUE),
    Wind_mean       = mean(Wind,    na.rm = TRUE)
    
  )

## # A tibble: 5 × 5
##   month_name count Ozone_mean Temp_mean Wind_mean
##   <fct>      <int>      <dbl>     <dbl>     <dbl>
## 1 May           31       23.6      65.5     11.6 
## 2 June          30       29.4      79.1     10.3 
## 3 July          31       59.1      83.9      8.94
## 4 August        31       60.0      84.0      8.79
## 5 September     30       31.4      76.9     10.2

Q17. Which month has the highest average ozone? How do temperature and wind vary across months?

It appears that August has the highest mean Ozone of 59.962 ppb, beating out July by only about .847 parts per billion.

Mean temperature was lowest in May (65.5 degrees F) and it’s highest in July and August, each around 84 degrees F. In September, mean temperature fell to about 77 degrees, two degrees lower than June.

Mean wind speed was at its highest in May at about 11.6 (miles per hour?) then dropping through June and with July and August having close mean wind speeds around 8.8 / 8.9 miles per hour. In September, the mean wind speed rose again to June levels at 10.2 miles per hour.

Months with the highest ozone also had the highest temperatures and lower wind speeds. This seems consistent with the positive correlation we found between ozone and temperature and the negative correlation between ozone and wind.

Homework 3 — Dates, Factors + EDA

CGilbert A

Part 1 — NYC Flights: Dates and Factors

Part 2 — Airquality EDA

Q12. Scatterplot of Temp vs Ozone, colored by Month.