Homework 3 — Dates, Factors + EDA

This homework has two parts. Part 1 uses lubridate and factors on NYC flight data. Part 2 does a full EDA on the built-in airquality dataset.

Part 1 — NYC Flights: Dates and Factors

# install.packages(c("nycflights13", "lubridate", "zoo", "forcats"))  # if needed
library(nycflights13)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(forcats)
data(flights)

Quick look:

str(flights)

## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

head(flights)

## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

# Q1. Create a column dep_datetime by combining year, month, day, and dep_time into a
#     POSIXct datetime using lubridate's make_datetime().
#     Hint: hour = dep_time %/% 100, minute = dep_time %% 100
#     Show the first 5 rows with year, month, day, dep_time, and dep_datetime.
new_flights <- flights %>%
  mutate(dep_datetime=make_datetime(year, month, day, hour = dep_time %/% 100, min = dep_time %% 100)) %>%
  select(year, month, day, dep_time, dep_datetime) %>%
  head(5)

# I used the mutate function to make a new column called dep_datetime and used the make_datetime function to populate the column. I could pull directly from the year, month, and day columns for the year, month, and day parts of the datetime. I used the formulas given above for the hour and minute parts of the datetime. I kept getting an error for minute. I asked Claude and it explained that I had to use min, not minute.

# Q2. After creating dep_datetime, use lubridate's month() to filter flights that
#     departed in JUNE 2013. How many flights are there?
#     (Hint: filter(month(dep_datetime) == 6))
flights %>%
  mutate(dep_datetime=make_datetime(year, month, day, hour = dep_time %/% 100, min = dep_time %% 100)) %>%
  filter(month(dep_datetime)==6) %>%
  nrow

## [1] 27234

# I used the same mutate function as above to make the dep_datetime column again. Then I used the month function within the filter function and set month part of dep_datetime equal to 6 to filter for just the flights that departed in June. Then I counted the rows to find how many flights met my criteria. 

# Q3. The carrier column is a character. Convert it to a factor and check its levels.
flights <- flights %>%
  mutate(carrier=as.factor(carrier)) 

levels(flights$carrier)

##  [1] "9E" "AA" "AS" "B6" "DL" "EV" "F9" "FL" "HA" "MQ" "OO" "UA" "US" "VX" "WN"
## [16] "YV"

# I used the as.factor function within the mutate function to convert the carrier column into factors. I originally tried using the levels function inside the pipe, but I kept getting a null result. I was able to use it successfully outside of the pipe, but I needed to save the new flights df first. 

# Q4. Use fct_collapse() to keep "UA", "AA", and "DL" as their own levels and lump
#     everything else into "Other". Then count flights per recoded carrier level.
#     (Hint: see the fct_collapse demo from the Wrangling Activity Part H)
flights <- flights %>%
  mutate(carrier_collapse=fct_collapse(carrier,
                                       "UA" = "UA",
                                       "AA" = "AA",
                                       "DL" = "DL",
                                       other_level = "other"))

flights %>%
  count(carrier_collapse)

## # A tibble: 4 × 2
##   carrier_collapse      n
##   <fct>             <int>
## 1 AA                32729
## 2 DL                48110
## 3 UA                58665
## 4 other            197272

# I used mutate to create a new column where I used fct_collapse to collapse the factors as explained in the question. I saved the new dataframe and then ran the count function on my new column. 

# Q5. Missing data: how many flights have NA for dep_delay? Filter them out and report
#     the remaining row count.
data("flights")
sum(is.na(flights$dep_delay))

## [1] 8255

flights %>%
  filter(dep_delay != is.na(dep_delay)) %>%
  nrow

## [1] 312007

# I reloaded the flights data because I thought I might have lost some rows with what I did above. I first tried this in the pipe but got an error. Claude explained that because sum is a base R function, it doesn't work the way I expect in the pipe. I successfully used the sum function outside the pipe to count the na flights in the dep_delay column. 

# I first tried flights %>% filter(dep_delay != NA) to filter out NA flights, but R gave me a warning and explained I need to use is.na to check if the data is NA. So I used is.na inside of the filter function. Then I used nrow to count the remaining rows.

Part 2 — Airquality EDA

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ ggplot2 4.0.3     ✔ stringr 1.6.0
## ✔ purrr   1.2.2     ✔ tibble  3.3.1
## ✔ readr   2.2.0     ✔ tidyr   1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data("airquality")

# Q6. For Ozone, Temp, and Wind: compute mean, median, sd, min, max
#     (use na.rm = TRUE where needed).
summary(airquality$Ozone)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    1.00   18.00   31.50   42.13   63.25  168.00      37

summary(airquality$Temp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   72.00   79.00   77.88   85.00   97.00

summary(airquality$Wind)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.700   7.400   9.700   9.958  11.500  20.700

eda_function <- function(column){
  paste(
    "Mean =",
    mean(column, na.rm=TRUE),
    "Median =",
    median(column, na.rm=TRUE),
    "SD =",
    sd(column, na.rm=TRUE),
    "min =", 
    min(column, na.rm=TRUE),
    "max =",
    max(column, na.rm=TRUE)
  )
}
eda_function(airquality$Ozone)

## [1] "Mean = 42.1293103448276 Median = 31.5 SD = 32.987884514434 min = 1 max = 168"

eda_function(airquality$Temp)

## [1] "Mean = 77.8823529411765 Median = 79 SD = 9.46526974097146 min = 56 max = 97"

eda_function(airquality$Wind)

## [1] "Mean = 9.95751633986928 Median = 9.7 SD = 3.5230013522126 min = 1.7 max = 20.7"

# First I tried using the summary function, but it did not give me sd. I created a custom function (eda_function) that contains the eda functions the question asks for so I don't have to keep calling them individually. I labeled each output with the function by using the paste function and putting "Name of function =" in front of each operation. Then I called my new custom function for each column the question asked for.

Q7. Compare the mean and median for each variable. What does the relationship between mean and median suggest about distribution skewness? What does the standard deviation tell you about variability?

The position of the median and mean compared to each other indicates skewness. If the median is higher than the mean, that means there are extreme low values that make the mean lower than the median, thus there is a significant tail of low values on the left. Thus, there is a left skew, since lower values are on the left. If the median is lower than the mean, that means there are extreme high values that make the mean higher than the median, thus there is a significant tail of high values. Thus there is a right skew, since higher values are on the right.

The ozone variable has a mean higher than the median, so it is skewed right.

The temp and wind variables have means and medians very close together, indicating minimal skew (normal distribution).

A larger SD indicates more variability in the data.

The ozone variable has much more variability than the temp and wind variable because its SD is much higher.

The wind variable has the least variability because its SD is the smallest.

# Q8. Make a histogram of Ozone.
ggplot (airquality, aes (x=Ozone)) +
  geom_histogram(binwidth=10, fill="blue", color="white") +
  labs(title="Distribution of Ozone Values", x="Ozone", y="Number of Values") +
  theme_minimal()

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_bin()`).

#I used ggplot's geom_histogram functions. I set binwidth to 10.

Q9. Describe the shape of the Ozone distribution. Any outliers or unusual features? There is a signficant right skew. There is a long right tail and an extreme outlier in the 175-185 bin.

# Q10. Create a new column month_name with full month names (May–September) using case_when.
#      Then make a boxplot of Ozone by month_name.
#      (Hint: case_when was covered in the Wrangling Activity.)

new_airquality <- airquality %>%
  mutate (month_name = 
            case_when(
              Month == 5 ~ "May",
              Month == 6 ~ "June",
              Month == 7 ~ "July",
              Month == 8 ~ "August",
              Month == 9 ~ "September"
                      )
  )

ggplot (new_airquality, aes(x= month_name , y= Ozone , fill = month_name)) +
  geom_boxplot() +
  labs(title="Ozone by Month", x="Month", y="Ozone") + 
  theme_minimal () +
  theme(axis.text.x=element_text(angle=45, hjust=1))

## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# I used the mutate function to add a column called month_name and used case_when to label 5 in the Month column as May in the month_name column, 6 in the Month column as June in the month_name column, etc.
# Then I used ggplot's geom_boxplot function. I set x=month_name, y=Ozone, and filled by month_name.

Q11. Which month has the highest median Ozone? Are there outliers in any month? July has the highest median Ozone. August, June, May, and September all have outliers.

# Q12. Scatterplot of Temp vs Ozone, colored by Month.
ggplot(new_airquality, aes(x = Temp, y = Ozone, color = month_name)) +
  geom_point(alpha = 0.4) +
  labs(title = "Temp vs Ozone Color by Month",
       x = "Temp", y = "Ozone") +
  theme_minimal()

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

# I used ggplot's geom_point function and added color=month_name to the aestetics in order to color by month.

Q13. Is there a visible relationship between temperature and ozone? Yes. As temperature increases, ozone also tends to increase. There is a positive correlation.

# Q14. Compute the correlation matrix for Ozone, Temp, and Wind.
#      (Hint: cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs"))
cor(airquality[, c("Ozone","Temp","Wind")], use = "complete.obs")

##            Ozone       Temp       Wind
## Ozone  1.0000000  0.6983603 -0.6015465
## Temp   0.6983603  1.0000000 -0.5110750
## Wind  -0.6015465 -0.5110750  1.0000000

# I used the function given in the hint.

Q15. Which pair has the strongest correlation? What does that suggest? Temperature and ozone have the strongest correlation. That suggests that temperature and ozone are good predictors of one another; better than wind and ozone or wind and temperature.

# Q16. Generate a summary table grouped by Month with: count, mean Ozone, mean Temp,
#      mean Wind for each month.

airquality %>%
  group_by(Month) %>%
  summarize("count" = n(), "mean_Ozone" = mean(Ozone, na.rm=TRUE), "mean_Temp" = mean(Temp), "mean_Wind" = mean(Wind)) %>%
  arrange(desc(mean_Ozone))

## # A tibble: 5 × 5
##   Month count mean_Ozone mean_Temp mean_Wind
##   <int> <int>      <dbl>     <dbl>     <dbl>
## 1     8    31       60.0      84.0      8.79
## 2     7    31       59.1      83.9      8.94
## 3     9    30       31.4      76.9     10.2 
## 4     6    30       29.4      79.1     10.3 
## 5     5    31       23.6      65.5     11.6

# I used dplyr's group_by and summarize function. First I tried this without removing NA values for Ozone, but I got NA for mean Ozone, so I realized that I needed to remove the NA values.

Q17. Which month has the highest average ozone? How do temperature and wind vary across months? August had the highest mean Ozone. Temperature is lowest in May, increases until it peaks in August, and then begins to decrease in September. Wind does the opposite. It is highest in may, decreases until the minimum in August, and then begins to increase in September.

Homework 3 — Dates, Factors + EDA

Vania Halvonik

Part 1 — NYC Flights: Dates and Factors

Part 2 — Airquality EDA