A VERY fast introduction to the Tidyverse

If you haven’t used the Tidyverse before I recommend:

Let’s load the required packages and see what data is included:

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Examining the table

You can also embed plots, for example:

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
?flights
## starting httpd help server ... done

Filter

Flights has more than one carrier (airline) within the table.

unique(flights$carrier)
##  [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV"
## [16] "OO"

To filter for one (or more) the filter function can be used.

Filter for flights from United Airlines (UA):

ua_flights<- flights %>% 

  filter(carrier == "UA")

unique(ua_flights$carrier)
## [1] "UA"

Select and Mutate

flights %>% 
  select(year, month, day, dep_time, flight, carrier) %>% 
  head()
## # A tibble: 6 x 6
##    year month   day dep_time flight carrier
##   <int> <int> <int>    <int>  <int> <chr>  
## 1  2013     1     1      517   1545 UA     
## 2  2013     1     1      533   1714 UA     
## 3  2013     1     1      542   1141 AA     
## 4  2013     1     1      544    725 B6     
## 5  2013     1     1      554    461 DL     
## 6  2013     1     1      554   1696 UA

Mutate creates new columns within the dataframe. The make_date function from the lubridate package makes creating date or datetime vectors within a dataframe quite easy. Here we use the integer columns from year, month day to make a new column called date_flight:

flights %>% 
  select(year, month, day) %>% 
  mutate(date_flight = make_date(year, month, day)) %>% 
  head()
## # A tibble: 6 x 4
##    year month   day date_flight
##   <int> <int> <int> <date>     
## 1  2013     1     1 2013-01-01 
## 2  2013     1     1 2013-01-01 
## 3  2013     1     1 2013-01-01 
## 4  2013     1     1 2013-01-01 
## 5  2013     1     1 2013-01-01 
## 6  2013     1     1 2013-01-01

Group By, Summarise and Arrange

Group By is a powerful function from the dplyr package within the tidyverse that groups data frames based upon keys:

flights_grouped_carrier <- flights %>% 
  group_by(carrier)

flights_grouped_carrier
## # A tibble: 336,776 x 19
## # Groups:   carrier [16]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

From there we can obtain specifics within the group, like number of rows, average dep_delay:

flights_grouped_carrier %>% 
  summarise(cnt = n(),
            avg_delay = mean(dep_delay, na.rm = T)) %>% 
  arrange(desc(avg_delay))
## # A tibble: 16 x 3
##    carrier   cnt avg_delay
##    <chr>   <int>     <dbl>
##  1 F9        685     20.2 
##  2 EV      54173     20.0 
##  3 YV        601     19.0 
##  4 FL       3260     18.7 
##  5 WN      12275     17.7 
##  6 9E      18460     16.7 
##  7 B6      54635     13.0 
##  8 VX       5162     12.9 
##  9 OO         32     12.6 
## 10 UA      58665     12.1 
## 11 MQ      26397     10.6 
## 12 DL      48110      9.26
## 13 AA      32729      8.59
## 14 AS        714      5.80
## 15 HA        342      4.90
## 16 US      20536      3.78

Visualization with ggplot2

https://r4ds.had.co.nz/data-visualisation.html

  1. Begin a flight with the ggplot()

  2. Include the dataset

  3. Choose how to map the data using the aes() aesthetics function

  4. Add layers, like a geom layer, which generates the plot and has additional settings

ggplot(data = <DATA>) +

<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

If we wanted to create a scatter plot with the ua_flights dataset and dep_delay on the x, arr_delay on the y, then we would use:

ggplot(data = ua_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 

  geom_point(alpha = 0.2)

Conditional filtering

Use | and & to conditionally filter a dataset:

early_january_weather <- weather %>% 

  filter(origin == "LGA" & (month == 1 & day <= 15))

head(early_january_weather)
## # A tibble: 6 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 LGA     2013     1     1     1  39.9  26.1  57.3      260       13.8      23.0
## 2 LGA     2013     1     1     2  41    26.1  55.0      260       17.3      25.3
## 3 LGA     2013     1     1     3  41    26.1  55.0      260       16.1      24.2
## 4 LGA     2013     1     1     4  41    26.1  55.0      260       17.3      25.3
## 5 LGA     2013     1     1     5  39.9  25.0  54.8      250       15.0      21.9
## 6 LGA     2013     1     1     6  39.9  25.0  54.8      260       16.1      23.0
## # ... with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

Plotting a time series:

ggplot(data = early_january_weather, 
       mapping = aes(x = time_hour, y = temp)) + 
  geom_line() +
  ggtitle((title = "Weather in early January")) +
  ylab("Temperature (F)") +
  xlab("Datetime")

Next lecture we’ll cover tsibble objects (we’re working with tibble objects today) and the autoplot() function which simplifies things:

library(tsibble)
## 
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
## 
##     interval
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(feasts)
## Loading required package: fabletools
early_january_weather %>% 
  tsibble(index = time_hour) %>% 
  autoplot(temp)

Lab 1

  1. What kind of object is early_january_weather (use str())?
str(early_january_weather)
## tibble [358 x 15] (S3: tbl_df/tbl/data.frame)
##  $ origin    : chr [1:358] "LGA" "LGA" "LGA" "LGA" ...
##  $ year      : int [1:358] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month     : int [1:358] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : int [1:358] 1 1 1 1 1 1 1 1 1 1 ...
##  $ hour      : int [1:358] 1 2 3 4 5 6 7 8 9 10 ...
##  $ temp      : num [1:358] 39.9 41 41 41 39.9 ...
##  $ dewp      : num [1:358] 26.1 26.1 26.1 26.1 25 ...
##  $ humid     : num [1:358] 57.3 55 55 55 54.8 ...
##  $ wind_dir  : num [1:358] 260 260 260 260 250 260 250 250 260 270 ...
##  $ wind_speed: num [1:358] 13.8 17.3 16.1 17.3 15 ...
##  $ wind_gust : num [1:358] 23 25.3 24.2 25.3 21.9 ...
##  $ precip    : num [1:358] 0 0 0 0 0 0 0 0 0 0 ...
##  $ pressure  : num [1:358] 1012 1012 1012 1012 1011 ...
##  $ visib     : num [1:358] 10 10 10 10 10 10 10 10 10 10 ...
##  $ time_hour : POSIXct[1:358], format: "2013-01-01 01:00:00" "2013-01-01 02:00:00" ...
### early_january_weather is a tibble.
  1. Plot wind speeds (weather table) in February by origin? What stands out?
february <- weather %>%
  filter(month == 2)

ggplot(february) +
  geom_line(aes(x = time_hour, y = wind_speed))

### It looks like somewhere between February 11 and February 15th there is an outlier of very high wind speeds at a particular time. Other than that every other day of the month it appears that the wind speed is relatively the same.
  1. What month, on average, had the highest wind speed with origin = JFK
JFK <- weather %>%
  filter(origin == "JFK")

JFK_Monthly_Avg <- JFK %>%
  group_by(month) %>%
  summarize(Monthly_AVG = mean(wind_speed, na.rm = 1)) %>%
  arrange(desc(Monthly_AVG))
### March has the highest monthly average windspeed with an average speed of 13.997021 miles/hour.

In Class Discussion

### Last class we discussed about the different factors that can change the price of a car being sold. Among these that were discussed were: Age of the car, accident history, how many miles are on the car and what is the miles per gallon, what the market is currently, additional features, and whether the car runs on gas or is an electric car.