Class 03: Basic Aspects of R

Introduction

In this class we are going to keep working with out databases from Class 2 and create new databases by merging them according our necessities.

The first step is install the package that we are going to use for this class. In this time we will work with tidyverse, which allow us to use variables from our dataset and create new ones. Also, the dataset we will use is obtained from nycflights13, this package contains information about all flights that departed from New York City and its airports to destinations in the United States, Puerto Rico, and the American Virgin Islands in 2013.

This package provides the following data tables.

flights: all flights that departed from NYC in 2013
weather: hourly meterological data for each airport
planes: construction information about each plane
airports: airport names and locations
airlines: translation between two letter carrier codes and names

To install these two package, we must run the following two lines:

#install.packages("nycflights13")
#install.packages("tidyverse")
library(nycflights13)
library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Remember that to run effectively those two lines, you must delete the # symbol. Then, the command library() allows us to use the package and have access to the commands or datasets included in that package. Take your time to check what we have access in each dataset.

data(flights)
flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

This flights dataset includes information for:

year, month, day: Date of departure.
dep_time, Actual departure time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.
arr_time: Actual arrival time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.
sched_dep_time: Scheduled departure time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.
sched_arr_time: Scheduled arrival time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.
dep_delay: Departure delays, in minutes. Negative times represent early departures.
arr_delay: Arrival delays, in minutes. Negative times represent early arrivals.
carrier: Two letter carrier abbreviation. See airlines to get name.
flight: Flight number.
tailnum: Plane tail number. See planes for additional metadata.
origin, dest: Origin and destination, respectively. See airports for additional metadata.
air_time: Amount of time spent in the air, in minutes.
distance: Distance between airports, in miles.
hour, minute: Time of scheduled departure broken into hour and minutes.
time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

data(weather)
weather

## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

This weather dataset includes information for:

origin: Weather station. Named origin to facilitate merging with flights data.
year, month, day, hour: Time of recording.
temp, dewp: Temperature and dewpoint in F.
humid: Relative humidity.
wind_dir, wind_speed, wind_gust: Wind direction (in degrees), speed and gust speed (in mph).
precip: Precipitation, in inches.
pressure: Sea level pressure in millibars.
visib: Visibility in miles.
time_hour: Date and hour of the recording as a POSIXct date.

data(planes)
planes

## # A tibble: 3,322 x 9
##    tailnum  year type          manufacturer   model  engines seats speed engine 
##    <chr>   <int> <chr>         <chr>          <chr>    <int> <int> <int> <chr>  
##  1 N10156   2004 Fixed wing m~ EMBRAER        EMB-1~       2    55    NA Turbo-~
##  2 N102UW   1998 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  3 N103US   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  4 N104UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  5 N10575   2002 Fixed wing m~ EMBRAER        EMB-1~       2    55    NA Turbo-~
##  6 N105UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  7 N107US   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  8 N108UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
##  9 N109UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## 10 N110UW   1999 Fixed wing m~ AIRBUS INDUST~ A320-~       2   182    NA Turbo-~
## # ... with 3,312 more rows

This planes dataset includes information for:

tailnum: Tail number.
year: Year manufactured.
type: Type of plane.
manufacturer, model: Manufacturer and model.
engines, seats: Number of engines and seats.
speed: Average cruising speed in mph.
engine: Type of engine.

data(airports)
airports

## # A tibble: 1,458 x 8
##    faa   name                       lat    lon   alt    tz dst   tzone          
##    <chr> <chr>                    <dbl>  <dbl> <dbl> <dbl> <chr> <chr>          
##  1 04G   Lansdowne Airport         41.1  -80.6  1044    -5 A     America/New_Yo~
##  2 06A   Moton Field Municipal A~  32.5  -85.7   264    -6 A     America/Chicago
##  3 06C   Schaumburg Regional       42.0  -88.1   801    -6 A     America/Chicago
##  4 06N   Randall Airport           41.4  -74.4   523    -5 A     America/New_Yo~
##  5 09J   Jekyll Island Airport     31.1  -81.4    11    -5 A     America/New_Yo~
##  6 0A9   Elizabethton Municipal ~  36.4  -82.2  1593    -5 A     America/New_Yo~
##  7 0G6   Williams County Airport   41.5  -84.5   730    -5 A     America/New_Yo~
##  8 0G7   Finger Lakes Regional A~  42.9  -76.8   492    -5 A     America/New_Yo~
##  9 0P2   Shoestring Aviation Air~  39.8  -76.6  1000    -5 U     America/New_Yo~
## 10 0S9   Jefferson County Intl     48.1 -123.    108    -8 A     America/Los_An~
## # ... with 1,448 more rows

This airports dataset includes information for:

faa: FAA airport code.
name: Usual name of the aiport.
lat, lon: Location of airport, latitude and longitude.
alt: Altitude, in feet.
tz: Timezone offset from GMT.
dst: Daylight savings time zone. A = Standard US DST: starts on the second Sunday of March, ends on the first Sunday of November. U = unknown. N = no dst.
tzone: IANA time zone, as determined by GeoNames webservice.

data(airlines)
airlines

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

This airlines dataset includes information for:

carrier: Two letter abbreviation.
name: Full name.

Relational Data

In this time we will relate different databases, create new variables with mutate or filtering part of our dataset. Our first tasks are:

Connect flights with planes database by using tailnum variable.
Connect flights with airlines database by using carrier variable.
Connect flights with airports database by using origin, and destiny variables.
Connect flights with weather database by using origin, year, month, day, and hour variables.

The first thing to do is be sure that our identifiers are unique, for that reason we will count them:

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)

## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>

weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)

## # A tibble: 3 x 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2

flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)

## # A tibble: 29,768 x 5
##     year month   day flight     n
##    <int> <int> <int>  <int> <int>
##  1  2013     1     1      1     2
##  2  2013     1     1      3     2
##  3  2013     1     1      4     2
##  4  2013     1     1     11     3
##  5  2013     1     1     15     2
##  6  2013     1     1     21     2
##  7  2013     1     1     27     4
##  8  2013     1     1     31     2
##  9  2013     1     1     32     2
## 10  2013     1     1     35     2
## # ... with 29,758 more rows

flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)

## # A tibble: 64,928 x 5
##     year month   day tailnum     n
##    <int> <int> <int> <chr>   <int>
##  1  2013     1     1 N0EGMQ      2
##  2  2013     1     1 N11189      2
##  3  2013     1     1 N11536      2
##  4  2013     1     1 N11544      3
##  5  2013     1     1 N11551      2
##  6  2013     1     1 N12540      2
##  7  2013     1     1 N12567      2
##  8  2013     1     1 N13123      2
##  9  2013     1     1 N13538      3
## 10  2013     1     1 N13566      3
## # ... with 64,918 more rows

Example

To join two or more databases, first we need to understand the methods we can use for it. In R, there are two way to join two datasets, one it is by using join comands and the other is by using mutate.

# First, we create a new dataset to use it in our join without modify the original one.
names(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

# By using join functions
flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier")

## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

# By using mutate functions
flights2 %>%
  select(-origin, -dest) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])

## # A tibble: 336,776 x 7
##     year month   day  hour tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

Understanding Join commands

The command we will explore in this time are: inner_join, full_join, left_join, semi_join, anti_join, etc.. Lets create a temporal datasets to understand how our functions work.

food <- data.frame(CustomerId = c(1:6), 
                Product = c("sushi","ramen",
                            "udon","karage",
                            "soba","chahan"))

drink <- data.frame(CustomerId = c(2, 4, 6, 7, 8), 
                State = c("green tea","water",
                          "orange juice","beer",
                          "sake"))

food

##   CustomerId Product
## 1          1   sushi
## 2          2   ramen
## 3          3    udon
## 4          4  karage
## 5          5    soba
## 6          6  chahan

drink

##   CustomerId        State
## 1          2    green tea
## 2          4        water
## 3          6 orange juice
## 4          7         beer
## 5          8         sake

It is important to understand that in INNER JOIN functions we only keep the variable that match with the requirements, the others are ignored. In R, you have two functions useful for this purpose: join (used this time), and merge (you can check in R help documents). For this last function, the syntax is as follow:

merge(x, y, by.x, by.y, all.x, all.y), where: + x and y are the dataset names. + by.x and by.y are the names of the columns that are common in both datasets to make the merge. If it is not listed, R will use the columns that have the same names. + all.x and all.y are the options that specify the type of merge we are using. By default is all=FALSE.

For this example we will use both types of join functions:

dataxy <- merge(x=food, y=drink, by="CustomerId")
dataxy

##   CustomerId Product        State
## 1          2   ramen    green tea
## 2          4  karage        water
## 3          6  chahan orange juice

library(dplyr)
food %>% inner_join(drink, by="CustomerId")

##   CustomerId Product        State
## 1          2   ramen    green tea
## 2          4  karage        water
## 3          6  chahan orange juice

For OUTER JOIN the function will return the information for all individuals in our dataset. In the case of using merge() function, we have to make it clear to put the option all=TRUE. In the case of JOIN function, we can use full_join.

dataxy <- merge(x=food, y=drink, by="CustomerId", all=TRUE)
dataxy

##   CustomerId Product        State
## 1          1   sushi         <NA>
## 2          2   ramen    green tea
## 3          3    udon         <NA>
## 4          4  karage        water
## 5          5    soba         <NA>
## 6          6  chahan orange juice
## 7          7    <NA>         beer
## 8          8    <NA>         sake

food %>% full_join(drink, by="CustomerId")

##   CustomerId Product        State
## 1          1   sushi         <NA>
## 2          2   ramen    green tea
## 3          3    udon         <NA>
## 4          4  karage        water
## 5          5    soba         <NA>
## 6          6  chahan orange juice
## 7          7    <NA>         beer
## 8          8    <NA>         sake

However, in some cases we need to keep the information only from one of our dataset rather than all the information from both of them, or eliminate them. in this case, we case use the following functions and options:

dataxy <- merge(x=food, y=drink, by="CustomerId", all.x=TRUE)
dataxy

##   CustomerId Product        State
## 1          1   sushi         <NA>
## 2          2   ramen    green tea
## 3          3    udon         <NA>
## 4          4  karage        water
## 5          5    soba         <NA>
## 6          6  chahan orange juice

food %>% left_join(drink, by="CustomerId")

##   CustomerId Product        State
## 1          1   sushi         <NA>
## 2          2   ramen    green tea
## 3          3    udon         <NA>
## 4          4  karage        water
## 5          5    soba         <NA>
## 6          6  chahan orange juice

For the right side:

dataxy <- merge(x=food, y=drink, by="CustomerId", all.y=TRUE)
dataxy

##   CustomerId Product        State
## 1          2   ramen    green tea
## 2          4  karage        water
## 3          6  chahan orange juice
## 4          7    <NA>         beer
## 5          8    <NA>         sake

food %>% right_join(drink, by="CustomerId")

##   CustomerId Product        State
## 1          2   ramen    green tea
## 2          4  karage        water
## 3          6  chahan orange juice
## 4          7    <NA>         beer
## 5          8    <NA>         sake

The special case of Cross join; in this case the information of one dataset is copy in every row of another table.

dataxy <- merge(x=food, y=drink, by=NULL)
dataxy

##    CustomerId.x Product CustomerId.y        State
## 1             1   sushi            2    green tea
## 2             2   ramen            2    green tea
## 3             3    udon            2    green tea
## 4             4  karage            2    green tea
## 5             5    soba            2    green tea
## 6             6  chahan            2    green tea
## 7             1   sushi            4        water
## 8             2   ramen            4        water
## 9             3    udon            4        water
## 10            4  karage            4        water
## 11            5    soba            4        water
## 12            6  chahan            4        water
## 13            1   sushi            6 orange juice
## 14            2   ramen            6 orange juice
## 15            3    udon            6 orange juice
## 16            4  karage            6 orange juice
## 17            5    soba            6 orange juice
## 18            6  chahan            6 orange juice
## 19            1   sushi            7         beer
## 20            2   ramen            7         beer
## 21            3    udon            7         beer
## 22            4  karage            7         beer
## 23            5    soba            7         beer
## 24            6  chahan            7         beer
## 25            1   sushi            8         sake
## 26            2   ramen            8         sake
## 27            3    udon            8         sake
## 28            4  karage            8         sake
## 29            5    soba            8         sake
## 30            6  chahan            8         sake

semi join is the case when we only keep the information of the left dataset:

food %>% semi_join(drink, by="CustomerId")

##   CustomerId Product
## 1          2   ramen
## 2          4  karage
## 3          6  chahan

Anti join is the case when we only keep the information that is differenct in both dataset, it is like food-drink information:

food %>% anti_join(drink, by="CustomerId")

##   CustomerId Product
## 1          1   sushi
## 2          3    udon
## 3          5    soba

Applications to our flight dataset:

In this subsection we will apply join functions to our dataset:

names(flights2)

## [1] "year"    "month"   "day"     "hour"    "origin"  "dest"    "tailnum"
## [8] "carrier"

head(flights2)

## # A tibble: 6 x 8
##    year month   day  hour origin dest  tailnum carrier
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
## 1  2013     1     1     5 EWR    IAH   N14228  UA     
## 2  2013     1     1     5 LGA    IAH   N24211  UA     
## 3  2013     1     1     5 JFK    MIA   N619AA  AA     
## 4  2013     1     1     5 JFK    BQN   N804JB  B6     
## 5  2013     1     1     6 LGA    ATL   N668DN  DL     
## 6  2013     1     1     5 EWR    ORD   N39463  UA

names(weather)

##  [1] "origin"     "year"       "month"      "day"        "hour"      
##  [6] "temp"       "dewp"       "humid"      "wind_dir"   "wind_speed"
## [11] "wind_gust"  "precip"     "pressure"   "visib"      "time_hour"

flights2 %>% 
  left_join(weather)

## Joining, by = c("year", "month", "day", "hour", "origin")

## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

When we want to specify the variable to make the join:

names(planes)

## [1] "tailnum"      "year"         "type"         "manufacturer" "model"       
## [6] "engines"      "seats"        "speed"        "engine"

names(flights2)

## [1] "year"    "month"   "day"     "hour"    "origin"  "dest"    "tailnum"
## [8] "carrier"

flights2 %>% 
  left_join(planes, by = "tailnum")

## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type 
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixe~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixe~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixe~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA> 
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

When we want to use many variables for make the join:

names(airports)

## [1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"

names(flights2)

## [1] "year"    "month"   "day"     "hour"    "origin"  "dest"    "tailnum"
## [8] "carrier"

head(airports)

## # A tibble: 6 x 8
##   faa   name                          lat   lon   alt    tz dst   tzone         
##   <chr> <chr>                       <dbl> <dbl> <dbl> <dbl> <chr> <chr>         
## 1 04G   Lansdowne Airport            41.1 -80.6  1044    -5 A     America/New_Y~
## 2 06A   Moton Field Municipal Airp~  32.5 -85.7   264    -6 A     America/Chica~
## 3 06C   Schaumburg Regional          42.0 -88.1   801    -6 A     America/Chica~
## 4 06N   Randall Airport              41.4 -74.4   523    -5 A     America/New_Y~
## 5 09J   Jekyll Island Airport        31.1 -81.4    11    -5 A     America/New_Y~
## 6 0A9   Elizabethton Municipal Air~  36.4 -82.2  1593    -5 A     America/New_Y~

head(flights2)

## # A tibble: 6 x 8
##    year month   day  hour origin dest  tailnum carrier
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
## 1  2013     1     1     5 EWR    IAH   N14228  UA     
## 2  2013     1     1     5 LGA    IAH   N24211  UA     
## 3  2013     1     1     5 JFK    MIA   N619AA  AA     
## 4  2013     1     1     5 JFK    BQN   N804JB  B6     
## 5  2013     1     1     6 LGA    ATL   N668DN  DL     
## 6  2013     1     1     5 EWR    ORD   N39463  UA

flights2 %>% 
  left_join(airports, c("dest" = "faa"))

## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Geor~  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Geor~  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miam~  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>   NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Hart~  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chic~  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort~  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Wash~  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orla~  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chic~  42.0 -87.9   668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

flights2 %>% 
  left_join(airports, c("origin" = "faa"))

## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Newa~  40.7 -74.2    18
##  2  2013     1     1     5 LGA    IAH   N24211  UA      La G~  40.8 -73.9    22
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      John~  40.6 -73.8    13
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      John~  40.6 -73.8    13
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      La G~  40.8 -73.9    22
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Newa~  40.7 -74.2    18
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Newa~  40.7 -74.2    18
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      La G~  40.8 -73.9    22
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      John~  40.6 -73.8    13
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La G~  40.8 -73.9    22
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

Now imagine, we want to obtain the information of the top less visited destinations:

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest

## # A tibble: 10 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705

If we want to extract information from those destinations and create a new database, we can make it by two methods:

flights %>% 
  filter(dest %in% top_dest$dest)

## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

flights %>% 
  semi_join(top_dest)

## Joining, by = "dest"

## # A tibble: 141,145 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      542            540         2      923            850
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      554            558        -4      740            728
##  4  2013     1     1      555            600        -5      913            854
##  5  2013     1     1      557            600        -3      838            846
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            559         0      702            706
## 10  2013     1     1      600            600         0      851            858
## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Working with Strings datasets:

When we want to work with datasets that have string information, we have to be careful to extract or work with the information storage in it.

Lets create some data to work with:

library(tidyverse)
rm(list = ls())
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

class(string1)

## [1] "character"

Datasets:

x <- c("\"", "\\")
y <- c("Augusto", "Ricardo", 'Delgado', "Narro")
x

## [1] "\"" "\\"

## [1] "Augusto" "Ricardo" "Delgado" "Narro"

x <- "\u00b5"
x

## [1] "µ"

z <- c("one", "two", "three")

Some important functions:

# To write the elements in lines or rows:
writeLines(y)

## Augusto
## Ricardo
## Delgado
## Narro

# To count the number of characters includying empty spaces:
str_length(c("a", "R for data science", NA))

## [1]  1 18 NA

str_length(y)

## [1] 7 7 7 5

# To concatenate the string elements. We also can use a separator:
str_c("x", "y")

## [1] "xy"

str_c("x", "y", "z")

## [1] "xyz"

str_c("x", "y", sep = ", ")

## [1] "x, y"

x <- c("abc", NA)
x

## [1] "abc" NA

str_c("|-", y, "-|")

## [1] "|-Augusto-|" "|-Ricardo-|" "|-Delgado-|" "|-Narro-|"

str_c("|-", str_replace_na(x), "-|") #if we want to include NA.

## [1] "|-abc-|" "|-NA-|"

str_c("prefix-", c("a", "b", "c"), "-suffix")

## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

Some important functions:

name <- "Augusto"
time_of_day <- "morning"
birthday <- TRUE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)

## [1] "Good morning Augusto and HAPPY BIRTHDAY."

str_c(c("x", "y", "z"), collapse = ", ")

## [1] "x, y, z"

str_c(y, collapse = " ")

## [1] "Augusto Ricardo Delgado Narro"

Subsetting string datasets:

x <- c("Apple", "Banana", "Pear", "Mandarin", "Papaya", "Pineapple")
str_sub(x, 1, 3)

## [1] "App" "Ban" "Pea" "Man" "Pap" "Pin"

str_sub(x, -3, -1)

## [1] "ple" "ana" "ear" "rin" "aya" "ple"

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x

## [1] "apple"     "banana"    "pear"      "mandarin"  "papaya"    "pineapple"

x <- c("apple", "eggplant", "banana", "tomato", "potatoe", "tofu")
# Sorting our dataset
str_sort(x, locale = "en")  # English

## [1] "apple"    "banana"   "eggplant" "potatoe"  "tofu"     "tomato"

str_sort(x, locale = "haw") # Hawaiian

## [1] "apple"    "eggplant" "banana"   "potatoe"  "tofu"     "tomato"

x <- c("Apple", "Banana", "Pear", "Mandarin", "Papaya", "Pineapple")
str_view(x, "an")

str_view(x, ".a.")

# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)

## \.

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")

x <- "a\\b"
writeLines(x)

## a\b

#$ to match the end of the string.

x <- c("apple", "banana", "pear", "Mandarin", "Papaya", "Pineapple", "Alto")
str_view(x, "^a")

str_view(x, "a$")

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")

str_view(x, "^apple$")

#Character Classes and Alternatives
#\d: matches any digit.
#\s: matches any whitespace (e.g. space, tab, newline).
#[abc]: matches a, b, or c.
#[^abc]: matches anything except a, b, or c.

# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")

str_view(c("abc", "a.c", "a*c", "b*c", "a c"), ".[*]c")

str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")

str_view(c("grey", "gray"), "gr(e|a)y")

#Repetition
#?: 0 or 1
#+: 1 or more
#*: 0 or more

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

str_view(x, "CC+")

str_view(x, 'C[LX]+')

#{n}: exactly n
#{n,}: n or more
#{,m}: at most m
#{n,m}: between n and m

str_view(x, "C{2}")

str_view(x, "C{2,}")

str_view(x, "C{2,3}")

str_view(x, 'C{2,3}?')

str_view(x, 'C[LX]+?')

Applied Econometrics: Geocomputation and Spatial Methods

Prof. Augusto Delgado (Faculty of Political Science and Economics - Waseda University)