In this class we are going to keep working with out databases from Class 2 and create new databases by merging them according our necessities.
The first step is install the package that we are going to use for this class. In this time we will work with tidyverse, which allow us to use variables from our dataset and create new ones. Also, the dataset we will use is obtained from nycflights13, this package contains information about all flights that departed from New York City and its airports to destinations in the United States, Puerto Rico, and the American Virgin Islands in 2013.
This package provides the following data tables.
flights: all flights that departed from NYC in 2013weather: hourly meterological data for each airportplanes: construction information about each planeairports: airport names and locationsairlines: translation between two letter carrier codes and namesTo install these two package, we must run the following two lines:
#install.packages("nycflights13")
#install.packages("tidyverse")
library(nycflights13)
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Remember that to run effectively those two lines, you must delete the # symbol. Then, the command library() allows us to use the package and have access to the commands or datasets included in that package. Take your time to check what we have access in each dataset.
data(flights)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
This flights dataset includes information for:
year, month, day: Date of departure.dep_time, Actual departure time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.arr_time: Actual arrival time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.sched_dep_time: Scheduled departure time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.sched_arr_time: Scheduled arrival time (format HHMM or HMM, HH are hours and MM are minutes), local time zone.dep_delay: Departure delays, in minutes. Negative times represent early departures.arr_delay: Arrival delays, in minutes. Negative times represent early arrivals.carrier: Two letter carrier abbreviation. See airlines to get name.flight: Flight number.tailnum: Plane tail number. See planes for additional metadata.origin, dest: Origin and destination, respectively. See airports for additional metadata.air_time: Amount of time spent in the air, in minutes.distance: Distance between airports, in miles.hour, minute: Time of scheduled departure broken into hour and minutes.time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.data(weather)
weather
## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
This weather dataset includes information for:
origin: Weather station. Named origin to facilitate merging with flights data.year, month, day, hour: Time of recording.temp, dewp: Temperature and dewpoint in F.humid: Relative humidity.wind_dir, wind_speed, wind_gust: Wind direction (in degrees), speed and gust speed (in mph).precip: Precipitation, in inches.pressure: Sea level pressure in millibars.visib: Visibility in miles.time_hour: Date and hour of the recording as a POSIXct date.data(planes)
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing m~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 2 N102UW 1998 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 3 N103US 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 4 N104UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 5 N10575 2002 Fixed wing m~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 6 N105UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 7 N107US 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 8 N108UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 9 N109UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 10 N110UW 1999 Fixed wing m~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## # ... with 3,312 more rows
This planes dataset includes information for:
tailnum: Tail number.year: Year manufactured.type: Type of plane.manufacturer, model: Manufacturer and model.engines, seats: Number of engines and seats.speed: Average cruising speed in mph.engine: Type of engine.data(airports)
airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Yo~
## 2 06A Moton Field Municipal A~ 32.5 -85.7 264 -6 A America/Chicago
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chicago
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Yo~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Yo~
## 6 0A9 Elizabethton Municipal ~ 36.4 -82.2 1593 -5 A America/New_Yo~
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/New_Yo~
## 8 0G7 Finger Lakes Regional A~ 42.9 -76.8 492 -5 A America/New_Yo~
## 9 0P2 Shoestring Aviation Air~ 39.8 -76.6 1000 -5 U America/New_Yo~
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/Los_An~
## # ... with 1,448 more rows
This airports dataset includes information for:
faa: FAA airport code.name: Usual name of the aiport.lat, lon: Location of airport, latitude and longitude.alt: Altitude, in feet.tz: Timezone offset from GMT.dst: Daylight savings time zone. A = Standard US DST: starts on the second Sunday of March, ends on the first Sunday of November. U = unknown. N = no dst.tzone: IANA time zone, as determined by GeoNames webservice.data(airlines)
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
This airlines dataset includes information for:
carrier: Two letter abbreviation.name: Full name.In this time we will relate different databases, create new variables with mutate or filtering part of our dataset. Our first tasks are:
The first thing to do is be sure that our identifiers are unique, for that reason we will count them:
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
## # A tibble: 3 x 6
## year month day hour origin n
## <int> <int> <int> <int> <chr> <int>
## 1 2013 11 3 1 EWR 2
## 2 2013 11 3 1 JFK 2
## 3 2013 11 3 1 LGA 2
flights %>%
count(year, month, day, flight) %>%
filter(n > 1)
## # A tibble: 29,768 x 5
## year month day flight n
## <int> <int> <int> <int> <int>
## 1 2013 1 1 1 2
## 2 2013 1 1 3 2
## 3 2013 1 1 4 2
## 4 2013 1 1 11 3
## 5 2013 1 1 15 2
## 6 2013 1 1 21 2
## 7 2013 1 1 27 4
## 8 2013 1 1 31 2
## 9 2013 1 1 32 2
## 10 2013 1 1 35 2
## # ... with 29,758 more rows
flights %>%
count(year, month, day, tailnum) %>%
filter(n > 1)
## # A tibble: 64,928 x 5
## year month day tailnum n
## <int> <int> <int> <chr> <int>
## 1 2013 1 1 N0EGMQ 2
## 2 2013 1 1 N11189 2
## 3 2013 1 1 N11536 2
## 4 2013 1 1 N11544 3
## 5 2013 1 1 N11551 2
## 6 2013 1 1 N12540 2
## 7 2013 1 1 N12567 2
## 8 2013 1 1 N13123 2
## 9 2013 1 1 N13538 3
## 10 2013 1 1 N13566 3
## # ... with 64,918 more rows
To join two or more databases, first we need to understand the methods we can use for it. In R, there are two way to join two datasets, one it is by using join comands and the other is by using mutate.
# First, we create a new dataset to use it in our join without modify the original one.
names(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
# By using join functions
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier")
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
# By using mutate functions
flights2 %>%
select(-origin, -dest) %>%
mutate(name = airlines$name[match(carrier, airlines$carrier)])
## # A tibble: 336,776 x 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
## 4 2013 1 1 5 N804JB B6 JetBlue Airways
## 5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 N516JB B6 JetBlue Airways
## 8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 N593JB B6 JetBlue Airways
## 10 2013 1 1 6 N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
The command we will explore in this time are: inner_join, full_join, left_join, semi_join, anti_join, etc.. Lets create a temporal datasets to understand how our functions work.
food <- data.frame(CustomerId = c(1:6),
Product = c("sushi","ramen",
"udon","karage",
"soba","chahan"))
drink <- data.frame(CustomerId = c(2, 4, 6, 7, 8),
State = c("green tea","water",
"orange juice","beer",
"sake"))
food
## CustomerId Product
## 1 1 sushi
## 2 2 ramen
## 3 3 udon
## 4 4 karage
## 5 5 soba
## 6 6 chahan
drink
## CustomerId State
## 1 2 green tea
## 2 4 water
## 3 6 orange juice
## 4 7 beer
## 5 8 sake
It is important to understand that in INNER JOIN functions we only keep the variable that match with the requirements, the others are ignored. In R, you have two functions useful for this purpose: join (used this time), and merge (you can check in R help documents). For this last function, the syntax is as follow:
merge(x, y, by.x, by.y, all.x, all.y), where: + x and y are the dataset names. + by.x and by.y are the names of the columns that are common in both datasets to make the merge. If it is not listed, R will use the columns that have the same names. + all.x and all.y are the options that specify the type of merge we are using. By default is all=FALSE.
For this example we will use both types of join functions:
dataxy <- merge(x=food, y=drink, by="CustomerId")
dataxy
## CustomerId Product State
## 1 2 ramen green tea
## 2 4 karage water
## 3 6 chahan orange juice
library(dplyr)
food %>% inner_join(drink, by="CustomerId")
## CustomerId Product State
## 1 2 ramen green tea
## 2 4 karage water
## 3 6 chahan orange juice
For OUTER JOIN the function will return the information for all individuals in our dataset. In the case of using merge() function, we have to make it clear to put the option all=TRUE. In the case of JOIN function, we can use full_join.
dataxy <- merge(x=food, y=drink, by="CustomerId", all=TRUE)
dataxy
## CustomerId Product State
## 1 1 sushi <NA>
## 2 2 ramen green tea
## 3 3 udon <NA>
## 4 4 karage water
## 5 5 soba <NA>
## 6 6 chahan orange juice
## 7 7 <NA> beer
## 8 8 <NA> sake
food %>% full_join(drink, by="CustomerId")
## CustomerId Product State
## 1 1 sushi <NA>
## 2 2 ramen green tea
## 3 3 udon <NA>
## 4 4 karage water
## 5 5 soba <NA>
## 6 6 chahan orange juice
## 7 7 <NA> beer
## 8 8 <NA> sake
However, in some cases we need to keep the information only from one of our dataset rather than all the information from both of them, or eliminate them. in this case, we case use the following functions and options:
dataxy <- merge(x=food, y=drink, by="CustomerId", all.x=TRUE)
dataxy
## CustomerId Product State
## 1 1 sushi <NA>
## 2 2 ramen green tea
## 3 3 udon <NA>
## 4 4 karage water
## 5 5 soba <NA>
## 6 6 chahan orange juice
food %>% left_join(drink, by="CustomerId")
## CustomerId Product State
## 1 1 sushi <NA>
## 2 2 ramen green tea
## 3 3 udon <NA>
## 4 4 karage water
## 5 5 soba <NA>
## 6 6 chahan orange juice
For the right side:
dataxy <- merge(x=food, y=drink, by="CustomerId", all.y=TRUE)
dataxy
## CustomerId Product State
## 1 2 ramen green tea
## 2 4 karage water
## 3 6 chahan orange juice
## 4 7 <NA> beer
## 5 8 <NA> sake
food %>% right_join(drink, by="CustomerId")
## CustomerId Product State
## 1 2 ramen green tea
## 2 4 karage water
## 3 6 chahan orange juice
## 4 7 <NA> beer
## 5 8 <NA> sake
The special case of Cross join; in this case the information of one dataset is copy in every row of another table.
dataxy <- merge(x=food, y=drink, by=NULL)
dataxy
## CustomerId.x Product CustomerId.y State
## 1 1 sushi 2 green tea
## 2 2 ramen 2 green tea
## 3 3 udon 2 green tea
## 4 4 karage 2 green tea
## 5 5 soba 2 green tea
## 6 6 chahan 2 green tea
## 7 1 sushi 4 water
## 8 2 ramen 4 water
## 9 3 udon 4 water
## 10 4 karage 4 water
## 11 5 soba 4 water
## 12 6 chahan 4 water
## 13 1 sushi 6 orange juice
## 14 2 ramen 6 orange juice
## 15 3 udon 6 orange juice
## 16 4 karage 6 orange juice
## 17 5 soba 6 orange juice
## 18 6 chahan 6 orange juice
## 19 1 sushi 7 beer
## 20 2 ramen 7 beer
## 21 3 udon 7 beer
## 22 4 karage 7 beer
## 23 5 soba 7 beer
## 24 6 chahan 7 beer
## 25 1 sushi 8 sake
## 26 2 ramen 8 sake
## 27 3 udon 8 sake
## 28 4 karage 8 sake
## 29 5 soba 8 sake
## 30 6 chahan 8 sake
semi join is the case when we only keep the information of the left dataset:
food %>% semi_join(drink, by="CustomerId")
## CustomerId Product
## 1 2 ramen
## 2 4 karage
## 3 6 chahan
Anti join is the case when we only keep the information that is differenct in both dataset, it is like food-drink information:
food %>% anti_join(drink, by="CustomerId")
## CustomerId Product
## 1 1 sushi
## 2 3 udon
## 3 5 soba
In this subsection we will apply join functions to our dataset:
names(flights2)
## [1] "year" "month" "day" "hour" "origin" "dest" "tailnum"
## [8] "carrier"
head(flights2)
## # A tibble: 6 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
names(weather)
## [1] "origin" "year" "month" "day" "hour"
## [6] "temp" "dewp" "humid" "wind_dir" "wind_speed"
## [11] "wind_gust" "precip" "pressure" "visib" "time_hour"
flights2 %>%
left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## # wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
When we want to specify the variable to make the join:
names(planes)
## [1] "tailnum" "year" "type" "manufacturer" "model"
## [6] "engines" "seats" "speed" "engine"
names(flights2)
## [1] "year" "month" "day" "hour" "origin" "dest" "tailnum"
## [8] "carrier"
flights2 %>%
left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
When we want to use many variables for make the join:
names(airports)
## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
names(flights2)
## [1] "year" "month" "day" "hour" "origin" "dest" "tailnum"
## [8] "carrier"
head(airports)
## # A tibble: 6 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Y~
## 2 06A Moton Field Municipal Airp~ 32.5 -85.7 264 -6 A America/Chica~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chica~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_Y~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New_Y~
## 6 0A9 Elizabethton Municipal Air~ 36.4 -82.2 1593 -5 A America/New_Y~
head(flights2)
## # A tibble: 6 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
flights2 %>%
left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Geor~ 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Geor~ 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miam~ 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Hart~ 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chic~ 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort~ 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Wash~ 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orla~ 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic~ 42.0 -87.9 668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## # tzone <chr>
flights2 %>%
left_join(airports, c("origin" = "faa"))
## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Newa~ 40.7 -74.2 18
## 2 2013 1 1 5 LGA IAH N24211 UA La G~ 40.8 -73.9 22
## 3 2013 1 1 5 JFK MIA N619AA AA John~ 40.6 -73.8 13
## 4 2013 1 1 5 JFK BQN N804JB B6 John~ 40.6 -73.8 13
## 5 2013 1 1 6 LGA ATL N668DN DL La G~ 40.8 -73.9 22
## 6 2013 1 1 5 EWR ORD N39463 UA Newa~ 40.7 -74.2 18
## 7 2013 1 1 6 EWR FLL N516JB B6 Newa~ 40.7 -74.2 18
## 8 2013 1 1 6 LGA IAD N829AS EV La G~ 40.8 -73.9 22
## 9 2013 1 1 6 JFK MCO N593JB B6 John~ 40.6 -73.8 13
## 10 2013 1 1 6 LGA ORD N3ALAA AA La G~ 40.8 -73.9 22
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## # tzone <chr>
Now imagine, we want to obtain the information of the top less visited destinations:
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
## # A tibble: 10 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
If we want to extract information from those destinations and create a new database, we can make it by two methods:
flights %>%
filter(dest %in% top_dest$dest)
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 554 600 -6 812 837
## 3 2013 1 1 554 558 -4 740 728
## 4 2013 1 1 555 600 -5 913 854
## 5 2013 1 1 557 600 -3 838 846
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 559 0 702 706
## 10 2013 1 1 600 600 0 851 858
## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
flights %>%
semi_join(top_dest)
## Joining, by = "dest"
## # A tibble: 141,145 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 542 540 2 923 850
## 2 2013 1 1 554 600 -6 812 837
## 3 2013 1 1 554 558 -4 740 728
## 4 2013 1 1 555 600 -5 913 854
## 5 2013 1 1 557 600 -3 838 846
## 6 2013 1 1 558 600 -2 753 745
## 7 2013 1 1 558 600 -2 924 917
## 8 2013 1 1 558 600 -2 923 937
## 9 2013 1 1 559 559 0 702 706
## 10 2013 1 1 600 600 0 851 858
## # ... with 141,135 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
When we want to work with datasets that have string information, we have to be careful to extract or work with the information storage in it.
Lets create some data to work with:
library(tidyverse)
rm(list = ls())
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
class(string1)
## [1] "character"
Datasets:
x <- c("\"", "\\")
y <- c("Augusto", "Ricardo", 'Delgado', "Narro")
x
## [1] "\"" "\\"
y
## [1] "Augusto" "Ricardo" "Delgado" "Narro"
x <- "\u00b5"
x
## [1] "µ"
z <- c("one", "two", "three")
Some important functions:
# To write the elements in lines or rows:
writeLines(y)
## Augusto
## Ricardo
## Delgado
## Narro
# To count the number of characters includying empty spaces:
str_length(c("a", "R for data science", NA))
## [1] 1 18 NA
str_length(y)
## [1] 7 7 7 5
# To concatenate the string elements. We also can use a separator:
str_c("x", "y")
## [1] "xy"
str_c("x", "y", "z")
## [1] "xyz"
str_c("x", "y", sep = ", ")
## [1] "x, y"
x <- c("abc", NA)
x
## [1] "abc" NA
str_c("|-", y, "-|")
## [1] "|-Augusto-|" "|-Ricardo-|" "|-Delgado-|" "|-Narro-|"
str_c("|-", str_replace_na(x), "-|") #if we want to include NA.
## [1] "|-abc-|" "|-NA-|"
str_c("prefix-", c("a", "b", "c"), "-suffix")
## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
Some important functions:
name <- "Augusto"
time_of_day <- "morning"
birthday <- TRUE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
## [1] "Good morning Augusto and HAPPY BIRTHDAY."
str_c(c("x", "y", "z"), collapse = ", ")
## [1] "x, y, z"
str_c(y, collapse = " ")
## [1] "Augusto Ricardo Delgado Narro"
Subsetting string datasets:
x <- c("Apple", "Banana", "Pear", "Mandarin", "Papaya", "Pineapple")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea" "Man" "Pap" "Pin"
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear" "rin" "aya" "ple"
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
## [1] "apple" "banana" "pear" "mandarin" "papaya" "pineapple"
x <- c("apple", "eggplant", "banana", "tomato", "potatoe", "tofu")
# Sorting our dataset
str_sort(x, locale = "en") # English
## [1] "apple" "banana" "eggplant" "potatoe" "tofu" "tomato"
str_sort(x, locale = "haw") # Hawaiian
## [1] "apple" "eggplant" "banana" "potatoe" "tofu" "tomato"
x <- c("Apple", "Banana", "Pear", "Mandarin", "Papaya", "Pineapple")
str_view(x, "an")
str_view(x, ".a.")
# To create the regular expression, we need \\
dot <- "\\."
# But the expression itself only contains one:
writeLines(dot)
## \.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
x <- "a\\b"
writeLines(x)
## a\b
#$ to match the end of the string.
x <- c("apple", "banana", "pear", "Mandarin", "Papaya", "Pineapple", "Alto")
str_view(x, "^a")
str_view(x, "a$")
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
#Character Classes and Alternatives
#\d: matches any digit.
#\s: matches any whitespace (e.g. space, tab, newline).
#[abc]: matches a, b, or c.
#[^abc]: matches anything except a, b, or c.
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "b*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
str_view(c("grey", "gray"), "gr(e|a)y")
#Repetition
#?: 0 or 1
#+: 1 or more
#*: 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')
#{n}: exactly n
#{n,}: n or more
#{,m}: at most m
#{n,m}: between n and m
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')