Import two related datasets from TidyTuesday Project.
full_trains <- read.csv("../00_data/data/full_trains")
small_trains <- read.csv("../00_data/data/small_trains")
full_trains_small <- full_trains %>% select(year, num_late_at_departure, departure_station) %>% sample_n(10)
small_trains_small <- small_trains %>% select(year,num_late_at_departure,departure_station)
Describe the two datasets: As a result I get two data sets with the year, the number of late departures, and the departure station. I can check if there is one specific station where trains depart from late.
Data1
Data 2
set.seed(1234)
full_trains_small <- full_trains %>% select(year,num_late_at_departure,departure_station) %>% sample_n(10)
small_trains_small <- small_trains %>% select(year,num_late_at_departure,departure_station) %>% sample_n(10)
full_trains_small
## year num_late_at_departure departure_station
## 1 2017 14 GRENOBLE
## 2 2015 9 BESANCON FRANCHE COMTE TGV
## 3 2016 6 PARIS MONTPARNASSE
## 4 2015 26 NANTES
## 5 2018 52 PARIS MONTPARNASSE
## 6 2017 29 PARIS MONTPARNASSE
## 7 2016 23 PARIS LYON
## 8 2017 33 PARIS LYON
## 9 2017 43 PARIS NORD
## 10 2017 18 PARIS MONTPARNASSE
small_trains_small
## year num_late_at_departure departure_station
## 1 2015 11 PARIS LYON
## 2 2018 18 PARIS LYON
## 3 2018 6 BREST
## 4 2017 13 PARIS LYON
## 5 2015 4 PARIS MONTPARNASSE
## 6 2017 2 SAINT ETIENNE CHATEAUCREUX
## 7 2018 27 RENNES
## 8 2018 65 REIMS
## 9 2017 31 PARIS MONTPARNASSE
## 10 2016 4 ANNECY
Describe the resulting data: I see the number of late departure at the departure station PARIS MONTPARNASSE and PARIS LYON. For example, in 2017, full trains departured late 33 times from PARIS LYON and small train only 13 times
How is it different from the original two datasets? It is a very small dataset. It campares the number of late departures of full trains with the ones of small trains at a specific departure location
full_trains_small %>% inner_join(small_trains_small, join_by("year","departure_station"))
## year num_late_at_departure.x departure_station num_late_at_departure.y
## 1 2017 29 PARIS MONTPARNASSE 31
## 2 2017 33 PARIS LYON 13
## 3 2017 18 PARIS MONTPARNASSE 31
Describe the resulting data: left_join keeps the observation in x. In this case, all full_trains_small values and the matching small_train_small ones. In this case, it looks lake most of the trains depart late from PARIS MONTPARNASSE in 2018.
How is it different from the original two datasets? It prints only the departure_station and number of late departure of the smaple data. Thtas why there are only 10 rows because of the sample size
full_trains_small %>% left_join(small_trains_small)
## Joining with `by = join_by(year, num_late_at_departure, departure_station)`
## year num_late_at_departure departure_station
## 1 2017 14 GRENOBLE
## 2 2015 9 BESANCON FRANCHE COMTE TGV
## 3 2016 6 PARIS MONTPARNASSE
## 4 2015 26 NANTES
## 5 2018 52 PARIS MONTPARNASSE
## 6 2017 29 PARIS MONTPARNASSE
## 7 2016 23 PARIS LYON
## 8 2017 33 PARIS LYON
## 9 2017 43 PARIS NORD
## 10 2017 18 PARIS MONTPARNASSE
Describe the resulting data: Same than left_join. But know we take the small_train_small data and add the matching data of full_trains_small. In this case, most trains depart late from REIMS in 2018.
How is it different from the original two datasets? Only 10 rows because of the smaple size. Gives us only output of the columns we selected.
full_trains_small %>% right_join(small_trains_small)
## Joining with `by = join_by(year, num_late_at_departure, departure_station)`
## year num_late_at_departure departure_station
## 1 2015 11 PARIS LYON
## 2 2018 18 PARIS LYON
## 3 2018 6 BREST
## 4 2017 13 PARIS LYON
## 5 2015 4 PARIS MONTPARNASSE
## 6 2017 2 SAINT ETIENNE CHATEAUCREUX
## 7 2018 27 RENNES
## 8 2018 65 REIMS
## 9 2017 31 PARIS MONTPARNASSE
## 10 2016 4 ANNECY
Describe the resulting data: The resulting data keeps all observation. This means the output are both sets full_trains_small and small_trains_small. However, it looks like it doesn’t add any numbers of matching data.
How is it different from the original two datasets? It’s a small set with only 3 columns
full_trains_small %>% full_join(small_trains_small)
## Joining with `by = join_by(year, num_late_at_departure, departure_station)`
## year num_late_at_departure departure_station
## 1 2017 14 GRENOBLE
## 2 2015 9 BESANCON FRANCHE COMTE TGV
## 3 2016 6 PARIS MONTPARNASSE
## 4 2015 26 NANTES
## 5 2018 52 PARIS MONTPARNASSE
## 6 2017 29 PARIS MONTPARNASSE
## 7 2016 23 PARIS LYON
## 8 2017 33 PARIS LYON
## 9 2017 43 PARIS NORD
## 10 2017 18 PARIS MONTPARNASSE
## 11 2015 11 PARIS LYON
## 12 2018 18 PARIS LYON
## 13 2018 6 BREST
## 14 2017 13 PARIS LYON
## 15 2015 4 PARIS MONTPARNASSE
## 16 2017 2 SAINT ETIENNE CHATEAUCREUX
## 17 2018 27 RENNES
## 18 2018 65 REIMS
## 19 2017 31 PARIS MONTPARNASSE
## 20 2016 4 ANNECY
Describe the resulting data: There is no result for semi_join. It looks like there is no data in small_trains_small that matches to full_trains_small. This is the same observation I made with full_join before.
How is it different from the original two datasets? In the sample size of 10, there is no matching data. Therfore we don’t get a result for semi_jain
full_trains_small %>% semi_join(small_trains_small)
## Joining with `by = join_by(year, num_late_at_departure, departure_station)`
## [1] year num_late_at_departure departure_station
## <0 rows> (or 0-length row.names)
Describe the resulting data: anti_joins returns all columns from full_trains_small that don’t have matching values in small_trains_small. Again, this confirms our observation that there are no matching rows in our smaple data.
How is it different from the original two datasets? It only prints our dataset full_trains_small since there are no matching entries in small_trains_small
full_trains_small %>% anti_join(small_trains_small)
## Joining with `by = join_by(year, num_late_at_departure, departure_station)`
## year num_late_at_departure departure_station
## 1 2017 14 GRENOBLE
## 2 2015 9 BESANCON FRANCHE COMTE TGV
## 3 2016 6 PARIS MONTPARNASSE
## 4 2015 26 NANTES
## 5 2018 52 PARIS MONTPARNASSE
## 6 2017 29 PARIS MONTPARNASSE
## 7 2016 23 PARIS LYON
## 8 2017 33 PARIS LYON
## 9 2017 43 PARIS NORD
## 10 2017 18 PARIS MONTPARNASSE