Week 9: Apply it to your data 8

1. Import your data

Import two related datasets from TidyTuesday Project.

full_trains <- read.csv("../00_data/data/full_trains")
small_trains <- read.csv("../00_data/data/small_trains")

2. Make data small

full_trains_small <- full_trains %>% select(year, num_late_at_departure, departure_station) %>% sample_n(10)

small_trains_small <- small_trains %>% select(year,num_late_at_departure,departure_station)

Describe the two datasets: As a result I get two data sets with the year, the number of late departures, and the departure station. I can check if there is one specific station where trains depart from late.

Data1

Columns: year, num_late_at_departure, departure_station
Rows: 10 Rows

Data 2

Columns: year,num_late_at_departure,departure_station
Rows: 32,772

set.seed(1234)
full_trains_small <- full_trains %>% select(year,num_late_at_departure,departure_station) %>% sample_n(10)
small_trains_small <- small_trains %>% select(year,num_late_at_departure,departure_station) %>% sample_n(10)

full_trains_small

##    year num_late_at_departure          departure_station
## 1  2017                    14                   GRENOBLE
## 2  2015                     9 BESANCON FRANCHE COMTE TGV
## 3  2016                     6         PARIS MONTPARNASSE
## 4  2015                    26                     NANTES
## 5  2018                    52         PARIS MONTPARNASSE
## 6  2017                    29         PARIS MONTPARNASSE
## 7  2016                    23                 PARIS LYON
## 8  2017                    33                 PARIS LYON
## 9  2017                    43                 PARIS NORD
## 10 2017                    18         PARIS MONTPARNASSE

small_trains_small

##    year num_late_at_departure          departure_station
## 1  2015                    11                 PARIS LYON
## 2  2018                    18                 PARIS LYON
## 3  2018                     6                      BREST
## 4  2017                    13                 PARIS LYON
## 5  2015                     4         PARIS MONTPARNASSE
## 6  2017                     2 SAINT ETIENNE CHATEAUCREUX
## 7  2018                    27                     RENNES
## 8  2018                    65                      REIMS
## 9  2017                    31         PARIS MONTPARNASSE
## 10 2016                     4                     ANNECY

3. inner_join

Describe the resulting data: I see the number of late departure at the departure station PARIS MONTPARNASSE and PARIS LYON. For example, in 2017, full trains departured late 33 times from PARIS LYON and small train only 13 times

Columns: year, num_late_at_departure.x, departure_station, num_late_at_departure.y
Rows: 3

How is it different from the original two datasets? It is a very small dataset. It campares the number of late departures of full trains with the ones of small trains at a specific departure location

full_trains_small %>% inner_join(small_trains_small, join_by("year","departure_station"))

##   year num_late_at_departure.x  departure_station num_late_at_departure.y
## 1 2017                      29 PARIS MONTPARNASSE                      31
## 2 2017                      33         PARIS LYON                      13
## 3 2017                      18 PARIS MONTPARNASSE                      31

4. left_join

Describe the resulting data: left_join keeps the observation in x. In this case, all full_trains_small values and the matching small_train_small ones. In this case, it looks lake most of the trains depart late from PARIS MONTPARNASSE in 2018.

Columns: year, num_late_at_departure, departure_station
Rows: 10

How is it different from the original two datasets? It prints only the departure_station and number of late departure of the smaple data. Thtas why there are only 10 rows because of the sample size

full_trains_small %>% left_join(small_trains_small)

## Joining with `by = join_by(year, num_late_at_departure, departure_station)`

##    year num_late_at_departure          departure_station
## 1  2017                    14                   GRENOBLE
## 2  2015                     9 BESANCON FRANCHE COMTE TGV
## 3  2016                     6         PARIS MONTPARNASSE
## 4  2015                    26                     NANTES
## 5  2018                    52         PARIS MONTPARNASSE
## 6  2017                    29         PARIS MONTPARNASSE
## 7  2016                    23                 PARIS LYON
## 8  2017                    33                 PARIS LYON
## 9  2017                    43                 PARIS NORD
## 10 2017                    18         PARIS MONTPARNASSE

5. right_join

Describe the resulting data: Same than left_join. But know we take the small_train_small data and add the matching data of full_trains_small. In this case, most trains depart late from REIMS in 2018.

Columns: year, num_late_at_departure, departure_station
Rows: 10

How is it different from the original two datasets? Only 10 rows because of the smaple size. Gives us only output of the columns we selected.

full_trains_small %>% right_join(small_trains_small)

## Joining with `by = join_by(year, num_late_at_departure, departure_station)`

##    year num_late_at_departure          departure_station
## 1  2015                    11                 PARIS LYON
## 2  2018                    18                 PARIS LYON
## 3  2018                     6                      BREST
## 4  2017                    13                 PARIS LYON
## 5  2015                     4         PARIS MONTPARNASSE
## 6  2017                     2 SAINT ETIENNE CHATEAUCREUX
## 7  2018                    27                     RENNES
## 8  2018                    65                      REIMS
## 9  2017                    31         PARIS MONTPARNASSE
## 10 2016                     4                     ANNECY

6. full_join

Describe the resulting data: The resulting data keeps all observation. This means the output are both sets full_trains_small and small_trains_small. However, it looks like it doesn’t add any numbers of matching data.

Columns: year, num_late_at_departure, departure_station
Rows: 20

How is it different from the original two datasets? It’s a small set with only 3 columns

full_trains_small %>% full_join(small_trains_small)

## Joining with `by = join_by(year, num_late_at_departure, departure_station)`

##    year num_late_at_departure          departure_station
## 1  2017                    14                   GRENOBLE
## 2  2015                     9 BESANCON FRANCHE COMTE TGV
## 3  2016                     6         PARIS MONTPARNASSE
## 4  2015                    26                     NANTES
## 5  2018                    52         PARIS MONTPARNASSE
## 6  2017                    29         PARIS MONTPARNASSE
## 7  2016                    23                 PARIS LYON
## 8  2017                    33                 PARIS LYON
## 9  2017                    43                 PARIS NORD
## 10 2017                    18         PARIS MONTPARNASSE
## 11 2015                    11                 PARIS LYON
## 12 2018                    18                 PARIS LYON
## 13 2018                     6                      BREST
## 14 2017                    13                 PARIS LYON
## 15 2015                     4         PARIS MONTPARNASSE
## 16 2017                     2 SAINT ETIENNE CHATEAUCREUX
## 17 2018                    27                     RENNES
## 18 2018                    65                      REIMS
## 19 2017                    31         PARIS MONTPARNASSE
## 20 2016                     4                     ANNECY

7. semi_join

Describe the resulting data: There is no result for semi_join. It looks like there is no data in small_trains_small that matches to full_trains_small. This is the same observation I made with full_join before.

Columns: year, num_late_at_departure, departure_station
Rows: 0

How is it different from the original two datasets? In the sample size of 10, there is no matching data. Therfore we don’t get a result for semi_jain

full_trains_small %>% semi_join(small_trains_small)

## Joining with `by = join_by(year, num_late_at_departure, departure_station)`

## [1] year                  num_late_at_departure departure_station    
## <0 rows> (or 0-length row.names)

8. anti_join

Describe the resulting data: anti_joins returns all columns from full_trains_small that don’t have matching values in small_trains_small. Again, this confirms our observation that there are no matching rows in our smaple data.

Columns: year, num_late_at_departure, departure_station
Rows: 10

How is it different from the original two datasets? It only prints our dataset full_trains_small since there are no matching entries in small_trains_small

full_trains_small %>% anti_join(small_trains_small)

## Joining with `by = join_by(year, num_late_at_departure, departure_station)`

##    year num_late_at_departure          departure_station
## 1  2017                    14                   GRENOBLE
## 2  2015                     9 BESANCON FRANCHE COMTE TGV
## 3  2016                     6         PARIS MONTPARNASSE
## 4  2015                    26                     NANTES
## 5  2018                    52         PARIS MONTPARNASSE
## 6  2017                    29         PARIS MONTPARNASSE
## 7  2016                    23                 PARIS LYON
## 8  2017                    33                 PARIS LYON
## 9  2017                    43                 PARIS NORD
## 10 2017                    18         PARIS MONTPARNASSE