Tidyverse Skills Check: Tour De France (SOLUTIONS)

Learning Objectives:

Students will use this exercise to assess their ability to translate questions into code using tidyverse packages and verbs.

dplyr: filter(), mutate(), group_by(), summarise(), count(), ect
ggplot2: ggplot(), aes(), geom_bar, geom_col, geom_histogram, geom_boxplot, geom_density

Step 0: Library tidyverse

library(tidyverse)

Step 1: Load the Data

The Tour de France is an annual men’s multiple-stage bicycle race primarily held in France, while also occasionally passing through nearby countries. Like the other Grand Tours, it consists of 21 stages, each a day long, over the course of 23 days, coinciding with the Bastille Day holiday.

https://en.wikipedia.org/wiki/Tour_de_France

In this section we will use the data about Tour de France winners from 1903 to 2019 and comes from Alastair Rushworth’s Data Package tdf and Kaggle.

tdf_winners <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-07/tdf_winners.csv')

## Rows: 106 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): winner_name, winner_team, full_name, nickname, birth_town, birth_c...
## dbl  (9): edition, distance, time_overall, time_margin, stage_wins, stages_l...
## date (3): start_date, born, died
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 2: Look at the data

What variables are available to work with?

str(tdf_winners)

## spec_tbl_df [106 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ edition      : num [1:106] 1 2 3 4 5 6 7 8 9 10 ...
##  $ start_date   : Date[1:106], format: "1903-07-01" "1904-07-02" ...
##  $ winner_name  : chr [1:106] "Maurice Garin" "Henri Cornet" "Louis Trousselier" "René Pottier" ...
##  $ winner_team  : chr [1:106] "La Française" "Conte" "Peugeot–Wolber" "Peugeot–Wolber" ...
##  $ distance     : num [1:106] 2428 2428 2994 4637 4488 ...
##  $ time_overall : num [1:106] 94.6 96.1 NA NA NA ...
##  $ time_margin  : num [1:106] 2.99 2.27 NA NA NA ...
##  $ stage_wins   : num [1:106] 3 1 5 5 2 5 6 4 2 3 ...
##  $ stages_led   : num [1:106] 6 3 10 12 5 13 13 3 13 13 ...
##  $ height       : num [1:106] 1.62 NA NA NA NA NA 1.78 NA NA NA ...
##  $ weight       : num [1:106] 60 NA NA NA NA NA 88 NA NA NA ...
##  $ age          : num [1:106] 32 19 24 27 24 25 22 22 26 23 ...
##  $ born         : Date[1:106], format: "1871-03-03" "1884-08-04" ...
##  $ died         : Date[1:106], format: "1957-02-19" "1941-03-18" ...
##  $ full_name    : chr [1:106] NA NA NA NA ...
##  $ nickname     : chr [1:106] "The Little Chimney-sweep" "Le rigolo (The joker)" "Levaloy / Trou-trou" NA ...
##  $ birth_town   : chr [1:106] "Arvier" "Desvres" "Paris" "Moret-sur-Loing" ...
##  $ birth_country: chr [1:106] "Italy" "France" "France" "France" ...
##  $ nationality  : chr [1:106] " France" " France" " France" " France" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   edition = col_double(),
##   ..   start_date = col_date(format = ""),
##   ..   winner_name = col_character(),
##   ..   winner_team = col_character(),
##   ..   distance = col_double(),
##   ..   time_overall = col_double(),
##   ..   time_margin = col_double(),
##   ..   stage_wins = col_double(),
##   ..   stages_led = col_double(),
##   ..   height = col_double(),
##   ..   weight = col_double(),
##   ..   age = col_double(),
##   ..   born = col_date(format = ""),
##   ..   died = col_date(format = ""),
##   ..   full_name = col_character(),
##   ..   nickname = col_character(),
##   ..   birth_town = col_character(),
##   ..   birth_country = col_character(),
##   ..   nationality = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Part 1

A. Convert to Miles

This data set uses the metric system. Please convert the value for distance to the imperial system that is standard in the United States.

1 km (kilometer) = 0.621371 miles

Create a new data frame to accomplish this so that we can use these values in subsequent parts.

tdf_Miles<-tdf_winners%>%
  mutate(distance_Mi = distance*0.621371)

B. Calculate Speed

Add a new column to the data set for speed (miles/hour). Which country is the fastest on average? Create a new data frame to accomplish this.

tdf_Fast<-tdf_Miles%>%
  mutate(speed = distance_Mi / time_overall)%>%
  group_by(nationality)%>%
  summarise(avgSpeed=mean(speed, na.rm=TRUE))%>%
  arrange(desc(avgSpeed))

head(tdf_Fast)

## # A tibble: 6 × 2
##   nationality    avgSpeed
##   <chr>             <dbl>
## 1  Colombia          25.1
## 2  Great Britain     25.0
## 3  Australia         24.7
## 4  United States     24.6
## 5  Germany           24.4
## 6  Denmark           24.4

Part 2

A. Unique Cyclists

How many unique cyclists are in these data? Create a new data frame to accomplish this and order it.
Which cyclist won the most Tour de France races?

winnerCount<-tdf_winners %>%
  group_by(winner_name)%>%
  count()%>%
  arrange(desc(n))

# how many winners? 
dim(winnerCount)

## [1] 63  2

# look at the top winners
head(winnerCount)

## # A tibble: 6 × 2
## # Groups:   winner_name [6]
##   winner_name          n
##   <chr>            <int>
## 1 Lance Armstrong      7
## 2 Bernard Hinault      5
## 3 Eddy Merckx          5
## 4 Jacques Anquetil     5
## 5 Miguel Induráin      5
## 6 Chris Froome         4

B. Graphic - Histogram

Make a histogram showing the distribution for the number of wins. Don’t forget to title your graph.

ggplot(winnerCount, aes(x=n))+
  geom_histogram()+
  ggtitle("Right Skewed Distribution of TDF Wins")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

C. Insight - Comment on the shape of this histogram:


This distribution has a strong right skew.  Most winners only one once.  There appears to be a possible outlier on the right with 7 total wins, Lance Armstrong.

Professor Provided Wrangling:

These data are somewhat repetitive because the physical metrics for an individual cyclist are repeated every time they win. Please run the following code and use it in the following sections to keep only the unique metrics for a given cylist.

uniqueWinners<-tdf_winners%>%
  group_by(winner_name)%>%
  summarise(wins=n(),
            height=unique(height), 
            weight=unique(weight), 
            born=unique(born), 
            died=unique(died), 
            nationality=unique(nationality))

#dim(uniqueWinners)

One more observation I made is that the nationality variable has a weird space before every country name. Let’s take this out using str_squish.

uniqueWinners$nationality<-stringr::str_squish(uniqueWinners$nationality)

#str(uniqueWinners)

Part 3

A. Unique Nations

There are 14 unique nations represented in the data set.

How many winners came from each? Create a new data frame to accomplish this.
Which country has the most winners? Use matrix indexing to only yield the desired row.

# Hint: Use the nationality variable

winnerCountry<-uniqueWinners %>%
  group_by(nationality)%>%
  count()%>%
  arrange(desc(n))

### first
winnerCountry[1,]

## # A tibble: 1 × 2
## # Groups:   nationality [1]
##   nationality     n
##   <chr>       <int>
## 1 France         21

B. Graphic - Bar Graph

Create a bar graph for the number of wins for each country.

Please also include the following code to improve your bar graph:

Order your categories from big to small using the function fct_infreq() around the variable on the x-axis
To have a more reabable graphic you can flip the x and y axes using the coord_flip() function as another layer at the end of your ggplot code

ggplot(uniqueWinners, aes(x=fct_infreq(nationality)))+
  geom_bar()+
  coord_flip()

Part 4

A. Imperial System

Please convert the values for height and weight to the imperial system, which is standard in the United States.

1 m (meter) = 39.3701 inches
1 kg (kilogram) = 2.205 pounds

Create a new data frame to accomplish this so that we can use these values in subsequent Parts 4, 5, and 6.

tdf_Imp<- uniqueWinners%>%
  mutate(height_in = height*39.3701)%>%
  mutate(weight_lb = weight*2.205)

str(tdf_Imp)

## tibble [63 × 9] (S3: tbl_df/tbl/data.frame)
##  $ winner_name: chr [1:63] "Alberto Contador" "André Leducq" "Andy Schleck" "Antonin Magne" ...
##  $ wins       : int [1:63] 2 2 1 2 5 2 1 1 1 1 ...
##  $ height     : num [1:63] 1.76 NA 1.86 NA 1.74 NA 1.84 1.9 1.74 1.73 ...
##  $ weight     : num [1:63] 62 NA 68 NA 62 NA 71 69 64 60 ...
##  $ born       : Date[1:63], format: "1982-12-06" "1904-02-27" ...
##  $ died       : Date[1:63], format: NA "1980-06-18" ...
##  $ nationality: chr [1:63] "Spain" "France" "Luxembourg" "France" ...
##  $ height_in  : num [1:63] 69.3 NA 73.2 NA 68.5 ...
##  $ weight_lb  : num [1:63] 137 NA 150 NA 137 ...

A. Height

Make a data frame that displays just the average height (in) of the winners from each country.

What country has the tallest cyclists?
What country has the shortest cyclists?

Hint: Recall the functions head() and tail()

tdf_AvgHeight <- tdf_Imp %>%
  group_by(nationality)%>%
  summarise(avgH = mean(height_in, na.rm=TRUE))%>%
  arrange(desc(avgH))

## Tallest
head(tdf_AvgHeight)

## # A tibble: 6 × 2
##   nationality    avgH
##   <chr>         <dbl>
## 1 Great Britain  73.4
## 2 Denmark        72.4
## 3 Germany        72.0
## 4 Belgium        71.7
## 5 Luxembourg     70.5
## 6 United States  69.9

## Shortest
tail(tdf_AvgHeight)

## # A tibble: 6 × 2
##   nationality  avgH
##   <chr>       <dbl>
## 1 Ireland      68.9
## 2 Spain        68.8
## 3 Australia    68.5
## 4 Netherlands  68.1
## 5 France       67.9
## 6 Switzerland NaN

B. Graphic - Box Plot

Create a side-by-side box plot to compare the height(in) distributions for cyclists from France, Italy, and Spain.

Hint: This will require a wrangling step before the graphic

tdf_Imp%>%
  filter(nationality %in% c("France", "Italy", "Spain"))%>%
  ggplot(aes(x=nationality, y=height_in, fill=nationality))+
  geom_boxplot()

## Warning: Removed 17 rows containing non-finite values (stat_boxplot).

C. Insights:

Please provide observations from the side-by-side box plot above.

The median heights of these countries don't appear to be significantly difference, since they fall within each others IQR.  

Of these three countries, Spain has the tallest cyclist, who is also an outlier within their own country.

Part 5

A. Join

Add a column for the average height (in) for each country to the data frame for information for unique cyclists.

Hint: Use left_join

tdf_Join<-tdf_Imp%>% 
  left_join(tdf_AvgHeight)

## Joining, by = "nationality"

B. Deviations

Now create a column to calculate the “deviation” between a cyclists height (in) and the average height (in) of their fellow country men.

tdf_Join%>%
  mutate(deviation=height_in-avgH)

## # A tibble: 63 × 11
##    winner_name  wins height weight born       died       natio…¹ heigh…² weigh…³
##    <chr>       <int>  <dbl>  <dbl> <date>     <date>     <chr>     <dbl>   <dbl>
##  1 Alberto Co…     2   1.76     62 1982-12-06 NA         Spain      69.3    137.
##  2 André Ledu…     2  NA        NA 1904-02-27 1980-06-18 France     NA       NA 
##  3 Andy Schle…     1   1.86     68 1985-06-10 NA         Luxemb…    73.2    150.
##  4 Antonin Ma…     2  NA        NA 1904-01-15 1983-09-08 France     NA       NA 
##  5 Bernard Hi…     5   1.74     62 1954-11-14 NA         France     68.5    137.
##  6 Bernard Th…     2  NA        NA 1948-01-10 NA         France     NA       NA 
##  7 Bjarne Riis     1   1.84     71 1964-04-03 NA         Denmark    72.4    157.
##  8 Bradley Wi…     1   1.9      69 1980-04-28 NA         Great …    74.8    152.
##  9 Cadel Evans     1   1.74     64 1977-02-14 NA         Austra…    68.5    141.
## 10 Carlos Sas…     1   1.73     60 1975-04-22 NA         Spain      68.1    132.
## # … with 53 more rows, 2 more variables: avgH <dbl>, deviation <dbl>, and
## #   abbreviated variable names ¹nationality, ²height_in, ³weight_lb

Part 6

A. BMI

Body Mass Index (BMI) is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height.

\[BMI = \frac{weight (lbs) \times 703}{height^2 (in^2)}\] Tasks/Questions:

Add a new column to the data set for body mass index (BMI).
Which cyclist had the highest BMI? What was his BMI?

tdf_BMI <- tdf_Imp %>%
  mutate(bmi = weight_lb*703 / height_in^2)%>% #bmi equation
  arrange(desc(bmi))

# highest
tdf_BMI[1,]

## # A tibble: 1 × 10
##   winner_name   wins height weight born       died       natio…¹ heigh…² weigh…³
##   <chr>        <int>  <dbl>  <dbl> <date>     <date>     <chr>     <dbl>   <dbl>
## 1 François Fa…     1   1.78     88 1887-01-26 1915-05-09 Luxemb…    70.1    194.
## # … with 1 more variable: bmi <dbl>, and abbreviated variable names
## #   ¹nationality, ²height_in, ³weight_lb

Part 7

A. Stages

The Tour de France is made up of multiple stages. Which cyclist led the most stages across their career (in this data set)?

Create a new data frame to accomplish this. You will use this in the next question.

tdf_Stages<-tdf_winners%>%
  group_by(winner_name)%>%
  summarise(sum_stage_wins=sum(stage_wins, na.rm = TRUE))%>%
  arrange(desc(sum_stage_wins))

head(tdf_Stages)

## # A tibble: 6 × 2
##   winner_name      sum_stage_wins
##   <chr>                     <dbl>
## 1 Eddy Merckx                  32
## 2 Bernard Hinault              21
## 3 Lance Armstrong              20
## 4 Jacques Anquetil             16
## 5 Miguel Induráin              10
## 6 Gino Bartali                  9

B. No Stage Wins

How many winners of the race, as a whole, did not win a single stage?

tdf_Stages%>%
  filter(sum_stage_wins==0)

## # A tibble: 5 × 2
##   winner_name     sum_stage_wins
##   <chr>                    <dbl>
## 1 Egan Bernal                  0
## 2 Gastone Nencini              0
## 3 Lucien Aimar                 0
## 4 Óscar Pereiro                0
## 5 Roger Walkowiak              0

Part 8

A. Longevity

Creating a dataset for only cyclists who have deceased, add a new column to the data set for a cyclist’s lifetime (in years). You can assume that each year has 365 days.

Questions:

Which cyclist lived the longest?
How many years old was he?

Hints:

Note that both born and died are as.Date variables. Subtracting as.Date variables will give the number of days between the dates.
If you want to remove the days label, use the as.numeric() wrapper function.

tdf_died<-uniqueWinners%>%
  filter(is.na(died)==FALSE)%>%
  mutate(lifeDays=died-born)%>%
  mutate(lifeYears=as.numeric(lifeDays/365))%>%
  arrange(desc(lifeYears))

head(tdf_died)

## # A tibble: 6 × 9
##   winner_name   wins height weight born       died       natio…¹ lifeD…² lifeY…³
##   <chr>        <int>  <dbl>  <dbl> <date>     <date>     <chr>   <drtn>    <dbl>
## 1 Ferdinand K…     1  NA        NA 1919-07-24 2016-12-29 Switze… 35588 …    97.5
## 2 Roger Walko…     1  NA        NA 1927-03-02 2017-02-06 France  32849 …    90.0
## 3 Lucien Buys…     1  NA        NA 1892-09-11 1980-01-03 Belgium 31889 …    87.4
## 4 Nicolas Fra…     2  NA        NA 1899-11-04 1985-11-08 Luxemb… 31415 …    86.1
## 5 Maurice Gar…     1   1.62     60 1871-03-03 1957-02-19 France  31399 …    86.0
## 6 Gino Bartali     2   1.72     66 1914-07-18 2000-05-05 Italy   31338 …    85.9
## # … with abbreviated variable names ¹nationality, ²lifeDays, ³lifeYears

FUN EXTRA CREDIT BONUS!!!!

Write code to calculate a cyclist’s win streak. Then find which cyclist had the longest win streak.