Students will use this exercise to assess their ability to translate
questions into code using tidyverse
packages and verbs.
dplyr
: filter()
, mutate()
,
group_by()
, summarise()
, count()
,
ectggplot2
: ggplot()
, aes()
,
geom_bar
, geom_col
,
geom_histogram
, geom_boxplot
,
geom_density
library(tidyverse)
The Tour de France is an annual men’s multiple-stage bicycle race primarily held in France, while also occasionally passing through nearby countries. Like the other Grand Tours, it consists of 21 stages, each a day long, over the course of 23 days, coinciding with the Bastille Day holiday.
https://en.wikipedia.org/wiki/Tour_de_France
In this section we will use the data about Tour de France winners from 1903 to 2019 and comes from Alastair Rushworth’s Data Package tdf and Kaggle.
tdf_winners <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-07/tdf_winners.csv')
## Rows: 106 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): winner_name, winner_team, full_name, nickname, birth_town, birth_c...
## dbl (9): edition, distance, time_overall, time_margin, stage_wins, stages_l...
## date (3): start_date, born, died
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What variables are available to work with?
str(tdf_winners)
## spec_tbl_df [106 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ edition : num [1:106] 1 2 3 4 5 6 7 8 9 10 ...
## $ start_date : Date[1:106], format: "1903-07-01" "1904-07-02" ...
## $ winner_name : chr [1:106] "Maurice Garin" "Henri Cornet" "Louis Trousselier" "René Pottier" ...
## $ winner_team : chr [1:106] "La Française" "Conte" "Peugeot–Wolber" "Peugeot–Wolber" ...
## $ distance : num [1:106] 2428 2428 2994 4637 4488 ...
## $ time_overall : num [1:106] 94.6 96.1 NA NA NA ...
## $ time_margin : num [1:106] 2.99 2.27 NA NA NA ...
## $ stage_wins : num [1:106] 3 1 5 5 2 5 6 4 2 3 ...
## $ stages_led : num [1:106] 6 3 10 12 5 13 13 3 13 13 ...
## $ height : num [1:106] 1.62 NA NA NA NA NA 1.78 NA NA NA ...
## $ weight : num [1:106] 60 NA NA NA NA NA 88 NA NA NA ...
## $ age : num [1:106] 32 19 24 27 24 25 22 22 26 23 ...
## $ born : Date[1:106], format: "1871-03-03" "1884-08-04" ...
## $ died : Date[1:106], format: "1957-02-19" "1941-03-18" ...
## $ full_name : chr [1:106] NA NA NA NA ...
## $ nickname : chr [1:106] "The Little Chimney-sweep" "Le rigolo (The joker)" "Levaloy / Trou-trou" NA ...
## $ birth_town : chr [1:106] "Arvier" "Desvres" "Paris" "Moret-sur-Loing" ...
## $ birth_country: chr [1:106] "Italy" "France" "France" "France" ...
## $ nationality : chr [1:106] " France" " France" " France" " France" ...
## - attr(*, "spec")=
## .. cols(
## .. edition = col_double(),
## .. start_date = col_date(format = ""),
## .. winner_name = col_character(),
## .. winner_team = col_character(),
## .. distance = col_double(),
## .. time_overall = col_double(),
## .. time_margin = col_double(),
## .. stage_wins = col_double(),
## .. stages_led = col_double(),
## .. height = col_double(),
## .. weight = col_double(),
## .. age = col_double(),
## .. born = col_date(format = ""),
## .. died = col_date(format = ""),
## .. full_name = col_character(),
## .. nickname = col_character(),
## .. birth_town = col_character(),
## .. birth_country = col_character(),
## .. nationality = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
This data set uses the metric system. Please convert the value for distance to the imperial system that is standard in the United States.
Create a new data frame to accomplish this so that we can use these values in subsequent parts.
tdf_Miles<-tdf_winners%>%
mutate(distance_Mi = distance*0.621371)
Add a new column to the data set for speed (miles/hour). Which country is the fastest on average? Create a new data frame to accomplish this.
tdf_Fast<-tdf_Miles%>%
mutate(speed = distance_Mi / time_overall)%>%
group_by(nationality)%>%
summarise(avgSpeed=mean(speed, na.rm=TRUE))%>%
arrange(desc(avgSpeed))
head(tdf_Fast)
## # A tibble: 6 × 2
## nationality avgSpeed
## <chr> <dbl>
## 1 Colombia 25.1
## 2 Great Britain 25.0
## 3 Australia 24.7
## 4 United States 24.6
## 5 Germany 24.4
## 6 Denmark 24.4
How many unique cyclists are in these data? Create a new data frame to accomplish this and order it.
Which cyclist won the most Tour de France races?
winnerCount<-tdf_winners %>%
group_by(winner_name)%>%
count()%>%
arrange(desc(n))
# how many winners?
dim(winnerCount)
## [1] 63 2
# look at the top winners
head(winnerCount)
## # A tibble: 6 × 2
## # Groups: winner_name [6]
## winner_name n
## <chr> <int>
## 1 Lance Armstrong 7
## 2 Bernard Hinault 5
## 3 Eddy Merckx 5
## 4 Jacques Anquetil 5
## 5 Miguel Induráin 5
## 6 Chris Froome 4
Make a histogram showing the distribution for the number of wins. Don’t forget to title your graph.
ggplot(winnerCount, aes(x=n))+
geom_histogram()+
ggtitle("Right Skewed Distribution of TDF Wins")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This distribution has a strong right skew. Most winners only one once. There appears to be a possible outlier on the right with 7 total wins, Lance Armstrong.
These data are somewhat repetitive because the physical metrics for an individual cyclist are repeated every time they win. Please run the following code and use it in the following sections to keep only the unique metrics for a given cylist.
uniqueWinners<-tdf_winners%>%
group_by(winner_name)%>%
summarise(wins=n(),
height=unique(height),
weight=unique(weight),
born=unique(born),
died=unique(died),
nationality=unique(nationality))
#dim(uniqueWinners)
One more observation I made is that the nationality
variable has a weird space before every country name. Let’s take this
out using str_squish
.
uniqueWinners$nationality<-stringr::str_squish(uniqueWinners$nationality)
#str(uniqueWinners)
There are 14 unique nations represented in the data set.
How many winners came from each? Create a new data frame to accomplish this.
Which country has the most winners? Use matrix indexing to only yield the desired row.
# Hint: Use the nationality variable
winnerCountry<-uniqueWinners %>%
group_by(nationality)%>%
count()%>%
arrange(desc(n))
### first
winnerCountry[1,]
## # A tibble: 1 × 2
## # Groups: nationality [1]
## nationality n
## <chr> <int>
## 1 France 21
Create a bar graph for the number of wins for each country.
Please also include the following code to improve your bar graph:
fct_infreq()
around the variable on the x-axiscoord_flip()
function as another layer at the end of
your ggplot
codeggplot(uniqueWinners, aes(x=fct_infreq(nationality)))+
geom_bar()+
coord_flip()
Please convert the values for height and weight to the imperial system, which is standard in the United States.
Create a new data frame to accomplish this so that we can use these values in subsequent Parts 4, 5, and 6.
tdf_Imp<- uniqueWinners%>%
mutate(height_in = height*39.3701)%>%
mutate(weight_lb = weight*2.205)
str(tdf_Imp)
## tibble [63 × 9] (S3: tbl_df/tbl/data.frame)
## $ winner_name: chr [1:63] "Alberto Contador" "André Leducq" "Andy Schleck" "Antonin Magne" ...
## $ wins : int [1:63] 2 2 1 2 5 2 1 1 1 1 ...
## $ height : num [1:63] 1.76 NA 1.86 NA 1.74 NA 1.84 1.9 1.74 1.73 ...
## $ weight : num [1:63] 62 NA 68 NA 62 NA 71 69 64 60 ...
## $ born : Date[1:63], format: "1982-12-06" "1904-02-27" ...
## $ died : Date[1:63], format: NA "1980-06-18" ...
## $ nationality: chr [1:63] "Spain" "France" "Luxembourg" "France" ...
## $ height_in : num [1:63] 69.3 NA 73.2 NA 68.5 ...
## $ weight_lb : num [1:63] 137 NA 150 NA 137 ...
Make a data frame that displays just the average height (in) of the winners from each country.
Hint: Recall the functions head()
and
tail()
tdf_AvgHeight <- tdf_Imp %>%
group_by(nationality)%>%
summarise(avgH = mean(height_in, na.rm=TRUE))%>%
arrange(desc(avgH))
## Tallest
head(tdf_AvgHeight)
## # A tibble: 6 × 2
## nationality avgH
## <chr> <dbl>
## 1 Great Britain 73.4
## 2 Denmark 72.4
## 3 Germany 72.0
## 4 Belgium 71.7
## 5 Luxembourg 70.5
## 6 United States 69.9
## Shortest
tail(tdf_AvgHeight)
## # A tibble: 6 × 2
## nationality avgH
## <chr> <dbl>
## 1 Ireland 68.9
## 2 Spain 68.8
## 3 Australia 68.5
## 4 Netherlands 68.1
## 5 France 67.9
## 6 Switzerland NaN
Create a side-by-side box plot to compare the height(in) distributions for cyclists from France, Italy, and Spain.
Hint: This will require a wrangling step before the graphic
tdf_Imp%>%
filter(nationality %in% c("France", "Italy", "Spain"))%>%
ggplot(aes(x=nationality, y=height_in, fill=nationality))+
geom_boxplot()
## Warning: Removed 17 rows containing non-finite values (stat_boxplot).
Please provide observations from the side-by-side box plot above.
The median heights of these countries don't appear to be significantly difference, since they fall within each others IQR.
Of these three countries, Spain has the tallest cyclist, who is also an outlier within their own country.
Add a column for the average height (in) for each country to the data frame for information for unique cyclists.
Hint: Use left_join
tdf_Join<-tdf_Imp%>%
left_join(tdf_AvgHeight)
## Joining, by = "nationality"
Now create a column to calculate the “deviation” between a cyclists height (in) and the average height (in) of their fellow country men.
tdf_Join%>%
mutate(deviation=height_in-avgH)
## # A tibble: 63 × 11
## winner_name wins height weight born died natio…¹ heigh…² weigh…³
## <chr> <int> <dbl> <dbl> <date> <date> <chr> <dbl> <dbl>
## 1 Alberto Co… 2 1.76 62 1982-12-06 NA Spain 69.3 137.
## 2 André Ledu… 2 NA NA 1904-02-27 1980-06-18 France NA NA
## 3 Andy Schle… 1 1.86 68 1985-06-10 NA Luxemb… 73.2 150.
## 4 Antonin Ma… 2 NA NA 1904-01-15 1983-09-08 France NA NA
## 5 Bernard Hi… 5 1.74 62 1954-11-14 NA France 68.5 137.
## 6 Bernard Th… 2 NA NA 1948-01-10 NA France NA NA
## 7 Bjarne Riis 1 1.84 71 1964-04-03 NA Denmark 72.4 157.
## 8 Bradley Wi… 1 1.9 69 1980-04-28 NA Great … 74.8 152.
## 9 Cadel Evans 1 1.74 64 1977-02-14 NA Austra… 68.5 141.
## 10 Carlos Sas… 1 1.73 60 1975-04-22 NA Spain 68.1 132.
## # … with 53 more rows, 2 more variables: avgH <dbl>, deviation <dbl>, and
## # abbreviated variable names ¹nationality, ²height_in, ³weight_lb
Body Mass Index (BMI) is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height.
\[BMI = \frac{weight (lbs) \times 703}{height^2 (in^2)}\] Tasks/Questions:
tdf_BMI <- tdf_Imp %>%
mutate(bmi = weight_lb*703 / height_in^2)%>% #bmi equation
arrange(desc(bmi))
# highest
tdf_BMI[1,]
## # A tibble: 1 × 10
## winner_name wins height weight born died natio…¹ heigh…² weigh…³
## <chr> <int> <dbl> <dbl> <date> <date> <chr> <dbl> <dbl>
## 1 François Fa… 1 1.78 88 1887-01-26 1915-05-09 Luxemb… 70.1 194.
## # … with 1 more variable: bmi <dbl>, and abbreviated variable names
## # ¹nationality, ²height_in, ³weight_lb
The Tour de France is made up of multiple stages. Which cyclist led the most stages across their career (in this data set)?
Create a new data frame to accomplish this. You will use this in the next question.
tdf_Stages<-tdf_winners%>%
group_by(winner_name)%>%
summarise(sum_stage_wins=sum(stage_wins, na.rm = TRUE))%>%
arrange(desc(sum_stage_wins))
head(tdf_Stages)
## # A tibble: 6 × 2
## winner_name sum_stage_wins
## <chr> <dbl>
## 1 Eddy Merckx 32
## 2 Bernard Hinault 21
## 3 Lance Armstrong 20
## 4 Jacques Anquetil 16
## 5 Miguel Induráin 10
## 6 Gino Bartali 9
How many winners of the race, as a whole, did not win a single stage?
tdf_Stages%>%
filter(sum_stage_wins==0)
## # A tibble: 5 × 2
## winner_name sum_stage_wins
## <chr> <dbl>
## 1 Egan Bernal 0
## 2 Gastone Nencini 0
## 3 Lucien Aimar 0
## 4 Óscar Pereiro 0
## 5 Roger Walkowiak 0
Creating a dataset for only cyclists who have deceased, add a new column to the data set for a cyclist’s lifetime (in years). You can assume that each year has 365 days.
Questions:
Hints:
born
and died
are
as.Date
variables. Subtracting as.Date
variables will give the number of days between the dates.days
label, use the
as.numeric()
wrapper function.tdf_died<-uniqueWinners%>%
filter(is.na(died)==FALSE)%>%
mutate(lifeDays=died-born)%>%
mutate(lifeYears=as.numeric(lifeDays/365))%>%
arrange(desc(lifeYears))
head(tdf_died)
## # A tibble: 6 × 9
## winner_name wins height weight born died natio…¹ lifeD…² lifeY…³
## <chr> <int> <dbl> <dbl> <date> <date> <chr> <drtn> <dbl>
## 1 Ferdinand K… 1 NA NA 1919-07-24 2016-12-29 Switze… 35588 … 97.5
## 2 Roger Walko… 1 NA NA 1927-03-02 2017-02-06 France 32849 … 90.0
## 3 Lucien Buys… 1 NA NA 1892-09-11 1980-01-03 Belgium 31889 … 87.4
## 4 Nicolas Fra… 2 NA NA 1899-11-04 1985-11-08 Luxemb… 31415 … 86.1
## 5 Maurice Gar… 1 1.62 60 1871-03-03 1957-02-19 France 31399 … 86.0
## 6 Gino Bartali 2 1.72 66 1914-07-18 2000-05-05 Italy 31338 … 85.9
## # … with abbreviated variable names ¹nationality, ²lifeDays, ³lifeYears
Write code to calculate a cyclist’s win streak. Then find which cyclist had the longest win streak.