The goal of this project is to create a vignette of one or more
features of a TidyVerse package in R. For this project, I will be
focusing on some of the functions of the
tidyr package and I will also use
ggplot2 to demonstrate how these
transformations can be helpful.
library(tidyr)
library(ggplot2)
library(dplyr)
The tidyr library provides functions to
tidy data while the ggplot2 library
provides functions to plot and visualize data. We will also utilize some
functions from dplyr to further manipulate
our data frame.
In order to load these libraries, you will need to first install the
packages. You can do this by using
install.packages("tidyr"),
install.packages("ggplot2"), and
install.packages("dplyr"). You can also access these
libraries by installing and using the
tidyverse package which includes these as
well as other packages.
The data set that we will focus on includes historical Elo ratings
and game scores for NBA teams from 1947 till 2023. The data set was
taken from FiveThirtEight.com and can
be accessed here
by downloading the nba_elo.csv file.
nba <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Tidyverse_CREATE/nba_elo.csv"))
Let’s take a look at the data:
rmarkdown::paged_table(nba)
There are 73,335 rows and 27 columns in this data set.
Each row constitutes a game played between two teams. The observations from these games can be seen by looking at the column names.
names(nba)
## [1] "date" "season" "neutral" "playoff"
## [5] "team1" "team2" "elo1_pre" "elo2_pre"
## [9] "elo_prob1" "elo_prob2" "elo1_post" "elo2_post"
## [13] "carm.elo1_pre" "carm.elo2_pre" "carm.elo_prob1" "carm.elo_prob2"
## [17] "carm.elo1_post" "carm.elo2_post" "raptor1_pre" "raptor2_pre"
## [21] "raptor_prob1" "raptor_prob2" "score1" "score2"
## [25] "quality" "importance" "total_rating"
Five Thirty Eight uses this data set to predict NBA season outcomes. According to their explanation online of how their predictions work, they began by using the Elo ratings for each team throughout history. They later realized that the team Elo rating did not factor for player skill and lacked the ability to predict based on player injuries, trades, or signings. In 2015, they came up with a new system, the “CARM-Elo” which factored in CARMELO player projections. In 2019 they created a new metric for player rating, the RAPTOR metric, which is what they currently use.
summary(nba)
## date season neutral playoff
## Length:73335 Min. :1947 Min. :0.000000 Length:73335
## Class :character 1st Qu.:1977 1st Qu.:0.000000 Class :character
## Mode :character Median :1994 Median :0.000000 Mode :character
## Mean :1993 Mean :0.002823
## 3rd Qu.:2009 3rd Qu.:0.000000
## Max. :2023 Max. :1.000000
##
## team1 team2 elo1_pre elo2_pre
## Length:73335 Length:73335 Min. :1106 Min. :1092
## Class :character Class :character 1st Qu.:1420 1st Qu.:1421
## Mode :character Mode :character Median :1503 Median :1504
## Mean :1498 Mean :1498
## 3rd Qu.:1578 3rd Qu.:1578
## Max. :1856 Max. :1865
##
## elo_prob1 elo_prob2 elo1_post elo2_post
## Min. :0.06262 Min. :0.01832 Min. :1100 Min. :1086
## 1st Qu.:0.49901 1st Qu.:0.24107 1st Qu.:1420 1st Qu.:1421
## Median :0.63992 Median :0.36008 Median :1503 Median :1504
## Mean :0.62103 Mean :0.37897 Mean :1497 Mean :1498
## 3rd Qu.:0.75893 3rd Qu.:0.50099 3rd Qu.:1578 3rd Qu.:1578
## Max. :0.98168 Max. :0.93738 Max. :1860 Max. :1865
## NA's :49 NA's :49
## carm.elo1_pre carm.elo2_pre carm.elo_prob1 carm.elo_prob2
## Min. :1188 Min. :1193 Min. :0.06 Min. :0.01
## 1st Qu.:1439 1st Qu.:1439 1st Qu.:0.50 1st Qu.:0.22
## Median :1514 Median :1513 Median :0.65 Median :0.35
## Mean :1514 Mean :1513 Mean :0.63 Mean :0.37
## 3rd Qu.:1590 3rd Qu.:1587 3rd Qu.:0.78 3rd Qu.:0.50
## Max. :1873 Max. :1882 Max. :0.99 Max. :0.94
## NA's :68086 NA's :68086 NA's :68086 NA's :68086
## carm.elo1_post carm.elo2_post raptor1_pre raptor2_pre
## Min. :1188 Min. :1192 Min. : 865.2 Min. : 936.1
## 1st Qu.:1438 1st Qu.:1439 1st Qu.:1428.0 1st Qu.:1425.1
## Median :1513 Median :1514 Median :1513.7 Median :1508.1
## Mean :1513 Mean :1514 Mean :1503.9 Mean :1500.9
## 3rd Qu.:1589 3rd Qu.:1588 3rd Qu.:1593.4 3rd Qu.:1590.4
## Max. :1877 Max. :1882 Max. :1850.6 Max. :1842.7
## NA's :68086 NA's :68086 NA's :67094 NA's :67094
## raptor_prob1 raptor_prob2 score1 score2
## Min. :0.03 Min. :0.01 Min. : 2.0 Min. : 0.0
## 1st Qu.:0.47 1st Qu.:0.23 1st Qu.: 96.0 1st Qu.: 92.0
## Median :0.63 Median :0.37 Median :105.0 Median :102.0
## Mean :0.61 Mean :0.39 Mean :105.4 Mean :101.8
## 3rd Qu.:0.77 3rd Qu.:0.53 3rd Qu.:115.0 3rd Qu.:111.0
## Max. :0.99 Max. :0.97 Max. :184.0 Max. :186.0
## NA's :67094 NA's :67094 NA's :49 NA's :49
## quality importance total_rating
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 20.00 1st Qu.: 6.00 1st Qu.: 20.00
## Median : 44.00 Median : 18.00 Median : 40.00
## Mean : 45.24 Mean : 28.23 Mean : 39.24
## 3rd Qu.: 69.00 3rd Qu.: 42.00 3rd Qu.: 54.00
## Max. :100.00 Max. :100.00 Max. :100.00
## NA's :70720 NA's :70720
We can see from the summary that majority of the ratings are missing for the “carm-elo” and “raptor” columns, as these were implemented for later seasons. We will therefore choose to focus on the “elo” and “score” columns for this assignment.
nba_filtered <- nba |>
select(date, season, team1, team2, elo1_pre, elo2_pre, elo1_post, elo2_post, score1, score2)
Each row of the data set includes metrics for both teams that played against each other in a game. If we wanted to graph the Elo ratings or scores for a specific team over time, we would need to have the metrics for this team in a single column.
To do this, we will use pivot_longer() to expand the
data frame. The basic structure for pivot_longer() is as
follows:
pivot_longer(data, cols, names_to = "names", values_to = "values").
Arguments:
data - the data frame which is being transformedcols - the columns which are being collapsed into
thenames_to - a column for which the column names will be
placed into (you supply the name for this column)values_to - a column for which the values of each
selected column will be placed (you supply the name for this
column)When using pivot_longer(), the values from other columns
will be duplicated the amount of times for each column that is being
collapsed into the new columns.
Before we use pivot_longer(), let’s use
tidyr’s unite() function to
join the columns pertaining to each team into a single column for each
team. In doing this, we can easily expand all columns without having to
pivot multiple times, create data frames for each separate value to then
combine, or running into duplication issues.
The basic syntax for unite() is as follows:
unite(data, col, sep = "_")
Arguments:
data - the data frame and columns which are being
combined (list first the data frame and then the columns)col - a column name to which the combined columns will
be assignedsep - a separator to use between columnsGenerally, the columns can be called in any order you wish and you can use any character, numeric, or symbol as your separator (or even nothing). For this assignment, we will be separating the columns after we pivot, so we will use the same order of columns and we will use a semicolon as our separator so it is unique from any other value.
nba_united <- nba_filtered |>
unite(team1,elo1_pre,elo1_post,score1, col = "team1", sep=";") |>
unite(team2,elo2_pre,elo2_post,score2, col = "team2", sep=";")
rmarkdown::paged_table(nba_united)
We can now use pivot_longer() to expand this data
frame.
nba_longer <- nba_united |>
pivot_longer(cols = c(team1, team2),
names_to = "team",
values_to = "stats")
rmarkdown::paged_table(nba_longer)
We now have a data frame which a row for each team from each game and their stats.
Let’s now separate the stats back into their original columns using
the separate() function. The basic syntax for
separate() is as follows:
separate(data, col, into, sep = "_").
Arguments:
data - the data framecol - the column which is being separatedinto - a vector of names to assign for each new
columnsep - the separator by which to split the columnsKeep in mind that the column created from unite() is now
a character vector. The original data types are, therefore, not
preserved and we will need to recast these columns if we wish to perform
analysis on these numbers. To do this, we will use
dplyr’s mutate() function to
recast these columns to numeric. If we want to graph over time using the
date, we will also need to cast the date column as a date
data type.
nba_final <- nba_longer |>
separate(stats, into = c("team_abr", "pre_elo", "post_elo", "score"), sep=";") |>
mutate(date = as.Date(date),
pre_elo = as.numeric(pre_elo),
post_elo = as.numeric(post_elo),
score = as.numeric(score))
rmarkdown::paged_table(nba_final)
We can now plot the change in Elo ratings for teams over time.
First, lets take the average Elo rating for each team for each season.
average_elo <- nba_final |>
group_by(season, team_abr) |>
summarize(avg_pre_elo = mean(pre_elo),
avg_post_elo = mean(post_elo)) |>
ungroup()
rmarkdown::paged_table(average_elo)
Now let’s use ggplot() to graph the Elo ratings for the
current top rated teams according to Five Thirty Eight’s current
ratings: Boston Celtics, Philadelphia 76ers, Denver Nuggets, Milwaukee
Bucks. We will use the pre game Elo ratings for this.
The data set from Five Thirty Eight only used the team abbreviations. For this, I chose to join the final data table to a table of the actual NBA team names found here.
team_names <- read.csv(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Tidyverse_CREATE/NBA_teams.csv"))
# only include teams that are in the NBA
average_elo <- average_elo |>
right_join(team_names, by = "team_abr")
average_elo |>
filter(team_name %in% c("Boston Celtics", "Philadelphia 76ers", "Denver Nuggets", "Milwaukee Bucks")) |>
ggplot(aes(x = season, y = avg_pre_elo, color = team_name)) +
geom_line() +
labs(title = "Average Pre Game Elo Rating Over Time", x = "Season", y = "Average Pre Elo Rating")
We can also plot the change in Elo ratings for each team based on the date of the game to see how their actual Elo rating changed overtime. Let’s do this for the Boston Celtics.
nba_final_names <- nba_final |>
right_join(team_names)
nba_final_names |>
filter(team_name %in% c("Boston Celtics", "Philadelphia 76ers", "Denver Nuggets", "Milwaukee Bucks")) |>
ggplot(aes(x = date, y = pre_elo)) +
geom_line() +
facet_wrap(~team_name) +
labs(title = "Pre Game Elo Rating Over Time", x = "Year", y = "Pre Elo Rating")
Tidy data is essential for effective data analysis as it allows for efficient and accurate data manipulation, transformation, and visualization. The principles of tidy data, such as having each variable in a separate column, each observation in a separate row, and having a single type of data in each column, ensure that data is organized in a way that facilitates easy analysis.
The tidyr package in R provides a
powerful set of tools for tidying messy data and transforming it into a
tidy format. The package offers functions for pivoting, gathering, and
separating data, making it easier to clean and manipulate data for
analysis. The tidyr package, when used in
conjunction with other data manipulation packages in R, allows for
efficient and streamlined data cleaning and analysis.
In this project, we explored some of the functions of the
tidyr package, including
pivot_longer(), unite(), and
separate(). These are not only helpful for tidying data,
but also for manipulating tidy data into a new format that works better
for the purposes of a specific analysis.
Other functions of tidyr include
pivot_longer() which works in the reverse of
pivot_longer() to widen data sets, and functions to handle
missing values such as fill(), drop_na(), and
replace_na().
Explanation and syntax for further
tidyr functions can be found here.