We’re going to pick up right where we left off yesterday in Problem
Set 1, where you wrote a script to create a tbl containing
several shooting statistics for NBA players between the 1996-97 season
and 2021-22 season. Your script could look something like what is the
code block below, and for reasons that will become clear shortly, we are
going to call our tbl raw_shooting (remember: writing code
in script files is good practice!)
library(tidyverse)
# Read in data
raw_shooting <- read_csv(file = "data/nba_shooting_1997_2022.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## PLAYER = col_character(),
## SEASON = col_double(),
## FGM = col_double(),
## FGA = col_double(),
## TPM = col_double(),
## TPA = col_double(),
## FTM = col_double(),
## FTA = col_double()
## )
# Create new columns
raw_shooting <-
mutate(raw_shooting,
FGP = FGM / FGA,
TPP = TPM / TPA,
FTP = FTM / FTA,
eFGP = (FGM + 0.5 * TPM) / (FGA),
PTS = FTM + 2 * FGM + TPM,
TSP = PTS/(2 * (FGA + 0.44 * FTA)))
# Sort by the TSP in descending order
raw_shooting <- arrange(raw_shooting, desc(TSP))
As we learned at the end of Coding Lecture 1, rather than repeatedly
referring to our tbl, we can use the pipe
%>% to chain together our operations for a much
cleaner chunk of code:
# First read in the data, assigning pipeline to raw_shooting
raw_shooting <- read_csv(file = "data/nba_shooting_1997_2022.csv") %>%
# Next create the new columns
mutate(FGP = FGM / FGA,
TPP = TPM / TPA,
FTP = FTM / FTA,
eFGP = (FGM + 0.5 * TPM) / (FGA),
PTS = FTM + 2 * FGM + TPM,
TSP = PTS / (2 * (FGA + 0.44 * FTA))) %>%
# And finally sort by TSP
arrange(desc(TSP))
raw_shooting
## # A tibble: 11,841 × 14
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP eFGP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Donnel… 2005 2 2 0 0 4 4 1 NaN 1 1
## 2 Tyson … 1999 1 1 1 1 1 2 1 1 0.5 1.5
## 3 David … 2018 2 3 2 3 4 4 0.667 0.667 1 1
## 4 Jackie… 2005 4 4 0 0 2 2 1 NaN 1 1
## 5 Udonis… 2021 2 2 0 0 0 0 1 NaN NaN 1
## 6 Rakeem… 2016 2 2 0 0 0 0 1 NaN NaN 1
## 7 Marcus… 2009 2 2 0 0 0 0 1 NaN NaN 1
## 8 David … 2001 3 3 0 0 0 0 1 NaN NaN 1
## 9 Julyan… 2013 2 2 0 0 3 4 1 NaN 0.75 1
## 10 Kostas… 2020 3 3 0 0 1 2 1 NaN 0.5 1
## # … with 11,831 more rows, and 2 more variables: PTS <dbl>, TSP <dbl>
The tbl contains 11841 player - seasons. As you can see, it also included players with few FGA.
We will use visualizations to answer some questions about the data. Specifically, we will study the distribution of the individual columns as well as try to understand the relationship between pairs of variables. For instance, are players attempting and making more three-point shots now than they did 10 years ago?
Throughout the week, we will be using the popular ggplot2 package
(again created by Hadley Wickham and a member of the
tidyverse) for all of our data visualizations. The
gg stands for the grammar of graphics, an
intuitive framework for data visualization. Given a dataset, such as
raw_shooting, we want to map the columns to certain
aesthetics of a visualization, such as the x-axis,
y-axis, size, color, etc. Then a geometric object is used to represent
the aesthetics visually such as a barchart or scatterplot. This
framework separates the process of visualization into different
components: data, aesthetic mappings, and geometric objects. These
components are then added together (or layered) to
produce the final graph. The ggplot2 package is the most
popular implementation of the grammar of graphics and
is relatively easy to use.
As you saw in Prof. Wyner’s lectures, histograms are a powerful way
to describe a single dataset. Let’s start with making a histogram of
field-goal percentage (FGP). To do so with ggplot2, we
start by telling R which tbl we want to use as
the data for the plot. This is done using the ggplot()
function:
ggplot(data = raw_shooting)
As you can see, nothing is displayed! That’s because we’ve just initialized the dataset to be used for creating the plot. Next, we need to map variables/columns from the data to aesthetics/properties of the plot. Examples include:
x: the variable that will be on the x-axisy: the variable that will be on the y-axiscolor: the variable that categorizes data by colorshape: the variable that categorizes data by shapeFor the histogram, we map FGP to the x aesthetic with
the aes() functions, displaying values of FGP along the
x-axis:
ggplot(data = raw_shooting, aes(x = FGP))
Now we can see an axis for FGP, but still no histogram! That’s
because we need to add the geometric layer of the
histogram to the plot. The various geometric objects available in
ggplot2 are referred to as geoms, and these
ultimately determine the type of play that will be created. Examples
include:
geom_point(): creates a scatterplotgeom_histogram(): creates a histogramgeom_line(): creates a linegeom_boxplot(): creates a boxplotTo display the FGP histogram, we will simply add the
geom_histogram() layer to the plot using the +
operator:
ggplot(data = raw_shooting, aes(x = FGP)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And of course, we can equivalently generate the same figure by
piping our dataset into the ggplot
function. The remaining examples will use the pipe operator
%>% for the remainder of this week, emphasizing how you
can manipulate the tbl with various other functions prior
to displaying the data:
raw_shooting %>%
ggplot(aes(x = FGP)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Each geom has its various attributes that can be
modified whether by mapping variables with aes or
global settings that affects the final displayed plot. A common
problem faced with histograms is determining the width of the bins or
the number of bins - how much smoothing do we want of the data? You
might have noticed that by default geom_histogram is using
30 bins and prints out a message for us about this decision. We can
easily modify the number of bins in the geom_histogram
function using the bins argument:
raw_shooting %>%
ggplot(aes(x = FGP)) +
geom_histogram(bins = 50)
Or we directly specify the width of the bins reflecting the range of
FGP included in each bin using binwidth:
raw_shooting %>%
ggplot(aes(x = FGP)) +
geom_histogram(binwidth = 0.05)
Now you should spend time making histograms of FTP and TPP, and other variables created in the code above. Discuss the differences with others.
By this point, it should be clear that there are a number of rather curious features in our dataset. For instance, there seem to be several players who have never made a field goal but have made every one of their free throw attempts. As it turns out, we have several players who have attempted fewer than 5 field goals in a single season. We’d like to remove all of the players who have not attempted a considerable number of field goal attempts and three-point attempts in order to understand how the rate and efficiency of three-point shots have changed over time.
The filter() function is used to pull out subsets of
observations that satisfy some logical condition like
FGA > 100 or FGA > 100 and
FTA > 50. To make such comparisons in R, we have the
following operators available at our disposal:
== for “equal to”!= for “not equal to”< and <= for “less than” and “less
than or equal to”> and >= for “greater than” and
“greater than or equal to”&, |, ! for “AND” and
“OR” and “NOT”The code below filters out a player’s season that has less than 100 field goal attempts (FGA) in a that season.
raw_shooting %>%
filter(FGA > 100)
## # A tibble: 9,142 × 14
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP eFGP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Robert … 2022 271 368 0 1 65 90 0.736 0 0.722 0.736
## 2 DeAndre… 2021 190 249 0 1 46 92 0.763 0 0.5 0.763
## 3 Rudy Go… 2022 362 508 0 4 303 439 0.713 0 0.690 0.713
## 4 Mitchel… 2020 253 341 0 0 84 148 0.742 NaN 0.568 0.742
## 5 Chris W… 2013 110 153 0 1 39 58 0.719 0 0.672 0.719
## 6 Dwight … 2022 269 401 13 37 166 212 0.671 0.351 0.783 0.687
## 7 Mitchel… 2022 261 343 0 0 88 181 0.761 NaN 0.486 0.761
## 8 Robert … 2021 186 258 0 2 45 73 0.721 0 0.616 0.721
## 9 Onyeka … 2022 156 226 0 0 80 110 0.690 NaN 0.727 0.690
## 10 Damian … 2021 70 103 1 5 42 58 0.680 0.2 0.724 0.684
## # … with 9,132 more rows, and 2 more variables: PTS <dbl>, TSP <dbl>
When we run this code, you’ll notice that R prints out a
tbl with 11841 rows. The orignal tbl contained 11841
player-seaons.
We can also filter on more complicated conditions constructed using
the AND, OR, and NOT operators: &, |, and
!. For instance, to filter observations with at least 100
field goal attempts OR 50 three-point attempts, we would do
raw_shooting %>%
filter(FGA >= 100 | TPA >= 50)
## # A tibble: 9,203 × 14
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP eFGP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Robert … 2022 271 368 0 1 65 90 0.736 0 0.722 0.736
## 2 DeAndre… 2021 190 249 0 1 46 92 0.763 0 0.5 0.763
## 3 Rudy Go… 2022 362 508 0 4 303 439 0.713 0 0.690 0.713
## 4 Mitchel… 2020 253 341 0 0 84 148 0.742 NaN 0.568 0.742
## 5 Chris W… 2013 110 153 0 1 39 58 0.719 0 0.672 0.719
## 6 Dwight … 2022 269 401 13 37 166 212 0.671 0.351 0.783 0.687
## 7 Mitchel… 2022 261 343 0 0 88 181 0.761 NaN 0.486 0.761
## 8 Robert … 2021 186 258 0 2 45 73 0.721 0 0.616 0.721
## 9 Onyeka … 2022 156 226 0 0 80 110 0.690 NaN 0.727 0.690
## 10 Damian … 2021 70 103 1 5 42 58 0.680 0.2 0.724 0.684
## # … with 9,193 more rows, and 2 more variables: PTS <dbl>, TSP <dbl>
We may combine these constraints by enclosing them in parentheses.
raw_shooting %>%
filter((FGA >= 100 & TPA >= 50) | (FGP >= 0.45 & FGP <= 0.5))
## # A tibble: 7,222 × 14
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP eFGP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Keith… 2014 3 6 3 6 3 3 0.5 0.5 1 0.75
## 2 Reggi… 2018 1 2 1 2 0 0 0.5 0.5 NaN 0.75
## 3 Coty … 2016 2 4 2 2 0 0 0.5 1 NaN 0.75
## 4 Jamaa… 2015 1 2 1 2 0 0 0.5 0.5 NaN 0.75
## 5 Tony … 2021 88 171 62 109 11 11 0.515 0.569 1 0.696
## 6 Kyle … 2015 292 600 221 449 106 118 0.487 0.492 0.898 0.671
## 7 Chand… 2022 1 2 0 0 2 2 0.5 NaN 1 0.5
## 8 Chuck… 2016 1 2 0 0 2 2 0.5 NaN 1 0.5
## 9 Etdri… 2001 2 4 0 0 4 4 0.5 NaN 1 0.5
## 10 Aleks… 1997 8 16 5 7 4 5 0.5 0.714 0.8 0.656
## # … with 7,212 more rows, and 2 more variables: PTS <dbl>, TSP <dbl>
What if we wanted to pull out the observations corresponding to the
2021-22 and 2014-15 seasons? We could do something like
filter(raw_shooting, (SEASON == 2022) | (SEASON == 2015)),
which would be perfectly fine. However, what if we wanted data from
2021-22, 2011-12, and 2015-16? Typing a lot of expressions like
SEASON == ... would be rather tedious. The
%in% operator lets us avoid this tedium:
raw_shooting %>%
filter(SEASON %in% c(2022, 2012, 2015))
## # A tibble: 1,547 × 14
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP eFGP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Rayjon … 2022 6 9 4 5 5 7 0.667 0.8 0.714 0.889
## 2 Juwan M… 2022 2 3 1 2 0 0 0.667 0.5 NaN 0.833
## 3 McKinle… 2022 2 3 1 2 0 0 0.667 0.5 NaN 0.833
## 4 Ryan Re… 2012 4 5 0 0 0 0 0.8 NaN NaN 0.8
## 5 D.J. Wi… 2022 11 15 0 2 8 10 0.733 0 0.8 0.733
## 6 Arinze … 2015 12 14 0 0 3 8 0.857 NaN 0.375 0.857
## 7 Craig S… 2022 3 4 0 1 0 0 0.75 0 NaN 0.75
## 8 Jamaal … 2015 1 2 1 2 0 0 0.5 0.5 NaN 0.75
## 9 Robert … 2022 271 368 0 1 65 90 0.736 0 0.722 0.736
## 10 Udoka A… 2022 37 49 0 0 6 11 0.755 NaN 0.545 0.755
## # … with 1,537 more rows, and 2 more variables: PTS <dbl>, TSP <dbl>
A typical NBA season consists of 82 games, however, a few NBA seasons
have been shortened. Maybe we want to exclude those seasons with fewer
than 82 games. The NBA had two lockout-shortened seasons, 1998-99 and
2011-12, and additionally, COVID shortened the 2019-2020 & 2020-21
seasons. In order to remove these four seasons, we could use a
combination of the NOT ! operator and
%in%.
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021))
## # A tibble: 9,972 × 14
## PLAYER SEASON FGM FGA TPM TPA FTM FTA FGP TPP FTP eFGP
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Donne… 2005 2 2 0 0 4 4 1 NaN 1 1
## 2 David… 2018 2 3 2 3 4 4 0.667 0.667 1 1
## 3 Jacki… 2005 4 4 0 0 2 2 1 NaN 1 1
## 4 Rakee… 2016 2 2 0 0 0 0 1 NaN NaN 1
## 5 Marcu… 2009 2 2 0 0 0 0 1 NaN NaN 1
## 6 David… 2001 3 3 0 0 0 0 1 NaN NaN 1
## 7 Julya… 2013 2 2 0 0 3 4 1 NaN 0.75 1
## 8 Kanie… 2004 3 3 0 0 1 2 1 NaN 0.5 1
## 9 Rayjo… 2022 6 9 4 5 5 7 0.667 0.8 0.714 0.889
## 10 Amir … 2006 7 10 2 3 4 4 0.7 0.667 1 0.8
## # … with 9,962 more rows, and 2 more variables: PTS <dbl>, TSP <dbl>
For the remainder of this module, we will focus on the full seasons
(ie !SEASON %in% c(1999, 2012, 2020, 2021)) but we still
need to determine a cutoff for FGA and TPA. Let’s start by making
histograms of the two variables to see their individual distributions,
and rather than creating a new temporary tbl we’ll take
advantage of the %>%. First for FGA:
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = FGA)) +
geom_histogram(binwidth = 50)
And now for TPA:
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA)) +
geom_histogram(binwidth = 25)
It might not make sense however to make a cutoff for FGA and TPA without considering their relationship.
We can proceed to view the joint distribution of FGA
and TPA, and the relationship between the two variables, by displaying a
scatterplot of the data. Obviously, as TPA increases the FGA will
increase (since FGA is the sum of TPA and the number of two-point
attempts). We create a scatterplot by mapping multiple variables to
x and y, and use geom_point to
display the desired geometric object of points instead of
geom_histogram from before.
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA, y = FGA)) +
geom_point()
Immediately, we notice a few things about the figure. First, we see the clear relationship between TPA and FGA - where TPA provides the lower threshold on FGA (remember why is this expected). We also see what appears to be a group of points with a smaller number of three-point attempts displaying a range in values for FGA, while a main block of points shows a clear increasing relationship. A major drawback of this scatterplot is its inability to show the relative density of points. For instance, all we see is a solid black mass for the lower range of values for both TPA and FGA - making it hard to determine where an appropriate cutoff should be made.
One way to address this is to use alpha-blending to change
the transparency of each point. When there are many points plotted in
the same region, that region will appear darker. Just like
binwidth or bins for
geom_histogram, geom_point has settings we can
change such as the alpha setting in this case which ranges
from 0 (completely transparent) to 1 (solid and opaque, the
default).
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA, y = FGA)) +
geom_point(alpha = 0.1)
Now we have a much better idea of where the majority of points are,
with a clear group of players displaying a smaller number of TPA.
Another type of plot commonly used in this situation is a
heatmap which you can think of as a two-dimensional
histogram. To form a heatmap, you start by dividing the coordinate plane
into many evenly-sized two-dimensional bins and then count the number of
points within each bin. You then color the bin according to the count.
While you can conceptually make the bins any shape you want, there are
two popular conventions: rectangular binning and hexagonal binning. For
this plot, we will focus on rectangular binning, using the
geom_bin2d() function.
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA, y = FGA)) +
geom_bin2d()
Now we have a color scale that has appeared to tell us the number of points in each bin. Just like histograms, we can increase the number of bins to get a much more high-resolution view of our data.
raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA, y = FGA)) +
geom_bin2d(bins = 100)
In this example, using geom_point with lower values for
alpha provides a clear interpretation of where to make the
cutoff. Given the histograms we previously made for each variable as
well, let’s use a cutoff of TPA > 50 and
FGA > 150. We can create the scatterplot as before but,
to demonstrate a convenient feature of ggplot2, we’ll now
assign the plot to a variable named fga_tpa_plot,
fga_tpa_plot <- raw_shooting %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA, y = FGA)) +
geom_point(alpha = 0.1)
Notice how this code runs without displaying the plot. To display it
we simplyrun fga_tpa_plot in the console and the plot
appear,
fga_tpa_plot
Now we’re going to annotate this plot with the cutoffs we’ve
determined. To do so, we’re going to add a vertical line to provide the
cutoff for the minimum value of TPA, and a horizontal line for the
minimum value of FGA. Both of these can be accomplished using the
geoms geom_vline and geom_hline
where we specify for each the values for the intercepts to draw the
lines at. Since we’ve stored the plot in fga_tpa_plot, we
can add these layers with the + operator to
fga_tpa_plot directly:
fga_tpa_plot +
# Add TPA cutoff
geom_vline(xintercept = 50) +
# Add FGA cutoff
geom_hline(yintercept = 150)
This provides us with a good indication of what we’re cutting off,
but we should distinguish these lines separately from the points more.
Since these are both geoms with their own set of
attributes, we’ll change the color and make the line type to be
dashed,
fga_tpa_plot +
# Add TPA cutoff
geom_vline(xintercept = 50, color = "red", linetype = "dashed") +
# Add FGA cutoff
geom_hline(yintercept = 150, color = "red", linetype = "dashed")
Although ggplot2 automatically provides axis labels
based on the variables we’ve mapped to the aesthetics in
aes, we really need text describing what is shown in the
plot. The easiest way to do this, is by adding a label layer with the
labs() function. Using labs, you can provide
it the same arguments as those inside aes as well as other
parts of the plot to label such as the title, subtitle, and caption. We
add appropriate labels to the plot above, with better descriptions for
the axes, as well as an informative subtitle regarding the red-dashed
lines:
fga_tpa_plot +
# Add TPA cutoff
geom_vline(xintercept = 50, color = "red", linetype = "dashed") +
# Add FGA cutoff
geom_hline(yintercept = 150, color = "red", linetype = "dashed") +
# Add appropriate labels:
labs(x = "Number of three-point attempts (TPA)",
y = "Number of field goal attempts (FGA)",
title = "Scatterplot of FGA and TPA (excluding shortened seasons '99, '12, '20, '21)",
subtitle = "Red-dashed lines indicate cutoffs for TPA > 50 and FGA > 150",
caption = "Created by INSERT YOUR NAME HERE")
As a reminder, the code chunk above is equivalent to running the entire pipeline of code without creating any temporary objects, although not recommended…
read_csv(file = "data/nba_shooting_1997_2022.csv") %>%
filter(!SEASON %in% c(1999, 2012, 2020, 2021)) %>%
ggplot(aes(x = TPA, y = FGA)) +
geom_point(alpha = 0.1) +
geom_vline(xintercept = 50, color = "red", linetype = "dashed") +
geom_hline(yintercept = 150, color = "red", linetype = "dashed") +
labs(x = "Number of three-point attempts (TPA)",
y = "Number of field goal attempts (FGA)",
title = "Scatterplot of FGA and TPA (excluding shortened seasons '99, '12, '20, '21)",
subtitle = "Red-dashed lines indicate cutoffs for TPA > 50 and FGA > 150",
caption = "Created by INSERT YOUR NAME HERE")
For the next lecture, Coding Lecture 3, we will focus on the dataset
of players who attempted at least 150 field goals and 50 three-pointers
in the non-lockout seasons. We’ll make this clean dataset using
the filter() function separating each condition with
commas, and refer to the resulting dataset as nba_shooting.
Finally, we’ll add a mutate() line to create the
three_point_fg_rate = TPA / FGA variable from Coding
Lecture 1.
nba_shooting <- raw_shooting %>%
# Filter on the conditions
filter(!SEASON %in% c(1999, 2012, 2020, 2021),
TPA > 50,
FGA > 150) %>%
# Create the three-point attempt rate variable
mutate(three_point_fg_rate = TPA / FGA)
While we can always re-run the commands used to produce this
tbl from our script, when data analyses become more
complicated, it is helpful to save these objects. R has its
own special file format for efficiently saving data on your
computer.
We will use the save() command.
save(nba_shooting, file = "data/clean_nba_shooting.RData")
Then when we want to load the data back into R, we can use
the load() function.
load("data/clean_nba_shooting.RData")
Next, proceed to practice this lecture’s lessons in Problem Set 2.