R Studio Server: http://rstudio.saintannsny.org:8787/

Today, we’ll be using the ggplot2 package to visualize NBA shot data.

You may want to look back over our past two labs:

Data Transformation: http://rpubs.com/jcross/data_transformation_lahman More Data Transformation: http://rpubs.com/jcross/data_transformation2 Grouping and Summarizing: http://rpubs.com/jcross/nba_play_by_play

You may also want to read the relevant section in Garret Grolemund and Hadley Wickham’s online textbook R for Data Science (http://r4ds.had.co.nz/). This lab practices the material in chapter 3.

The Data and the Packages

Let’s start by loading the data and the packages

# Reading in data on NBA jump shots taken in the 1st quarter of games in December 2015
n <- read.csv('/home/rstudioshared/shared_files/data/nba_savant_jumpshots_dec2015_q1.csv')

# loading the dplyr and ggplot2 libraries
library(dplyr)
library(ggplot2)

Scatterplots

With the ggplot function, you first specify what data you want to use and then add “layers” to your plot (separated by + signs). Let’s use n (the NBA play-by-play data) as our data and tell ggplot to put shot distance on the x-axis and defender distance on the y-axis. Note that “aes” stands for “aesthetic”:

ggplot(data = n) + 
  geom_point(mapping = aes(x = shot_distance, y = defender_distance)

You can, of course, change the color, size and shape of the points:

ggplot(data = n) + 
  geom_point(mapping = aes(x = shot_distance, y = defender_distance), col="red")

ggplot(data = n) + 
  geom_point(mapping = aes(x = shot_distance, y = defender_distance), col="red", size=0.1)

ggplot(data = n) + 
  geom_point(mapping = aes(x = shot_distance, y = defender_distance), col="dark green", size=1.5, pch=3)

We could also color the points based on shot_type (although notice now that we need to do this within “aesthetic”):

ggplot(data = n) + 
  geom_point(mapping = aes(x = shot_distance, y = defender_distance, col=shot_type))

… or color the points based on action type and shape them based on shot type:

ggplot(data = n) + 
  geom_point(mapping = aes(x = shot_distance, y = defender_distance, col=action_type, pch=shot_type))

Note: You can also arrange your code differently to having multiple layers based on the same mapping. Below, the second layer is a smooth curve through your data.

ggplot(data = n, mapping = aes(x = shot_distance, y = defender_distance))+
  geom_point()+
  geom_smooth()

Shot Locations

Each shot also has an x and a y coordinate in the data set and by graphing shots using their coordinates we can understand how the nbasavant coordinate system works:

ggplot(data = n) + 
  geom_point(mapping = aes(x = x, y = y, color=shot_type))

Perhaps most importantly, you can combine ggplot with the functions you learning in previous labs. Here are the Warriors’ shot locations. Notice that in ggplot() we don’t need to specify the data since the data is piped in from the previous commands:

n %>% 
  filter(team_name=="Golden State Warriors") %>% 
  ggplot() + geom_point(mapping = aes(x = x, y = y, color=shot_type))

We could compare the Knicks and Warriors shots as follows:

n %>% 
  filter(team_name %in% c("Golden State Warriors", "New York Knicks")) %>% 
  ggplot()+
  geom_point(mapping = aes(x = x, y = y, color=team_name))

but it might be even better to put them side by side which we can do using facets:

n %>% 
  filter(team_name %in% c("Golden State Warriors", "New York Knicks")) %>% 
  ggplot()+
  geom_point(mapping = aes(x = x, y = y, color=shot_type)) +
  facet_wrap(~team_name)

We could go one step fancier and make a facet grid. Take a couple of moments to see if you can understand where we’re looking at here:

n %>% 
  filter(team_name %in% c("Golden State Warriors", "New York Knicks")) %>% 
  ggplot()+
  geom_point(mapping = aes(x = x, y = y, color=shot_type)) +
  facet_wrap(shot_made_flag~team_name)

In our past lab, we looked at how shooting percentage changes with defender distance. If we want to make a graph showing those numbers we could do something like the following:

n %>% group_by(def_dist_bin=round(defender_distance,0)) %>% 
  summarize(num_shots = length(name), shooting_percentage=mean(shot_made_flag)) %>%
  filter(def_dist_bin<=10) %>%
  ggplot(aes(x=def_dist_bin, y=shooting_percentage, size=num_shots)) + geom_point() +
  geom_smooth(aes(weight=num_shots)) +ggtitle("Shooting Percentage v. Defender Distance (ft.)")

Other types of graphs

We are, of course, not limited to scatterplots. Here are two bar charts made using geom_bar. They each show the number of 2-point and 3-point field goals attempted and the second one splits the bars into sections for shots made and shots missed.

ggplot(data = n) + 
  geom_bar(mapping = aes(x = shot_type, fill=shot_type))

ggplot(data = n) + 
  geom_bar(mapping = aes(x = shot_type, fill=as.factor(shot_made_flag)))

The following is a histogram of defender distance. Remember making bins for defender distance in the last last? A histogram makes the bins for you and counts how many shots fall into each bin:

ggplot(data = n) + 
  geom_histogram(aes(defender_distance))

Of course, you might want to create your own bins. Here is the same data with 5 foot wide bins:

ggplot(data = n) + 
  geom_histogram(aes(defender_distance), breaks=5*(0:4), fill=1:4)

Maybe I want to make a map showing where shots are most often taken. I could make a hexbin plot or a contour plot:

ggplot(data = n) + 
  geom_hex(aes(x=x, y=y))

ggplot(data = n) + 
  geom_density2d(aes(x=x, y=y))

… and in either case I could try out a different color scheme:

ggplot(data = n) + 
  geom_hex(aes(x=x, y=y)) + 
  scale_fill_gradient(low = 'white', high = 'red', guide = 'colorbar')

ggplot(data = n) + 
  stat_density2d(aes(x=x, y=y, color = ..level..)) + 
  scale_color_gradient(low = 'yellow', high = 'red', guide = 'colorbar')

Take Home Ideas

As you explore data you will likely frequently need to make scatterplots and histograms to get a sense of the the shape data. You will mostly often only need fancier plots (density plots and hexbins) when it comes time to present your research and you can look up how to create these plots when the time comes. A good reference for making plots using the ggplot package can be found here.

Explore