R Studio Server: http://rstudio.saintannsny.org:8787/

Once again we’ll be using the dplyr package in R. This time we’ll be using play-by-play NBA data and using the summarize and group_by functions for the first time.

You may want to look back over our past two labs:

Data Transformation: http://rpubs.com/jcross/data_transformation_lahman

More Data Transformation: http://rpubs.com/jcross/data_transformation2

You may also want to read the relevant section in Garret Grolemund and Hadley Wickham’s online textbook *R for Data Science (http://r4ds.had.co.nz/)*. This lab practices the material in 5.6.

Scripts and Work Flow

Click on the icon in the upper-left (or on the File menu and then New File) and create a new R script. This will open an empty script pane in the upper left where you can save code. Note that you will need to actually save it – unlike Google documents this file will not save automatically.

You should save your (successful) code from this lab into an R script. You can do this either by typing code into a script, running it and then deleting code that is not useful or you can try out code in the console (lower left) and then copy and paste successful code into your script.

Comments

Your script should include comments telling me (and you in the weeks to come) what your code does. Comments are simply text following a # sign. Your script should start as follows:

# Reading in data on NBA jump shots taken in the 1st quarter of games in December 2015
n <- read.csv('/home/rstudioshared/shared_files/data/nba_savant_jumpshots_dec2015_q1.csv')

# loading the dplyr library
library(dplyr)

# looking at the first six rows of data
head(n)

You don’t need to overdo it with comments (which I might be doing above) but know that you will invariably find that you want to revisit past work. If this happens the next day, you might well remember the meaning of each line of code. If you revisit your code weeks or months later or share it with someone, however, your pal or the future you will be lost without comments.

Note that the data in this lab is from http://nbasavant.com/shot_search.php.

Summarize

Once you have run the code above (loading the data and the dplyr library) we can start summarizing the data. The shot_made_flag takes on the value of 1 for a shot made and 0 for a shot missed. We can get the total number of shots made (in this data set) and the shooting percentage as follows:

n %>% summarise(tot_shots_made = sum(shot_made_flag))

n %>% summarise(shooting_percentage = mean(shot_made_flag))

Note, that we can create multiple summaries at once. For instance:

n %>% summarise(tot_shots_made = sum(shot_made_flag), shooting_percentage = mean(shot_made_flag))

Problems:

  1. Calculate the mean amount of time left on the shot clock

  2. Calculate the mean and median “touch time”. Which is larger? Why do you think that might be?

Grouping

I might well be interested in summarizing data by group rather than overall. Groups could be a teams, players, types of shots or distances from the basket. To do this we can use the group_by function. For instance, we can get the shooting percentage for every team with:

n %>% group_by(team_name) %>% summarise(shooting_percentage = mean(shot_made_flag))

We can also combine this with the functions we learned in previous labs (filter, arrange, top_n…). Here are the top 10 shooting teams (remember, this data is only from the first quarter of December 2015 games):

n %>% group_by(team_name) %>% 
  summarise(shooting_percentage = mean(shot_made_flag)) %>%
  top_n(10, shooting_percentage) %>%
  arrange(desc(shooting_percentage))

Grouping Problem:

Find the 10 players who made the most shots and arrange them in a top 10 list.

Coding these summaries can get a bit tricky. Let’s say we wanted to find the ten players with the highest shooting percentages. We would probably want to limit our top ten list to players who had taken some minimum number of shots. To do this we would need to group the data, then summarize, then filter…

n %>% group_by(name) %>% 
  summarise(shots_taken = length(name), shooting_percentage = mean(shot_made_flag)) %>%
  filter(shots_taken >= 50) %>%
  top_n(10, shooting_percentage) %>%
  arrange(desc(shooting_percentage))

Quick Triggers?

Find the 5 players who took shots with the most and least time left on the shot clock. Use some reasonable minimum for numbers of shots taken.

Multiple Groupings

You can group by multiple criteria at the same time. The following code groups shots by type and whether the shots were made or missed and finds the median amount of time left on the shot clock for each group:

n %>% group_by(action_type, shot_made_flag) %>%
summarise(shots_taken = length(name), median_shot_clock = median(shot_clock)) %>%
  filter(shots_taken >= 100) %>%
  arrange(desc(median_shot_clock))

Is the median amount of time remaining greater for jump shots that are made or jump shots that are missed?

Multiple grouping problem:

Find the mean shooting percentage by team and shot type. What team/shot types were the most succesful? You should limit this to some reasonable minimum number of shots taken.

Cut

Grouping by continuous variables is problematic. Take a look at what happens when you look at shooting percentage by defender distance:

n %>% group_by(shot_distance) %>% 
  summarize(num_shots = length(name), shooting_percentage=mean(shot_made_flag)) %>%
  arrange(shot_distance)

Shot distance can take on any value (roundest to the nearest 0.1 feet) so every individual shot distance includes very few shots. To get a sense of how shooting percentage changes with shot distance we will want to make larger bins for shot distance. We can use the cut function to do this:

n <- n %>% mutate(shot_distance_bin = cut(shot_distance, breaks=c(0,5,10,15,20,25,30,100)))

head(n)

The first line above adds a new column called shot_distance_bin that says which bucket or bin the shot_distance falls in: (0,5], (5,10], (10,15], (15,20], (20,25], (25,30] or (30,100]. Remember that square brackets are inclusive so the second bin would include shots taken from 5 feet.

Now, let’s look at shooting percentage by shot distance bin:

n %>% group_by(shot_distance_bin) %>%
  summarize(num_shots = length(name), shooting_percentage=mean(shot_made_flag)) %>%
  arrange(shot_distance_bin)

Now we’re getting somewhere but there are a couple of issues. First, there are a number of NA values. These are shots from distances that did not fall into any of our bins. To see why, we can look at a summary of the shot distances:

summary(n$shot_distance)

No shots were taken from over 100 feet but some shots were taken from 0 feet which isn’t included in the (0,5] bin. Let’s fix this by altering our bins to include the the left endpoint of the first interval. Next, we probably don’t need (or want) this many decimal places. We can round our calculated shooting percentages to the nearest two decimal places. These changes are made below:

n <- n %>% 
  mutate(shot_distance_bin = cut(shot_distance, breaks=c(0,5,10,15,20,25,30,100), include.lowest=TRUE))

n %>% group_by(shot_distance_bin) %>%
  summarize(num_shots = length(name), shooting_percentage=round(mean(shot_made_flag),2)) %>%
  arrange(shot_distance_bin)

Challenges:

Save the code that answers these questions in your script. You can include the question in your script as a comment.

  1. How does shooting percentage depend on defender distance?
  2. How does shooting percentage depend on the time remaining on the shot clock?
  3. How do defender distance and shot distance depend on time left on the shot clock?

Explore:

What can you discover about individual players, teams or the NBA more generally by exploring this data set? Remember to save your best bits of code in your script along with comments describing what you did.