In this document, I will be demonstrating how to ascertain and chart the average percentage of shots taken as three point attempts in the NBA from the 1980-81 season to the 2014-15 season.
Why would I want to do this? In my view there has been a lot of loose talk centering around the style of play in the NBA and its merits as “good” basketball. Typically, its someone “of a certain age” who makes the old things were better in my day comment. If it’s not that type of comment, then its the fill in the blank is hurting the league line of argument. Some people may even argue that it it is time to move the three point line further back because perimeter play needs to be tempered. As you might guess, I don’t subscribe to these arguments, and normally find such hand wringing laughable. What makes these arguments laughable is the lack of a proper historical perspective on the use of the three point shot in NBA offenses. In this document, I’ll show how to illustrate average shots taken as three pointers and how a better perspective on recent developments can help formulate a more reasonable argument.
The data I will use is from the basketball-reference.com website. I thank them for making the data available. The first thing to do is go to basketball-reference.com and download the Team Stats section for each season. For our example, this was done for each of the 35 seasons between the 1980-81 season to the last complete season as of this writing, 2014-15.
I chose to download the files as .csv files, but please note that the files will have to be edited. For example, I found that I had to delete the first and last lines of each file. Also, in the Team Stats section of basketball-reference.com an asterisk (*) is placed next to the names of each of the teams that made the playoffs for that season. In addition, if you want to do anything with the data with respect to franchises, you will have to account for team name changes. For example, in the time period we are looking at the Sacramento Kings franchise was previously the Kansas City Kings. Adjustments to issues such as these can be made with making changes via text edits. I used the bash shell to make changes via the command line. Examples of changes are listed below:
# All data files start with the name leagues_NBA_
for file in leagues_NBA_*; do sed -i '1d;$d' "$file";done
for file in leagues_NBA_*; do sed -i 's/\*//' "$file";done
for file in leagues_NBA_*; do sed -i 's/,Sacramento Kings,/,Kansas City - Sacramento Kings,/' "$file"; done
For each of the files two additional columns were made. One column, SEASON, contains the season value for each of the records in each file. For example, the table containing team data for the 2003-04 season has a column with ‘2003-04’ for each of the records. In addition, each of the files has a column that denotes which decade the data are from. The value of the column follows this breakdown:
Also note the use of ‘%’ to denote percentage or percent. From what I’ve found R will not handle this well, so it may be best to switch out ‘%’ for ‘pct’ using methods similar to the ones above.
So, now we have the individual files edited to your satisfaction. There is still the matter of loading the data into R for analysis. There are a number of ways to do this, but the most efficient, correct, and time efficient example I found was on this blog post. Actually, the method I used is in the comments replying to the post. Before executing the code, you should make sure of a few things:
file_list <- list.files() # create a list of the files in the directory
dataset <- do.call("rbind",lapply(file_list,
FUN=function(files){read.table(files,
header=TRUE, sep="\t")}))
write.csv(dataset,file='league_data.csv',row.names=FALSE) # in case you wish to refer to the data at a later date
Now that the data are in order, we can load the additional libraries we’ll need for this project. We will use ggplot2 for plotting and plyr to manipulate the data for increased usability. We will also need the grid library for plotting, as well as the boot library for some additional statistical measures.
library(ggplot2)
library(plyr)
library(grid)
library(boot)
We can now load our data set into memory and create another column for it. We will call the additional column ‘pct3’. For each of the team records, pct3 will be the result of dividing the number of shots taken as three point attempts by the total number of field goals attempted, giving us the percentage of shots taken as three pointers. In addition, we will define the levels of the DECADE variable so that the decades are in order of occurrence. A subset of the data will created containing the team name, season, decade, and the percentage of shots taken as three pointers.
data1 <- read.csv('league_data.csv',header=T)
data1$pct3 <- data1$X3PA/data1$FGA
data1$DECADE <- factor(data1$DECADE,levels=c('80s','90s','00s','10s'))
data2 <- data1[,c('Team','SEASON','DECADE','pct3')]
With our new subset of data, we will will ascertain the mean value of pct3. We can do that using the ddply() function from plyr as well as the mean() function. We will also construct a helper function that, with the boot() function, will help us define the percentile bootstrap quantile results at ..025 and .975, effectively giving us as 95% confidence interval.
mymean <- function(x,i){
mean(x[i])
}
Now, one might ask why go through all of this for confidence limits. Well, in my view assuming a normal distribution of the mean value for percentage of shots taken as three pointers is a bit shaky. Also, the league contains, at most, thirty (30) teams. It’s not a great deal of data to go off of. In the chart below, the data seem to bear this out. Using the original data frame…
# A histogram of all of the percentages of all teams in each season.
ggplot(data1,aes(x=pct3))+geom_histogram(color='red',fill='black')+facet_wrap(~SEASON,ncol=7)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
With this in mind, we will compute another data frame with the mean values of pct3 for each season as well as values for the .025 and .975 limits of the bootstrapped samples of each season. Each season is sampled 10,000 times with replacement.
data3 <- ddply(data2,.(SEASON,DECADE),summarize,avg3=mean(pct3),ci025=quantile(boot(pct3,mymean,10000)$t,.025),ci975=quantile(boot(pct3,mymean,10000)$t,.975))
Using this data set, we can prepare to chart the data.
ggplot(data=data3,aes(x=SEASON,y=avg3,color=DECADE))+geom_pointrange(aes(ymin=ci025,ymax=ci975),size=.5,fill='white',shape=21)+theme_bw()+theme(axis.text.x=element_text(angle=90,color='black'))+scale_y_continuous(name='Percentage of shots taken as threes')+scale_fill_discrete(breaks=c('80s','90s','00s','10s'))
One thing you’ll note rather quickly are the three data points that seem to stand out in the middle of the graph. They represent the three years that the three point line was brought in to 22 feet to help stimulate scoring. After those three years, the limits were pushed back to 23 feet 9 inches except for the corners of the court, where the limit remains at 22 feet.
In addition, you can see over time that the use of the three point shot within seasons has a greater range in comparison to when the rule was first instituted.
Except for those three years you will also notice that, since the mid 1980’s the percentage of shots taken as three pointers has risen somewhat regularly. In fact, the rate of change for those three points seem to be similar to the trend as a whole.
Of course, you’ll never hear hear this during a typical NBA broadcast. At best, some commentator will refer to how the game has “evolved” without ever really discussing the revolution. You’ll get some off the cuff statement regarding trade-offs between two point and three point shots. Even articles praising the use of statistical information in basketball can get it wrong. This article refers to Charles Barkley’s era as being of slower pace and with emphasis on the individual over the team. I don’t have data on the “individual” issue, but a cursory review of data on basketball-reference.com shows that the take on pace is completely wrong. Pace is measured as the number of possessions a team has per 48 minutes. Last season, 2014-15, pace for the league was estimated at 93.9. During Charles Barkley’s MVP season, 1992-93, pace was measured at 96.8. As of today’s writing, 2/29/16, pace for the league is measured at a lower 95.7. Even during Barkley’s retirement season of 1999-00, the end of his era, pace was at 93.1, which is comparable to last season’s level.
The point I’m trying to make is that, though there is much information to distill when discussing the league, it does not take that much more effort to use available historical data to inform the public on changes in the NBA. A more informed perspective leads to a better served public, dispelling myths and stereotypes regarding the NBA and its players.