Behind the Dime- a statistical analysis of the basketball assist

Section I: Introduction
- Research Question
- Variables
Section II: Exploratory Data Analysis
- Univariate analysis
- Bivariate Analysis
Section III: Simple Linear Regression
- Assists & Turnovers
- Assists & Position
Section IV: Multiple Linear Regression
Section V: Hypothesis Testing
- Analysis of Hypothesis Test
Section VI: Conclusion
Honor Code

Section I: Introduction

My project will draw from the dataset, NBAPlayerStatistics0910 which is located in the SportsAnalytics package. This dataset provides 441 observations on 25 variables. The 441 observations are all registered players in the National Basketball Association, referred to here as the NBA. The 25 variables include organizational information such as “Player name” and “Player Position,” as well as performance information such as “Total minutes played” and “Field goals attempted.”

Research Question

My research question is the following:

“What are the relationships between a player’s number of assists in a season and other aspects of their gameplay?”

“Assists”" is my response variable. NBA.com defines the assist as “a pass that leads directly to a basket.” Great assisters are essential to any good offense. They typically have excellent ball handling skills and court vision as well as a high understanding of their teammates’ abilities. I chose to focus on the assist because it has remained a consistent part of basketball even as other aspects of the game have changed (the increasing prevalence of the 3-point shot). Therefore, my analysis of the assist will be better able to provide certain truths about the game of basketball as a whole, not just basketball as it was played in the 2009-10 season.

Variables

The variables I will be analyzing in relation to assists are “FieldGoalsMade,” “Position,” “Turnovers,” and “PersonalFouls.”

FeildGoalsMade

“FieldGoalsMade” is a numeric variable that uses whole numbers to measure the number of feild goals that a player scores in a season. A field goal is a basket scored on any shot or tap other than a free throw. The realtionship between assists and field goals is clear- an assist leads to a field goal, always. However, this is not to say that players who get lots of assists score lots of points. In fact, there is reason to believe that the opposite is true. Players who pass and dribble frequently, such as point gaurds (described below), often score less than their teammates because their job is to make the players around them score. By analyzing the relationship between field goals made and assists, we can examine if good passers are typically good scorers, bad scorers, or no correlation.

Position

“Position” is a catagorial variable. A players position refers to his role and/or typical location on the court. The five standard positions are point gaurd (pg), shooting gaurd (sg), small forward (sf), power forward (pf), and center (c).

The point guard and shooting guard primarily operate outside of and around the three point line. The point guard leads the offense. He is almost always the principle ball handler and is responsible for setting up his teammates to score. I predict that the point gaurds will have the highest number of assists.

The shooting guard’s primary goal is to score. He does this by moving around the perimeter of the court and to take shots from beyond the 3-point line. I predict that the shooting guards will have an average number of assists.

The small and power forward travel around the court, frequently streaking towards the basket. They are the most versatile players on the court. I predict that the small and power forward will have an average number of assists.

The center stays almost exclusively around the basket or the “paint” area. He is typically bigger and stronger than his teammates and is a key player on defense. I predict that the center will have the lowest number of assists.

By analyzing each position in relation to its number of assists, we’ll be able to figure out which positions pass the most/least on average and from there, make inferences about general offensive structure in the NBA.

Turnovers

“Turnovers” is a numeric variable that uses whole numbers to measure the amount of turnovers that a player commits in a season. A turnover occurs when a team loses possession of the ball to the opposing team before a player takes a shot at his team’s basket. Assists are related to turnovers because a pass that is intercepted is considered to be a turnover. So there could be a negative relatoinship between assists and turnovers because bad passers turn the ball over often. Alternatively, there could be a positive relationship between assists and turnovers because players who are good passers typically pass the ball ofen and therefore, may have more chances to commit turnovers than their teammates. By analyzing the relationship between the two variables, we will be able to determine which of these two hypothesis are correct, or if there is no correlation.

Steals

Used only in Section IV and beyond

“Steals” is a numeric variable that uses whole numbers to measure the amount of Steals that a player commits in a season. A steal occurs when a player on defense makes a move to gain control of the basketball so that they are on offense. Steals are related to assists because, for reasons theorized in Section III, players who make lots of steals may also make lots of assists. Examining the relationship between the two will allow us to make inferences about differences between players and positions.

ThreesAttempted

Used only in Section IV and beyond

“ThreesAttempted” is a numeric variable that uses whole numbers to measure the amount of three pointers that a player attempts in a season. A three point shot is a shot behind the three point line, an arc that makes a semicircle around the basket. The number of threes attempted is related to assists because, for reasons theorized in Section III, players who attempt lots of threes may also make a lot of assists. Examining the relationship between the two will allow us to make inferences about differences between players and positions.

OffensiveRebounds

Used only in Section IV and beyond

“OffensiveRebounds” is a numeric variable that uses whole numbers to measure the amount of offensive rebounds that a player attempts in a season. An offensive rebound occurs when a player grabs the ball after it has hit the rim or the backboard of the hoop. The players who can atttain offensive rebounds must be on the same team as the player who just missed his/her shot. The number of offensive rebounds is related to assists because, for reasons theorized in Section III, players who get lots of offensive rebounds may not make a lot of assists. Examining the relationship between the two will allow us to make inferences about differences between players and positions.

TotalMinutesPlayed

Used only in section V and beyond

“TotalMinutesPlayed” is a numeric variable that uses whole numbers to measure the number of minutes that a player is on the court for the entire season. TotalMinutesPLayed, or “playing time”, can be used as an indicator of a player’s importance to their team, and also of their overall skill, since the NBA is such a competitive league. Playing time is important to my project because I attribute much of the variance in my data to differences in playing time.

Section II: Exploratory Data Analysis

Univariate analysis

Assists

summary(NBAPlayerStatistics0910$Assists)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    21.0    72.0   118.5   161.0   892.0

The summary of Assists tells us that the average number of assists in a season is around 118, and that the maximum is 892. Already, we can see from the distance between these two measurments that there is a large difference in assists between the average NBA player and the best passer(s).

hist(NBAPlayerStatistics0910$Assists, main="Histogram 1.1, Assists", xlab="Number of Assists", ylab="Number of Players", xlim = c(0, 1000), breaks=5)

Histogram 1.1 confirms what was suspected from the data’s summary: the majority of the NBA (Over 300 players out of 441) completed less than 200 passes in a season. From this, we are able to infer that passing is a specialized skill in the NBA. Since passing is specialized, one of the positions will have drastically higher average assists than the others. Earlier I predicted that this position would be the point gaurd. When I present the relationship between Assists and Position we will be able to see if my hypothesis was correct.

FieldGoalsMade

summary(NBAPlayerStatistics0910$FieldGoalsMade)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    56.0   171.0   210.3   314.0   794.0

The summary of FieldGoalsMade tells a similar story from that of Assists. There is a large difference between the mean of field goals scored (210) and the most field goals scored (794). Those who follow basketball, however, might not be surprised by this difference because they know that certain players often dominate point-scoring on teams by consistently making around 30 points per game, which is roughly 10-15 field goals. If you take 12.5 (the average of 10 and 15) and multiply it by the average number of games played in the NBA (~56) then the result is roughly 700, which is very close to the max of 794.

mean(NBAPlayerStatistics0910$GamesPlayed)

## [1] 56.25397

12.5*56.25397

## [1] 703.1746

Therefore, the high disparity between average field goals scored and maximum field goals scored can be explained by the existence of offensive stars, or players who are much better at scoring than their counterparts.

hist(NBAPlayerStatistics0910$FieldGoalsMade, main="Histogram 1.2, Field Goals Scored", xlab="Number Field Goals Scored", ylab="Number of Players", breaks=5)

This histogram reveals a slightly different story than Histogram 1.1. For starters, the transition between ~200 field goals and ~700 field goals is much more gradual, as opposed to the harsh drop after ~200 assists in Histogram 1.1. The more gradual decline could mean that the cause of the disparity between average field goals scored and max. field goals scored could simply be due to a difference in offensive abilities, rather than specialization, as could be the case with assists. We will be able to make more inferences once we begin bivariate analysis.

Position

table(NBAPlayerStatistics0910$Position)

## 
##  C PF PG SF SG 
## 85 88 96 90 81

The tabular breakdown of players’ positions reveals a pretty even split between the 5 positions. The point guard position holds the most players, at 96, and the shooting guard position has the least amount of players, at 81. It should be considered, however, that many players can perform multiple roles on the court. Point guards and small forwards are often good shooters, which could account for their higher prevelance.

Below is a visual representation of the table.

plot(NBAPlayerStatistics0910$Position, main="Bar Plot 1.3, Position", xlab="Positions", ylab="Number of Players", ylim=c(0, 100))

Turnovers

summary(NBAPlayerStatistics0910$Turnovers)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   22.00   63.00   75.65  112.00  295.00

Just as with Assists and FieldGoalsScored, the summary reveals a non-normal distribution of data with the maximum number of turnovers more than doubling the third quartile number of turnovers. However, we still don’t know enough about the players to determine who these certain players are or why they are turning the ball over much more often than their teammates. Once we move into bivariate analysis we will be able to explore the relationship between Turnovers and Assists, and make inferences about who is turning the ball over more, good passers or bad passers.

hist(NBAPlayerStatistics0910$Turnovers, main="Histogram 1.4, Turnovers", xlab="Number of Turnovers", ylab="Number of Players", ylim=c(0, 200))

The descent from the mean of around 75 turnovers to the maximum of around 300 turnovers is much more gradual than those of Histograms 1.1 and 1.2. The more gradual desent could be explained by the nature of turnovers contrasted with assists and field goals. Teams desire to make assists and score field goals so they specialize players to do so or encourage the development of offensive stars. Turnovers, on the other hand, are equally undesirable to all players/positions, so there is no specialization or development of star turnover players.

Bivariate Analysis

Assists & FieldGoalsScored

plot(NBAPlayerStatistics0910$FieldGoalsMade, NBAPlayerStatistics0910$Assists, main="Scatterplott 2.1, Assists vs Field Goals", xlab="Number of Assists", ylab="Number of Field Goals")

The central tendency of Scatterplot 2.1 is located in the bottom left and mostly consists of players that make less than 50 points and less than 100 assists. Players who make less than 50 field goals in an entire season are almost always bench majority players who are ONLY put in the game to help out the starters or to give them a break. These players skew the data because they don’t perform the same roles as the starters and the more important bench players. Because of this, my analysis will mostly ignore players with less than 100 asists in order to more accurately make inferences about the game of basketball.

Perhaps the most important take-away from this scatterplot is that the relationship between Assists and FieldGoalsMade is generally positive, meaning that as field goals scored increase, so do assists.

Despite the positive trend, the best passers weren’t necesarily the best scorers, and vice versa. The best passers were mainly concentrated in the 100-500 scoring range and the best scorers were mainly concentrated in the 200-600 assist range. Discounting the existence of the primarily bench players, we can conclude that the best passers are generally only average to above average at scoring.

Assists & Position

boxplot(NBAPlayerStatistics0910$Assists~NBAPlayerStatistics0910$Position, main="Boxplot 2.2, Assists vs Position", xlab="Position", ylab="Number of Assists")

As we can see from above, the point guard is clearly the primary passer, followed by the shooting guard, small forward, power forward and center. I acurately predicted that the point guard would have the most assists and that the center would have the least. But what’s interesting is the differences between the other three positions. The shooting guard has noticably more assists than the forwards, even though forwards are typically more versatile and athletic. This could be because the forwards are constantly moving around the court, giving shooting guards on the perimeter oppurtunities to rack up assists. An evaluation of the relationship between FieldGoalsMade and Position could help clarify the relationship between forwards and shooting guards.

FieldGoalsMade & Position

boxplot(NBAPlayerStatistics0910$FieldGoalsMade~NBAPlayerStatistics0910$Position, main="Boxplot 2.3, FieldGoalsMade vs Position", xlab="Position", ylab="Number of Field Goals Made")

This boxplot, unlike Boxplot 2.2, doesn’t really distinguish the positions from each other. It must be noted, however, that shooting guards mainly shoot 3 pointers, so their offensive impact is probably greater than expressed here. Still, the data presented above suggests that most positions make about the same number of field goals. So if the guards make notably more assists than the forwards and they score just as much, why even have forwards?

Forwards probably have more of an impact on the defensive side of the court (an observatoin of Blocks vs Position might offer some insight into this). On offense, the guards take control, with notably greater numbers of assists and marginally greater field goals made.

Assists & Turnovers

plot(NBAPlayerStatistics0910$Turnovers, NBAPlayerStatistics0910$Assists, main="Scatterplot 2.4, Assists vs Turnovers", xlab="Number of Turnovers", ylab="Number of Assists")

The central tendency of this scatterplot, similarly to Scatterplot 2.1, is located in the bottom left and consists mostly of players who make less than 100 assists, which is less than the mean of 118.5. This group also has a fairly wide range of turnovers, between 0-100, which surpasses the mean of 75.65. In contrast, the group in the 0-100 assist range in “Scatterplot 2.1, Assists vs FieldGoalsMade,” which I classified as “primarily bench players,” almost all made close to 0 field goals. From this, we can infer that primarily bench players typically make 0-100 assists, they turn the ball over 0-100 times, but they do not make very many field goals. The significant numbers of assists and turnovers compared to the insignificant number of field goals can be explained by the primarily bench players’ nature of play. Primarily bench players, when they’re on the court, are not expected to shoot the basketball as often as the more important bench players or the starters. They are also not as skilled as the other two groups of players, and therefore are more likely to commit turnovers. As stated earlier, because of their differing style of play I will discount the primarily bench players from the next part of my analysis.

This scatterplot has an even more distinct positive relationship than Scatterplot 2.1. Meaning that, as a player’s assists increase, so does the amount of turnovers that he makes. This relationship supports my hypothesis that players who are good passers typically pass the ball ofen and therefore, may have more chances to commit turnovers than their teammates. My alternative hypothesis, that bad passers turn the ball over more often was disproved. In fact, this scatterplot’s positive relationship suggests that a players number of turnovers have no importance with regard to their skill as a passer. This is visible even in the outliers. The best passer, with 892.0 assists also made the most turnovers, 295.0.

Given this relationship, we could predict that the position that completed the most assists, the point guard, also made the most turnovers. An analysis of Turnover vs Position would reveal the turth of this hypothesis.

Turnovers & Position

boxplot(NBAPlayerStatistics0910$Turnovers~NBAPlayerStatistics0910$Position, main="Boxplot 2.5, Turnovers vs Position", xlab="Position", ylab="Turnovers")

As shown, point guards by far had the highest number of turnovers, despite thier superior ball handling and passing abilities. We can therefore conclude that the more a player handles the ball, regardless of his skill level, the more turnovers he will committ.

Section III: Simple Linear Regression

In this section, I will present a regression analysis of the relationship between Assists and two other variables: Turnovers and Position.

Assists & Turnovers

As stated earlier, a turnover occurs when a team loses possession of the ball to the opposing team before a player takes a shot at his team’s basket. Turnovers are related to assists because a pass that is intercepted by the opposing team is a turnover. In the univariate analysis of Turnovers we determined that the mean number of turnovers is 75.65, and that the transition from the mean to the max. was much more gradual than those of Assists and FieldGoalsScored because teams don’t specialize turnovers in the same way they do passing and scoring. In the bivariate analysis of Assists & Turnovers we saw that there is a distinctly positive relationship between assists vs turnovers. I also pointed out the existance of the primarily bench players, who make up the central tendency of data in the relationship because they all share the same characteristics. With the analysis of Turnovers and Position, we were able to determine that good passers commit the most turnovers because they handle the ball more often than their teammates. In the upcoming regression analysis, I expect to find a regression line with a positive slope. I also expect that my R-squared value will indicate that my model is fairly good.

lm(NBAPlayerStatistics0910$Assists~NBAPlayerStatistics0910$Turnovers)

## 
## Call:
## lm(formula = NBAPlayerStatistics0910$Assists ~ NBAPlayerStatistics0910$Turnovers)
## 
## Coefficients:
##                       (Intercept)  NBAPlayerStatistics0910$Turnovers  
##                           -17.204                              1.794

Just as I predicted, the linear model provides a regression line with a positive slope. This means that, on average, for every additional turnover a player commits, he will also make roughly 1.8 assists. The intercept is most likely negative due to the primarily bench players, who mostly had very low numbers of assists and turnovers. Below is the regression line drawn over Scatterplot 2.4.

plot(NBAPlayerStatistics0910$Assists~NBAPlayerStatistics0910$Turnovers, main="Scatterplot w/ Regression Line 3.1, Assists vs Turnovers", xlab="Number of Turnovers", ylab="Number of Assists")
line1 <- lm(NBAPlayerStatistics0910$Assists~NBAPlayerStatistics0910$Turnovers)
abline(line1)

Now we can take a look at the summary of the model to evaluate its “goodness.”

summary(lm(NBAPlayerStatistics0910$Assists~NBAPlayerStatistics0910$Turnovers))

## 
## Call:
## lm(formula = NBAPlayerStatistics0910$Assists ~ NBAPlayerStatistics0910$Turnovers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -330.33  -35.26    3.93   21.23  391.39 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       -17.20366    5.81636  -2.958  0.00327
## NBAPlayerStatistics0910$Turnovers   1.79392    0.05917  30.317  < 2e-16
##                                      
## (Intercept)                       ** 
## NBAPlayerStatistics0910$Turnovers ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.99 on 439 degrees of freedom
## Multiple R-squared:  0.6768, Adjusted R-squared:  0.676 
## F-statistic: 919.1 on 1 and 439 DF,  p-value: < 2.2e-16

R-squared gives the proportion of the variation of the response variable (Assists) that is being explained by the independent variable (Turnovers). In other words, R-squared measures the “goodness” of my model. The closer R-squared is to 1, the better the model is. My R-squared value was 0.677. The limited accuracy of my model could be explained by differences between primarily bench players and more important bench players and starters. There is low variance among the primarily bench players because they aren’t even on the court most of the time, and therefore do not have the chance to make assists/commit turnovers. On the other hand, the more important bench players and starters have the playing time necesary to provide solid data, but since basketball players all have different roles and playing styles, there is a high variance, which effects the accuracy of the model. Therefore the presence of the primarily bench players probably drives up the R-squared data. Paradoxically, if primarily bench players were discounted from this package, there might be an even lower R-squared value, even though I would consider my model to be more accurate.

I believe that my model is slighly less “good” than the R-squared value suggests. But paired with my analysis of “Boxplot 2.5, Turnovers vs Position,” I hold it as sufficient to support my statement that the more a player handles the ball, regardless of his skill level, the more turnovers he will committ.

Assists & Position

A players position refers to his role and/or typical location on the court. The five standard positions are point gaurd (pg), shooting gaurd (sg), small forward (sf), power forward (pf), and center (c). In univariate analysis we determined that the number of most positions is relatively even, which is good because it allows us to better view the differences between them. In bivariate analysis we saw that the point guard, as I predicted, had by far more assists than any other postion, followed by the shooting guard, the forwards, and the center. The reason behind the shooting guard’s higher number of assists was unclear, so I analyzed the relationship between FieldGoalsScored and Position and was able to infer that the shooting guard is just generally more of an offensive presence on the court than the forwards and the center, accounting for differences in assists and points scored. In the upcoming regression analysis, I expect that the mean assist values of the 5 positions will be highest for point guards and lowest for centers, and that the R-squared value will be low.

lm(NBAPlayerStatistics0910$Assists~factor(NBAPlayerStatistics0910$Position))

## 
## Call:
## lm(formula = NBAPlayerStatistics0910$Assists ~ factor(NBAPlayerStatistics0910$Position))
## 
## Coefficients:
##                                (Intercept)  
##                                      57.72  
## factor(NBAPlayerStatistics0910$Position)PF  
##                                      12.67  
## factor(NBAPlayerStatistics0910$Position)PG  
##                                     183.52  
## factor(NBAPlayerStatistics0910$Position)SF  
##                                      32.37  
## factor(NBAPlayerStatistics0910$Position)SG  
##                                      64.44

This model supports my analysis of “Boxplot 2.2, Assists vs Position” because it provides averages to numerically support what before I was only eyeballing. The intercept represents the lowest average number of assists, belonging to the Center. The numbers below each other position represent the values that you add to the intercept to reach the average number of assists for that position. For example, you would add 64.44 to 57.72 to reach the average number of assists for the shooting guard position (122.16). This value is above the mean number of assists for all positions, which is 118.5, supporting my earlier statement that gaurds are more involved on the court offensively than the forwards. Also, my earlier prediction that point guards have the greatest average number of assists and centers have the lowest was true. Point guards lead the pack with 241.24 assists, compared with the centers’ measely 57.72. Below is “Boxplot 2.2, Assists vs Position” so that we can once again see the differences between the positions.

boxplot(NBAPlayerStatistics0910$Assists~NBAPlayerStatistics0910$Position, main="Boxplot 2.2, Assists vs Position", xlab="Position", ylab="Number of Assists")

Now we can take a look at the summary of the model to evaluate its “goodness.”

summary(lm(NBAPlayerStatistics0910$Assists~factor(NBAPlayerStatistics0910$Position)))

## 
## Call:
## lm(formula = NBAPlayerStatistics0910$Assists ~ factor(NBAPlayerStatistics0910$Position))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -239.24  -58.11  -26.55   47.44  650.76 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                                   57.72      12.96   4.454
## factor(NBAPlayerStatistics0910$Position)PF    12.67      18.17   0.697
## factor(NBAPlayerStatistics0910$Position)PG   183.52      17.79  10.315
## factor(NBAPlayerStatistics0910$Position)SF    32.37      18.07   1.792
## factor(NBAPlayerStatistics0910$Position)SG    64.44      18.55   3.474
##                                            Pr(>|t|)    
## (Intercept)                                1.07e-05 ***
## factor(NBAPlayerStatistics0910$Position)PF 0.485983    
## factor(NBAPlayerStatistics0910$Position)PG  < 2e-16 ***
## factor(NBAPlayerStatistics0910$Position)SF 0.073896 .  
## factor(NBAPlayerStatistics0910$Position)SG 0.000564 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 119.5 on 435 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.2402 
## F-statistic: 35.69 on 4 and 435 DF,  p-value: < 2.2e-16

As I predicted, the R-squared of my data is low, although admittedly, I didn’t think it was going to be THAT low. I believe that the reason for the low value is the large range of assists over each position. Take, for example, the small forward position. The NBA has 90 small forwards. Many of these small forwards are primarily bench players, who are only making 0-100 assists. Some more are primarily defensive players, who probably make no more than 200 assists per season. And a few are most likely offensive stars (see Boxplot 2.3 to see the outliers) who make up to 500 or 600 assists and drive up the variance of the model. To summarize, the reason for the large range of assists is a combination of differences in playing time and in playing style. If the primarily bench players were excluded from the data, the R-squared value would probaly go up, but it would still be pretty low due to differences between players.

Overall, I would still vouch for the reliabilty of this model becuase the low R-squared value appears to be natural and because the results of the model are consistent with what I know from my years of experience with the sport of basketball.

Section IV: Multiple Linear Regression

Below are a couple interesting multiple linear regression models of my dataset. I’ve decided to start using the syntax, “data = NBAPlayerStatistics0910” at the end of my code because I’ll be dealing with more variables now.

Assists vs ThreesAttempted and ThreesMade

summary(lm(Assists ~ ThreesAttempted + ThreesMade, data = NBAPlayerStatistics0910))

## 
## Call:
## lm(formula = Assists ~ ThreesAttempted + ThreesMade, data = NBAPlayerStatistics0910)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -237.62  -55.44  -34.63   34.38  694.20 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      52.0450     7.2912   7.138 3.96e-12 ***
## ThreesAttempted   1.4298     0.3285   4.352 1.68e-05 ***
## ThreesMade       -2.1800     0.8720  -2.500   0.0128 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.4 on 438 degrees of freedom
## Multiple R-squared:  0.3061, Adjusted R-squared:  0.303 
## F-statistic: 96.62 on 2 and 438 DF,  p-value: < 2.2e-16

Assists vs Steals, ThreesAttempted and Offensive Rebounds

summary(lm(Assists ~ Steals + ThreesAttempted + OffensiveRebounds, data=NBAPlayerStatistics0910))

## 
## Call:
## lm(formula = Assists ~ Steals + ThreesAttempted + OffensiveRebounds, 
##     data = NBAPlayerStatistics0910)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -202.07  -34.16   -8.76   15.63  748.69 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.89236    6.99433   0.842    0.400    
## Steals             3.23599    0.18367  17.619  < 2e-16 ***
## ThreesAttempted    0.04830    0.04783   1.010    0.313    
## OffensiveRebounds -0.36896    0.08046  -4.586 5.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.56 on 437 degrees of freedom
## Multiple R-squared:  0.6127, Adjusted R-squared:   0.61 
## F-statistic: 230.4 on 3 and 437 DF,  p-value: < 2.2e-16

Model Analysis

I’ve chosen to analyze further the above multiple linear regression, “Assists vs Steals, ThreesAttempted and Offensive Rebounds” because it allows us to further examine the relationship between Assists and Position without having to use the linear model of Assists vs Position, which is hard to draw conclusions from for reasons discussed in the last section. Steals, ThreesAttempted and OffensiveRebounds are all variables that vary heavily between positions. Therefore, statistically significant relationships between Assists and each of these variables can help us further understand the differences between positions, and by extent, the game of basketball.

Steals

H0: beta(Steals) = 0

H1: beta(Steals) (not=) 0

Null hypothesis: the beta of Steals is equal to zero

Alternative hypothesis: the beta of Steals is not equal to zero

In the model, Steals has a beta of 3.23599, and a P-value, Pr(>|t|), of 2e-16. P-value indicates that data is statistically significant when it is less than the level of significance, which is .05 for this project. Since beta(Steals) is greater than zero, and the P-value is less than .05, we can reject the Null Hypothesis.

Now we know that for every 1 unit increase in steals, Assists increase by 3.23599. This value makes sense because players who play in positions that tend to make lots of assists, point gaurds, almost always guard other point gaurds on defense. And point gaurds, as we know, handle the ball more than almost any other player on the court. So whoever is gaurding the point gaurds will tend to get the most steals from opportunity alone. Additionally, point gaurds are quicker and more agile than most of their teammates, which gives them a natural advantage in stealing the ball.

Below is a boxplot of Steals vs Position

boxplot(Steals ~ factor(Position), data = NBAPlayerStatistics0910, main="Boxplot 3.1, Steals vs Position", xlab="Postion", ylab="Number of Steals")

As we can see, the number of steals appears to be the highest for the point gaurd, and lowest for the center. This supports my statement that positions that typically gaurd the oposing teams ball handlers, the point gaurds, get the most steals. Additionally, positions that are typically larger and less agile, the centers, get the least steals.

ThreesAttempted

H0: beta(ThreesAttempted) = 0

H1: beta(ThreesAttempted) (not=) 0

Null hypothesis: the beta of ThreesAttempted is equal to zero

Alternative hypothesis: the beta of ThreesAttempted is not equal to zero

In the model, ThreesAttempted has a beta of 0.04830 and a P-value of 0.313. Although beta(ThreesAttempted) is greater than zero, the P-value is less than .05 so therefore we cannot reject the Null Hypotheis.

I was surprised by the absense of a statistically significant relationship between Assists and ThreesAttempted in the model. Point guards mostly stay outside of the 3 point line in order to have more space to dribble and pass. I thought that since point gaurds stay outside of the three point line, they SHOULD attempt more threes and therefore there would be a positive relationship between Assists and ThreesAttempted. This model disproves that hypothesis.

Let it be noted, however, that the lack of a statistically significant relationship between Assists and ThreesAttempted could just be in the scope of this model. In the other multiple linear regression model that I presented, “Assists vs ThreesAttempted and ThreesMade”“, ThreesAttempted has a beta of 1.4298 with a P-value of 1.68e-05. Therefore, it is still possible that the relationship exists.

OffensiveRebounds

H0: beta(OffensiveRebounds) = 0

H1: beta(OffensiveRebounds) (not=) 0

Null hypothesis: the beta of OffensiveRebounds is equal to zero

Alternative hypothesis: the beta of OffensiveRebounds is not equal to zero

In the model, OffensiveRebounds has a beta value of -0.36896 and a Pr(>|t|) of 5.92e-06. beta(OffensiveRebounds) is less than zero and the p-value is less than .05 so therefore we can reject the Null Hypothesis.

Now we know that for every 1 unit decrease in offensive rebounds, there is a 0.36896 decrease in Assists. This relationship makes sense for the same reasons that I expected the relationship between ThreesAttempted and Assits to make sense. Point gaurds typically play on the perimeter of the court, far away from the rim. Because of this, they don’t have great positioning to get offensive rebounds.

Below is a boxplot of OffensiveRebounds vs Position

boxplot(OffensiveRebounds ~ factor(Position), data = NBAPlayerStatistics0910, main="Boxplot 3.2, OffensiveRebounds vs Position", xlab="Postion", ylab="Number of Offensive Rebounds")

As we can see, the center and the power forward appear to have MUCH higher numbers of offensive rebounds than positions that typically operate behind the 3 point line such as the point gaurd and the shooting guard.

Model Assessment

My model had a faily good R^2 value at 0.6127. The good value could be due to the presence of multiple predictors. I believe that this model effectively supports my earlier hypothesis that point gaurds score the most assists and centers score the least because there was a statistically significant relationship between Assists and Steals, and Assists and OffensiveRebounds. Those two predictors both vary greatly over position.

Residual Diagnosis

Residuals are another way in which statisticians can show how well or poorly a model represents relationships. According to University of Virginia’s Research Data Services and Sciences, “Residuals are leftover of the outcome variable after fitting a model (predictors) to data and they could reveal unexplained patterns in the data by the fitted model.” The following dignostic plots show residuals in four different ways. In this section, I will lightly explain the first three plots and their significance.

plot(lm(Assists ~ Steals + ThreesAttempted + OffensiveRebounds, data=NBAPlayerStatistics0910))

Residuals vs Fitted

This plot shows if residuals have non-linear relationships. We can tell if the data has a non-linear pattern if this plot has no distinctive patterns. From what I can see, there is nothing but a slight quadratic curve. Since the curve is slight, however, we can determine that my data probably does not have any non-linear relationships.

Normal Q-Q

This plot shows if residuals are normally distributed or not. If the data falls within a straight diagonal line, the data is probably normal. Here, the plot shows a relationship that could be considered to be cubic, but again, it’s not pronounced enough for me to declare that the plot is not normal.

Scale-Location

In this plot we check for constant variance. Similar to the first plot, we want the residuals to display no distinctive patterns. I believe that the upward sloping curved line in this plot constitutes a distinct pattern, and so my model does not have constant variance. I believe this could be due to the presence of bench players. Players who are on the court a large amount of time will naturally have greater statistical differences between each other than players who don’t play much.

Section V: Hypothesis Testing

Throughout my project, I have claimed that the presence of bench players has a profound affect on my data. However, I’ve never directly addressed the presence of these players using a model. To prove that there is a difference in playing time between players, I will run a formal test of hypothesis to the variable, TotalMinutesPlayed.

TotalMinutesPlayed

There are 48 minutes in a basketball game. Each team is allowed 12 players on it’s active roster and only 5 players can be on the court at any given time. The total number of minutes spent by all players together will be 48 x 5, or 240 minutes. And so, 240 / 12, or 20, should be the amount of minutes played by each player per game if the coach chooses to play them all for an equal amount of time (in other words, assuming that there is no actual distinction between bench players and starters). Now, according to Wikepedia, there were 82 regular season games in the 2009/10 season. Subtract about 4 games to account for injuries and roster changes and we get 78 total games for each player. Therefore, 78 x 20, or 1560, should be the total minutes played for every player in the NBA, assuming that they all play equal amounts of time per game.

With this information, we are able run a formal test of hypothesis (t.test) with the mean equal to 1560 in order to see if there really is a significant population of majority bench players in the NBA.

t.test(NBAPlayerStatistics0910$TotalMinutesPlayed, mu=1560)

## 
##  One Sample t-test
## 
## data:  NBAPlayerStatistics0910$TotalMinutesPlayed
## t = -4.8365, df = 440, p-value = 1.83e-06
## alternative hypothesis: true mean is not equal to 1560
## 95 percent confidence interval:
##  1262.065 1434.239
## sample estimates:
## mean of x 
##  1348.152

Analysis of Hypothesis Test

H0: mu = 1560 H1: mu (not=) 1560

Null hypothesis: true mean is equal to 1560 Alternative hypothesis: true mean is not equal to 1560

The t.test concluded that TotalMinutesPlayed has a true mean of 1348.152 and a p-value of 1.83e-06. The true mean is not equal to 1560 and the p-value is less than .05 so we are able to reject the null hypothesis. Additionally, the t.test has a 95% confidence interval with bounds of 1262.065 and 1434.239. Since our mean does not fall within these bounds, we can reject the null hypothesis a second time!

Since the true mean of of the t.test is less than the number of minutes that each player would play if they all played equal amounts of time per game, we can infer that, not only do majority bench players exist, but that they outnumber the amount of starters. This is significant because it verifies the statements that I have made throughout my project that there is a large amount of majority bench players that drive down the “goodness” of my models.

Section VI: Conclusion

Throughout my project I have identified and explored a number of relationships between variables in the dataset, NBAPlayerStatistics0910. But the focus of my project has always been the variable, Assists. After determining the presence of specialization and the strong relationship between Assists and Turnovers, I was able to infer that 1) Strong passers aren’t always offensive stars, 2) The point gaurd is typically the best passer on the teams, followed by the shooting gaurd, forwards and center, and 3) time spent spent with the ball, not ball-handling skill was the primary indicator of turnovers.

I moved on to support my second and third inferences using linear models. However, my model involving position was limited by a notably low R^2 value, which I attributed towards the presence of primarily bench players and significant differences in playing style.

Unsatisfied with my position model, I decided to prove the relationship between Assists and Position in a different way: by choosing variables that I knew varied highly across positions and making them predictors in a multiple linear regression with Assists as the response variable. The results were good. 2 out of 3 of the predictors that I chose had statistically significant relationships with Assists, supporting my earlier model and deepening our understanding of differences between positions.

Finally, I used the formal test of hypothesis to finally pin down the scourge of my project: primarily bench players. Throughout my project, I had blamed these players for elements including clotted scatterplots, low R^2 values and irregular residual diagnostic plots. Needless to say, the absence of primarily bench players would have been the iceburg to my Titanic. Fortunately, a little bit of math and a formal hypothesis test were able to confirm large numbers of primarily bench players, and by extent, most of my analysis.

Honor Code

On my honor, I have neither given nor received any unauthorized assistance on this work.

Pledged, Joseph Amado Williams

Behind the Dime- a statistical analysis of the basketball assist

Joe Williams

March 09, 2018

Section I: Introduction

Research Question

Variables

FeildGoalsMade

Position

Turnovers

Steals

ThreesAttempted

OffensiveRebounds

TotalMinutesPlayed

Section II: Exploratory Data Analysis

Univariate analysis

Assists

FieldGoalsMade

Position

Turnovers

Bivariate Analysis

Assists & FieldGoalsScored

Assists & Position

FieldGoalsMade & Position

Assists & Turnovers

Turnovers & Position

Section III: Simple Linear Regression

Assists & Turnovers

Assists & Position

Section IV: Multiple Linear Regression

Assists vs ThreesAttempted and ThreesMade

Assists vs Steals, ThreesAttempted and Offensive Rebounds

Model Analysis

Steals

ThreesAttempted

OffensiveRebounds

Model Assessment

Residual Diagnosis

Residuals vs Fitted

Normal Q-Q

Scale-Location

Section V: Hypothesis Testing

TotalMinutesPlayed

Analysis of Hypothesis Test

Section VI: Conclusion

Honor Code