Difference between PER ratings for NBA players of different ages

Does Age affect an NBA player’s output?

Arvind Krishnamoorthy s3765630

June 2 2019

Introduction

The National Basketball Association (NBA) has some of the most athletic players. There are not too many sports around the world that require the athletic traits that are used in basketball. However athleticsm itself does not drive success in the NBA - a high level of skill is also required.

This experiment will look to see if Age, the variable accounting for athleticism, affects the player’s output, which will be measured by the Player’s Efficiency Rating (PER).

PER is a weighted score that takes into account a number of different basketball related factors. It is has widely used as an effective way in measuring an NBA player’s total output.

Problem Statement

The problem statement here is to determine the effect on a player’s PER when that player gets older and loses his athleticism. We will be created two separate groups based on their age with young players being categorised between 18 and 26 years old, and old players being categorised between 27 and 35 years old. We will be looking at a sample of players who have played in the NBA, and reviewing their PER rating in comparison to their age category.

I will be using the player’s age during every season, as well the player’s average PER score for every season. For the purposes of this investigation, the calculation of the PER score does not need to be explained. So long as the reader is aware that for the purposes of this investigation, the PER score is the single determinant of player efficiency.

Data

To gather this information, I used open source data (Kaggle and google data sets with link in references section) that included every basketball player’s statistics consolidated per season. This information was already in csv format so I was able to download this and manipulate in Excel. The link for the data source can be found in the references.

Once this information was loaded into Excel, I removed statistics between 1950 and 1970 as PER was not available for these year.

Furthermore, I identified that after the age of 35, there was a high level of variability in the data. In order to obtain a more consistent model, I limited my NBA age range between 18 and 35.

Data Cont.

I then split these groups into two smaller groups: Young (0) = Age >= 18 | Age <= 26 Old (1) = Age >= 27 | Age <= 35

I loaded into R and created a dataframe and created two separate dataframes (one for young and the other for old players)

I then renamed the columns and identified that there were four variables: Player - This was the name of the player Age - This is the age of the player for a particular season Age_category - This is the category that the player falls into (0 if age is between 18 and 26 and 1 if age is between 27 and 35) PER - This is the player’s PER for a particular season

Data Cont.

I also noticed that PER can be negative however these are outlier results. I excluded negative PER values in my testing. (remove your zeros)

With regard to sampling, I used simple random sampling (SRS) between these two groups. I used the following code:

sample_young <- sample_n(nba_young,200) sample_old <- sample_n(nba_old,200)

This produced two sample groups: sample_young was 200 random samples of “young” players, sample_old was 200 random samples of old players.

I used 200 samples for each group. This is a large value to satisfy Central Limit Theorem (CLT)

For my testing I then joined the dataframes into one through using the following code:

sample_all <- rbind(sample_young,sample_old)

Age category is my qualitative variable, and I needed to convert it into a factor so that I could later test for it. I did this by using the following code:

sample_all\(Age_cat_updated <- as.factor(sample_all\)Age_cat) is.factor(sample_all$Age_cat_updated)

Descriptive Statistics and Visualisation

Here are the descriptive statistics for young (18 - 26 years old) players

sample_young %>% summarise(Min = min(Per,na.rm = TRUE),
                                         Q1 = quantile(Per,probs = .25,na.rm = TRUE),
                                         Median = median(Per, na.rm = TRUE),
                                         Q3 = quantile(Per,probs = .75,na.rm = TRUE),
                                         Max = max(Per,na.rm = TRUE),
                                         Mean = mean(Per, na.rm = TRUE),
                                         SD = sd(Per, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Per))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
3.7 10.2 12.8 15.925 76.1 13.5255 6.902676 200 0

Decsriptive Statistics Cont.

Here are the descriptive statistics for old (27 - 35 years old) players

sample_old %>% summarise(Min = min(Per,na.rm = TRUE),
                         Q1 = quantile(Per, probs = 0.25, na.rm = TRUE),
                         Median = median(Per,na.rm = TRUE),
                         Q3 = quantile(Per, probs = 0.75, na.rm = TRUE),
                         Max = max(Per,na.rm = TRUE),
                         Mean = mean(Per,na.rm = TRUE),
                         SD = sd(Per,na.rm = TRUE),
                         n = n(),
                         Missing = sum(is.na(Per))) -> table3
knitr::kable(table3)
Min Q1 Median Q3 Max Mean SD n Missing
1.4 10.1 12.95 15.225 24.1 12.695 4.186863 200 0

Decsriptive Statistics Cont.

Here is the histogram for young and old players NBA players. ‘0’ represents young players between the ages of 18 and 26 ‘1’ represents old players between the ages of 27 and 35

sample_all %>% boxplot(Per ~ Age_cat, data = ., xlab = 'Age category', ylab = 'Per', main = 'Per summary for both age categories')

Hypothesis Testing

The following tests represent the null and alternate hypothesis for this investigation.

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

Hypothesis Testing

model2 <- lm(Per ~ Age_cat,data = sample_all)
model2 %>% summary()
## 
## Call:
## lm(formula = Per ~ Age_cat, data = sample_all)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.295  -3.095  -0.225   2.505  62.574 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.5255     0.4037  33.507   <2e-16 ***
## Age_cat      -0.8305     0.5709  -1.455    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.709 on 398 degrees of freedom
## Multiple R-squared:  0.00529,    Adjusted R-squared:  0.00279 
## F-statistic: 2.116 on 1 and 398 DF,  p-value: 0.1465

Hypthesis Testing Cont.

We are going to perform a two sided hypothesis test Testing the Assumption of Normality

sample_young\(Per %>% qqPlot(dist="norm") sample_old\)Per %>% qqPlot(dist=“norm”)

Central Limit Theorem Due to the central limit theorem, we know that a sampling distribution of a mean will be approximately normally distributes, regardless of the underlying population distribution when the sample size is large (n > 30)

Hypthesis Testing Cont.

Homogeneity of variance Here we need to test the assumption of equal variance this is done using the Levene’s test

leveneTest(Per ~ Age_cat_updated,data = sample_all)

sample_all\(Age_cat_updated <- as.factor(sample_all\)Age_cat) is.factor(sample_all$Age_cat_updated)

P = 0.46 and because this is less than 0.05, we fail to reject H0 and we can assume equal variance

Hypthesis Testing Cont.

t.test - we can assume t-test assuming equal variance and two-sided hypothesis test

t.test( Per ~ Age_cat, data = sample_all, var.equal = TRUE, alternative = “two.sided” )

[-1.04436, 0.6653615] t stat = -0.4358

tcrit qt(p = 0.025,df = 200+200-2) Answer = -1.966

12.9975 - 12.7725

difference between group 0 and group 1 12.9975 - 12.7725 Answer = 0.225 Therefore, H0 is captured therefore there is no statistical significance between the means

Discussion

In this investigation we attempted to show whether there is a difference between the means of two samples - one sample containing players between the age of 18 and 26 and the other containing players between 27 and 35, and apply this to the population.

The major finding from this investigation is that there isn’t a statistically significant difference in the Player Efficiency Rating score between players in the age range 18 to 26 and players in the age range 27 to 35.

One of the issues of this investigation is that we used two large sub categories with regards to the age categories.

If I were to perform this investigation again, I would segment the age categories into more groups and this may provide a more granular view of relationship between an NBA player’s age and his PER.

In conclusion, there is not a statistically significant difference in the mean PER between young (player between the ages of 18 and 26) and old (player between the ages of 27 and 35) players.

References

Kiggins, (2019). Kaggle: Your home for Data Science, Retrieved from https://www.kaggle.com/drgilermo/nba-players-stats