Have NBA Players’ Scoring Patterns Changed Over Time?

A Statistical Analysis of NBA Player Performance: 2011 vs 2015

Sahana Ramamurthy - s4058517

Last updated: 01 June, 2025

Introduction

Basketball, particularly in the NBA, has evolved significantly over the years in terms of pace, player roles, and scoring dynamics. As gameplay has become faster and more offense-oriented, questions arise about whether these changes have led to statistically measurable differences in player performance over time. One of the most direct indicators of offensive performance is the total number of points scored by players in a season. By focusing on this metric, we can gain insights into how scoring trends have shifted across different NBA eras.

Introduction Cont.

Problem Statement

The main question driving this investigation is: Has the average total points scored by NBA players changed significantly between the 2011 and 2015 seasons?

This question stems from visible changes in NBA gameplay, including increased offensive pace, changing player roles, and evolving team strategies. To explore this, we use real player statistics from the Kaggle NBA dataset and apply a two-sample t-test to compare scoring averages between the two seasons. Supporting analysis includes descriptive statistics and visualisations to identify distribution patterns and assess variability. This approach enables us to determine whether any observed difference in scoring is statistically significant or likely due to random chance.

Data

This investigation uses open-source data sourced from Kaggle’s “NBA Player Stats” repository. Two specific CSV files were used:

Loading the datasets

# Load datasets
players <- read_csv("Players.csv")
stats <- read_csv("Seasons_Stats.csv")

No primary data collection was conducted. The data is publicly available, and the analysis can be replicated by downloading the same files from Kaggle.

The dataset represents a full population of NBA players per season from 2000–2023. For this project, we focused on the seasons 2011 and 2015 to compare scoring patterns. This selection ensures relevance to modern gameplay while providing distinct temporal comparison points.

To prepare the dataset, we filtered for these two years and removed rows with missing values in key columns (PTS and height). The maximum PTS per player per season was retained to represent peak performance. The data was then merged on the Player field to link height with scoring data.

Data Cont.

The final cleaned dataset includes the following key variables:

All numeric variables are measured on a ratio scale, allowing valid arithmetic and statistical operations. For example, PTS ranges from 0 to over 2000 depending on player performance, and height typically ranges from 170 cm to 230 cm.

Sampling Method

A sampling method is the strategy used to select a subset of individuals or observations from a larger population for the purpose of statistical analysis. Since it’s often impractical to collect data from an entire population, sampling allows researchers to make inferences about the whole based on a manageable group. There are various types of sampling methods—including random, stratified, systematic, cluster, and convenience sampling—each chosen depending on the research objective, data availability, and required accuracy.

Sampling method was used in this project. Specifically, the analysis employed a form of convenience sampling by focusing on NBA player data from two specific seasons—2011 and 2015. Rather than selecting a random sample from the full dataset, the data was filtered to include only those players who had recorded total points in the selected years.

Data Preprocessing

To identify seasons with sufficient player data for comparison, we first examined how many players had valid total points (PTS) recorded for each year from 2000 onward. This helps ensure that the years selected for analysis are based on complete and representative data.

# Count number of players per year with valid PTS
stats %>%
  filter(Year >= 2000, !is.na(PTS)) %>%
  group_by(Year) %>%
  summarise(count = n_distinct(Player)) %>%
  arrange(desc(count))

Based on the previous step, the 2011 and 2015 seasons were selected for analysis due to their completeness. We then filtered the dataset to include only these two years and extracted each player’s total points (PTS). For players with multiple entries per year, only the highest point total was retained to represent their peak seasonal performance.

stats_filtered <- stats %>%
  filter(Year %in% c(2011, 2015), !is.na(PTS)) %>%
  select(Player, Year, PTS)

stats_clean <- stats_filtered %>%
  group_by(Player, Year) %>%
  summarise(TotalPoints = max(PTS, na.rm = TRUE), .groups = "drop")

Descriptive Statistics and Visualisation

The boxplot below visualises the distribution of total points scored by NBA players in the 2011 and 2015 seasons. Both distributions appear right-skewed with multiple high-value outliers, representing top-scoring players. The medians are similar between years, while the interquartile range (IQR) and spread suggest slight variability in scoring distribution. These visuals support our summary statistics and provide a clearer view of scoring dynamics across seasons.

# Boxplot
ggplot(stats_clean, aes(x = as.factor(Year), y = TotalPoints, fill = as.factor(Year))) +
  # geom_boxplot() +
geom_boxplot(outlier.color = "red", outlier.shape = 16, width = 0.6)+


  labs(title = "Total Points Comparison: 2011 vs 2015",
       x = "Season Year",
       y = "Total Points") +
  theme_minimal()

#Calculating Average Total Points Per Year
stats %>%
  filter(!is.na(PTS), Year >= 2000) %>%
  group_by(Year) %>%
  summarise(MeanPoints = mean(PTS, na.rm = TRUE)) %>%
  ggplot(aes(x = Year, y = MeanPoints)) +
  geom_line(color = "blue", size = 1) +
  geom_point() +
  labs(title = "Average Total Points Per Year",
       x = "Year", y = "Mean Points") +
  theme_minimal()

This investigation focuses on two key numeric variables:

To explore patterns in scoring performance, descriptive statistics such as mean, median, standard deviation, and range were calculated. Visualisation techniques such as boxplots were used to highlight differences between seasons and detect potential outliers.

Missing data was handled during preprocessing. Specifically, records with missing values in PTS or height were removed. This ensured a clean dataset with complete observations. Outliers were retained to reflect real-world variability in scoring (e.g., exceptionally high scorers).

Below are R chunks used for analysis and their outputs:

# Grouped summary by season year
stats_clean %>%
  group_by(Year) %>%
  summarise(
    Mean = mean(TotalPoints, na.rm = TRUE),
    Median = median(TotalPoints, na.rm = TRUE),
    SD = sd(TotalPoints, na.rm = TRUE),
    Q1 = quantile(TotalPoints, 0.25, na.rm = TRUE),
    Q3 = quantile(TotalPoints, 0.75, na.rm = TRUE),
    Min = min(TotalPoints, na.rm = TRUE),
    Max = max(TotalPoints, na.rm = TRUE),
    Count = n()
  ) %>%
  knitr::kable(caption = "Summary Statistics by Year")
Summary Statistics by Year
Year Mean Median SD Q1 Q3 Min Max Count
2011 541.8009 409.0 476.4902 144.5 813.25 0 2161 452
2015 500.0711 420.5 422.4312 141.0 772.00 0 2217 492

To prepare for hypothesis testing, we extracted total point values for the 2011 and 2015 NBA seasons. After filtering the dataset and aggregating the maximum PTS per player, we created two separate vectors — data_2011 and data_2015 — containing the total points scored by all valid players in each season. These vectors form the basis for a two-sample t-test to determine whether the difference in scoring averages between the two years is statistically significant. As shown in the output, there were 452 players in 2011 and 492 players in 2015.

table(stats_clean$Year)
## 
## 2011 2015 
##  452  492
stats_filtered <- stats %>%
  filter(Year %in% c(2011, 2015), !is.na(PTS)) %>%
  select(Player, Year, PTS)

stats_clean <- stats_filtered %>%
  group_by(Player, Year) %>%
  summarise(TotalPoints = max(PTS, na.rm = TRUE), .groups = "drop")

# Now test
data_2011 <- stats_clean %>% filter(Year == 2011) %>% pull(TotalPoints)
data_2015 <- stats_clean %>% filter(Year == 2015) %>% pull(TotalPoints)

Hypothesis Testing and Confidence interval

A Welch two-sample t-test was conducted to compare the average total points scored by NBA players in the 2011 and 2015 seasons. This test does not assume equal variances and is appropriate for comparing two independent groups.

The result returned a p-value of 0.1563, which is greater than the standard significance level of 0.05. Therefore, we fail to reject the null hypothesis, suggesting that the difference in average total points between 2011 and 2015 is not statistically significant.

The 95% confidence interval for the difference in means is [-15.99, 99.45], which includes 0, further supporting this conclusion. While the mean in 2011 was slightly higher (≈542) than in 2015 (≈500), the variation does not reflect a significant trend.

This test confirms that there is no strong evidence of a significant shift in player scoring averages between the two selected seasons.

# Prepare data vectors for test
data_2011 <- stats_clean %>% filter(Year == 2011) %>% pull(TotalPoints)
data_2015 <- stats_clean %>% filter(Year == 2015) %>% pull(TotalPoints)

# Perform Welch's t-test
t_test_result <- t.test(data_2011, data_2015, var.equal = FALSE)
t_test_result
## 
##  Welch Two Sample t-test
## 
## data:  data_2011 and data_2015
## t = 1.4188, df = 904.35, p-value = 0.1563
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.99200  99.45149
## sample estimates:
## mean of x mean of y 
##  541.8009  500.0711
# Load car package if not already loaded
library(car)

# Run Levene’s Test to check for equality of variances
leveneTest(TotalPoints ~ as.factor(Year), data = stats_clean)
# Confidence Interval
conf_interval <- t_test_result$conf.int
conf_interval
## [1] -15.99200  99.45149
## attr(,"conf.level")
## [1] 0.95

Regression Analysis

# Merge height with total points
players_clean <- players %>% select(Player, height) %>% filter(!is.na(height))
reg_data <- inner_join(players_clean, stats_clean, by = "Player")

# Linear regression model
model <- lm(TotalPoints ~ height, data = reg_data)
summary(model)
## 
## Call:
## lm(formula = TotalPoints ~ height, data = reg_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -575.1 -365.9 -101.0  267.1 1682.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1126.344    319.840   3.522  0.00045 ***
## height        -3.018      1.590  -1.898  0.05806 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.8 on 942 degrees of freedom
## Multiple R-squared:  0.003808,   Adjusted R-squared:  0.00275 
## F-statistic: 3.601 on 1 and 942 DF,  p-value: 0.05806

A simple linear regression was performed to investigate whether player height predicts total points scored. The model used TotalPoints as the dependent variable and height as the independent variable.

The regression coefficient for height was -3.018 with a p-value of 0.0581, which is slightly above the 0.05 significance threshold. This means we fail to reject the null hypothesis — there is insufficient evidence to conclude that height significantly affects scoring performance.

Additionally, the R-squared value is 0.0038, indicating that less than 0.4% of the variability in total points is explained by player height. The relationship is extremely weak and practically negligible.

These findings suggest that height alone is not a meaningful predictor of total points scored, and other factors (e.g., playing time, role, skillset) likely play a much greater role in performance.

# Regression plot
ggplot(reg_data, aes(x = height, y = TotalPoints)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Linear Regression: Height vs Total Points",
       x = "Height (cm)", y = "Total Points") +
  theme_minimal()

- Model: TotalPoints = β₀ + β₁ × Height + ε - R² value: Indicates how much variance in total points is explained by height. - p-value of slope: If < 0.05, height significantly predicts scoring performance.

Discussion

References