Introduction

Basketball, particularly in the NBA, has evolved significantly over the years in terms of pace, player roles, and scoring dynamics. As gameplay has become faster and more offense-oriented, questions arise about whether these changes have led to statistically measurable differences in player performance over time. One of the most direct indicators of offensive performance is the total number of points scored by players in a season. By focusing on this metric, we can gain insights into how scoring trends have shifted across different NBA eras.

Introduction Cont.

The NBA has undergone significant evolution in gameplay, with increased pace and offensive strategies over time.
Total points scored by players is a key metric to assess offensive performance and league-wide trends.
This project investigates whether average player scoring significantly changed between the 2011 and 2015 seasons.
Rule changes, rise in analytics, and a shift toward high-efficiency shots (like 3-pointers) may have influenced scoring patterns.
Statistical tools like the two-sample t-test offer a reliable way to test if observed differences are significant or random.
Real-world NBA data from Kaggle ensures transparency, replicability, and practical relevance for sports analysis.
Descriptive statistics and visualisations (boxplots, summaries) support deeper understanding of data distribution.

Problem Statement

The main question driving this investigation is: Has the average total points scored by NBA players changed significantly between the 2011 and 2015 seasons?

This question stems from visible changes in NBA gameplay, including increased offensive pace, changing player roles, and evolving team strategies. To explore this, we use real player statistics from the Kaggle NBA dataset and apply a two-sample t-test to compare scoring averages between the two seasons. Supporting analysis includes descriptive statistics and visualisations to identify distribution patterns and assess variability. This approach enables us to determine whether any observed difference in scoring is statistically significant or likely due to random chance.

Data

This investigation uses open-source data sourced from Kaggle’s “NBA Player Stats” repository. Two specific CSV files were used:

Players.csv: Contains player identifiers and physical characteristics (e.g., height).
Seasons_Stats.csv: Contains detailed season performance statistics, including total points scored (PTS).

Loading the datasets

# Load datasets
players <- read_csv("Players.csv")
stats <- read_csv("Seasons_Stats.csv")

No primary data collection was conducted. The data is publicly available, and the analysis can be replicated by downloading the same files from Kaggle.

The dataset represents a full population of NBA players per season from 2000–2023. For this project, we focused on the seasons 2011 and 2015 to compare scoring patterns. This selection ensures relevance to modern gameplay while providing distinct temporal comparison points.

To prepare the dataset, we filtered for these two years and removed rows with missing values in key columns (PTS and height). The maximum PTS per player per season was retained to represent peak performance. The data was then merged on the Player field to link height with scoring data.

Data Cont.

The final cleaned dataset includes the following key variables:

Player: The full name of the NBA player (character).
Year: Season year of participation (numeric).
PTS: Total points scored by the player during that season (numeric, ratio scale).
height: Player height in centimetres (numeric, ratio scale).

All numeric variables are measured on a ratio scale, allowing valid arithmetic and statistical operations. For example, PTS ranges from 0 to over 2000 depending on player performance, and height typically ranges from 170 cm to 230 cm.

Sampling Method

A sampling method is the strategy used to select a subset of individuals or observations from a larger population for the purpose of statistical analysis. Since it’s often impractical to collect data from an entire population, sampling allows researchers to make inferences about the whole based on a manageable group. There are various types of sampling methods—including random, stratified, systematic, cluster, and convenience sampling—each chosen depending on the research objective, data availability, and required accuracy.

Sampling method was used in this project. Specifically, the analysis employed a form of convenience sampling by focusing on NBA player data from two specific seasons—2011 and 2015. Rather than selecting a random sample from the full dataset, the data was filtered to include only those players who had recorded total points in the selected years.

Data Preprocessing

To identify seasons with sufficient player data for comparison, we first examined how many players had valid total points (PTS) recorded for each year from 2000 onward. This helps ensure that the years selected for analysis are based on complete and representative data.

# Count number of players per year with valid PTS
stats %>%
  filter(Year >= 2000, !is.na(PTS)) %>%
  group_by(Year) %>%
  summarise(count = n_distinct(Player)) %>%
  arrange(desc(count))

Based on the previous step, the 2011 and 2015 seasons were selected for analysis due to their completeness. We then filtered the dataset to include only these two years and extracted each player’s total points (PTS). For players with multiple entries per year, only the highest point total was retained to represent their peak seasonal performance.

stats_filtered <- stats %>%
  filter(Year %in% c(2011, 2015), !is.na(PTS)) %>%
  select(Player, Year, PTS)

stats_clean <- stats_filtered %>%
  group_by(Player, Year) %>%
  summarise(TotalPoints = max(PTS, na.rm = TRUE), .groups = "drop")

Descriptive Statistics and Visualisation

The boxplot below visualises the distribution of total points scored by NBA players in the 2011 and 2015 seasons. Both distributions appear right-skewed with multiple high-value outliers, representing top-scoring players. The medians are similar between years, while the interquartile range (IQR) and spread suggest slight variability in scoring distribution. These visuals support our summary statistics and provide a clearer view of scoring dynamics across seasons.

# Boxplot
ggplot(stats_clean, aes(x = as.factor(Year), y = TotalPoints, fill = as.factor(Year))) +
  # geom_boxplot() +
geom_boxplot(outlier.color = "red", outlier.shape = 16, width = 0.6)+


  labs(title = "Total Points Comparison: 2011 vs 2015",
       x = "Season Year",
       y = "Total Points") +
  theme_minimal()

#Calculating Average Total Points Per Year
stats %>%
  filter(!is.na(PTS), Year >= 2000) %>%
  group_by(Year) %>%
  summarise(MeanPoints = mean(PTS, na.rm = TRUE)) %>%
  ggplot(aes(x = Year, y = MeanPoints)) +
  geom_line(color = "blue", size = 1) +
  geom_point() +
  labs(title = "Average Total Points Per Year",
       x = "Year", y = "Mean Points") +
  theme_minimal()

This investigation focuses on two key numeric variables:

TotalPoints: The total number of points scored by a player in a season (used as the primary performance measure).
Year: The season in which the points were recorded (used to group and compare trends over time).

To explore patterns in scoring performance, descriptive statistics such as mean, median, standard deviation, and range were calculated. Visualisation techniques such as boxplots were used to highlight differences between seasons and detect potential outliers.

Missing data was handled during preprocessing. Specifically, records with missing values in PTS or height were removed. This ensured a clean dataset with complete observations. Outliers were retained to reflect real-world variability in scoring (e.g., exceptionally high scorers).

Below are R chunks used for analysis and their outputs:

# Grouped summary by season year
stats_clean %>%
  group_by(Year) %>%
  summarise(
    Mean = mean(TotalPoints, na.rm = TRUE),
    Median = median(TotalPoints, na.rm = TRUE),
    SD = sd(TotalPoints, na.rm = TRUE),
    Q1 = quantile(TotalPoints, 0.25, na.rm = TRUE),
    Q3 = quantile(TotalPoints, 0.75, na.rm = TRUE),
    Min = min(TotalPoints, na.rm = TRUE),
    Max = max(TotalPoints, na.rm = TRUE),
    Count = n()
  ) %>%
  knitr::kable(caption = "Summary Statistics by Year")

Summary Statistics by Year
Year	Mean	Median	SD	Q1	Q3	Min	Max	Count
2011	541.8009	409.0	476.4902	144.5	813.25	0	2161	452
2015	500.0711	420.5	422.4312	141.0	772.00	0	2217	492

To prepare for hypothesis testing, we extracted total point values for the 2011 and 2015 NBA seasons. After filtering the dataset and aggregating the maximum PTS per player, we created two separate vectors — data_2011 and data_2015 — containing the total points scored by all valid players in each season. These vectors form the basis for a two-sample t-test to determine whether the difference in scoring averages between the two years is statistically significant. As shown in the output, there were 452 players in 2011 and 492 players in 2015.

table(stats_clean$Year)

## 
## 2011 2015 
##  452  492

stats_filtered <- stats %>%
  filter(Year %in% c(2011, 2015), !is.na(PTS)) %>%
  select(Player, Year, PTS)

stats_clean <- stats_filtered %>%
  group_by(Player, Year) %>%
  summarise(TotalPoints = max(PTS, na.rm = TRUE), .groups = "drop")

# Now test
data_2011 <- stats_clean %>% filter(Year == 2011) %>% pull(TotalPoints)
data_2015 <- stats_clean %>% filter(Year == 2015) %>% pull(TotalPoints)

Hypothesis Testing and Confidence interval

A Welch two-sample t-test was conducted to compare the average total points scored by NBA players in the 2011 and 2015 seasons. This test does not assume equal variances and is appropriate for comparing two independent groups.

Null Hypothesis (H₀): The mean total points in 2011 is equal to that in 2015.
Alternative Hypothesis (H₁): The mean total points in 2011 is not equal to that in 2015.

The result returned a p-value of 0.1563, which is greater than the standard significance level of 0.05. Therefore, we fail to reject the null hypothesis, suggesting that the difference in average total points between 2011 and 2015 is not statistically significant.

The 95% confidence interval for the difference in means is [-15.99, 99.45], which includes 0, further supporting this conclusion. While the mean in 2011 was slightly higher (≈542) than in 2015 (≈500), the variation does not reflect a significant trend.

This test confirms that there is no strong evidence of a significant shift in player scoring averages between the two selected seasons.

# Prepare data vectors for test
data_2011 <- stats_clean %>% filter(Year == 2011) %>% pull(TotalPoints)
data_2015 <- stats_clean %>% filter(Year == 2015) %>% pull(TotalPoints)

# Perform Welch's t-test
t_test_result <- t.test(data_2011, data_2015, var.equal = FALSE)
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  data_2011 and data_2015
## t = 1.4188, df = 904.35, p-value = 0.1563
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.99200  99.45149
## sample estimates:
## mean of x mean of y 
##  541.8009  500.0711

# Load car package if not already loaded
library(car)

# Run Levene’s Test to check for equality of variances
leveneTest(TotalPoints ~ as.factor(Year), data = stats_clean)

# Confidence Interval
conf_interval <- t_test_result$conf.int
conf_interval

## [1] -15.99200  99.45149
## attr(,"conf.level")
## [1] 0.95

Null Hypothesis (H₀): Average total points scored in 2011 = 2015.
Alternative Hypothesis (H₁): Average total points scored in 2011 ≠ 2015.
You can also place equations inline: \(z = \frac{x - \mu}{\sigma}\)
p-value: 0.1562877
Conclusion: Fail to reject H₀ – no significant difference

Regression Analysis

# Merge height with total points
players_clean <- players %>% select(Player, height) %>% filter(!is.na(height))
reg_data <- inner_join(players_clean, stats_clean, by = "Player")

# Linear regression model
model <- lm(TotalPoints ~ height, data = reg_data)
summary(model)

## 
## Call:
## lm(formula = TotalPoints ~ height, data = reg_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -575.1 -365.9 -101.0  267.1 1682.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1126.344    319.840   3.522  0.00045 ***
## height        -3.018      1.590  -1.898  0.05806 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.8 on 942 degrees of freedom
## Multiple R-squared:  0.003808,   Adjusted R-squared:  0.00275 
## F-statistic: 3.601 on 1 and 942 DF,  p-value: 0.05806

A simple linear regression was performed to investigate whether player height predicts total points scored. The model used TotalPoints as the dependent variable and height as the independent variable.

Null Hypothesis (H₀): Height has no effect on total points scored (slope = 0).
Alternative Hypothesis (H₁): Height has a significant effect on total points scored (slope ≠ 0).

The regression coefficient for height was -3.018 with a p-value of 0.0581, which is slightly above the 0.05 significance threshold. This means we fail to reject the null hypothesis — there is insufficient evidence to conclude that height significantly affects scoring performance.

Additionally, the R-squared value is 0.0038, indicating that less than 0.4% of the variability in total points is explained by player height. The relationship is extremely weak and practically negligible.

These findings suggest that height alone is not a meaningful predictor of total points scored, and other factors (e.g., playing time, role, skillset) likely play a much greater role in performance.

# Regression plot
ggplot(reg_data, aes(x = height, y = TotalPoints)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  labs(title = "Linear Regression: Height vs Total Points",
       x = "Height (cm)", y = "Total Points") +
  theme_minimal()

- Model: TotalPoints = β₀ + β₁ × Height + ε - R² value: Indicates how much variance in total points is explained by height. - p-value of slope: If < 0.05, height significantly predicts scoring performance.

Discussion

The Welch’s t-test produced a p-value of 0.156, meaning we fail to reject the null hypothesis. There is no statistically significant difference in the average total points scored by players in the 2011 and 2015 seasons.
Levene’s Test returned a p-value of 0.033, indicating unequal variances in total points between 2011 and 2015. This justified the use of Welch’s t-test for comparing the means.
The linear regression analysis found that height was not a significant predictor of total points scored. The p-value for the height variable was 0.058, and the R² value was only 0.0038, suggesting an extremely weak relationship.
Strengths of this investigation include the use of real-world NBA data, proper data cleaning, use of visualisation techniques, and checking statistical assumptions with Levene’s Test.
Limitations include the exclusion of key performance factors such as minutes played, player position, and game pace. The analysis was also limited to only two seasons.
Future investigations could expand to multiple seasons, explore per-minute scoring rates, or examine interaction effects based on position or team strategies.
Conclusion: This study found no significant scoring change between 2011 and 2015, and no evidence that taller players score more. NBA player performance appears to be influenced by a broader set of contextual and tactical variables beyond just physical height or year of play.

Have NBA Players’ Scoring Patterns Changed Over Time?

A Statistical Analysis of NBA Player Performance: 2011 vs 2015

RPubs link information