The dataset I’ve chosen is the player per game statistics for the 2021-2022 NBA season. Some stats found in the dataset are shooting percentages, rebound totals, assists, turnovers, and scoring outputs. My goal for this project is to figure out which performance metrics best predict a players scoring output. Regression modeling can help me understand which variables influences points the most.
#2 Data Cleaning
I will first import my libraries, set my directory and load the dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 812 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): player, pos, tm
dbl (27): age, g, gs, mp, fg, fga, fgpercent, x3p, x3pa, x3ppercent, x2p, x2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now that I’ve done the prep I will now clean the dataset
# Clean column namesnames(nba) <-tolower(names(nba))# Convert categoricalnba_clean <- nba %>%filter(mp >5) %>%# remove players with extremely low minutes as they aren't impactfuldrop_na(pts, fgpercent, x3ppercent, ftpercent) %>%mutate(pos =as.factor(pos),tm =as.factor(tm),player =as.factor(player) )
#3 Exploring Dataset
summary(nba_clean)
player pos age tm
Juancho Hernangómez: 4 SG :168 Min. :19.00 TOT : 85
Aaron Holiday : 3 PF :130 1st Qu.:23.00 POR : 26
Alize Johnson : 3 PG :129 Median :26.00 DAL : 23
Andre Drummond : 3 SF :127 Mean :26.14 DET : 23
Armoni Brooks : 3 C :105 3rd Qu.:29.00 IND : 23
Bruno Fernando : 3 SF-SG : 4 Max. :41.00 MIL : 23
(Other) :655 (Other): 11 (Other):471
g gs mp fg
Min. : 1.00 Min. : 0.00 Min. : 7.0 Min. : 0.000
1st Qu.:21.00 1st Qu.: 1.00 1st Qu.: 302.5 1st Qu.: 5.525
Median :45.00 Median : 7.00 Median : 804.0 Median : 7.100
Mean :42.39 Mean :19.70 Mean : 971.7 Mean : 7.359
3rd Qu.:65.00 3rd Qu.:31.75 3rd Qu.:1572.2 3rd Qu.: 8.900
Max. :82.00 Max. :82.00 Max. :2854.0 Max. :18.700
fga fgpercent x3p x3pa
Min. : 3.40 Min. :0.0000 Min. :0.000 Min. : 0.000
1st Qu.:13.40 1st Qu.:0.3950 1st Qu.:1.300 1st Qu.: 4.300
Median :16.20 Median :0.4415 Median :2.200 Median : 6.900
Mean :16.67 Mean :0.4417 Mean :2.292 Mean : 6.842
3rd Qu.:19.48 3rd Qu.:0.4900 3rd Qu.:3.300 3rd Qu.: 9.175
Max. :41.20 Max. :0.7360 Max. :7.800 Max. :19.900
x3ppercent x2p x2pa x2ppercent
Min. :0.0000 Min. : 0.000 Min. : 0.000 Min. :0.0000
1st Qu.:0.2792 1st Qu.: 3.025 1st Qu.: 6.500 1st Qu.:0.4617
Median :0.3330 Median : 4.800 Median : 9.400 Median :0.5165
Mean :0.3165 Mean : 5.069 Mean : 9.828 Mean :0.5092
3rd Qu.:0.3767 3rd Qu.: 6.700 3rd Qu.:12.500 3rd Qu.:0.5710
Max. :1.0000 Max. :18.700 Max. :27.500 Max. :1.0000
NA's :2
ft fta ftpercent orb
Min. : 0.000 Min. : 0.200 Min. :0.0000 Min. : 0.000
1st Qu.: 1.700 1st Qu.: 2.325 1st Qu.:0.6893 1st Qu.: 0.900
Median : 2.700 Median : 3.650 Median :0.7730 Median : 1.600
Mean : 3.153 Mean : 4.170 Mean :0.7570 Mean : 2.271
3rd Qu.: 4.000 3rd Qu.: 5.300 3rd Qu.:0.8488 3rd Qu.: 3.200
Max. :14.200 Max. :17.400 Max. :1.0000 Max. :13.700
drb trb ast stl
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :0.00
1st Qu.: 4.800 1st Qu.: 5.900 1st Qu.: 2.500 1st Qu.:1.10
Median : 6.000 Median : 7.700 Median : 3.700 Median :1.40
Mean : 6.666 Mean : 8.938 Mean : 4.555 Mean :1.54
3rd Qu.: 8.100 3rd Qu.:11.375 3rd Qu.: 6.175 3rd Qu.:1.90
Max. :19.200 Max. :33.000 Max. :16.100 Max. :9.00
blk tov pf pts
Min. :0.0000 Min. : 0.000 Min. : 0.000 Min. : 2.60
1st Qu.:0.4000 1st Qu.: 1.700 1st Qu.: 3.300 1st Qu.:15.50
Median :0.8000 Median : 2.300 Median : 4.050 Median :19.40
Mean :0.9469 Mean : 2.538 Mean : 4.343 Mean :20.16
3rd Qu.:1.2000 3rd Qu.: 3.100 3rd Qu.: 5.100 3rd Qu.:23.90
Max. :6.0000 Max. :13.700 Max. :13.700 Max. :49.90
ortg drtg
Min. : 17.0 Min. : 87.0
1st Qu.:103.0 1st Qu.:110.0
Median :111.0 Median :113.0
Mean :109.8 Mean :112.5
3rd Qu.:117.0 3rd Qu.:115.0
Max. :187.0 Max. :122.0
# Scatter plot between FG% and PTSggplot(nba_clean, aes(x = fgpercent, y = pts)) +geom_point(alpha =0.6, color ="purple") +labs(title ="FG% vs Points per Game",x ="Field Goal Percentage",y ="Points per Game")
Call:
lm(formula = pts ~ fgpercent + x3ppercent + ftpercent + stl +
tov + age + mp, data = nba_clean)
Residuals:
Min 1Q Median 3Q Max
-13.7093 -3.4095 -0.5074 2.9910 21.1214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.2329902 1.9045511 -2.223 0.02658 *
fgpercent 26.3862303 2.2457870 11.749 < 2e-16 ***
x3ppercent 5.3577804 1.6566081 3.234 0.00128 **
ftpercent 9.9319982 1.3836097 7.178 1.89e-12 ***
stl -0.7039262 0.2449417 -2.874 0.00418 **
tov 2.0631389 0.1563717 13.194 < 2e-16 ***
age -0.0976222 0.0474454 -2.058 0.04002 *
mp 0.0019848 0.0002741 7.240 1.24e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.057 on 666 degrees of freedom
Multiple R-squared: 0.4583, Adjusted R-squared: 0.4526
F-statistic: 80.5 on 7 and 666 DF, p-value: < 2.2e-16
Based on the information provided by the regression model, I can come to the conclusion that minutes played have the strongest effect on points as more playing time equals more points. High Field goal percentage and 3-point percentage show good scoring efficiency. Assists show offensive involvement that happens side by side with scoring. My final conclusion is that turnovers negatively impact the scoring outcome.
#5 Final Visualization
# Creates a summary based on positionpos_summary <- nba_clean %>%group_by(pos) %>%summarize(avg_pts =mean(pts))# Plots average points per game by positionggplot(pos_summary, aes(x = pos, y = avg_pts, fill = pos)) +geom_col(color ="black") +scale_fill_brewer(palette ="Set3") +# automatically sets color labs(title ="Average Points per Game by Position (NBA 2021–22)",x ="Position",y ="Average Points",caption ="Source: NBA Player Stats Dataset (Kaggle)" ) +theme_minimal() +theme(legend.position ="none")
#6 Conclusion
Process:
I made this visualization by cleaning the data set by removing players with very low minutes, which represents meaningless game averages. I also dropped missing values from variables like shooting percentages and scoring, and converted categorical variables to factors. The regression analysis revealed that minutes played, field goal percentage, and three-point accuracy are significant predictors of scoring. This is very common knowledge around the basketball world as players who play more are likely to score more, and shooting efficiency is extremely impactful to offensive success. The backward elimination process made sure the model focused on meaningful predictors and avoided overfitting.
Visualization Insight:
The bar chart is meant to compare average scoring by position. In doing so it shows the fact that guards and forwards score differently, which reflects the roles they play and shot selection differences across positions. The custom colors are meant to make the visualization easier to read and more professional.
Limitations:
The only fault of this analysis I can think of is the fact it only considers one season and did not only take into account if a team was tanking or pace of play. A Visualization if the future could have defensive metrics since scoring isn’t the only stat that can highlight a players importance.