Data 110 Project 1

By: Ameer Adegun

#1 Introduction

The dataset I’ve chosen is the player per game statistics for the 2021-2022 NBA season. Some stats found in the dataset are shooting percentages, rebound totals, assists, turnovers, and scoring outputs. My goal for this project is to figure out which performance metrics best predict a players scoring output. Regression modeling can help me understand which variables influences points the most.

#2 Data Cleaning

I will first import my libraries, set my directory and load the dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/SwagD/Downloads/Data 110")

nba <- read_csv("nba-player-stats-2021.csv")
Rows: 812 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, pos, tm
dbl (27): age, g, gs, mp, fg, fga, fgpercent, x3p, x3pa, x3ppercent, x2p, x2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nba #make sure it loaded properly
# A tibble: 812 × 30
   player  pos     age tm        g    gs    mp    fg   fga fgpercent   x3p  x3pa
   <chr>   <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl>
 1 Precio… C        22 TOR      73    28  1725   7.7  17.5     0.439   1.6   4.5
 2 Steven… C        28 MEM      76    75  1999   5     9.2     0.547   0     0  
 3 Bam Ad… C        24 MIA      56    56  1825  11.1  20       0.557   0     0.2
 4 Santi … PF       21 MEM      32     0   360   7    17.5     0.402   0.8   6.4
 5 LaMarc… C        36 BRK      47    12  1050  11.6  21.1     0.55    0.6   2.1
 6 Nickei… SG       23 TOT      65    21  1466   8.5  22.9     0.372   3.5  11.4
 7 Nickei… SG       23 NOP      50    19  1317   8.9  23.7     0.375   3.6  11.4
 8 Nickei… SG       23 UTA      15     2   149   5.3  15.9     0.333   3.3  10.9
 9 Grayso… SG       26 MIL      66    61  1805   6.8  15.1     0.448   4.2  10.4
10 Jarret… C        23 CLE      56    56  1809  10.2  15       0.677   0     0.3
# ℹ 802 more rows
# ℹ 18 more variables: x3ppercent <dbl>, x2p <dbl>, x2pa <dbl>,
#   x2ppercent <dbl>, ft <dbl>, fta <dbl>, ftpercent <dbl>, orb <dbl>,
#   drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, tov <dbl>, pf <dbl>,
#   pts <dbl>, ortg <dbl>, drtg <dbl>

Now that I’ve done the prep I will now clean the dataset

# Clean column names
names(nba) <- tolower(names(nba))

# Convert categorical
nba_clean <- nba %>%
  filter(mp > 5) %>%   # remove players with extremely low minutes as they aren't impactful
  drop_na(pts, fgpercent, x3ppercent, ftpercent) %>%
  mutate(
    pos = as.factor(pos),
    tm  = as.factor(tm),
    player = as.factor(player)
  )

#3 Exploring Dataset

summary(nba_clean)
                 player         pos           age              tm     
 Juancho Hernangómez:  4   SG     :168   Min.   :19.00   TOT    : 85  
 Aaron Holiday      :  3   PF     :130   1st Qu.:23.00   POR    : 26  
 Alize Johnson      :  3   PG     :129   Median :26.00   DAL    : 23  
 Andre Drummond     :  3   SF     :127   Mean   :26.14   DET    : 23  
 Armoni Brooks      :  3   C      :105   3rd Qu.:29.00   IND    : 23  
 Bruno Fernando     :  3   SF-SG  :  4   Max.   :41.00   MIL    : 23  
 (Other)            :655   (Other): 11                   (Other):471  
       g               gs              mp               fg        
 Min.   : 1.00   Min.   : 0.00   Min.   :   7.0   Min.   : 0.000  
 1st Qu.:21.00   1st Qu.: 1.00   1st Qu.: 302.5   1st Qu.: 5.525  
 Median :45.00   Median : 7.00   Median : 804.0   Median : 7.100  
 Mean   :42.39   Mean   :19.70   Mean   : 971.7   Mean   : 7.359  
 3rd Qu.:65.00   3rd Qu.:31.75   3rd Qu.:1572.2   3rd Qu.: 8.900  
 Max.   :82.00   Max.   :82.00   Max.   :2854.0   Max.   :18.700  
                                                                  
      fga          fgpercent           x3p             x3pa       
 Min.   : 3.40   Min.   :0.0000   Min.   :0.000   Min.   : 0.000  
 1st Qu.:13.40   1st Qu.:0.3950   1st Qu.:1.300   1st Qu.: 4.300  
 Median :16.20   Median :0.4415   Median :2.200   Median : 6.900  
 Mean   :16.67   Mean   :0.4417   Mean   :2.292   Mean   : 6.842  
 3rd Qu.:19.48   3rd Qu.:0.4900   3rd Qu.:3.300   3rd Qu.: 9.175  
 Max.   :41.20   Max.   :0.7360   Max.   :7.800   Max.   :19.900  
                                                                  
   x3ppercent          x2p              x2pa          x2ppercent    
 Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
 1st Qu.:0.2792   1st Qu.: 3.025   1st Qu.: 6.500   1st Qu.:0.4617  
 Median :0.3330   Median : 4.800   Median : 9.400   Median :0.5165  
 Mean   :0.3165   Mean   : 5.069   Mean   : 9.828   Mean   :0.5092  
 3rd Qu.:0.3767   3rd Qu.: 6.700   3rd Qu.:12.500   3rd Qu.:0.5710  
 Max.   :1.0000   Max.   :18.700   Max.   :27.500   Max.   :1.0000  
                                                    NA's   :2       
       ft              fta           ftpercent           orb        
 Min.   : 0.000   Min.   : 0.200   Min.   :0.0000   Min.   : 0.000  
 1st Qu.: 1.700   1st Qu.: 2.325   1st Qu.:0.6893   1st Qu.: 0.900  
 Median : 2.700   Median : 3.650   Median :0.7730   Median : 1.600  
 Mean   : 3.153   Mean   : 4.170   Mean   :0.7570   Mean   : 2.271  
 3rd Qu.: 4.000   3rd Qu.: 5.300   3rd Qu.:0.8488   3rd Qu.: 3.200  
 Max.   :14.200   Max.   :17.400   Max.   :1.0000   Max.   :13.700  
                                                                    
      drb              trb              ast              stl      
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :0.00  
 1st Qu.: 4.800   1st Qu.: 5.900   1st Qu.: 2.500   1st Qu.:1.10  
 Median : 6.000   Median : 7.700   Median : 3.700   Median :1.40  
 Mean   : 6.666   Mean   : 8.938   Mean   : 4.555   Mean   :1.54  
 3rd Qu.: 8.100   3rd Qu.:11.375   3rd Qu.: 6.175   3rd Qu.:1.90  
 Max.   :19.200   Max.   :33.000   Max.   :16.100   Max.   :9.00  
                                                                  
      blk              tov               pf              pts       
 Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 2.60  
 1st Qu.:0.4000   1st Qu.: 1.700   1st Qu.: 3.300   1st Qu.:15.50  
 Median :0.8000   Median : 2.300   Median : 4.050   Median :19.40  
 Mean   :0.9469   Mean   : 2.538   Mean   : 4.343   Mean   :20.16  
 3rd Qu.:1.2000   3rd Qu.: 3.100   3rd Qu.: 5.100   3rd Qu.:23.90  
 Max.   :6.0000   Max.   :13.700   Max.   :13.700   Max.   :49.90  
                                                                   
      ortg            drtg      
 Min.   : 17.0   Min.   : 87.0  
 1st Qu.:103.0   1st Qu.:110.0  
 Median :111.0   Median :113.0  
 Mean   :109.8   Mean   :112.5  
 3rd Qu.:117.0   3rd Qu.:115.0  
 Max.   :187.0   Max.   :122.0  
                                
# Scatter plot between FG% and PTS
ggplot(nba_clean, aes(x = fgpercent, y = pts)) +
  geom_point(alpha = 0.6, color = "purple") +
  labs(title = "FG% vs Points per Game",
       x = "Field Goal Percentage",
       y = "Points per Game")

#4 Regression Modeling

full_model <- lm(pts ~ fgpercent + x3ppercent + ftpercent + trb + ast + stl + blk + tov + age + mp, data = nba_clean)
summary(full_model)

Call:
lm(formula = pts ~ fgpercent + x3ppercent + ftpercent + trb + 
    ast + stl + blk + tov + age + mp, data = nba_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3975  -3.4987  -0.4391   3.0038  21.3060 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.1637388  1.9357454  -2.151  0.03184 *  
fgpercent   25.6256286  2.5937110   9.880  < 2e-16 ***
x3ppercent   5.4202553  1.7795005   3.046  0.00241 ** 
ftpercent    9.9400011  1.4103334   7.048 4.57e-12 ***
trb         -0.0419243  0.0679909  -0.617  0.53770    
ast         -0.0234253  0.0912092  -0.257  0.79739    
stl         -0.7448954  0.2501306  -2.978  0.00301 ** 
blk          0.4261241  0.3019228   1.411  0.15861    
tov          2.1258232  0.1912072  11.118  < 2e-16 ***
age         -0.0905922  0.0483618  -1.873  0.06148 .  
mp           0.0020250  0.0002815   7.195 1.70e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.06 on 663 degrees of freedom
Multiple R-squared:  0.4601,    Adjusted R-squared:  0.4519 
F-statistic:  56.5 on 10 and 663 DF,  p-value: < 2.2e-16
step_model <- step(full_model, direction = "backward")
Start:  AIC=2196.6
pts ~ fgpercent + x3ppercent + ftpercent + trb + ast + stl + 
    blk + tov + age + mp

             Df Sum of Sq   RSS    AIC
- ast         1       1.7 16979 2194.7
- trb         1       9.7 16987 2195.0
<none>                    16978 2196.6
- blk         1      51.0 17029 2196.6
- age         1      89.9 17067 2198.2
- stl         1     227.1 17205 2203.6
- x3ppercent  1     237.6 17215 2204.0
- ftpercent   1    1272.0 18250 2243.3
- mp          1    1325.5 18303 2245.3
- fgpercent   1    2499.6 19477 2287.2
- tov         1    3165.2 20143 2309.8

Step:  AIC=2194.67
pts ~ fgpercent + x3ppercent + ftpercent + trb + stl + blk + 
    tov + age + mp

             Df Sum of Sq   RSS    AIC
- trb         1       8.2 16987 2193.0
<none>                    16979 2194.7
- blk         1      54.2 17033 2194.8
- age         1      97.3 17077 2196.5
- stl         1     235.8 17215 2202.0
- x3ppercent  1     246.6 17226 2202.4
- ftpercent   1    1274.3 18254 2241.4
- mp          1    1360.1 18339 2244.6
- fgpercent   1    2499.4 19479 2285.2
- tov         1    4250.7 21230 2343.3

Step:  AIC=2193
pts ~ fgpercent + x3ppercent + ftpercent + stl + blk + tov + 
    age + mp

             Df Sum of Sq   RSS    AIC
- blk         1      46.1 17034 2192.8
<none>                    16987 2193.0
- age         1     100.2 17088 2195.0
- stl         1     228.6 17216 2200.0
- x3ppercent  1     300.1 17288 2202.8
- ftpercent   1    1349.5 18337 2242.5
- mp          1    1375.5 18363 2243.5
- fgpercent   1    2706.2 19694 2290.6
- tov         1    4488.8 21476 2349.0

Step:  AIC=2192.82
pts ~ fgpercent + x3ppercent + ftpercent + stl + tov + age + 
    mp

             Df Sum of Sq   RSS    AIC
<none>                    17034 2192.8
- age         1     108.3 17142 2195.1
- stl         1     211.2 17245 2199.1
- x3ppercent  1     267.5 17301 2201.3
- ftpercent   1    1317.9 18351 2241.1
- mp          1    1340.7 18374 2241.9
- fgpercent   1    3530.6 20564 2317.8
- tov         1    4452.1 21486 2347.3
summary(step_model)

Call:
lm(formula = pts ~ fgpercent + x3ppercent + ftpercent + stl + 
    tov + age + mp, data = nba_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.7093  -3.4095  -0.5074   2.9910  21.1214 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.2329902  1.9045511  -2.223  0.02658 *  
fgpercent   26.3862303  2.2457870  11.749  < 2e-16 ***
x3ppercent   5.3577804  1.6566081   3.234  0.00128 ** 
ftpercent    9.9319982  1.3836097   7.178 1.89e-12 ***
stl         -0.7039262  0.2449417  -2.874  0.00418 ** 
tov          2.0631389  0.1563717  13.194  < 2e-16 ***
age         -0.0976222  0.0474454  -2.058  0.04002 *  
mp           0.0019848  0.0002741   7.240 1.24e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.057 on 666 degrees of freedom
Multiple R-squared:  0.4583,    Adjusted R-squared:  0.4526 
F-statistic:  80.5 on 7 and 666 DF,  p-value: < 2.2e-16

Based on the information provided by the regression model, I can come to the conclusion that minutes played have the strongest effect on points as more playing time equals more points. High Field goal percentage and 3-point percentage show good scoring efficiency. Assists show offensive involvement that happens side by side with scoring. My final conclusion is that turnovers negatively impact the scoring outcome.

#5 Final Visualization

  # Creates a summary based on position
pos_summary <- nba_clean %>%
  group_by(pos) %>%
  summarize(avg_pts = mean(pts))

# Plots average points per game by position
ggplot(pos_summary, aes(x = pos, y = avg_pts, fill = pos)) +
  geom_col(color = "black") +
  scale_fill_brewer(palette = "Set3") +  # automatically sets color 
  labs(
    title = "Average Points per Game by Position (NBA 2021–22)",
    x = "Position",
    y = "Average Points",
    caption = "Source: NBA Player Stats Dataset (Kaggle)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

#6 Conclusion

Process:

I made this visualization by cleaning the data set by removing players with very low minutes, which represents meaningless game averages. I also dropped missing values from variables like shooting percentages and scoring, and converted categorical variables to factors. The regression analysis revealed that minutes played, field goal percentage, and three-point accuracy are significant predictors of scoring. This is very common knowledge around the basketball world as players who play more are likely to score more, and shooting efficiency is extremely impactful to offensive success. The backward elimination process made sure the model focused on meaningful predictors and avoided overfitting.

Visualization Insight:

The bar chart is meant to compare average scoring by position. In doing so it shows the fact that guards and forwards score differently, which reflects the roles they play and shot selection differences across positions. The custom colors are meant to make the visualization easier to read and more professional.

Limitations:

The only fault of this analysis I can think of is the fact it only considers one season and did not only take into account if a team was tanking or pace of play. A Visualization if the future could have defensive metrics since scoring isn’t the only stat that can highlight a players importance.