NBA Player Performance Analysis for the 2021-2022 Season
Introduction
This nba-player-stats-2021 data set contains statistics about 812 player-team stints during the 2021-2022 NBA regular season. Below are the variables used in the data set:
Variable
Description
player
Name of player
pos
Player’s designated position
age
Player’s age on February 1st of the season
tm
Name of team
g
Number of games
gs
Number of games started
mp
Number of minutes played
fg
Field goals per 100 team possessions
fga
Field goal attempts per 100 team possessions
fgpercent
Field goal percentage
x3p
3 point field goals per 100 team possessions
x3pa
3 point field goal attempts per 100 team possessions
x3ppercent
3 point field goal percentage
x2p
2 point field goals per 100 team possessions
x2pa
2 point field goal attempts per 100 team possessions
x2ppercent
2 point field goal percentage
ft
Free throws per 100 team possessions
fta
Free throw attempts per 100 team possessions
ftpercent
Free throw percentage
orb
Offensive rebounds per 100 team possessions
drb
Defensive rebounds per 100 team possessions
trb
Total rebounds per 100 team possessions
ast
Assists per 100 team possessions
stl
Steals per 100 team possessions
blk
Blocks per 100 team possessions
tov
Turnovers per 100 team possessions
pf
Personal fouls per 100 team possessions
pts
Points per 100 team possessions
ortg
Offensive Rating - an estimate of points produced per 100 possessions scale
drtg
Defensive Rating - an estimate of points allowed per 100 possessions scale
Elmore R (2020). ballr: Access to Current and Historical Basketball Data. R package version 0.2.6.
I will be exploring a box plot and tree map for the data set above.
For the Box Plot analysis, I will explore the scoring efficiency and variability between different player positions in the NBA data set. By examining the distribution of points scored by guards, forwards, and centers, This will analyze the offensive contributions and playing styles characteristic of each position category.
For the Tree map analysis, This will show the distribution of player positions and their corresponding playing time. By comparing the distribution of playing time among guards, forwards, and centers. This analysis will offer valuable insights into the utilization of player positions within teams and serve as a foundation for understanding lineup compositions, and player development.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(treemap)
#Extracting the data from the csv file.setwd("/Users/thejitharajapakshe/Desktop/DATA 110")nba <-read_csv("nba-player-stats-2021.csv")
Rows: 812 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): player, pos, tm
dbl (27): age, g, gs, mp, fg, fga, fgpercent, x3p, x3pa, x3ppercent, x2p, x2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Linear Regression
positions_default <-c("PG", "PF","C", "SF","SG") # I do not want to consider the extra positions as they are minutenba_nona <- nba |>filter(!is.na(pts) &!is.na(x3ppercent)) # Removing Nas from Pts and x3ppercentnbaclean <- nba_nona |>filter(pos %in% positions_default) # Adding the new Positions to the column posggplot(nbaclean, aes(x = pts, y = x3ppercent, color = pos)) +#Creating the scatter plot axis while setting the x axis as pts and y as 3 point percentage.geom_point(na.rm =TRUE) +#Discarding the NAs ffrom the poingeom_smooth(method ='lm', se =FALSE, formula = y~x, size =0.7, color ="black") +scale_color_brewer(palette ="Accent", name ="Position") +#Using a preset for the color of each point from RColorBrewer, And setting legend "Position"labs(x ="Points per 100 Team Possessions", y ="3-Point Percentage", title ="Point Distribution Regards To 3-Point Percentage", caption ="Source: Elmore R (2020). ballr") +#Labeling axis and captiontheme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Above, I am creating a scatter plot between the 2 quantitative variables pos and x3ppercentage and creating a regression line at the same time.
Calculating the Correlation Coefficient & Model Summary
cor(nbaclean$pts, nbaclean$x3ppercent) # function to find the correlation coefficient.
[1] 0.3274789
model1 <-lm(x3ppercent ~ pts, data = nbaclean) #function to obtain summary of pvalues and adjusted R^2 valuessummary(model1)
Call:
lm(formula = x3ppercent ~ pts, data = nbaclean)
Residuals:
Min 1Q Median 3Q Max
-0.48670 -0.04838 0.01641 0.06541 0.71480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1874521 0.0133090 14.085 <2e-16 ***
pts 0.0059969 0.0006435 9.319 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1317 on 723 degrees of freedom
Multiple R-squared: 0.1072, Adjusted R-squared: 0.106
F-statistic: 86.85 on 1 and 723 DF, p-value: < 2.2e-16
x3ppercent = 0.005969(pts) + 0.187 #coefficient and interept found above
For each additional point per 100 team posessions, there is an approximate predicted 3 point percentage of 0.006*.
The p-value on the right of pts and Intercept(3 point percentage) has 3 asterisks which represents a highly significant p-value to indicate the increase in points as a result of the three pointers.
Adjusted R-Squared (R^2) value: 10.6% of the variation in the observations may be explained by the model. Which means, 89.4% of the variation in the data is likely not explained by the model.
Tree Map
treemap(nbaclean, index ="pos", vSize ="mp", title ="Total Minutes Played by Player Position") #tree map function
Box Plot
ggplot(nbaclean, aes(x = pos, y = fgpercent, fill = pos)) +##Creating the axis while setting the x axis as pos and y as fgpercent.geom_boxplot() +scale_fill_brewer(palette ="Spectral", name ="Position") +labs(title ="Field Goal Percentage by Player Position",#Labeling Axis and captionx ="Player Position",y ="Field Goal Percentage",caption ="Source: Elmore R (2020). ballr") +theme_minimal()
a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).
nba_nona <- nba |> filter(!is.na(pts) & !is.na(x3ppercent))
The first statement is to remove any of the NAs in the pts and x3ppercent column.
The second statement is to add my new set of positions which are only PG”, “PF”,“C”, “SF”,“SG”, while omitting anything else.
b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.
Tree Map
The treemap visualization represents the distribution of player positions in the NBA data set, with each rectangle representing a specific position category and the size of the rectangle proportional to the total minutes played by players in that position.
This visualization offers a hierarchical view of player positions, allowing for easy comparison of playing time across different roles on the basketball court.
Larger rectangles indicate positions where players collectively spent more time on the court, while smaller rectangles represent positions with fewer minutes played.
Boxplot
The box plot visualization presents a comparisson between the field goal percentage(fgp) across different player positions (pos).
Each box in the plot represents the distribution of points scored by players within a specific position category, with the central line indicating the median value. The box (top line to bottom line) represents the interquartile range (IQR), depicting the spread of data around the median, while the whiskers extend to the maximum and minimum values within 1.5 times the IQR. And the individual points are the Outliers
This visualization indicates the scoring variation among player positions.
By visually comparing the spreads of points scored across different positions, the box plot provides valuable insights into the diversity of scoring dynamics for each team and the players in them.
c. Anything that you might have shown that you could not get to work or that you wished you could have included:
I wanted to use the top 10 teams instead of all them but when I tried sorting the points in ascending order, it will not assign to the right player.