Project 1

Author

Thejitha Rajapakshe

NBA Player Performance Analysis for the 2021-2022 Season

Introduction

This nba-player-stats-2021 data set contains statistics about 812 player-team stints during the 2021-2022 NBA regular season. Below are the variables used in the data set:

Variable	Description
player	Name of player
pos	Player’s designated position
age	Player’s age on February 1st of the season
tm	Name of team
g	Number of games
gs	Number of games started
mp	Number of minutes played
fg	Field goals per 100 team possessions
fga	Field goal attempts per 100 team possessions
fgpercent	Field goal percentage
x3p	3 point field goals per 100 team possessions
x3pa	3 point field goal attempts per 100 team possessions
x3ppercent	3 point field goal percentage
x2p	2 point field goals per 100 team possessions
x2pa	2 point field goal attempts per 100 team possessions
x2ppercent	2 point field goal percentage
ft	Free throws per 100 team possessions
fta	Free throw attempts per 100 team possessions
ftpercent	Free throw percentage
orb	Offensive rebounds per 100 team possessions
drb	Defensive rebounds per 100 team possessions
trb	Total rebounds per 100 team possessions
ast	Assists per 100 team possessions
stl	Steals per 100 team possessions
blk	Blocks per 100 team possessions
tov	Turnovers per 100 team possessions
pf	Personal fouls per 100 team possessions
pts	Points per 100 team possessions
ortg	Offensive Rating - an estimate of points produced per 100 possessions scale
drtg	Defensive Rating - an estimate of points allowed per 100 possessions scale

Source: https://data.scorenetwork.org/basketball/nba-player-stats.html.

Elmore R (2020). ballr: Access to Current and Historical Basketball Data. R package version 0.2.6.

I will be exploring a box plot and tree map for the data set above.

For the Box Plot analysis, I will explore the scoring efficiency and variability between different player positions in the NBA data set. By examining the distribution of points scored by guards, forwards, and centers, This will analyze the offensive contributions and playing styles characteristic of each position category.

For the Tree map analysis, This will show the distribution of player positions and their corresponding playing time. By comparing the distribution of playing time among guards, forwards, and centers. This analysis will offer valuable insights into the utilization of player positions within teams and serve as a foundation for understanding lineup compositions, and player development.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggplot2)
library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(treemap)

#Extracting the data from the csv file.
setwd("/Users/thejitharajapakshe/Desktop/DATA 110")
nba <- read_csv("nba-player-stats-2021.csv")

Rows: 812 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, pos, tm
dbl (27): age, g, gs, mp, fg, fga, fgpercent, x3p, x3pa, x3ppercent, x2p, x2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Linear Regression

positions_default <- c("PG", "PF","C", "SF","SG") # I do not want to consider the extra positions as they are minute

nba_nona <- nba |>
  filter(!is.na(pts) & !is.na(x3ppercent))  # Removing Nas from Pts and x3ppercent

nbaclean <- nba_nona |>
  filter(pos %in% positions_default) # Adding the new Positions to the column pos

ggplot(nbaclean, aes(x = pts, y = x3ppercent, color = pos)) + #Creating the scatter plot axis while setting the x axis as pts and y as 3 point percentage.
  geom_point(na.rm = TRUE) + #Discarding the NAs ffrom the poin
  geom_smooth(method = 'lm', se = FALSE, formula = y~x, size = 0.7, color = "black") +
  scale_color_brewer(palette = "Accent", name = "Position") + #Using a preset for the color of each point from RColorBrewer, And setting legend "Position"
  labs(x = "Points per 100 Team Possessions", y = "3-Point Percentage", title = "Point Distribution Regards To 3-Point Percentage", caption = "Source: Elmore R (2020). ballr") + #Labeling axis and caption
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Above, I am creating a scatter plot between the 2 quantitative variables pos and x3ppercentage and creating a regression line at the same time.

Calculating the Correlation Coefficient & Model Summary

cor(nbaclean$pts, nbaclean$x3ppercent) # function to find the correlation coefficient.

[1] 0.3274789

model1 <- lm(x3ppercent ~ pts, data = nbaclean) #function to obtain summary of pvalues and adjusted R^2 values
summary(model1)


Call:
lm(formula = x3ppercent ~ pts, data = nbaclean)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48670 -0.04838  0.01641  0.06541  0.71480 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.1874521  0.0133090  14.085   <2e-16 ***
pts         0.0059969  0.0006435   9.319   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1317 on 723 degrees of freedom
Multiple R-squared:  0.1072,    Adjusted R-squared:  0.106 
F-statistic: 86.85 on 1 and 723 DF,  p-value: < 2.2e-16

x3ppercent = 0.005969(pts) + 0.187 #coefficient and interept found above

For each additional point per 100 team posessions, there is an approximate predicted 3 point percentage of 0.006*.
The p-value on the right of pts and Intercept(3 point percentage) has 3 asterisks which represents a highly significant p-value to indicate the increase in points as a result of the three pointers.
Adjusted R-Squared (R^2) value: 10.6% of the variation in the observations may be explained by the model. Which means, 89.4% of the variation in the data is likely not explained by the model.

Tree Map

treemap(nbaclean, index = "pos", vSize = "mp", title = "Total Minutes Played by Player Position") #tree map function

Box Plot

ggplot(nbaclean, aes(x = pos, y = fgpercent, fill = pos)) + ##Creating the axis while setting the x axis as pos and y as fgpercent.
  geom_boxplot() +
  scale_fill_brewer(palette = "Spectral", name = "Position") +
  labs(title = "Field Goal Percentage by Player Position",#Labeling Axis and caption
       x = "Player Position",
       y = "Field Goal Percentage",
       caption = "Source: Elmore R (2020). ballr") +
  theme_minimal()

a. How you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).

nba_nona <- nba |> filter(!is.na(pts) & !is.na(x3ppercent))

nbaclean <- nba_nona |> filter(pos %in% positions_default)

The first statement is to remove any of the NAs in the pts and x3ppercent column.

The second statement is to add my new set of positions which are only PG”, “PF”,“C”, “SF”,“SG”, while omitting anything else.

b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.

Tree Map

The treemap visualization represents the distribution of player positions in the NBA data set, with each rectangle representing a specific position category and the size of the rectangle proportional to the total minutes played by players in that position.
This visualization offers a hierarchical view of player positions, allowing for easy comparison of playing time across different roles on the basketball court.
Larger rectangles indicate positions where players collectively spent more time on the court, while smaller rectangles represent positions with fewer minutes played.

Boxplot

The box plot visualization presents a comparisson between the field goal percentage(fgp) across different player positions (pos).
Each box in the plot represents the distribution of points scored by players within a specific position category, with the central line indicating the median value. The box (top line to bottom line) represents the interquartile range (IQR), depicting the spread of data around the median, while the whiskers extend to the maximum and minimum values within 1.5 times the IQR. And the individual points are the Outliers
This visualization indicates the scoring variation among player positions.
By visually comparing the spreads of points scored across different positions, the box plot provides valuable insights into the diversity of scoring dynamics for each team and the players in them.

c. Anything that you might have shown that you could not get to work or that you wished you could have included:

I wanted to use the top 10 teams instead of all them but when I tried sorting the points in ascending order, it will not assign to the right player.