Descrption of the
Data
Here is what the author on Kaggle had to say about the
2023-24 NBA Player Stats Dataset: “This dataset
contains [2023-2024] regular season NBA player stats per game. Note that
there are duplicate player names resulted from team changes.”
A detailed description of the variables within the dataset is given
below:
Rk: Rank, (player identification number)
[integer]
Player: Player’s name [character]
Pos: Position [categorical]
Age: Player’s age [numerical]
Tm: Team [categorical]
G: Games played [numerical]
GS: Games started [numerical]
MP: Minutes played per game [numerical]
FG: Field goals per game [numerical]
FGA: Field goal attempts per game
[numerical]
FG%: Field goal percentage [numerical]
3P: 3-point field goals per game
[numerical]
3PA: 3-point field goal attempts per game
[numerical]
3P%: 3-point field goal percentage
[numerical]
2P: 2-point field goals per game
[numerical]
2PA: 2-point field goal attempts per game
[numerical]
2P%: 2-point field goal percentage
[numerical]
eFG%: Effective field goal percentage
[numerical]
FT: Free throws per game [numerical]
FTA: Free throw attempts per game
[numerical]
FT%: Free throw percentage [numerical]
ORB: Offensive rebounds per game
[numerical]
DRB: Defensive rebounds per game
[numerical]
TRB: Total rebounds per game [numerical]
AST: Assists per game [numerical]
STL: Steals per game [numerical]
BLK: Blocks per game [numerical]
TOV: Turnovers per game [numerical]
PF: Personal fouls per game [numerical]
PTS: Points per game [numerical]
Feature
Engineering
Here are some additional feature engineering that is needed for
ananysis found later in the report:
Team
Reconfiguration
In the instance where a player played for multiple teams in the
2023-24 season, that player has multiple observations in the dataset,
one for each team they played on. The player also has an additional
observation with the overall season stats for both teams. This in
indicated when Tm equals “TOT”, which stand for total.
To avoid duplicate results, any players who played for multiple teams
have their non-total observations removed. Also, their Tm
value will be updated to include both team names
For Example If Player A played for PHI and CHI,
their new Tm value will be PHI-CHI
nba.data.non.duplicates <- nba.data %>%
group_by(Player) %>%
filter(n() == 1)
nba.data.duplicates <- nba.data %>%
group_by(Player) %>%
filter(n() > 1)
nba.data.duplictaes.updated <- nba.data.duplicates %>%
mutate(
Tm = ifelse(
Tm == "TOT",
paste(setdiff(unique(Tm), "TOT"), collapse = "-"),
Tm
)
) %>%
filter(str_detect(Tm, "-"))
nba.data.filtered <- bind_rows(nba.data.non.duplicates, nba.data.duplictaes.updated)
Variable
Renaming
All percentage variables will be renamed with a -P at the end of
their variable name instead of an -% for more consistent processing.
Also, all numbers that begin with a number will now have an “X” in front
of its name. The table below reflects this change:
nba.data.filtered <- nba.data.filtered %>%
rename(
FGP = FG.,
X3PP = X3P.,
X2PP = X2P.,
eFGP = eFG.,
FTP= FT.
)
perc.vars <- data.frame(
org.name = c("FG%", "3P%", "2P%", "eFG%", "FT%", "3P", "3PA", "2P", "2PA"),
new.name = c("FGP", "X3PP", "X2PP", "eFGP", "FTP", "X3P", "X3PA", "X2P", "X2PA")
)
kable(perc.vars,
col.names = c("Original Name", "New Name"),
caption = "Renaming of Percentage Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Renaming of Percentage Variables
|
Original Name
|
New Name
|
|
FG%
|
FGP
|
|
3P%
|
X3PP
|
|
2P%
|
X2PP
|
|
eFG%
|
eFGP
|
|
FT%
|
FTP
|
|
3P
|
X3P
|
|
3PA
|
X3PA
|
|
2P
|
X2P
|
|
2PA
|
X2PA
|
Handling Missing
Values
The next step is to determine if there are any missing variables.
Missing variables can cause biased models to be generated, ultimately
causing decisions that may be incorrect.
Another issue with missing values is how to replace them. There are
many ways to replace missing values, including mean and median
imputation, and model-based imputation.
However, I will first determine if there are any missing values
within the dataset before exploring replacement techniques.
missing.values.summary <- data.frame(
"Variable Name" = names(nba.data.filtered),
"Missing Values" = colSums(is.na(nba.data.filtered))
)
kable(missing.values.summary,
col.names = c("Variable Name", "Missing Values"),
caption = "Summary of Missing Values by Variable") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Summary of Missing Values by Variable
|
|
Variable Name
|
Missing Values
|
|
Rk
|
Rk
|
0
|
|
Player
|
Player
|
0
|
|
Pos
|
Pos
|
0
|
|
Age
|
Age
|
0
|
|
Tm
|
Tm
|
0
|
|
G
|
G
|
0
|
|
GS
|
GS
|
0
|
|
MP
|
MP
|
0
|
|
FG
|
FG
|
0
|
|
FGA
|
FGA
|
0
|
|
FGP
|
FGP
|
0
|
|
X3P
|
X3P
|
0
|
|
X3PA
|
X3PA
|
0
|
|
X3PP
|
X3PP
|
0
|
|
X2P
|
X2P
|
0
|
|
X2PA
|
X2PA
|
0
|
|
X2PP
|
X2PP
|
0
|
|
eFGP
|
eFGP
|
0
|
|
FT
|
FT
|
0
|
|
FTA
|
FTA
|
0
|
|
FTP
|
FTP
|
0
|
|
ORB
|
ORB
|
0
|
|
DRB
|
DRB
|
0
|
|
TRB
|
TRB
|
0
|
|
AST
|
AST
|
0
|
|
STL
|
STL
|
0
|
|
BLK
|
BLK
|
0
|
|
TOV
|
TOV
|
0
|
|
PF
|
PF
|
0
|
|
PTS
|
PTS
|
0
|
The above summary table shows that 0 of the
variables have missing values. Therefore, no imputations are necessary
for the dataset.
Final Dataset
The final dataset has 735 observations each with
30 variables. All 30 variables will be held within the
dataset for any further research, regardless of their use in this
research.
Below is a quick summary of all numeric variables in the final
dataset:
num.data <- nba.data.filtered[, sapply(nba.data.filtered, is.numeric)]
num.sum <- data.frame(
Variable = names(num.data),
Min = sapply(num.data, function(x) min(x, na.rm = TRUE)),
Mean = sapply(num.data, function(x) mean(x, na.rm = TRUE)),
Median = sapply(num.data, function(x) median(x, na.rm = TRUE)),
Max = sapply(num.data, function(x) max(x, na.rm = TRUE))
)
kable(num.sum,
col.names = c("Variable", "Min", "Mean", "Median", "Max"),
caption = "Summary of Numeric Variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Summary of Numeric Variables
|
|
Variable
|
Min
|
Mean
|
Median
|
Max
|
|
Rk
|
Rk
|
1.0
|
286.5000000
|
286.5000
|
572.000
|
|
Age
|
Age
|
19.0
|
25.7430070
|
25.0000
|
39.000
|
|
G
|
G
|
1.0
|
46.1538462
|
51.0000
|
84.000
|
|
GS
|
GS
|
0.0
|
21.5034965
|
7.0000
|
82.000
|
|
MP
|
MP
|
0.5
|
18.6506993
|
17.3500
|
37.800
|
|
FG
|
FG
|
0.0
|
3.1202797
|
2.4000
|
11.500
|
|
FGA
|
FGA
|
0.0
|
6.6833916
|
5.1000
|
23.600
|
|
FGP
|
FGP
|
0.0
|
0.4495839
|
0.4520
|
0.747
|
|
X3P
|
X3P
|
0.0
|
0.9409091
|
0.7000
|
4.800
|
|
X3PA
|
X3PA
|
0.0
|
2.6463287
|
2.1000
|
11.800
|
|
X3PP
|
X3PP
|
0.0
|
0.2994808
|
0.3390
|
1.000
|
|
X2P
|
X2P
|
0.0
|
2.1774476
|
1.6000
|
11.000
|
|
X2PA
|
X2PA
|
0.0
|
4.0363636
|
3.0000
|
18.300
|
|
X2PP
|
X2PP
|
0.0
|
0.5196748
|
0.5340
|
1.000
|
|
eFGP
|
eFGP
|
0.0
|
0.5164983
|
0.5330
|
0.917
|
|
FT
|
FT
|
0.0
|
1.2442308
|
0.8000
|
10.200
|
|
FTA
|
FTA
|
0.0
|
1.6013986
|
1.0000
|
11.600
|
|
FTP
|
FTP
|
0.0
|
0.7013951
|
0.7605
|
1.000
|
|
ORB
|
ORB
|
0.0
|
0.8576923
|
0.7000
|
4.600
|
|
DRB
|
DRB
|
0.0
|
2.5229021
|
2.2000
|
10.100
|
|
TRB
|
TRB
|
0.0
|
3.3730769
|
3.0000
|
13.700
|
|
AST
|
AST
|
0.0
|
2.0013986
|
1.3000
|
10.900
|
|
STL
|
STL
|
0.0
|
0.5912587
|
0.5500
|
2.100
|
|
BLK
|
BLK
|
0.0
|
0.4027972
|
0.3000
|
3.600
|
|
TOV
|
TOV
|
0.0
|
0.9846154
|
0.7000
|
4.400
|
|
PF
|
PF
|
0.0
|
1.4909091
|
1.5000
|
3.600
|
|
PTS
|
PTS
|
0.0
|
8.4232517
|
6.4000
|
34.700
|
Below is a quick summary of all categorical variables in the final
dataset:
cat.data <- nba.data.filtered[, sapply(nba.data.filtered, is.character)]
cat.sum <- lapply(names(cat.data), function(var) {
counts <- table(cat.data[[var]], useNA = "no")
sorted.counts <- sort(counts, decreasing = TRUE)
top.value <- if (length(sorted.counts) > 0) names(sorted.counts)[1] else NA
top.count <- if (length(sorted.counts) > 0) as.numeric(sorted.counts[1]) else 0
data.frame(
Variable = var,
`Top Value` = top.value,
`Top Count` = top.count,
`Unique Values` = length(counts)
)
}) %>%
bind_rows()
kable(cat.sum,
col.names = c("Variable", "Top Value", "Top Count", "Unique Values"),
caption = "Summary of Categorical Variables with Top Value") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Summary of Categorical Variables with Top Value
|
Variable
|
Top Value
|
Top Count
|
Unique Values
|
|
Player
|
A.J. Green
|
1
|
572
|
|
Pos
|
SG
|
131
|
12
|
|
Tm
|
MEM
|
22
|
92
|
Note A.J. Green is listed as the top value for
Player. This is because A.J. Green is the first player
alphabetically within the dataset, and each player only appears
once.
Note MEM is listed as the top value for
Tm. This only includes players who played for MEM ONLY.
Players who were traded within the season were not added to this count.
A list of what each abbreviation stand for can be found within the next
section.
Exploratory Data
Analysis
Next I will perform some Exploratpory
Data Analysis, or
EDA. EDA is an important step in the overarching data
analysis process, as it is important to familiarize yourself with the
dataset.
Distribution of
Position
Below is a pie chart of Pos:
nba.data.position.pie.chart <- nba.data.filtered %>%
separate_rows(Pos, sep = "-")
position <- nba.data.position.pie.chart %>%
group_by(Pos) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(perc = count / sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
ggplot(position, aes(x = "", y = perc, fill = Pos)) +
geom_col() +
geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
position = position_stack(vjust = (0.5))) +
coord_polar(theta = "y") +
labs(title = "Pie Chart of Position", y = "", x = "") +
guides(fill = guide_legend(title = "Position"))

As shown in the pie chart above, the most common position in the
dataset is Shooting Guard (23.4%) and the least common position
is Center (16.8%).
Note Any player who is classified as having multiple
positions counts as both positions for this visual representation
only.
Distribution of
Age
Below is a histogram of Age:
hist(nba.data.filtered$Age,
breaks = seq(15, 40, 1),
main = "Distribution of Age",
xlim = c(15, 40),
xaxt = "n",
xlab = "Age (years)",
ylim = c(0, 100),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(10, 100, 10), col = "gray", lty = "dotted")
axis(1, at = seq(15, 50, 5))

The histogram above shows that Age has a right-skewed
distribution. The youngest players in the dataset are 18 years old, and
the oldest player is 39 years old.
Distribution of
Team
Below is a table of each team abbreviation un-abbreviated:
nba.teams <- data.frame(
abbreviation = c("ATL", "BOS", "BKN", "CHA", "CHI", "CLE", "DAL", "DEN", "DET", "GSW", "HOU", "IND", "LAC", "LAL", "MEM", "MIA", "MIL", "MIN", "NOP", "NYK", "OKC", "ORL", "PHI", "PHX", "POR", "SAC", "SAS", "TOR", "UTA", "WAS"),
full.name = c("Atlanta Hawks", "Boston Celtics", "Brooklyn Nets", "Charlotte Hornets", "Chicago Bulls", "Cleveland Cavaliers", "Dallas Mavericks", "Denver Nuggets", "Detroit Pistons", "Golden State Warriors", "Houston Rockets", "Indiana Pacers", "Los Angeles Clippers", "Los Angeles Lakers", "Memphis Grizzlies", "Miami Heat", "Milwaukee Bucks", "Minnesota Timberwolves", "New Orleans Pelicans", "New York Knicks", "Oklahoma City Thunder", "Orlando Magic", "Philadelphia 76ers", "Phoenix Suns", "Portland Trail Blazers", "Sacramento Kings", "San Antonio Spurs", "Toronto Raptors", "Utah Jazz", "Washington Wizards")
)
kable(nba.teams,
col.names = c("Team Abbreviation", "Full Name"),
caption = "List of NBA Teams and Their Abbreviations") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
List of NBA Teams and Their Abbreviations
|
Team Abbreviation
|
Full Name
|
|
ATL
|
Atlanta Hawks
|
|
BOS
|
Boston Celtics
|
|
BKN
|
Brooklyn Nets
|
|
CHA
|
Charlotte Hornets
|
|
CHI
|
Chicago Bulls
|
|
CLE
|
Cleveland Cavaliers
|
|
DAL
|
Dallas Mavericks
|
|
DEN
|
Denver Nuggets
|
|
DET
|
Detroit Pistons
|
|
GSW
|
Golden State Warriors
|
|
HOU
|
Houston Rockets
|
|
IND
|
Indiana Pacers
|
|
LAC
|
Los Angeles Clippers
|
|
LAL
|
Los Angeles Lakers
|
|
MEM
|
Memphis Grizzlies
|
|
MIA
|
Miami Heat
|
|
MIL
|
Milwaukee Bucks
|
|
MIN
|
Minnesota Timberwolves
|
|
NOP
|
New Orleans Pelicans
|
|
NYK
|
New York Knicks
|
|
OKC
|
Oklahoma City Thunder
|
|
ORL
|
Orlando Magic
|
|
PHI
|
Philadelphia 76ers
|
|
PHX
|
Phoenix Suns
|
|
POR
|
Portland Trail Blazers
|
|
SAC
|
Sacramento Kings
|
|
SAS
|
San Antonio Spurs
|
|
TOR
|
Toronto Raptors
|
|
UTA
|
Utah Jazz
|
|
WAS
|
Washington Wizards
|
Below is a table of Tm:
player.count <- nba.data.filtered %>%
separate_rows(Tm, sep = "-") %>%
group_by(Tm) %>%
summarise(count = n()) %>%
arrange(desc(count))
player.count %>%
kable(
col.names = c("Team", "Number of Players"),
align = "lc",
caption = "Players by Team"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Players by Team
|
Team
|
Number of Players
|
|
MEM
|
33
|
|
DET
|
31
|
|
TOR
|
30
|
|
PHI
|
28
|
|
CHO
|
26
|
|
NYK
|
26
|
|
WAS
|
24
|
|
DAL
|
22
|
|
IND
|
22
|
|
OKC
|
22
|
|
PHO
|
22
|
|
POR
|
22
|
|
BRK
|
21
|
|
LAC
|
21
|
|
LAL
|
21
|
|
MIA
|
21
|
|
MIL
|
21
|
|
UTA
|
21
|
|
NOP
|
20
|
|
SAC
|
20
|
|
SAS
|
20
|
|
ATL
|
19
|
|
BOS
|
19
|
|
MIN
|
19
|
|
CHI
|
18
|
|
CLE
|
18
|
|
GSW
|
18
|
|
ORL
|
18
|
|
DEN
|
17
|
|
HOU
|
17
|
As shown in the table above, the Memphis Grizzlies had the most
amount of players in the dataset.
Note Any player who is played for multiple teams
counts for both teams in the above table.
Distribution of Games
Played
Below is a histogram of G:
hist(nba.data.filtered$G,
breaks = seq(0, 90, 5),
main = "Distribution of Games Played",
xlim = c(0, 90),
xaxt = "n",
xlab = "Games Played",
ylim = c(0, 60),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(5, 60, 5), col = "gray", lty = "dotted")
axis(1, at = seq(0, 90, 10))

The histogram above shows that the distribution of G has
two peaks around the 70’s and 5-10. It is important to note that there
are 82 games in a NBA regular season.
Distribution of Games
Started
Below is a histogram of GS:
hist(nba.data.filtered$GS,
breaks = seq(0, 90, 5),
main = "Distribution of Games Started",
xlim = c(0, 90),
xaxt = "n",
xlab = "Games Started",
ylim = c(0, 300),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(25, 300, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 90, 10))

The histogram above shows that for GS, almost half of
the players in the dataset started less than 5 games. Only 5 players can
start a game per team per game, meaning that only ~ 1/3rd to 1/6th of an
NBA team’s roster will not start on any given game, given their roster
size.
Distribution of
Minutes Played per Game
Below is a histogram of MP:
hist(nba.data.filtered$MP,
breaks = seq(0, 48, 4),
main = "Distribution of Minutes Played per Game",
xlim = c(0, 48),
xaxt = "n",
xlab = "Minutes Played per Game",
ylim = c(0, 100),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(10, 100, 10), col = "gray", lty = "dotted")
axis(1, at = seq(0, 48, 8))

The histogram above shows that GS has a much more
evenly-distributed distribution, This means that even though many
players did not start a game all season, almost all players played at
least 4 minutes per game on average.
Distribution of Field
Goals
Below is a scatterplot of FGA vs FG, with a
color scaling of FGP:
ggplot(nba.data.filtered, aes(x = FGA, y = FG)) +
geom_point(aes(color = FGP)) +
geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "Field Goal Attempts vs Field Goals Scored",
x = "Field Goal Attempts per Game",
y = "Field Goal Scored per Game",
color = "Field Goal Percentage"
) +
coord_equal() +
scale_x_continuous(limits = c(0, 24)) +
scale_y_continuous(limits = c(0, 24))

As shown in the scatterplot above, players seem to clump together for
field goal data, following a straight-line. The black dashed line
represents if a player has a 100% field goal percentage.
Distribution of
2-Point Field Goals
Below is a scatterplot of X2PA vs X2P, with
a color scaling of X2PP:
ggplot(nba.data.filtered, aes(x = X2PA, y = X2P)) +
geom_point(aes(color = X2PP)) +
geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "2-Point Attempts vs 2-Point Scored",
x = "2-Point Attempts per Game",
y = "2-Point Scored per Game",
color = "2-Point Percentage"
) +
coord_equal() +
scale_x_continuous(limits = c(0, 20)) +
scale_y_continuous(limits = c(0, 20))

As shown above, there appears to be six outliers past the 8 2-point
scored per game range. Those players are below:
two.pt.outliers <- nba.data.filtered %>%
filter(X2P > 8) %>%
dplyr::select(Player, X2PA, X2P, X2PP)
kable(two.pt.outliers,
col.names = c("Player", "2-Point Attempts per Game", "2-Point Goals per Game", "2 Point Percentage"),
caption = "Players with more than 8 2-Point Field Goals Scored") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Players with more than 8 2-Point Field Goals Scored
|
Player
|
2-Point Attempts per Game
|
2-Point Goals per Game
|
2 Point Percentage
|
|
Giannis Antetokounmpo
|
17.1
|
11.0
|
0.645
|
|
Anthony Davis
|
15.5
|
9.0
|
0.582
|
|
Joel Embiid
|
18.3
|
10.2
|
0.556
|
|
Shai Gilgeous-Alexander
|
16.2
|
9.3
|
0.576
|
|
Nikola Jokic
|
14.9
|
9.4
|
0.626
|
|
Zion Williamson
|
15.4
|
8.8
|
0.574
|
Players Giannis Antetokounmpo, Anthony Davis, Joel Embiid, Shai
Gilgeous-Alexander, Nikola Jokic, and Zion Williamson are the “unusual”
players for 2-point field goals.
Distribution of
3-Point Field Goals
Below is a scatterplot of X3PA vs X3P, with
a color scaling of X3PP:
ggplot(nba.data.filtered, aes(x = X3PA, y = X3P)) +
geom_point(aes(color = X3PP)) +
geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "3-Point Attempts vs 3-Point Scored",
x = "3-Point Attempts per Game",
y = "3-Point Scored per Game",
color = "3-Point Percentage"
) +
coord_equal() +
scale_x_continuous(limits = c(0, 12)) +
scale_y_continuous(limits = c(0, 12))

As shown above, there appears to be two outliers past the 10 3-point
attempts per game range. Those players are below:
three.pt.outliers <- nba.data.filtered %>%
filter(X3PA > 10) %>%
dplyr::select(Player, X3PA, X3P, X3PP)
kable(three.pt.outliers,
col.names = c("Player", "3-Point Attempts per Game", "3-Point Goals per Game", "3 Point Percentage"),
caption = "Players with more than 10 3-Point Attemps") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Players with more than 10 3-Point Attemps
|
Player
|
3-Point Attempts per Game
|
3-Point Goals per Game
|
3 Point Percentage
|
|
Stephen Curry
|
11.8
|
4.8
|
0.408
|
|
Luka Doncic
|
10.6
|
4.1
|
0.382
|
Players Stephen Curry and Luka Doncic are the standout players for
3-point field goals.
Distribution of
Effective Field Goal Percentage
Effective Field Goal Percentage (eFGP) is an adjusted
field goal percentage that accounts for 2-point field goals being worth
2 points, and 3-point field goals being worth 3 points.
Below is a scatterplot of X2PP vs X3PP,
with a color scaling of eFGP:
ggplot(nba.data.filtered, aes(x = X2PP, y = X3PP)) +
geom_point(aes(color = eFGP)) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "3-Point Attempts vs 3-Point Scored",
x = "2-Point Field Goal Percentage",
y = "3-Point Field Goal Percentage",
color = "Effective Field Goal Percentage"
) +
coord_equal() +
scale_x_continuous(limits = c(0, 1)) +
scale_y_continuous(limits = c(0, 1))

As shown in the scatterplot above, there appear to be some players
who have a 0 for X2PP and/or X3PP. Apart from
these observations, most players are in a clump together.
Distribution of Free
Throws
Below is a scatterplot of FTA vs FT, with a
color scaling of FTP:
ggplot(nba.data.filtered, aes(x = FTA, y = FT)) +
geom_point(aes(color = FTP)) +
geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "Free Throw Attempts vs Free Throws Scored",
x = "Free Throw Attempts per Game",
y = "Free Throws Scored per Game",
color = "Free Throw Percentage"
) +
coord_equal() +
scale_x_continuous(limits = c(0, 12)) +
scale_y_continuous(limits = c(0, 12))

As shown above, there appears to be four outliers past the 8 free
throw attempts per game range. Those players are below:
free.throw.outliers <- nba.data.filtered %>%
filter(FTA > 8) %>%
dplyr::select(Player, FTA, FT, FTP)
kable(free.throw.outliers,
col.names = c("Player", "Free Throw Attempts per Game", "Free Throw per Game", "Free Throw Percentage"),
caption = "Players with more than 10 Free Throw Attempts") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Players with more than 10 Free Throw Attempts
|
Player
|
Free Throw Attempts per Game
|
Free Throw per Game
|
Free Throw Percentage
|
|
Giannis Antetokounmpo
|
10.7
|
7.0
|
0.657
|
|
Luka Doncic
|
8.7
|
6.8
|
0.786
|
|
Joel Embiid
|
11.6
|
10.2
|
0.883
|
|
Shai Gilgeous-Alexander
|
8.7
|
7.6
|
0.874
|
Players Giannis Antetokounmpo, Luka Doncic, Joel Embiid, and Shai
Gilgeous-Alexander are the “unusual” players for free throws.
Distribution of
Rebounds
Below is a scatterplot of ORB vs DRB, with
a color scaling of TRB:
ggplot(nba.data.filtered, aes(x = ORB, y = DRB)) +
geom_point(aes(color = TRB)) +
scale_color_gradient(low = "red", high = "green") +
labs(
title = "Offensive Rebounds vs Defensive Rebounds",
x = "Offensive Rebounds per Game",
y = "Defensive Rebounds per Game",
color = "Total Rebounds"
)

As shown above, the relationship between ORB and
DRB have a slight fanning shape as both variables increase
in value.
Distribution of
Assists
Below is a histogram of AST:
hist(nba.data.filtered$AST,
breaks = seq(0, 12, 1),
main = "Distribution of Assists per Game",
xlim = c(0, 12),
xaxt = "n",
xlab = "Assists per Game",
ylim = c(0, 250),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(25, 250, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 12, 2))

The histogram above shows that AST has a right
skewed-distribution.
Distribution of
Steals
Below is a histogram of STL:
hist(nba.data.filtered$STL,
breaks = seq(0, 2.5, .25),
main = "Distribution of Steals per Game",
xlim = c(0, 2.5),
xaxt = "n",
xlab = "Steals per Game",
ylim = c(0, 200),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(25, 200, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 2.5, .5))

The histogram above shows that STL has right-skewed
distribution, but not as aggressive as assists.
Distribution of
Blocks
Below is a histogram of BLK:
hist(nba.data.filtered$BLK,
breaks = seq(0, 4, .4),
main = "Distribution of Blocks per Game",
xlim = c(0, 4),
xaxt = "n",
xlab = "Blocks per Game",
ylim = c(0, 400),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(50, 400, 50), col = "gray", lty = "dotted")
axis(1, at = seq(0, 4, .8))

The histogram above shows that BLK has an aggressively
right-skewed distribution. There appears to be one outlier who is above
the 3 blocks per game. That player is below:
blocks.outliers <- nba.data.filtered %>%
filter(BLK > 3) %>%
dplyr::select(Player, G, GS, BLK)
kable(blocks.outliers,
col.names = c("Player", "Games Played", "Games Started", "Blocks"),
caption = "Players with more than 3 Blocks per Game") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Players with more than 3 Blocks per Game
|
Player
|
Games Played
|
Games Started
|
Blocks
|
|
Victor Wembanyama
|
71
|
71
|
3.6
|
Victor Wembanyama was the league leader in blocks per game.
Distribution of
Turnovers
Below is a histogram of TOV:
hist(nba.data.filtered$TOV,
breaks = seq(0, 5, .25),
main = "Distribution of Turnovers per Game",
xlim = c(0, 5),
xaxt = "n",
xlab = "Turnovers per Game",
ylim = c(0, 150),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(25, 150, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 5, .5))

The histogram above shows that BLK has a right-skewed
distribution. There appears to be one outlier who is above the 4
turnovers per game. That player is below:
turnovers.outliers <- nba.data.filtered %>%
filter(TOV > 4) %>%
dplyr::select(Player, G, GS, TOV)
kable(turnovers.outliers,
col.names = c("Player", "Games Played", "Games Started", "Turnovers"),
caption = "Players with more than 4 TUrnovers per Game") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Players with more than 4 TUrnovers per Game
|
Player
|
Games Played
|
Games Started
|
Turnovers
|
|
Trae Young
|
54
|
54
|
4.4
|
Trae Young was the league leader in blocks per game.
Distribution of
Personal Fouls
Below is a histogram of PF:
hist(nba.data.filtered$PF,
breaks = seq(0, 4, .4),
main = "Distribution of Personal Fouls per Game",
xlim = c(0, 4),
xaxt = "n",
xlab = "Personal Fouls per Game",
ylim = c(0, 120),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(10, 120, 10), col = "gray", lty = "dotted")
axis(1, at = seq(0, 4, .8))

The histogram above shows that BLK has a right-skewed
distribution.
Distribution of
Points Scored
Below is a histogram of PTS:
hist(nba.data.filtered$PTS,
breaks = seq(0, 36, 4),
main = "Distribution of Points Scored per Game",
xlim = c(0, 36),
xaxt = "n",
xlab = "Points Scored per Game",
ylim = c(0, 200),
ylab = "Number of Players",
col = "red",
labels = TRUE)
abline(h = seq(25, 200, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 36, 4))

Local Outlier Factor
(LOF)
The LOF is the most well-known local anomaly detection algorithm
whose idea is carried out in many nearest-neighbor-based algorithms.
When a point is considered as an outlier bases on its local
neighborhood, it is a local outlier. LOF will identify an outlier
considering the density of the neighborhood. LOF performs well when the
density of the data is not the same throughout the dataset.
Outlier
Definition
For the sake of this analysis, I will define an outlier as any player
who has a combined z-score for the following variables higher than
3:
MP
PTS
FGP
X3PP
TRB
AST
STL
BLK
TOV
A new variable Outlier will be created for each
observation. If their combined z-score is higher than 3, their
Outlier variable will be 1, otherwise it will be
0.
LOF Distribution
The distribution of LOF scores is usually very skewed. To create a
better visual representation, I truncate the extremely large LOF scores.
Next, I extracted the LOF scores of each player in the dataset and
summarize the distribution.
Below is the summary:
lof.data <- nba.data.filtered[, c("MP", "PTS", "FGP", "X3PP", "TRB", "AST", "STL", "BLK", "TOV")]
lof.data <- scale(lof.data)
lof.scores.50 <- lof(lof.data, minPts = 50)
summary(lof.scores.50)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9566 1.0154 1.0765 1.1638 1.2304 3.5534
hist(lof.scores.50[lof.scores.50 < 5],
breaks = seq(0, 4, .1),
main = "Distribution of LOF",
xlim = c(0, 4),
xaxt = "n",
xlab = "LOF < 5",
ylim = c(0, 250),
ylab = "Number of Players",
col = "gray",
labels = TRUE)
abline(h = seq(25, 250, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 4, .4))

Determining LOF
cutoff
In order for outliers to be identified, there must be a consistent
cutoff point. To determine the cutoff point, I will graph the suggested
LOF cutoffs against the successful LOF catching rate.
lof.cut = seq(1, 4, length = 20)
lof.catch.rate = numeric(length(lof.cut))
vars <- c("MP", "PTS", "FGP", "X3PP", "TRB", "AST", "STL", "BLK", "TOV")
z.scores <- scale(nba.data.filtered[, vars])
nba.data.filtered$combined.z <- apply(z.scores, 1, function(row) sqrt(sum(row^2)))
nba.data.filtered$Outlier <- nba.data.filtered$combined.z > 3
for (i in 1:length(lof.cut)){
ID = lof.scores.50 > lof.cut[i]
Outlier <- nba.data.filtered$Outlier[ID]
lof.catch.rate[i] <- sum(Outlier) / sum(ID)
}
default.rate <- sum(nba.data.filtered$Outlier) / nrow(nba.data.filtered)
rel.detect.rate <- (lof.catch.rate - default.rate) / default.rate
par(mfrow = c(1, 2))
plot(lof.cut, 100*lof.catch.rate, type = "l", col = "red", lwd = 2,
xlab = "LOF Cutoff",
ylab = "LOF Catching Rate (%)",
main = "LOF Catching Rate of Outliers")
plot(lof.cut, rel.detect.rate, type = "l", col = "red", lwd = 2,
xlab = "LOF Cutoff",
ylab = "LOF and random catching rate ratio",
main = "LOF catching rate improvement")

The above figures show that, with k = 100, the LOF-based
outlier-catching rate is much higher than the random guess. I will next
sketch an ROC to assess the global performance in terms of catching
rate.
category = as.character(nba.data.filtered$Outlier)
category <- ifelse(category == TRUE, "1", "0")
roc.obj.lof.50 <- roc(category, lof.scores.50, levels = c("1", "0"), direction = ">")
sen.lof.50 = roc.obj.lof.50$sensitivities
fnr.lof.50 = 1 - roc.obj.lof.50$specificities
par(type = "s")
plot(fnr.lof.50, sen.lof.50, type = "l", lwd = 2, col = "red",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = "1 - Specificity",
ylab = "Sensitivity",
main = "ROC Curves of LOF Detection")
segments(0, 0, 1, 1, lwd = 1, col = "black", lty = 2)
auc.50 = roc.obj.lof.50$auc
text(0.87, 0.20, paste("AUC (50) = ", round(auc.50, 4)), col = "red", cex = 0.7, adj = 1)

Tuning Hyperparameter
k
The LOF defines local outliers. The LOF scores will be depending on
the value of \(k\) (the number of
nearest players of every player). For illustration purposes, I will use
3 alternate values of k in an attempt to improve the AUC value.
lof.scores.100 <- lof(lof.data, minPts = 100)
roc.obj.lof.100 <- roc(category, lof.scores.100, levels = c("1", "0"), direction = ">")
sen.lof.100 = roc.obj.lof.100$sensitivities
fnr.lof.100 = 1 - roc.obj.lof.100$specificities
lof.scores.200 <- lof(lof.data, minPts = 200)
roc.obj.lof.200 <- roc(category, lof.scores.200, levels = c("1", "0"), direction = ">")
sen.lof.200 = roc.obj.lof.200$sensitivities
fnr.lof.200 = 1 - roc.obj.lof.200$specificities
lof.scores.400 <- lof(lof.data, minPts = 400)
roc.obj.lof.400 <- roc(category, lof.scores.400, levels = c("1", "0"), direction = ">")
sen.lof.400 = roc.obj.lof.400$sensitivities
fnr.lof.400 = 1 - roc.obj.lof.400$specificities
par(type = "s")
colors = c("red", "blue", "orange", "darkgreen")
plot(fnr.lof.50, sen.lof.50, type = "l", lwd = 2, col = colors[1],
xlim = c(0, 1),
ylim = c(0, 1),
xlab = "1 - Specificity",
ylab = "Sensitivity",
main = "ROC Curves of LOF Detection Comparison")
lines(fnr.lof.100, sen.lof.100, lwd = 2, lty = 2, col = colors[2])
lines(fnr.lof.200, sen.lof.200, lwd = 1, col = colors[3])
lines(fnr.lof.400, sen.lof.400, lwd = 1, col = colors[4])
segments(0, 0, 1, 1, lwd = 1, col = "black", lty = 2)
legend("bottomright", c("LOF.50", "LOF.100", "LOF.200", "LOF.400"),
col = colors, lwd = c(2, 2, 1, 1, 1),
lty = c(1, 2, 1, 1, 2), bty = "n", cex = 0.7)
AUC.50 = roc.obj.lof.50$auc
AUC.100 = roc.obj.lof.100$auc
AUC.200 = roc.obj.lof.200$auc
AUC.400 = roc.obj.lof.400$auc
text(0.87, 0.17, paste("AUC (50) = ", round(AUC.50, 4)), col = colors[1], cex = 0.7, adj = 1)
text(0.87, 0.12, paste("AUC (100) = ", round(AUC.100, 4)), col = colors[2], cex = 0.7, adj = 1)
text(0.87, 0.07, paste("AUC (200) = ", round(AUC.200, 4)), col = colors[3], cex = 0.7, adj = 1)
text(0.87, 0.02, paste("AUC (400) = ", round(AUC.400, 4)), col = colors[4], cex = 0.7, adj = 1)

As shown above, the best \(k\) for
the given data from the selection is 400, as it has the highest AUC
value. However, it is advised to be careful choosing a \(k\) value that is deemed to be “too close”
to the number of observations within the dataset.
---
title: "Grouping NBA Players Utilizing Unsupervised Learning Algroithms"
author: "Evan Parker"
date: "2024-11-17"
output: 
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 6
    fig_height: 4
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  font-weight:bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: system-ui;
    font-weight:bold;
    color: navy;
    text-align: left;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-weight:bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-weight:bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
options(repos = list(CRAN="http://cran.rstudio.com/"))

update.packages()

if (!require("car")) {
  install.packages("car")
  library(car)
}

if (!require("caret")) {
  install.packages("caret")
  library(caret)
}

if (!require("caTools")) {
  install.packages("caTools")
  library(CaTools)
}

if (!require("cluster")) {
  install.packages("cluster")
  library(cluster)
}

if (!require("dbscan")) {
   install.packages("dbscan")
   library(dbscan)
}

if (!require("dplyr")) {
   install.packages("dplyr")
   library(dplyr)
}

if (!require("e1071")) {
  install.packages("e1071")
  library(e1071)
}

if (!require("factoextra")) {
  install.packages("factoextra")
  library(factoextra)
}

if (!require("ggfortify")) {
  install.packages("ggfortify")
  library(ggfortify)
}

if (!require("ggplot2")) {
   install.packages("ggplot2")
   library(ggplot2)
}

if (!require("ggpubr")) {
  install.packages("ggpubr")
  library(ggpubr)
}

if (!require("ipred")) {
  install.packages("ipred")
  library(ipred)
}

if (!require("kableExtra")) {
  install.packages("kableExtra")
  library(kableExtra)
}

if (!require("keras")) {
  install.packages("keras")
  library(keras)
}

if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}

if (!require("MASS")) {
  install.packages("MASS")
  library(MASS)
}

if (!require("neuralnet")) {
  install.packages("neuralnet")
  library(neuralnet)
}

if (!require("nnet")) {
  install.packages("nnet")
  library(nnet)
}

if (!require("pROC")) {
  install.packages("pROC")
  library(pROC)
}

if (!require("psych")) {
  install.packages("psych")
  library(psych)
}

if (!require("rpart")) {
  install.packages("rpart")
  library(rpart)
}

if (!require("rpart.plot")) {
  install.packages("rpart.plot")
  library(rpart.plot)
}

if (!require("tidyverse")) {
   install.packages("tidyverse")
   library(tidyverse)
}

knitr::opts_chunk$set(echo = TRUE,       
                      warning = FALSE,   
                      result = TRUE,   
                      message = FALSE,
                      comment = NA)

options(scipen = 999)
```

# Problem Statement and Background

The aim of the following analysis is to examine the relationships between different player measurements for NBA players in the 2023-24 season. This analysis hopes to identify key factors to cluster groups of players together, hoping to ease player comparison.

This dataset comes from Kaggle, a free web-based platform that data scientists and statisticians use to share both ideas and datasets. The link to the **2023-24 NBA Player Stats** dataset is below:

[Loan Approval Classification Dataset](https://www.kaggle.com/datasets/vivovinco/2023-2024-nba-player-stats/data)

This dataset may be updated over time. I downloaded this dataset on Sunday, November 17th, 2024. Link to this version of the dataset is below:

[11/17/24 Version](https://github.com/EPKeep32/STA551/blob/main/2023-2024%20NBA%20Player%20Stats%20-%20Regular.csv)

R/R Markdown were used in this project as it is free and open-source, allowing users to customize their experience with various libraries and features that other coding software such as SAS do not have. R provides an easy-to-use and comprehensive toolset of statistical analyses and tests. While these advanced analyses will not be used in this project, further work can be done to provide additional insights.

This analysis seeks to answer the questions:

-   *Can the dataset be split into separate groups using a multitude of different player statistics?*
-   *Can we determine any players who are considered outliers in comparison to their group?*

```{r data integration, include = TRUE}
nba.data <- read.csv("https://github.com/EPKeep32/STA551/raw/refs/heads/main/2023-2024%20NBA%20Player%20Stats%20-%20Regular.csv", sep = ";")
```

# Descrption of the Data

Here is what the author on Kaggle had to say about the **2023-24 NBA Player Stats** Dataset: "This dataset contains [2023-2024] regular season NBA player stats per game. Note that there are duplicate player names resulted from team changes."

A detailed description of the variables within the dataset is given below:

`Rk`: Rank, (player identification number) [*integer*]

`Player`: Player's name [*character*]

`Pos`: Position [*categorical*]

`Age`: Player's age [*numerical*]

`Tm`: Team [*categorical*]

`G`: Games played [*numerical*]

`GS`: Games started [*numerical*]

`MP`: Minutes played per game [*numerical*]

`FG`: Field goals per game [*numerical*]

`FGA`: Field goal attempts per game [*numerical*]

`FG%`: Field goal percentage [*numerical*]

`3P`: 3-point field goals per game [*numerical*]

`3PA`: 3-point field goal attempts per game [*numerical*]

`3P%`: 3-point field goal percentage [*numerical*]

`2P`: 2-point field goals per game [*numerical*]

`2PA`: 2-point field goal attempts per game [*numerical*]

`2P%`: 2-point field goal percentage [*numerical*]

`eFG%`: Effective field goal percentage [*numerical*]

`FT`: Free throws per game [*numerical*]

`FTA`: Free throw attempts per game [*numerical*]

`FT%`: Free throw percentage [*numerical*]

`ORB`: Offensive rebounds per game [*numerical*]

`DRB`: Defensive rebounds per game [*numerical*]

`TRB`: Total rebounds per game [*numerical*]

`AST`: Assists per game [*numerical*]

`STL`: Steals per game [*numerical*]

`BLK`: Blocks per game [*numerical*]

`TOV`: Turnovers per game [*numerical*]

`PF`: Personal fouls per game [*numerical*]

`PTS`: Points per game [*numerical*]

## Feature Engineering

Here are some additional feature engineering that is needed for ananysis found later in the report:

### Team Reconfiguration

In the instance where a player played for multiple teams in the 2023-24 season, that player has multiple observations in the dataset, one for each team they played on. The player also has an additional observation with the overall season stats for both teams. This in indicated when `Tm` equals "TOT", which stand for total.

To avoid duplicate results, any players who played for multiple teams have their non-total observations removed. Also, their `Tm` value will be updated to include both team names

**For Example** If Player A played for PHI and CHI, their new `Tm` value will be PHI-CHI

```{r team reconfiguration, include = TRUE}
nba.data.non.duplicates <- nba.data %>%
  group_by(Player) %>%
  filter(n() == 1)

nba.data.duplicates <- nba.data %>%
  group_by(Player) %>%
  filter(n() > 1)

nba.data.duplictaes.updated <- nba.data.duplicates %>%
  mutate(
    Tm = ifelse(
      Tm == "TOT",
      paste(setdiff(unique(Tm), "TOT"), collapse = "-"),
      Tm
    )
  ) %>%
  filter(str_detect(Tm, "-"))

nba.data.filtered <- bind_rows(nba.data.non.duplicates, nba.data.duplictaes.updated)
```

### Variable Renaming

All percentage variables will be renamed with a -P at the end of their variable name instead of an -% for more consistent processing. Also, all numbers that begin with a number will now have an "X" in front of its name. The table below reflects this change:

```{r percentage variables, include = TRUE}
nba.data.filtered <- nba.data.filtered %>%
  rename(
    FGP = FG.,
    X3PP = X3P.,
    X2PP = X2P.,
    eFGP = eFG.,
    FTP= FT.
  )

perc.vars <- data.frame(
  org.name = c("FG%", "3P%", "2P%", "eFG%", "FT%", "3P", "3PA", "2P", "2PA"),
  new.name = c("FGP", "X3PP", "X2PP", "eFGP", "FTP", "X3P", "X3PA", "X2P", "X2PA")
)

kable(perc.vars,
      col.names = c("Original Name", "New Name"),
      caption = "Renaming of Percentage Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

## Handling Missing Values

The next step is to determine if there are any missing variables. Missing variables can cause biased models to be generated, ultimately causing decisions that may be incorrect.

Another issue with missing values is how to replace them. There are many ways to replace missing values, including mean and median imputation, and model-based imputation.

However, I will first determine if there are any missing values within the dataset before exploring replacement techniques.

```{r missing values, include = TRUE}
missing.values.summary <- data.frame(
  "Variable Name" = names(nba.data.filtered),
  "Missing Values" = colSums(is.na(nba.data.filtered))
)

kable(missing.values.summary,
      col.names = c("Variable Name", "Missing Values"),
      caption = "Summary of Missing Values by Variable") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

The above summary table shows that **0** of the variables have missing values. Therefore, no imputations are necessary for the dataset.

## Final Dataset

The final dataset has **735** observations each with **30** variables. All 30 variables will be held within the dataset for any further research, regardless of their use in this research.

Below is a quick summary of all numeric variables in the final dataset:

```{r num data, include = TRUE}
num.data <- nba.data.filtered[, sapply(nba.data.filtered, is.numeric)]

num.sum <- data.frame(
  Variable = names(num.data),
  Min = sapply(num.data, function(x) min(x, na.rm = TRUE)),
  Mean = sapply(num.data, function(x) mean(x, na.rm = TRUE)),
  Median = sapply(num.data, function(x) median(x, na.rm = TRUE)),
  Max = sapply(num.data, function(x) max(x, na.rm = TRUE))
)

kable(num.sum,
      col.names = c("Variable", "Min", "Mean", "Median", "Max"),
      caption = "Summary of Numeric Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Below is a quick summary of all categorical variables in the final dataset:

```{r cat data, include = TRUE}
cat.data <- nba.data.filtered[, sapply(nba.data.filtered, is.character)]

cat.sum <- lapply(names(cat.data), function(var) {
  counts <- table(cat.data[[var]], useNA = "no")
  sorted.counts <- sort(counts, decreasing = TRUE)
  top.value <- if (length(sorted.counts) > 0) names(sorted.counts)[1] else NA
  top.count <- if (length(sorted.counts) > 0) as.numeric(sorted.counts[1]) else 0
  
  data.frame(
    Variable = var,
    `Top Value` = top.value,
    `Top Count` = top.count,
    `Unique Values` = length(counts)
  )
}) %>%
  bind_rows()

kable(cat.sum,
      col.names = c("Variable", "Top Value", "Top Count", "Unique Values"),
      caption = "Summary of Categorical Variables with Top Value") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

**Note** A.J. Green is listed as the top value for `Player`. This is because A.J. Green is the first player alphabetically within the dataset, and each player only appears once.

**Note** MEM is listed as the top value for `Tm`. This only includes players who played for MEM ONLY. Players who were traded within the season were not added to this count. A list of what each abbreviation stand for can be found within the next section.

# Exploratory Data Analysis

Next I will perform some **E**xploratpory **D**ata **A**nalysis, or **EDA**. EDA is an important step in the overarching data analysis process, as it is important to familiarize yourself with the dataset.

## Distribution of Position

Below is a pie chart of `Pos`:

```{r position pie chart, include = TRUE}
nba.data.position.pie.chart <- nba.data.filtered %>%
  separate_rows(Pos, sep = "-")

position <- nba.data.position.pie.chart %>%
  group_by(Pos) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  mutate(perc = count / sum(count)) %>%
  arrange(perc) %>%
  mutate(labels = scales::percent(perc))

ggplot(position, aes(x = "", y = perc, fill = Pos)) +
  geom_col() +
  geom_text(aes(label = paste(count, "\n(", labels, ")", sep = "")),
            position = position_stack(vjust = (0.5))) +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart of Position", y = "", x = "") +
  guides(fill = guide_legend(title = "Position"))
```

As shown in the pie chart above, the most common position in the dataset is Shooting Guard (*23.4%*) and the least common position is Center (*16.8%*).

**Note** Any player who is classified as having multiple positions counts as both positions for this visual representation only.

## Distribution of Age

Below is a histogram of `Age`:

```{r age histogram, include = TRUE}
hist(nba.data.filtered$Age,
     breaks = seq(15, 40, 1),
     main = "Distribution of Age",
     xlim = c(15, 40),
     xaxt = "n",
     xlab = "Age (years)",
     ylim = c(0, 100),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(10, 100, 10), col = "gray", lty = "dotted")
axis(1, at = seq(15, 50, 5))
```

The histogram above shows that `Age` has a right-skewed distribution. The youngest players in the dataset are 18 years old, and the oldest player is 39 years old.

## Distribution of Team

Below is a table of each team abbreviation un-abbreviated:

```{r team bar graph chart, include = TRUE}
nba.teams <- data.frame(
  abbreviation = c("ATL", "BOS", "BKN", "CHA", "CHI", "CLE", "DAL", "DEN", "DET", "GSW", "HOU", "IND", "LAC", "LAL", "MEM", "MIA", "MIL", "MIN", "NOP", "NYK", "OKC", "ORL", "PHI", "PHX", "POR", "SAC", "SAS", "TOR", "UTA", "WAS"),
  
  full.name = c("Atlanta Hawks", "Boston Celtics", "Brooklyn Nets", "Charlotte Hornets", "Chicago Bulls", "Cleveland Cavaliers", "Dallas Mavericks", "Denver Nuggets", "Detroit Pistons", "Golden State Warriors", "Houston Rockets", "Indiana Pacers", "Los Angeles Clippers", "Los Angeles Lakers", "Memphis Grizzlies", "Miami Heat", "Milwaukee Bucks", "Minnesota Timberwolves", "New Orleans Pelicans", "New York Knicks", "Oklahoma City Thunder", "Orlando Magic", "Philadelphia 76ers", "Phoenix Suns", "Portland Trail Blazers", "Sacramento Kings", "San Antonio Spurs", "Toronto Raptors", "Utah Jazz", "Washington Wizards")
)

kable(nba.teams,
      col.names = c("Team Abbreviation", "Full Name"),
      caption = "List of NBA Teams and Their Abbreviations") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Below is a table of `Tm`:

```{r team bar graph chart2, include = TRUE}
player.count <- nba.data.filtered %>%
  separate_rows(Tm, sep = "-") %>%
  group_by(Tm) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

player.count %>%
  kable(
    col.names = c("Team", "Number of Players"),
    align = "lc",
    caption = "Players by Team"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

As shown in the table above, the Memphis Grizzlies had the most amount of players in the dataset.

**Note** Any player who is played for multiple teams counts for both teams in the above table.

## Distribution of Games Played

Below is a histogram of `G`:

```{r games played histogram, include = TRUE}
hist(nba.data.filtered$G,
     breaks = seq(0, 90, 5),
     main = "Distribution of Games Played",
     xlim = c(0, 90),
     xaxt = "n",
     xlab = "Games Played",
     ylim = c(0, 60),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(5, 60, 5), col = "gray", lty = "dotted")
axis(1, at = seq(0, 90, 10))
```

The histogram above shows that the distribution of `G` has two peaks around the 70's and 5-10. It is important to note that there are 82 games in a NBA regular season.

## Distribution of Games Started

Below is a histogram of `GS`:

```{r games started histogram, include = TRUE}
hist(nba.data.filtered$GS,
     breaks = seq(0, 90, 5),
     main = "Distribution of Games Started",
     xlim = c(0, 90),
     xaxt = "n",
     xlab = "Games Started",
     ylim = c(0, 300),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(25, 300, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 90, 10))
```

The histogram above shows that for `GS`, almost half of the players in the dataset started less than 5 games. Only 5 players can start a game per team per game, meaning that only \~ 1/3rd to 1/6th of an NBA team's roster will not start on any given game, given their roster size.

## Distribution of Minutes Played per Game

Below is a histogram of `MP`:

```{r minutes played histogram, include = TRUE}
hist(nba.data.filtered$MP,
     breaks = seq(0, 48, 4),
     main = "Distribution of Minutes Played per Game",
     xlim = c(0, 48),
     xaxt = "n",
     xlab = "Minutes Played per Game",
     ylim = c(0, 100),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(10, 100, 10), col = "gray", lty = "dotted")
axis(1, at = seq(0, 48, 8))
```

The histogram above shows that `GS` has a much more evenly-distributed distribution, This means that even though many players did not start a game all season, almost all players played at least 4 minutes per game on average.

## Distribution of Field Goals

Below is a scatterplot of `FGA` vs `FG`, with a color scaling of `FGP`:

```{r field goal scatteplot, include = TRUE}
ggplot(nba.data.filtered, aes(x = FGA, y = FG)) +
  geom_point(aes(color = FGP)) +
  geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
  scale_color_gradient(low = "red", high = "green") +
  labs(
    title = "Field Goal Attempts vs Field Goals Scored",
    x = "Field Goal Attempts per Game",
    y = "Field Goal Scored per Game",
    color = "Field Goal Percentage"
  ) +
  coord_equal() +
  scale_x_continuous(limits = c(0, 24)) +
  scale_y_continuous(limits = c(0, 24))
```

As shown in the scatterplot above, players seem to clump together for field goal data, following a straight-line. The black dashed line represents if a player has a 100% field goal percentage.

## Distribution of 2-Point Field Goals

Below is a scatterplot of `X2PA` vs `X2P`, with a color scaling of `X2PP`:

```{r 2 point scatterplot, include = TRUE}
ggplot(nba.data.filtered, aes(x = X2PA, y = X2P)) +
  geom_point(aes(color = X2PP)) +
  geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
  scale_color_gradient(low = "red", high = "green") +
  labs(
    title = "2-Point Attempts vs 2-Point Scored",
    x = "2-Point Attempts per Game",
    y = "2-Point Scored per Game",
    color = "2-Point Percentage"
  ) +
  coord_equal() +
  scale_x_continuous(limits = c(0, 20)) +
  scale_y_continuous(limits = c(0, 20))
```

As shown above, there appears to be six outliers past the 8 2-point scored per game range. Those players are below:

```{r 2-point outliers, include = TRUE}
two.pt.outliers <- nba.data.filtered %>%
  filter(X2P > 8) %>%
  dplyr::select(Player, X2PA, X2P, X2PP)

kable(two.pt.outliers,
      col.names = c("Player", "2-Point Attempts per Game", "2-Point Goals per Game", "2 Point Percentage"),
      caption = "Players with more than 8 2-Point Field Goals Scored") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Players Giannis Antetokounmpo, Anthony Davis, Joel Embiid, Shai Gilgeous-Alexander, Nikola Jokic, and Zion Williamson are the "unusual" players for 2-point field goals.

## Distribution of 3-Point Field Goals

Below is a scatterplot of `X3PA` vs `X3P`, with a color scaling of `X3PP`:

```{r 3 point scatterplot, include = TRUE}
ggplot(nba.data.filtered, aes(x = X3PA, y = X3P)) +
  geom_point(aes(color = X3PP)) +
  geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
  scale_color_gradient(low = "red", high = "green") +
  labs(
    title = "3-Point Attempts vs 3-Point Scored",
    x = "3-Point Attempts per Game",
    y = "3-Point Scored per Game",
    color = "3-Point Percentage"
  ) +
  coord_equal() +
  scale_x_continuous(limits = c(0, 12)) +
  scale_y_continuous(limits = c(0, 12))
```

As shown above, there appears to be two outliers past the 10 3-point attempts per game range. Those players are below:

```{r 3-point outliers, include = TRUE}
three.pt.outliers <- nba.data.filtered %>%
  filter(X3PA > 10) %>%
  dplyr::select(Player, X3PA, X3P, X3PP)

kable(three.pt.outliers,
      col.names = c("Player", "3-Point Attempts per Game", "3-Point Goals per Game", "3 Point Percentage"),
      caption = "Players with more than 10 3-Point Attemps") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Players Stephen Curry and Luka Doncic are the standout players for 3-point field goals.

## Distribution of Effective Field Goal Percentage

Effective Field Goal Percentage (`eFGP`) is an adjusted field goal percentage that accounts for 2-point field goals being worth 2 points, and 3-point field goals being worth 3 points.

Below is a scatterplot of `X2PP` vs `X3PP`, with a color scaling of `eFGP`:

```{r eFGP scatterplot, include = TRUE}
ggplot(nba.data.filtered, aes(x = X2PP, y = X3PP)) +
  geom_point(aes(color = eFGP)) +
  scale_color_gradient(low = "red", high = "green") +
  labs(
    title = "3-Point Attempts vs 3-Point Scored",
    x = "2-Point Field Goal Percentage",
    y = "3-Point Field Goal Percentage",
    color = "Effective Field Goal Percentage"
  ) +
  coord_equal() +
  scale_x_continuous(limits = c(0, 1)) +
  scale_y_continuous(limits = c(0, 1))
```

As shown in the scatterplot above, there appear to be some players who have a 0 for `X2PP` and/or `X3PP`. Apart from these observations, most players are in a clump together.

## Distribution of Free Throws

Below is a scatterplot of `FTA` vs `FT`, with a color scaling of `FTP`:

```{r free throw scatterplot, include = TRUE}
ggplot(nba.data.filtered, aes(x = FTA, y = FT)) +
  geom_point(aes(color = FTP)) +
  geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed", size = 1) +
  scale_color_gradient(low = "red", high = "green") +
  labs(
    title = "Free Throw Attempts vs Free Throws Scored",
    x = "Free Throw Attempts per Game",
    y = "Free Throws Scored per Game",
    color = "Free Throw Percentage"
  ) +
  coord_equal() +
  scale_x_continuous(limits = c(0, 12)) +
  scale_y_continuous(limits = c(0, 12))
```

As shown above, there appears to be four outliers past the 8 free throw attempts per game range. Those players are below:

```{r free throw outliers, include = TRUE}
free.throw.outliers <- nba.data.filtered %>%
  filter(FTA > 8) %>%
  dplyr::select(Player, FTA, FT, FTP)

kable(free.throw.outliers,
      col.names = c("Player", "Free Throw Attempts per Game", "Free Throw per Game", "Free Throw Percentage"),
      caption = "Players with more than 10 Free Throw Attempts") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Players Giannis Antetokounmpo, Luka Doncic, Joel Embiid, and Shai Gilgeous-Alexander are the "unusual" players for free throws.

## Distribution of Rebounds

Below is a scatterplot of `ORB` vs `DRB`, with a color scaling of `TRB`:

```{r rebound scatterplot, include = TRUE}
ggplot(nba.data.filtered, aes(x = ORB, y = DRB)) +
  geom_point(aes(color = TRB)) +
  scale_color_gradient(low = "red", high = "green") +
  labs(
    title = "Offensive Rebounds vs Defensive Rebounds",
    x = "Offensive Rebounds per Game",
    y = "Defensive Rebounds per Game",
    color = "Total Rebounds"
  )
```

As shown above, the relationship between `ORB` and `DRB` have a slight fanning shape as both variables increase in value.

## Distribution of Assists

Below is a histogram of `AST`:

```{r assists histogram, include = TRUE}
hist(nba.data.filtered$AST,
     breaks = seq(0, 12, 1),
     main = "Distribution of Assists per Game",
     xlim = c(0, 12),
     xaxt = "n",
     xlab = "Assists per Game",
     ylim = c(0, 250),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(25, 250, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 12, 2))
```

The histogram above shows that `AST` has a right skewed-distribution.

## Distribution of Steals

Below is a histogram of `STL`:

```{r steals histogram, include = TRUE}
hist(nba.data.filtered$STL,
     breaks = seq(0, 2.5, .25),
     main = "Distribution of Steals per Game",
     xlim = c(0, 2.5),
     xaxt = "n",
     xlab = "Steals per Game",
     ylim = c(0, 200),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(25, 200, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 2.5, .5))
```

The histogram above shows that `STL` has right-skewed distribution, but not as aggressive as assists.

## Distribution of Blocks

Below is a histogram of `BLK`:

```{r blocks histogram, include = TRUE}
hist(nba.data.filtered$BLK,
     breaks = seq(0, 4, .4),
     main = "Distribution of Blocks per Game",
     xlim = c(0, 4),
     xaxt = "n",
     xlab = "Blocks per Game",
     ylim = c(0, 400),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(50, 400, 50), col = "gray", lty = "dotted")
axis(1, at = seq(0, 4, .8))
```

The histogram above shows that `BLK` has an aggressively right-skewed distribution. There appears to be one outlier who is above the 3 blocks per game. That player is below:

```{r blocks outliers, include = TRUE}
blocks.outliers <- nba.data.filtered %>%
  filter(BLK > 3) %>%
  dplyr::select(Player, G, GS, BLK)

kable(blocks.outliers,
      col.names = c("Player", "Games Played", "Games Started", "Blocks"),
      caption = "Players with more than 3 Blocks per Game") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Victor Wembanyama was the league leader in blocks per game.

## Distribution of Turnovers

Below is a histogram of `TOV`:

```{r turnovers histogram, include = TRUE}
hist(nba.data.filtered$TOV,
     breaks = seq(0, 5, .25),
     main = "Distribution of Turnovers per Game",
     xlim = c(0, 5),
     xaxt = "n",
     xlab = "Turnovers per Game",
     ylim = c(0, 150),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(25, 150, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 5, .5))
```

The histogram above shows that `BLK` has a right-skewed distribution. There appears to be one outlier who is above the 4 turnovers per game. That player is below:

```{r turnovers outliers, include = TRUE}
turnovers.outliers <- nba.data.filtered %>%
  filter(TOV > 4) %>%
  dplyr::select(Player, G, GS, TOV)

kable(turnovers.outliers,
      col.names = c("Player", "Games Played", "Games Started", "Turnovers"),
      caption = "Players with more than 4 TUrnovers per Game") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

Trae Young was the league leader in blocks per game.

## Distribution of Personal Fouls

Below is a histogram of `PF`:

```{r personal fouls histogram, include = TRUE}
hist(nba.data.filtered$PF,
     breaks = seq(0, 4, .4),
     main = "Distribution of Personal Fouls per Game",
     xlim = c(0, 4),
     xaxt = "n",
     xlab = "Personal Fouls per Game",
     ylim = c(0, 120),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(10, 120, 10), col = "gray", lty = "dotted")
axis(1, at = seq(0, 4, .8))
```

The histogram above shows that `BLK` has a right-skewed distribution.

## Distribution of Points Scored

Below is a histogram of `PTS`:

```{r points scored histogram, include = TRUE}
hist(nba.data.filtered$PTS,
     breaks = seq(0, 36, 4),
     main = "Distribution of Points Scored per Game",
     xlim = c(0, 36),
     xaxt = "n",
     xlab = "Points Scored per Game",
     ylim = c(0, 200),
     ylab = "Number of Players",
     col = "red",
     labels = TRUE)
abline(h = seq(25, 200, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 36, 4))
```

# K-Means Clustering Analysis

K-Means clustering is a machine learning algorithm that is used to partition a dataset into distinct groups or clusters based on their similarity. It is often used to identify natural groupings within data and simplify data interpretation. For this project, I will perform K-Means Clustering to group players together.

## Determining the "K"

I will begin this process by determining the amount of clusters needed for the analysis. Below is an elbow plot to visualize the number of clusters needed. The number along the x-axis where a "kink" appears in the graph is an indicator of where the cluster cutoff should be

```{r elbow plot, include = TRUE}
kmeans.data <- subset(num.data, select = -Rk)
kmeans.scaled <- as.data.frame(scale(kmeans.data))

wss = NULL
k = 10
for(i in 1:k){
  wss[i] = kmeans(kmeans.scaled, i, 1)$tot.withinss
}

plot(1:k, wss, type = "b",
     col = "red",
     xlab = "Number of Clusters",
     ylab = "Within Sums of Squares",
     main = "Elbow Plot for Selecting Optimal Number of Clusters")
```

As shown above, the elbow appears to be at k = 2. To confirm or deny this, below is a silhouette plot

```{r silhouette plot, include = TRUE}
fviz_nbclust(kmeans.scaled, FUN = hcut, method = "silhouette")
```

The silhouette plot confirms that the ideal number of clusters for the analysis is k = **2**

## Clustering

Next, I will split the dataset into two distinct clusters. The scatterplot below shows the relationship between `PTS` and `FGP`, with a the clusters shown by the color of each point.

```{r k2, include = TRUE}
k2 <- kmeans(x = kmeans.scaled,
             centers = 2,
             iter.max = 10,
             nstart = 25,
             algorithm = "Lloyd",
             trace = FALSE)

kmeans.scaled$group = as.factor(k2$cluster)

plot(kmeans.scaled$PTS, kmeans.scaled$FGP,
     pch = 19,
     col = factor(kmeans.scaled$group),
     xlab = "Points Scored (scaled)",
     ylab = "Field Goal Percentage (scaled)",
     main = "Clustering Performance Visual Check")

legend("bottomright",
       legend = levels(factor(kmeans.scaled$group)),
       pch = 19,
       col = 1:length(levels(kmeans.scaled$group)))
```

The scatterplot above shows how the dataset is split into two distinct groups.

# Principal Component Analysis (PCA)

Principle Component Analysis (PCA) is a tool to reduce the dimensionality of the dataset while retaining as much variability as possible. The subsetted numerical dataset has 27 variables. Therefore, we want to reduce this dimension to a smaller number, while not losing any valuable information.

Below is the factor loading for the fitted PCA. 

```{r pca creation, include = TRUE}
log.num.data = log(kmeans.data + 1)

log.pca <- prcomp(log.num.data, center = TRUE, scale = TRUE)

kable(round(log.pca$rotation, 2),
      caption = "PCA Loadings") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

## Determining the Amount of kept PCA's

The object of PCA is to reduce the dimension without losing a significant amount of information. The following table gives the importance of the principal components

```{r pca, include = TRUE}
kable(summary(log.pca)$importance,
      caption = "Principle Component Importance") %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

TO visualize these PCA's, below is a scree plot

```{r scree plot, include = TRUE}
var.explained <- log.pca$sdev^2 / sum(log.pca$sdev^2)

cum.var <- cumsum(var.explained)

plot(cum.var, 
     type = "b", 
     xlab = "Number of Principal Components",
     ylab = "Cumultive Variance Explained",
     main = "Scree Plot")
abline(h = .9, col = "red", lty = 2)
```

A common cutoff for PCA's is *90%* of cumulative variance explained. Therefore, we want to keep the first 8 PCA's.

Below is the kept PCA scores:

```{r kept pcas, include = TRUE}
kable(log.pca$x[1:10, 1:8],
      caption = "The first 8 PC scored transformed from the original variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
```

# Local Outlier Factor (LOF)

The LOF is the most well-known local anomaly detection algorithm whose idea is carried out in many nearest-neighbor-based algorithms. When a point is considered as an outlier bases on its local neighborhood, it is a local outlier. LOF will identify an outlier considering the density of the neighborhood. LOF performs well when the density of the data is not the same throughout the dataset.

## Outlier Definition

For the sake of this analysis, I will define an outlier as any player who has a combined z-score for the following variables higher than 3:

- `MP`
- `PTS`
- `FGP`
- `X3PP`
- `TRB`
- `AST`
- `STL`
- `BLK`
- `TOV`

A new variable `Outlier` will be created for each observation. If their combined z-score is higher than 3, their `Outlier` variable will be *1*, otherwise it will be *0*.

## LOF Distribution

The distribution of LOF scores is usually very skewed. To create a better visual representation, I truncate the extremely large LOF scores. Next, I extracted the LOF scores of each player in the dataset and summarize the distribution.

Below is the summary:

```{r lof summary, include = TRUE}
lof.data <- nba.data.filtered[, c("MP", "PTS", "FGP", "X3PP", "TRB", "AST", "STL", "BLK", "TOV")]

lof.data <- scale(lof.data)

lof.scores.50 <- lof(lof.data, minPts = 50)
summary(lof.scores.50)
```

```{r lof histogram, include = TRUE}
hist(lof.scores.50[lof.scores.50 < 5],
     breaks = seq(0, 4, .1),
     main = "Distribution of LOF",
     xlim = c(0, 4),
     xaxt = "n",
     xlab = "LOF < 5",
     ylim = c(0, 250),
     ylab = "Number of Players",
     col = "gray",
     labels = TRUE)
abline(h = seq(25, 250, 25), col = "gray", lty = "dotted")
axis(1, at = seq(0, 4, .4))
```

## Determining LOF cutoff

In order for outliers to be identified, there must be a consistent cutoff point. To determine the cutoff point, I will graph the suggested LOF cutoffs against the successful LOF catching rate.

```{r lof cutoff, include = TRUE}
lof.cut = seq(1, 4, length = 20)

lof.catch.rate = numeric(length(lof.cut))

vars <- c("MP", "PTS", "FGP", "X3PP", "TRB", "AST", "STL", "BLK", "TOV")
z.scores <- scale(nba.data.filtered[, vars])
nba.data.filtered$combined.z <- apply(z.scores, 1, function(row) sqrt(sum(row^2)))
nba.data.filtered$Outlier <- nba.data.filtered$combined.z > 3

for (i in 1:length(lof.cut)){
  ID = lof.scores.50 > lof.cut[i]
  Outlier <- nba.data.filtered$Outlier[ID]
  lof.catch.rate[i] <- sum(Outlier) / sum(ID)
}

default.rate <- sum(nba.data.filtered$Outlier) / nrow(nba.data.filtered)
rel.detect.rate <- (lof.catch.rate - default.rate) / default.rate

par(mfrow = c(1, 2))

plot(lof.cut, 100*lof.catch.rate, type = "l", col = "red", lwd = 2,
     xlab = "LOF Cutoff",
     ylab = "LOF Catching Rate (%)",
     main = "LOF Catching Rate of Outliers")

plot(lof.cut, rel.detect.rate, type = "l", col = "red", lwd = 2,
     xlab = "LOF Cutoff",
     ylab = "LOF and random catching rate ratio",
     main = "LOF catching rate improvement")
```

The above figures show that, with k = 100, the LOF-based outlier-catching rate is much higher than the random guess. I will next sketch an ROC to assess the global performance in terms of catching rate.

```{r roc, include = TRUE}
category = as.character(nba.data.filtered$Outlier)
category <- ifelse(category == TRUE, "1", "0")

roc.obj.lof.50 <- roc(category, lof.scores.50, levels = c("1", "0"), direction = ">")

sen.lof.50 = roc.obj.lof.50$sensitivities
fnr.lof.50 = 1 - roc.obj.lof.50$specificities

par(type = "s")
plot(fnr.lof.50, sen.lof.50, type = "l", lwd = 2, col = "red",
     xlim = c(0, 1),
     ylim = c(0, 1),
     xlab = "1 - Specificity",
     ylab = "Sensitivity",
     main = "ROC Curves of LOF Detection")

segments(0, 0, 1, 1, lwd = 1, col = "black", lty = 2)
auc.50 = roc.obj.lof.50$auc
text(0.87, 0.20, paste("AUC (50) = ", round(auc.50, 4)), col = "red", cex = 0.7, adj = 1)
```

## Tuning Hyperparameter k

The LOF defines local outliers. The LOF scores will be depending on the value of $k$ (the number of nearest players of every player). For illustration purposes, I will use 3 alternate values of k in an attempt to improve the AUC value.

```{r alt k 100, include = TRUE}
lof.scores.100 <- lof(lof.data, minPts = 100)

roc.obj.lof.100 <- roc(category, lof.scores.100, levels = c("1", "0"), direction = ">")

sen.lof.100 = roc.obj.lof.100$sensitivities
fnr.lof.100 = 1 - roc.obj.lof.100$specificities
```

```{r alt k 200, include = TRUE}
lof.scores.200 <- lof(lof.data, minPts = 200)

roc.obj.lof.200 <- roc(category, lof.scores.200, levels = c("1", "0"), direction = ">")

sen.lof.200 = roc.obj.lof.200$sensitivities
fnr.lof.200 = 1 - roc.obj.lof.200$specificities
```

```{r alt k 400, include = TRUE}
lof.scores.400 <- lof(lof.data, minPts = 400)

roc.obj.lof.400 <- roc(category, lof.scores.400, levels = c("1", "0"), direction = ">")

sen.lof.400 = roc.obj.lof.400$sensitivities
fnr.lof.400 = 1 - roc.obj.lof.400$specificities
```

```{r alt auc, include = TRUE}
par(type = "s")
colors = c("red", "blue", "orange", "darkgreen")

plot(fnr.lof.50, sen.lof.50, type = "l", lwd = 2, col = colors[1],
     xlim = c(0, 1),
     ylim = c(0, 1),
     xlab = "1 - Specificity",
     ylab = "Sensitivity",
     main = "ROC Curves of LOF Detection Comparison")

lines(fnr.lof.100, sen.lof.100, lwd = 2, lty = 2, col = colors[2])
lines(fnr.lof.200, sen.lof.200, lwd = 1, col = colors[3])
lines(fnr.lof.400, sen.lof.400, lwd = 1, col = colors[4])

segments(0, 0, 1, 1, lwd = 1, col = "black", lty = 2)
legend("bottomright", c("LOF.50", "LOF.100", "LOF.200", "LOF.400"),
       col = colors, lwd = c(2, 2, 1, 1, 1),
       lty = c(1, 2, 1, 1, 2), bty = "n", cex = 0.7)

AUC.50 = roc.obj.lof.50$auc
AUC.100 = roc.obj.lof.100$auc
AUC.200 = roc.obj.lof.200$auc
AUC.400 = roc.obj.lof.400$auc

text(0.87, 0.17, paste("AUC (50) = ", round(AUC.50, 4)), col = colors[1], cex = 0.7, adj = 1)
text(0.87, 0.12, paste("AUC (100) = ", round(AUC.100, 4)), col = colors[2], cex = 0.7, adj = 1)
text(0.87, 0.07, paste("AUC (200) = ", round(AUC.200, 4)), col = colors[3], cex = 0.7, adj = 1)
text(0.87, 0.02, paste("AUC (400) = ", round(AUC.400, 4)), col = colors[4], cex = 0.7, adj = 1)
```

As shown above, the best $k$ for the given data from the selection is 400, as it has the highest AUC value. However, it is advised to be careful choosing a $k$ value that is deemed to be "too close" to the number of observations within the dataset. 