Required packages

This section loads the required packages

# Loads required packages
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
library(forecast)
library(Hmisc)

Executive Summary

In this assignment, we applied what we learnt during the semester on data wrangling and data pre-processing techniques. Two data sets were merged using the following key pre-processing steps:

1.Data (Part 1 of 2): The two data sets used for this assignment is related to the NBA (National Basketball Association). The NBA is an American men’s professional basketball league. It’s made up of 30 teams. It is one of the top men’s professional basketball league in the world.

They were both collected from the basketball reference site: [https://www.basketball-reference.com/] which holds all types of basketball related statistics.

Before the data sets could be merged, there’s was a few de-duplication and manipulation steps. In the detailed sections below, I’ve broken the data sections into two parts. The first part is the initial inspection and the second part was where i merged the two data sets.

2.Understand: In this section, I inspected the attributes and also looked at a few samples. I found that there were some illegal characters that needed to be stripped out. There was also some character->numeric and character-factors conversion completed here,In particular an ordered factor around player position.

3.De-duplication & Scan I: In this section, I’ve introduced a step where i de-duplicated both data sets. It is very common in the NBA that a player can be traded during the playing season. Hence the players stats will have multiple records if a player played for multiple teams, This is the same for the contracts data sets as two teams could be paying a salary. For this assignment, I’ve consolidated the players stats to be one total and also assigned it to his latest team since i wanted to use the whole season as a comparison and not in between the season.

4.Tidy & Manipulate Data I: In this section, I checked the tidiness of the data sets and found that the contracts data set had to be converted from a wide format to a long format. I used the gather function here to do the conversion. The players statistics data set was already tidy after the de-duplication and cleansing of the data in previous steps.

5.Data (Part 1 of 2): After some de-duplication and manipulation, I finally merged my two data sets in this section.

6.Tidy & Manipulate Data II: In this section, I created two new fields. player efficiency and salary per game. The formula for efficiency can be found here: [https://en.wikipedia.org/wiki/Efficiency_(basketball)] and is used across most basketball league to measure a players’ basketball efficiency using counting statistics. Salary per game was a field i was interested to see how much a player earns on average for the season.

7.Scan II - Outliers: In this section, I explored the numeric data to detect outliers and we were able to remove the outliers using a capping technique that we applied to the salary and player efficiency data.

8.Transform:

Transformed the data using a few techniques to normalised the salary data. In this section, I tried a few tehcniquesnamely Log10, Square Root and Box-Cox methods normalised the salary data. Both Log10 and Box-Cox techniques were the best at achieiving this.

Data (Part 1 of 2)

This is the detailed data section where the data set was initially looked at. The first dataset relates to NBA players salary starting from the 2019-2020 Season. The second dataset is related to NBA player total statistics for the 2019-2020 Season.

Data - Data Set 1: NBA Player Contracts 2019-2020.

Data Set 1: NBA Player Contracts 2019-2020

https://www.basketball-reference.com/contracts/players.html

The following variables are found in the dataset and defined as follows:

  • Rk - Rank of salary sorted by the 2019-2020 NBA Season.
  • Player - The player’s Name.
  • Tm - The team the players plays for.
  • 2019-20 - The Player salary’s for the 2019-2020 NBA Season.
  • 2020-21 - The Player salary’s for the 2020-2021 NBA Season.
  • 2021-22 - The Player salary’s for the 2021-2022 NBA Season.
  • 2022-23 - The Player salary’s for the 2022-2023 NBA Season.
  • 2023-24 - The Player salary’s for the 2023-2024 NBA Season.
  • 2024-25 - The Player salary’s for the 2024-2025 NBA Season.
  • Signed Using - The type of contract the player signed up for.
  • Guaranteed - The amount of a player’s remaining salary that is guaranteed.
## Load data section for data set 1: NBA Player Contracts 2019-2020 

# Create variable to hold the column names.
player_contract_header <- c("Rk","Player","Tm","2019-20","2020-21","2021-22","2022-23","2023-24","2024-25","Signed Using","Guaranteed") 

# Loading in the data via the "read_csv" command.
player_contract <- read_csv("2019_20_NBA_Player_Contracts.csv",skip=2, col_names=player_contract_header)
Parsed with column specification:
cols(
  Rk = col_double(),
  Player = col_character(),
  Tm = col_character(),
  `2019-20` = col_character(),
  `2020-21` = col_character(),
  `2021-22` = col_character(),
  `2022-23` = col_character(),
  `2023-24` = col_character(),
  `2024-25` = col_character(),
  `Signed Using` = col_character(),
  Guaranteed = col_character()
)
# Getting the dimensions of the dataset.
dim(player_contract)
[1] 568  11
# Having a peek of the dataset.
head(player_contract)

Data set 1 contains 568 observations(Rows) and 11 variables(Columns).

Data - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

Data Set 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

https://www.basketball-reference.com/leagues/NBA_2020_totals.html

The following variables are found in the dataset and defined as follows:

  • Rk - Rank of salary sorted by the player’s last name in ascending order.
  • Player - The player’s Name.
  • Pos - The player’s basketball position that they play on the basketball court.
  • Age - The age of the player as of the 1st of February of the season.
  • Tm - The team the players plays for.
  • G - The number of games the player played in the Season.
  • GS - The number of games the player started in the Season.
  • MP - The amount of time the player played measured in minutes.
  • FG - The number of field goals baskets made.
  • FGA - The number of field goals baskets attempted.
  • FG% - The percentage of a player’s made field goal.
  • 3P - The number of 3 pointers field goals baskets made.
  • 3PA - The number of 3 pointers field goals baskets attempted.
  • 3P% - The percentage of a player’s 3 pointer made field goal.
  • 2P - The number of 2 point field goals baskets made.
  • 2PA - The number of 2 point field goals baskets attempted.
  • 2P% - The percentage of a player’s 2 point made field goal.
  • eFG% - The effective field goal percentage, based on the formula: (FG + 0.5 * 3P) / FGA
  • FT - The number of free throws made.
  • FTA - The number of free throws attempted.
  • FT% - The percentage of a player’s 2 point made field goal.
  • ORB - The number of offensive Rebounds.
  • DRB - The number of defensive Rebounds.
  • TRB - The number of total Rebounds.
  • AST - The number of assists.
  • STL - The number of steals.
  • BLK - The number of blocks.
  • TOV - The number of turnovers.
  • PF - The number of personal fouls.
  • PTS - The number of points.
## Load data section for data set 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

# Loading in the data via the "read_csv" command.
player_stats <- read_csv("2019_20_NBA_Player_Stats_Totals.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  Player = col_character(),
  Pos = col_character(),
  Tm = col_character()
)
See spec(...) for full column specifications.
# Getting the dimensions of the dataset.
dim(player_stats)
[1] 651  30
# Having a peek of the dataset.
head(player_stats)

Data set 2 contains 651 observations(Rows) and 30 variables(Columns).

We are unable to merge the 2 datasets in this section as there’s a number of conversion and tidy steps we need to do before we can merge the datasets. The mergining of the dataset will be defined in part 2 of the data section.

Understand

In this section, we seek to understand the 2 data sets and apply any conversions to the attributes of the data.

Understand - Data Set 1: NBA Player Contracts 2019-2020.

# Checking the structure of the data set
str(player_contract)
tibble [568 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Rk          : num [1:568] 1 2 3 4 5 6 7 8 9 10 ...
 $ Player      : chr [1:568] "Stephen Curry\\curryst01" "Chris Paul\\paulch01" "Russell Westbrook\\westbru01" "John Wall\\walljo01" ...
 $ Tm          : chr [1:568] "GSW" "OKC" "HOU" "WAS" ...
 $ 2019-20     : chr [1:568] "$40,231,758" "$38,506,482" "$38,178,000" "$37,800,000" ...
 $ 2020-21     : chr [1:568] "$43,006,362" "$41,358,814" "$41,006,000" "$40,824,000" ...
 $ 2021-22     : chr [1:568] "$45,780,966" "$44,211,146" "$43,848,000" "$43,848,000" ...
 $ 2022-23     : chr [1:568] NA NA "$46,662,000" "$46,872,000" ...
 $ 2023-24     : chr [1:568] NA NA NA NA ...
 $ 2024-25     : chr [1:568] NA NA NA NA ...
 $ Signed Using: chr [1:568] "Bird Rights" "Bird Rights" "Bird Rights" "Bird Rights" ...
 $ Guaranteed  : chr [1:568] "$129,019,086" "$79,865,296" "$123,032,000" "$122,472,000" ...
 - attr(*, "spec")=
  .. cols(
  ..   Rk = col_double(),
  ..   Player = col_character(),
  ..   Tm = col_character(),
  ..   `2019-20` = col_character(),
  ..   `2020-21` = col_character(),
  ..   `2021-22` = col_character(),
  ..   `2022-23` = col_character(),
  ..   `2023-24` = col_character(),
  ..   `2024-25` = col_character(),
  ..   `Signed Using` = col_character(),
  ..   Guaranteed = col_character()
  .. )

It appears that after a sneak peek of the dataset, we need to apply some conversions to the dataset, stripping of certain characters and also converting character fields to numeric fields. We will also rename the fields in the section to make them more readable.

# Stripping out illegal characters
# Cleaning up the "$" and "\" symbols from the salary columns.
# str_replace(player_contract,"\\$","")
pattern = c("\\$|\\,")
player_contract <- player_contract %>%
  mutate_all(function(x) gsub(pattern,"",x))

# Split out the player field into the two new fields: Player Name and Player ID
player_contract <- player_contract %>% separate(Player, c("Player_Name","Player_ID"), sep="\\\\")

# Renaming the columns
player_contract <- player_contract %>% rename(Salary_Rank = Rk)
player_contract <- player_contract %>% rename(Team = Tm)
player_contract <- player_contract %>% rename("Salary_2019-20" = "2019-20")
player_contract <- player_contract %>% rename("Salary_2020-21" = "2020-21")
player_contract <- player_contract %>% rename("Salary_2021-22" = "2021-22")
player_contract <- player_contract %>% rename("Salary_2022-23" = "2022-23")
player_contract <- player_contract %>% rename("Salary_2023-24" = "2023-24")
player_contract$`Salary_2019-20` <- as.numeric(player_contract$`Salary_2019-20`)
player_contract$`Salary_2020-21` <- as.numeric(player_contract$`Salary_2020-21`)
player_contract$`Salary_2021-22` <- as.numeric(player_contract$`Salary_2021-22`)
player_contract$`Salary_2022-23` <- as.numeric(player_contract$`Salary_2022-23`)
player_contract$`Salary_2023-24` <- as.numeric(player_contract$`Salary_2023-24`)

# Keep required columns only for the 2019-2020 Season
player_contract <- player_contract %>% 
                   select(Player_Name,Player_ID,Team,"Salary_2019-20","Salary_2020-21")

# Checking the structure of the data set after the conversions
str(player_contract)
tibble [568 x 5] (S3: tbl_df/tbl/data.frame)
 $ Player_Name   : chr [1:568] "Stephen Curry" "Chris Paul" "Russell Westbrook" "John Wall" ...
 $ Player_ID     : chr [1:568] "curryst01" "paulch01" "westbru01" "walljo01" ...
 $ Team          : chr [1:568] "GSW" "OKC" "HOU" "WAS" ...
 $ Salary_2019-20: num [1:568] 40231758 38506482 38178000 37800000 37800000 ...
 $ Salary_2020-21: num [1:568] 43006362 41358814 41006000 40824000 40824000 ...

The salary column has been converted to numeric as expected and also column names are more readable.

Understand - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

# Checking the structure of the data set
str(player_stats)
tibble [651 x 30] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Rk    : num [1:651] 1 2 3 4 5 6 7 8 9 10 ...
 $ Player: chr [1:651] "Steven Adams\\adamsst01" "Bam Adebayo\\adebaba01" "LaMarcus Aldridge\\aldrila01" "Kyle Alexander\\alexaky01" ...
 $ Pos   : chr [1:651] "C" "PF" "C" "PF" ...
 $ Age   : num [1:651] 26 22 34 23 21 24 21 27 29 26 ...
 $ Tm    : chr [1:651] "OKC" "MIA" "SAS" "MIA" ...
 $ G     : num [1:651] 63 72 53 2 47 38 70 10 18 10 ...
 $ GS    : num [1:651] 63 72 53 0 1 0 64 0 2 1 ...
 $ MP    : num [1:651] 1680 2417 1754 13 591 ...
 $ FG    : num [1:651] 283 440 391 1 98 117 302 19 25 10 ...
 $ FGA   : num [1:651] 478 790 793 2 266 251 465 44 86 38 ...
 $ FG%   : num [1:651] 0.592 0.557 0.493 0.5 0.368 0.466 0.649 0.432 0.291 0.263 ...
 $ 3P    : num [1:651] 1 2 61 0 46 57 0 5 9 6 ...
 $ 3PA   : num [1:651] 3 14 157 0 133 141 6 16 36 29 ...
 $ 3P%   : num [1:651] 0.333 0.143 0.389 NA 0.346 0.404 0 0.313 0.25 0.207 ...
 $ 2P    : num [1:651] 282 438 330 1 52 60 302 14 16 4 ...
 $ 2PA   : num [1:651] 475 776 636 2 133 110 459 28 50 9 ...
 $ 2P%   : num [1:651] 0.594 0.564 0.519 0.5 0.391 0.545 0.658 0.5 0.32 0.444 ...
 $ eFG%  : num [1:651] 0.593 0.558 0.532 0.5 0.455 0.58 0.649 0.489 0.343 0.342 ...
 $ FT    : num [1:651] 117 264 158 0 25 39 171 7 19 2 ...
 $ FTA   : num [1:651] 201 382 191 0 37 45 270 11 29 4 ...
 $ FT%   : num [1:651] 0.582 0.691 0.827 NA 0.676 0.867 0.633 0.636 0.655 0.5 ...
 $ ORB   : num [1:651] 207 176 103 2 9 8 216 2 24 1 ...
 $ DRB   : num [1:651] 376 559 289 1 75 77 455 7 63 20 ...
 $ TRB   : num [1:651] 583 735 392 3 84 85 671 9 87 21 ...
 $ AST   : num [1:651] 146 368 129 0 89 52 110 21 21 8 ...
 $ STL   : num [1:651] 51 82 36 0 17 10 40 5 18 0 ...
 $ BLK   : num [1:651] 67 93 87 0 8 2 92 2 8 6 ...
 $ TOV   : num [1:651] 94 204 74 1 54 33 77 8 17 4 ...
 $ PF    : num [1:651] 122 182 128 1 57 53 162 7 27 13 ...
 $ PTS   : num [1:651] 684 1146 1001 2 267 ...
 - attr(*, "spec")=
  .. cols(
  ..   Rk = col_double(),
  ..   Player = col_character(),
  ..   Pos = col_character(),
  ..   Age = col_double(),
  ..   Tm = col_character(),
  ..   G = col_double(),
  ..   GS = col_double(),
  ..   MP = col_double(),
  ..   FG = col_double(),
  ..   FGA = col_double(),
  ..   `FG%` = col_double(),
  ..   `3P` = col_double(),
  ..   `3PA` = col_double(),
  ..   `3P%` = col_double(),
  ..   `2P` = col_double(),
  ..   `2PA` = col_double(),
  ..   `2P%` = col_double(),
  ..   `eFG%` = col_double(),
  ..   FT = col_double(),
  ..   FTA = col_double(),
  ..   `FT%` = col_double(),
  ..   ORB = col_double(),
  ..   DRB = col_double(),
  ..   TRB = col_double(),
  ..   AST = col_double(),
  ..   STL = col_double(),
  ..   BLK = col_double(),
  ..   TOV = col_double(),
  ..   PF = col_double(),
  ..   PTS = col_double()
  .. )
# Stripping out illegal characters
# Cleaning up the "$" and "\" symbols from the salary columns.
# str_replace(player_contract,"\\$","")
pattern = c("\\$|\\,")
player_stats <- player_stats %>%
  mutate_all(function(x) gsub(pattern,"",x))

# Split out the player field into the two new fields: Player Name and Player ID
player_stats <- player_stats %>% separate(Player, c("Player_Name","Player_ID"), sep="\\\\")

# Renaming the columns
player_stats <- player_stats %>% rename(Player_Rank = Rk) 
player_stats <- player_stats %>% rename(Position = Pos)
player_stats <- player_stats %>% rename(Team = Tm)
player_stats <- player_stats %>% rename(Games_played = G)
player_stats <- player_stats %>% rename(Games_started = GS)
player_stats <- player_stats %>% rename(minutes_played = MP)
player_stats <- player_stats %>% rename(field_goals = FG)
player_stats <- player_stats %>% rename(field_goals_attempted = FGA)
player_stats <- player_stats %>% rename(three_pointers = "3P")
player_stats <- player_stats %>% rename(three_pointers_attempted = "3PA")
player_stats <- player_stats %>% rename(two_pointers = "2P")
player_stats <- player_stats %>% rename(two_pointers_attempted = "2PA")
player_stats <- player_stats %>% rename(free_throws = FT)
player_stats <- player_stats %>% rename(free_throws_attempted = FTA)
player_stats <- player_stats %>% rename(offensive_rebounds = ORB)
player_stats <- player_stats %>% rename(defensive_rebounds = DRB)
player_stats <- player_stats %>% rename(total_rebounds = TRB)
player_stats <- player_stats %>% rename(assists = AST)
player_stats <- player_stats %>% rename(steals = STL)
player_stats <- player_stats %>% rename(blocks = BLK)
player_stats <- player_stats %>% rename(turnovers = TOV)
player_stats <- player_stats %>% rename(personal_fouls = PF)
player_stats <- player_stats %>% rename(points = PTS)

# Keep required columns only for the 2019-2020 Season
player_stats <- player_stats %>% 
                   select(Player_Rank, Player_Name,Player_ID,Position,Team,Games_played,
                          Games_started,minutes_played,field_goals,field_goals_attempted,
                          three_pointers,three_pointers_attempted,two_pointers,
                          two_pointers_attempted,free_throws,free_throws_attempted,
                          offensive_rebounds,defensive_rebounds,total_rebounds,
                          assists,steals,blocks,turnovers,personal_fouls,points)

# Converting various character fields to numeric fields
player_stats$Games_played <- as.numeric(player_stats$Games_played)
# player_stats[,5:24] <- as.numeric(player_stats[,5:24])

player_stats[6:25] <- lapply(player_stats[6:25], as.numeric)

# Converting the position field to be an order factor field
player_stats$Position <- factor(player_stats$Position, 
                                levels= c("PG", "SG", "SF", "PF", "C", "Other"), 
                                labels = c("PG", "SG", "SF", "PF", "C", "Other"), 
                                ordered = TRUE)

# Converting the rank field to be an order factor field
player_stats$Player_Rank <- as.factor(player_stats$Player_Rank)

# Checking the structure of the data set
str(player_stats)
tibble [651 x 25] (S3: tbl_df/tbl/data.frame)
 $ Player_Rank             : Factor w/ 651 levels "1","10","100",..: 1 112 223 334 445 556 619 630 641 2 ...
 $ Player_Name             : chr [1:651] "Steven Adams" "Bam Adebayo" "LaMarcus Aldridge" "Kyle Alexander" ...
 $ Player_ID               : chr [1:651] "adamsst01" "adebaba01" "aldrila01" "alexaky01" ...
 $ Position                : Ord.factor w/ 6 levels "PG"<"SG"<"SF"<..: 5 4 5 4 2 2 5 2 4 3 ...
 $ Team                    : chr [1:651] "OKC" "MIA" "SAS" "MIA" ...
 $ Games_played            : num [1:651] 63 72 53 2 47 38 70 10 18 10 ...
 $ Games_started           : num [1:651] 63 72 53 0 1 0 64 0 2 1 ...
 $ minutes_played          : num [1:651] 1680 2417 1754 13 591 ...
 $ field_goals             : num [1:651] 283 440 391 1 98 117 302 19 25 10 ...
 $ field_goals_attempted   : num [1:651] 478 790 793 2 266 251 465 44 86 38 ...
 $ three_pointers          : num [1:651] 1 2 61 0 46 57 0 5 9 6 ...
 $ three_pointers_attempted: num [1:651] 3 14 157 0 133 141 6 16 36 29 ...
 $ two_pointers            : num [1:651] 282 438 330 1 52 60 302 14 16 4 ...
 $ two_pointers_attempted  : num [1:651] 475 776 636 2 133 110 459 28 50 9 ...
 $ free_throws             : num [1:651] 117 264 158 0 25 39 171 7 19 2 ...
 $ free_throws_attempted   : num [1:651] 201 382 191 0 37 45 270 11 29 4 ...
 $ offensive_rebounds      : num [1:651] 207 176 103 2 9 8 216 2 24 1 ...
 $ defensive_rebounds      : num [1:651] 376 559 289 1 75 77 455 7 63 20 ...
 $ total_rebounds          : num [1:651] 583 735 392 3 84 85 671 9 87 21 ...
 $ assists                 : num [1:651] 146 368 129 0 89 52 110 21 21 8 ...
 $ steals                  : num [1:651] 51 82 36 0 17 10 40 5 18 0 ...
 $ blocks                  : num [1:651] 67 93 87 0 8 2 92 2 8 6 ...
 $ turnovers               : num [1:651] 94 204 74 1 54 33 77 8 17 4 ...
 $ personal_fouls          : num [1:651] 122 182 128 1 57 53 162 7 27 13 ...
 $ points                  : num [1:651] 684 1146 1001 2 267 ...

De-duplication & Scan I

In this section, we seek to de-duplicate the data in the two data sets and also scan for missing and/or special values and also any obvious errors.

De-duplication & Scan I - Data Set 1: NBA Player Contracts 2019-2020.

# De-duplication Process

# Sample a record to understand the duplicate i.e Pau Gasol
sum(duplicated(player_contract$Player_Name))
[1] 42
# Which rows are duplicates?
which(duplicated(player_contract$Player_Name))
 [1] 290 292 296 348 369 441 446 447 470 471 474 475 476 478 481 485 487 488 490 504 505 507 512 514
[25] 516 522 523 526 528 529 534 535 536 537 542 544 545 549 551 553 556 563
# Let's pick on Pau Gasol
which(player_contract$Player_Name=="Pau Gasol")
[1] 198 290
# Get both his records to see why it's duplicated
player_contract %>% filter(player_contract$Player_Name=="Pau Gasol")
# Separate out the duplicate players
tbl_player_name <- table(player_contract$Player_Name)
df_dup_player_name <- player_contract[player_contract$Player_Name %in% names(tbl_player_name)[tbl_player_name>1],]

# Consolidate the duplicate records by summing up the salary
stage_salary <- df_dup_player_name %>% group_by(Player_ID)  %>%
  summarise(`Salary_2019-20` = sum(`Salary_2019-20`),
            `Salary_2020-21` = sum(`Salary_2020-21`)) %>% ungroup()

# Create a dataset to obtain the player's last team that he played for
stage_lastrecord <- df_dup_player_name %>%
  group_by(Player_Name) %>%
  slice(n()) %>%
  ungroup()

# Refine the selected Columns
lastrecord_select <- select(stage_lastrecord, Player_Name,Player_ID,Team)

# Finalised the duplicate records with the consolidated slary and final team that the player played for:
df_dup_player_name_final <- stage_salary %>% inner_join(lastrecord_select, by="Player_ID") %>% select(Player_Name,Player_ID,Team,`Salary_2019-20`,`Salary_2020-21`)


# Separate out the non-duplicate players
df_non_dup_player_name <- player_contract[player_contract$Player_Name %in% names(tbl_player_name)[tbl_player_name<2],]
df_non_dup_player_name <- df_non_dup_player_name %>% select(Player_Name,Player_ID,Team,`Salary_2019-20`,`Salary_2020-21`)
# Combining the two datasets 
player_contract_final <- union(df_dup_player_name_final,df_non_dup_player_name)

# Sort the data set by highest salary
player_contract_final <- player_contract_final %>% arrange(.,desc(`Salary_2019-20`))

# Add a "Ranking" field to the salary. We will name it Salary_Rank
player_contract_final <- tibble::rowid_to_column(player_contract_final, "Salary_Rank")

# Convert the rank field to be a factor
player_contract_final$Salary_Rank <- as.factor(player_contract_final$Salary_Rank)

# Assign the updated dataset to the original one
player_contract <- player_contract_final

# Check for duplicates
sum(duplicated(player_contract$Player_Name))
[1] 0
sum(duplicated(player_contract$Player_ID))
[1] 0
# Check the structure
str(player_contract)
tibble [526 x 6] (S3: tbl_df/tbl/data.frame)
 $ Salary_Rank   : Factor w/ 526 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Player_Name   : chr [1:526] "Stephen Curry" "Chris Paul" "Russell Westbrook" "John Wall" ...
 $ Player_ID     : chr [1:526] "curryst01" "paulch01" "westbru01" "walljo01" ...
 $ Team          : chr [1:526] "GSW" "OKC" "HOU" "WAS" ...
 $ Salary_2019-20: num [1:526] 40231758 38506482 38178000 37800000 37800000 ...
 $ Salary_2020-21: num [1:526] 43006362 41358814 41006000 40824000 40824000 ...

In this section i checked for any missing values (NA’s), infinite or NaN values.

# Check for NA's
sapply(player_contract, function(x) sum(is.na(x)))
   Salary_Rank    Player_Name      Player_ID           Team Salary_2019-20 Salary_2020-21 
             0              0              0              0              0            193 
# Check for infinites
sapply(player_contract, function(x) sum(is.infinite(x)))
   Salary_Rank    Player_Name      Player_ID           Team Salary_2019-20 Salary_2020-21 
             0              0              0              0              0              0 
# Check for NaN
sapply(player_contract, function(x) sum(is.nan(x)))
   Salary_Rank    Player_Name      Player_ID           Team Salary_2019-20 Salary_2020-21 
             0              0              0              0              0              0 

Looks like we found some NA’s in the salary data for 2020-2021. Since the data is numeric, taking the average salary is an option that will be used to impute into the NA’s


# Imputes the mean for the missing values
player_contract$`Salary_2020-21` <- impute(player_contract$`Salary_2020-21`, fun=mean(player_contract$`Salary_2020-21`, na.rm = TRUE)) 

# Check missing value after imputation
sapply(player_contract, function(x) sum(is.na(x)))
   Salary_Rank    Player_Name      Player_ID           Team Salary_2019-20 Salary_2020-21 
             0              0              0              0              0              0 
colSums(is.na(player_contract))
   Salary_Rank    Player_Name      Player_ID           Team Salary_2019-20 Salary_2020-21 
             0              0              0              0              0              0 

De-duplication & Scan I - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

# De-duplication Process

# Sample a record to understand the duplicate i.e Trevor Ariza
sum(duplicated(player_stats$Player_Name))
[1] 122
# Which row numbers has the duplicates?
which(duplicated(player_stats$Player_Name))
  [1]  20  21  33  34  38  39  43  44  47  48  51  52  64  65  94  95  97  98 102 103 109 110 117 118
 [25] 124 125 128 129 132 133 144 145 147 148 150 151 155 156 161 162 167 168 174 175 185 186 194 195
 [49] 198 199 201 202 218 219 236 237 246 247 254 255 260 261 274 275 278 279 287 288 317 318 328 329
 [73] 333 334 347 348 352 353 372 373 413 414 415 435 436 438 439 452 453 460 461 482 483 518 519 521
 [97] 522 532 533 562 563 566 567 578 579 580 590 591 595 596 600 601 606 607 613 614 621 622 629 630
[121] 636 637
# Let see which row is Trevor Ariza
which(player_stats$Player_Name=="Trevor Ariza")
[1] 19 20 21
# Have a look at the full records from Trevor Ariza
player_stats %>% filter(player_stats$Player_Name=="Trevor Ariza")

# Separate out the duplicate players
tbl_player_stats_name <- table(player_stats$Player_Name)
df_dup_player_stats_name <- player_stats[player_stats$Player_Name %in% names(tbl_player_stats_name)[tbl_player_stats_name>1],]

# Remove the rows where the team equals "TOT". This row is redundant
stage_player_stats <- df_dup_player_stats_name %>% filter(df_dup_player_stats_name$Team != "TOT")

# Consolidate the duplicate records by summing up the counting stats
stage_player_stats2 <- stage_player_stats  %>% group_by(Player_ID)  %>% summarise(
                    `Games_played` = sum(`Games_played`),
                    `Games_started` = sum(`Games_started`),
                    `minutes_played` = sum(`minutes_played`),
                    `field_goals` = sum(`field_goals`),
                    `field_goals_attempted` = sum(`field_goals_attempted`),
                    `two_pointers` = sum(`two_pointers`),
                    `two_pointers_attempted` = sum(`two_pointers_attempted`),
                    `three_pointers` = sum(`three_pointers`),
                    `three_pointers_attempted` = sum(`three_pointers_attempted`),
                    `free_throws` = sum(`free_throws`),
                    `free_throws_attempted` = sum(`free_throws_attempted`),
                    `offensive_rebounds` = sum(`offensive_rebounds`),
                    `defensive_rebounds` = sum(`defensive_rebounds`),
                    `total_rebounds` = sum(`total_rebounds`),
                    `assists` = sum(`assists`),
                    `steals` = sum(`steals`),
                    `blocks` = sum(`blocks`),
                    `turnovers` = sum(`turnovers`),
                    `personal_fouls` = sum(`personal_fouls`),
                    `points` = sum(`points`)
            ) %>% ungroup()
`summarise()` ungrouping output (override with `.groups` argument)
# Get the last row for each player as we need to get their last position team. We do this by using the slice function
last_team_player_stats <- stage_player_stats %>%
  group_by(Player_Name) %>%
  slice(n()) %>%
  ungroup()

# Keeping only the required columns
last_team_player_stats <- last_team_player_stats %>% select(Player_Name,Player_ID,Position, Team)

# Joining the the two stage datasets to complete the de-duplication process.
stage_player_stats3 <- last_team_player_stats %>% inner_join(stage_player_stats2, by="Player_ID") 
df_dup_player_stats_name <- stage_player_stats3

# Separate out the non-duplicate players
df_non_dup_player_stats_name <- player_stats[player_stats$Player_Name %in% names(tbl_player_stats_name)[tbl_player_stats_name<2],]

# Removing the plyer-rank column
df_non_dup_player_stats_name <- df_non_dup_player_stats_name %>% select(-Player_Rank)

# Check to see if the there's any duplicates in the non-duplicate players stats dataset. Expected results should be zero
sum(duplicated(df_non_dup_player_stats_name$Player_Name))
[1] 0
sum(duplicated(df_non_dup_player_stats_name$Player_ID))
[1] 0
# Combining the two datasets 
player_stats_final <- union(df_dup_player_stats_name,df_non_dup_player_stats_name)

# Check to see if the there's any duplicates in the combined players stats dataset. Expected results should be zero
sum(duplicated(player_stats_final$Player_Name))
[1] 0
sum(duplicated(player_stats_final$Player_ID))
[1] 0

Check for any NA’s and Special Values

# Check for NA's
sapply(player_stats_final, function(x) sum(is.na(x)))
             Player_Name                Player_ID                 Position                     Team 
                       0                        0                        0                        0 
            Games_played            Games_started           minutes_played              field_goals 
                       0                        0                        0                        0 
   field_goals_attempted             two_pointers   two_pointers_attempted           three_pointers 
                       0                        0                        0                        0 
three_pointers_attempted              free_throws    free_throws_attempted       offensive_rebounds 
                       0                        0                        0                        0 
      defensive_rebounds           total_rebounds                  assists                   steals 
                       0                        0                        0                        0 
                  blocks                turnovers           personal_fouls                   points 
                       0                        0                        0                        0 
# Check for infinites
sapply(player_stats_final, function(x) sum(is.infinite(x)))
             Player_Name                Player_ID                 Position                     Team 
                       0                        0                        0                        0 
            Games_played            Games_started           minutes_played              field_goals 
                       0                        0                        0                        0 
   field_goals_attempted             two_pointers   two_pointers_attempted           three_pointers 
                       0                        0                        0                        0 
three_pointers_attempted              free_throws    free_throws_attempted       offensive_rebounds 
                       0                        0                        0                        0 
      defensive_rebounds           total_rebounds                  assists                   steals 
                       0                        0                        0                        0 
                  blocks                turnovers           personal_fouls                   points 
                       0                        0                        0                        0 
# Check for NaN
sapply(player_stats_final, function(x) sum(is.nan(x)))
             Player_Name                Player_ID                 Position                     Team 
                       0                        0                        0                        0 
            Games_played            Games_started           minutes_played              field_goals 
                       0                        0                        0                        0 
   field_goals_attempted             two_pointers   two_pointers_attempted           three_pointers 
                       0                        0                        0                        0 
three_pointers_attempted              free_throws    free_throws_attempted       offensive_rebounds 
                       0                        0                        0                        0 
      defensive_rebounds           total_rebounds                  assists                   steals 
                       0                        0                        0                        0 
                  blocks                turnovers           personal_fouls                   points 
                       0                        0                        0                        0 

There a no missing values(NA’s), infinite or NaN values in the players stats data set

Tidy & Manipulate Data I

Tidy & Manipulate Data I - Data Set 1: NBA Player Contracts 2019-2020.

The contracts data set was found to break the following tidy principle:

  • Each observation has it’s own row

The contracts data set is untidy as each observation for a salary does not form it’s own row and the salary years are across the columns.To tidy the dataset up, we will use the gather function from the tidyverse to change the data from a wide format to a long format.

#Tidying the contracts data set by using the gather() function
player_contract_final <- player_contract_final %>% gather(Salary_Year,Salary,5:6)

# Updating the value to reflect the year
player_contract_final$Salary_Year <- player_contract_final$Salary_Year %>% str_replace("Salary_2019-20", "2019-2020")
player_contract_final$Salary_Year <- player_contract_final$Salary_Year %>% str_replace("Salary_2020-21", "2020-2021")

Tidy & Manipulate Data I - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

The players stats dataset was tidy.

It conforms to these standards: * Each variable must have its own column. * Each observation must have its own row. * Each value must have its own cell.

Data (Part 2)

We can now merge the 2 datasets after the de-duplication process. The new merged data set will be player performance

# Filtering the contract data as we only want the current season.
player_contract_filter <- player_contract_final %>% filter(player_contract_final$Salary_Year=="2019-2020")

# Creating the merged dataset named player_performance.
# using an inner join as there a some players that have a contract but no stats, and also there's a stats for players that have been paid but have not been captured by the contracts dataset.
player_performance <- player_stats_final %>% inner_join(player_contract_filter, by="Player_ID")

# Removing the duplication columns and cleaning up the column names
player_performance <- player_performance %>% select(-Team.y,-Player_Name.y,-Salary_Year)
player_performance <- player_performance %>% rename(Player_Name = Player_Name.x)
player_performance <- player_performance %>% rename(Team = Team.x)

# Check for any dups after the merged
sum(duplicated(player_performance$Player_Name))
[1] 0
sum(duplicated(player_performance$Player_ID))
[1] 0

Tidy & Manipulate Data II

In this section, after the data sets were merged, I created the two new fields: player_efficiency & Salary_per_game

# Creating the 2 new fields using mutate
player_performance_new <- player_performance %>% mutate(.,
                                                        player_efficiency = (points+
                                                                            total_rebounds+
                                                                            assists+
                                                                            steals+
                                                                            blocks-
                                                                            (field_goals_attempted-field_goals)-
                                                                            (free_throws_attempted-free_throws)-
                                                                            turnovers)/Games_played
                                                        ,Salary_per_game= Salary/Games_played
                                                                            )
# Sorting the Dataset
player_performance_new <- player_performance_new %>% arrange(.,desc(player_efficiency))

# Selected the required columns
player_performance_new <- player_performance_new %>% select(Player_Name,Player_ID,Position,Team,Games_played,Games_started,minutes_played,
                                  Salary_Rank,Salary,player_efficiency,Salary_per_game)
# The new merged dataset
head(player_performance_new)
str(player_performance_new)
tibble [479 x 11] (S3: tbl_df/tbl/data.frame)
 $ Player_Name      : chr [1:479] "Giannis Antetokounmpo" "James Harden" "Luka Don?i?" "Karl-Anthony Towns" ...
 $ Player_ID        : chr [1:479] "antetgi01" "hardeja01" "doncilu01" "townska01" ...
 $ Position         : Ord.factor w/ 6 levels "PG"<"SG"<"SF"<..: 4 2 1 5 4 1 1 5 3 1 ...
 $ Team             : chr [1:479] "MIL" "HOU" "DAL" "MIN" ...
 $ Games_played     : num [1:479] 63 68 61 35 62 67 66 73 57 20 ...
 $ Games_started    : num [1:479] 63 68 61 35 62 67 66 73 57 20 ...
 $ minutes_played   : num [1:479] 1917 2483 2047 1187 2131 ...
 $ Salary_Rank      : Factor w/ 526 levels "1","2","3","4",..: 42 5 158 33 35 6 21 38 14 17 ...
 $ Salary           : num [1:479] 25842697 37800000 7683360 27250000 27093019 ...
 $ player_efficiency: num [1:479] 34.6 32.6 30.8 30.6 29.8 ...
 $ Salary_per_game  : num [1:479] 410202 555882 125957 778571 436984 ...

Scan II

In this section,I scanned the numeric data set for any outliers.

# Check for Outliers using boxplots
df_outliers <- player_performance_new %>% select_if(is.numeric)
for (i in 1:6) {
  boxplot(df_outliers[c(i)], xlab = colnames(df_outliers)[i])
}

NA

After running the boxplots to dettect outliers, It appears that the salary field and newly created player efficiency field has a few outliers that we need to address.The salary per game field is based of salary so in the next section i’ll just look at capping salary.

# Cap Function
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# Apply the capping
Salary <- player_performance_new$Salary %>% cap()
player_efficiency <- player_performance_new$player_efficiency %>% cap()

# Let's Recreate the boxplots to ensure that's it's all fixed
par(mfrow = c(2,2))
boxplot(Salary, xlab = "Salary ($)")
boxplot(player_efficiency, xlab = "Player Efficiency Score")

As you can see in the boxplots after the capping that it did a good job of removing all but one outlier. Looking into this lone outlier, it was steph curry who got injured 5 games into the season. For those 5 games, he earned $8,046,351.6 (8mil) per game for the 2019-2020. How’s that for a pay day.

Transform

In this section, We used to the salary field to do the transformation. We will use a combination of logarithms, square root and boxcox techniques to apply and then we will choose the best technique to use from the transformation.

# plot the histogram first
hist(player_performance_new$Salary)

The standard histogram has a right skewed effect to it.

# Using the square root technique
salary_square <- sqrt(player_performance_new$Salary)
hist(salary_square)

# Using the logarithm10 technique
Log <- log10(player_performance_new$Salary)
hist(Log)

# Using the Box-Cox technique
boxcox_hist <- BoxCox(player_performance_new$Salary, lambda = "auto")
hist(boxcox_hist)

Both Log10 and Box-Cox methods normalises the salary data.


---
title: "MATH2349 Data Wrangling"
author: "Daniel Tran Nguyen (s9707655)"
subtitle: Assignment 2
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
---

## Required packages 

This section loads the required packages 

```{r}
# Loads required packages
library(dplyr)
library(readr)
library(stringr)
library(tidyr)
library(forecast)
library(Hmisc)
```

## Executive Summary 

In this assignment, we applied what we learnt during the semester on data wrangling and data pre-processing techniques. Two data sets were merged using the following key pre-processing steps:

1.Data (Part 1 of 2):
The two data sets used for this assignment is related to the NBA (National Basketball Association). The NBA is an American men's professional basketball league. It's made up of 30 teams. It is one of the top men's professional basketball league in the world.

They were both collected from the basketball reference site: [https://www.basketball-reference.com/] which holds all types of basketball related statistics.

Before the data sets could be merged, there's was a few de-duplication and manipulation steps. In the detailed sections below, I've broken the data sections into two parts. The first part is the initial inspection and the second part was where i merged the two data sets. 

2.Understand:
In this section, I inspected the attributes and also looked at a few samples. I found that there were some illegal characters that needed to be stripped out. There was also some character->numeric and character-factors conversion completed here,In particular an ordered factor around player position.

3.De-duplication & Scan I:
In this section, I've introduced a step where i de-duplicated both data sets.
It is very common in the NBA that a player can be traded during the playing season. Hence the players stats will have multiple records if a player played for multiple teams, This is the same for the contracts data sets as two teams could be paying a salary. For this assignment, I've consolidated the players stats to be one total and also assigned it to his latest team since i wanted to use the whole season as a comparison and not in between the season. 

4.Tidy & Manipulate Data I:
In this section, I checked the tidiness of the data sets and found that the contracts data set had to be converted from a wide format to a long format. I used the gather function here to do the conversion.
The players statistics data set was already tidy after the de-duplication and cleansing of the data in previous steps.

5.Data (Part 1 of 2):
After some de-duplication and manipulation, I finally merged my two data sets in this section.

6.Tidy & Manipulate Data II:
In this section, I created two new fields. player efficiency and salary per game.
The formula for efficiency can be found here: [https://en.wikipedia.org/wiki/Efficiency_(basketball)] and is used across most basketball league to measure a players' basketball efficiency using counting statistics. 
Salary per game was a field i was interested to see how much a player earns on average for the season.

7.Scan II - Outliers: 
In this section, I explored the numeric data to detect outliers and we were able to remove the outliers using a capping technique that we applied to the salary and player efficiency data.

8.Transform:

Transformed the data using a few techniques to normalised the salary data.
In this section, I tried a few tehcniquesnamely Log10, Square Root and Box-Cox methods normalised the salary data.
Both Log10 and Box-Cox techniques were the best at achieiving this.  

\newpage
## Data (Part 1 of 2)

This is the detailed data section where the data set was initially looked at.
The first dataset relates to NBA players salary starting from the 2019-2020 Season.
The second dataset is related to NBA player total statistics for the 2019-2020 Season.

### Data - Data Set 1: NBA Player Contracts 2019-2020. 

Data Set 1: NBA Player Contracts 2019-2020 

https://www.basketball-reference.com/contracts/players.html

The following variables are found in the dataset and defined as follows: 

* Rk - Rank of salary sorted by the 2019-2020 NBA Season.
* Player - The player's Name.
* Tm - The team the players plays for.	
* 2019-20	- The Player salary's for the 2019-2020 NBA Season.
* 2020-21	- The Player salary's for the 2020-2021 NBA Season.	
* 2021-22	- The Player salary's for the 2021-2022 NBA Season.
* 2022-23	- The Player salary's for the 2022-2023 NBA Season.
* 2023-24	- The Player salary's for the 2023-2024 NBA Season.
* 2024-25	- The Player salary's for the 2024-2025 NBA Season.
* Signed Using - 	The type of contract the player signed up for.
* Guaranteed - The amount of a player's remaining salary that is guaranteed.

```{r}
## Load data section for data set 1: NBA Player Contracts 2019-2020 

# Create variable to hold the column names.
player_contract_header <- c("Rk","Player","Tm","2019-20","2020-21","2021-22","2022-23","2023-24","2024-25","Signed Using","Guaranteed") 

# Loading in the data via the "read_csv" command.
player_contract <- read_csv("2019_20_NBA_Player_Contracts.csv",skip=2, col_names=player_contract_header)

# Getting the dimensions of the dataset.
dim(player_contract)

# Having a peek of the dataset.
head(player_contract)
```
Data set 1 contains 568 observations(Rows) and 11 variables(Columns).

### Data - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

Data Set 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

https://www.basketball-reference.com/leagues/NBA_2020_totals.html

The following variables are found in the dataset and defined as follows: 

* Rk - Rank of salary sorted by the player's last name in ascending order.
* Player - The player's Name.
* Pos	- The player's basketball position that they play on the basketball court.
* Age	- The age of the player as of the 1st of February of the season.
* Tm - The team the players plays for.		
* G	- The number of games the player played in the Season.
* GS - The number of games the player started in the Season.	
* MP - The amount of time the player played measured in minutes.	
* FG - The number of field goals baskets made.
* FGA - The number of field goals baskets attempted.	
* FG%	- The percentage of a player's made field goal.
* 3P - The number of 3 pointers field goals baskets made.	
* 3PA	- The number of 3 pointers field goals baskets attempted.
* 3P% - The percentage of a player's 3 pointer made field goal.	
* 2P - The number of 2 point field goals baskets made.	
* 2PA - The number of 2 point field goals baskets attempted.	
* 2P%	- The percentage of a player's 2 point made field goal. 
* eFG% - The effective field goal percentage, based on the formula: (FG + 0.5 * 3P) / FGA	
* FT - The number of free throws made.	
* FTA	- The number of free throws attempted.	
* FT%	- The percentage of a player's 2 point made field goal. 
* ORB - The number of offensive Rebounds.
* DRB	- The number of defensive Rebounds.
* TRB - The number of total Rebounds.	
* AST	- The number of assists.
* STL - The number of steals.	
* BLK - The number of blocks.	
* TOV - The number of turnovers.	
* PF - The number of personal fouls.	
* PTS - The number of points.	

```{r}
## Load data section for data set 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

# Loading in the data via the "read_csv" command.
player_stats <- read_csv("2019_20_NBA_Player_Stats_Totals.csv")

# Getting the dimensions of the dataset.
dim(player_stats)

# Having a peek of the dataset.
head(player_stats)
```
Data set 2 contains 651 observations(Rows) and 30 variables(Columns).

We are unable to merge the 2 datasets in this section as there's a number of conversion and tidy steps we need to do before we can merge the datasets. The mergining of the dataset will be defined in part 2 of the data section.

\newpage
## Understand 

In this section, we seek to understand the 2 data sets and apply any conversions to the attributes of the data.

### Understand - Data Set 1: NBA Player Contracts 2019-2020. 

```{r}
# Checking the structure of the data set
str(player_contract)
```
It appears that after a sneak peek of the dataset, we need to apply some conversions to the dataset, stripping of certain characters and also converting character fields to numeric fields. We will also rename the fields in the section to make them more readable.

```{r}
# Stripping out illegal characters
# Cleaning up the "$" and "\" symbols from the salary columns.
# str_replace(player_contract,"\\$","")
pattern = c("\\$|\\,")
player_contract <- player_contract %>%
  mutate_all(function(x) gsub(pattern,"",x))

# Split out the player field into the two new fields: Player Name and Player ID
player_contract <- player_contract %>% separate(Player, c("Player_Name","Player_ID"), sep="\\\\")

# Renaming the columns
player_contract <- player_contract %>% rename(Salary_Rank = Rk)
player_contract <- player_contract %>% rename(Team = Tm)
player_contract <- player_contract %>% rename("Salary_2019-20" = "2019-20")
player_contract <- player_contract %>% rename("Salary_2020-21" = "2020-21")
player_contract <- player_contract %>% rename("Salary_2021-22" = "2021-22")
player_contract <- player_contract %>% rename("Salary_2022-23" = "2022-23")
player_contract <- player_contract %>% rename("Salary_2023-24" = "2023-24")
player_contract$`Salary_2019-20` <- as.numeric(player_contract$`Salary_2019-20`)
player_contract$`Salary_2020-21` <- as.numeric(player_contract$`Salary_2020-21`)
player_contract$`Salary_2021-22` <- as.numeric(player_contract$`Salary_2021-22`)
player_contract$`Salary_2022-23` <- as.numeric(player_contract$`Salary_2022-23`)
player_contract$`Salary_2023-24` <- as.numeric(player_contract$`Salary_2023-24`)

# Keep required columns only for the 2019-2020 Season
player_contract <- player_contract %>% 
                   select(Player_Name,Player_ID,Team,"Salary_2019-20","Salary_2020-21")

# Checking the structure of the data set after the conversions
str(player_contract)

```
The salary column has been converted to numeric as expected and also column names are more readable.

### Understand - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

```{r}
# Checking the structure of the data set
str(player_stats)
```

```{r}
# Stripping out illegal characters
# Cleaning up the "$" and "\" symbols from the salary columns.
# str_replace(player_contract,"\\$","")
pattern = c("\\$|\\,")
player_stats <- player_stats %>%
  mutate_all(function(x) gsub(pattern,"",x))

# Split out the player field into the two new fields: Player Name and Player ID
player_stats <- player_stats %>% separate(Player, c("Player_Name","Player_ID"), sep="\\\\")

# Renaming the columns
player_stats <- player_stats %>% rename(Player_Rank = Rk) 
player_stats <- player_stats %>% rename(Position = Pos)
player_stats <- player_stats %>% rename(Team = Tm)
player_stats <- player_stats %>% rename(Games_played = G)
player_stats <- player_stats %>% rename(Games_started = GS)
player_stats <- player_stats %>% rename(minutes_played = MP)
player_stats <- player_stats %>% rename(field_goals = FG)
player_stats <- player_stats %>% rename(field_goals_attempted = FGA)
player_stats <- player_stats %>% rename(three_pointers = "3P")
player_stats <- player_stats %>% rename(three_pointers_attempted = "3PA")
player_stats <- player_stats %>% rename(two_pointers = "2P")
player_stats <- player_stats %>% rename(two_pointers_attempted = "2PA")
player_stats <- player_stats %>% rename(free_throws = FT)
player_stats <- player_stats %>% rename(free_throws_attempted = FTA)
player_stats <- player_stats %>% rename(offensive_rebounds = ORB)
player_stats <- player_stats %>% rename(defensive_rebounds = DRB)
player_stats <- player_stats %>% rename(total_rebounds = TRB)
player_stats <- player_stats %>% rename(assists = AST)
player_stats <- player_stats %>% rename(steals = STL)
player_stats <- player_stats %>% rename(blocks = BLK)
player_stats <- player_stats %>% rename(turnovers = TOV)
player_stats <- player_stats %>% rename(personal_fouls = PF)
player_stats <- player_stats %>% rename(points = PTS)

# Keep required columns only for the 2019-2020 Season
player_stats <- player_stats %>% 
                   select(Player_Rank, Player_Name,Player_ID,Position,Team,Games_played,
                          Games_started,minutes_played,field_goals,field_goals_attempted,
                          three_pointers,three_pointers_attempted,two_pointers,
                          two_pointers_attempted,free_throws,free_throws_attempted,
                          offensive_rebounds,defensive_rebounds,total_rebounds,
                          assists,steals,blocks,turnovers,personal_fouls,points)

# Converting various character fields to numeric fields
player_stats$Games_played <- as.numeric(player_stats$Games_played)
# player_stats[,5:24] <- as.numeric(player_stats[,5:24])

player_stats[6:25] <- lapply(player_stats[6:25], as.numeric)

# Converting the position field to be an order factor field
player_stats$Position <- factor(player_stats$Position, 
                                levels= c("PG", "SG", "SF", "PF", "C", "Other"), 
                                labels = c("PG", "SG", "SF", "PF", "C", "Other"), 
                                ordered = TRUE)

# Converting the rank field to be an order factor field
player_stats$Player_Rank <- as.factor(player_stats$Player_Rank)

# Checking the structure of the data set
str(player_stats)
```
\newpage
##	De-duplication & Scan I 

In this section, we seek to de-duplicate the data in the two data sets and also scan for missing and/or special values and also any obvious errors.

### De-duplication & Scan I  - Data Set 1: NBA Player Contracts 2019-2020. 

```{r}
# De-duplication Process

# Sample a record to understand the duplicate i.e Pau Gasol
sum(duplicated(player_contract$Player_Name))

# Which rows are duplicates?
which(duplicated(player_contract$Player_Name))

# Let's pick on Pau Gasol
which(player_contract$Player_Name=="Pau Gasol")

# Get both his records to see why it's duplicated
player_contract %>% filter(player_contract$Player_Name=="Pau Gasol")
```

```{r}
# Separate out the duplicate players
tbl_player_name <- table(player_contract$Player_Name)
df_dup_player_name <- player_contract[player_contract$Player_Name %in% names(tbl_player_name)[tbl_player_name>1],]

# Consolidate the duplicate records by summing up the salary
stage_salary <- df_dup_player_name %>% group_by(Player_ID)  %>%
  summarise(`Salary_2019-20` = sum(`Salary_2019-20`),
            `Salary_2020-21` = sum(`Salary_2020-21`)) %>% ungroup()

# Create a dataset to obtain the player's last team that he played for
stage_lastrecord <- df_dup_player_name %>%
  group_by(Player_Name) %>%
  slice(n()) %>%
  ungroup()

# Refine the selected Columns
lastrecord_select <- select(stage_lastrecord, Player_Name,Player_ID,Team)

# Finalised the duplicate records with the consolidated slary and final team that the player played for:
df_dup_player_name_final <- stage_salary %>% inner_join(lastrecord_select, by="Player_ID") %>% select(Player_Name,Player_ID,Team,`Salary_2019-20`,`Salary_2020-21`)


# Separate out the non-duplicate players
df_non_dup_player_name <- player_contract[player_contract$Player_Name %in% names(tbl_player_name)[tbl_player_name<2],]
df_non_dup_player_name <- df_non_dup_player_name %>% select(Player_Name,Player_ID,Team,`Salary_2019-20`,`Salary_2020-21`)

```

```{r}
# Combining the two datasets 
player_contract_final <- union(df_dup_player_name_final,df_non_dup_player_name)

# Sort the data set by highest salary
player_contract_final <- player_contract_final %>% arrange(.,desc(`Salary_2019-20`))

# Add a "Ranking" field to the salary. We will name it Salary_Rank
player_contract_final <- tibble::rowid_to_column(player_contract_final, "Salary_Rank")

# Convert the rank field to be a factor
player_contract_final$Salary_Rank <- as.factor(player_contract_final$Salary_Rank)

# Assign the updated dataset to the original one
player_contract <- player_contract_final

# Check for duplicates
sum(duplicated(player_contract$Player_Name))
sum(duplicated(player_contract$Player_ID))

# Check the structure
str(player_contract)

```

In this section i checked for any missing values (NA's), infinite or NaN values.

```{r}
# Check for NA's
sapply(player_contract, function(x) sum(is.na(x)))

# Check for infinites
sapply(player_contract, function(x) sum(is.infinite(x)))

# Check for NaN
sapply(player_contract, function(x) sum(is.nan(x)))
```

Looks like we found some NA's in the salary data for 2020-2021.
Since the data is numeric, taking the average salary is an option that will be used to impute into the NA's
```{r}

# Imputes the mean for the missing values
player_contract$`Salary_2020-21` <- impute(player_contract$`Salary_2020-21`, fun=mean(player_contract$`Salary_2020-21`, na.rm = TRUE)) 

# Check missing value after imputation
sapply(player_contract, function(x) sum(is.na(x)))
colSums(is.na(player_contract))
```

### De-duplication & Scan I  - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.
```{r}
# De-duplication Process

# Sample a record to understand the duplicate i.e Trevor Ariza
sum(duplicated(player_stats$Player_Name))

# Which row numbers has the duplicates?
which(duplicated(player_stats$Player_Name))

# Let see which row is Trevor Ariza
which(player_stats$Player_Name=="Trevor Ariza")

# Have a look at the full records from Trevor Ariza
player_stats %>% filter(player_stats$Player_Name=="Trevor Ariza")

# Separate out the duplicate players
tbl_player_stats_name <- table(player_stats$Player_Name)
df_dup_player_stats_name <- player_stats[player_stats$Player_Name %in% names(tbl_player_stats_name)[tbl_player_stats_name>1],]

# Remove the rows where the team equals "TOT". This row is redundant
stage_player_stats <- df_dup_player_stats_name %>% filter(df_dup_player_stats_name$Team != "TOT")

# Consolidate the duplicate records by summing up the counting stats
stage_player_stats2 <- stage_player_stats  %>% group_by(Player_ID)  %>% summarise(
                    `Games_played` = sum(`Games_played`),
                    `Games_started` = sum(`Games_started`),
                    `minutes_played` = sum(`minutes_played`),
                    `field_goals` = sum(`field_goals`),
                    `field_goals_attempted` = sum(`field_goals_attempted`),
                    `two_pointers` = sum(`two_pointers`),
                    `two_pointers_attempted` = sum(`two_pointers_attempted`),
                    `three_pointers` = sum(`three_pointers`),
                    `three_pointers_attempted` = sum(`three_pointers_attempted`),
                    `free_throws` = sum(`free_throws`),
                    `free_throws_attempted` = sum(`free_throws_attempted`),
                    `offensive_rebounds` = sum(`offensive_rebounds`),
                    `defensive_rebounds` = sum(`defensive_rebounds`),
                    `total_rebounds` = sum(`total_rebounds`),
                    `assists` = sum(`assists`),
                    `steals` = sum(`steals`),
                    `blocks` = sum(`blocks`),
                    `turnovers` = sum(`turnovers`),
                    `personal_fouls` = sum(`personal_fouls`),
                    `points` = sum(`points`)
            ) %>% ungroup()

# Get the last row for each player as we need to get their last position team. We do this by using the slice function
last_team_player_stats <- stage_player_stats %>%
  group_by(Player_Name) %>%
  slice(n()) %>%
  ungroup()

# Keeping only the required columns
last_team_player_stats <- last_team_player_stats %>% select(Player_Name,Player_ID,Position, Team)

# Joining the the two stage datasets to complete the de-duplication process.
stage_player_stats3 <- last_team_player_stats %>% inner_join(stage_player_stats2, by="Player_ID") 
df_dup_player_stats_name <- stage_player_stats3

# Separate out the non-duplicate players
df_non_dup_player_stats_name <- player_stats[player_stats$Player_Name %in% names(tbl_player_stats_name)[tbl_player_stats_name<2],]

# Removing the plyer-rank column
df_non_dup_player_stats_name <- df_non_dup_player_stats_name %>% select(-Player_Rank)

# Check to see if the there's any duplicates in the non-duplicate players stats dataset. Expected results should be zero
sum(duplicated(df_non_dup_player_stats_name$Player_Name))
sum(duplicated(df_non_dup_player_stats_name$Player_ID))

# Combining the two datasets 
player_stats_final <- union(df_dup_player_stats_name,df_non_dup_player_stats_name)

# Check to see if the there's any duplicates in the combined players stats dataset. Expected results should be zero
sum(duplicated(player_stats_final$Player_Name))
sum(duplicated(player_stats_final$Player_ID))
```
Check for any NA's and Special Values

```{r}
# Check for NA's
sapply(player_stats_final, function(x) sum(is.na(x)))

# Check for infinites
sapply(player_stats_final, function(x) sum(is.infinite(x)))

# Check for NaN
sapply(player_stats_final, function(x) sum(is.nan(x)))

```
There a no missing values(NA's), infinite or NaN values in the players stats data set

\newpage
##	Tidy & Manipulate Data I 

### Tidy & Manipulate Data I - Data Set 1: NBA Player Contracts 2019-2020. 

The contracts data set was found to break the following tidy principle:

* Each observation has it’s own row 

The contracts data set is untidy as each observation for a salary does not form it's own row and the salary years are across the columns.To tidy the dataset up, we will use the gather function from the tidyverse to change the data from a wide format to a long format.

```{r}
#Tidying the contracts data set by using the gather() function
player_contract_final <- player_contract_final %>% gather(Salary_Year,Salary,5:6)

# Updating the value to reflect the year
player_contract_final$Salary_Year <- player_contract_final$Salary_Year %>% str_replace("Salary_2019-20", "2019-2020")
player_contract_final$Salary_Year <- player_contract_final$Salary_Year %>% str_replace("Salary_2020-21", "2020-2021")

```

### Tidy & Manipulate Data I - Dataset 2: NBA Player Total Statistics for the 2019-2020 NBA Season.

The players stats dataset was tidy.

It conforms to these standards:
* Each variable must have its own column.
* Each observation must have its own row.
* Each value must have its own cell.

\newpage
## Data (Part 2)

We can now merge the 2 datasets after the de-duplication process. The new merged data set will be player performance
```{r}
# Filtering the contract data as we only want the current season.
player_contract_filter <- player_contract_final %>% filter(player_contract_final$Salary_Year=="2019-2020")

# Creating the merged dataset named player_performance.
# using an inner join as there a some players that have a contract but no stats, and also there's a stats for players that have been paid but have not been captured by the contracts dataset.
player_performance <- player_stats_final %>% inner_join(player_contract_filter, by="Player_ID")

# Removing the duplication columns and cleaning up the column names
player_performance <- player_performance %>% select(-Team.y,-Player_Name.y,-Salary_Year)
player_performance <- player_performance %>% rename(Player_Name = Player_Name.x)
player_performance <- player_performance %>% rename(Team = Team.x)

# Check for any dups after the merged
sum(duplicated(player_performance$Player_Name))
sum(duplicated(player_performance$Player_ID))
```
\newpage
##	Tidy & Manipulate Data II 

In this section, after the data sets were merged, I created the two new fields: player_efficiency & Salary_per_game

```{r}
# Creating the 2 new fields using mutate
player_performance_new <- player_performance %>% mutate(.,
                                                        player_efficiency = (points+
                                                                            total_rebounds+
                                                                            assists+
                                                                            steals+
                                                                            blocks-
                                                                            (field_goals_attempted-field_goals)-
                                                                            (free_throws_attempted-free_throws)-
                                                                            turnovers)/Games_played
                                                        ,Salary_per_game= Salary/Games_played
                                                                            )
# Sorting the Dataset
player_performance_new <- player_performance_new %>% arrange(.,desc(player_efficiency))

# Selected the required columns
player_performance_new <- player_performance_new %>% select(Player_Name,Player_ID,Position,Team,Games_played,Games_started,minutes_played,
                                  Salary_Rank,Salary,player_efficiency,Salary_per_game)

```


```{r}
# The new merged dataset
head(player_performance_new)
str(player_performance_new)
```

\newpage
##	Scan II

In this section,I scanned the numeric data set for any outliers. 

```{r}
# Check for Outliers using boxplots
df_outliers <- player_performance_new %>% select_if(is.numeric)
for (i in 1:6) {
  boxplot(df_outliers[c(i)], xlab = colnames(df_outliers)[i])
}

```
After running the boxplots to dettect outliers, It appears that the salary field and newly created player efficiency field has a few outliers that we need to address.The salary per game field is based of salary so in the next section i'll just look at capping salary.

```{r}
# Cap Function
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# Apply the capping
Salary <- player_performance_new$Salary %>% cap()
player_efficiency <- player_performance_new$player_efficiency %>% cap()

# Let's Recreate the boxplots to ensure that's it's all fixed
par(mfrow = c(2,2))
boxplot(Salary, xlab = "Salary ($)")
boxplot(player_efficiency, xlab = "Player Efficiency Score")
```
As you can see in the boxplots after the capping that it did a good job of removing all but one outlier.
Looking into this lone outlier, it was steph curry who got injured 5 games into the season. For those 5 games, he earned $8,046,351.6 (8mil) per game for the 2019-2020. How's that for a pay day.

\newpage
##	Transform 

In this section, We used to the salary field to do the transformation.
We will use a combination of logarithms, square root and boxcox techniques to apply and then we will choose the best technique to use from the transformation.

```{r}
# plot the histogram first
hist(player_performance_new$Salary)
```

The standard histogram has a right skewed effect to it.

```{r}
# Using the square root technique
salary_square <- sqrt(player_performance_new$Salary)
hist(salary_square)
```

```{r}
# Using the logarithm10 technique
Log <- log10(player_performance_new$Salary)
hist(Log)
```


```{r}
# Using the Box-Cox technique
boxcox_hist <- BoxCox(player_performance_new$Salary, lambda = "auto")
hist(boxcox_hist)
```
Both Log10 and Box-Cox methods normalises the salary data.

<br>