MATH2349 Data Wrangling

Required packages

The necessary packages were installed and loaded.

# readr, dplyr, tidyr, deducorrect, editrules, forecast, stringr, outliers, MVN, ggplot2
library(readr)
library(dplyr)
library(tidyr)
library(deducorrect)
library(editrules)
library(forecast)
library(stringr)
library(outliers)
library(MVN)
library(ggplot2)

Executive Summary

The NBA games and details datasets are about all games from 2004 season to last updated date teams and all statistics of players for a given game. The games and details datasets will be able to be joined by GAME_ID. After joining the two datasets, only 17 important variables will be selected and the dataset will be filtered only by 2019 season games by Los Angeles Lakers and Miami Heat basketball teams. The joined dataset is not untidy so it will be reshaped into a tidy format. Also one new variable TEAM_NAME will be created. Some categorical variables including HOME_TEAM_WINS, TEAM_ABBREVIATION, TEAM_CITY and START_POSITION will be changed to factors. The dataset has missing values so those missing values will be replaced properly by considering of the characteristic of the real-world data. Also, I’ll identify if the dataset has some special values or not. In addition to this, I’ll check for obvious inconsistencies or errors by edit rules and see if some violations are exist or not. All numeric variables will be scanned for outliers and I’ll discuss how to control these outliers by considering of the NBA real-world data. Lastly, about the points stat scored by the NBA players, I’ll try to apply the most useful data transformations and find the best data transformation for decreasing the skewness and converting the distribution into a normal distribution.

Data

The NBA Games datasets (Kaggle 2020) were download from Kaggle (see here). The datasets consist of two CSV files which are games.csv and games_details.csv. The games.csv has 23,195 rows and 21 variables about all games from 2004 season to last updated date teams and some details like number of points, etc. The games_details.csv has 576,782 rows and 28 variables about details of games dataset, all statistics of players for a given game. The relationship key between the games and games_detail dataset is GAME_ID [KEY] variable. By using GAME_ID variable, we can join games and games_details datasets. I’ll join two datasets and create new dataset games_info.

The datasets include variables and the details of these variables are given below:

[NBA Games datasets]

(1) games.csv
* GAME_DATE_EST: Game’s date
* GAME_ID [KEY]: ID of the game
* GAME_STATUS_TEXT: Status : Final means that the is completed
* HOME_TEAM_ID: ID of the home team
* VISITOR_TEAM_ID: ID of the visitor team
* SEASON: Season when the game occured
* TEAM_ID_home: ID of the home team (dupplicate of HOME_TEAM_ID)
* PTS_home: Number of points scored by home team
* FG_PCT_home: Field Goal Percentage home team
* FT_PCT_home: Free Throw Percentage of the home team
* FG3_PCT_home: Three Point Percentageof the home team
* AST_home: Assists of the home team
* REB_home: Rebounds of the home team
* TEAM_ID_away: ID of the away team (dupplicate of VISITOR_TEAM_ID)
* PTS_away: Number of points scored by away team
* FG_PCT_away: Field Goal Percentage away team
* FT_PCT_away: Free Throw Percentage of the away team
* FG3_PCT_away: Three Point Percentage of the away team
* AST_away: Assists of the away team
* REB_away: Rebounds of the away team
* HOME_TEAM_WINS: If home team won the game

(2) games_details.csv
* GAME_ID [KEY]: ID of the game
* TEAM_ID: ID of the team
* TEAM_ABBREVIATION: Team’s abbreviation
* TEAM_CITY: City where the game was played
* PLAYER_ID: ID of the player
* PLAYER_NAME: Player’s name
* START_POSITION: Position of the player (if nothing then he’s on the bench)
* COMMENT: Comment
* MIN: Minutes played
* FGM: Field Goals Made
* FGA: Field Goals Attempted
* FG_PCT: Field Goal Percentage
* FG3M: Three Pointers Made
* FG3A: Three Pointers Attempted
* FG3_PCT: Three Point Percentage
* FTM: Free Throws Made
* FTA: Free Throws Attempted
* FT_PCT: Free Throw Percentage
* OREB: Offensive Rebounds
* DREB: Defensive Rebounds
* REB: Rebounds
* AST: Assists
* STL: Steals
* BLK: Blocked shots
* TO: Turnovers
* PF: Personnal Foul
* PTS: Number of points scored by the player
* PLUS_MINUS: Plus - Minus

# Read the NBA games dataset by using the read.csv() function in Base R package.
games <- read_csv("./datasets/games.csv")

# Read the NBA games_details dataset by using the read.csv() function in Base R package.
games_details <- read_csv("./datasets/games_details.csv")

## Warning: 250965 parsing failures.
## row col   expected actual                           file
##   1 MIN valid date  27:08 './datasets/games_details.csv'
##   2 MIN valid date  34:55 './datasets/games_details.csv'
##   3 MIN valid date  26:25 './datasets/games_details.csv'
##   4 MIN valid date  27:35 './datasets/games_details.csv'
##   6 MIN valid date  24:52 './datasets/games_details.csv'
## ... ... .......... ...... ..............................
## See problems(...) for more details.

The games and games_details datasets have a relation with a key variable GAME_ID. So, we can join the datasets by GAME_ID. By joining the games and games_details datasets by GAME_ID, we can create the new dataset games_info. I’ll use this games_info dataset for the next steps.

# Join the games and games_details dataset by using the inner_join() function in dplyr package.
# games_info
games_info <- inner_join(games, games_details, by=c("GAME_ID"))
games_info %>% head(3)

## # A tibble: 3 x 48
##   GAME_DATE_EST GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON
##   <date>          <dbl> <chr>                   <dbl>           <dbl>  <dbl>
## 1 2020-03-01     2.19e7 Final              1610612766      1610612749   2019
## 2 2020-03-01     2.19e7 Final              1610612766      1610612749   2019
## 3 2020-03-01     2.19e7 Final              1610612766      1610612749   2019
## # … with 42 more variables: TEAM_ID_home <dbl>, PTS_home <dbl>,
## #   FG_PCT_home <dbl>, FT_PCT_home <dbl>, FG3_PCT_home <dbl>, AST_home <dbl>,
## #   REB_home <dbl>, TEAM_ID_away <dbl>, PTS_away <dbl>, FG_PCT_away <dbl>,
## #   FT_PCT_away <dbl>, FG3_PCT_away <dbl>, AST_away <dbl>, REB_away <dbl>,
## #   HOME_TEAM_WINS <dbl>, TEAM_ID <dbl>, TEAM_ABBREVIATION <chr>,
## #   TEAM_CITY <chr>, PLAYER_ID <dbl>, PLAYER_NAME <chr>, START_POSITION <chr>,
## #   COMMENT <chr>, MIN <time>, FGM <dbl>, FGA <dbl>, FG_PCT <dbl>, FG3M <dbl>,
## #   FG3A <dbl>, FG3_PCT <dbl>, FTM <dbl>, FTA <dbl>, FT_PCT <dbl>, OREB <dbl>,
## #   DREB <dbl>, REB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TO <dbl>, PF <dbl>,
## #   PTS <dbl>, PLUS_MINUS <dbl>

Understand

To summarise the types of variables and data structures, and also check the attributes in the games_info dataset str() function was used.

# Summarise the types of variables and data structures, check the attributes in the games_info dataset.
games_info %>% str()

## tibble [576,782 × 48] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ GAME_DATE_EST    : Date[1:576782], format: "2020-03-01" "2020-03-01" ...
##  $ GAME_ID          : num [1:576782] 21900895 21900895 21900895 21900895 21900895 ...
##  $ GAME_STATUS_TEXT : chr [1:576782] "Final" "Final" "Final" "Final" ...
##  $ HOME_TEAM_ID     : num [1:576782] 1.61e+09 1.61e+09 1.61e+09 1.61e+09 1.61e+09 ...
##  $ VISITOR_TEAM_ID  : num [1:576782] 1.61e+09 1.61e+09 1.61e+09 1.61e+09 1.61e+09 ...
##  $ SEASON           : num [1:576782] 2019 2019 2019 2019 2019 ...
##  $ TEAM_ID_home     : num [1:576782] 1.61e+09 1.61e+09 1.61e+09 1.61e+09 1.61e+09 ...
##  $ PTS_home         : num [1:576782] 85 85 85 85 85 85 85 85 85 85 ...
##  $ FG_PCT_home      : num [1:576782] 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 ...
##  $ FT_PCT_home      : num [1:576782] 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 ...
##  $ FG3_PCT_home     : num [1:576782] 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 ...
##  $ AST_home         : num [1:576782] 22 22 22 22 22 22 22 22 22 22 ...
##  $ REB_home         : num [1:576782] 47 47 47 47 47 47 47 47 47 47 ...
##  $ TEAM_ID_away     : num [1:576782] 1.61e+09 1.61e+09 1.61e+09 1.61e+09 1.61e+09 ...
##  $ PTS_away         : num [1:576782] 93 93 93 93 93 93 93 93 93 93 ...
##  $ FG_PCT_away      : num [1:576782] 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 ...
##  $ FT_PCT_away      : num [1:576782] 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 ...
##  $ FG3_PCT_away     : num [1:576782] 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 ...
##  $ AST_away         : num [1:576782] 20 20 20 20 20 20 20 20 20 20 ...
##  $ REB_away         : num [1:576782] 61 61 61 61 61 61 61 61 61 61 ...
##  $ HOME_TEAM_WINS   : num [1:576782] 0 0 0 0 0 0 0 0 0 0 ...
##  $ TEAM_ID          : num [1:576782] 1.61e+09 1.61e+09 1.61e+09 1.61e+09 1.61e+09 ...
##  $ TEAM_ABBREVIATION: chr [1:576782] "MIL" "MIL" "MIL" "MIL" ...
##  $ TEAM_CITY        : chr [1:576782] "Milwaukee" "Milwaukee" "Milwaukee" "Milwaukee" ...
##  $ PLAYER_ID        : num [1:576782] 202083 203507 201572 1628978 202339 ...
##  $ PLAYER_NAME      : chr [1:576782] "Wesley Matthews" "Giannis Antetokounmpo" "Brook Lopez" "Donte DiVincenzo" ...
##  $ START_POSITION   : chr [1:576782] "F" "F" "C" "G" ...
##  $ COMMENT          : chr [1:576782] NA NA NA NA ...
##  $ MIN              : 'hms' num [1:576782] NA NA NA NA ...
##   ..- attr(*, "units")= chr "secs"
##  $ FGM              : num [1:576782] 3 17 4 1 2 2 1 1 0 4 ...
##  $ FGA              : num [1:576782] 11 28 11 5 8 5 5 2 1 11 ...
##  $ FG_PCT           : num [1:576782] 0.273 0.607 0.364 0.2 0.25 0.4 0.2 0.5 0 0.364 ...
##  $ FG3M             : num [1:576782] 2 1 1 0 0 1 0 1 0 1 ...
##  $ FG3A             : num [1:576782] 7 4 5 3 1 4 0 2 1 4 ...
##  $ FG3_PCT          : num [1:576782] 0.286 0.25 0.2 0 0 0.25 0 0.5 0 0.25 ...
##  $ FTM              : num [1:576782] 0 6 7 0 0 1 0 0 0 2 ...
##  $ FTA              : num [1:576782] 0 7 9 0 0 2 0 0 0 3 ...
##  $ FT_PCT           : num [1:576782] 0 0.857 0.778 0 0 0.5 0 0 0 0.667 ...
##  $ OREB             : num [1:576782] 4 2 2 1 1 2 1 0 0 2 ...
##  $ DREB             : num [1:576782] 4 18 5 6 0 3 2 3 2 3 ...
##  $ REB              : num [1:576782] 8 20 7 7 1 5 3 3 2 5 ...
##  $ AST              : num [1:576782] 2 6 0 5 2 1 0 0 2 2 ...
##  $ STL              : num [1:576782] 2 1 0 0 1 0 0 0 1 2 ...
##  $ BLK              : num [1:576782] 0 0 3 1 0 0 1 0 1 0 ...
##  $ TO               : num [1:576782] 0 3 0 2 3 1 2 1 1 3 ...
##  $ PF               : num [1:576782] 0 2 2 0 2 2 1 0 1 1 ...
##  $ PTS              : num [1:576782] 8 41 16 2 4 6 2 3 0 11 ...
##  $ PLUS_MINUS       : num [1:576782] 11 22 16 14 6 0 -12 -8 -11 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   GAME_DATE_EST = col_date(format = ""),
##   ..   GAME_ID = col_double(),
##   ..   GAME_STATUS_TEXT = col_character(),
##   ..   HOME_TEAM_ID = col_double(),
##   ..   VISITOR_TEAM_ID = col_double(),
##   ..   SEASON = col_double(),
##   ..   TEAM_ID_home = col_double(),
##   ..   PTS_home = col_double(),
##   ..   FG_PCT_home = col_double(),
##   ..   FT_PCT_home = col_double(),
##   ..   FG3_PCT_home = col_double(),
##   ..   AST_home = col_double(),
##   ..   REB_home = col_double(),
##   ..   TEAM_ID_away = col_double(),
##   ..   PTS_away = col_double(),
##   ..   FG_PCT_away = col_double(),
##   ..   FT_PCT_away = col_double(),
##   ..   FG3_PCT_away = col_double(),
##   ..   AST_away = col_double(),
##   ..   REB_away = col_double(),
##   ..   HOME_TEAM_WINS = col_double()
##   .. )

The joined dataset games_info has now 576,782 rows and 48 variables. However, the data size is too big and there are too many variables in games_info dataset, so I’ll select 17 important variables and filter the games_info dataset about 2019 season games by Los Angeles Lakers and Miami Heat basketball teams as below:

Only select 17 important variables from the games_info dataset.
Filter the games_info dataset about 2019 season games by Los Angeles Lakers and Miami Heat basketball teams.

# Select 17 variables and filter 2020 season games by LA Lakers and Miami Heat basketball teams.
games_info <- games_info %>% 
  filter(games_info$SEASON == 2019, 
         (games_info$TEAM_ABBREVIATION == 'LAL' | games_info$TEAM_ABBREVIATION == 'MIA')) %>% 
  select(GAME_DATE_EST, GAME_ID, GAME_STATUS_TEXT, HOME_TEAM_ID, VISITOR_TEAM_ID, SEASON, 
         HOME_TEAM_WINS, TEAM_ID, TEAM_ABBREVIATION, TEAM_CITY, PLAYER_ID, PLAYER_NAME, 
         START_POSITION, COMMENT, PTS, REB, AST)
games_info %>% head(3)

## # A tibble: 3 x 17
##   GAME_DATE_EST GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON
##   <date>          <dbl> <chr>                   <dbl>           <dbl>  <dbl>
## 1 2020-03-01     2.19e7 Final              1610612740      1610612747   2019
## 2 2020-03-01     2.19e7 Final              1610612740      1610612747   2019
## 3 2020-03-01     2.19e7 Final              1610612740      1610612747   2019
## # … with 11 more variables: HOME_TEAM_WINS <dbl>, TEAM_ID <dbl>,
## #   TEAM_ABBREVIATION <chr>, TEAM_CITY <chr>, PLAYER_ID <dbl>,
## #   PLAYER_NAME <chr>, START_POSITION <chr>, COMMENT <chr>, PTS <dbl>,
## #   REB <dbl>, AST <dbl>

games_info %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  1728

The games_info dataset has 1,728 rows and 17 variables now. (Only 2019 season (2019~2020) games by LA Lakers and Miami Heat basketball teams) I’ll use this games_info dataset for the next steps.

All ID type variables were changed from int to character type.

# as.character()
games_info$GAME_ID <- as.character(games_info$GAME_ID)
games_info$HOME_TEAM_ID <- as.character(games_info$HOME_TEAM_ID)
games_info$VISITOR_TEAM_ID <- as.character(games_info$VISITOR_TEAM_ID)
games_info$TEAM_ID <- as.character(games_info$TEAM_ID)
games_info$PLAYER_ID <- as.character(games_info$PLAYER_ID)

To check which variables will be able to be factors, table() function was used here.

table(games_info$GAME_STATUS_TEXT)

## 
## Final 
##  1728

table(games_info$SEASON)

## 
## 2019 
## 1728

table(games_info$HOME_TEAM_WINS)

## 
##    0    1 
##  719 1009

table(games_info$TEAM_ABBREVIATION)

## 
## LAL MIA 
## 864 864

table(games_info$TEAM_CITY)

## 
## Los Angeles       Miami 
##         864         864

GAME_STATUS_TEXT variable has only ‘Final’ value and SEASON variable has only ‘2019’ value. So, factor variables are clearly summarized as follows:

[Factor variables]
HOME_TEAM_WINS: 0, 1
TEAM_ABBREVIATION: LAL, MIA
TEAM_CITY: Los Angeles, Miami

HOME_TEAM_WINS, TEAM_ABBREVIATION and TEAM_CITY variables were changed to factors.

# factor: games_info$HOME_TEAM_WINS
games_info$HOME_TEAM_WINS <- factor(
  games_info$HOME_TEAM_WINS, levels = c('0', '1'), labels = c('Loss', 'Win'), ordered = FALSE
)
str(games_info$HOME_TEAM_WINS)

##  Factor w/ 2 levels "Loss","Win": 1 1 1 1 1 1 1 1 1 1 ...

# 
# factor: games_info$TEAM_ABBREVIATION (LAL / MIA)
games_info$TEAM_ABBREVIATION <- factor(
  games_info$TEAM_ABBREVIATION, levels = c('LAL', 'MIA'), labels = c('LAL', 'MIA'), ordered = FALSE
)
str(games_info$TEAM_ABBREVIATION)

##  Factor w/ 2 levels "LAL","MIA": 1 1 1 1 1 1 1 1 1 1 ...

# 
# factor: games_info$TEAM_CITY (Los Angeles / Miami)
games_info$TEAM_CITY <- factor(
  games_info$TEAM_CITY, levels = c('Los Angeles', 'Miami'), labels = c('Los Angeles', 'Miami'), ordered = FALSE
)
str(games_info$TEAM_CITY)

##  Factor w/ 2 levels "Los Angeles",..: 1 1 1 1 1 1 1 1 1 1 ...

Tidy & Manipulate Data I

The games_info dataset doesn’t conform the tidy data principle because columns headers (‘PTS’, ‘REB’, ‘AST’) are not variables but observations. Those three observations are NBA player’s stats. So, the games_info dataset is untidy. It will be able to be one variable named ‘STAT’. The games_info dataset was reshaped into a tidy format by using gather() function in tidyr package.

# tidy: player stats
games_info <- games_info %>% gather(`PTS`, `REB`, `AST`, key = "STAT_NAME", value = "STAT_VALUE")
games_info$STAT_NAME %>% head(10)

##  [1] "PTS" "PTS" "PTS" "PTS" "PTS" "PTS" "PTS" "PTS" "PTS" "PTS"

games_info$STAT_VALUE %>% head(10)

##  [1] 34 20  8 10 10 13  8 10  4  5

games_info %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1  5184

The games_info dataset has 5,184 rows and 16 variables now as tidy dataset. I’ll use this games_info dataset for the next steps.

The STAT_NAME variable should be changed to a factor.

# factor: games_info$STAT_NAME (PTS/REB/AST)
games_info$STAT_NAME <- factor(
  games_info$STAT_NAME, levels = c('PTS', 'REB', 'AST'), labels = c('PTS', 'REB', 'AST'), ordered = FALSE
)
str(games_info$STAT_NAME)

##  Factor w/ 3 levels "PTS","REB","AST": 1 1 1 1 1 1 1 1 1 1 ...

Tidy & Manipulate Data II

The games_info dataset has TEAM_CITY variable however it doesn’t have the full NBA team names. (For example, the full NBA team name of the ‘Boston’ is ‘Boston Celtics’) So, one new variable ‘STAT’. The games_info dataset was reshaped into a tidy format by using TEAM_NAME from the variable TEAM_CITY was created. The games_info dataset was divided two datasets which are games_info_LAL and games_info_MIA with the new variable TEAM_NAME for the LA Lakers and Miami Heat NBA teams.

# New variable TEAM_NAME from the variable TEAM_CITY was created
# games_info_LAL
games_info_LAL <- games_info %>% 
  filter(TEAM_ABBREVIATION == "LAL") %>% 
  mutate(TEAM_NAME = paste(TEAM_CITY, "Lakers"))
games_info_LAL$TEAM_NAME %>% head(3)

## [1] "Los Angeles Lakers" "Los Angeles Lakers" "Los Angeles Lakers"

# games_info_MIA
games_info_MIA <- games_info %>% 
  filter(TEAM_ABBREVIATION == "MIA") %>% 
  mutate(TEAM_NAME = paste(TEAM_CITY, "Heat"))
games_info_MIA$TEAM_NAME %>% head(3)

## [1] "Miami Heat" "Miami Heat" "Miami Heat"

The games_info dataset was created again by binding the games_info_LAL and games_info_MIA datasets.

# bind rows
games_info <- bind_rows(games_info_LAL, games_info_MIA)
games_info$TEAM_NAME %>% str()

##  chr [1:5184] "Los Angeles Lakers" "Los Angeles Lakers" ...

The games_info dataset has 5,184 rows and 17 variables now with the new variable TEAM_NAME. I’ll use this games_info dataset for the next steps.

Scan I

The games_info dataset was scanned for missing values, special values and obvious errors. The is.na() and colSums() function were used to identify the total missing values in each column from the dataset. To replace the missing values, which() function was used. Also to identify the missing value NA, special values -Inf, Inf and NaN, the sapply(), sum(), is.na(), is.infinite(), is.nan() functions were used.

The games_info dataset was scaned to find missing values by using is.na() function with colSums() function. The results are as follows:

# Scan: missing values
colSums(is.na(games_info))

##     GAME_DATE_EST           GAME_ID  GAME_STATUS_TEXT      HOME_TEAM_ID 
##                 0                 0                 0                 0 
##   VISITOR_TEAM_ID            SEASON    HOME_TEAM_WINS           TEAM_ID 
##                 0                 0                 0                 0 
## TEAM_ABBREVIATION         TEAM_CITY         PLAYER_ID       PLAYER_NAME 
##                 0                 0                 0                 0 
##    START_POSITION           COMMENT         STAT_NAME        STAT_VALUE 
##              3234              4143                 0              1041 
##         TEAM_NAME 
##                 0

There are three variables START_POSITION, COMMENT and STAT_VALUE having missing values.

The START_POSITION consists of ‘C’, ‘F’ and ‘G’. The ‘C’ means center position, ‘F’ means forward position and ‘G’ means guard position of the starting 5 players. The starting 5 players in a NBA game have these START_POSITION values. However, other bench players have missing value NA in START_POSITION column. So, I’ll create a new symbol ‘B’ meaning a bench player and replace the missing value NA to ‘B’.

# START_POSITION: bench player as 'B'
games_info$START_POSITION[which(is.na(games_info$START_POSITION))] = 'B'
table(games_info$START_POSITION)

## 
##    B    C    F    G 
## 3234  390  780  780

Totally 3,234 missing values were replaced to bench player value ‘B’.
* ‘B’: bench player (3,234)
* ‘C’: center position starting player (390)
* ‘F’: forward position starting player (780)
* ‘G’: guard position starting player (780)

The START_POSITION variable should be changed to a factor.

# factor: games_info$START_POSITION (B/C/F/G)
games_info$START_POSITION <- factor(
  games_info$START_POSITION,
  levels = c('B', 'C', 'F', 'G'),
  labels = c('B', 'C', 'F', 'G'),
  ordered = FALSE
)
str(games_info$START_POSITION)

##  Factor w/ 4 levels "B","C","F","G": 3 3 2 4 4 1 1 1 1 1 ...

The COMMENT variable describes a NBA player’s status. If it has missing value, it means that the NBA player played in the game. So, missing values replaced to ‘GAME - played’.

# COMMENT: 'GAME - played'
games_info$COMMENT[which(is.na(games_info$COMMENT))] = 'GAME - played'
table(games_info$COMMENT)

## 
## DND - Coach's Decision   DND - Injury/Illness         DND - Personal 
##                      3                    108                      3 
##             DND - Rest DNP - Coach's Decision   DNP - Injury/Illness 
##                     15                    705                    111 
##             DNP - Rest          GAME - played NWT - Coach's Decision 
##                     27                   4143                      3 
##   NWT - Injury/Illness         NWT - Personal             NWT - Rest 
##                     33                      6                      3 
##      NWT_NOT_WITH_TEAM 
##                     24

The STAT_VALUE variable describes the points, rebounds or assists performed by a NBA player. If it has missing value, it means that the NBA player didn’t play in the game and the points, rebounds and assists are 0. So, missing values replaced to ‘0’.

# STAT_VALUE: 0
games_info$STAT_VALUE[which(is.na(games_info$STAT_VALUE))] = 0

Let’s identify the missing value NA again.

# Scan: missing values
# colSums(is.na(games_info))

# Identify the missing value NA
sapply(games_info, function(x) sum(is.na(x)))

##     GAME_DATE_EST           GAME_ID  GAME_STATUS_TEXT      HOME_TEAM_ID 
##                 0                 0                 0                 0 
##   VISITOR_TEAM_ID            SEASON    HOME_TEAM_WINS           TEAM_ID 
##                 0                 0                 0                 0 
## TEAM_ABBREVIATION         TEAM_CITY         PLAYER_ID       PLAYER_NAME 
##                 0                 0                 0                 0 
##    START_POSITION           COMMENT         STAT_NAME        STAT_VALUE 
##                 0                 0                 0                 0 
##         TEAM_NAME 
##                 0

There’s no missing value from the games_info dataset.

In addition to missing values, there are a few special values that are used in R. These are -Inf, Inf and NaN. To identify the special values in games_info dataset, the is.finite(), is.infinite() and is.nan() functions were used.

# Scan: Special Values
# Identify the special value -Inf, Inf
sapply(games_info, function(x) sum(is.infinite(x)))

##     GAME_DATE_EST           GAME_ID  GAME_STATUS_TEXT      HOME_TEAM_ID 
##                 0                 0                 0                 0 
##   VISITOR_TEAM_ID            SEASON    HOME_TEAM_WINS           TEAM_ID 
##                 0                 0                 0                 0 
## TEAM_ABBREVIATION         TEAM_CITY         PLAYER_ID       PLAYER_NAME 
##                 0                 0                 0                 0 
##    START_POSITION           COMMENT         STAT_NAME        STAT_VALUE 
##                 0                 0                 0                 0 
##         TEAM_NAME 
##                 0

There’s no special value -Inf, Inf from the games_info dataset.

# Identify the special value NaN
sapply(games_info, function(x) sum(is.nan(x)))

##     GAME_DATE_EST           GAME_ID  GAME_STATUS_TEXT      HOME_TEAM_ID 
##                 0                 0                 0                 0 
##   VISITOR_TEAM_ID            SEASON    HOME_TEAM_WINS           TEAM_ID 
##                 0                 0                 0                 0 
## TEAM_ABBREVIATION         TEAM_CITY         PLAYER_ID       PLAYER_NAME 
##                 0                 0                 0                 0 
##    START_POSITION           COMMENT         STAT_NAME        STAT_VALUE 
##                 0                 0                 0                 0 
##         TEAM_NAME 
##                 0

There’s no special value NaN from the games_info dataset.

An obvious inconsistency occurs when a data record contains a value or combination of values that cannot correspond to a real-world situation. So, I defined edit rules on numerical, categorical and mixed-type data sets in the editrules.txt to check for obvious inconsistencies or errors.

[editrules.txt]
1 # numerical rules
2 SEASON == 2019
3 STAT_VALUE >= 0
4
5 # categorical rules
6 HOME_TEAM_WINS %in% c(“Loss”, “Win”)
7 TEAM_ABBREVIATION %in% c(“LAL”, “MIA”)
8 TEAM_CITY %in% c(“Los Angeles”, “Miami”)
9 TEAM_NAME %in% c(“Los Angeles Lakers”, “Miami Heat”)
10 START_POSITION %in% c(“B”, “C”, “F”, “G”)
11 STAT_NAME %in% c(“PTS”, “REB”, “AST”)
12
13 # mixed rules
14 if (STAT_NAME == “PTS”) STAT_VALUE <= 100
15 if (STAT_NAME == “REB”) STAT_VALUE <= 50
16 if (STAT_NAME == “AST”) STAT_VALUE <= 50

[Explaination about edit rules]
* The dataset is about 2019 season games. So, SEASON variable is equal to 2019.
* The basketball stats (STAT_VALUE variable) including points, rebounds and assists is positive number or zero.
* HOME_TEAM_WINS variable must have ‘Loss’ or ‘Win’ value.
* TEAM_ABBREVIATION variable must have ‘LAL’ or ‘MIA’ value.
* TEAM_CITY variable must have ‘Los Angeles’ or ‘Miami’ value.
* TEAM_NAME variable must have ‘Los Angeles Lakers’ or ‘Miami Heat’ value.
* START_POSITION variable must have ‘B’, ‘C’, ‘F’, ‘G’ value.
* STAT_NAME variable must have ‘PTS’, ‘REB’, ‘AST’ value.
* About points stat, there’s no any NBA player who has scored more than 100 points in NBA history.
* About rebounds stat, there’s no any NBA player who has done more than 50 rebounds in NBA history.
* About assists stat, there’s no any NBA player who has done more than 50 assists in NBA history.

# Scan: obvious errors (inconsistencies)
Rules <- editfile("./datasets/editrules.txt", type = "all")
Rules

## 
## Data model:
## dat3 : HOME_TEAM_WINS %in% c('Loss', 'Win')
## dat4 : START_POSITION %in% c('B', 'C', 'F', 'G')
## dat5 : STAT_NAME %in% c('AST', 'PTS', 'REB')
## dat6 : TEAM_ABBREVIATION %in% c('LAL', 'MIA')
## dat7 : TEAM_CITY %in% c('Los Angeles', 'Miami')
## dat8 : TEAM_NAME %in% c('Los Angeles Lakers', 'Miami Heat') 
## 
## Edit set:
## num1 : SEASON == 2019
## num2 : 0 <= STAT_VALUE
## mix3 : if( 100 < STAT_VALUE ) STAT_NAME != 'PTS'
## mix4 : if( 50 < STAT_VALUE ) STAT_NAME != 'REB'
## mix5 : if( 50 < STAT_VALUE ) STAT_NAME != 'AST'

Violated <- violatedEdits(Rules, games_info)
summary(Violated)

## No violations detected, 0 checks evaluated to NA

## NULL

There’s no violation from the games_info dataset by the edit rules.

Scan II

In statistics, an outlier is defined as an observation which stands far away from the most of other observations. However in NBA basketball stats, outliers means some NBA superstars’ high stats. (Genereally, NBA superstars play high points, rebounds and assists games.) So, I’ll identify the outliers of the NBA players’ points, rebounds and assists in LA Lakers and Miami Heat teams from the games_info dataset and compare them with the outliers of Lebron James’ stats. (Lebron James is a NBA superstar of the LA Lakers team.) The important thing is that the outliers in NBA basketball stats can’t be removed from the dataset because the outliers are not garbage data but NBA superstars’ high stats. Therefore, I’ll not remove the outliers from the dataset but just prove that the outliers in the NBA basketball stats are the NBA superstars’ stats. Also I’ll show how to remove those outliers from the dataset if it’s necessary.

The target numeric variables for outliers are PTS, REB and AST stats. The box plots for each PTS, REB and AST stats by the NBA players in LA Lakers and Miami Heat teams are as follows. To apply two different colors (the R Graph Gallery 2020), TEAM_COLORS variable was created. To arrange the plots together, the par(mfrow) (The Pirate’s Guide to R 2020) was used.

# Arranging plots with par(mfrow)
par(mfrow=c(1, 3))
# Add color to specific groups of a boxplot
TEAM_COLORS <- ifelse(levels(games_info$TEAM_ABBREVIATION) == "LAL", rgb(0.1, 0.1, 0.7, 0.5), 
  ifelse(levels(games_info$TEAM_ABBREVIATION) == "MIA", rgb(0.8, 0.1, 0.3, 0.6), "grey90"))
# 
# PTS stat by the NBA players in LA Lakers and Miami Heat teams
player_point <- games_info %>% filter(STAT_NAME == "PTS") %>% select(TEAM_ABBREVIATION, STAT_VALUE)
boxplot(player_point$STAT_VALUE ~ player_point$TEAM_ABBREVIATION, main="The NBA Players' Points", xlab="NBA Teams",ylab="Number of points scored by the NBA players", col = TEAM_COLORS)
# REB stat by the NBA players in LA Lakers and Miami Heat teams
player_rebound <- games_info %>% filter(STAT_NAME == "REB") %>% select(TEAM_ABBREVIATION, STAT_VALUE)
boxplot(player_rebound$STAT_VALUE ~ player_rebound$TEAM_ABBREVIATION, main="The NBA Players' Rebounds", xlab="NBA Teams",ylab="Number of rebounds", col = TEAM_COLORS)
# AST stat by the NBA players in LA Lakers and Miami Heat teams
player_assist <- games_info %>% filter(STAT_NAME == "AST") %>% select(TEAM_ABBREVIATION, STAT_VALUE)
boxplot(player_assist$STAT_VALUE ~ player_assist$TEAM_ABBREVIATION, main="The NBA Players' Assists", xlab="NBA Teams", ylab="Number of assists", col = TEAM_COLORS
)

About the NBA players’ points, LA Lakers team has 13 outliers but Miami Heat team only has 2 outlier. Because the NBA superstar Lebron James plays for LA Lakers team. He scores high points in his games and those points stat display as outliers in box plot. For the rebounds stat, both LA Lakers team and Miami Heat team have 6 outliers. For the assists stat, LA Lakers team has 12 outliers and Miami Heat team has 6 outliers. The LA Lakers superstart Lebron James usually helps his teammates by high assist games. So, LA Lakers team has more assists outliers than Miami Heat team.

On the other hand, the NBA superstar Lebron James’ box plots for points, rebounds and assists stats are as follows. To arrange the plots together, the par(mfrow) (The Pirate’s Guide to R 2020) was used.

# Arranging plots with par(mfrow)
par(mfrow=c(1, 3))
# 
# PTS stat by the NBA superstar Lebron James
lebron_point <- games_info %>% filter(PLAYER_NAME == "LeBron James", STAT_NAME == "PTS") %>% select(STAT_VALUE)
boxplot(lebron_point$STAT_VALUE, main="Lebron James' Points", ylab="Number of points scored by Lebron James", col = "green")
# REB by the NBA superstar Lebron James
lebron_rebound <- games_info %>% filter(PLAYER_NAME == "LeBron James", STAT_NAME == "REB") %>% select(STAT_VALUE)
boxplot(lebron_rebound$STAT_VALUE, main="Lebron James' Rebounds", xlab="NBA Teams",ylab="Number of rebounds", col = "yellow"
)
# AST by the NBA superstar Lebron James
lebron_assist <- games_info %>% filter(PLAYER_NAME == "LeBron James", STAT_NAME == "AST") %>% select(STAT_VALUE)
boxplot(lebron_assist$STAT_VALUE, main="Lebron James' Assists", ylab="Number of assists", col = "orange"
)

Obviously, Lebron’s points stat doesn’t display many outliers in box plot. Only one outlier is exist. It means that Lebron’s points stat should be displayed as outliers in LA Lakers team dataset but not outliers in his points stat because he is the NBA superstar and scores high points in his games. Lebron’s rebounds stat doesn’t have any outlier. Lebron’s assists stat has 2 outliers. Those 2 outliers mean that Lebron James played the lowest assist game and highest assist game in this season and the lowest and highest assist stats are displayed as 2 outliers in box plot.

I can conclude that the outliers of the PTS, REB and AST stats come from some NBA superstar players with high stats. Therefore, those outliers should be kept and don’t need to be removed from the games_info dataset by considering of the characteristic of the real-world data.

To see the multivariate outlier detection using Mahalanobis distance with QQ plots, new player_stat dataset with PTS, REB and AST was created. The QQ plots is given below:

# PTS
player_point <- games_info %>% filter(STAT_NAME == "PTS") %>% select(STAT_VALUE)
# REB
player_rebound <- games_info %>% filter(STAT_NAME == "REB") %>% select(STAT_VALUE)
# AST
player_assist <- games_info %>% filter(STAT_NAME == "AST") %>% select(STAT_VALUE)
# Binding columns by PTS and REB
player_stat <- bind_cols(player_point$STAT_VALUE, player_rebound$STAT_VALUE)
# Binding columns by PTS, REB and AST
player_stat <- bind_cols(player_stat, player_assist$STAT_VALUE)
# Set column names as POINTS, REBOUNDS and ASSISTS
colnames(player_stat) <- c("POINTS", "REBOUNDS", 'ASSISTS')

# Multivariate outlier detection using Mahalanobis distance with QQ plots
results <- mvn(data = player_stat, multivariateOutlierMethod = "quan", showOutliers = TRUE)

# Observation Mahalanobis Distance Outlier (Display top 5 outputs from 504 outliers)
results$multivariateOutliers %>% head(3)

##   Observation Mahalanobis Distance Outlier
## 1           1              479.141    TRUE
## 2           2              323.873    TRUE
## 3           3              306.677    TRUE

There are 504 outliers of the PTS, REB and AST stats in the games. These outliers are the NBA superstars stats in the games and don’t need to be removed from the games_info dataset.

Lastly, I’ll show how to remove those outliers from the dataset regardless of the characteristic of the real-world data.
The outliers about points stat can be excluded as follows:

# PTS
z.scores <- player_point$STAT_VALUE %>% scores(type = "z")
z.scores %>% summary()
which(abs(z.scores) > 3)
length(which( abs(z.scores) > 3))
# Excluding outliers (PTS)
player_point_clean <- player_point$STAT_VALUE[-which(abs(z.scores) > 3)]

The outliers about rebounds stat can be excluded as follows:

# REB
z.scores <- player_rebound$STAT_VALUE %>% scores(type = "z")
z.scores %>% summary()
which(abs(z.scores) > 3)
length(which( abs(z.scores) > 3))
# Excluding outliers (REB)
player_rebound_clean <- player_rebound$STAT_VALUE[-which(abs(z.scores) > 3)]

The outliers about assists stat can be excluded as follows:

# AST
z.scores <- player_assist$STAT_VALUE %>% scores(type = "z")
z.scores %>% summary()
which(abs(z.scores) > 3)
length(which( abs(z.scores) > 3))
# Excluding outliers (AST)
player_assist_clean <- player_assist$STAT_VALUE[-which(abs(z.scores) > 3)]

Transform

The points scored by the NBA players from the games_info dataset are not symmetric but right/positive skew. So, data transformation is necessary to decrease the skewness and convert the distribution into a normal distribution. I’ll try to apply the most useful data transformations and find the best data transformation for decreasing the skewness and converting the distribution into a normal distribution.

[Data transformation]
* (DT-1) Log transformation (base 10)
* (DT-2) Base e (log e)
* (DT-3) Square root transformation
* (DT-4) Square transformation
* (DT-5) Reciprocal transformation
* (DT-6) Box-Cox Transformation

The points scored by the NBA players from the games_info dataset are not symmetric but right/positive skew. The histogram by using ggplot2 (Statistical tools for high-throughput data analysis 2020) is given below:

# Histogram with density plot for the points scored by the NBA players
ggplot(player_point, aes(x=STAT_VALUE)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white", bins = 30) + 
 geom_density(alpha=.2, fill="#FF6666") + 
 geom_vline(aes(xintercept=mean(STAT_VALUE)), color="blue", linetype="dashed", size=1)

For data transformation, the points stat having 0 point is not meaningful. So, the games_info dataset was filtered without 0 point stats.

player_point <- games_info %>% filter(STAT_NAME == "PTS", STAT_VALUE != 0) %>% select(STAT_VALUE)
data_transformation <- player_point

The histograms of the data transformations from (DT-1) to (DT-6) are as follows. To arrange the plots together, the par(mfrow) (The Pirate’s Guide to R 2020) was used.

# Arranging plots with par(mfrow)
par(mfrow=c(2, 3))
# (DT-1) Log transformation (base 10)
data_transformation$STAT_VALUE <- log10(player_point$STAT_VALUE)
hist(data_transformation$STAT_VALUE, main = "Log transformation (base 10)", col="skyblue")

# (DT-2) Base e (log e)
data_transformation$STAT_VALUE <- log(player_point$STAT_VALUE)
hist(data_transformation$STAT_VALUE, main = "Base e (log e)", col="skyblue")

# (DT-3) Square root transformation
data_transformation$STAT_VALUE <- sqrt(player_point$STAT_VALUE)
hist(data_transformation$STAT_VALUE, main = "Square root transformation", col="skyblue")

# (DT-4) Square transformation
data_transformation$STAT_VALUE <- player_point$STAT_VALUE^2
hist(data_transformation$STAT_VALUE, main = "Square transformation", col="skyblue")

# (DT-5) Reciprocal transformation
data_transformation$STAT_VALUE <- 1/player_point$STAT_VALUE
hist(data_transformation$STAT_VALUE, main = "Reciprocal transformation", col="skyblue")

# (DT-6) Box-Cox Transformation
data_transformation$STAT_VALUE <- BoxCox(player_point$STAT_VALUE, lambda = "auto")
hist(data_transformation$STAT_VALUE, main = "Box-Cox Transformation", col="skyblue")

[Data transformation results]
(DT-1) Log transformation (base 10): Left/Negative skewed
(DT-2) Base e (log e): Left/Negative skewed
(DT-3) Square root transformation: Slightly Right/Positive skewed
(DT-4) Square transformation: Right/Positive skewed
(DT-5) Reciprocal transformation: Right/Positive skewed
(DT-6) Box-Cox Transformation: Left/Negative skewed

As seen above, the data transformation was not perpect to get the symmetrical normal distribution for this data set. However, by the (DT-3) Square root transformation, the right/positive skewness of the original dataset was decreased. It’s still slightly right skewed but it has much better symmetric distribution than the original dataset.

Therefore, the best data transformation to decrease the skewness for symmetrical normal distribution is the square root transformation.

# Square root transformation
data_transformation$STAT_VALUE <- sqrt(player_point$STAT_VALUE)
# 
ggplot(data_transformation, aes(x=STAT_VALUE)) + 
 geom_histogram(aes(y=..density..), colour="black", fill="white", bins=30) + 
 geom_density(alpha=.2, fill="#FF6666")

References

Kaggle 2020, NBA games data, Nathan Lauga, viewed 20 October 2020, https://www.kaggle.com/nathanlauga/nba-games
Statistical tools for high-throughput data analysis 2020, ggplot2 histogram plot, viewed 20 October 2020, http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization
The Pirate’s Guide to R 2020, Arranging plots with par(mfrow) and layout(), viewed 20 October 2020, https://bookdown.org/ndphillips/YaRrr/arranging-plots-with-parmfrow-and-layout.html
the R Graph Gallery 2020, Add color to specific groups of a boxplot, viewed 20 October 2020, https://www.r-graph-gallery.com/23-add-colors-to-specific-groups-of-a-boxplot.html