1. Executive Summary:

The data analysis process includes many steps from defining the problem statement to data pre-processing which includes getting the data ready to analyze so that we can explore the data to gain meaningful insights from it. This report provides an analysis on a NBA game data set for the season of 2019-2020 season following the 3 steps of Data Analysis: Defining a problem statement, Data Collecting and Data pre processing.

The data collection for this analysis is retrieved from Kaggle.com and R is used to carry out the data pre processing. Our data-preprocessing methodology is highlighted below:

  1. The two separate data sets called “games.csv” and “games_details.csv” is imported using the ‘readr’ library .

  2. The redundant columns, for example Comment,s were dropped from the data set and the untidy dataset were made tidy using the Tidy Principles.

  3. The two data sets were joined together to make one data set using the primary keys GAME_ID and HOME_TEAM_ID.

  4. For analysis a new variable was created called OREB_To_DREB_RATIO, which would help in the further analysis because it indicates the offensive contribution in terms of the defensive contribution of a NBA player.

  5. After the above steps were implemented the missing values were analyzed in the data set and were imputed or removed accordingly.

  6. Next, outliers were found out for the numeric columns and “capping” technique was applied to replace the outliers.

  7. Finally, Box Cox transformations were applied to ensure the normality of the data, which makes it ready for the analysis.

From the analysis of the data set it can be concluded that the nba_games_dataset is the final cleaned data set which is free from any impurities and can be used for further analysis.

2. Required packages [R code]:

#Loading the required packages
library(readr)
library(dplyr)
library(tidyr)
library(car)
library(outliers)
library(stringr)
library(forecast)
library(magrittr)

3. Data:

The “games.csv” contains information about NBA Games during 2019 Season. It contains 23195 observations and 21 variables. It was collected from Kaggle[https://www.kaggle.com/nathanlauga/nba-games?select=games.csv].

The “games.csv” contains the following attributes:

Loading the “games.csv” data set. viewing the contents using the head(). Checking the attributes and their data types using the str function

#Loading the data set "games.csv"
#Stripping the white spaces where it is possible
games <- read.csv("data/games.csv", strip.white = TRUE)
#Viewing the games dataset
head(games)
##   GAME_DATE_EST  GAME_ID GAME_STATUS_TEXT HOME_TEAM_ID VISITOR_TEAM_ID SEASON
## 1    2020-03-01 21900895            Final   1610612766      1610612749   2019
## 2    2020-03-01 21900896            Final   1610612750      1610612742   2019
## 3    2020-03-01 21900897            Final   1610612746      1610612755   2019
## 4    2020-03-01 21900898            Final   1610612743      1610612761   2019
## 5    2020-03-01 21900899            Final   1610612758      1610612765   2019
## 6    2020-03-01 21900900            Final   1610612740      1610612747   2019
##   TEAM_ID_home PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home REB_home
## 1   1610612766       85       0.354       0.900        0.229       22       47
## 2   1610612750       91       0.364       0.400        0.310       19       57
## 3   1610612746      136       0.592       0.805        0.542       25       37
## 4   1610612743      133       0.566       0.700        0.500       38       41
## 5   1610612758      106       0.407       0.885        0.257       18       51
## 6   1610612740      114       0.421       0.818        0.219       24       52
##   TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away
## 1   1610612749       93       0.402       0.762        0.226       20       61
## 2   1610612742      111       0.468       0.632        0.275       28       56
## 3   1610612755      130       0.505       0.650        0.488       27       37
## 4   1610612761      118       0.461       0.897        0.263       24       36
## 5   1610612765      100       0.413       0.667        0.429       23       42
## 6   1610612747      122       0.515       0.900        0.371       23       36
##   HOME_TEAM_WINS
## 1              0
## 2              0
## 3              1
## 4              1
## 5              1
## 6              0
#Viewing the data types of  attributes in the "games.csv" dataset
str(games)
## 'data.frame':    23195 obs. of  21 variables:
##  $ GAME_DATE_EST   : chr  "2020-03-01" "2020-03-01" "2020-03-01" "2020-03-01" ...
##  $ GAME_ID         : int  21900895 21900896 21900897 21900898 21900899 21900900 21900901 21900887 21900888 21900889 ...
##  $ GAME_STATUS_TEXT: chr  "Final" "Final" "Final" "Final" ...
##  $ HOME_TEAM_ID    : int  1610612766 1610612750 1610612746 1610612743 1610612758 1610612740 1610612744 1610612752 1610612737 1610612748 ...
##  $ VISITOR_TEAM_ID : int  1610612749 1610612742 1610612755 1610612761 1610612765 1610612747 1610612764 1610612741 1610612757 1610612751 ...
##  $ SEASON          : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
##  $ TEAM_ID_home    : int  1610612766 1610612750 1610612746 1610612743 1610612758 1610612740 1610612744 1610612752 1610612737 1610612748 ...
##  $ PTS_home        : num  85 91 136 133 106 114 110 125 129 116 ...
##  $ FG_PCT_home     : num  0.354 0.364 0.592 0.566 0.407 0.421 0.472 0.553 0.548 0.451 ...
##  $ FT_PCT_home     : num  0.9 0.4 0.805 0.7 0.885 0.818 0.708 0.697 0.864 0.833 ...
##  $ FG3_PCT_home    : num  0.229 0.31 0.542 0.5 0.257 0.219 0.321 0.4 0.429 0.368 ...
##  $ AST_home        : num  22 19 25 38 18 24 25 29 34 27 ...
##  $ REB_home        : num  47 57 37 41 51 52 52 50 36 45 ...
##  $ TEAM_ID_away    : int  1610612749 1610612742 1610612755 1610612761 1610612765 1610612747 1610612764 1610612741 1610612757 1610612751 ...
##  $ PTS_away        : num  93 111 130 118 100 122 124 115 117 113 ...
##  $ FG_PCT_away     : num  0.402 0.468 0.505 0.461 0.413 0.515 0.488 0.461 0.5 0.465 ...
##  $ FT_PCT_away     : num  0.762 0.632 0.65 0.897 0.667 0.9 0.889 0.696 0.714 0.739 ...
##  $ FG3_PCT_away    : num  0.226 0.275 0.488 0.263 0.429 0.371 0.667 0.486 0.286 0.364 ...
##  $ AST_away        : num  20 28 27 24 23 23 24 26 14 30 ...
##  $ REB_away        : num  61 56 37 36 42 36 34 33 42 44 ...
##  $ HOME_TEAM_WINS  : int  0 0 1 1 1 0 0 1 1 1 ...
#Checking the dimension of the "games.csv" dataset
dim(games)
## [1] 23195    21

The “games_details.csv” contains extensive game details information about NBA Games during 2019-2020 Season. It contains 576782 observations and observations and 28 variables.It was collected from Kaggle[https://www.kaggle.com/nathanlauga/nba-games?select=games_details.csv].

The “games_details.csv” contains the following attributes:

#Loading the "games_details.csv" data set
games_details <- read.csv("data/games_details.csv", strip.white = TRUE)
head(games_details)
##    GAME_ID    TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID
## 1 21900895 1610612749               MIL Milwaukee    202083
## 2 21900895 1610612749               MIL Milwaukee    203507
## 3 21900895 1610612749               MIL Milwaukee    201572
## 4 21900895 1610612749               MIL Milwaukee   1628978
## 5 21900895 1610612749               MIL Milwaukee    202339
## 6 21900895 1610612749               MIL Milwaukee   1626192
##             PLAYER_NAME START_POSITION COMMENT   MIN FGM FGA FG_PCT FG3M FG3A
## 1       Wesley Matthews              F         27:08   3  11  0.273    2    7
## 2 Giannis Antetokounmpo              F         34:55  17  28  0.607    1    4
## 3           Brook Lopez              C         26:25   4  11  0.364    1    5
## 4      Donte DiVincenzo              G         27:35   1   5  0.200    0    3
## 5          Eric Bledsoe              G         22:17   2   8  0.250    0    1
## 6       Pat Connaughton                        24:52   2   5  0.400    1    4
##   FG3_PCT FTM FTA FT_PCT OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS
## 1   0.286   0   0  0.000    4    4   8   2   2   0  0  0   8         11
## 2   0.250   6   7  0.857    2   18  20   6   1   0  3  2  41         22
## 3   0.200   7   9  0.778    2    5   7   0   0   3  0  2  16         16
## 4   0.000   0   0  0.000    1    6   7   5   0   1  2  0   2         14
## 5   0.000   0   0  0.000    1    0   1   2   1   0  3  2   4          6
## 6   0.250   1   2  0.500    2    3   5   1   0   0  1  2   6          0
#Viewing the attributes in the "games_details.csv" data set
str(games_details)
## 'data.frame':    576782 obs. of  28 variables:
##  $ GAME_ID          : int  21900895 21900895 21900895 21900895 21900895 21900895 21900895 21900895 21900895 21900895 ...
##  $ TEAM_ID          : int  1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 ...
##  $ TEAM_ABBREVIATION: chr  "MIL" "MIL" "MIL" "MIL" ...
##  $ TEAM_CITY        : chr  "Milwaukee" "Milwaukee" "Milwaukee" "Milwaukee" ...
##  $ PLAYER_ID        : int  202083 203507 201572 1628978 202339 1626192 201577 1628425 101107 201588 ...
##  $ PLAYER_NAME      : chr  "Wesley Matthews" "Giannis Antetokounmpo" "Brook Lopez" "Donte DiVincenzo" ...
##  $ START_POSITION   : chr  "F" "F" "C" "G" ...
##  $ COMMENT          : chr  "" "" "" "" ...
##  $ MIN              : chr  "27:08" "34:55" "26:25" "27:35" ...
##  $ FGM              : num  3 17 4 1 2 2 1 1 0 4 ...
##  $ FGA              : num  11 28 11 5 8 5 5 2 1 11 ...
##  $ FG_PCT           : num  0.273 0.607 0.364 0.2 0.25 0.4 0.2 0.5 0 0.364 ...
##  $ FG3M             : num  2 1 1 0 0 1 0 1 0 1 ...
##  $ FG3A             : num  7 4 5 3 1 4 0 2 1 4 ...
##  $ FG3_PCT          : num  0.286 0.25 0.2 0 0 0.25 0 0.5 0 0.25 ...
##  $ FTM              : num  0 6 7 0 0 1 0 0 0 2 ...
##  $ FTA              : num  0 7 9 0 0 2 0 0 0 3 ...
##  $ FT_PCT           : num  0 0.857 0.778 0 0 0.5 0 0 0 0.667 ...
##  $ OREB             : num  4 2 2 1 1 2 1 0 0 2 ...
##  $ DREB             : num  4 18 5 6 0 3 2 3 2 3 ...
##  $ REB              : num  8 20 7 7 1 5 3 3 2 5 ...
##  $ AST              : num  2 6 0 5 2 1 0 0 2 2 ...
##  $ STL              : num  2 1 0 0 1 0 0 0 1 2 ...
##  $ BLK              : num  0 0 3 1 0 0 1 0 1 0 ...
##  $ TO               : num  0 3 0 2 3 1 2 1 1 3 ...
##  $ PF               : num  0 2 2 0 2 2 1 0 1 1 ...
##  $ PTS              : num  8 41 16 2 4 6 2 3 0 11 ...
##  $ PLUS_MINUS       : num  11 22 16 14 6 0 -12 -8 -11 2 ...
#Checking the dimension of the "games_details.csv" data set
dim(games_details)
## [1] 576782     28

We are renaming column TEAM_ID to HOME_TEAM_ID in “games_details.csv” data set , so that it becomes parts of the primary key, before we perform the join operation.

colnames(games_details)[2] <- "HOME_TEAM_ID"
head(games_details)
##    GAME_ID HOME_TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID
## 1 21900895   1610612749               MIL Milwaukee    202083
## 2 21900895   1610612749               MIL Milwaukee    203507
## 3 21900895   1610612749               MIL Milwaukee    201572
## 4 21900895   1610612749               MIL Milwaukee   1628978
## 5 21900895   1610612749               MIL Milwaukee    202339
## 6 21900895   1610612749               MIL Milwaukee   1626192
##             PLAYER_NAME START_POSITION COMMENT   MIN FGM FGA FG_PCT FG3M FG3A
## 1       Wesley Matthews              F         27:08   3  11  0.273    2    7
## 2 Giannis Antetokounmpo              F         34:55  17  28  0.607    1    4
## 3           Brook Lopez              C         26:25   4  11  0.364    1    5
## 4      Donte DiVincenzo              G         27:35   1   5  0.200    0    3
## 5          Eric Bledsoe              G         22:17   2   8  0.250    0    1
## 6       Pat Connaughton                        24:52   2   5  0.400    1    4
##   FG3_PCT FTM FTA FT_PCT OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS
## 1   0.286   0   0  0.000    4    4   8   2   2   0  0  0   8         11
## 2   0.250   6   7  0.857    2   18  20   6   1   0  3  2  41         22
## 3   0.200   7   9  0.778    2    5   7   0   0   3  0  2  16         16
## 4   0.000   0   0  0.000    1    6   7   5   0   1  2  0   2         14
## 5   0.000   0   0  0.000    1    0   1   2   1   0  3  2   4          6
## 6   0.250   1   2  0.500    2    3   5   1   0   0  1  2   6          0

Performing the join operation operation of the two data sets: “games.csv” dataset and “games_details.csv”. The merged data set contains 288643 observations and 47 attributes found using dim() function before any kind of data pre-processing.

#Performing the Join operation of the data section
nba_games_dataset <- games_details %>% inner_join(games, by = c('GAME_ID','HOME_TEAM_ID'))
head(nba_games_dataset)
##    GAME_ID HOME_TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID     PLAYER_NAME
## 1 21900895   1610612766               CHA Charlotte   1628970   Miles Bridges
## 2 21900895   1610612766               CHA Charlotte   1629023 P.J. Washington
## 3 21900895   1610612766               CHA Charlotte    202687 Bismack Biyombo
## 4 21900895   1610612766               CHA Charlotte   1628984 Devonte' Graham
## 5 21900895   1610612766               CHA Charlotte   1626179    Terry Rozier
## 6 21900895   1610612766               CHA Charlotte   1628998     Cody Martin
##   START_POSITION COMMENT   MIN FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT
## 1              F         35:15   3  13  0.231    1    7   0.143   0   0      0
## 2              F         31:52   5  14  0.357    1    8   0.125   1   1      1
## 3              C         22:07   2   8  0.250    0    0   0.000   4   4      1
## 4              G         32:21   7  18  0.389    3    8   0.375   0   1      0
## 5              G         36:05   6  18  0.333    0    3   0.000   1   1      1
## 6                        29:08   4   8  0.500    2    5   0.400   1   1      1
##   OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS GAME_DATE_EST GAME_STATUS_TEXT
## 1    1    3   4   2   2   2  2  2   7         -4    2020-03-01            Final
## 2    1    5   6   3   0   2  2  3  12        -13    2020-03-01            Final
## 3    4    5   9   2   0   2  1  3   8        -15    2020-03-01            Final
## 4    0    2   2   3   1   0  0  2  17        -14    2020-03-01            Final
## 5    2    1   3   4   1   0  2  2  13        -20    2020-03-01            Final
## 6    0    5   5   2   0   1  1  2  11          2    2020-03-01            Final
##   VISITOR_TEAM_ID SEASON TEAM_ID_home PTS_home FG_PCT_home FT_PCT_home
## 1      1610612749   2019   1610612766       85       0.354         0.9
## 2      1610612749   2019   1610612766       85       0.354         0.9
## 3      1610612749   2019   1610612766       85       0.354         0.9
## 4      1610612749   2019   1610612766       85       0.354         0.9
## 5      1610612749   2019   1610612766       85       0.354         0.9
## 6      1610612749   2019   1610612766       85       0.354         0.9
##   FG3_PCT_home AST_home REB_home TEAM_ID_away PTS_away FG_PCT_away FT_PCT_away
## 1        0.229       22       47   1610612749       93       0.402       0.762
## 2        0.229       22       47   1610612749       93       0.402       0.762
## 3        0.229       22       47   1610612749       93       0.402       0.762
## 4        0.229       22       47   1610612749       93       0.402       0.762
## 5        0.229       22       47   1610612749       93       0.402       0.762
## 6        0.229       22       47   1610612749       93       0.402       0.762
##   FG3_PCT_away AST_away REB_away HOME_TEAM_WINS
## 1        0.226       20       61              0
## 2        0.226       20       61              0
## 3        0.226       20       61              0
## 4        0.226       20       61              0
## 5        0.226       20       61              0
## 6        0.226       20       61              0
#Checking the dimensions of the merged dataset
dim(nba_games_dataset)
## [1] 288643     47

4. Tidy & Manipulate Data I:

The “games.csv” data set is not tidy because it contains 2 attributes “AWAY_TEAM_ID” and “TEAM_ID_away” which are duplicates of each other.Therefore, we are dropping “Team_ID_away” from the data set.

According to the Tidy Principles:Each variable forms a column, whereas variables here are forming multiple columns. Therefore, I am dropping “Team_ID_home” and “Team_ID_away” attributes from the “nba_games_dataset” dataset.

#Dropping TEAM_ID_HOME and TEAM_ID_AWAY
nba_games_dataset <- nba_games_dataset %>% subset( select= -c(TEAM_ID_home,TEAM_ID_away) )
head(nba_games_dataset)
##    GAME_ID HOME_TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID     PLAYER_NAME
## 1 21900895   1610612766               CHA Charlotte   1628970   Miles Bridges
## 2 21900895   1610612766               CHA Charlotte   1629023 P.J. Washington
## 3 21900895   1610612766               CHA Charlotte    202687 Bismack Biyombo
## 4 21900895   1610612766               CHA Charlotte   1628984 Devonte' Graham
## 5 21900895   1610612766               CHA Charlotte   1626179    Terry Rozier
## 6 21900895   1610612766               CHA Charlotte   1628998     Cody Martin
##   START_POSITION COMMENT   MIN FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT
## 1              F         35:15   3  13  0.231    1    7   0.143   0   0      0
## 2              F         31:52   5  14  0.357    1    8   0.125   1   1      1
## 3              C         22:07   2   8  0.250    0    0   0.000   4   4      1
## 4              G         32:21   7  18  0.389    3    8   0.375   0   1      0
## 5              G         36:05   6  18  0.333    0    3   0.000   1   1      1
## 6                        29:08   4   8  0.500    2    5   0.400   1   1      1
##   OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS GAME_DATE_EST GAME_STATUS_TEXT
## 1    1    3   4   2   2   2  2  2   7         -4    2020-03-01            Final
## 2    1    5   6   3   0   2  2  3  12        -13    2020-03-01            Final
## 3    4    5   9   2   0   2  1  3   8        -15    2020-03-01            Final
## 4    0    2   2   3   1   0  0  2  17        -14    2020-03-01            Final
## 5    2    1   3   4   1   0  2  2  13        -20    2020-03-01            Final
## 6    0    5   5   2   0   1  1  2  11          2    2020-03-01            Final
##   VISITOR_TEAM_ID SEASON PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home
## 1      1610612749   2019       85       0.354         0.9        0.229       22
## 2      1610612749   2019       85       0.354         0.9        0.229       22
## 3      1610612749   2019       85       0.354         0.9        0.229       22
## 4      1610612749   2019       85       0.354         0.9        0.229       22
## 5      1610612749   2019       85       0.354         0.9        0.229       22
## 6      1610612749   2019       85       0.354         0.9        0.229       22
##   REB_home PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away
## 1       47       93       0.402       0.762        0.226       20       61
## 2       47       93       0.402       0.762        0.226       20       61
## 3       47       93       0.402       0.762        0.226       20       61
## 4       47       93       0.402       0.762        0.226       20       61
## 5       47       93       0.402       0.762        0.226       20       61
## 6       47       93       0.402       0.762        0.226       20       61
##   HOME_TEAM_WINS
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

The “games_details.csv” data set is not tidy because “COMMENT” is not really a variable. According to Tidy Principle,

  1. Each variable should form a column, whereas comment is not a variable
  2. Comment is absent in almost all the rows, for it to be considered as part of a table.

To tidy this problem, we drop “COMMENT” column from the joined “nba_games_dataset” dataset.

#Dropping the comment column
nba_games_dataset <- nba_games_dataset %>% subset( select= -c(COMMENT) )
head(nba_games_dataset)
##    GAME_ID HOME_TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID     PLAYER_NAME
## 1 21900895   1610612766               CHA Charlotte   1628970   Miles Bridges
## 2 21900895   1610612766               CHA Charlotte   1629023 P.J. Washington
## 3 21900895   1610612766               CHA Charlotte    202687 Bismack Biyombo
## 4 21900895   1610612766               CHA Charlotte   1628984 Devonte' Graham
## 5 21900895   1610612766               CHA Charlotte   1626179    Terry Rozier
## 6 21900895   1610612766               CHA Charlotte   1628998     Cody Martin
##   START_POSITION   MIN FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT OREB
## 1              F 35:15   3  13  0.231    1    7   0.143   0   0      0    1
## 2              F 31:52   5  14  0.357    1    8   0.125   1   1      1    1
## 3              C 22:07   2   8  0.250    0    0   0.000   4   4      1    4
## 4              G 32:21   7  18  0.389    3    8   0.375   0   1      0    0
## 5              G 36:05   6  18  0.333    0    3   0.000   1   1      1    2
## 6                29:08   4   8  0.500    2    5   0.400   1   1      1    0
##   DREB REB AST STL BLK TO PF PTS PLUS_MINUS GAME_DATE_EST GAME_STATUS_TEXT
## 1    3   4   2   2   2  2  2   7         -4    2020-03-01            Final
## 2    5   6   3   0   2  2  3  12        -13    2020-03-01            Final
## 3    5   9   2   0   2  1  3   8        -15    2020-03-01            Final
## 4    2   2   3   1   0  0  2  17        -14    2020-03-01            Final
## 5    1   3   4   1   0  2  2  13        -20    2020-03-01            Final
## 6    5   5   2   0   1  1  2  11          2    2020-03-01            Final
##   VISITOR_TEAM_ID SEASON PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home
## 1      1610612749   2019       85       0.354         0.9        0.229       22
## 2      1610612749   2019       85       0.354         0.9        0.229       22
## 3      1610612749   2019       85       0.354         0.9        0.229       22
## 4      1610612749   2019       85       0.354         0.9        0.229       22
## 5      1610612749   2019       85       0.354         0.9        0.229       22
## 6      1610612749   2019       85       0.354         0.9        0.229       22
##   REB_home PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away
## 1       47       93       0.402       0.762        0.226       20       61
## 2       47       93       0.402       0.762        0.226       20       61
## 3       47       93       0.402       0.762        0.226       20       61
## 4       47       93       0.402       0.762        0.226       20       61
## 5       47       93       0.402       0.762        0.226       20       61
## 6       47       93       0.402       0.762        0.226       20       61
##   HOME_TEAM_WINS
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

“MIN” is the total time played by a player. Where the minutes are stored as hours and minutes are concatenated by “:” According to Tidy Principle, 1. Each variable should form a column, whereas MIN contains multiples variables stored in one column.

To tidy this problem, we divide “MIN” column into “HOURS” and “MINUTES” COLUMNS using “:” as the separator.

#Separating MIN into HOURS and MINUTES TO MAKE the merged dataset tidy
nba_games_dataset <- nba_games_dataset %>% separate(MIN, into = c("HOURS", "MINUTES"), sep = ":")
head(nba_games_dataset)
##    GAME_ID HOME_TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID     PLAYER_NAME
## 1 21900895   1610612766               CHA Charlotte   1628970   Miles Bridges
## 2 21900895   1610612766               CHA Charlotte   1629023 P.J. Washington
## 3 21900895   1610612766               CHA Charlotte    202687 Bismack Biyombo
## 4 21900895   1610612766               CHA Charlotte   1628984 Devonte' Graham
## 5 21900895   1610612766               CHA Charlotte   1626179    Terry Rozier
## 6 21900895   1610612766               CHA Charlotte   1628998     Cody Martin
##   START_POSITION HOURS MINUTES FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT
## 1              F    35      15   3  13  0.231    1    7   0.143   0   0      0
## 2              F    31      52   5  14  0.357    1    8   0.125   1   1      1
## 3              C    22      07   2   8  0.250    0    0   0.000   4   4      1
## 4              G    32      21   7  18  0.389    3    8   0.375   0   1      0
## 5              G    36      05   6  18  0.333    0    3   0.000   1   1      1
## 6                   29      08   4   8  0.500    2    5   0.400   1   1      1
##   OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS GAME_DATE_EST GAME_STATUS_TEXT
## 1    1    3   4   2   2   2  2  2   7         -4    2020-03-01            Final
## 2    1    5   6   3   0   2  2  3  12        -13    2020-03-01            Final
## 3    4    5   9   2   0   2  1  3   8        -15    2020-03-01            Final
## 4    0    2   2   3   1   0  0  2  17        -14    2020-03-01            Final
## 5    2    1   3   4   1   0  2  2  13        -20    2020-03-01            Final
## 6    0    5   5   2   0   1  1  2  11          2    2020-03-01            Final
##   VISITOR_TEAM_ID SEASON PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home
## 1      1610612749   2019       85       0.354         0.9        0.229       22
## 2      1610612749   2019       85       0.354         0.9        0.229       22
## 3      1610612749   2019       85       0.354         0.9        0.229       22
## 4      1610612749   2019       85       0.354         0.9        0.229       22
## 5      1610612749   2019       85       0.354         0.9        0.229       22
## 6      1610612749   2019       85       0.354         0.9        0.229       22
##   REB_home PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away
## 1       47       93       0.402       0.762        0.226       20       61
## 2       47       93       0.402       0.762        0.226       20       61
## 3       47       93       0.402       0.762        0.226       20       61
## 4       47       93       0.402       0.762        0.226       20       61
## 5       47       93       0.402       0.762        0.226       20       61
## 6       47       93       0.402       0.762        0.226       20       61
##   HOME_TEAM_WINS
## 1              0
## 2              0
## 3              0
## 4              0
## 5              0
## 6              0

For tidying the character variables, The strip.white = True was used while reading the data set which removes any white space in the data set but the data set contains newline character in the characters. This ensures that there are no white spaces in the character columns.

nba_games_dataset$TEAM_ABBREVIATION <- str_replace_all(nba_games_dataset$TEAM_ABBREVIATION, "[\r\n]" , "")
nba_games_dataset$TEAM_CITY <- str_replace_all(nba_games_dataset$TEAM_CITY, "[\r\n]" , "")
nba_games_dataset$PLAYER_NAME <- str_replace_all(nba_games_dataset$PLAYER_NAME, "[\r\n]" , "")
nba_games_dataset$GAME_DATE_EST <- str_replace_all(nba_games_dataset$GAME_DATE_EST, "[\r\n]" , "")

5. Tidy & Manipulate Data II:

Mutating a new variable called “OREB_To_DREB_RATIO” that measures offensive rebounds as a comparison to defensive rebounds. The value “0.00001” is added in the denominator so that ratio doesn’t turn to the infinity value.

#This disables the scientific notation
options(scipen = 999)
#Mutating the variable "OREB_To_DREB_RATIO"
nba_games_dataset <-  nba_games_dataset %>% mutate(OREB_To_DREB_RATIO = round((OREB/(DREB+0.00001)),2))
head(nba_games_dataset)
##    GAME_ID HOME_TEAM_ID TEAM_ABBREVIATION TEAM_CITY PLAYER_ID     PLAYER_NAME
## 1 21900895   1610612766               CHA Charlotte   1628970   Miles Bridges
## 2 21900895   1610612766               CHA Charlotte   1629023 P.J. Washington
## 3 21900895   1610612766               CHA Charlotte    202687 Bismack Biyombo
## 4 21900895   1610612766               CHA Charlotte   1628984 Devonte' Graham
## 5 21900895   1610612766               CHA Charlotte   1626179    Terry Rozier
## 6 21900895   1610612766               CHA Charlotte   1628998     Cody Martin
##   START_POSITION HOURS MINUTES FGM FGA FG_PCT FG3M FG3A FG3_PCT FTM FTA FT_PCT
## 1              F    35      15   3  13  0.231    1    7   0.143   0   0      0
## 2              F    31      52   5  14  0.357    1    8   0.125   1   1      1
## 3              C    22      07   2   8  0.250    0    0   0.000   4   4      1
## 4              G    32      21   7  18  0.389    3    8   0.375   0   1      0
## 5              G    36      05   6  18  0.333    0    3   0.000   1   1      1
## 6                   29      08   4   8  0.500    2    5   0.400   1   1      1
##   OREB DREB REB AST STL BLK TO PF PTS PLUS_MINUS GAME_DATE_EST GAME_STATUS_TEXT
## 1    1    3   4   2   2   2  2  2   7         -4    2020-03-01            Final
## 2    1    5   6   3   0   2  2  3  12        -13    2020-03-01            Final
## 3    4    5   9   2   0   2  1  3   8        -15    2020-03-01            Final
## 4    0    2   2   3   1   0  0  2  17        -14    2020-03-01            Final
## 5    2    1   3   4   1   0  2  2  13        -20    2020-03-01            Final
## 6    0    5   5   2   0   1  1  2  11          2    2020-03-01            Final
##   VISITOR_TEAM_ID SEASON PTS_home FG_PCT_home FT_PCT_home FG3_PCT_home AST_home
## 1      1610612749   2019       85       0.354         0.9        0.229       22
## 2      1610612749   2019       85       0.354         0.9        0.229       22
## 3      1610612749   2019       85       0.354         0.9        0.229       22
## 4      1610612749   2019       85       0.354         0.9        0.229       22
## 5      1610612749   2019       85       0.354         0.9        0.229       22
## 6      1610612749   2019       85       0.354         0.9        0.229       22
##   REB_home PTS_away FG_PCT_away FT_PCT_away FG3_PCT_away AST_away REB_away
## 1       47       93       0.402       0.762        0.226       20       61
## 2       47       93       0.402       0.762        0.226       20       61
## 3       47       93       0.402       0.762        0.226       20       61
## 4       47       93       0.402       0.762        0.226       20       61
## 5       47       93       0.402       0.762        0.226       20       61
## 6       47       93       0.402       0.762        0.226       20       61
##   HOME_TEAM_WINS OREB_To_DREB_RATIO
## 1              0               0.33
## 2              0               0.20
## 3              0               0.80
## 4              0               0.00
## 5              0               2.00
## 6              0               0.00

6. Understand:

Displaying the attributes in the data set to understand the data types in the merged data set

#Displaying the datatype of the attributes before data type conversions
str(nba_games_dataset)
## 'data.frame':    288643 obs. of  46 variables:
##  $ GAME_ID           : int  21900895 21900895 21900895 21900895 21900895 21900895 21900895 21900895 21900895 21900895 ...
##  $ HOME_TEAM_ID      : int  1610612766 1610612766 1610612766 1610612766 1610612766 1610612766 1610612766 1610612766 1610612766 1610612766 ...
##  $ TEAM_ABBREVIATION : chr  "CHA" "CHA" "CHA" "CHA" ...
##  $ TEAM_CITY         : chr  "Charlotte" "Charlotte" "Charlotte" "Charlotte" ...
##  $ PLAYER_ID         : int  1628970 1629023 202687 1628984 1626179 1628998 1629667 1626195 1628997 201587 ...
##  $ PLAYER_NAME       : chr  "Miles Bridges" "P.J. Washington" "Bismack Biyombo" "Devonte' Graham" ...
##  $ START_POSITION    : chr  "F" "F" "C" "G" ...
##  $ HOURS             : chr  "35" "31" "22" "32" ...
##  $ MINUTES           : chr  "15" "52" "07" "21" ...
##  $ FGM               : num  3 5 2 7 6 4 1 4 2 NA ...
##  $ FGA               : num  13 14 8 18 18 8 2 9 6 NA ...
##  $ FG_PCT            : num  0.231 0.357 0.25 0.389 0.333 0.5 0.5 0.444 0.333 NA ...
##  $ FG3M              : num  1 1 0 3 0 2 0 0 1 NA ...
##  $ FG3A              : num  7 8 0 8 3 5 1 1 2 NA ...
##  $ FG3_PCT           : num  0.143 0.125 0 0.375 0 0.4 0 0 0.5 NA ...
##  $ FTM               : num  0 1 4 0 1 1 0 2 0 NA ...
##  $ FTA               : num  0 1 4 1 1 1 0 2 0 NA ...
##  $ FT_PCT            : num  0 1 1 0 1 1 0 1 0 NA ...
##  $ OREB              : num  1 1 4 0 2 0 0 3 1 NA ...
##  $ DREB              : num  3 5 5 2 1 5 1 10 3 NA ...
##  $ REB               : num  4 6 9 2 3 5 1 13 4 NA ...
##  $ AST               : num  2 3 2 3 4 2 1 4 1 NA ...
##  $ STL               : num  2 0 0 1 1 0 0 2 1 NA ...
##  $ BLK               : num  2 2 2 0 0 1 1 0 0 NA ...
##  $ TO                : num  2 2 1 0 2 1 0 1 1 NA ...
##  $ PF                : num  2 3 3 2 2 2 1 0 3 NA ...
##  $ PTS               : num  7 12 8 17 13 11 2 10 5 NA ...
##  $ PLUS_MINUS        : num  -4 -13 -15 -14 -20 2 0 11 13 NA ...
##  $ GAME_DATE_EST     : chr  "2020-03-01" "2020-03-01" "2020-03-01" "2020-03-01" ...
##  $ GAME_STATUS_TEXT  : chr  "Final" "Final" "Final" "Final" ...
##  $ VISITOR_TEAM_ID   : int  1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 1610612749 ...
##  $ SEASON            : int  2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
##  $ PTS_home          : num  85 85 85 85 85 85 85 85 85 85 ...
##  $ FG_PCT_home       : num  0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 ...
##  $ FT_PCT_home       : num  0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 ...
##  $ FG3_PCT_home      : num  0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 ...
##  $ AST_home          : num  22 22 22 22 22 22 22 22 22 22 ...
##  $ REB_home          : num  47 47 47 47 47 47 47 47 47 47 ...
##  $ PTS_away          : num  93 93 93 93 93 93 93 93 93 93 ...
##  $ FG_PCT_away       : num  0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 ...
##  $ FT_PCT_away       : num  0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 ...
##  $ FG3_PCT_away      : num  0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 ...
##  $ AST_away          : num  20 20 20 20 20 20 20 20 20 20 ...
##  $ REB_away          : num  61 61 61 61 61 61 61 61 61 61 ...
##  $ HOME_TEAM_WINS    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OREB_To_DREB_RATIO: num  0.33 0.2 0.8 0 2 0 0 0.3 0.33 NA ...

GAME_ID, HOME_TEAM_ID,VISITOR_TEAM_ID, PLAYER_ID are not factors in the merged dataset, which needs to be converted into a factor because Identification numbers are factors.

Checking the Data Type

a <- nba_games_dataset$GAME_ID  %>% class()
b <- nba_games_dataset$HOME_TEAM_ID  %>% class()
c <- nba_games_dataset$VISITOR_TEAM_ID  %>% class()
d <- nba_games_dataset$PLAYER_ID %>% class()
print(c(a,b,c,d))
## [1] "integer" "integer" "integer" "integer"

After Data Type Conversion

#Conversion to Factor
nba_games_dataset$GAME_ID <- nba_games_dataset$GAME_ID  %>% as.factor()
nba_games_dataset$HOME_TEAM_ID <- nba_games_dataset$HOME_TEAM_ID  %>% as.factor()
nba_games_dataset$VISITOR_TEAM_ID <- nba_games_dataset$VISITOR_TEAM_ID  %>% as.factor()
nba_games_dataset$PLAYER_ID <- nba_games_dataset$PLAYER_ID %>% as.integer() %>% as.factor()
a <- nba_games_dataset$GAME_ID  %>% class()
b <- nba_games_dataset$HOME_TEAM_ID  %>% class()
c <- nba_games_dataset$VISITOR_TEAM_ID  %>% class()
d <- nba_games_dataset$PLAYER_ID %>% class()
print(c(a,b,c,d))
## [1] "factor" "factor" "factor" "factor"

TEAM_CITY, TEAM_ABBREVIATION,PLAYER_NAME are all character variables. So no changes need to be made.

#Checking the data type
nba_games_dataset$TEAM_CITY  %>% class()
## [1] "character"
nba_games_dataset$TEAM_ABBREVIATION  %>% class()
## [1] "character"
nba_games_dataset$PLAYER_NAME  %>% class()
## [1] "character"

START_POSITION is a character variable in the merged dataset. In reality it is a ordered factor variable where positions are ranked by height. Guard(G)Forward(F)<Center(C). G is the smallest player on the team and C is the largest player on the team. Therefore START POSITION needs to be converted to a ordered factor variable.

Checking the data type of START_POSITION

nba_games_dataset$START_POSITION %>% class()
## [1] "character"

Converting START_POSITION to an ordered factor variable and checking the levels of the factor variable

nba_games_dataset$START_POSITION<-factor(nba_games_dataset$START_POSITION ,levels=c("G", "F", "C"),ordered=TRUE)
head(nba_games_dataset$START_POSITION)
## [1] F    F    C    G    G    <NA>
## Levels: G < F < C
#Checking the levels of the factor variable
levels(nba_games_dataset$START_POSITION)
## [1] "G" "F" "C"

HOURS and MINUTES are all characters variables in the merged data set.

#Checking the data type
nba_games_dataset$HOURS  %>% class()
## [1] "character"
nba_games_dataset$MINUTES  %>% class()
## [1] "character"

HOURS and MINUTES are all numeric variables, more specifically, Integer variables logically. Converting the data type of HOURS MINUTES TO numeric variables

nba_games_dataset$HOURS<-nba_games_dataset$HOURS  %>% as.numeric()
nba_games_dataset$MINUTES<-nba_games_dataset$MINUTES  %>% as.numeric()
#Checking the data type
nba_games_dataset$HOURS  %>% class()
## [1] "numeric"
nba_games_dataset$MINUTES  %>% class()
## [1] "numeric"

FGM,FGA,FG3M,FTM,FTA,OREB,DREB,REB,AST,STL,BLK,TO, PF,PTS,PLUS_MINUS,PTS_home,AST_home,REB_home,PTS_away,AST_away,REB_away are double variables(i.e. numeric) in the merged dataset. Therefore no data type conversions needs to be made.

#Checking the data type
a <- nba_games_dataset$FGM  %>% class()
b <- nba_games_dataset$FGA  %>% class()
c <- nba_games_dataset$FG3M  %>% class()
d <- nba_games_dataset$FTM  %>% class()
e <- nba_games_dataset$FTA  %>% class()
f <- nba_games_dataset$OREB  %>% class()
g <- nba_games_dataset$DREB  %>% class()
h <- nba_games_dataset$REB  %>% class()
i <- nba_games_dataset$AST%>% class()
j <- nba_games_dataset$STL  %>% class()
k <- nba_games_dataset$BLK  %>% class()
l <- nba_games_dataset$TO%>% class()
m <- nba_games_dataset$PF %>% class()
n <- nba_games_dataset$PTS %>% class()
o <- nba_games_dataset$PLUS_MINUS %>% class()
p <- nba_games_dataset$PTS_home %>% class()
q <- nba_games_dataset$AST_home%>% class()
r <- nba_games_dataset$REB_home %>% class()
s <- nba_games_dataset$PTS_away %>% class()
t <- nba_games_dataset$AST_away%>% class()
u <- nba_games_dataset$REB_away %>% class()
print(c(a,b,c,d,e,f,g,h,i,j,l,m,n,o,p,q,r,s,t,u))
##  [1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
##  [8] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## [15] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

FG_PCT,FG3_PCT,FT_PCT,FT_PCT_home,FG3_PCT_home,FT_PCT_away,FG3_PCT_away are all double variables (i.e.) because they contain percentage value.Therefore, no data type conversion need to be made.

#Checking the data type
a <- nba_games_dataset$FG_PCT  %>% typeof()
b <- nba_games_dataset$FG3_PCT %>% typeof()
c <- nba_games_dataset$FT_PCT  %>% typeof()
d <- nba_games_dataset$FT_PCT_home  %>% typeof()
e <- nba_games_dataset$FG3_PCT_home  %>% typeof()
f <- nba_games_dataset$FT_PCT_away  %>% typeof()
g <- nba_games_dataset$FG3_PCT_away  %>% typeof()

print(c(a,b,c,d,e,f,g))
## [1] "double" "double" "double" "double" "double" "double" "double"

SEASON is a numeric variable in the merged dataset. In reality is a factor variable.

#Checking the datatype of SEASON Variable
nba_games_dataset$SEASON %>% class()
## [1] "integer"

Converting the SEASON variable to factor variable

nba_games_dataset$SEASON<-nba_games_dataset$SEASON %>% as.factor()
#Checking the data type after data type conversion
nba_games_dataset$SEASON %>% class() 
## [1] "factor"

GAME_STATUS_TEXT is a character variable in the merged dataset. However, it is a factor variable with “FINAL” being one of the states. The other states might include “HALF-TIME” and others

#Checking the data type of GAME_STATUS_TEXT
nba_games_dataset$GAME_STATUS_TEXT %>% class() 
## [1] "character"

Converting the GAME_STATUS_TEXT variable to factor variable

#Converting the GAME_STATUS_TEXT variable to factor variable
nba_games_dataset$GAME_STATUS_TEXT<-nba_games_dataset$GAME_STATUS_TEXT %>% as.factor()
#Checking the data type of of GAME_STATUS_TEXT
nba_games_dataset$GAME_STATUS_TEXT %>% class() 
## [1] "factor"

HOME_TEAM_WINS is a numeric variable is the merged dataset. However it is encoded ordered factor where 0 means loss and 1 means WIN. The hierarchy is LOSS<WIN

#Checking the data type of HOME_TEAM_WINS
nba_games_dataset$HOME_TEAM_WINS %>% class()
## [1] "integer"

Changing it to the right hierarchy.

nba_games_dataset$HOME_TEAM_WINS <- factor(nba_games_dataset$HOME_TEAM_WINS,levels = c(0,1),labels = c("LOSS","WIN"),ordered = TRUE)
levels(nba_games_dataset$HOME_TEAM_WINS )
## [1] "LOSS" "WIN"
#Checking the data type
nba_games_dataset$HOME_TEAM_WINS %>% class()
## [1] "ordered" "factor"

GAME_DATE_EST is a character in the merged data set. So the string must be converted to date.

#Checking the data type of GAME_DATE_EST
nba_games_dataset$GAME_DATE_EST  %>% class()
## [1] "character"
#Converting the string to date
nba_games_dataset$GAME_DATE_EST <- nba_games_dataset$GAME_DATE_EST  %>% as.Date(format = "%YYYY/%MM/%DD")
#Checking the data type
nba_games_dataset$GAME_DATE_EST  %>% class()
## [1] "Date"

The mutated variable OREB_To_DREB_RATIO is a ratio, the data type must be “double” which it already is in the original data set.

#Checking the data type of Offensive_Rebound_To_Defensive_Rebound_Ratio
nba_games_dataset$OREB_To_DREB_RATIO %>% typeof()
## [1] "double"

7. Feature Selection:

Due to the constraints of the maximum page limit of 25 pages, we are selecting important features to reduce the number of features in the merged data set. FG_PCT,FGM and FGA are multi-collinear features since FG_PCT can be derived from FGM and FGA, therefore dropping FGM and FGA. FG_3PCT,FG3M,FG3A are multi-collinear features since FG_3PCT can be derived from FG3M and FG3A, therefore dropping FG3M and FG3A. FT_PCT,FTA,FTM are multi-collinear features since FT_PCT can be derived from FTM and FTA, therefore dropping FTM and FTA. There is a high degree of multi-collinearity between REB,OREB,DREB since REB = (OREB+DREB), therefore dropping OREB and DREB from the merged data set.

GAME_ID,PLAYER_ID,HOME_TEAM_ID and VISITOR_TEAM_ID are identifiers which have a sequential pattern. If this kind of identifiers are inserted into a model, it will introduce biases and learning wrong patterns. Therefore these attributes are dropped from the merged data set.

#Performing Feature Selection by dropping columns which might be problematic for machine learning
nba_games_dataset <- nba_games_dataset %>% subset( select= -c(OREB,DREB,FGM,FGA,FG3M,FG3A,FTM,FTA,GAME_ID,PLAYER_ID,HOME_TEAM_ID,VISITOR_TEAM_ID) )

8. Scan I:

We are scanning for any “NA” values in the data set, which are the missing values,as it can affect the results of analysis.

# creating a dataframe with the sum of na values in the data set 
a<-  data.frame(colSums(is.na (nba_games_dataset  )))
a
##                    colSums.is.na.nba_games_dataset..
## TEAM_ABBREVIATION                                  0
## TEAM_CITY                                          0
## PLAYER_NAME                                        0
## START_POSITION                                177959
## HOURS                                          46304
## MINUTES                                        58253
## FG_PCT                                         46304
## FG3_PCT                                        46304
## FT_PCT                                         46304
## REB                                            46304
## AST                                            46304
## STL                                            46304
## BLK                                            46304
## TO                                             46304
## PF                                             46304
## PTS                                            46304
## PLUS_MINUS                                     58253
## GAME_DATE_EST                                 288643
## GAME_STATUS_TEXT                                   0
## SEASON                                             0
## PTS_home                                           0
## FG_PCT_home                                        0
## FT_PCT_home                                        0
## FG3_PCT_home                                       0
## AST_home                                           0
## REB_home                                           0
## PTS_away                                           0
## FG_PCT_away                                        0
## FT_PCT_away                                        0
## FG3_PCT_away                                       0
## AST_away                                           0
## REB_away                                           0
## HOME_TEAM_WINS                                     0
## OREB_To_DREB_RATIO                             46304

From the output of the columns and their missing values are as follows:

START_POSITION 177959,
MINUTES 58253,
FG_PCT 46304,
FG3_PCT 46304,
FT_PCT 46304,
REB 46304,

As it is seen that major columns have a missing values of 46304 which would not add any value and if replaced by any other value it would give wrong results while analysis and hence omitting these rows.

 nba_games_dataset <- nba_games_dataset %>% drop_na(REB)
#Checking the dimensions after dropping the all the rows with NA values
dim(nba_games_dataset)
## [1] 242339     34
# checking the missing values 
b<- data.frame( colSums(is.na(nba_games_dataset)))
b
##                    colSums.is.na.nba_games_dataset..
## TEAM_ABBREVIATION                                  0
## TEAM_CITY                                          0
## PLAYER_NAME                                        0
## START_POSITION                                131655
## HOURS                                              0
## MINUTES                                        11949
## FG_PCT                                             0
## FG3_PCT                                            0
## FT_PCT                                             0
## REB                                                0
## AST                                                0
## STL                                                0
## BLK                                                0
## TO                                                 0
## PF                                                 0
## PTS                                                0
## PLUS_MINUS                                     11949
## GAME_DATE_EST                                 242339
## GAME_STATUS_TEXT                                   0
## SEASON                                             0
## PTS_home                                           0
## FG_PCT_home                                        0
## FT_PCT_home                                        0
## FG3_PCT_home                                       0
## AST_home                                           0
## REB_home                                           0
## PTS_away                                           0
## FG_PCT_away                                        0
## FT_PCT_away                                        0
## FG3_PCT_away                                       0
## AST_away                                           0
## REB_away                                           0
## HOME_TEAM_WINS                                     0
## OREB_To_DREB_RATIO                                 0

We are left with missing values from

START_POSITION - 131655, MINUTES - 11949,

As the time column was split into HOURS and MINUTES all of the columns have the same number of missing values and since the hours played for each match are almost the same, replacing it with mean values of the columns.

# replacing the HOURS and MINUTES  column.
nba_games_dataset$HOURS[is.na(nba_games_dataset$HOURS)] <- mean(nba_games_dataset$HOURS, na.rm =TRUE)
nba_games_dataset$MINUTES[is.na(nba_games_dataset$MINUTES)] <- mean(nba_games_dataset$MINUTES, na.rm =TRUE)

As the start position has many missing values and the guards(“G”) can play small Forward position which is “F”, and Center(“C”) players can also play in the power Forward position which is “F”, hence replacing the na values with “F”

# checking the levels of the column
levels(nba_games_dataset$START_POSITION)
## [1] "G" "F" "C"
# replacing
nba_games_dataset$START_POSITION[is.na(nba_games_dataset$START_POSITION)] <- "F"
# checking if it is factor
is.factor(nba_games_dataset$START_POSITION)
## [1] TRUE
# Checking if all the na values were replaced
b<- data.frame( colSums(is.na(nba_games_dataset)))
b
##                    colSums.is.na.nba_games_dataset..
## TEAM_ABBREVIATION                                  0
## TEAM_CITY                                          0
## PLAYER_NAME                                        0
## START_POSITION                                     0
## HOURS                                              0
## MINUTES                                            0
## FG_PCT                                             0
## FG3_PCT                                            0
## FT_PCT                                             0
## REB                                                0
## AST                                                0
## STL                                                0
## BLK                                                0
## TO                                                 0
## PF                                                 0
## PTS                                                0
## PLUS_MINUS                                     11949
## GAME_DATE_EST                                 242339
## GAME_STATUS_TEXT                                   0
## SEASON                                             0
## PTS_home                                           0
## FG_PCT_home                                        0
## FT_PCT_home                                        0
## FG3_PCT_home                                       0
## AST_home                                           0
## REB_home                                           0
## PTS_away                                           0
## FG_PCT_away                                        0
## FT_PCT_away                                        0
## FG3_PCT_away                                       0
## AST_away                                           0
## REB_away                                           0
## HOME_TEAM_WINS                                     0
## OREB_To_DREB_RATIO                                 0
#Checking the dimension of the merged dataset
dim(nba_games_dataset)
## [1] 242339     34

From the last output it can be seen that the data set does not have any missing values anymore.

9. Scan II:

In this section we are scanning for outliers in all of the numeric columns.

# creating a data set with only the numerical columns called d
d<-data.frame(select_if(nba_games_dataset, is.numeric))
str(d)
## 'data.frame':    242339 obs. of  26 variables:
##  $ HOURS             : num  35 31 22 32 36 29 9 20 23 22 ...
##  $ MINUTES           : num  15 52 7 21 5 8 21 21 30 59 ...
##  $ FG_PCT            : num  0.231 0.357 0.25 0.389 0.333 0.5 0.5 0.444 0.333 0.5 ...
##  $ FG3_PCT           : num  0.143 0.125 0 0.375 0 0.4 0 0 0.5 0.5 ...
##  $ FT_PCT            : num  0 1 1 0 1 1 0 1 0 0 ...
##  $ REB               : num  4 6 9 2 3 5 1 13 4 6 ...
##  $ AST               : num  2 3 2 3 4 2 1 4 1 1 ...
##  $ STL               : num  2 0 0 1 1 0 0 2 1 1 ...
##  $ BLK               : num  2 2 2 0 0 1 1 0 0 2 ...
##  $ TO                : num  2 2 1 0 2 1 0 1 1 2 ...
##  $ PF                : num  2 3 3 2 2 2 1 0 3 3 ...
##  $ PTS               : num  7 12 8 17 13 11 2 10 5 9 ...
##  $ PLUS_MINUS        : num  -4 -13 -15 -14 -20 2 0 11 13 -9 ...
##  $ PTS_home          : num  85 85 85 85 85 85 85 85 85 91 ...
##  $ FG_PCT_home       : num  0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.354 0.364 ...
##  $ FT_PCT_home       : num  0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.4 ...
##  $ FG3_PCT_home      : num  0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.229 0.31 ...
##  $ AST_home          : num  22 22 22 22 22 22 22 22 22 19 ...
##  $ REB_home          : num  47 47 47 47 47 47 47 47 47 57 ...
##  $ PTS_away          : num  93 93 93 93 93 93 93 93 93 111 ...
##  $ FG_PCT_away       : num  0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.402 0.468 ...
##  $ FT_PCT_away       : num  0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.762 0.632 ...
##  $ FG3_PCT_away      : num  0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.226 0.275 ...
##  $ AST_away          : num  20 20 20 20 20 20 20 20 20 28 ...
##  $ REB_away          : num  61 61 61 61 61 61 61 61 61 56 ...
##  $ OREB_To_DREB_RATIO: num  0.33 0.2 0.8 0 2 0 0 0.3 0.33 1 ...

Plotting the box-plot to visualize the outliers

#Plotting the boxplot
boxplot(d,main="Boxplot of all numerical columns")

Since the box-plot does not show detailed display all the columns and the outliers clearly, we will have to plot individual box plot to identify any kind of outliers. We have also plotted the histograms of the data set to better understand the outliers.

par(mfrow=c(3,4))
# plotting the histogram
for(i in colnames(nba_games_dataset[,c("HOURS","MINUTES","FG_PCT","FG3_PCT","FT_PCT","REB","AST","STL","BLK","TO","PF","PTS","PLUS_MINUS","PTS_home","FG_PCT_home","FT_PCT_home","FG3_PCT_home","AST_home","REB_home","PTS_away","FG_PCT_away","FT_PCT_away","FG3_PCT_away","AST_away","REB_away","OREB_To_DREB_RATIO")]))
{
  hist(nba_games_dataset[,i],main = colnames(nba_games_dataset[i]))
  boxplot(nba_games_dataset[i])
  title(colnames(nba_games_dataset[i]))
}

From the box-plots and histograms it can be seen that all the numerical columns except PLUS_MINUS,PTS_HOme,FG_PCT_home,REB_home,PTS_away and FG_PCT_away are all skewed. Since PLUS_MINUS,PTS_HOme,FG_PCT_home,REB_home,PTS_away and FG_PCT_away is normally distributed, we could have used the z-score to only identify the outliers. But we have chosen “Capping” method, as it does not require for the columns to be normally distributed, allowing us to us this method for all the numerical columns regardless of the distribution of the data.

As we can see from the box-plot, there are multiple columns with outliers, HOURS,FG3_PCT,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS,PTS_home,FG_PCT_home,FT_PCT_home,FG3_PCT,AST_home,REB_home,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away and REB_away. Eliminating these outliers would not be right for analysis since these outliers do not present any meaningful information and would not affect the data analysis of the data. The advantage of “Capping” method is that it replaces the outliers with the best value near it.This makes this method an appropriate method for outlier detection of all numerical columns and imputations with appropriate values.

# Defining the function for the "capping" method based on the upper and lower fences of the boxplot
cap <-function(x){ quantiles <- quantile( x, c(.05,0.25,0.75,.95) ) 
x[ x < quantiles[2] -1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] +1.5*IQR(x) ] <- quantiles[4] 
x }

par(mfrow=c(3,4))

#Selecting the numerical columns
for (i in colnames(nba_games_dataset[,c("HOURS","MINUTES","FG_PCT","FG3_PCT","FT_PCT","REB","AST","STL","BLK","TO","PF","PTS","PTS_home","FG_PCT_home","FT_PCT_home","FG3_PCT_home","AST_home","REB_home","PTS_away","FG_PCT_away","FT_PCT_away","FG3_PCT_away","AST_away","REB_away","OREB_To_DREB_RATIO")])) 
{
  nba_games_dataset[,c("HOURS","MINUTES","FG_PCT","FG3_PCT","FT_PCT","REB","AST","STL","BLK","TO","PF","PTS","PTS_home","FG_PCT_home","FT_PCT_home","FG3_PCT_home","AST_home","REB_home","PTS_away","FG_PCT_away","FT_PCT_away","FG3_PCT_away","AST_away","REB_away","OREB_To_DREB_RATIO")] <- sapply( nba_games_dataset[,c("HOURS","MINUTES","FG_PCT","FG3_PCT","FT_PCT","REB","AST","STL","BLK","TO","PF","PTS","PTS_home","FG_PCT_home","FT_PCT_home","FG3_PCT_home","AST_home","REB_home","PTS_away","FG_PCT_away","FT_PCT_away","FG3_PCT_away","AST_away","REB_away","OREB_To_DREB_RATIO")], FUN = cap)
}


#Plot the histograms and box-plots
for(i in colnames(nba_games_dataset[,c("HOURS","MINUTES","FG_PCT","FG3_PCT","FT_PCT","REB","AST","STL","BLK","TO","PF","PTS","PTS_home","FG_PCT_home","FT_PCT_home","FG3_PCT_home","AST_home","REB_home","PTS_away","FG_PCT_away","FT_PCT_away","FG3_PCT_away","AST_away","REB_away","OREB_To_DREB_RATIO")]))
{
  hist(nba_games_dataset[,i],main = colnames(nba_games_dataset[i]))
  boxplot(nba_games_dataset[i])
  title(colnames(nba_games_dataset[i]))
}

From the output of the box-plots above, it can be seen that there are no outliers remaining in almost all of the numerical columns, except for columns FG3_PCT,STL and OREB_To_DREB RATIO.. It can be also seen that after the outliers are removed, all the numerical columns except columns MINUTES,FG_PCT,FG3PCT,FT_PCT,AST,STL,TO,PF,PTS and OREB_To_DREB RATIO, have become normally distributed.

10. Transform

In this section, we are applying Box-Cox transformation to the columns to reduce the skewness of the numerical columns.Box-Cox Transformation is used instead it can be used to reduce both left and right skewness of the data. FG3_PCT, FT_PCT, AST, STL, TO,PF,PTS and OREB_To_DREB_RATIO columns which are not normally distributed.This is because most machine learning models, for example regression models, require the data to be normally distributed.

Box-Cox transformation is applied to reduce the skewness of the columns that are not normally distributed.

#Box Cox Transformation of all skewed numerical columns
#The code for Box-Cox Transformation
par(mfrow=c(2,2))
#Plotting the histogram of MINUTES before and after BOX COX transformation
hist(nba_games_dataset$MINUTES,main = "Before Box Cox: MINUTES")
nba_games_dataset$MINUTES<-BoxCox(nba_games_dataset$MINUTES,lambda = "auto")
hist(nba_games_dataset$MINUTES,main = "After Box Cox: MINUTES")
#Plotting the histogram of FG_PCT before and after BOX COX transformation
hist(nba_games_dataset$FG_PCT, main = "Before Box Cox: FG_PCT")
nba_games_dataset$FG_PCT<-BoxCox(nba_games_dataset$FG_PCT,lambda = "auto")
hist(nba_games_dataset$FG_PCT, main = "After Box Cox: FG_PCT")

#Plotting the histogram of FG3_PCT before and after BOX COX transformation
hist(nba_games_dataset$FG3_PCT, main = "Before Box Cox: FG3_PCT")
nba_games_dataset$FG3_PCT<-BoxCox(nba_games_dataset$FG3_PCT,lambda = "auto")
hist(nba_games_dataset$FG3_PCT, main="After Box Cox: FG3_PCT")
#Plotting the histogram of FT_PCT before and after BOX COX transformation
hist(nba_games_dataset$FT_PCT, main = "Before Box Cox: FT_PCT")
nba_games_dataset$FT_PCT<-BoxCox(nba_games_dataset$FT_PCT,lambda = "auto")
hist(nba_games_dataset$FT_PCT, main="After Box Cox: FT_PCT")

#Plotting the histogram of AST before and after BOX COX transformation
hist(nba_games_dataset$AST, main = "Before Box Cox: AST")
nba_games_dataset$AST<-BoxCox(nba_games_dataset$AST,lambda = "auto")
hist(nba_games_dataset$AST, main="After Box Cox: AST")
#Plotting the histogram of STL before and after BOX COX transformation
hist(nba_games_dataset$STL, main = "Before Box Cox: STL")
nba_games_dataset$STL<-BoxCox(nba_games_dataset$STL,lambda = "auto")
hist(nba_games_dataset$STL, main="After Box Cox: STL")

#Plotting the histogram of TO before and after BOX COX transformation
hist(nba_games_dataset$TO, main = "Before Box Cox: TO")
nba_games_dataset$TO<-BoxCox(nba_games_dataset$TO,lambda = "auto")
hist(nba_games_dataset$TO, main="After Box Cox: TO")
#Plotting the histogram of PF before and after BOX COX transformation
hist(nba_games_dataset$PF, main = "Before Box Cox: PF")
nba_games_dataset$PF<-BoxCox(nba_games_dataset$PF,lambda = "auto")
hist(nba_games_dataset$PF, main="After Box Cox: PF")

#Plotting the histogram of PTS before and after BOX COX transformation
hist(nba_games_dataset$PTS, main = "Before Box Cox: PTS")
nba_games_dataset$PTS<-BoxCox(nba_games_dataset$PTS,lambda = "auto")
hist(nba_games_dataset$PTS, main="After Box Cox: PTS")
#Plotting the histogram of OREB_To_DREB_RATIO before and after BOX COX transformation
hist(nba_games_dataset$OREB_To_DREB_RATIO, main = "Before Box Cox: OREB_To_DREB_RATIO")
nba_games_dataset$OREB_To_DREB_RATIO<-BoxCox(nba_games_dataset$OREB_To_DREB_RATIO,
                                                                       lambda = "auto")
hist(nba_games_dataset$OREB_To_DREB_RATIO,main="After Box Cox: OREB_To_DREB_RATIO")

From the comparison of histograms before and after Box-Cox transformation, it can be seen that the skewness of the numerical columns have decreased across most of the numerical columns.

11. Conclusion

The final “nba_games_dataset” is clean data set with no missing values with 242339 observations and 34 variables. Almost all of the outliers were removed and replaced by the capping method. Finally, transformation was applied to reduce the skewness, of all numerical variables which were not normally distributed. In short, the data was cleaned in order to ensure the data set does not have any discrepancies relating to the NBA games 2019-2020 data set.

References

  1. En.wikipedia.org. 2020. Rules Of Basketball. [online] Available at: https://en.wikipedia.org/wiki/Rules_of_basketball [Accessed 14 October 2020].

  2. Kaggle.com. n.d. NBA Games Data. [online] Available at: https://www.kaggle.com/nathanlauga/nba-games?select=games.csv [Accessed 13 October 2020].

  3. Kaggle.com. n.d. NBA Games Data. [online] Available at: https://www.kaggle.com/nathanlauga/nba-games?select=games_details.csv [Accessed 13 October 2020].

  4. Ladd, T., 2020. NBA Official Basketball Rules And Regulations For Beginners. [online] Sportsierra. Available at: https://sportsierra.com/nba-official-basketball-rules-and-regulations/#:~:text=%20NBA%20basketball%20rules%20%201%20Regulation%20NBA,4%20Fighting%20and%20flagrant%20fouls.%20%20More%20 [Accessed 14 October 2020].