Data Wrangling Assessment Task 3: Dataset challenge

Setup

Section 2 - Required packages

Load Required Packages

All the installed packages were loaded for this assignment:

### Load Packages

library(readr)
library(readxl)
library(foreign)
library(gdata)
library(xml2)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(editrules)
library(validate)
library(Hmisc)
library(forecast)
library(stringr)
library(lubridate)
library(car)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(mlr)
library(ggplot2)
library(knitr)
library(magrittr)  
library(printr)
library(htmlwidgets)
library(plotly)

Section 3 - Data

A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfill steps #1-2 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.

Part 1 - Locate Data

Two Data Sets were Downloaded from the Kaggle in CSV format

Data Set 1

Site Called - NBA Players - Biometric, biographic and basic box score features from 1996 to 2019 season - CSV named “all_seasons.csv”.

Site location - https://www.kaggle.com/justinas/nba-players-data

Description of Data Set: Is a listing of all NBA players game and physical statistics with data registered annually.

Data Points Set 1:

"" - This numeric row number field did not have a name/header - Not required, to be deleted
player_name - Name of the player statistics are relevant to
team_abbreviation - Team the player played for in the associated season
age - Age of the player in the associated season
player_height - Height of player
player_weight - Weight of player in associated season
college - College the player attended
country - the country the player was from
draft_year - The year in which the player was drafted
draft_round - The round of the draft the player was picked from
draft_number - The number in the draft the player was picked
gp - Games played in the season
pts - Average points scored per game
reb - Average rebounds made per game
ast - Average assists made per game
net_rating - Unknown data point to be deleted
oreb_pct - Unknown data point to be deleted
dreb_pct - Unknown data point to be deleted
usg_pct - Unknown data point to be deleted
ts_pct - Unknown data point to be deleted
ast_pct - Unknown data point to be deleted
season - The playing season which the statistics were gathered

Note that a player can have multiple record rows as statistics were unique to the season.

Data Set 2

Site Called - NBA Salaries By Players of Season 2000 to 2019 - CSV named “NBA_Full_Salaries_2000-2019.csv”.

Site Location - https://www.kaggle.com/hrfang1995/nba-salaries-by-players-of-season-2000-to-2019

Description of Data Set: Is a listing of NBA players annual salary.

Data Points Set 2:

"" - This numeric row number field did not have a name/header - Not required, to be deleted
Name - Name of the player the annual salary is relevant to
Year - The year which the salary is applicable
Salaries - The annual player salary
Rank - The rank at which the player salary sat when compared to other players

Note that a player can have multiple record rows as statistics were unique to the season.

Part 2 - Read/Import Data

Import the Data - Saved the file into the project directory as a CSV to enable direct read.csv function to be utilized.

### Import and read CSV files obtained from open data sources - per references above.

Players <- read.csv("all_seasons.csv", header = TRUE, stringsAsFactor = FALSE, sep = ",")
Salary <- read.csv("NBA_Full_Salaries_2000-2019.csv", header = TRUE, stringsAsFactor = FALSE, sep = ",")

print(head(Players)) ### Check header

##   X       player_name team_abbreviation age player_height player_weight
## 1 0     Dennis Rodman               CHI  36        198.12      99.79024
## 2 1 Dwayne Schintzius               LAC  28        215.90     117.93392
## 3 2      Earl Cureton               TOR  39        205.74      95.25432
## 4 3       Ed O'Bannon               DAL  24        203.20     100.69742
## 5 4       Ed Pinckney               MIA  34        205.74     108.86208
## 6 5     Eddie Johnson               HOU  38        200.66      97.52228
##                       college country draft_year draft_round draft_number gp
## 1 Southeastern Oklahoma State     USA       1986           2           27 55
## 2                     Florida     USA       1990           1           24 15
## 3               Detroit Mercy     USA       1979           3           58  9
## 4                        UCLA     USA       1995           1            9 64
## 5                   Villanova     USA       1985           1           10 27
## 6                    Illinois     USA       1981           2           29 52
##   pts  reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct season
## 1 5.7 16.1 3.1       16.1    0.186    0.323   0.100  0.479   0.113   1996
## 2 2.3  1.5 0.3       12.3    0.078    0.151   0.175  0.430   0.048   1996
## 3 0.8  1.0 0.4       -2.1    0.105    0.102   0.103  0.376   0.148   1996
## 4 3.7  2.3 0.6       -8.7    0.060    0.149   0.167  0.399   0.077   1996
## 5 2.4  2.4 0.2      -11.2    0.109    0.179   0.127  0.611   0.040   1996
## 6 8.2  2.7 1.0        4.1    0.034    0.126   0.220  0.541   0.102   1996

print(head(Salary)) ### Check header

##   X             Name Year Salaries Rank
## 1 1 Shaquille O'Neal 2000 17142000    1
## 2 2    Kevin Garnett 2000 16806000    2
## 3 3  Alonzo Mourning 2000 15004000    3
## 4 4     Juwan Howard 2000 15000000    4
## 5 5   Scottie Pippen 2000 14795000    5
## 6 6      Karl Malone 2000 14000000    6

Actions for Section 3 - Part 2 - Read/Import Data

Wrote the name of a new dataframes “Players” and “Salary”
Used the read.csv function to import the data from each of the CSVs into the relevant dataframes
Used header = TRUE to assign top row of data as the header
stringsAsFactor = FALSE splits the data by columns and separate by any commas using sep = “,”
Fill = TRUE means that any blanks headers will be filled
Printed both the header lines to check the columns
Printed both the dataframes to review the data content

Part 3 - Merge Data

### Need to change variable headers to allow for merge by "Name" - the player's name and "season" the year in which the salary and performance statistics were recorded. 

Players <- Players %>% 
  rename(
    Name = player_name)

Salary <- Salary %>% 
  rename(
    season = Year)

### Merge by both "Name" and "Season".

NBA <- merge(Players, Salary, by = c("Name","season"))

print(head(NBA)) ### Capture merged dataframe content

##           Name season  X.x team_abbreviation age player_height player_weight
## 1   A.C. Green   2000 1948               MIA  37        205.74     102.05820
## 2 Aaron Brooks   2007 5025               HOU  23        182.88      73.02831
## 3 Aaron Brooks   2008 5504               HOU  24        182.88      73.02831
## 4 Aaron Brooks   2009 5804               HOU  25        182.88      73.02831
## 5 Aaron Brooks   2010 6491               PHX  26        182.88      73.02831
## 6 Aaron Brooks   2012 7158               HOU  28        182.88      73.02831
##        college country draft_year draft_round draft_number gp  pts reb ast
## 1 Oregon State     USA       1985           1           23 82  4.5 3.8 0.5
## 2       Oregon     USA       2007           1           26 51  5.2 1.1 1.7
## 3       Oregon     USA       2007           1           26 80 11.2 2.0 3.0
## 4       Oregon     USA       2007           1           26 82 19.6 2.6 5.3
## 5       Oregon     USA       2007           1           26 59 10.7 1.3 3.9
## 6       Oregon     USA       2007           1           26 53  7.1 1.5 2.2
##   net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct   X.y Salaries Rank
## 1        3.3    0.089    0.171   0.141  0.492   0.050   260       NA  260
## 2       -0.5    0.026    0.085   0.224  0.535   0.249 14032       NA  935
## 3        4.2    0.021    0.071   0.231  0.521   0.201 15903   972720  368
## 4       -0.7    0.021    0.065   0.258  0.549   0.253 17774  1045560  358
## 5       -6.5    0.017    0.053   0.257  0.489   0.289 19645  1118520  350
## 6      -10.7    0.014    0.077   0.181  0.555   0.190 23387       NA 1158

Actions for Section 3 - Part 3 - Merge Data

Use the rename transform function an state the variable which need to be renamed in the code to allow for the merge of dataframe to occur
Check the columns to ensure the name changes align.
Use the merge function using using “Name” and “season” as the common variables to allow for merge
Printed new NBA dataframe to review the data content

Section 4 - Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled steps #3-5.

Part 1 - Check data dimesions and types

### Check dimension and types

print(nrow(NBA))

## [1] 8777

print(ncol(NBA))

## [1] 25

print(dim(NBA))

## [1] 8777   25

print(str(NBA))

## 'data.frame':    8777 obs. of  25 variables:
##  $ Name             : chr  "A.C. Green" "Aaron Brooks" "Aaron Brooks" "Aaron Brooks" ...
##  $ season           : int  2000 2007 2008 2009 2010 2012 2013 2014 2015 2016 ...
##  $ X.x              : int  1948 5025 5504 5804 6491 7158 7731 8199 8749 9146 ...
##  $ team_abbreviation: chr  "MIA" "HOU" "HOU" "HOU" ...
##  $ age              : int  37 23 24 25 26 28 29 30 31 32 ...
##  $ player_height    : num  206 183 183 183 183 ...
##  $ player_weight    : num  102 73 73 73 73 ...
##  $ college          : chr  "Oregon State" "Oregon" "Oregon" "Oregon" ...
##  $ country          : chr  "USA" "USA" "USA" "USA" ...
##  $ draft_year       : chr  "1985" "2007" "2007" "2007" ...
##  $ draft_round      : chr  "1" "1" "1" "1" ...
##  $ draft_number     : chr  "23" "26" "26" "26" ...
##  $ gp               : int  82 51 80 82 59 53 72 82 69 65 ...
##  $ pts              : num  4.5 5.2 11.2 19.6 10.7 7.1 9 11.6 7.1 5 ...
##  $ reb              : num  3.8 1.1 2 2.6 1.3 1.5 1.9 2 1.5 1.1 ...
##  $ ast              : num  0.5 1.7 3 5.3 3.9 2.2 3.2 3.2 2.6 1.9 ...
##  $ net_rating       : num  3.3 -0.5 4.2 -0.7 -6.5 -10.7 -2.5 5.2 -1.4 -3 ...
##  $ oreb_pct         : num  0.089 0.026 0.021 0.021 0.017 0.014 0.031 0.019 0.02 0.022 ...
##  $ dreb_pct         : num  0.171 0.085 0.071 0.065 0.053 0.077 0.069 0.078 0.078 0.064 ...
##  $ usg_pct          : num  0.141 0.224 0.231 0.258 0.257 0.181 0.205 0.252 0.231 0.191 ...
##  $ ts_pct           : num  0.492 0.535 0.521 0.549 0.489 0.555 0.518 0.534 0.494 0.507 ...
##  $ ast_pct          : num  0.05 0.249 0.201 0.253 0.289 0.19 0.238 0.245 0.265 0.216 ...
##  $ X.y              : int  260 14032 15903 17774 19645 23387 25258 27129 29000 30871 ...
##  $ Salaries         : int  NA NA 972720 1045560 1118520 NA 5750000 1027424 915243 2250000 ...
##  $ Rank             : int  260 935 368 358 350 1158 124 359 406 271 ...
## NULL

Actions for Section 4 - Part 1 - Dimensional and type check

Checked the number of rows using the nrow function
Checked the number of columns the ncol function
Checked the overall dimension of the dataframe using the dim function (shows both rows and columns)
Use the str function to show the data frame structure table to allow for review of the variable types.

Part 2 - Convert, rename and order variables

NBA <- NBA %>% 
  mutate(Yearnum = as.numeric(paste(season)))

### Change variable types - Step 4

NBA$season <- as.Date(as.character(NBA$season), format = "%Y")

NBA$team_abbreviation <- as.factor(NBA$team_abbreviation)
print(levels(NBA$team_abbreviation)) ## Check levels

##  [1] "ATL" "BKN" "BOS" "CHA" "CHH" "CHI" "CLE" "DAL" "DEN" "DET" "GSW" "HOU"
## [13] "IND" "LAC" "LAL" "MEM" "MIA" "MIL" "MIN" "NJN" "NOH" "NOK" "NOP" "NYK"
## [25] "OKC" "ORL" "PHI" "PHX" "POR" "SAC" "SAS" "SEA" "TOR" "UTA" "VAN" "WAS"

NBA$country <- as.factor(NBA$country)
NBA$college <- as.factor(NBA$college)
str(NBA) ## Check

## 'data.frame':    8777 obs. of  26 variables:
##  $ Name             : chr  "A.C. Green" "Aaron Brooks" "Aaron Brooks" "Aaron Brooks" ...
##  $ season           : Date, format: "2000-02-23" "2007-02-23" ...
##  $ X.x              : int  1948 5025 5504 5804 6491 7158 7731 8199 8749 9146 ...
##  $ team_abbreviation: Factor w/ 36 levels "ATL","BKN","BOS",..: 17 12 12 12 28 12 9 6 6 13 ...
##  $ age              : int  37 23 24 25 26 28 29 30 31 32 ...
##  $ player_height    : num  206 183 183 183 183 ...
##  $ player_weight    : num  102 73 73 73 73 ...
##  $ college          : Factor w/ 273 levels " ","Alabama",..: 174 173 173 173 173 173 173 173 173 173 ...
##  $ country          : Factor w/ 72 levels "Argentina","Australia",..: 69 69 69 69 69 69 69 69 69 69 ...
##  $ draft_year       : chr  "1985" "2007" "2007" "2007" ...
##  $ draft_round      : chr  "1" "1" "1" "1" ...
##  $ draft_number     : chr  "23" "26" "26" "26" ...
##  $ gp               : int  82 51 80 82 59 53 72 82 69 65 ...
##  $ pts              : num  4.5 5.2 11.2 19.6 10.7 7.1 9 11.6 7.1 5 ...
##  $ reb              : num  3.8 1.1 2 2.6 1.3 1.5 1.9 2 1.5 1.1 ...
##  $ ast              : num  0.5 1.7 3 5.3 3.9 2.2 3.2 3.2 2.6 1.9 ...
##  $ net_rating       : num  3.3 -0.5 4.2 -0.7 -6.5 -10.7 -2.5 5.2 -1.4 -3 ...
##  $ oreb_pct         : num  0.089 0.026 0.021 0.021 0.017 0.014 0.031 0.019 0.02 0.022 ...
##  $ dreb_pct         : num  0.171 0.085 0.071 0.065 0.053 0.077 0.069 0.078 0.078 0.064 ...
##  $ usg_pct          : num  0.141 0.224 0.231 0.258 0.257 0.181 0.205 0.252 0.231 0.191 ...
##  $ ts_pct           : num  0.492 0.535 0.521 0.549 0.489 0.555 0.518 0.534 0.494 0.507 ...
##  $ ast_pct          : num  0.05 0.249 0.201 0.253 0.289 0.19 0.238 0.245 0.265 0.216 ...
##  $ X.y              : int  260 14032 15903 17774 19645 23387 25258 27129 29000 30871 ...
##  $ Salaries         : int  NA NA 972720 1045560 1118520 NA 5750000 1027424 915243 2250000 ...
##  $ Rank             : int  260 935 368 358 350 1158 124 359 406 271 ...
##  $ Yearnum          : num  2000 2007 2008 2009 2010 ...

### Rename variables - Step 5a

NBA <- NBA %>% 
  rename(
    Team = team_abbreviation)

print(head(NBA))

##           Name     season  X.x Team age player_height player_weight
## 1   A.C. Green 2000-02-23 1948  MIA  37        205.74     102.05820
## 2 Aaron Brooks 2007-02-23 5025  HOU  23        182.88      73.02831
## 3 Aaron Brooks 2008-02-23 5504  HOU  24        182.88      73.02831
## 4 Aaron Brooks 2009-02-23 5804  HOU  25        182.88      73.02831
## 5 Aaron Brooks 2010-02-23 6491  PHX  26        182.88      73.02831
## 6 Aaron Brooks 2012-02-23 7158  HOU  28        182.88      73.02831
##        college country draft_year draft_round draft_number gp  pts reb ast
## 1 Oregon State     USA       1985           1           23 82  4.5 3.8 0.5
## 2       Oregon     USA       2007           1           26 51  5.2 1.1 1.7
## 3       Oregon     USA       2007           1           26 80 11.2 2.0 3.0
## 4       Oregon     USA       2007           1           26 82 19.6 2.6 5.3
## 5       Oregon     USA       2007           1           26 59 10.7 1.3 3.9
## 6       Oregon     USA       2007           1           26 53  7.1 1.5 2.2
##   net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct   X.y Salaries Rank
## 1        3.3    0.089    0.171   0.141  0.492   0.050   260       NA  260
## 2       -0.5    0.026    0.085   0.224  0.535   0.249 14032       NA  935
## 3        4.2    0.021    0.071   0.231  0.521   0.201 15903   972720  368
## 4       -0.7    0.021    0.065   0.258  0.549   0.253 17774  1045560  358
## 5       -6.5    0.017    0.053   0.257  0.489   0.289 19645  1118520  350
## 6      -10.7    0.014    0.077   0.181  0.555   0.190 23387       NA 1158
##   Yearnum
## 1    2000
## 2    2007
## 3    2008
## 4    2009
## 5    2010
## 6    2012

### Order variables - Step 5b

NBA <- NBA[
  order( NBA[,26], NBA[,25] ),
]

Actions for Section 4 - Part 2 - Convert, rename and order variables

Convert

Change variable type by referencing the variable from the dataframe and transforming it by stating the variable type needed for the change and stating the variable which is to be changed.
The following variables were converted: > Season converted from integer to date (used as.date to convert and stated year as format) > Team converted from character to factor (used as.factor to convert) > College converted from character to factor (used as.factor to convert) > Country converted from character to factor (used as.factor to convert)
Following the conversion to factors it was possible to check the associated levels of each of the factors using the levels function.

Rename

Use the rename transform function an state the variable which need to be renamed in the code - this was performed for team_abbreviation variable
Check the column header to ensure the name changes align.

Order

Order by the season and rank to allow for a scan from highest to lowest each year
Use the order function and identify the columns which are used to order, assign season first then rank
Override the NBA dataframe with the newly ordered data
Print to check outcome.

Section 5 - Tidy & Manipulate Data I

Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (step #6). In addition to the R codes and outputs, explain everything that you do in this step.

Clean to the data so that the data conforms with tidy data principles

Each variable must have its own column.
Each observation must have its own row.
value must have its own cell.

### Conclusion is that all the tidy data principles apply as per listed above there is however some minor amendments to variable names and types which can be undertaken including removing unnecessary columns. 

#### Remove Unnecessary columns 

#### X.x #### Variable 3
#### net_rating #### Variable 17
#### oreb_pct #### Variable 18
#### dreb_pct #### Variable 19
#### usg_pct #### Variable 20
#### ts_pct #### Variable 21
#### ast_pct #### Variable 22
#### X.y #### Variable 23

NBA <- NBA[, c(1:2, 4:16,24:26)]
print(head(NBA)) ### Check header

##                  Name     season Team age player_height player_weight
## 7551 Shaquille O'Neal 2000-02-23  LAL  29        215.90     142.88148
## 4875    Kevin Garnett 2000-02-23  MIN  25        210.82      99.79024
## 248   Alonzo Mourning 2000-02-23  MIA  31        208.28     118.38751
## 4666     Juwan Howard 2000-02-23  DAL  28        205.74     113.39800
## 7418   Scottie Pippen 2000-02-23  POR  35        200.66     103.41898
## 4699      Karl Malone 2000-02-23  UTA  37        205.74     116.11955
##               college country draft_year draft_round draft_number gp  pts  reb
## 7551  Louisiana State     USA       1992           1            1 74 28.7 12.7
## 4875             None     USA       1995           1            5 81 22.0 11.4
## 248        Georgetown     USA       1992           1            2 13 13.6  7.8
## 4666         Michigan     USA       1994           1            5 81 18.0  7.1
## 7418 Central Arkansas     USA       1987           1            5 64 11.3  5.2
## 4699   Louisiana Tech     USA       1985           1           13 81 23.2  8.3
##      ast Salaries Rank Yearnum
## 7551 3.7 17142000    1    2000
## 4875 5.0 16806000    2    2000
## 248  0.9 15004000    3    2000
## 4666 2.8 15000000    4    2000
## 7418 4.6 14795000    5    2000
## 4699 4.5 14000000    6    2000

Actions for Section 5 - Part 1 - Clean data

Restructure the NBA dataframe with columns removed by selecting the columns that are set to be retained form the original NBA dataframe
Check the column header to ensure the columns have been removed correctly.

Section 6 - Tidy & Manipulate Data II

Create/mutate at least one variable from the existing variables (step #7). In addition to the R codes and outputs, explain everything that you do in this step.

Part 1 - Create new variables

For this example 3 new variables have been created including: 1. Average earning per game 2. Total annual points 3. Earnings per point

### Create Variables

### Average Earnings Per Game 

NBA <- NBA %>% 
  mutate(pay.per.game = as.numeric(Salaries/gp))

### Total Points Annually

NBA <- NBA %>% 
  mutate(annual.points = as.numeric(gp*pts))

### Earnings Per Point

NBA <- NBA %>% 
  mutate(pay.per.point = as.numeric(Salaries/annual.points))

print(head(NBA))

##                  Name     season Team age player_height player_weight
## 7551 Shaquille O'Neal 2000-02-23  LAL  29        215.90     142.88148
## 4875    Kevin Garnett 2000-02-23  MIN  25        210.82      99.79024
## 248   Alonzo Mourning 2000-02-23  MIA  31        208.28     118.38751
## 4666     Juwan Howard 2000-02-23  DAL  28        205.74     113.39800
## 7418   Scottie Pippen 2000-02-23  POR  35        200.66     103.41898
## 4699      Karl Malone 2000-02-23  UTA  37        205.74     116.11955
##               college country draft_year draft_round draft_number gp  pts  reb
## 7551  Louisiana State     USA       1992           1            1 74 28.7 12.7
## 4875             None     USA       1995           1            5 81 22.0 11.4
## 248        Georgetown     USA       1992           1            2 13 13.6  7.8
## 4666         Michigan     USA       1994           1            5 81 18.0  7.1
## 7418 Central Arkansas     USA       1987           1            5 64 11.3  5.2
## 4699   Louisiana Tech     USA       1985           1           13 81 23.2  8.3
##      ast Salaries Rank Yearnum pay.per.game annual.points pay.per.point
## 7551 3.7 17142000    1    2000     231648.6        2123.8      8071.381
## 4875 5.0 16806000    2    2000     207481.5        1782.0      9430.976
## 248  0.9 15004000    3    2000    1154153.8         176.8     84864.253
## 4666 2.8 15000000    4    2000     185185.2        1458.0     10288.066
## 7418 4.6 14795000    5    2000     231171.9         723.2     20457.688
## 4699 4.5 14000000    6    2000     172839.5        1879.2      7449.979

Actions for Section 6 - Part 1 - Create new variables

Use the operator pipe to adapt the NBA dataframe and the mutate function to add the new column based upon the calculation specified.
Need to name the new column which was done for each of the calculations performed to get the new columns.
Print the NBA dataframe to check that the mutate function successfully added the new columns using the calculations desired.

Section 7 - Scan I

Part 1 Scan and Omit

Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfill the step #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

Options: 1. Remove the rows where salary information is not provided.
2. Use the Mean of other salary data to override years where salary information is not provided.

Decision - To remove all rows where data is not provided. Based upon the scan conduced and reviewing the data it was concluded that were salary blanks were present it was not reasonable to use the mean of other year salaries as the sample number for each play varied and therefore it was best to omit data rows were player salary was not provided.

### Check for "NA" data 

print(colSums(is.na(NBA)))

##          Name        season          Team           age player_height 
##             0             0             0             0             0 
## player_weight       college       country    draft_year   draft_round 
##             0             0             0             0             0 
##  draft_number            gp           pts           reb           ast 
##             0             0             0             0             0 
##      Salaries          Rank       Yearnum  pay.per.game annual.points 
##          1800             0             0          1800             0 
## pay.per.point 
##          1800

### Remove blanks

NBA <- na.omit(NBA)
dim(NBA) ## Dataframe has now reduced in size

## [1] 6977   21

print(colSums(is.na(NBA)))

##          Name        season          Team           age player_height 
##             0             0             0             0             0 
## player_weight       college       country    draft_year   draft_round 
##             0             0             0             0             0 
##  draft_number            gp           pts           reb           ast 
##             0             0             0             0             0 
##      Salaries          Rank       Yearnum  pay.per.game annual.points 
##             0             0             0             0             0 
## pay.per.point 
##             0

Actions for Section 7 - Part 1 - Scan and omit

Use colsum(is.na) function to count the number of NA data points across each of the columns in the data set
Use the na.moit across the full NBA dataframe to removed the rows where NA data points were present
This reduced the dataframe vertically by 1800 rows with the NA fields removed.

Section 8 - Scan II

Part 1 - Idenitfy outliers

Scan the numeric data for outliers. In this step, you should fulfill the step #9. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.

### Boxplot 1 - Salary comparison by team

Teambox <- ggplot(data = NBA,aes(Team, Salaries, fill=factor(Team)))+
  geom_boxplot( )+
  ggtitle("Salary Boxplot by Team") +
  theme(plot.title = element_text(hjust = 0.5,face = "bold", colour="Black", size = (16)))+
  theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 6, vjust = 0.5))+
  labs(x='Team', y='Salary ($)')+
  theme(legend.text=element_text(size=6),
        legend.key.size = unit(0.5, 'cm'), #change legend key size
        legend.key.height = unit(0.5, 'cm'), #change legend key height
        legend.key.width = unit(0.5, 'cm'))+
  labs(fill='Team')

ggplotly(Teambox)

### Boxplot 2 - Salary comparison by Season

Seasonbox <- ggplot(data = NBA,aes(Yearnum, Salaries, fill=factor(Yearnum)))+
  geom_boxplot( )+
  ggtitle("Salary Boxplot by Year") +
  theme(plot.title = element_text(hjust = 0.5,face = "bold", colour="Black", size = (16)))+
  theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 6, vjust = 0.5))+
  labs(x='Year', y='Salary ($)')+
  theme(legend.text=element_text(size=6),
        legend.key.size = unit(0.5, 'cm'), #change legend key size
        legend.key.height = unit(0.5, 'cm'), #change legend key height
        legend.key.width = unit(0.5, 'cm'))+
  labs(fill='Year')

ggplotly(Seasonbox)

Actions for Section 8 - Part 1 - Scan for outliers

Run a ggplot for the player salary by team, using the fill colour to illustrate the different team
Use ggplot editing theme methods to alter the labels and legend
A a plotly layer to the ggplot to allow for the tooltip hover to identify the outliers
This was repeated for the salary by season boxplot

Outcome

It was possible to observe some high value outliers for each team and each season. Based upon the profile of NBA this is not a data concern as the top 10 players will typically draw much higher salaries than the average, therefore they need to remain within the illustration.

Section 8 - Addtional work - Scatterplot and correlation review

#### Do a scatter plot of salary against to total points scored identify if there is a correlation

SalaryScatter <- ggplot(NBA, aes(Salaries, annual.points)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm,   # Add linear regression line
                se=TRUE)     # Add shaded confidence region

print(SalaryScatter)

## `geom_smooth()` using formula 'y ~ x'

#### Check the correction 

CorrSP <- cor.test(NBA$Salaries, NBA$annual.points,  method="pearson")

print(CorrSP)

## 
##  Pearson's product-moment correlation
## 
## data:  NBA$Salaries and NBA$annual.points
## t = 37.394, df = 6975, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3889186 0.4280156
## sample estimates:
##       cor 
## 0.4086546

Correlation summary

It can be concluded that there is a moderate positive correlation between total annual points and the average salary of players. As points is not the only defining factor of a players ability and it depends upon the role they have this makes sense, but it would be moderately significant consideration for salary assignment.

Section 8 - Addtional Actions taken

Run a ggplot scatter with salaries against annual.points with a regression line run over the top
Run a test for correlation using cor.test to see how the two variable impact each other.

Section 9 - Transform

Part 1 - Review

Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfill the step #10.

Histo <- hist(NBA$Salaries, 
     main = "Histogram of Salary figures", 
     xlab = "Salary")

print(Histo)

## $breaks
##  [1] 0.0e+00 2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07 1.4e+07 1.6e+07
## [10] 1.8e+07 2.0e+07 2.2e+07 2.4e+07 2.6e+07 2.8e+07 3.0e+07 3.2e+07 3.4e+07
## [19] 3.6e+07 3.8e+07
## 
## $counts
##  [1] 2591 1508  900  532  346  295  259  183  116   88   48   45   26   17    8
## [16]    9    1    4    1
## 
## $density
##  [1] 1.856815e-07 1.080694e-07 6.449764e-08 3.812527e-08 2.479576e-08
##  [6] 2.114089e-08 1.856099e-08 1.311452e-08 8.313029e-09 6.306435e-09
## [11] 3.439874e-09 3.224882e-09 1.863265e-09 1.218289e-09 5.733123e-10
## [16] 6.449764e-10 7.166404e-11 2.866562e-10 7.166404e-11
## 
## $mids
##  [1] 1.0e+06 3.0e+06 5.0e+06 7.0e+06 9.0e+06 1.1e+07 1.3e+07 1.5e+07 1.7e+07
## [10] 1.9e+07 2.1e+07 2.3e+07 2.5e+07 2.7e+07 2.9e+07 3.1e+07 3.3e+07 3.5e+07
## [19] 3.7e+07
## 
## $xname
## [1] "NBA$Salaries"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Review

The data appears to be left skewed suggesting that there are lots of players who have been paid lower levels. Due to the nature of the data running across multiple seasons this makes sense as salaries were lower in previous years. To balance this it is recommended a logarithmic transformation is undertaken.

Actions for Section 9 - Part 1 - Review

Run a histogram for frequency of data based upon salary groups.
Assess the data to identify the distribution and skew.
Print the histogram - it could be possible to also use ggplot for this to add features, but it is not required on this occasion.

Part 2 - Transform

### Perform a logarithmic transformation 

log_NBAsal <- log10(NBA$Salaries)

LogHisto <- hist(log_NBAsal, 
     main = "Histogram of base 10 log salary", 
     xlab = "Base 10 log of Salary")

print(LogHisto)

## $breaks
##  [1] 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6
## 
## $counts
##  [1]    3    9   17   59   21   30  130  310  718  889  898  942 1067  784  719
## [16]  331   50
## 
## $density
##  [1] 0.002149921 0.006449764 0.012182887 0.042281783 0.015049448 0.021499212
##  [7] 0.093163251 0.222158521 0.514547800 0.637093307 0.643543070 0.675075247
## [13] 0.764655296 0.561846066 0.515264440 0.237207969 0.035832019
## 
## $mids
##  [1] 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 6.1 6.3 6.5 6.7 6.9 7.1 7.3 7.5
## 
## $xname
## [1] "log_NBAsal"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Actions for Section 9 - Part 2 - Transform

Create a new data line by running the log10 function to level the skew
Run a histogram for frequency of data based upon the log data groups.
Assess the data to identify the distribution and skew.
Print the histogram - it could be possible to also use ggplot for this to add features, but it is not required on this occasion.