library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(magrittr)
library(lubridate)
library(eeptools)
library(forecast)
The report provides wrangling steps used for the NBA player datasets. The files are imported to RStudio and then merged based on the player name. Later the variables are checked and the data types are correctly set. The height variable had improper values that are rectified and saved in inches unit.
A new variable age is created based on the birth date of player. The data is then scanned for missing values and outliers and the Salary variable is transformed to normal distribution from left skewed distribution.
The R-code and outputs are present for each steps performed.
The datasets are two comma-separated files for NBA players. The Player.csv has players data and NBA_season1718_salary.csv contains the salary of players for year 2017-2018. The two files were taken from kaggle.com and are merged based on the Players name.
The following variables are present in the files;
Player
** Name - Name of the Player
** Year_Start - The year player started playing
** Year_End - The year player stopped playing (mostly has values 2017 and 2018)
** Position - The position at which the player plays
** Height - Height of the player in feet inches
** Weight - Weight of the player in pounds
** Birth_date - Birth date of the player
** College - College the player went to
NBA_season1718_salary
** X1 - Index
** Player - Name of the Player
** Tm - Team for which the player plays
** seson1718 - Salary of the player
player <- read_csv("~/Chaitali/RMIT/Semester 2/Data Preprocessing/Assignment 3/Player.csv")
head(player)
salary <- read_csv("~/Chaitali/RMIT/Semester 2/Data Preprocessing/Assignment 3/NBA_season1718_salary.csv")
## Warning: Missing column names filled in: 'X1' [1]
head(salary)
salary <- salary[-1]
player_salary <- player %>% select(name, year_start, year_end, position:college)
player_salary <- player_salary %>% left_join(salary, by = c("name" = "Player"))
head(player_salary)
The following changes are done to the data; * The Birthdate is converted from character to Date format. * The year_start and year_end are converted to factor from integer. * The position variable is also changed to factor and labelled. * The colnames of last two columns are changed to Team and Salary * The Team variable is converted to factor. * The height column is character and the values are stored are not proper which can be changed in tidy step.
player_salary$birth_date <- as.Date(player_salary$birth_date, format = '%B %d, %Y', origin = "1970-01-01")
player_salary$year_start <- as.factor(player_salary$year_start)
player_salary$year_end <- as.factor(player_salary$year_end)
player_salary$position <- factor(player_salary$position, levels = c("C", "C-F", "F", "F-C", "F-G", "G", "G-F"), labels = c("Centre", "Centre Forward", "Forward", "Centre Forward", "Guard Forward", "Guard", "Guard Forward"))
colnames(player_salary)[colnames(player_salary) == "Tm"] <- "Team"
colnames(player_salary)[colnames(player_salary) == "seson17_18"] <- "Salary"
player_salary$Team <- as.factor(player_salary$Team)
head(player_salary)
As seen the height variable is not in tidy format the inches are seen as months, so feet-month has to be changed to inches.
player_salary <- player_salary %>% separate(height, into = c("feet", "inches"), sep = "-")
player_salary$inches <- factor(player_salary$inches, levels = c("0", "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), labels = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
player_salary$inches <- as.integer(player_salary$inches)
player_salary$feet <- as.integer(player_salary$feet)
player_salary <- mutate(player_salary, height = (feet*12) + inches)
player_salary <- player_salary %>% select(-(feet:inches))
head(player_salary)
Lets create age variable from Birthdate.
player_salary <- mutate(player_salary, Age = age_calc(birth_date, units = 'years'))
player_salary <- player_salary %>% select(-(birth_date))
head(player_salary)
colSums(is.na(player_salary))
## name year_start year_end position weight college
## 0 0 0 0 0 100
## Team Salary height Age
## 0 0 0 0
The missing values are found for College variable but that cannot be rectified. Lets us check for inconsistencies in height, weight, age and Salary.
summary(player_salary)
## name year_start year_end position
## Length:558 2017 : 83 2018 :472 Centre : 54
## Class :character 2018 : 69 2017 : 54 Centre Forward: 75
## Mode :character 2016 : 57 2016 : 11 Forward :140
## 2015 : 47 2015 : 6 Guard Forward : 72
## 2014 : 43 2014 : 4 Guard :217
## 2013 : 34 2013 : 3
## (Other):225 (Other): 8
## weight college Team Salary
## Min. :150.0 Length:558 ATL : 27 Min. : 17224
## 1st Qu.:200.0 Class :character CHI : 24 1st Qu.: 1312611
## Median :219.5 Mode :character HOU : 22 Median : 2500000
## Mean :218.4 MIL : 22 Mean : 6040341
## 3rd Qu.:238.0 DAL : 21 3rd Qu.: 8402750
## Max. :307.0 NYK : 21 Max. :34682550
## (Other):421
## height Age
## Min. :70.00 Min. :19.60
## 1st Qu.:77.00 1st Qu.:24.24
## Median :80.00 Median :27.02
## Mean :79.98 Mean :28.04
## 3rd Qu.:83.00 3rd Qu.:30.74
## Max. :88.00 Max. :64.19
##
There are no missing values for any of the columns.
The height, weight, age and Salary variables are checked for outliers.
The height does not have any outliers. The weight, age and Salary have outliers.
par(mfrow = c(2,2))
player_salary$height %>% boxplot(main = "Boxplot of Height", ylab = "Height (in inches)")
player_salary$weight %>% boxplot(main = "Boxplot of Weight", ylab = "Weight (in pounds)")
player_salary$Age %>% boxplot(main = "Boxplot of Age", ylab = "Age (in years)")
player_salary$Salary %>% boxplot(main = "Boxplot of Salary", ylab = "Salary")
boxplot.stats(player_salary$weight)$out
## [1] 307
boxplot.stats(player_salary$Age)$out
## [1] 41.33973 60.15890 42.09589 64.19178 40.83836 51.74247 62.36712
## [8] 42.93425 59.29315 49.85357 40.62740 45.38356 40.70411 57.40548
boxplot.stats(player_salary$Salary)$out
## [1] 22471910 21461010 19578455 22471910 26243760 23112004 22434783
## [8] 23775506 25289390 19301070 28530608 19332500 34682550 23775506
## [15] 27739975 23775506 25000000 20559599 22642350 19508958 21974719
## [22] 29512900 28299399 29727900 20000000 25686667 27734405 23500000
## [29] 20061729 33285709 22642350 20566802 26153057 22642350 22642350
## [36] 28703704 23962573 31269231 21000000 23112004 24599495 24773250
## [43] 23000000 28530608 23775506
cap <- function(x){
quantiles <- quantile( x, c(.05, .95 ), na.rm = TRUE)
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
player_salary$weight <- player_salary$weight %>% cap()
player_salary <- player_salary[!player_salary$Age > 39,]
par(mfrow = c(1,2))
player_salary$weight %>% boxplot(main = "Boxplot of Weight", ylab = "Weight (in pounds)")
player_salary$Age %>% boxplot(main = "Boxplot of Age", ylab = "Age (in years)")
The weight has two outliers after the 95th percentile, but can be true as players can have that weight, but are capped to the nearest neighbours of 95th percentile.
The age has values greater than 41 years of age, are removed as the players that play are between the age of 19 to 39.
The salary variable is kept as it is, as the players can have higher salary and cannot be treated as outliers.
The Salary variable is left skewed, as there are very few players with very high salary. So to make the distribution normal, different transformation are applied.
log_salary <- log(player_salary$Salary)
log10_salary <- log10(player_salary$Salary)
sqrt_Salary <- sqrt(player_salary$Salary)
reci_Salary <- 1/player_salary$Salary
box_Salary <- BoxCox(player_salary$Salary, lambda = "auto")
par(mfrow = c(3,3))
player_salary$Salary %>% hist(main = "Original Salary")
log_salary %>% hist(main = "Log Salary")
log10_salary %>% hist(main = "Log to base 10 Salary")
sqrt_Salary %>% hist(main = "Squareroot Salary")
reci_Salary %>% hist(main = "Reciprocal Salary")
box_Salary %>% hist(main = "BoxCox transformed Salary")
The log, log10 and boxcox transformation makes the distribution to some what normal as compared to other transformations.