NBA 2017

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

library(rvest)

## Loading required package: xml2

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forecast)
library(outliers)
library(editrules)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following object is masked from 'package:dplyr':
## 
##     contains

## The following object is masked from 'package:tidyr':
## 
##     separate

library(MVN)

## sROC 0.1-2 loaded

library(mlr)

## Loading required package: ParamHelpers

## 
## Attaching package: 'ParamHelpers'

## The following object is masked from 'package:editrules':
## 
##     isFeasible

library(knitr)
library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(lattice)

Executive Summary

The purpose of this assignment is to investigate relationship between player salary and team valuation in the entire NBA league based on the data collected in 2017. Throughout the entire preprocessing datasets, which including tiding up datasets, scanning and dealing with missing values and outliers, and data transformation, the visualised graphs have drawn the conclusion that extremely high salary does not affect much on team valuations but average salary range does depend on team valuations in a certain level.

Data

Three seperated datasets have been imported:

all 30 nba team names with its abbreviations
nba 2017 team valuations, some team/club was valuated over billion dollars while some of them in millions
nba season 2017-2018 all players salary

# scrap HTML table data - nba team name and abbreviations from link below:
url <- "https://en.wikipedia.org/wiki/Wikipedia:WikiProject_National_Basketball_Association/National_Basketball_Association_team_abbreviations"
nba_team_name <- read_html(url)

# select the first element of the html_nodes
nba_team_name <- html_table(html_nodes(nba_team_name, "table")[[1]])

nba_team_name # it's a standard matching sheet of all 30 nba team names and its abbreivation

# import nba 2017 team valuations.csv dataset downloaded from:
# https://www.kaggle.com/noahgift/social-power-nba
nba_team_valuations <- read.csv("nba_2017_team_valuations.csv")

nba_team_valuations

# import nba season 2017-2018 salary file from:
# https://www.kaggle.com/noahgift/social-power-nba
nba_salary <- read.csv("NBA_season1718_salary.csv")

nba_salary

Understand

in order to keep consistency and make dataset merging easier, in this part, there are some data type and structure handling for all three datasets to get them ready for the next step

# for nba team name dataset, 
# rename col "X1" as "team", and col "X2" as "team_name" 
colnames(nba_team_name) <- c("team", "team_name")

# also remove first row, coz it won't be needed in this assignment
nba_team_name1 <- nba_team_name [-1, ]

nba_team_name1

# similarly, for team valuation dataset,
# rename col TEAM as "team_name", and change col UPPER case to lower case
colnames(nba_team_valuations) <- c("team_name", "value_millions")

nba_team_valuations

# for nba 2017-2018 season salary dataset, remove first column
nba_salary1 <- nba_salary [, -1]

# covert salary cell value to millions so that it becomes more readable
nba_salary1$season17_18 <- nba_salary1[, 3]/1000000 %>% round(3)

# rename salary column
colnames(nba_salary1) <- c("player", "team", "salary_millions")

nba_salary1

# check dataset types
str(nba_team_name1)

## 'data.frame':    30 obs. of  2 variables:
##  $ team     : chr  "ATL" "BKN" "BOS" "CHA" ...
##  $ team_name: chr  "Atlanta Hawks" "Brooklyn Nets" "Boston Celtics" "Charlotte Hornets" ...

str(nba_team_valuations)

## 'data.frame':    30 obs. of  2 variables:
##  $ team_name     : Factor w/ 30 levels "Atlanta Hawks",..: 20 14 10 5 2 13 3 11 7 16 ...
##  $ value_millions: num  3300 3000 2600 2500 2200 2000 1800 1650 1450 1350 ...

head(levels(nba_team_valuations$team_name))

## [1] "Atlanta Hawks"       "Boston Celtics"      "Brooklyn Nets"      
## [4] "Charlotte Hornets"   "Chicago Bulls"       "Cleveland Cavaliers"

str(nba_salary1)

## 'data.frame':    573 obs. of  3 variables:
##  $ player         : Factor w/ 535 levels "A.J. Hammons",..: 461 322 410 181 40 309 439 366 215 115 ...
##  $ team           : Factor w/ 30 levels "ATL","BKN","BOS",..: 10 6 8 3 9 28 21 15 11 28 ...
##  $ salary_millions: num  34.7 33.3 31.3 29.7 29.5 ...

head(levels(nba_salary1$player))

## [1] "A.J. Hammons" "Aaron Brooks" "Aaron Gordon" "Aaron Gray"  
## [5] "Abdel Nader"  "Al Horford"

head(levels(nba_salary1$team))

## [1] "ATL" "BKN" "BOS" "CHA" "CHI" "CLE"

Tidy & Manipulate Data I

in this part, all three datasets will be merged together by using xxx_join functions. The complete merged dataset includes 5 variables: team, team_name, value_millions, player and salary_millions

# full join team name and team valuations datasets, and assigned to nba_team_join1
nba_team_join1 <- nba_team_name1 %>% full_join(nba_team_valuations, by = "team_name")

## Warning: Column `team_name` joining character vector and factor, coercing
## into character vector

nba_team_join1

# full join nba_team_join1 and salary datasets
nba_team_join2 <- nba_team_join1 %>% full_join(nba_salary1, by = "team")

## Warning: Column `team` joining character vector and factor, coercing into
## character vector

# the merged datasets showcase players' salary associated with 
# their correspondent residing team/club valuations, named as nba_team_join2
nba_team_join2

str(nba_team_join2)

## 'data.frame':    573 obs. of  5 variables:
##  $ team           : chr  "ATL" "ATL" "ATL" "ATL" ...
##  $ team_name      : chr  "Atlanta Hawks" "Atlanta Hawks" "Atlanta Hawks" "Atlanta Hawks" ...
##  $ value_millions : num  885 885 885 885 885 885 885 885 885 885 ...
##  $ player         : Factor w/ 535 levels "A.J. Hammons",..: 293 119 371 210 340 158 130 369 425 334 ...
##  $ salary_millions: num  16.91 15.5 12.5 10.94 6.31 ...

head(levels(nba_team_join2$player))

## [1] "A.J. Hammons" "Aaron Brooks" "Aaron Gordon" "Aaron Gray"  
## [5] "Abdel Nader"  "Al Horford"

Tidy & Manipulate Data II

in order to investigate the relationship between salary and team valueations further, a new variable has to be created to see the ratio/percentage of a player’s salary to each of their home team valuations

# remove the team name column as it's not being used for the investigation
nba_team_sub1 <- nba_team_join2 %>% select(team, player, salary_millions, value_millions)

# use mutate fucntion to create a new column to record the ratio 
# between player's salary and their correspondent team valuations
# assgin to nba_team_sub2
nba_team_sub2 <- mutate(nba_team_sub1, salary_value_ratio = salary_millions/value_millions*100)

nba_team_sub2

Scan I

this part is to check if there is any missing values, special values and obvious errors in dataset nba_team_sub2.

# bring the dataset in
nba_team_sub2

# identify missing values
colSums(is.na(nba_team_sub2)) # as shown, there is no missing values

##               team             player    salary_millions 
##                  0                  0                  0 
##     value_millions salary_value_ratio 
##                  0                  0

# self-defined function to check special values for dataframe type
is.special <- function(x){
  if (is.numeric(x)) !is.finite(x) else is.na(x)
}

# identify special values for dataset nba_team_sub2
colSums(sapply(nba_team_sub2, is.special)) # as shown, there is no special values

##               team             player    salary_millions 
##                  0                  0                  0 
##     value_millions salary_value_ratio 
##                  0                  0

# identify obvious errors
# the ruels set out below is based on information through nba.com 
Rules <- editset(c("value_millions >= 500", "salary_millions <= 50", "salary_millions >= 0.001"))

nba_rules <- violatedEdits(Rules, nba_team_sub2)

colSums(nba_rules) # no obvious errors based on the rules

## num1 num2 num3 
##    0    0    0

As per result, there is no missing values, no speical values or obvious errors in this dataset

Scan II

this part is to identify outliers by the use of boxplot and scatter plot; capping outlier method is used to handle the outliers

# identify outliers
boxplot(nba_team_sub2$salary_value_ratio ~ nba_team_sub2$team, 
        main = "Salary to Team Valuations amongst all 30 NBA teams", 
        ylab = "Salary_Value_Ratio", 
        xlab = "Team", col = "pink")

nba_team_sub2 %>% plot(salary_millions ~ value_millions, 
                       data = ., 
                       ylab="Salary in millions", 
                       xlab="Value in millions", 
                       main="Salary by Value of in the entire league"
                       )

summary(nba_team_sub2$salary_millions)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01722  1.31261  2.38686  5.85895  7.93651 34.68255

# cap outliers
cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

nba_salary_capped <- nba_team_sub2$salary_millions %>% cap()
summary(nba_salary_capped)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.01722  1.31261  2.38686  5.77156  7.93651 22.64235

As shown in the boxplot, the mean of players’ salary to team valuations percentage is within 1%. There are some extremely high salary proporations almost in each team, looks like outliers. Also the range of salary to team valuation ratio from mininum to maxmium in each team is quite distinct and fluctuate.

As shown in the scatter plot, it gives more direct in dollar figure of the relationship between salary and team valuations. Most dots are around salary 0-10 million range throughout the entire team valuations axis, from 20 onwards, it becomes much less condensed.

The reason to use capping method is because, in reality, paying extremely high salary to a super star player is part of team strategies. Therefore, simply to remove the outliers or impute with mean doesn’t seem to make much sense. After capping the outliers, it relatively brings down the maximum salary payout in the league, sort of like setting up a standard for the league.

Transform

boxcox function is used for transforming

# data transformation
hist(nba_team_sub2$salary_value_ratio, main = "salary_value_ratio", col = "grey")

boxcox_nba_salary<- BoxCox(nba_team_sub2$salary_value_ratio,lambda = "auto")

hist(boxcox_nba_salary, main = "salary_value_ratio",col = "gold")

As shown above through histgram, the Box-Cox transformation is successful in transforming skewed distributions into a relatively symmetric distribution. It’s not perfect symmetric, but it actually largely transformed from initial heavily right-skewed distribution.

Conclusion

The entire investigation in this assignment has actually indicated several very interesting points:

teams that being valuated over 2.5 or 3 billion dollars does not neccessarily set a high average salary for their players. They do pay extremely well for their super stars, but the pay between super stars and other players are very substaintial.
teams that being valuated lower than 1 billion dollars, it seems like they are willing to spend more on their players’ salary than those high valuated teams would do. And the pay gap between most players are not very outstanding. On top of that, they have maintained to also have their super stars who gets paid really well.

Overall from above, the ratio between salary and team valuations does fluctuate depending on team strategies, especially for average salary range. On the other hand, extremely high salary does not seem to affect team valuations unless the team is being valuated much much lower than any others. Therefore it is to say that MVP players get paid well no matter which team they go. A recommendations would be instead of focusing on spending a lot on a MVP by shrinking average salary to other players, if a highly valuated team can well balance out its rewarding, managing, coaching, marketing and operations systems, it will go a long way.