All the installed packages were loaded for this assignment:
### Load Packages
library(readr)
library(readxl)
library(foreign)
library(gdata)
library(xml2)
library(rvest)
library(dplyr)
library(tidyr)
library(deductive)
library(deducorrect)
library(editrules)
library(validate)
library(Hmisc)
library(forecast)
library(stringr)
library(lubridate)
library(car)
library(outliers)
library(MVN)
library(infotheo)
library(MASS)
library(caret)
library(mlr)
library(ggplot2)
library(knitr)
library(magrittr)
library(printr)
library(htmlwidgets)
library(plotly)
A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfill steps #1-2 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.
Two Data Sets were Downloaded from the Kaggle in CSV format
Site Called - NBA Players - Biometric, biographic and basic box score features from 1996 to 2019 season - CSV named “all_seasons.csv”.
Site location - https://www.kaggle.com/justinas/nba-players-data
Description of Data Set: Is a listing of all NBA players game and physical statistics with data registered annually.
Note that a player can have multiple record rows as statistics were unique to the season.
Site Called - NBA Salaries By Players of Season 2000 to 2019 - CSV named “NBA_Full_Salaries_2000-2019.csv”.
Site Location - https://www.kaggle.com/hrfang1995/nba-salaries-by-players-of-season-2000-to-2019
Description of Data Set: Is a listing of NBA players annual salary.
Note that a player can have multiple record rows as statistics were unique to the season.
Import the Data - Saved the file into the project directory as a CSV to enable direct read.csv function to be utilized.
### Import and read CSV files obtained from open data sources - per references above.
Players <- read.csv("all_seasons.csv", header = TRUE, stringsAsFactor = FALSE, sep = ",")
Salary <- read.csv("NBA_Full_Salaries_2000-2019.csv", header = TRUE, stringsAsFactor = FALSE, sep = ",")
print(head(Players)) ### Check header
## X player_name team_abbreviation age player_height player_weight
## 1 0 Dennis Rodman CHI 36 198.12 99.79024
## 2 1 Dwayne Schintzius LAC 28 215.90 117.93392
## 3 2 Earl Cureton TOR 39 205.74 95.25432
## 4 3 Ed O'Bannon DAL 24 203.20 100.69742
## 5 4 Ed Pinckney MIA 34 205.74 108.86208
## 6 5 Eddie Johnson HOU 38 200.66 97.52228
## college country draft_year draft_round draft_number gp
## 1 Southeastern Oklahoma State USA 1986 2 27 55
## 2 Florida USA 1990 1 24 15
## 3 Detroit Mercy USA 1979 3 58 9
## 4 UCLA USA 1995 1 9 64
## 5 Villanova USA 1985 1 10 27
## 6 Illinois USA 1981 2 29 52
## pts reb ast net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct season
## 1 5.7 16.1 3.1 16.1 0.186 0.323 0.100 0.479 0.113 1996
## 2 2.3 1.5 0.3 12.3 0.078 0.151 0.175 0.430 0.048 1996
## 3 0.8 1.0 0.4 -2.1 0.105 0.102 0.103 0.376 0.148 1996
## 4 3.7 2.3 0.6 -8.7 0.060 0.149 0.167 0.399 0.077 1996
## 5 2.4 2.4 0.2 -11.2 0.109 0.179 0.127 0.611 0.040 1996
## 6 8.2 2.7 1.0 4.1 0.034 0.126 0.220 0.541 0.102 1996
print(head(Salary)) ### Check header
## X Name Year Salaries Rank
## 1 1 Shaquille O'Neal 2000 17142000 1
## 2 2 Kevin Garnett 2000 16806000 2
## 3 3 Alonzo Mourning 2000 15004000 3
## 4 4 Juwan Howard 2000 15000000 4
## 5 5 Scottie Pippen 2000 14795000 5
## 6 6 Karl Malone 2000 14000000 6
### Need to change variable headers to allow for merge by "Name" - the player's name and "season" the year in which the salary and performance statistics were recorded.
Players <- Players %>%
rename(
Name = player_name)
Salary <- Salary %>%
rename(
season = Year)
### Merge by both "Name" and "Season".
NBA <- merge(Players, Salary, by = c("Name","season"))
print(head(NBA)) ### Capture merged dataframe content
## Name season X.x team_abbreviation age player_height player_weight
## 1 A.C. Green 2000 1948 MIA 37 205.74 102.05820
## 2 Aaron Brooks 2007 5025 HOU 23 182.88 73.02831
## 3 Aaron Brooks 2008 5504 HOU 24 182.88 73.02831
## 4 Aaron Brooks 2009 5804 HOU 25 182.88 73.02831
## 5 Aaron Brooks 2010 6491 PHX 26 182.88 73.02831
## 6 Aaron Brooks 2012 7158 HOU 28 182.88 73.02831
## college country draft_year draft_round draft_number gp pts reb ast
## 1 Oregon State USA 1985 1 23 82 4.5 3.8 0.5
## 2 Oregon USA 2007 1 26 51 5.2 1.1 1.7
## 3 Oregon USA 2007 1 26 80 11.2 2.0 3.0
## 4 Oregon USA 2007 1 26 82 19.6 2.6 5.3
## 5 Oregon USA 2007 1 26 59 10.7 1.3 3.9
## 6 Oregon USA 2007 1 26 53 7.1 1.5 2.2
## net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct X.y Salaries Rank
## 1 3.3 0.089 0.171 0.141 0.492 0.050 260 NA 260
## 2 -0.5 0.026 0.085 0.224 0.535 0.249 14032 NA 935
## 3 4.2 0.021 0.071 0.231 0.521 0.201 15903 972720 368
## 4 -0.7 0.021 0.065 0.258 0.549 0.253 17774 1045560 358
## 5 -6.5 0.017 0.053 0.257 0.489 0.289 19645 1118520 350
## 6 -10.7 0.014 0.077 0.181 0.555 0.190 23387 NA 1158
Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled steps #3-5.
### Check dimension and types
print(nrow(NBA))
## [1] 8777
print(ncol(NBA))
## [1] 25
print(dim(NBA))
## [1] 8777 25
print(str(NBA))
## 'data.frame': 8777 obs. of 25 variables:
## $ Name : chr "A.C. Green" "Aaron Brooks" "Aaron Brooks" "Aaron Brooks" ...
## $ season : int 2000 2007 2008 2009 2010 2012 2013 2014 2015 2016 ...
## $ X.x : int 1948 5025 5504 5804 6491 7158 7731 8199 8749 9146 ...
## $ team_abbreviation: chr "MIA" "HOU" "HOU" "HOU" ...
## $ age : int 37 23 24 25 26 28 29 30 31 32 ...
## $ player_height : num 206 183 183 183 183 ...
## $ player_weight : num 102 73 73 73 73 ...
## $ college : chr "Oregon State" "Oregon" "Oregon" "Oregon" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
## $ draft_year : chr "1985" "2007" "2007" "2007" ...
## $ draft_round : chr "1" "1" "1" "1" ...
## $ draft_number : chr "23" "26" "26" "26" ...
## $ gp : int 82 51 80 82 59 53 72 82 69 65 ...
## $ pts : num 4.5 5.2 11.2 19.6 10.7 7.1 9 11.6 7.1 5 ...
## $ reb : num 3.8 1.1 2 2.6 1.3 1.5 1.9 2 1.5 1.1 ...
## $ ast : num 0.5 1.7 3 5.3 3.9 2.2 3.2 3.2 2.6 1.9 ...
## $ net_rating : num 3.3 -0.5 4.2 -0.7 -6.5 -10.7 -2.5 5.2 -1.4 -3 ...
## $ oreb_pct : num 0.089 0.026 0.021 0.021 0.017 0.014 0.031 0.019 0.02 0.022 ...
## $ dreb_pct : num 0.171 0.085 0.071 0.065 0.053 0.077 0.069 0.078 0.078 0.064 ...
## $ usg_pct : num 0.141 0.224 0.231 0.258 0.257 0.181 0.205 0.252 0.231 0.191 ...
## $ ts_pct : num 0.492 0.535 0.521 0.549 0.489 0.555 0.518 0.534 0.494 0.507 ...
## $ ast_pct : num 0.05 0.249 0.201 0.253 0.289 0.19 0.238 0.245 0.265 0.216 ...
## $ X.y : int 260 14032 15903 17774 19645 23387 25258 27129 29000 30871 ...
## $ Salaries : int NA NA 972720 1045560 1118520 NA 5750000 1027424 915243 2250000 ...
## $ Rank : int 260 935 368 358 350 1158 124 359 406 271 ...
## NULL
NBA <- NBA %>%
mutate(Yearnum = as.numeric(paste(season)))
### Change variable types - Step 4
NBA$season <- as.Date(as.character(NBA$season), format = "%Y")
NBA$team_abbreviation <- as.factor(NBA$team_abbreviation)
print(levels(NBA$team_abbreviation)) ## Check levels
## [1] "ATL" "BKN" "BOS" "CHA" "CHH" "CHI" "CLE" "DAL" "DEN" "DET" "GSW" "HOU"
## [13] "IND" "LAC" "LAL" "MEM" "MIA" "MIL" "MIN" "NJN" "NOH" "NOK" "NOP" "NYK"
## [25] "OKC" "ORL" "PHI" "PHX" "POR" "SAC" "SAS" "SEA" "TOR" "UTA" "VAN" "WAS"
NBA$country <- as.factor(NBA$country)
NBA$college <- as.factor(NBA$college)
str(NBA) ## Check
## 'data.frame': 8777 obs. of 26 variables:
## $ Name : chr "A.C. Green" "Aaron Brooks" "Aaron Brooks" "Aaron Brooks" ...
## $ season : Date, format: "2000-02-23" "2007-02-23" ...
## $ X.x : int 1948 5025 5504 5804 6491 7158 7731 8199 8749 9146 ...
## $ team_abbreviation: Factor w/ 36 levels "ATL","BKN","BOS",..: 17 12 12 12 28 12 9 6 6 13 ...
## $ age : int 37 23 24 25 26 28 29 30 31 32 ...
## $ player_height : num 206 183 183 183 183 ...
## $ player_weight : num 102 73 73 73 73 ...
## $ college : Factor w/ 273 levels " ","Alabama",..: 174 173 173 173 173 173 173 173 173 173 ...
## $ country : Factor w/ 72 levels "Argentina","Australia",..: 69 69 69 69 69 69 69 69 69 69 ...
## $ draft_year : chr "1985" "2007" "2007" "2007" ...
## $ draft_round : chr "1" "1" "1" "1" ...
## $ draft_number : chr "23" "26" "26" "26" ...
## $ gp : int 82 51 80 82 59 53 72 82 69 65 ...
## $ pts : num 4.5 5.2 11.2 19.6 10.7 7.1 9 11.6 7.1 5 ...
## $ reb : num 3.8 1.1 2 2.6 1.3 1.5 1.9 2 1.5 1.1 ...
## $ ast : num 0.5 1.7 3 5.3 3.9 2.2 3.2 3.2 2.6 1.9 ...
## $ net_rating : num 3.3 -0.5 4.2 -0.7 -6.5 -10.7 -2.5 5.2 -1.4 -3 ...
## $ oreb_pct : num 0.089 0.026 0.021 0.021 0.017 0.014 0.031 0.019 0.02 0.022 ...
## $ dreb_pct : num 0.171 0.085 0.071 0.065 0.053 0.077 0.069 0.078 0.078 0.064 ...
## $ usg_pct : num 0.141 0.224 0.231 0.258 0.257 0.181 0.205 0.252 0.231 0.191 ...
## $ ts_pct : num 0.492 0.535 0.521 0.549 0.489 0.555 0.518 0.534 0.494 0.507 ...
## $ ast_pct : num 0.05 0.249 0.201 0.253 0.289 0.19 0.238 0.245 0.265 0.216 ...
## $ X.y : int 260 14032 15903 17774 19645 23387 25258 27129 29000 30871 ...
## $ Salaries : int NA NA 972720 1045560 1118520 NA 5750000 1027424 915243 2250000 ...
## $ Rank : int 260 935 368 358 350 1158 124 359 406 271 ...
## $ Yearnum : num 2000 2007 2008 2009 2010 ...
### Rename variables - Step 5a
NBA <- NBA %>%
rename(
Team = team_abbreviation)
print(head(NBA))
## Name season X.x Team age player_height player_weight
## 1 A.C. Green 2000-02-23 1948 MIA 37 205.74 102.05820
## 2 Aaron Brooks 2007-02-23 5025 HOU 23 182.88 73.02831
## 3 Aaron Brooks 2008-02-23 5504 HOU 24 182.88 73.02831
## 4 Aaron Brooks 2009-02-23 5804 HOU 25 182.88 73.02831
## 5 Aaron Brooks 2010-02-23 6491 PHX 26 182.88 73.02831
## 6 Aaron Brooks 2012-02-23 7158 HOU 28 182.88 73.02831
## college country draft_year draft_round draft_number gp pts reb ast
## 1 Oregon State USA 1985 1 23 82 4.5 3.8 0.5
## 2 Oregon USA 2007 1 26 51 5.2 1.1 1.7
## 3 Oregon USA 2007 1 26 80 11.2 2.0 3.0
## 4 Oregon USA 2007 1 26 82 19.6 2.6 5.3
## 5 Oregon USA 2007 1 26 59 10.7 1.3 3.9
## 6 Oregon USA 2007 1 26 53 7.1 1.5 2.2
## net_rating oreb_pct dreb_pct usg_pct ts_pct ast_pct X.y Salaries Rank
## 1 3.3 0.089 0.171 0.141 0.492 0.050 260 NA 260
## 2 -0.5 0.026 0.085 0.224 0.535 0.249 14032 NA 935
## 3 4.2 0.021 0.071 0.231 0.521 0.201 15903 972720 368
## 4 -0.7 0.021 0.065 0.258 0.549 0.253 17774 1045560 358
## 5 -6.5 0.017 0.053 0.257 0.489 0.289 19645 1118520 350
## 6 -10.7 0.014 0.077 0.181 0.555 0.190 23387 NA 1158
## Yearnum
## 1 2000
## 2 2007
## 3 2008
## 4 2009
## 5 2010
## 6 2012
### Order variables - Step 5b
NBA <- NBA[
order( NBA[,26], NBA[,25] ),
]
Check if the data conforms the tidy data principles. If your data is untidy, reshape your data into a tidy format (step #6). In addition to the R codes and outputs, explain everything that you do in this step.
### Conclusion is that all the tidy data principles apply as per listed above there is however some minor amendments to variable names and types which can be undertaken including removing unnecessary columns.
#### Remove Unnecessary columns
#### X.x #### Variable 3
#### net_rating #### Variable 17
#### oreb_pct #### Variable 18
#### dreb_pct #### Variable 19
#### usg_pct #### Variable 20
#### ts_pct #### Variable 21
#### ast_pct #### Variable 22
#### X.y #### Variable 23
NBA <- NBA[, c(1:2, 4:16,24:26)]
print(head(NBA)) ### Check header
## Name season Team age player_height player_weight
## 7551 Shaquille O'Neal 2000-02-23 LAL 29 215.90 142.88148
## 4875 Kevin Garnett 2000-02-23 MIN 25 210.82 99.79024
## 248 Alonzo Mourning 2000-02-23 MIA 31 208.28 118.38751
## 4666 Juwan Howard 2000-02-23 DAL 28 205.74 113.39800
## 7418 Scottie Pippen 2000-02-23 POR 35 200.66 103.41898
## 4699 Karl Malone 2000-02-23 UTA 37 205.74 116.11955
## college country draft_year draft_round draft_number gp pts reb
## 7551 Louisiana State USA 1992 1 1 74 28.7 12.7
## 4875 None USA 1995 1 5 81 22.0 11.4
## 248 Georgetown USA 1992 1 2 13 13.6 7.8
## 4666 Michigan USA 1994 1 5 81 18.0 7.1
## 7418 Central Arkansas USA 1987 1 5 64 11.3 5.2
## 4699 Louisiana Tech USA 1985 1 13 81 23.2 8.3
## ast Salaries Rank Yearnum
## 7551 3.7 17142000 1 2000
## 4875 5.0 16806000 2 2000
## 248 0.9 15004000 3 2000
## 4666 2.8 15000000 4 2000
## 7418 4.6 14795000 5 2000
## 4699 4.5 14000000 6 2000
Create/mutate at least one variable from the existing variables (step #7). In addition to the R codes and outputs, explain everything that you do in this step.
For this example 3 new variables have been created including: 1. Average earning per game 2. Total annual points 3. Earnings per point
### Create Variables
### Average Earnings Per Game
NBA <- NBA %>%
mutate(pay.per.game = as.numeric(Salaries/gp))
### Total Points Annually
NBA <- NBA %>%
mutate(annual.points = as.numeric(gp*pts))
### Earnings Per Point
NBA <- NBA %>%
mutate(pay.per.point = as.numeric(Salaries/annual.points))
print(head(NBA))
## Name season Team age player_height player_weight
## 7551 Shaquille O'Neal 2000-02-23 LAL 29 215.90 142.88148
## 4875 Kevin Garnett 2000-02-23 MIN 25 210.82 99.79024
## 248 Alonzo Mourning 2000-02-23 MIA 31 208.28 118.38751
## 4666 Juwan Howard 2000-02-23 DAL 28 205.74 113.39800
## 7418 Scottie Pippen 2000-02-23 POR 35 200.66 103.41898
## 4699 Karl Malone 2000-02-23 UTA 37 205.74 116.11955
## college country draft_year draft_round draft_number gp pts reb
## 7551 Louisiana State USA 1992 1 1 74 28.7 12.7
## 4875 None USA 1995 1 5 81 22.0 11.4
## 248 Georgetown USA 1992 1 2 13 13.6 7.8
## 4666 Michigan USA 1994 1 5 81 18.0 7.1
## 7418 Central Arkansas USA 1987 1 5 64 11.3 5.2
## 4699 Louisiana Tech USA 1985 1 13 81 23.2 8.3
## ast Salaries Rank Yearnum pay.per.game annual.points pay.per.point
## 7551 3.7 17142000 1 2000 231648.6 2123.8 8071.381
## 4875 5.0 16806000 2 2000 207481.5 1782.0 9430.976
## 248 0.9 15004000 3 2000 1154153.8 176.8 84864.253
## 4666 2.8 15000000 4 2000 185185.2 1458.0 10288.066
## 7418 4.6 14795000 5 2000 231171.9 723.2 20457.688
## 4699 4.5 14000000 6 2000 172839.5 1879.2 7449.979
Scan the data for missing values, inconsistencies and obvious errors. In this step, you should fulfill the step #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
Options: 1. Remove the rows where salary information is not provided.
2. Use the Mean of other salary data to override years where salary information is not provided.
Decision - To remove all rows where data is not provided. Based upon the scan conduced and reviewing the data it was concluded that were salary blanks were present it was not reasonable to use the mean of other year salaries as the sample number for each play varied and therefore it was best to omit data rows were player salary was not provided.
### Check for "NA" data
print(colSums(is.na(NBA)))
## Name season Team age player_height
## 0 0 0 0 0
## player_weight college country draft_year draft_round
## 0 0 0 0 0
## draft_number gp pts reb ast
## 0 0 0 0 0
## Salaries Rank Yearnum pay.per.game annual.points
## 1800 0 0 1800 0
## pay.per.point
## 1800
### Remove blanks
NBA <- na.omit(NBA)
dim(NBA) ## Dataframe has now reduced in size
## [1] 6977 21
print(colSums(is.na(NBA)))
## Name season Team age player_height
## 0 0 0 0 0
## player_weight college country draft_year draft_round
## 0 0 0 0 0
## draft_number gp pts reb ast
## 0 0 0 0 0
## Salaries Rank Yearnum pay.per.game annual.points
## 0 0 0 0 0
## pay.per.point
## 0
Scan the numeric data for outliers. In this step, you should fulfill the step #9. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
### Boxplot 1 - Salary comparison by team
Teambox <- ggplot(data = NBA,aes(Team, Salaries, fill=factor(Team)))+
geom_boxplot( )+
ggtitle("Salary Boxplot by Team") +
theme(plot.title = element_text(hjust = 0.5,face = "bold", colour="Black", size = (16)))+
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 6, vjust = 0.5))+
labs(x='Team', y='Salary ($)')+
theme(legend.text=element_text(size=6),
legend.key.size = unit(0.5, 'cm'), #change legend key size
legend.key.height = unit(0.5, 'cm'), #change legend key height
legend.key.width = unit(0.5, 'cm'))+
labs(fill='Team')
ggplotly(Teambox)
### Boxplot 2 - Salary comparison by Season
Seasonbox <- ggplot(data = NBA,aes(Yearnum, Salaries, fill=factor(Yearnum)))+
geom_boxplot( )+
ggtitle("Salary Boxplot by Year") +
theme(plot.title = element_text(hjust = 0.5,face = "bold", colour="Black", size = (16)))+
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 6, vjust = 0.5))+
labs(x='Year', y='Salary ($)')+
theme(legend.text=element_text(size=6),
legend.key.size = unit(0.5, 'cm'), #change legend key size
legend.key.height = unit(0.5, 'cm'), #change legend key height
legend.key.width = unit(0.5, 'cm'))+
labs(fill='Year')
ggplotly(Seasonbox)
It was possible to observe some high value outliers for each team and each season. Based upon the profile of NBA this is not a data concern as the top 10 players will typically draw much higher salaries than the average, therefore they need to remain within the illustration.
#### Do a scatter plot of salary against to total points scored identify if there is a correlation
SalaryScatter <- ggplot(NBA, aes(Salaries, annual.points)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm, # Add linear regression line
se=TRUE) # Add shaded confidence region
print(SalaryScatter)
## `geom_smooth()` using formula 'y ~ x'
#### Check the correction
CorrSP <- cor.test(NBA$Salaries, NBA$annual.points, method="pearson")
print(CorrSP)
##
## Pearson's product-moment correlation
##
## data: NBA$Salaries and NBA$annual.points
## t = 37.394, df = 6975, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3889186 0.4280156
## sample estimates:
## cor
## 0.4086546
It can be concluded that there is a moderate positive correlation between total annual points and the average salary of players. As points is not the only defining factor of a players ability and it depends upon the role they have this makes sense, but it would be moderately significant consideration for salary assignment.
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfill the step #10.
Histo <- hist(NBA$Salaries,
main = "Histogram of Salary figures",
xlab = "Salary")
print(Histo)
## $breaks
## [1] 0.0e+00 2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07 1.4e+07 1.6e+07
## [10] 1.8e+07 2.0e+07 2.2e+07 2.4e+07 2.6e+07 2.8e+07 3.0e+07 3.2e+07 3.4e+07
## [19] 3.6e+07 3.8e+07
##
## $counts
## [1] 2591 1508 900 532 346 295 259 183 116 88 48 45 26 17 8
## [16] 9 1 4 1
##
## $density
## [1] 1.856815e-07 1.080694e-07 6.449764e-08 3.812527e-08 2.479576e-08
## [6] 2.114089e-08 1.856099e-08 1.311452e-08 8.313029e-09 6.306435e-09
## [11] 3.439874e-09 3.224882e-09 1.863265e-09 1.218289e-09 5.733123e-10
## [16] 6.449764e-10 7.166404e-11 2.866562e-10 7.166404e-11
##
## $mids
## [1] 1.0e+06 3.0e+06 5.0e+06 7.0e+06 9.0e+06 1.1e+07 1.3e+07 1.5e+07 1.7e+07
## [10] 1.9e+07 2.1e+07 2.3e+07 2.5e+07 2.7e+07 2.9e+07 3.1e+07 3.3e+07 3.5e+07
## [19] 3.7e+07
##
## $xname
## [1] "NBA$Salaries"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
The data appears to be left skewed suggesting that there are lots of players who have been paid lower levels. Due to the nature of the data running across multiple seasons this makes sense as salaries were lower in previous years. To balance this it is recommended a logarithmic transformation is undertaken.
### Perform a logarithmic transformation
log_NBAsal <- log10(NBA$Salaries)
LogHisto <- hist(log_NBAsal,
main = "Histogram of base 10 log salary",
xlab = "Base 10 log of Salary")
print(LogHisto)
## $breaks
## [1] 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6
##
## $counts
## [1] 3 9 17 59 21 30 130 310 718 889 898 942 1067 784 719
## [16] 331 50
##
## $density
## [1] 0.002149921 0.006449764 0.012182887 0.042281783 0.015049448 0.021499212
## [7] 0.093163251 0.222158521 0.514547800 0.637093307 0.643543070 0.675075247
## [13] 0.764655296 0.561846066 0.515264440 0.237207969 0.035832019
##
## $mids
## [1] 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9 6.1 6.3 6.5 6.7 6.9 7.1 7.3 7.5
##
## $xname
## [1] "log_NBAsal"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"