drawing

Seasons 1950-2017




Introduction


This in an exploratory analysis of the NBA Seasons 1950-2017. If you are familiar with NBA statistics terminology, don’t care about how the data being processed, or just want to go straight to the fun, please skip this boring part and proceed to the next part.


About Me and the Project

It was the NBA who first introduced me into the broad world of statistics in an entertaining manner back in my childhood. Since then I have always been captivated by statistics and data.

Despite my interest in statistics coupled with intermittently making amateurish statistical data analysis out of anything that piqued my interest for pleasure, I never really had the proper training nor consider take the subject as a career until recently. It was not long ago, without prior knowledge in the area, I started to teach myself coding in Data Science obsessively by dive into MOOC (Massive Open Online Course) courses religiously in my spare time.

Being absent from following NBA regularly for a long time elevated my excitement to explore, analyze and gain insight from the data. Thus, this dataset is perfect for my first attempt at doing a self-directed project, as it is convenience for me to dig deeper into learning Data Science while reaping the joy of producing it.

I expect to use this opportunity to gain more experience by challenging myself to implement my new skill while expanding it by exploring various methods of data wrangling and visualization, then communicating the findings efficiently in an aesthetic manner. In this way, I can expect to shape my skills through practical learning.

The objective is to present NBA seasons statistical facts through compelling visualization and tables. I do not make any predictive analysis here (will get there later).

This presentation is designed for anyone who interested in this subjects without the need of prior technical knowledge, hence the simplification and references, it also intended to draw expertise in the subject to give feedback for my improvement.

I hope this may be useful as a reference for understanding the NBA statistics, NBA seasons history, or as a code reference for R learners


The Dataset

The dataset contains aggregate individual statistics for 67 NBA seasons from 1950 to 2017. The data was scraped from Basketball-reference and is outdated. Nevertheless, the codes are designed to facilitate update once new data emerge. . For simplicity, I took only the basic stats (FG, PTS, etc.), and dismissed the advanced stats (VORP, PER, OBPM, USG%, etc.).

Since the dataset provides regular season statistics and basic player biodata only, I need to point out the dataset limitation:

  • No NBA champion, playoffs or All-Star is present in the dataset.
  • There’s no win-lose nor score information.
  • No team’s stats given.
  • Data about title or awards (e.g.: MVP contest winners, rookie of the year, All-Star player, etc.) is not provided in the dataset.
  • The player’s stats are presented as a whole season, so it’s not possible to extract information about in-game records.
  • No regional division information is given.




Glossary



More about Positions

Historically, only three positions were recognized (two guards, two forwards, and one center) based on where they played on the court: Guards generally played outside and away from the hoop and forwards played outside and near the baseline, with the center usually positioned in the key. During the 1980s, as team strategy evolved after the three-point field goal and the three-point arc were added to the basketball court, more specialized roles developed, resulting in the five position designations used today. (source)

In here, for simplicity and esthetical purpose, I arbitrarily framed up all the players into the standard five position used today and then color-coded them. This is essensial because I tend to make good use of this for comparison frequently.

For more information about each position, click on the position icon below.

  • : Center
  • : Power Forward
  • : Small Forward
  • : Shooting Guard
  • : Point Guard




Data Preparation Stage


Loading necessary packages.

library(dplyr)
library(tidyr)
library(measurements)
library(sqldf)
library(kableExtra)


Create additional functions

# Mode average
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

convertHeight <- function(x) {
    x <- as.character(x)
    split <- strsplit(x, "-")
    feet <- as.numeric(split[[1]][1])
    inch <- as.numeric(split[[1]][2])
    round(conv_unit(feet, "ft", "cm") + conv_unit(inch, "inch", "cm"),0)
}


Load the dataset


Download and extracting the files.

Dataset from: https://www.kaggle.com/drgilermo/nba-players-stats/version/2

fileURL <- "https://www.kaggle.com/drgilermo/nba-players-stats/downloads/nba-players-stats.zip/2"
filename <- "NBASeason1950-2017.zip"

# Checking if archieve already exists.
if (!file.exists(filename)){
  download.file(fileURL, filename, method="curl")
}  

# Checking if folder exists
if (!file.exists("NBA Season Dataset")) { 
  unzip(filename)
}

Download date: 28 July 2018


Load the data.

NBA <- read.csv("NBA Season Dataset/Seasons_Stats.csv")[, c(2:9, 11:20, 32:53)]
PlayerData <- read.csv("NBA Season Dataset/player_data.csv")[, c(1:3, 5:6)]


Tidying data


Clean up data.

# Remove NA rows
NBA <- NBA %>% filter(!is.na(Year), !is.na(Player))
# Remove Team = TOT (which indicates total, when player played in more than 1 team in a season)
NBA <- NBA[NBA$Tm != "TOT",]
# Remove of "*" which indicates a player is a member of NBA Hall of Fame
NBA$Player <- gsub("\\*$", "", NBA$Player)
# Fix player data
PlayerData[2143, 4] = as.factor("6-2")
PlayerData[2143, 5] = 190
NBA[21304, 3] = "SG"


Renaming some columns of the dataset

colnames(PlayerData) <- c("Player", "YearStart", "YearEnd", "Height-feet", "Weight-lbs")


Merging data

NBA <- sqldf("SELECT * FROM NBA JOIN PlayerData ON NBA.Player = PlayerData.Player 
      WHERE NBA.Year >= PlayerData.YearStart AND NBA.Year <= PlayerData.YearEnd")


Fixing Position

In here, for simplicity and esthetical purpose, I arbitrarily framed up all the players into the standard five position used today and then color-coded them:

  • ā€œC-Fā€ in original dataset becomes ā€œCā€
  • ā€œF-Cā€ in original dataset becomes ā€œPFā€
  • ā€œFā€ in original dataset becomes ā€œPFā€
  • ā€œF-Gā€ in original dataset becomes ā€œSFā€
  • ā€œGā€ in original dataset becomes ā€œSGā€
  • ā€œG-Fā€ in original dataset becomes ā€œSFā€
NBA$Pos[NBA$Pos == "C-F"] <- "C"
NBA$Pos[NBA$Pos == "F-C"] <- "PF"
NBA$Pos[NBA$Pos == "F"] <- "PF"
NBA$Pos[NBA$Pos == "F-G"] <- "SF"
NBA$Pos[NBA$Pos == "G"] <- "SG"
NBA$Pos[NBA$Pos == "G-F"] <- "SF"
NBA$Pos <- factor(NBA$Pos, levels = c("C", "PF", "SF", "SG", "PG"))
PosColorCode <- c("C"="#FF0000", "PF"="#FFA500", "SF"="#DDDD00" ,"SG"="#0000FF", "PG"="#32CD32")


Create new variables:

    Player info dataset

  • Height: Convert unit from feet to cm
  • Weight: Convert unit from lbs to kg
  • BMI: Calculate Body Mass Index
  • Born: Year of birth
  • ORpG: (Offensive Rebounds per Game): Average offensive rebounds a player made in a game. (ORB/G)
  • DRpG: (Defensive Rebounds per Game): Average defensive rebounds a player made in a game. (DRB/G)
  • RpG (Rebounds per Game): Average rebounds a player made in a game. (TRB/G)
  • ApG (Assists per Game): Average assists a player made in a game. (AST/G)
  • SPG (Steals per Game): Average rebounds a player made in a game. (STL/G)
  • BPG (Blocks Per Game): Average rebounds a player made in a game. (BLK/G)
  • TPG (Turnovers Per Game): Average turnovers a player made in a game. (TOV/G)
  • PpG (Points per Game): Average points a player made in a game. (PTS/G)
  • Position: Same like Pos, but prettified


NBA <- NBA %>%
    rowwise() %>%
    mutate(Height = convertHeight(`Height-feet`),
           Weight = round(conv_unit(`Weight-lbs`, "lbs", "kg"),0),
           BMI = round(Weight / (Height / 100)^2, 2),
           Born = Year - Age,
           ORpG = ORB / G,
           DRpG = DRB / G,
           RpG = TRB / G,
           ApG = AST / G,
           SpG = STL / G,
           BpG = BLK / G,
           TpG = TOV / G,
           PpG = PTS / G,
           Position = cell_spec(Pos,
                            color = "white",
                            align = "c",
                            background = factor(Pos, c("C", "PF", "SF", "SG", "PG"),
                                                PosColorCode)))


Arrange the table.

NBA <- NBA %>%
    select(Year:MP, YearStart:BMI, "Born", FG:PTS, TS.:TOV., ORpG:PpG, "Position", -c("Height-feet", "Weight-lbs"))


Create duplicate table with normalized stats.

NBA_Scaled <- NBA %>% mutate_at(vars(c(G:MP, FG:PpG)), scale)

NBA_pM <- NBA %>%
    mutate_at(vars(c(FG:PTS)), function(x){ x/NBA$MP}) %>%
    rename_at(vars(c(FG:PTS)), ~paste0(.,"_pM"))

NBA_Scaled <- cbind(NBA_Scaled, NBA_pM[,c(15:36)])


Displaying raw tidy data table

NBA


  • Number of rows: 22345
  • Number of columns: 56
  • Number of players: 3889
  • Number of teams: 68
  • File Size: 7275.5 Kb



Write tidy dataset for later use…

write.csv(NBA, file = "NBA_TidySet.csv")
write.csv(NBA_Scaled, file = "NBA_Scaled_TidySet.csv")




End of Session


