drawing

Seasons 1950-2017





Setup Stage


Loading necessary packages.

library(dplyr)
library(tidyr)
library(ggplot2)
library(RColorBrewer)
library(plotly)
library(kableExtra)


Create additional functions

# Mode average
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}


Load the dataset


Dataset taken from: https://www.kaggle.com/drgilermo/nba-players-stats/version/2

If you need to see the glossary and the data tidying process, please visit the first part here.

Loading datasets…

NBA <- read.csv("NBA_TidySet.csv")[,-c(1)]
NBA_Scaled <- read.csv("NBA_Scaled_TidySet.csv")
NBA$Pos <- factor(NBA$Pos, levels = c("C", "PF", "SF", "SG", "PG"))
NBA_Scaled$Pos <- factor(NBA_Scaled$Pos, levels = c("C", "PF", "SF", "SG", "PG"))
PosColorCode <- c("C"="#FF0000", "PF"="#FFA500", "SF"="#DDDD00" ,"SG"="#0000FF", "PG"="#32CD32")


Displaying raw tidy data table

NBA


  • Number of rows: 22345
  • Number of columns: 55
  • Number of players: 3889
  • Number of teams: 68
  • File Size: 7188.1 Kb




Players

In this session, we start by exploring the players. The variables that will be enticing to explore include their positions, height, weight, born and age.


NBA Teams growth


We start by exploring how NBA teams and the number of players has grown over the years. The table will show the detailed numbers.

Team_Player <- NBA %>%
  group_by(Year) %>%
  summarise(nPlayers = n_distinct(Player),
            nTeams = n_distinct(Tm),
            nGames = max(G),
            Players_per_Team = round(nPlayers/nTeams, 2)) 
Team_Player %>%
    kable(escape = FALSE, align='c', caption = "Players, Teams and Games") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(1, bold = T) %>%
    scroll_box(width = "100%", height = "300px")
Players, Teams and Games
Year nPlayers nTeams nGames Players_per_Team
1950 219 17 68 12.88
1951 126 11 69 11.45
1952 112 10 66 11.20
1953 123 10 72 12.30
1954 108 9 72 12.00
1955 96 9 72 10.67
1956 91 8 72 11.38
1957 96 8 72 12.00
1958 98 8 72 12.25
1959 91 8 72 11.38
1960 95 8 75 11.88
1961 92 8 79 11.50
1962 106 9 80 11.78
1963 116 9 80 12.89
1964 111 9 80 12.33
1965 114 9 80 12.67
1966 109 9 80 12.11
1967 121 10 81 12.10
1968 149 12 82 12.42
1969 166 14 82 11.86
1970 167 14 82 11.93
1971 213 17 82 12.53
1972 212 17 82 12.47
1973 210 17 82 12.35
1974 217 17 82 12.76
1975 231 18 82 12.83
1976 233 18 82 12.94
1977 289 22 82 13.14
1978 281 22 82 12.77
1979 274 22 82 12.45
1980 281 22 82 12.77
1981 297 23 82 12.91
1982 309 23 82 13.43
1983 309 23 82 13.43
1984 306 23 82 13.30
1985 314 23 82 13.65
1986 319 23 82 13.87
1987 330 23 82 14.35
1988 328 23 82 14.26
1989 348 25 82 13.92
1990 376 27 82 13.93
1991 383 27 82 14.19
1992 385 27 82 14.26
1993 386 27 82 14.30
1994 399 27 82 14.78
1995 402 27 82 14.89
1996 424 29 82 14.62
1997 437 29 82 15.07
1998 434 29 82 14.97
1999 436 29 50 15.03
2000 435 29 82 15.00
2001 437 29 82 15.07
2002 436 29 82 15.03
2003 425 29 82 14.66
2004 439 29 82 15.14
2005 460 30 82 15.33
2006 454 30 82 15.13
2007 456 30 82 15.20
2008 448 30 82 14.93
2009 442 30 82 14.73
2010 440 30 82 14.67
2011 449 30 82 14.97
2012 475 30 66 15.83
2013 464 30 82 15.47
2014 477 30 82 15.90
2015 489 30 82 16.30
2016 472 30 82 15.73
2017 481 30 82 16.03


Now we see how NBA teams have grown, with this simple plot.

Team_Player %>%
    ggplot() +
    geom_line(aes(Year, nTeams, linetype = "Trend line")) +
    ggtitle("Number of NBA Teams by Year") +
    geom_hline(aes(yintercept = mean(Team_Player$nTeams), linetype = "Average line"),
               col = "red",
               alpha = 0.5) +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_linetype_manual(name = "", values = c(2, 1), guide = guide_legend(reverse = TRUE)) +
    ylab("Number of Teams") +
    theme(legend.position="bottom")


  • The number of NBA teams has grown from 17 in 1950 to 30 in 2017.
  • Lowest number of teams occur during 1956-1961 seasons, with only 8 teams competes.
  • The highest number of teams occur from 2005 until present with 30 teams.




Number of NBA Players


We can plot number of NBA players year by year based on the same table.

Team_Player %>%
    ggplot(aes(Year, nPlayers, fill=nPlayers)) +
    geom_bar(stat = "identity") +
    ggtitle("Number of NBA Players by Year") +
    geom_hline(aes(yintercept = mean(Team_Player$nPlayers), linetype = "Average line"),
               col = "red",
               alpha = 0.5) +
    scale_fill_gradient(low = "green", high = "red") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_linetype_manual(name = "", values = 2) +
    ylab("Number of Players") +
    theme(legend.position="bottom")


  • The number of NBA players has grown more than doubled, from 219 in 1950 to 481 in 2017.
  • The average number of players compete in NBA regular season is 294.82
  • The least number of players competes in a season is 91 players during 1955-1956 & 1958-1959 season.
  • The most number of players competes in a season is 489 players in 2014-2015 season.




Number of games for each team in a season


Almost all NBA fans know that there are 82 games to play in a season. However, I’d like to see the past history, I wonder if it’s has grown over time, too.

Team_Player %>% 
    ggplot(aes(Year, nGames, fill=nGames)) +
    ggtitle("Number of Games in a Season") +
    geom_bar(stat = "identity") +
    scale_fill_gradient(low = "green", high = "red") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    ylab("Number of Games") +
    theme(legend.position="bottom")


Since 1967–68 season, NBA expands its regular season to 82 games per team, where it still stands to this date. Except for these notable occurrences:

  • 1998–99 NBA season: number of games is 50, due to a lockout
  • 2011-12 NBA season: number of games is 66, due to another lockout




Position Ratio in the NBA


My next question, is the position ratio always evenly distributed?

NBA %>%
    ggplot(aes(Year, group=Pos, color = Pos, fill = Pos)) +
    geom_density(alpha = 0.5, position = "fill") +
    ggtitle("Position Ratio by Year") +
    scale_color_manual("Pos", values = PosColorCode) +
    scale_fill_manual("Pos", values = PosColorCode) +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    theme(legend.position="bottom")




Height Distribution


We all know basketball players are tall, naturally, I want to know how tall most of them are, and how small we are (average height person) compared to their standard.

HeightMean <- mean(NBA$Height)
HeightSD <- sd(NBA$Height)
NBA %>% ggplot(aes(Height, fill=TRUE)) +
    geom_density() +
    scale_x_continuous(breaks = seq(160, 240, 10)) +
    geom_vline(aes(xintercept = HeightMean, linetype = "Average height of NBA players"),
               col = "red",
               alpha = 0.8) +
    geom_vline(aes(xintercept = 177, linetype = "Average height of American male"),
               col = "blue",
               alpha = 0.8) +
    geom_vline(xintercept = c(seq(HeightMean, 240, HeightSD), seq(HeightMean, 160, -HeightSD)),
               col = "blue",
               alpha = 0.3,
               linetype = 5) +
    scale_linetype_manual(name = "", values = c(1, 1)) +
    guides(fill=FALSE) +
    theme(legend.position="bottom")


  • Average height of NBA players is: 199.6 cm, with standard deviation: 9.3.
  • Average height of American men is 177 cm (source) is shorter by more than two standard deviation away from average NBA players.


Now let’s see the groundcrawlers and the skyscrapers in the NBA.


NBA %>%
    group_by(Height, Player) %>%
    summarise(Pos = getmode(Position),
              YearActive = paste(mean(YearStart), "-", mean(YearEnd)),
              Team = getmode(Tm),
              Games = sum(G),
              PPG = round(sum(PTS)/sum(G), 2)) %>%
    arrange(Height) %>%
    head() %>%
    kable(escape = FALSE, align='c', caption = "Shortest Players") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777")
Shortest Players
Height Player Pos YearActive Team Games PPG
160 Muggsy Bogues PG 1988 - 2001 CHH 889 7.71
165 Earl Boykins PG 1999 - 2012 DEN 652 8.88
168 Spud Webb PG 1986 - 1998 ATL 814 9.92
170 Greg Grant PG 1990 - 1996 PHI 274 2.80
170 Keith Jennings PG 1993 - 1995 GSW 164 6.65
170 Monte Towe PG 1976 - 1977 DEN 51 2.55


NBA %>%
    group_by(Height, Player) %>%
    summarise(Pos = getmode(Position),
              YearActive = paste(mean(YearStart), "-", mean(YearEnd)),
              Team = getmode(Tm),
              Games = sum(G),
              PPG = round(sum(PTS)/sum(G), 2)) %>%
    arrange(desc(Height)) %>%
    head(n=8) %>%
    kable(escape = FALSE, align='c', caption = "Tallest Players") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777")
Tallest Players
Height Player Pos YearActive Team Games PPG
231 Gheorghe Muresan C 1994 - 2000 WSB 307 9.84
231 Manute Bol C 1986 - 1995 WSB 624 2.56
229 Shawn Bradley C 1994 - 2005 DAL 832 8.12
229 Yao Ming C 2003 - 2011 HOU 486 19.03
226 Chuck Nevitt C 1983 - 1994 HOU 155 1.62
226 Pavel Podkolzin C 2005 - 2006 DAL 6 0.67
226 Sim Bhullar C 2015 - 2015 SAC 3 0.67
226 Slavko Vranes C 2004 - 2004 POR 1 0.00



Height Comparison


Next, I’d like to compare them side-by-side, from Center to Point Guard, and find their averages and ranges.

NBA %>%
    group_by(Pos) %>%
    summarise(MinHeight = min(Height),
              MaxHeight = max(Height),
              MedianHeight = median(Height),
              ModeHeight = getmode(Height),
              MeanHeight = round(mean(`Height`), 2)) %>%
    mutate(Pos = cell_spec(Pos,
                            color = "white",
                            align = "c",
                            background = factor(Pos, c("C", "PF", "SF", "SG", "PG"),
                                                PosColorCode))) %>%
    kable(escape = FALSE, align='c', caption = "Height: Averages and Range by Position") %>%
    kable_styling("striped", full_width = T)
Height: Averages and Range by Position
Pos MinHeight MaxHeight MedianHeight ModeHeight MeanHeight
C 196 231 211 211 210.61
PF 185 224 206 206 204.91
SF 178 213 201 201 200.48
SG 165 211 196 196 194.03
PG 160 211 188 190 187.14


Violin plot not only gives us the averages and ranges, but it also gives us the distribution.

NBA %>%
  ggplot(aes(Pos, Height, color=Pos)) +
  geom_violin() +
  ggtitle("Height distribution by position") +
  stat_summary(fun.y=mean, geom="point", shape=8, size=6) +
  geom_point() +
  geom_hline(aes(yintercept = mean(NBA$Height, na.rm=T), linetype = "Average NBA players"),
             col = "red",
             alpha = 0.5) +
  geom_hline(aes(yintercept = 177, linetype = "Average American male"),
             col = "blue",
             alpha = 0.5) +
  scale_color_manual("Pos", values = PosColorCode) +
  scale_linetype_manual(name = "", values = c(1, 1)) +
  theme(legend.position="bottom")




Height by Years


Exploring the height of NBA players would not feel complete without taking a look at it from the chronological perspective. This might not produce significant insight, but I just can’t resist seeing the plot.

HeightYear <- NBA %>%
    group_by(Year) %>%
    summarise(meanHeight = round(mean(Height, na.rm = T), 1))
NBA %>%
    group_by(Year, Pos) %>%
    summarise(meanHeight = round(mean(Height, na.rm = T), 1)) %>%
    ggplot() +
    geom_line(aes(Year, meanHeight, group=Pos, color=Pos), size = 1.2, alpha = 1) +
    geom_line(aes(Year, meanHeight, linetype = "Average line"),
              data = HeightYear, color = "black", size = 0.8, alpha = 0.5) +
    ggtitle("Height by position by Year") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_color_manual("Pos", values = PosColorCode) +
    scale_linetype_manual(name = "", values = c(3)) +
    ylab("Height") +
    guides(group=FALSE) +
    theme(legend.position="bottom")




BMI Distribution


Next, let’s explore the BMI (Body Mass Index) of the players.

BMI <- NBA %>%
    group_by(Player) %>%
    filter(!is.na(BMI)) %>%
    mutate(BMIGroup = ifelse(BMI < 18.5, "Underweight",
                               ifelse(BMI >= 18.5 & BMI < 25, "Healthy weight",
                                      ifelse(BMI >= 25 & BMI < 30, "Overweight",
                                             "Obese Class I")))) %>%
    summarise(Pos = getmode(Position),
              Height = getmode(Height),
              Weight = getmode(Weight),
              BMI = getmode(BMI),
              BMIGroup = getmode(BMIGroup),
              Games = sum(G),
              PPG = round(sum(PTS)/Games, 1))
shade <- data.frame(xstart = c(15, 25, 30), xend = c(18.5, 30, 33), col = c("#F00", "#0F0", "#00F"))
BMIclass <- data.frame(X = c(15.7, 20, 26.5, 31), Y = 0.27, label = c("Underweight", "Healthy weight", "Overweight", "Obese"))
BMIMean <- mean(BMI$BMI)
BMISD <- sd(BMI$BMI)
BMI %>% ggplot() +
    geom_rect(data = shade,
              aes(xmin = xstart, xmax = xend, ymin = 0, ymax = Inf, fill = col),
              alpha = 0.3) +
    geom_density(aes(BMI, fill=TRUE)) +
    scale_x_continuous(breaks = seq(15, 33, 1)) +
    geom_vline(aes(xintercept = BMIMean, linetype = "Average line"),
               col = "black",
               alpha = 0.8) +
    geom_vline(xintercept = c(seq(BMIMean, 33, BMISD), seq(BMIMean, 15, -BMISD)),
               col = "blue",
               alpha = 0.3,
               linetype = 5) +
    geom_text(data = BMIclass,
              mapping = aes(x = X, y = Y, label = label),
              size = 3,
              vjust = 0,
              hjust = 0,
              color = "forestgreen") +
    scale_linetype_manual(name = "", values = c(1, 1, 1, 1)) +
    guides(fill=FALSE) +
    theme(legend.position="bottom")


    BMI category based on WHO standard BMI classification.

  • Average BMI: 24.1, with standard deviation: 1.7
  • As expected, most players, 73.3% are in the “Healthy Weight” category.
  • 26.2% players are in “Overweight” category, which is still common in athletes.
  • 15 players(0.4%) are in “Obese Class I” category, Shaquille O’Neal is one of them.
  • Only 1 player falls under the “Underweight” category. As one of the tallest player in the NBA (231 cm), he also the skinniest player in the league history with BMI only 17.1, his name is Manute Bol


Now I’m interested to see the real BIG guys in the NBA.


BMI %>%
    arrange(desc(BMI)) %>%
    filter(BMIGroup == "Obese Class I") %>%
    select(BMI, everything()) %>%
    kable(escape = FALSE, align='c', caption = "Highest BMI (All in Obese Class I group)") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    scroll_box(width = "100%", height = "300px")
Highest BMI (All in Obese Class I group)
BMI Player Pos Height Weight BMIGroup Games PPG
31.91 Sim Bhullar C 226 163 Obese Class I 3 0.7
31.56 Thomas Hamilton C 218 150 Obese Class I 33 3.2
31.51 Shaquille O'Neal C 216 147 Obese Class I 1207 23.7
31.45 Dexter Pittman C 211 140 Obese Class I 50 2.3
31.30 Robert Traylor C 203 129 Obese Class I 438 4.8
31.22 Nikola Pekovic C 211 139 Obese Class I 271 12.6
31.11 Jahidi White C 206 132 Obese Class I 334 5.9
31.00 Garret Siler C 211 138 Obese Class I 21 2.1
30.87 Glen Davis PF 206 131 Obese Class I 514 8.0
30.40 Kevin Seraphin C 206 129 Obese Class I 423 5.9
30.33 Elton Brand PF 203 125 Obese Class I 1058 15.9
30.33 Mike Sweetney C 203 125 Obese Class I 233 6.5
30.28 Al Jefferson C 208 131 Obese Class I 879 16.0
30.20 DeJuan Blair PF 201 122 Obese Class I 424 6.8
30.09 Garth Joseph C 218 143 Obese Class I 4 0.5



Chronological Size Growth


Inspired by Hans Rosling’s animated bubble chart from Gapminder, I challenged myself to see if I can make one myself with this dataset. Weight in the x-axis, height in the y-axis, size for BMI, and color represent their position in the field. Don’t forget to click “Play” to see them grows year-by-year.players are in “Overweight” category, which is still common in athletes.

pBubble <- NBA %>%
    filter(!is.na(BMI), !is.na(Weight), !is.na(Height)) %>%
    plot_ly(x = ~Weight,
            y = ~Height,
            size = ~BMI,
            color = ~Pos,
            colors = PosColorCode,
            frame = ~Year,
            text = ~Player, 
            hoverinfo = "text",
            type = 'scatter',
            mode = 'markers') %>% 
    animation_opts(1000, easing = "elastic",
                   redraw = FALSE) %>% 
    animation_button(x = 1,
                     xanchor = "right",
                     y = 0,
                     yanchor = "bottom") %>%
    animation_slider(currentvalue = list(prefix = "Year: ", font = list(color="red")))
htmlwidgets::saveWidget(as.widget(pBubble), "pBubble.html")

Note: If this animated plot (which is the most awesome plot of all), does not appear in your browser, that means I still have technical difficulty with attaching the plot with the ‘iframe’ method. Will come back to this later when the solution emerge



Ability by Position Comparison


Since we now know that height is strongly correlated with their position in the field, naturally I’m interested to see how their abilities differ based on their roles.

To compare multiple variables with different scales, I use normalized (scaled) data I created in part 1: Data Preparation Stage of this series. This makes “0” in the scale is the average of each variable, anything below “0” tells us that the group performance in that particular category is below average, and vice versa.

RadarHeight <- NBA_Scaled %>%
    group_by(Pos) %>%
    summarise(Shoot2P = mean(X2P., na.rm=T),
              Shoot3P = mean(X3P., na.rm=T),
              ShootFT = mean(FT., na.rm=T),
              OffensiveRB = mean(ORB, na.rm=T),
              Assist = mean(AST, na.rm=T),
              DefensiveRB = mean(DRB, na.rm=T),
              Steal = mean(STL, na.rm=T),
              Block = mean(BLK, na.rm=T)) %>%
    select(-Pos)
pRadar <- plot_ly(type = 'scatterpolar',
        fill = 'toself',
        mode = 'lines') %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[1,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'C',
              fillcolor = "#FF0000",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[2,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'PF',
              fillcolor = "#FFA500",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[3,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'SF',
              fillcolor = "#DDDD00",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[4,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'SG',
              fillcolor = "#0000FF",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[5,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'PG',
              fillcolor = "#32CD32",
              opacity = 0.5) %>%
    layout(polar = list(
        radialaxis = list(
        visible = T,
        range = c(-1, 1))))
RadarHeight

click for interactive plotly graph


How to interpret the plot:

Click on the ‘position legend’ in the upper right of the graph to select/unselect each position, double-click to isolate the trace. You can compare them in any way as you desire.

So, as we can see, the graph does not just confirm the well-known fact that the tall guys (Centers) are the best shot-blockers and the short guys (Point Guards) are the best at passing the ball, we can also measure how far better they perform these task.




Annual Age Range


Same like with height, it would be compelling to see the age range in the NBA year-by-year.

AgeNBA <- NBA %>%
    group_by(Year) %>%
    summarise(Average = round(mean(Age, na.rm = T), 2),
              Max = max(Age, na.rm = T),
              Min = min(Age, na.rm = T))
AgeNBA %>% kable(align = "c", caption = "Age: Average and Range by Year") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(1, bold = T) %>%
    scroll_box(width = "100%", height = "300px")
Age: Average and Range by Year
Year Average Max Min
1950 26.06 36 20
1951 26.26 33 22
1952 26.07 34 21
1953 25.99 35 21
1954 25.81 38 19
1955 26.10 35 21
1956 25.83 34 20
1957 25.87 35 21
1958 26.21 35 22
1959 25.96 33 21
1960 26.06 34 21
1961 26.11 34 22
1962 25.59 34 22
1963 25.83 34 21
1964 26.02 35 21
1965 25.76 33 21
1966 26.19 34 21
1967 25.89 35 21
1968 25.96 36 21
1969 26.08 36 22
1970 26.45 41 21
1971 26.05 37 21
1972 26.17 38 21
1973 26.23 39 21
1974 26.31 37 21
1975 26.35 37 19
1976 26.27 35 18
1977 25.87 36 19
1978 25.88 37 20
1979 25.81 35 21
1980 26.16 36 19
1981 25.87 35 20
1982 26.01 36 20
1983 26.12 37 21
1984 26.12 38 21
1985 26.22 37 21
1986 26.43 38 20
1987 26.36 39 20
1988 26.55 40 21
1989 26.70 41 21
1990 26.72 39 20
1991 26.79 37 21
1992 26.78 38 21
1993 27.00 39 20
1994 27.05 40 20
1995 27.34 41 20
1996 27.39 42 19
1997 27.51 43 18
1998 27.55 40 18
1999 27.51 40 18
2000 27.66 39 19
2001 27.77 39 19
2002 27.22 39 19
2003 27.12 40 19
2004 27.13 41 18
2005 27.05 42 18
2006 26.55 39 18
2007 26.48 44 19
2008 26.84 41 19
2009 26.63 42 19
2010 26.69 39 19
2011 26.71 38 19
2012 26.62 39 19
2013 26.55 40 19
2014 26.44 39 19
2015 26.52 38 19
2016 26.71 39 19
2017 26.38 40 19


And here’s the plot…

AgeNBA %>% ggplot(aes(Year)) +
    geom_line(aes(y = Max, linetype = "Oldest"), color = "red", alpha = 0.5) +
    geom_line(aes(y = Average, linetype = "Average"), color = "black") +
    geom_line(aes(y = Min, linetype = "Youngest"), color = "blue", alpha = 0.5) +
    ggtitle("Age Range by Year") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_linetype_manual(name = "", values = c(1, 1, 1)) +
    guides(group=FALSE) +
    theme(legend.position="bottom")



Player Distribution by Generation


NBA has been around since 1950, this suggests that many generations have passed since the first NBA season. Would it be nice to see which generation dominated the league each year?

NBA %>%
    group_by(Year, Player) %>%
    filter(!is.na(Born)) %>%
    mutate(Generation = ifelse(Born <= 1921, "The Depression Era",
                          ifelse(Born %in% 1922:1927, "World War II",
                            ifelse(Born %in% 1928:1945, "Post-War Cohort",
                              ifelse(Born %in% 1946:1954, "Baby Boomers",
                                ifelse(Born %in% 1955:1965, "Generation Jones",
                                  ifelse(Born %in% 1966:1976, "Generation X",
                                    ifelse(Born %in% 1977:1995, "Millennials",
                                      "Generation Z")))))))) %>%
    summarise(Generation = getmode(Generation)) %>%
    arrange(Year, Player) %>%
    mutate(Generation = factor(Generation, levels = c("The Depression Era",
                                                     "World War II",
                                                     "Post-War Cohort",
                                                     "Baby Boomers",
                                                     "Generation Jones",
                                                     "Generation X",
                                                     "Millennials",
                                                     "Generation Z"))) %>%
    ggplot(aes(Year, y=..count.., colour=Generation, fill=Generation)) +
    geom_density(alpha=0.55) +
    ggtitle("Generation Count by Years") +
    ylab("Count") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    theme(legend.position='bottom')

Generation classification based on WJS marketing research


And here’s the oldest generation of the NBA…

Earliest born:

NBA %>%
    group_by(Player, Born) %>%
    summarise(Pos = getmode(Position),
              ActiveYears = paste(getmode(YearStart), "-", getmode(YearEnd)),
              NBASeasons = n_distinct(Year),
              Games = sum(G),
              StartingAge = min(Age),
              FinalAge = max(Age),
              PPG = round(sum(PTS)/sum(G), 1)) %>%
    arrange(Born) %>%
    select(Born, everything()) %>%
    head(9) %>%
    kable(escape = F, align = "c", caption = "Oldest NBA Players") %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    column_spec(2, bold = T) %>%
    kable_styling("striped", full_width = T)
Oldest NBA Players
Born Player Pos ActiveYears NBASeasons Games StartingAge FinalAge PPG
1914 Charley Shipp SF 1950 - 1950 1 23 36 36 4.7
1915 Chick Reiser SF 1948 - 1950 1 67 35 35 9.0
1916 Mike Novak C 1949 - 1954 2 65 34 38 1.5
1917 Dick Schulz SF 1947 - 1950 1 50 33 33 4.2
1918 Al Cervi PG 1950 - 1953 4 202 32 35 7.9
1918 Bob Carpenter PF 1950 - 1951 2 122 32 33 7.7
1918 Buddy Jeannette SG 1948 - 1950 1 37 32 32 5.2
1918 Ed Sadowski C 1947 - 1950 1 69 32 32 12.6
1918 Gene Englund PF 1950 - 1950 1 46 32 32 7.8


Notice that in the ActiveYears column some players started their career before 1950 and the number of seasons in the NBASeasons column summed up after 1949. This because the NBA formally formed in 1949 by the merger of two rival organizations, the National Basketball League (founded 1937) and the Basketball Association of America (founded 1946). You can dig more into NBA history here.



Career Length in the NBA


My next question is how long do NBA careers last?

This can be answered by summing up the number of seasons each player participated in.

SeasonNBA <- NBA %>%
    group_by(Player, Born) %>%
    summarise(NBASeasons = n_distinct(Year))
SeasonNBA %>%
    ggplot(aes(NBASeasons)) +
    geom_bar(aes(fill = ..count..)) +
    ggtitle("Player Distribution by Number of Seasons") +
    geom_vline(aes(xintercept = mean(NBASeasons, na.rm = T), linetype = "Average line"),
               col = "black",
               alpha = 0.5) +
    scale_fill_gradient(low = "green", high = "red") +
    scale_linetype_manual(name = "", values = 2) +
    xlab("Number of NBA Seasons") +
    ylab("Count") +
    theme(legend.position="bottom")


  • 1068 players, or more than a quarter (27.2%) of all players in the NBA history didn’t survive after their first season.
  • Average career length in the NBA is 5.1 years.
  • Less than half of all NBA players (41.7%) participate in at 5 seasons or more.
  • And only 1 out of 5 players (19.9%), can survive for 10 seasons or more.


Now let’s find out who has been in the NBA the longest…


NBA %>%
    group_by(Player, Born) %>%
    summarise(Pos = getmode(Position),
              Team = getmode(Tm),
              ActiveYears = paste(min(Year), "-", max(Year)),
              RookieAge = min(Age),
              RetirementAge = max(Age),
              Seasons = n_distinct(Year),
              RpG = round(sum(TRB)/sum(G), 1),
              PpG = round(sum(PTS)/sum(G), 1)) %>%
    arrange(desc(Seasons), ActiveYears) %>%
    select(Seasons, everything(), -Born) %>%
    head(17) %>%
    kable(escape = F, align = "c", caption = "Longest Career in the NBA") %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    column_spec(2, bold = T) %>%
    kable_styling("striped", full_width = T)
Longest Career in the NBA
Seasons Player Pos Team ActiveYears RookieAge RetirementAge RpG PpG
21 Robert Parish C BOS 1977 - 1997 23 43 9.1 14.5
21 Kevin Willis PF ATL 1985 - 2007 22 44 8.4 12.1
21 Kevin Garnett PF MIN 1996 - 2016 19 39 10.0 17.8
20 Kareem Abdul-Jabbar C LAL 1970 - 1989 22 41 11.2 24.6
20 Kobe Bryant SG LAL 1997 - 2016 18 37 5.2 25.0
19 Moses Malone C HOU 1977 - 1995 21 39 12.2 20.6
19 James Edwards C PHO 1978 - 1996 22 40 5.1 12.7
19 John Stockton PG UTA 1985 - 2003 22 40 2.7 13.1
19 Charles Oakley PF NYK 1986 - 2004 22 40 9.5 9.7
19 Karl Malone PF UTA 1986 - 2004 22 40 10.1 25.0
19 Shaquille O'Neal C LAL 1993 - 2011 20 38 10.9 23.7
19 Jason Kidd PG DAL 1995 - 2013 21 39 6.3 12.6
19 Juwan Howard PF WAS 1995 - 2013 21 39 6.1 13.4
19 Tim Duncan C SAS 1998 - 2016 21 39 10.8 19.0
19 Dirk Nowitzki PF DAL 1999 - 2017 20 38 7.8 21.7
19 Paul Pierce SF BOS 1999 - 2017 21 39 5.6 19.7
19 Vince Carter SG TOR 1999 - 2017 22 40 4.6 18.2

The table above listed the players who have a career in the NBA for 19 years or more.

    Few things I noticed in this group are:

  • Most of them are well-known great players, 94.1 % of them have double digits overall points per game, which only about 1 in 5 NBA players can accomplish this feat.
  • What interesting is the Centers are Power Forwards (translated: tall guys) dominated the group with 35.3 % each, which makes up 70.6 % in total.



The Rookies vs the Retirees


Now we take a closer look at the age when they started and finished their career in the NBA.

RRAge <- NBA %>%
    group_by(Player) %>%
    filter(!is.na(Age)) %>%
    summarise(RookieAge = min(Age),
              RetirementAge = max(Age)) %>%
    gather(Parameter, Value, RookieAge:RetirementAge)
RRAge %>%
    ggplot(aes(x=as.factor(Value),fill=Parameter)) + 
    geom_bar(data=filter(RRAge, Parameter == "RetirementAge")) + 
    geom_bar(data=filter(RRAge, Parameter == "RookieAge"), aes(y = ..count.. * (-1))) +
    ggtitle("Rookie Age vs, Retirement Age in the NBA") +
    xlab("Age") +
    ylab("Count") +
    scale_y_continuous(breaks=seq(-1500,1500,500),labels=abs(seq(-1500,1500,500))) +
    scale_fill_brewer(palette = "Set1") + 
    coord_flip() +
    theme(legend.position="bottom")

RRAge <- RRAge %>%
    spread(Parameter, Value)


  • Most of the players (68.4%) start their career in NBA between age 22-24.
  • 2.7% of NBA players starting their career before their 20 years old birthday.
  • Only 50 players (1.3%) starting their career in their 30’s.


NBA %>%
    filter(YearStart >= 1950) %>%
    group_by(Player) %>%
    summarise(RookieAge = min(Age),
              Pos = getmode(Position),
              Team = getmode(Tm),
              RetirementAge = max(Age),
              NBAYears = paste(min(Year), "-", max(Year)),
              NBASeasons = n_distinct(Year),
              Games = sum(G),
              PpG = round(sum(PTS)/Games,2)) %>%
    arrange(desc(RookieAge)) %>%
    select(RookieAge, everything()) %>%
    head(10) %>%
    kable(escape = F, align = "c", caption = "Oldest NBA Rookies") %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    kable_styling("striped", full_width = T)
Oldest NBA Rookies
RookieAge Player Pos Team RetirementAge NBAYears NBASeasons Games PpG
36 Charley Shipp SF WAT 36 1950 - 1950 1 23 4.65
35 Pablo Prigioni PG NYK 38 2013 - 2016 4 270 3.50
33 Steve Jones SG POR 33 1976 - 1976 1 64 6.47
32 Al Cervi PG SYR 35 1950 - 1953 4 202 7.88
32 Bob Carpenter PF FTW 33 1950 - 1951 2 122 7.68
32 Byron Beck PF DEN 32 1977 - 1977 1 53 4.72
32 Gene Englund PF BOS 32 1950 - 1950 1 46 7.83
32 Louie Dampier PG SAS 34 1977 - 1979 3 232 6.69
32 Marcelo Huertas PG LAL 33 2016 - 2017 2 76 3.95
32 Mel Daniels C NYN 32 1977 - 1977 1 11 3.55

Bear in mind that these are rookies as in the NBA. Some of them might already have years of experience in another professional basketball league.


NBA %>%
    select(Age, Player, Position, Year, Tm, G, GS, PTS) %>%
    mutate(PPG = round(PTS/G, 2)) %>%
    arrange(desc(Age)) %>%
    head(n=10) %>%
    kable(escape = F, align = 'c', caption = "Oldest NBA Players") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777")
Oldest NBA Players
Age Player Position Year Tm G GS PTS PPG
44 Kevin Willis PF 2007 DAL 5 0 12 2.40
43 Robert Parish C 1997 CHI 43 3 161 3.74
42 Robert Parish C 1996 CHH 74 34 290 3.92
42 Kevin Willis C 2005 ATL 29 5 87 3.00
42 Dikembe Mutombo C 2009 HOU 9 2 16 1.78
41 Bob Cousy PG 1970 CIN 7 NA 5 0.71
41 Kareem Abdul-Jabbar C 1989 LAL 74 74 748 10.11
41 Robert Parish C 1995 CHH 81 4 389 4.80
41 Kevin Willis C 2004 SAS 48 0 164 3.42
41 Dikembe Mutombo C 2008 HOU 39 25 118 3.03



Age vs Chance and Ability


Lastly, I’d like to know how age correlates with their chances and ability in the field.

AgeCor <- NBA_Scaled %>%
    group_by(Age) %>%
    summarise(Games = mean(G, na.rm = T),
              GameStarted = mean(GS, na.rm = T),
              MinutesPlayed = mean(MP, na.rm = T),
              Shooting = mean(TS., na.rm = T),
              ShootAttemps = mean(FGA, na.rm = T),
              Rebound = mean(RpG, na.rm = T),
              Assist = mean(ApG, na.rm = T),
              Steal = mean(SpG, na.rm = T),
              Block = mean(BpG, na.rm = T),
              Turnover = mean(TpG, na.rm = T),
              Points = mean(PpG, na.rm = T)) %>%
    gather(variable, value, -Age) %>%
    filter(!is.na(Age)) %>%
    mutate(variable = factor(variable, levels = c("Games", "GameStarted", "MinutesPlayed", "Shooting", "ShootAttemps", "Rebound", "Assist", "Steal", "Block", "Turnover", "Points")))
AgeCor %>%
    ggplot(aes(Age, variable, fill=value)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0, 0)) +
    scale_fill_gradientn(colors = brewer.pal(9, "Reds")) +
    theme(panel.grid = element_blank()) +
    scale_y_discrete(limits = rev(levels(AgeCor$variable))) +
    ggtitle("Age, Chance and Ability") +
    ylab("Parameter") +
    theme(legend.position="bottom")


It turns out that their peak around 25-30 years old, with some exception in shooting (TS%) and blocks. That is because, as we have seen in the previous table, only a handful of players still active after their 40 and 90% of them are post players who are shot-blockers with a good FG%.




End of Session


---
title: 'Visualizing NBA Seasons Part 1: Players'
author: "Nunno Nugroho"
output:
  html_notebook:
    css: style.css
    theme: paper
    code_folding: hide
  html_document:
    css: style.css
    theme: paper
    code_folding: hide
---

<br/>
<br/>

<div class="header">

<center><img src="images/NBALogoTransp.png" alt="drawing" width="200px" heigth="100px"/>

**Seasons 1950-2017**

<div class="dropdown">

<button class="dropbtn">Part 1: Players</button>

<div class="dropdown-content">

<a href="https://rpubs.com/ninjazzle/NBASeasons0">Introduction & Preparation</a>

<a href="https://rpubs.com/ninjazzle/NBASeasons2">Part 2: Points</a>

<a href="https://rpubs.com/ninjazzle/NBASeasons3">Part 3: Shootings (FG and FT)</a>

<a href="https://rpubs.com/ninjazzle/NBASeasons4">Part 4: Shootings (3-Points and Mixed)</a>

<a href="https://rpubs.com/ninjazzle/NBASeasons5">Part 5: Skills</a>

<a href="https://rpubs.com/ninjazzle/NBASeasons6">Part 6: Roles</a>

</div>

</div>

</center>

</div>

<br/>
<br/>

---

<br/>

# Setup Stage

<br/>

Loading necessary packages.

```{r setup, message=FALSE}
library(dplyr)
library(tidyr)
library(ggplot2)
library(RColorBrewer)
library(plotly)
library(kableExtra)
```

<br/>

Create additional functions

```{r}
# Mode average
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
```

<br/>

#### Load the dataset

<br/>

Dataset taken from: https://www.kaggle.com/drgilermo/nba-players-stats/version/2

If you need to see the glossary and the data tidying process, please visit the first part [here](https://rpubs.com/ninjazzle/NBASeasons0).

Loading datasets...

```{r}
NBA <- read.csv("NBA_TidySet.csv")[,-c(1)]
NBA_Scaled <- read.csv("NBA_Scaled_TidySet.csv")
NBA$Pos <- factor(NBA$Pos, levels = c("C", "PF", "SF", "SG", "PG"))
NBA_Scaled$Pos <- factor(NBA_Scaled$Pos, levels = c("C", "PF", "SF", "SG", "PG"))
PosColorCode <- c("C"="#FF0000", "PF"="#FFA500", "SF"="#DDDD00" ,"SG"="#0000FF", "PG"="#32CD32")
```

<br/>

**Displaying raw tidy data table**

<div class='maintable'>

```{r fig.width=9}
NBA
```

<br/>

<div class="Fact">

<ul class="CustomList">

<li>Number of rows: `r nrow(NBA)`</li>
<li>Number of columns: `r ncol(NBA)`</li>
<li>Number of players: `r n_distinct(NBA$Player)`</li>
<li>Number of teams: `r n_distinct(NBA$Tm)`</li>
<li>File Size: `r format(object.size(NBA), units = "Kb")`</li>

</ul>

</div>

</div>

<br/>

---

<br/>

# Players

In this session, we start by exploring the players. The variables that will be enticing to explore include their positions, height, weight, born and age.

<br/>

### NBA Teams growth

<div class='Box'>

<br/>

We start by exploring how NBA teams and the number of players has grown over the years. The table will show the detailed numbers.

<div class="table">

```{r}
Team_Player <- NBA %>%
  group_by(Year) %>%
  summarise(nPlayers = n_distinct(Player),
            nTeams = n_distinct(Tm),
            nGames = max(G),
            Players_per_Team = round(nPlayers/nTeams, 2)) 

Team_Player %>%
    kable(escape = FALSE, align='c', caption = "Players, Teams and Games") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(1, bold = T) %>%
    scroll_box(width = "100%", height = "300px")
```

</div>

<br/>

Now we see how NBA teams have grown, with this simple plot.

```{r fig.width = 9, fig.height=5}
Team_Player %>%
    ggplot() +
    geom_line(aes(Year, nTeams, linetype = "Trend line")) +
    ggtitle("Number of NBA Teams by Year") +
    geom_hline(aes(yintercept = mean(Team_Player$nTeams), linetype = "Average line"),
               col = "red",
               alpha = 0.5) +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_linetype_manual(name = "", values = c(2, 1), guide = guide_legend(reverse = TRUE)) +
    ylab("Number of Teams") +
    theme(legend.position="bottom")
```

<br/>

<div class="Fact">

<ul class="CustomList">

<li>The number of NBA teams has grown from 17 in 1950 to 30 in 2017.</li>
<li>Lowest number of teams occur during 1956-1961 seasons, with only 8 teams competes.</li>
<li>The highest number of teams occur from 2005 until present with 30 teams.</li>
    
</ul>

</div>

<br/>

</div>

<br/>
<br/>

### Number of NBA Players

<div class='Box'>

<br/>

We can plot number of NBA players year by year based on the same table.

```{r fig.width = 9, fig.height=5}
Team_Player %>%
    ggplot(aes(Year, nPlayers, fill=nPlayers)) +
    geom_bar(stat = "identity") +
    ggtitle("Number of NBA Players by Year") +
    geom_hline(aes(yintercept = mean(Team_Player$nPlayers), linetype = "Average line"),
               col = "red",
               alpha = 0.5) +
    scale_fill_gradient(low = "green", high = "red") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_linetype_manual(name = "", values = 2) +
    ylab("Number of Players") +
    theme(legend.position="bottom")
```

<br/>

<div class="Fact">

<ul class="CustomList">

<li>The number of NBA players has grown more than doubled, from 219 in 1950 to 481 in 2017.</li>
<li>The average number of players compete in NBA regular season is `r round(mean(Team_Player$nPlayers), 2)`</li>
<li>The least number of players competes in a season is `r min(Team_Player$nPlayers)` players during 1955-1956 & 1958-1959 season.</li>
<li>The most number of players competes in a season is `r max(Team_Player$nPlayers)` players in 2014-2015 season.</li>

</ul>

</div>

<br/>

</div>

<br/>
<br/>

### Number of games for each team in a season

<div class="Box">

<br/>

Almost all NBA fans know that there are 82 games to play in a season. However, I'd like to see the past history, I wonder if it's has grown over time, too.

```{r fig.width = 9, fig.height=5}
Team_Player %>% 
    ggplot(aes(Year, nGames, fill=nGames)) +
    ggtitle("Number of Games in a Season") +
    geom_bar(stat = "identity") +
    scale_fill_gradient(low = "green", high = "red") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    ylab("Number of Games") +
    theme(legend.position="bottom")
```

<br/>

<div class="Fact">

Since 1967–68 season, NBA expands its regular season to 82 games per team, where it still stands to this date. Except for these notable occurrences:

<ul class="CustomList">

<li>1998–99 NBA season: number of games is 50, due to a [lockout](https://en.wikipedia.org/wiki/1998%E2%80%9399_NBA_lockout)</li>
<li>2011-12 NBA season: number of games is 66, due to another [lockout](https://en.wikipedia.org/wiki/2011_NBA_lockout)</li>

</ul>

</div>

<br/>

</div>

<br/>
<br/>

### Position Ratio in the NBA

<div class="Box">

<br/>

My next question, is the position ratio always evenly distributed?

```{r fig.width = 9}
NBA %>%
    ggplot(aes(Year, group=Pos, color = Pos, fill = Pos)) +
    geom_density(alpha = 0.5, position = "fill") +
    ggtitle("Position Ratio by Year") +
    scale_color_manual("Pos", values = PosColorCode) +
    scale_fill_manual("Pos", values = PosColorCode) +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    theme(legend.position="bottom")
```

<br/>

</div>

<br/>
<br/>

### Height Distribution

<div class="Box">

<br/>

We all know basketball players are tall, naturally, I want to know how tall most of them are, and how small we are (average height person) compared to their standard.

```{r fig.width = 9, fig.height=5}
HeightMean <- mean(NBA$Height)
HeightSD <- sd(NBA$Height)

NBA %>% ggplot(aes(Height, fill=TRUE)) +
    geom_density() +
    scale_x_continuous(breaks = seq(160, 240, 10)) +
    geom_vline(aes(xintercept = HeightMean, linetype = "Average height of NBA players"),
               col = "red",
               alpha = 0.8) +
    geom_vline(aes(xintercept = 177, linetype = "Average height of American male"),
               col = "blue",
               alpha = 0.8) +
    geom_vline(xintercept = c(seq(HeightMean, 240, HeightSD), seq(HeightMean, 160, -HeightSD)),
               col = "blue",
               alpha = 0.3,
               linetype = 5) +
    scale_linetype_manual(name = "", values = c(1, 1)) +
    guides(fill=FALSE) +
    theme(legend.position="bottom")
```

<br/>

<div class="Fact">

<ul class="CustomList">

<li>Average height of NBA players is: `r round(mean(NBA$Height, na.rm=T), 1)` cm, with standard deviation: `r round(sd(NBA$Height, na.rm=T), 1)`.</li>
<li>Average height of American men is 177 cm ([source](https://halls.md/average-height-men-height-weight/)) is shorter by more than two standard deviation away from average NBA players.</li>

</ul>

</div>

<br/>

Now let's see the groundcrawlers and the skyscrapers in the NBA.

<br/>

<div class="table">

```{r}
NBA %>%
    group_by(Height, Player) %>%
    summarise(Pos = getmode(Position),
              YearActive = paste(mean(YearStart), "-", mean(YearEnd)),
              Team = getmode(Tm),
              Games = sum(G),
              PPG = round(sum(PTS)/sum(G), 2)) %>%
    arrange(Height) %>%
    head() %>%
    kable(escape = FALSE, align='c', caption = "Shortest Players") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777")
```

<br/>

```{r}
NBA %>%
    group_by(Height, Player) %>%
    summarise(Pos = getmode(Position),
              YearActive = paste(mean(YearStart), "-", mean(YearEnd)),
              Team = getmode(Tm),
              Games = sum(G),
              PPG = round(sum(PTS)/sum(G), 2)) %>%
    arrange(desc(Height)) %>%
    head(n=8) %>%
    kable(escape = FALSE, align='c', caption = "Tallest Players") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777")
```

</div>

</div>

<br/>
<br/>

### Height Comparison

<div class="Box">

<br/>

<div class="table">

Next, I'd like to compare them side-by-side, from Center to Point Guard, and find their averages and ranges.

```{r}
NBA %>%
    group_by(Pos) %>%
    summarise(MinHeight = min(Height),
              MaxHeight = max(Height),
              MedianHeight = median(Height),
              ModeHeight = getmode(Height),
              MeanHeight = round(mean(`Height`), 2)) %>%
    mutate(Pos = cell_spec(Pos,
                            color = "white",
                            align = "c",
                            background = factor(Pos, c("C", "PF", "SF", "SG", "PG"),
                                                PosColorCode))) %>%
    kable(escape = FALSE, align='c', caption = "Height: Averages and Range by Position") %>%
    kable_styling("striped", full_width = T)
```

</div>

<br/>

Violin plot not only gives us the averages and ranges, but it also gives us the distribution.

```{r fig.width = 9, fig.height=5}
NBA %>%
  ggplot(aes(Pos, Height, color=Pos)) +
  geom_violin() +
  ggtitle("Height distribution by position") +
  stat_summary(fun.y=mean, geom="point", shape=8, size=6) +
  geom_point() +
  geom_hline(aes(yintercept = mean(NBA$Height, na.rm=T), linetype = "Average NBA players"),
             col = "red",
             alpha = 0.5) +
  geom_hline(aes(yintercept = 177, linetype = "Average American male"),
             col = "blue",
             alpha = 0.5) +
  scale_color_manual("Pos", values = PosColorCode) +
  scale_linetype_manual(name = "", values = c(1, 1)) +
  theme(legend.position="bottom")
```

<br/>

</div>

<br/>
<br/>

### Height by Years

<div class="Box">

<br/>

Exploring the height of NBA players would not feel complete without taking a look at it from the chronological perspective. This might not produce significant insight, but I just can't resist seeing the plot.

```{r fig.width = 9, fig.height=5}
HeightYear <- NBA %>%
    group_by(Year) %>%
    summarise(meanHeight = round(mean(Height, na.rm = T), 1))

NBA %>%
    group_by(Year, Pos) %>%
    summarise(meanHeight = round(mean(Height, na.rm = T), 1)) %>%
    ggplot() +
    geom_line(aes(Year, meanHeight, group=Pos, color=Pos), size = 1.2, alpha = 1) +
    geom_line(aes(Year, meanHeight, linetype = "Average line"),
              data = HeightYear, color = "black", size = 0.8, alpha = 0.5) +
    ggtitle("Height by position by Year") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_color_manual("Pos", values = PosColorCode) +
    scale_linetype_manual(name = "", values = c(3)) +
    ylab("Height") +
    guides(group=FALSE) +
    theme(legend.position="bottom")
```

<br/>

</div>

<br/>
<br/>

### BMI Distribution

<div class="Box">

<br/>

Next, let's explore the BMI (Body Mass Index) of the players.

```{r fig.width=9, fig.height=5}
BMI <- NBA %>%
    group_by(Player) %>%
    filter(!is.na(BMI)) %>%
    mutate(BMIGroup = ifelse(BMI < 18.5, "Underweight",
                               ifelse(BMI >= 18.5 & BMI < 25, "Healthy weight",
                                      ifelse(BMI >= 25 & BMI < 30, "Overweight",
                                             "Obese Class I")))) %>%
    summarise(Pos = getmode(Position),
              Height = getmode(Height),
              Weight = getmode(Weight),
              BMI = getmode(BMI),
              BMIGroup = getmode(BMIGroup),
              Games = sum(G),
              PPG = round(sum(PTS)/Games, 1))

shade <- data.frame(xstart = c(15, 25, 30), xend = c(18.5, 30, 33), col = c("#F00", "#0F0", "#00F"))
BMIclass <- data.frame(X = c(15.7, 20, 26.5, 31), Y = 0.27, label = c("Underweight", "Healthy weight", "Overweight", "Obese"))
BMIMean <- mean(BMI$BMI)
BMISD <- sd(BMI$BMI)

BMI %>% ggplot() +
    geom_rect(data = shade,
              aes(xmin = xstart, xmax = xend, ymin = 0, ymax = Inf, fill = col),
              alpha = 0.3) +
    geom_density(aes(BMI, fill=TRUE)) +
    scale_x_continuous(breaks = seq(15, 33, 1)) +
    geom_vline(aes(xintercept = BMIMean, linetype = "Average line"),
               col = "black",
               alpha = 0.8) +
    geom_vline(xintercept = c(seq(BMIMean, 33, BMISD), seq(BMIMean, 15, -BMISD)),
               col = "blue",
               alpha = 0.3,
               linetype = 5) +
    geom_text(data = BMIclass,
              mapping = aes(x = X, y = Y, label = label),
              size = 3,
              vjust = 0,
              hjust = 0,
              color = "forestgreen") +
    scale_linetype_manual(name = "", values = c(1, 1, 1, 1)) +
    guides(fill=FALSE) +
    theme(legend.position="bottom")
```

<br/>

<div class="Fact">

<ul class="CustomList">
BMI category based on [WHO standard BMI classification.](https://en.wikipedia.org/wiki/Body_mass_index#Categories)

<li>Average BMI: `r round(mean(BMI$BMI), 1)`, with standard deviation: `r round(sd(BMI$BMI),1)`</li>
<li>As expected, most players, `r round(mean(BMI$BMIGroup == "Healthy weight")*100, 1)`% are in the "Healthy Weight" category.</li>
<li>`r round(mean(BMI$BMIGroup == "Overweight")*100, 1)`% players are in "Overweight" category, which is still common in athletes.</li>
<li>`r nrow(BMI[BMI$BMIGroup == "Obese Class I",])` players(`r round(mean(BMI$BMIGroup == "Obese Class I")*100, 1)`%) are in "Obese Class I" category, Shaquille O'Neal is one of them.</li>
<li>Only 1 player falls under the "Underweight" category. As one of the tallest player in the NBA (231 cm), he also the skinniest player in the league history with BMI only 17.1, his name is [Manute Bol](https://2kmtcentral.com/img/players/18/generated/8045i1.png/cache-1509469726/8045i1.png)</li>

</ul>

</div>

<br/>

Now I'm interested to see the real BIG guys in the NBA.

<br/>

<div class="table">

```{r}
BMI %>%
    arrange(desc(BMI)) %>%
    filter(BMIGroup == "Obese Class I") %>%
    select(BMI, everything()) %>%
    kable(escape = FALSE, align='c', caption = "Highest BMI (All in Obese Class I group)") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    scroll_box(width = "100%", height = "300px")
```

</div>

</div>

<br/>
<br/>

### Chronological Size Growth

<div class="Box">

<br/>

Inspired by Hans Rosling's animated bubble chart from [Gapminder](https://www.gapminder.org/tools/#$chart-type=bubbles), I challenged myself to see if I can make one myself with this dataset. Weight in the x-axis, height in the y-axis, size for BMI, and color represent their position in the field. Don't forget to click **"Play"** to see them grows year-by-year.players are in "Overweight" category, which is still common in athletes.

```{r warning=FALSE, plotly=TRUE}
pBubble <- NBA %>%
    filter(!is.na(BMI), !is.na(Weight), !is.na(Height)) %>%
    plot_ly(x = ~Weight,
            y = ~Height,
            size = ~BMI,
            color = ~Pos,
            colors = PosColorCode,
            frame = ~Year,
            text = ~Player, 
            hoverinfo = "text",
            type = 'scatter',
            mode = 'markers') %>% 
    animation_opts(1000, easing = "elastic",
                   redraw = FALSE) %>% 
    animation_button(x = 1,
                     xanchor = "right",
                     y = 0,
                     yanchor = "bottom") %>%
    animation_slider(currentvalue = list(prefix = "Year: ", font = list(color="red")))

htmlwidgets::saveWidget(as.widget(pBubble), "pBubble.html")
```

<iframe align = "center" width="800" height="600" frameborder="0" scrolling="no" src="pBubble.html"></iframe>

**Note:** *If this animated plot (which is the most awesome plot of all), does not appear in your browser, that means I still have technical difficulty with attaching the plot with the 'iframe' method. Will come back to this later when the solution emerge*

</div>

<br/>
<br/>

### Ability by Position Comparison

<div class="Box">

<br/>

Since we now know that height is strongly correlated with their position in the field, naturally I'm interested to see how their abilities differ based on their roles.

To compare multiple variables with different scales, I use normalized (scaled) data I created in [part 1: Data Preparation Stage]() of this series. This makes "0" in the scale is the average of each variable, anything below "0" tells us that the group performance in that particular category is below average, and vice versa.

```{r warning=FALSE, message=FALSE, plotly=TRUE}
RadarHeight <- NBA_Scaled %>%
    group_by(Pos) %>%
    summarise(Shoot2P = mean(X2P., na.rm=T),
              Shoot3P = mean(X3P., na.rm=T),
              ShootFT = mean(FT., na.rm=T),
              OffensiveRB = mean(ORB, na.rm=T),
              Assist = mean(AST, na.rm=T),
              DefensiveRB = mean(DRB, na.rm=T),
              Steal = mean(STL, na.rm=T),
              Block = mean(BLK, na.rm=T)) %>%
    select(-Pos)

pRadar <- plot_ly(type = 'scatterpolar',
        fill = 'toself',
        mode = 'lines') %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[1,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'C',
              fillcolor = "#FF0000",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[2,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'PF',
              fillcolor = "#FFA500",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[3,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'SF',
              fillcolor = "#DDDD00",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[4,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'SG',
              fillcolor = "#0000FF",
              opacity = 0.5) %>%
    add_trace(r = as.numeric(as.vector(RadarHeight[5,])),
              theta = as.character(as.vector(colnames(RadarHeight))),
              name = 'PG',
              fillcolor = "#32CD32",
              opacity = 0.5) %>%
    layout(polar = list(
        radialaxis = list(
        visible = T,
        range = c(-1, 1))))
```

<div><a href="https://plot.ly/~NunNoDeCabunic/1/?share_key=0fG8BrFQ60qIAsDREQbL4y" target="_blank" title="RadarHeight" style="display: block; text-align: center;"><img src="https://plot.ly/~NunNoDeCabunic/1.png?share_key=0fG8BrFQ60qIAsDREQbL4y" alt="RadarHeight" style="max-width: 100%;width: 600px;"  width="600" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a><script data-plotly="NunNoDeCabunic:1" sharekey-plotly="0fG8BrFQ60qIAsDREQbL4y" src="https://plot.ly/embed.js" async></script></div>
*click for interactive plotly graph*

<br/>

**How to interpret the plot:**

Click on the 'position legend' in the upper right of the graph to select/unselect each position, double-click to isolate the trace. You can compare them in any way as you desire.

So, as we can see, the graph does not just confirm the well-known fact that the tall guys (Centers) are the best shot-blockers and the short guys (Point Guards) are the best at passing the ball, we can also measure how far better they perform these task.

<br/>

</div>

<br/>
<br/>

### Annual Age Range

<div class="Box">

<br/>

<div class="table">

Same like with height, it would be compelling to see the age range in the NBA year-by-year.

```{r}
AgeNBA <- NBA %>%
    group_by(Year) %>%
    summarise(Average = round(mean(Age, na.rm = T), 2),
              Max = max(Age, na.rm = T),
              Min = min(Age, na.rm = T))

AgeNBA %>% kable(align = "c", caption = "Age: Average and Range by Year") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(1, bold = T) %>%
    scroll_box(width = "100%", height = "300px")
```
</div>

<br/>

And here's the plot...

```{r fig.width = 9, fig.height=5}
AgeNBA %>% ggplot(aes(Year)) +
    geom_line(aes(y = Max, linetype = "Oldest"), color = "red", alpha = 0.5) +
    geom_line(aes(y = Average, linetype = "Average"), color = "black") +
    geom_line(aes(y = Min, linetype = "Youngest"), color = "blue", alpha = 0.5) +
    ggtitle("Age Range by Year") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    scale_linetype_manual(name = "", values = c(1, 1, 1)) +
    guides(group=FALSE) +
    theme(legend.position="bottom")
```

</div>

<br/>
<br/>

### Player Distribution by Generation

<div class="Box">

<br/>

NBA has been around since 1950, this suggests that many generations have passed since the first NBA season. Would it be nice to see which generation dominated the league each year?

```{r fig.width = 9, fig.height=5}
NBA %>%
    group_by(Year, Player) %>%
    filter(!is.na(Born)) %>%
    mutate(Generation = ifelse(Born <= 1921, "The Depression Era",
                          ifelse(Born %in% 1922:1927, "World War II",
                            ifelse(Born %in% 1928:1945, "Post-War Cohort",
                              ifelse(Born %in% 1946:1954, "Baby Boomers",
                                ifelse(Born %in% 1955:1965, "Generation Jones",
                                  ifelse(Born %in% 1966:1976, "Generation X",
                                    ifelse(Born %in% 1977:1995, "Millennials",
                                      "Generation Z")))))))) %>%
    summarise(Generation = getmode(Generation)) %>%
    arrange(Year, Player) %>%
    mutate(Generation = factor(Generation, levels = c("The Depression Era",
                                                     "World War II",
                                                     "Post-War Cohort",
                                                     "Baby Boomers",
                                                     "Generation Jones",
                                                     "Generation X",
                                                     "Millennials",
                                                     "Generation Z"))) %>%
    ggplot(aes(Year, y=..count.., colour=Generation, fill=Generation)) +
    geom_density(alpha=0.55) +
    ggtitle("Generation Count by Years") +
    ylab("Count") +
    scale_x_continuous(breaks = seq(1950, 2017, 10)) +
    theme(legend.position='bottom')
```

<div class="footnote">
Generation classification based on [WJS marketing research](http://socialmarketing.org/archives/generations-xy-z-and-the-others/)
</div>

<br/>

And here's the oldest generation of the NBA...

#### Earliest born:

<div class="table">

```{r}
NBA %>%
    group_by(Player, Born) %>%
    summarise(Pos = getmode(Position),
              ActiveYears = paste(getmode(YearStart), "-", getmode(YearEnd)),
              NBASeasons = n_distinct(Year),
              Games = sum(G),
              StartingAge = min(Age),
              FinalAge = max(Age),
              PPG = round(sum(PTS)/sum(G), 1)) %>%
    arrange(Born) %>%
    select(Born, everything()) %>%
    head(9) %>%
    kable(escape = F, align = "c", caption = "Oldest NBA Players") %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    column_spec(2, bold = T) %>%
    kable_styling("striped", full_width = T)
```

<br/>

Notice that in the `ActiveYears` column some players started their career before 1950 and the number of seasons in the `NBASeasons` column summed up after 1949. This because the NBA formally formed in 1949 by the merger of two rival organizations, the *National Basketball League* (founded 1937) and the *Basketball Association of America* (founded 1946). You can dig more into NBA history [here](https://www.britannica.com/topic/National-Basketball-Association).

</div>

</div>

<br/>
<br/>

### Career Length in the NBA

<br/>

<div class='Box'>

My next question is how long do NBA careers last?

This can be answered by summing up the number of seasons each player participated in.

```{r fig.width = 9, fig.height = 5, warning=FALSE}
SeasonNBA <- NBA %>%
    group_by(Player, Born) %>%
    summarise(NBASeasons = n_distinct(Year))

SeasonNBA %>%
    ggplot(aes(NBASeasons)) +
    geom_bar(aes(fill = ..count..)) +
    ggtitle("Player Distribution by Number of Seasons") +
    geom_vline(aes(xintercept = mean(NBASeasons, na.rm = T), linetype = "Average line"),
               col = "black",
               alpha = 0.5) +
    scale_fill_gradient(low = "green", high = "red") +
    scale_linetype_manual(name = "", values = 2) +
    xlab("Number of NBA Seasons") +
    ylab("Count") +
    theme(legend.position="bottom")
```

<br/>

<div class="Fact">

<ul class="CustomList">

<li>`r nrow(SeasonNBA[SeasonNBA$NBASeasons <= 1,])` players, or more than a quarter (`r round((nrow(SeasonNBA[SeasonNBA$NBASeasons <= 1,])/nrow(SeasonNBA))*100, 1)`%) of all players in the NBA history didn't survive after their first season.</li>
<li>Average career length in the NBA is `r round(mean(SeasonNBA$NBASeasons, na.rm=T),1)` years.</li>
<li>Less than half of all NBA players (`r round((nrow(SeasonNBA[SeasonNBA$NBASeasons >= 5,])/nrow(SeasonNBA))*100, 1)`%) participate in at 5 seasons or more.</li>
<li>And only 1 out of 5 players (`r round((nrow(SeasonNBA[SeasonNBA$NBASeasons >= 10,])/nrow(SeasonNBA))*100, 1)`%), can survive for 10 seasons or more.</li>

</ul>

</div>

<br/>

Now let's find out who has been in the NBA the longest...

<br/>

<div class="table">

```{r}
NBA %>%
    group_by(Player, Born) %>%
    summarise(Pos = getmode(Position),
              Team = getmode(Tm),
              ActiveYears = paste(min(Year), "-", max(Year)),
              RookieAge = min(Age),
              RetirementAge = max(Age),
              Seasons = n_distinct(Year),
              RpG = round(sum(TRB)/sum(G), 1),
              PpG = round(sum(PTS)/sum(G), 1)) %>%
    arrange(desc(Seasons), ActiveYears) %>%
    select(Seasons, everything(), -Born) %>%
    head(17) %>%
    kable(escape = F, align = "c", caption = "Longest Career in the NBA") %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    column_spec(2, bold = T) %>%
    kable_styling("striped", full_width = T)
```

The table above listed the players who have a career in the NBA for 19 years or more.

<div class="Fact">

<ul class="CustomList">

Few things I noticed in this group are:

<li>Most of them are well-known great players, `r round(16/17 * 100, 1)` % of them have double digits overall points per game, which only about 1 in 5 NBA players can accomplish this feat.</li>
<li>What interesting is the Centers are Power Forwards (translated: tall guys) dominated the group with `r round(6/17 * 100, 1)` % each, which makes up `r round(12/17 * 100, 1)` % in total.</li>

</ul>

</div>

</div>

</div>

<br/>
<br/>

### The Rookies vs the Retirees

<div class='Box'>

<br/>

Now we take a closer look at the age when they started and finished their career in the NBA.

```{r fig.width=9, fig.height=5}
RRAge <- NBA %>%
    group_by(Player) %>%
    filter(!is.na(Age)) %>%
    summarise(RookieAge = min(Age),
              RetirementAge = max(Age)) %>%
    gather(Parameter, Value, RookieAge:RetirementAge)

RRAge %>%
    ggplot(aes(x=as.factor(Value),fill=Parameter)) + 
    geom_bar(data=filter(RRAge, Parameter == "RetirementAge")) + 
    geom_bar(data=filter(RRAge, Parameter == "RookieAge"), aes(y = ..count.. * (-1))) +
    ggtitle("Rookie Age vs, Retirement Age in the NBA") +
    xlab("Age") +
    ylab("Count") +
    scale_y_continuous(breaks=seq(-1500,1500,500),labels=abs(seq(-1500,1500,500))) +
    scale_fill_brewer(palette = "Set1") + 
    coord_flip() +
    theme(legend.position="bottom")

RRAge <- RRAge %>%
    spread(Parameter, Value)
```

<br/>

<div class="Fact">

<ul class="CustomList">

<li>Most of the players (`r round((nrow(RRAge[RRAge$RookieAge > 21 & RRAge$RookieAge < 25,])/nrow(RRAge))*100, 1)`%) start their career in NBA between age 22-24.</li>
<li>`r round((nrow(RRAge[RRAge$RookieAge < 20,])/nrow(RRAge))*100, 1)`% of NBA players starting their career before their 20 years old birthday.</li>
<li>Only `r nrow(RRAge[RRAge$RookieAge >= 30,])` players (`r round((nrow(RRAge[RRAge$RookieAge >= 30,])/nrow(RRAge))*100, 1)`%) starting their career in their 30's.</li>

</ul>

</div>

<br/>

<div class="table">

```{r}
NBA %>%
    filter(YearStart >= 1950) %>%
    group_by(Player) %>%
    summarise(RookieAge = min(Age),
              Pos = getmode(Position),
              Team = getmode(Tm),
              RetirementAge = max(Age),
              NBAYears = paste(min(Year), "-", max(Year)),
              NBASeasons = n_distinct(Year),
              Games = sum(G),
              PpG = round(sum(PTS)/Games,2)) %>%
    arrange(desc(RookieAge)) %>%
    select(RookieAge, everything()) %>%
    head(10) %>%
    kable(escape = F, align = "c", caption = "Oldest NBA Rookies") %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777") %>%
    kable_styling("striped", full_width = T)
```

Bear in mind that these are rookies as in the NBA. Some of them might already have years of experience in another professional basketball league.

<br/>

```{r}
NBA %>%
    select(Age, Player, Position, Year, Tm, G, GS, PTS) %>%
    mutate(PPG = round(PTS/G, 2)) %>%
    arrange(desc(Age)) %>%
    head(n=10) %>%
    kable(escape = F, align = 'c', caption = "Oldest NBA Players") %>%
    kable_styling("striped", full_width = T) %>%
    column_spec(2, bold = T) %>%
    column_spec(1, bold = T, color = "white", background = "#777777")
```

</div>

</div>

<br/>
<br/>

### Age vs Chance and Ability

<div class="Box">

<br/>

Lastly, I'd like to know how age correlates with their chances and ability in the field.

```{r fig.width = 9}
AgeCor <- NBA_Scaled %>%
    group_by(Age) %>%
    summarise(Games = mean(G, na.rm = T),
              GameStarted = mean(GS, na.rm = T),
              MinutesPlayed = mean(MP, na.rm = T),
              Shooting = mean(TS., na.rm = T),
              ShootAttemps = mean(FGA, na.rm = T),
              Rebound = mean(RpG, na.rm = T),
              Assist = mean(ApG, na.rm = T),
              Steal = mean(SpG, na.rm = T),
              Block = mean(BpG, na.rm = T),
              Turnover = mean(TpG, na.rm = T),
              Points = mean(PpG, na.rm = T)) %>%
    gather(variable, value, -Age) %>%
    filter(!is.na(Age)) %>%
    mutate(variable = factor(variable, levels = c("Games", "GameStarted", "MinutesPlayed", "Shooting", "ShootAttemps", "Rebound", "Assist", "Steal", "Block", "Turnover", "Points")))

AgeCor %>%
    ggplot(aes(Age, variable, fill=value)) +
    geom_tile(color = "grey50") +
    scale_x_continuous(expand = c(0, 0)) +
    scale_fill_gradientn(colors = brewer.pal(9, "Reds")) +
    theme(panel.grid = element_blank()) +
    scale_y_discrete(limits = rev(levels(AgeCor$variable))) +
    ggtitle("Age, Chance and Ability") +
    ylab("Parameter") +
    theme(legend.position="bottom")
```

<br/>

It turns out that their peak around 25-30 years old, with some exception in shooting (TS%) and blocks. That is because, as we have seen in the previous table, only a handful of players still active after their 40 and 90% of them are post players who are shot-blockers with a good FG%. 

</div>
 
</div>

<br/>
<br/>

---

<div class="row">

 <div class="column left">
 <a href="https://rpubs.com/ninjazzle/NBASeasons0" target="_blank">
<button class="leftbtn"></button>
 </a>
 </div>
 
 <div class="column middle">

End of Session

  </div>
  
  <div class="column right">
  <a href="https://rpubs.com/ninjazzle/NBASeasons2" target="_blank">
<button class="rightbtn"></button>
  </a>
  </div>
  
</div>

---