Week 3 Data Dive

This week I will be creating a series of data frames that showcase some additional insight into my data.

These data frames include 3 Group By data frames and one Combination data frame.

I will start by explaining what the data frame is and providing my hypothesis for what we will see from it. Then I will present all of the coding and visuals for the data frame. And finally, I will provide a summary that will explain all of my insights, any significant outcomes, and any additional questions.

Group By Data Frames:

First Data Frame:

The first data frame dives into the difference in scoring between playoff teams and non-playoff teams.

My hypothesis is that we see playoff teams score more points in comparison to teams that miss the playoffs.

group_playoffs_pts <- aggregate(
  PTS_per_100 ~ Playoffs,
  data = NBA_Stats_100,
  FUN = function(x) c(
    avg = mean(x, na.rm = TRUE),
    sd  = sd(x, na.rm = TRUE),
    n   = length(x)
  )
)

# Convert matrix column to data frame
group_playoffs_pts <- data.frame(
  Playoffs = group_playoffs_pts$Playoffs,
  teams = group_playoffs_pts$PTS_per_100[, "n"],
  avg_pts = group_playoffs_pts$PTS_per_100[, "avg"],
  sd_pts = group_playoffs_pts$PTS_per_100[, "sd"]
)

group_playoffs_pts
##   Playoffs teams  avg_pts   sd_pts
## 1    FALSE   637 104.9846 4.761452
## 2     TRUE   765 107.9314 4.245263
group_playoffs_pts$probability <-
  group_playoffs_pts$teams / sum(group_playoffs_pts$teams)

group_playoffs_pts$rare_group <-
  group_playoffs_pts$probability == min(group_playoffs_pts$probability)

group_playoffs_pts
##   Playoffs teams  avg_pts   sd_pts probability rare_group
## 1    FALSE   637 104.9846 4.761452   0.4543509       TRUE
## 2     TRUE   765 107.9314 4.245263   0.5456491      FALSE
playoff_labels <- ifelse(
  group_playoffs_pts$Playoffs,
  "Made Playoffs",
  "Missed Playoffs"
)

# Set y-axis limits slightly wider than the data range
y_min <- floor(min(group_playoffs_pts$avg_pts) - 2)
y_max <- ceiling(max(group_playoffs_pts$avg_pts) + 2)

barplot(
  height = group_playoffs_pts$avg_pts,
  names.arg = playoff_labels,
  col = c("skyblue", "steelblue"),
  ylim = c(y_min, y_max),
  ylab = "AVG Points per 100 Poss.",
  main = "AVG Points per 100 Poss. by Playoff Status"
)

Summary:

The insight that we gather from the first data frame confirms my initial hypothesis that teams that made the playoffs average more points per 100 possessions than teams that missed the playoffs. This is significant in proving that scoring is vital towards team success. If I were to further dive into this concept I would ask if there was a significant difference in points allowed per 100 possessions between teams that made the playoffs and teams that missed the playoffs.

Second Data Frame:

The second data frame shows us the average turnovers per 100 possessions between playoff teams and non-playoff teams.

My hypothesis is that we see turnovers be more prevalent in teams that miss the playoffs and we find that playoff teams average fewer turnovers.

group_season_tov <- aggregate(
  TOV_per_100 ~ Season + Playoffs,
  data = NBA_Stats_100,
  FUN = function(x) c(
    avg = mean(x, na.rm = TRUE),
    n   = length(x)
  )
)

group_season_tov <- data.frame(
  Season = group_season_tov$Season,
  Playoffs = group_season_tov$Playoffs,
  teams = group_season_tov$TOV_per_100[, "n"],
  avg_tov = group_season_tov$TOV_per_100[, "avg"]
)

group_season_tov
##      Season Playoffs teams  avg_tov
## 1   1973-74    FALSE    10 18.97000
## 2   1974-75    FALSE    10 18.95000
## 3   1975-76    FALSE    12 18.39167
## 4   1976-77    FALSE    10 19.29000
## 5   1977-78    FALSE    10 18.76000
## 6   1978-79    FALSE    10 18.58000
## 7   1979-80    FALSE    10 18.29000
## 8   1980-81    FALSE    11 18.06364
## 9   1981-82    FALSE    11 17.60909
## 10  1982-83    FALSE    11 18.60000
## 11  1983-84    FALSE     7 17.71429
## 12  1984-85    FALSE     7 18.18571
## 13  1985-86    FALSE     7 17.84286
## 14  1986-87    FALSE     7 17.87143
## 15  1987-88    FALSE     7 17.50000
## 16  1988-89    FALSE     9 17.81111
## 17  1989-90    FALSE    11 16.49091
## 18  1990-91    FALSE    11 16.35455
## 19  1991-92    FALSE    11 16.23636
## 20  1992-93    FALSE    11 16.98182
## 21  1993-94    FALSE    11 16.95455
## 22  1994-95    FALSE    11 17.63636
## 23  1995-96    FALSE    13 17.57692
## 24  1996-97    FALSE    13 17.69231
## 25  1997-98    FALSE    13 17.33077
## 26  1998-99    FALSE    13 17.16154
## 27  1999-00    FALSE    13 17.15385
## 28  2000-01    FALSE    13 16.92308
## 29  2001-02    FALSE    13 16.34615
## 30  2002-03    FALSE    13 16.94615
## 31  2003-04    FALSE    13 17.06923
## 32  2004-05    FALSE    14 16.06429
## 33  2005-06    FALSE    14 16.30000
## 34  2006-07    FALSE    14 16.82857
## 35  2007-08    FALSE    14 15.64286
## 36  2008-09    FALSE    14 15.65000
## 37  2009-10    FALSE    14 15.49286
## 38  2010-11    FALSE    14 15.70000
## 39  2011-12    FALSE    14 16.07857
## 40  2012-13    FALSE    14 15.65000
## 41  2013-14    FALSE    14 15.57857
## 42  2014-15    FALSE    14 15.50714
## 43  2015-16    FALSE    14 15.32143
## 44  2016-17    FALSE    14 14.36429
## 45  2017-18    FALSE    14 14.72143
## 46  2018-19    FALSE    14 14.18571
## 47  2019-20    FALSE    14 14.81429
## 48  2020-21    FALSE    14 14.20714
## 49  2021-22    FALSE    14 14.17143
## 50  2022-23    FALSE    30 14.10333
## 51  2023-24    FALSE    23 13.95652
## 52  1973-74     TRUE    17 18.25294
## 53  1974-75     TRUE    18 18.16667
## 54  1975-76     TRUE    15 18.34667
## 55  1976-77     TRUE    12 19.15833
## 56  1977-78     TRUE    12 18.60000
## 57  1978-79     TRUE    12 18.66667
## 58  1979-80     TRUE    12 18.18333
## 59  1980-81     TRUE    12 18.52500
## 60  1981-82     TRUE    12 17.31667
## 61  1982-83     TRUE    12 18.31667
## 62  1983-84     TRUE    16 17.42500
## 63  1984-85     TRUE    16 17.03750
## 64  1985-86     TRUE    16 17.12500
## 65  1986-87     TRUE    16 16.25000
## 66  1987-88     TRUE    16 16.38750
## 67  1988-89     TRUE    16 16.55000
## 68  1989-90     TRUE    16 16.06250
## 69  1990-91     TRUE    16 16.24375
## 70  1991-92     TRUE    16 15.78750
## 71  1992-93     TRUE    16 15.85625
## 72  1993-94     TRUE    16 16.60625
## 73  1994-95     TRUE    16 16.60625
## 74  1995-96     TRUE    16 16.76875
## 75  1996-97     TRUE    16 16.85000
## 76  1997-98     TRUE    16 16.76250
## 77  1998-99     TRUE    16 17.07500
## 78  1999-00     TRUE    16 15.99375
## 79  2000-01     TRUE    16 15.90625
## 80  2001-02     TRUE    16 15.43750
## 81  2002-03     TRUE    16 15.68750
## 82  2003-04     TRUE    16 16.05625
## 83  2004-05     TRUE    16 15.61250
## 84  2005-06     TRUE    16 15.34375
## 85  2006-07     TRUE    16 15.86250
## 86  2007-08     TRUE    16 14.76250
## 87  2008-09     TRUE    16 14.76875
## 88  2009-10     TRUE    16 14.96875
## 89  2010-11     TRUE    16 15.05000
## 90  2011-12     TRUE    16 15.64375
## 91  2012-13     TRUE    16 15.74375
## 92  2013-14     TRUE    16 15.35625
## 93  2014-15     TRUE    16 14.85000
## 94  2015-16     TRUE    16 14.51250
## 95  2016-17     TRUE    16 14.35625
## 96  2017-18     TRUE    16 14.44375
## 97  2018-19     TRUE    16 13.79375
## 98  2019-20     TRUE    16 14.05625
## 99  2020-21     TRUE    16 13.57500
## 100 2021-22     TRUE    16 13.70000
## 101 2023-24     TRUE     7 13.01429
group_season_tov$probability <-
  group_season_tov$teams / sum(group_season_tov$teams)

group_season_tov$rare_group <-
  group_season_tov$teams == min(group_season_tov$teams)

group_season_tov
##      Season Playoffs teams  avg_tov probability rare_group
## 1   1973-74    FALSE    10 18.97000 0.007132668      FALSE
## 2   1974-75    FALSE    10 18.95000 0.007132668      FALSE
## 3   1975-76    FALSE    12 18.39167 0.008559201      FALSE
## 4   1976-77    FALSE    10 19.29000 0.007132668      FALSE
## 5   1977-78    FALSE    10 18.76000 0.007132668      FALSE
## 6   1978-79    FALSE    10 18.58000 0.007132668      FALSE
## 7   1979-80    FALSE    10 18.29000 0.007132668      FALSE
## 8   1980-81    FALSE    11 18.06364 0.007845934      FALSE
## 9   1981-82    FALSE    11 17.60909 0.007845934      FALSE
## 10  1982-83    FALSE    11 18.60000 0.007845934      FALSE
## 11  1983-84    FALSE     7 17.71429 0.004992867       TRUE
## 12  1984-85    FALSE     7 18.18571 0.004992867       TRUE
## 13  1985-86    FALSE     7 17.84286 0.004992867       TRUE
## 14  1986-87    FALSE     7 17.87143 0.004992867       TRUE
## 15  1987-88    FALSE     7 17.50000 0.004992867       TRUE
## 16  1988-89    FALSE     9 17.81111 0.006419401      FALSE
## 17  1989-90    FALSE    11 16.49091 0.007845934      FALSE
## 18  1990-91    FALSE    11 16.35455 0.007845934      FALSE
## 19  1991-92    FALSE    11 16.23636 0.007845934      FALSE
## 20  1992-93    FALSE    11 16.98182 0.007845934      FALSE
## 21  1993-94    FALSE    11 16.95455 0.007845934      FALSE
## 22  1994-95    FALSE    11 17.63636 0.007845934      FALSE
## 23  1995-96    FALSE    13 17.57692 0.009272468      FALSE
## 24  1996-97    FALSE    13 17.69231 0.009272468      FALSE
## 25  1997-98    FALSE    13 17.33077 0.009272468      FALSE
## 26  1998-99    FALSE    13 17.16154 0.009272468      FALSE
## 27  1999-00    FALSE    13 17.15385 0.009272468      FALSE
## 28  2000-01    FALSE    13 16.92308 0.009272468      FALSE
## 29  2001-02    FALSE    13 16.34615 0.009272468      FALSE
## 30  2002-03    FALSE    13 16.94615 0.009272468      FALSE
## 31  2003-04    FALSE    13 17.06923 0.009272468      FALSE
## 32  2004-05    FALSE    14 16.06429 0.009985735      FALSE
## 33  2005-06    FALSE    14 16.30000 0.009985735      FALSE
## 34  2006-07    FALSE    14 16.82857 0.009985735      FALSE
## 35  2007-08    FALSE    14 15.64286 0.009985735      FALSE
## 36  2008-09    FALSE    14 15.65000 0.009985735      FALSE
## 37  2009-10    FALSE    14 15.49286 0.009985735      FALSE
## 38  2010-11    FALSE    14 15.70000 0.009985735      FALSE
## 39  2011-12    FALSE    14 16.07857 0.009985735      FALSE
## 40  2012-13    FALSE    14 15.65000 0.009985735      FALSE
## 41  2013-14    FALSE    14 15.57857 0.009985735      FALSE
## 42  2014-15    FALSE    14 15.50714 0.009985735      FALSE
## 43  2015-16    FALSE    14 15.32143 0.009985735      FALSE
## 44  2016-17    FALSE    14 14.36429 0.009985735      FALSE
## 45  2017-18    FALSE    14 14.72143 0.009985735      FALSE
## 46  2018-19    FALSE    14 14.18571 0.009985735      FALSE
## 47  2019-20    FALSE    14 14.81429 0.009985735      FALSE
## 48  2020-21    FALSE    14 14.20714 0.009985735      FALSE
## 49  2021-22    FALSE    14 14.17143 0.009985735      FALSE
## 50  2022-23    FALSE    30 14.10333 0.021398003      FALSE
## 51  2023-24    FALSE    23 13.95652 0.016405136      FALSE
## 52  1973-74     TRUE    17 18.25294 0.012125535      FALSE
## 53  1974-75     TRUE    18 18.16667 0.012838802      FALSE
## 54  1975-76     TRUE    15 18.34667 0.010699001      FALSE
## 55  1976-77     TRUE    12 19.15833 0.008559201      FALSE
## 56  1977-78     TRUE    12 18.60000 0.008559201      FALSE
## 57  1978-79     TRUE    12 18.66667 0.008559201      FALSE
## 58  1979-80     TRUE    12 18.18333 0.008559201      FALSE
## 59  1980-81     TRUE    12 18.52500 0.008559201      FALSE
## 60  1981-82     TRUE    12 17.31667 0.008559201      FALSE
## 61  1982-83     TRUE    12 18.31667 0.008559201      FALSE
## 62  1983-84     TRUE    16 17.42500 0.011412268      FALSE
## 63  1984-85     TRUE    16 17.03750 0.011412268      FALSE
## 64  1985-86     TRUE    16 17.12500 0.011412268      FALSE
## 65  1986-87     TRUE    16 16.25000 0.011412268      FALSE
## 66  1987-88     TRUE    16 16.38750 0.011412268      FALSE
## 67  1988-89     TRUE    16 16.55000 0.011412268      FALSE
## 68  1989-90     TRUE    16 16.06250 0.011412268      FALSE
## 69  1990-91     TRUE    16 16.24375 0.011412268      FALSE
## 70  1991-92     TRUE    16 15.78750 0.011412268      FALSE
## 71  1992-93     TRUE    16 15.85625 0.011412268      FALSE
## 72  1993-94     TRUE    16 16.60625 0.011412268      FALSE
## 73  1994-95     TRUE    16 16.60625 0.011412268      FALSE
## 74  1995-96     TRUE    16 16.76875 0.011412268      FALSE
## 75  1996-97     TRUE    16 16.85000 0.011412268      FALSE
## 76  1997-98     TRUE    16 16.76250 0.011412268      FALSE
## 77  1998-99     TRUE    16 17.07500 0.011412268      FALSE
## 78  1999-00     TRUE    16 15.99375 0.011412268      FALSE
## 79  2000-01     TRUE    16 15.90625 0.011412268      FALSE
## 80  2001-02     TRUE    16 15.43750 0.011412268      FALSE
## 81  2002-03     TRUE    16 15.68750 0.011412268      FALSE
## 82  2003-04     TRUE    16 16.05625 0.011412268      FALSE
## 83  2004-05     TRUE    16 15.61250 0.011412268      FALSE
## 84  2005-06     TRUE    16 15.34375 0.011412268      FALSE
## 85  2006-07     TRUE    16 15.86250 0.011412268      FALSE
## 86  2007-08     TRUE    16 14.76250 0.011412268      FALSE
## 87  2008-09     TRUE    16 14.76875 0.011412268      FALSE
## 88  2009-10     TRUE    16 14.96875 0.011412268      FALSE
## 89  2010-11     TRUE    16 15.05000 0.011412268      FALSE
## 90  2011-12     TRUE    16 15.64375 0.011412268      FALSE
## 91  2012-13     TRUE    16 15.74375 0.011412268      FALSE
## 92  2013-14     TRUE    16 15.35625 0.011412268      FALSE
## 93  2014-15     TRUE    16 14.85000 0.011412268      FALSE
## 94  2015-16     TRUE    16 14.51250 0.011412268      FALSE
## 95  2016-17     TRUE    16 14.35625 0.011412268      FALSE
## 96  2017-18     TRUE    16 14.44375 0.011412268      FALSE
## 97  2018-19     TRUE    16 13.79375 0.011412268      FALSE
## 98  2019-20     TRUE    16 14.05625 0.011412268      FALSE
## 99  2020-21     TRUE    16 13.57500 0.011412268      FALSE
## 100 2021-22     TRUE    16 13.70000 0.011412268      FALSE
## 101 2023-24     TRUE     7 13.01429 0.004992867       TRUE
barplot(
  height = group_season_tov$avg_tov,
  names.arg = paste(group_season_tov$Season, group_season_tov$Playoffs),
  las = 2,
  ylab = "AVG Turnovers per 100 Poss.",
  main = "Turnovers by Season and Playoff Status"
)

Summary:

The insight that we gather from the second data frame is not that non-playoff teams average more turnovers, but that average turnovers have decreased over time for both playoff teams and non-playoff teams. This would mean that my initial hypothesis was incorrect. The difference between average turnovers for playoff teams and non-playoffs teams is not nearly as significant as the difference in average turnovers for a team in the 1970s versus a team in the 2020s. This is significant in showing us that NBA teams as a whole are improving ever year when it comes to minimizing turnovers. If I were to ask additional questions on this topic I would want to know minimizing turnovers on offense or forcing turnovers on defense is more important to a team in order to make the playoffs.

Third Data Frame:

The third data frame shows us the amount of teams in NBA history that fit into the “Low”, “Medium”, “High”, and “Elite” scoring tiers.

Low - A team with a poor NBA offense (high probability)

Medium - A team with an average NBA offense (average probability)

High - A team with a very good NBA offense (low probability)

Elite - A team that is one of the greatest offenses in NBA history (very low probability)

My hypothesis is that the “Elite” scoring tier will be very exclusive and only see some of the greatest offensive teams of all time.

NBA_Stats_100$scoring_tier <- cut(
  NBA_Stats_100$PTS_per_100,
  breaks = c(0, 110, 115, 120, Inf),
  labels = c("Low", "Medium", "High", "Elite")
)
group_scoring_tier <- aggregate(
  PTS_per_100 ~ scoring_tier,
  data = NBA_Stats_100,
  FUN = function(x) c(
    avg = mean(x, na.rm = TRUE),
    n   = length(x)
  )
)

group_scoring_tier <- data.frame(
  scoring_tier = group_scoring_tier$scoring_tier,
  teams = group_scoring_tier$PTS_per_100[, "n"],
  avg_pts = group_scoring_tier$PTS_per_100[, "avg"]
)

group_scoring_tier
##   scoring_tier teams  avg_pts
## 1          Low  1072 104.6693
## 2       Medium   275 112.0356
## 3         High    53 116.6642
## 4        Elite     2 122.1000
group_scoring_tier$probability <-
  group_scoring_tier$teams / sum(group_scoring_tier$teams)

group_scoring_tier$rare_group <-
  group_scoring_tier$probability == min(group_scoring_tier$probability)

group_scoring_tier
##   scoring_tier teams  avg_pts probability rare_group
## 1          Low  1072 104.6693 0.764621969      FALSE
## 2       Medium   275 112.0356 0.196148359      FALSE
## 3         High    53 116.6642 0.037803138      FALSE
## 4        Elite     2 122.1000 0.001426534       TRUE
barplot(
  height = group_scoring_tier$teams,
  names.arg = group_scoring_tier$scoring_tier,
  col = "darkseagreen",
  ylab = "Number of Team-Seasons",
  main = "Distribution of Teams by Scoring Tier"
)

Summary:

The insight that we gather from the third data frame is that the majority of NBA offenses across history fit into the “Low” tier. There are only 53 offenses that can be considered a “High” level offense, and only 2 that can be considered “Elite”. This matches up very well with my hypothesis that the “Elite” tier would contain only some of the greatest offenses in NBA history. This is significant in showing us the gap in terms of the talent levels of NBA offenses across history. If I were to continue diving into this topic I would want to know if modern NBA offenses scored more points compared to past NBA offenses before the 2000s.

Combination Data Frame

The combination data frame combines the scoring tier of a team with whether or not they made the playoffs.

If I were to make a prediction I would say that the low scoring tier will be more associated with non-playoff teams and that the medium, high, and elite scoring tiers will be more associated with playoff teams.

all_combinations <- expand.grid(
  Playoffs = unique(NBA_Stats_100$Playoffs),
  scoring_tier = levels(NBA_Stats_100$scoring_tier)
)

all_combinations
##   Playoffs scoring_tier
## 1    FALSE          Low
## 2     TRUE          Low
## 3    FALSE       Medium
## 4     TRUE       Medium
## 5    FALSE         High
## 6     TRUE         High
## 7    FALSE        Elite
## 8     TRUE        Elite
observed_combinations <- unique(
  NBA_Stats_100[, c("Playoffs", "scoring_tier")]
)

observed_combinations
##     Playoffs scoring_tier
## 1      FALSE         High
## 2       TRUE        Elite
## 3      FALSE       Medium
## 5      FALSE          Low
## 7       TRUE         High
## 12     FALSE        Elite
## 62      TRUE       Medium
## 104     TRUE          Low
missing_combinations <- all_combinations[
  !apply(all_combinations, 1, function(row)
    any(
      observed_combinations$Playoffs == row[1] &
      observed_combinations$scoring_tier == row[2]
    )
  ),
]

missing_combinations
## [1] Playoffs     scoring_tier
## <0 rows> (or 0-length row.names)
combo_counts <- as.data.frame(
  table(NBA_Stats_100$Playoffs, NBA_Stats_100$scoring_tier)
)

colnames(combo_counts) <- c("Playoffs", "Scoring_Tier", "Count")

combo_counts
##   Playoffs Scoring_Tier Count
## 1    FALSE          Low   551
## 2     TRUE          Low   521
## 3    FALSE       Medium    60
## 4     TRUE       Medium   215
## 5    FALSE         High    25
## 6     TRUE         High    28
## 7    FALSE        Elite     1
## 8     TRUE        Elite     1
barplot(
  combo_counts$Count,
  names.arg = paste(combo_counts$Playoffs, combo_counts$Scoring_Tier),
  las = 2,
  ylab = "Number of Team-Seasons",
  main = "Playoff Status by Scoring Tier"
)

Summary:

The insight that we gather from this combination data frame is that there isn’t a perfect correlation between scoring tier and making the playoffs, but there is certainly some correlation. What we see is that the it is more common for teams in the low scoring tier to miss the playoffs than it is for them to make it. We also see that it is significantly more common for teams in the medium scoring tier to make the playoffs. However, we do see that the split between playoff teams and non-playoff teams is almost exactly even for the high and elite scoring tiers. This is significant in showing us that scoring at a medium tier actually gives you the best chance at making the playoffs. If I were to dive deeper into this topic I would want to find out whether the high and elite scoring tier teams are predominantly modern NBA teams that are just competing with each other for playoff spots.