This week I will be creating a series of data frames that showcase some additional insight into my data.
These data frames include 3 Group By data frames and one Combination data frame.
I will start by explaining what the data frame is and providing my hypothesis for what we will see from it. Then I will present all of the coding and visuals for the data frame. And finally, I will provide a summary that will explain all of my insights, any significant outcomes, and any additional questions.
The first data frame dives into the difference in scoring between playoff teams and non-playoff teams.
My hypothesis is that we see playoff teams score more points in comparison to teams that miss the playoffs.
group_playoffs_pts <- aggregate(
PTS_per_100 ~ Playoffs,
data = NBA_Stats_100,
FUN = function(x) c(
avg = mean(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
n = length(x)
)
)
# Convert matrix column to data frame
group_playoffs_pts <- data.frame(
Playoffs = group_playoffs_pts$Playoffs,
teams = group_playoffs_pts$PTS_per_100[, "n"],
avg_pts = group_playoffs_pts$PTS_per_100[, "avg"],
sd_pts = group_playoffs_pts$PTS_per_100[, "sd"]
)
group_playoffs_pts
## Playoffs teams avg_pts sd_pts
## 1 FALSE 637 104.9846 4.761452
## 2 TRUE 765 107.9314 4.245263
group_playoffs_pts$probability <-
group_playoffs_pts$teams / sum(group_playoffs_pts$teams)
group_playoffs_pts$rare_group <-
group_playoffs_pts$probability == min(group_playoffs_pts$probability)
group_playoffs_pts
## Playoffs teams avg_pts sd_pts probability rare_group
## 1 FALSE 637 104.9846 4.761452 0.4543509 TRUE
## 2 TRUE 765 107.9314 4.245263 0.5456491 FALSE
playoff_labels <- ifelse(
group_playoffs_pts$Playoffs,
"Made Playoffs",
"Missed Playoffs"
)
# Set y-axis limits slightly wider than the data range
y_min <- floor(min(group_playoffs_pts$avg_pts) - 2)
y_max <- ceiling(max(group_playoffs_pts$avg_pts) + 2)
barplot(
height = group_playoffs_pts$avg_pts,
names.arg = playoff_labels,
col = c("skyblue", "steelblue"),
ylim = c(y_min, y_max),
ylab = "AVG Points per 100 Poss.",
main = "AVG Points per 100 Poss. by Playoff Status"
)
The insight that we gather from the first data frame confirms my initial hypothesis that teams that made the playoffs average more points per 100 possessions than teams that missed the playoffs. This is significant in proving that scoring is vital towards team success. If I were to further dive into this concept I would ask if there was a significant difference in points allowed per 100 possessions between teams that made the playoffs and teams that missed the playoffs.
The second data frame shows us the average turnovers per 100 possessions between playoff teams and non-playoff teams.
My hypothesis is that we see turnovers be more prevalent in teams that miss the playoffs and we find that playoff teams average fewer turnovers.
group_season_tov <- aggregate(
TOV_per_100 ~ Season + Playoffs,
data = NBA_Stats_100,
FUN = function(x) c(
avg = mean(x, na.rm = TRUE),
n = length(x)
)
)
group_season_tov <- data.frame(
Season = group_season_tov$Season,
Playoffs = group_season_tov$Playoffs,
teams = group_season_tov$TOV_per_100[, "n"],
avg_tov = group_season_tov$TOV_per_100[, "avg"]
)
group_season_tov
## Season Playoffs teams avg_tov
## 1 1973-74 FALSE 10 18.97000
## 2 1974-75 FALSE 10 18.95000
## 3 1975-76 FALSE 12 18.39167
## 4 1976-77 FALSE 10 19.29000
## 5 1977-78 FALSE 10 18.76000
## 6 1978-79 FALSE 10 18.58000
## 7 1979-80 FALSE 10 18.29000
## 8 1980-81 FALSE 11 18.06364
## 9 1981-82 FALSE 11 17.60909
## 10 1982-83 FALSE 11 18.60000
## 11 1983-84 FALSE 7 17.71429
## 12 1984-85 FALSE 7 18.18571
## 13 1985-86 FALSE 7 17.84286
## 14 1986-87 FALSE 7 17.87143
## 15 1987-88 FALSE 7 17.50000
## 16 1988-89 FALSE 9 17.81111
## 17 1989-90 FALSE 11 16.49091
## 18 1990-91 FALSE 11 16.35455
## 19 1991-92 FALSE 11 16.23636
## 20 1992-93 FALSE 11 16.98182
## 21 1993-94 FALSE 11 16.95455
## 22 1994-95 FALSE 11 17.63636
## 23 1995-96 FALSE 13 17.57692
## 24 1996-97 FALSE 13 17.69231
## 25 1997-98 FALSE 13 17.33077
## 26 1998-99 FALSE 13 17.16154
## 27 1999-00 FALSE 13 17.15385
## 28 2000-01 FALSE 13 16.92308
## 29 2001-02 FALSE 13 16.34615
## 30 2002-03 FALSE 13 16.94615
## 31 2003-04 FALSE 13 17.06923
## 32 2004-05 FALSE 14 16.06429
## 33 2005-06 FALSE 14 16.30000
## 34 2006-07 FALSE 14 16.82857
## 35 2007-08 FALSE 14 15.64286
## 36 2008-09 FALSE 14 15.65000
## 37 2009-10 FALSE 14 15.49286
## 38 2010-11 FALSE 14 15.70000
## 39 2011-12 FALSE 14 16.07857
## 40 2012-13 FALSE 14 15.65000
## 41 2013-14 FALSE 14 15.57857
## 42 2014-15 FALSE 14 15.50714
## 43 2015-16 FALSE 14 15.32143
## 44 2016-17 FALSE 14 14.36429
## 45 2017-18 FALSE 14 14.72143
## 46 2018-19 FALSE 14 14.18571
## 47 2019-20 FALSE 14 14.81429
## 48 2020-21 FALSE 14 14.20714
## 49 2021-22 FALSE 14 14.17143
## 50 2022-23 FALSE 30 14.10333
## 51 2023-24 FALSE 23 13.95652
## 52 1973-74 TRUE 17 18.25294
## 53 1974-75 TRUE 18 18.16667
## 54 1975-76 TRUE 15 18.34667
## 55 1976-77 TRUE 12 19.15833
## 56 1977-78 TRUE 12 18.60000
## 57 1978-79 TRUE 12 18.66667
## 58 1979-80 TRUE 12 18.18333
## 59 1980-81 TRUE 12 18.52500
## 60 1981-82 TRUE 12 17.31667
## 61 1982-83 TRUE 12 18.31667
## 62 1983-84 TRUE 16 17.42500
## 63 1984-85 TRUE 16 17.03750
## 64 1985-86 TRUE 16 17.12500
## 65 1986-87 TRUE 16 16.25000
## 66 1987-88 TRUE 16 16.38750
## 67 1988-89 TRUE 16 16.55000
## 68 1989-90 TRUE 16 16.06250
## 69 1990-91 TRUE 16 16.24375
## 70 1991-92 TRUE 16 15.78750
## 71 1992-93 TRUE 16 15.85625
## 72 1993-94 TRUE 16 16.60625
## 73 1994-95 TRUE 16 16.60625
## 74 1995-96 TRUE 16 16.76875
## 75 1996-97 TRUE 16 16.85000
## 76 1997-98 TRUE 16 16.76250
## 77 1998-99 TRUE 16 17.07500
## 78 1999-00 TRUE 16 15.99375
## 79 2000-01 TRUE 16 15.90625
## 80 2001-02 TRUE 16 15.43750
## 81 2002-03 TRUE 16 15.68750
## 82 2003-04 TRUE 16 16.05625
## 83 2004-05 TRUE 16 15.61250
## 84 2005-06 TRUE 16 15.34375
## 85 2006-07 TRUE 16 15.86250
## 86 2007-08 TRUE 16 14.76250
## 87 2008-09 TRUE 16 14.76875
## 88 2009-10 TRUE 16 14.96875
## 89 2010-11 TRUE 16 15.05000
## 90 2011-12 TRUE 16 15.64375
## 91 2012-13 TRUE 16 15.74375
## 92 2013-14 TRUE 16 15.35625
## 93 2014-15 TRUE 16 14.85000
## 94 2015-16 TRUE 16 14.51250
## 95 2016-17 TRUE 16 14.35625
## 96 2017-18 TRUE 16 14.44375
## 97 2018-19 TRUE 16 13.79375
## 98 2019-20 TRUE 16 14.05625
## 99 2020-21 TRUE 16 13.57500
## 100 2021-22 TRUE 16 13.70000
## 101 2023-24 TRUE 7 13.01429
group_season_tov$probability <-
group_season_tov$teams / sum(group_season_tov$teams)
group_season_tov$rare_group <-
group_season_tov$teams == min(group_season_tov$teams)
group_season_tov
## Season Playoffs teams avg_tov probability rare_group
## 1 1973-74 FALSE 10 18.97000 0.007132668 FALSE
## 2 1974-75 FALSE 10 18.95000 0.007132668 FALSE
## 3 1975-76 FALSE 12 18.39167 0.008559201 FALSE
## 4 1976-77 FALSE 10 19.29000 0.007132668 FALSE
## 5 1977-78 FALSE 10 18.76000 0.007132668 FALSE
## 6 1978-79 FALSE 10 18.58000 0.007132668 FALSE
## 7 1979-80 FALSE 10 18.29000 0.007132668 FALSE
## 8 1980-81 FALSE 11 18.06364 0.007845934 FALSE
## 9 1981-82 FALSE 11 17.60909 0.007845934 FALSE
## 10 1982-83 FALSE 11 18.60000 0.007845934 FALSE
## 11 1983-84 FALSE 7 17.71429 0.004992867 TRUE
## 12 1984-85 FALSE 7 18.18571 0.004992867 TRUE
## 13 1985-86 FALSE 7 17.84286 0.004992867 TRUE
## 14 1986-87 FALSE 7 17.87143 0.004992867 TRUE
## 15 1987-88 FALSE 7 17.50000 0.004992867 TRUE
## 16 1988-89 FALSE 9 17.81111 0.006419401 FALSE
## 17 1989-90 FALSE 11 16.49091 0.007845934 FALSE
## 18 1990-91 FALSE 11 16.35455 0.007845934 FALSE
## 19 1991-92 FALSE 11 16.23636 0.007845934 FALSE
## 20 1992-93 FALSE 11 16.98182 0.007845934 FALSE
## 21 1993-94 FALSE 11 16.95455 0.007845934 FALSE
## 22 1994-95 FALSE 11 17.63636 0.007845934 FALSE
## 23 1995-96 FALSE 13 17.57692 0.009272468 FALSE
## 24 1996-97 FALSE 13 17.69231 0.009272468 FALSE
## 25 1997-98 FALSE 13 17.33077 0.009272468 FALSE
## 26 1998-99 FALSE 13 17.16154 0.009272468 FALSE
## 27 1999-00 FALSE 13 17.15385 0.009272468 FALSE
## 28 2000-01 FALSE 13 16.92308 0.009272468 FALSE
## 29 2001-02 FALSE 13 16.34615 0.009272468 FALSE
## 30 2002-03 FALSE 13 16.94615 0.009272468 FALSE
## 31 2003-04 FALSE 13 17.06923 0.009272468 FALSE
## 32 2004-05 FALSE 14 16.06429 0.009985735 FALSE
## 33 2005-06 FALSE 14 16.30000 0.009985735 FALSE
## 34 2006-07 FALSE 14 16.82857 0.009985735 FALSE
## 35 2007-08 FALSE 14 15.64286 0.009985735 FALSE
## 36 2008-09 FALSE 14 15.65000 0.009985735 FALSE
## 37 2009-10 FALSE 14 15.49286 0.009985735 FALSE
## 38 2010-11 FALSE 14 15.70000 0.009985735 FALSE
## 39 2011-12 FALSE 14 16.07857 0.009985735 FALSE
## 40 2012-13 FALSE 14 15.65000 0.009985735 FALSE
## 41 2013-14 FALSE 14 15.57857 0.009985735 FALSE
## 42 2014-15 FALSE 14 15.50714 0.009985735 FALSE
## 43 2015-16 FALSE 14 15.32143 0.009985735 FALSE
## 44 2016-17 FALSE 14 14.36429 0.009985735 FALSE
## 45 2017-18 FALSE 14 14.72143 0.009985735 FALSE
## 46 2018-19 FALSE 14 14.18571 0.009985735 FALSE
## 47 2019-20 FALSE 14 14.81429 0.009985735 FALSE
## 48 2020-21 FALSE 14 14.20714 0.009985735 FALSE
## 49 2021-22 FALSE 14 14.17143 0.009985735 FALSE
## 50 2022-23 FALSE 30 14.10333 0.021398003 FALSE
## 51 2023-24 FALSE 23 13.95652 0.016405136 FALSE
## 52 1973-74 TRUE 17 18.25294 0.012125535 FALSE
## 53 1974-75 TRUE 18 18.16667 0.012838802 FALSE
## 54 1975-76 TRUE 15 18.34667 0.010699001 FALSE
## 55 1976-77 TRUE 12 19.15833 0.008559201 FALSE
## 56 1977-78 TRUE 12 18.60000 0.008559201 FALSE
## 57 1978-79 TRUE 12 18.66667 0.008559201 FALSE
## 58 1979-80 TRUE 12 18.18333 0.008559201 FALSE
## 59 1980-81 TRUE 12 18.52500 0.008559201 FALSE
## 60 1981-82 TRUE 12 17.31667 0.008559201 FALSE
## 61 1982-83 TRUE 12 18.31667 0.008559201 FALSE
## 62 1983-84 TRUE 16 17.42500 0.011412268 FALSE
## 63 1984-85 TRUE 16 17.03750 0.011412268 FALSE
## 64 1985-86 TRUE 16 17.12500 0.011412268 FALSE
## 65 1986-87 TRUE 16 16.25000 0.011412268 FALSE
## 66 1987-88 TRUE 16 16.38750 0.011412268 FALSE
## 67 1988-89 TRUE 16 16.55000 0.011412268 FALSE
## 68 1989-90 TRUE 16 16.06250 0.011412268 FALSE
## 69 1990-91 TRUE 16 16.24375 0.011412268 FALSE
## 70 1991-92 TRUE 16 15.78750 0.011412268 FALSE
## 71 1992-93 TRUE 16 15.85625 0.011412268 FALSE
## 72 1993-94 TRUE 16 16.60625 0.011412268 FALSE
## 73 1994-95 TRUE 16 16.60625 0.011412268 FALSE
## 74 1995-96 TRUE 16 16.76875 0.011412268 FALSE
## 75 1996-97 TRUE 16 16.85000 0.011412268 FALSE
## 76 1997-98 TRUE 16 16.76250 0.011412268 FALSE
## 77 1998-99 TRUE 16 17.07500 0.011412268 FALSE
## 78 1999-00 TRUE 16 15.99375 0.011412268 FALSE
## 79 2000-01 TRUE 16 15.90625 0.011412268 FALSE
## 80 2001-02 TRUE 16 15.43750 0.011412268 FALSE
## 81 2002-03 TRUE 16 15.68750 0.011412268 FALSE
## 82 2003-04 TRUE 16 16.05625 0.011412268 FALSE
## 83 2004-05 TRUE 16 15.61250 0.011412268 FALSE
## 84 2005-06 TRUE 16 15.34375 0.011412268 FALSE
## 85 2006-07 TRUE 16 15.86250 0.011412268 FALSE
## 86 2007-08 TRUE 16 14.76250 0.011412268 FALSE
## 87 2008-09 TRUE 16 14.76875 0.011412268 FALSE
## 88 2009-10 TRUE 16 14.96875 0.011412268 FALSE
## 89 2010-11 TRUE 16 15.05000 0.011412268 FALSE
## 90 2011-12 TRUE 16 15.64375 0.011412268 FALSE
## 91 2012-13 TRUE 16 15.74375 0.011412268 FALSE
## 92 2013-14 TRUE 16 15.35625 0.011412268 FALSE
## 93 2014-15 TRUE 16 14.85000 0.011412268 FALSE
## 94 2015-16 TRUE 16 14.51250 0.011412268 FALSE
## 95 2016-17 TRUE 16 14.35625 0.011412268 FALSE
## 96 2017-18 TRUE 16 14.44375 0.011412268 FALSE
## 97 2018-19 TRUE 16 13.79375 0.011412268 FALSE
## 98 2019-20 TRUE 16 14.05625 0.011412268 FALSE
## 99 2020-21 TRUE 16 13.57500 0.011412268 FALSE
## 100 2021-22 TRUE 16 13.70000 0.011412268 FALSE
## 101 2023-24 TRUE 7 13.01429 0.004992867 TRUE
barplot(
height = group_season_tov$avg_tov,
names.arg = paste(group_season_tov$Season, group_season_tov$Playoffs),
las = 2,
ylab = "AVG Turnovers per 100 Poss.",
main = "Turnovers by Season and Playoff Status"
)
The insight that we gather from the second data frame is not that non-playoff teams average more turnovers, but that average turnovers have decreased over time for both playoff teams and non-playoff teams. This would mean that my initial hypothesis was incorrect. The difference between average turnovers for playoff teams and non-playoffs teams is not nearly as significant as the difference in average turnovers for a team in the 1970s versus a team in the 2020s. This is significant in showing us that NBA teams as a whole are improving ever year when it comes to minimizing turnovers. If I were to ask additional questions on this topic I would want to know minimizing turnovers on offense or forcing turnovers on defense is more important to a team in order to make the playoffs.
The third data frame shows us the amount of teams in NBA history that fit into the “Low”, “Medium”, “High”, and “Elite” scoring tiers.
Low - A team with a poor NBA offense (high probability)
Medium - A team with an average NBA offense (average probability)
High - A team with a very good NBA offense (low probability)
Elite - A team that is one of the greatest offenses in NBA history (very low probability)
My hypothesis is that the “Elite” scoring tier will be very exclusive and only see some of the greatest offensive teams of all time.
NBA_Stats_100$scoring_tier <- cut(
NBA_Stats_100$PTS_per_100,
breaks = c(0, 110, 115, 120, Inf),
labels = c("Low", "Medium", "High", "Elite")
)
group_scoring_tier <- aggregate(
PTS_per_100 ~ scoring_tier,
data = NBA_Stats_100,
FUN = function(x) c(
avg = mean(x, na.rm = TRUE),
n = length(x)
)
)
group_scoring_tier <- data.frame(
scoring_tier = group_scoring_tier$scoring_tier,
teams = group_scoring_tier$PTS_per_100[, "n"],
avg_pts = group_scoring_tier$PTS_per_100[, "avg"]
)
group_scoring_tier
## scoring_tier teams avg_pts
## 1 Low 1072 104.6693
## 2 Medium 275 112.0356
## 3 High 53 116.6642
## 4 Elite 2 122.1000
group_scoring_tier$probability <-
group_scoring_tier$teams / sum(group_scoring_tier$teams)
group_scoring_tier$rare_group <-
group_scoring_tier$probability == min(group_scoring_tier$probability)
group_scoring_tier
## scoring_tier teams avg_pts probability rare_group
## 1 Low 1072 104.6693 0.764621969 FALSE
## 2 Medium 275 112.0356 0.196148359 FALSE
## 3 High 53 116.6642 0.037803138 FALSE
## 4 Elite 2 122.1000 0.001426534 TRUE
barplot(
height = group_scoring_tier$teams,
names.arg = group_scoring_tier$scoring_tier,
col = "darkseagreen",
ylab = "Number of Team-Seasons",
main = "Distribution of Teams by Scoring Tier"
)
The insight that we gather from the third data frame is that the majority of NBA offenses across history fit into the “Low” tier. There are only 53 offenses that can be considered a “High” level offense, and only 2 that can be considered “Elite”. This matches up very well with my hypothesis that the “Elite” tier would contain only some of the greatest offenses in NBA history. This is significant in showing us the gap in terms of the talent levels of NBA offenses across history. If I were to continue diving into this topic I would want to know if modern NBA offenses scored more points compared to past NBA offenses before the 2000s.
The combination data frame combines the scoring tier of a team with whether or not they made the playoffs.
If I were to make a prediction I would say that the low scoring tier will be more associated with non-playoff teams and that the medium, high, and elite scoring tiers will be more associated with playoff teams.
all_combinations <- expand.grid(
Playoffs = unique(NBA_Stats_100$Playoffs),
scoring_tier = levels(NBA_Stats_100$scoring_tier)
)
all_combinations
## Playoffs scoring_tier
## 1 FALSE Low
## 2 TRUE Low
## 3 FALSE Medium
## 4 TRUE Medium
## 5 FALSE High
## 6 TRUE High
## 7 FALSE Elite
## 8 TRUE Elite
observed_combinations <- unique(
NBA_Stats_100[, c("Playoffs", "scoring_tier")]
)
observed_combinations
## Playoffs scoring_tier
## 1 FALSE High
## 2 TRUE Elite
## 3 FALSE Medium
## 5 FALSE Low
## 7 TRUE High
## 12 FALSE Elite
## 62 TRUE Medium
## 104 TRUE Low
missing_combinations <- all_combinations[
!apply(all_combinations, 1, function(row)
any(
observed_combinations$Playoffs == row[1] &
observed_combinations$scoring_tier == row[2]
)
),
]
missing_combinations
## [1] Playoffs scoring_tier
## <0 rows> (or 0-length row.names)
combo_counts <- as.data.frame(
table(NBA_Stats_100$Playoffs, NBA_Stats_100$scoring_tier)
)
colnames(combo_counts) <- c("Playoffs", "Scoring_Tier", "Count")
combo_counts
## Playoffs Scoring_Tier Count
## 1 FALSE Low 551
## 2 TRUE Low 521
## 3 FALSE Medium 60
## 4 TRUE Medium 215
## 5 FALSE High 25
## 6 TRUE High 28
## 7 FALSE Elite 1
## 8 TRUE Elite 1
barplot(
combo_counts$Count,
names.arg = paste(combo_counts$Playoffs, combo_counts$Scoring_Tier),
las = 2,
ylab = "Number of Team-Seasons",
main = "Playoff Status by Scoring Tier"
)
The insight that we gather from this combination data frame is that there isn’t a perfect correlation between scoring tier and making the playoffs, but there is certainly some correlation. What we see is that the it is more common for teams in the low scoring tier to miss the playoffs than it is for them to make it. We also see that it is significantly more common for teams in the medium scoring tier to make the playoffs. However, we do see that the split between playoff teams and non-playoff teams is almost exactly even for the high and elite scoring tiers. This is significant in showing us that scoring at a medium tier actually gives you the best chance at making the playoffs. If I were to dive deeper into this topic I would want to find out whether the high and elite scoring tier teams are predominantly modern NBA teams that are just competing with each other for playoff spots.