In this week’s data dive I am going to isolate 3 samples from the NBA Team Stats data that I have been using. I will first build a data frame with the random samples. I will then scrutinize the samples and draw conclusions about the following subjects:
How different they are
Anomalies
Is there consistency?
Effects of changing relative size (10%, 25%, 75)%
Lastly, I will consider how this investigation affects how I might draw conclusions about the data in the future.
sample_frac <- 0.25 # Tried 0.10, 0.25, and 0.75
n_samples <- 3
df_samples <- data.frame()
set.seed(123)
for (i in 1:n_samples) {
sample_size <- round(sample_frac * nrow(NBA_Stats_100))
df_i <- NBA_Stats_100[
sample(1:nrow(NBA_Stats_100), size = sample_size, replace = TRUE),
]
df_i$sample_num <- i
df_samples <- rbind(df_samples, df_i)
}
head(df_samples)
## Season League Team Abbreviation Playoffs Games
## 415 2010-11 NBA Portland Trail Blazers POR TRUE 82
## 463 2008-09 NBA Los Angeles Lakers LAL TRUE 82
## 179 2018-19 NBA Utah Jazz UTA TRUE 82
## 526 2006-07 NBA Milwaukee Bucks MIL FALSE 82
## 195 2017-18 NBA Memphis Grizzlies MEM FALSE 82
## 938 1992-93 NBA Sacramento Kings SAC FALSE 82
## Minutes_Played FG_per_100 FGA_per_100 FG_Percent X3p_per_100 X3pa_per_100
## 415 19805 40.7 91.0 0.447 7.1 20.7
## 463 19780 42.6 89.8 0.474 7.0 19.5
## 179 19755 40.1 85.8 0.468 12.0 33.8
## 526 19855 40.9 87.9 0.465 6.9 19.3
## 195 19705 38.7 87.1 0.444 9.7 27.6
## 938 19780 40.4 87.4 0.463 3.2 9.5
## X3p_Percent X2p_per_100 X2pa_per_100 X2p_Percent FT_per_100 FTA_per_100
## 415 0.345 33.5 70.2 0.477 20.3 25.3
## 463 0.361 35.5 70.3 0.505 20.7 26.9
## 179 0.356 28.1 52.0 0.541 18.7 25.3
## 526 0.356 34.0 68.6 0.495 18.4 25.1
## 195 0.352 28.9 59.5 0.486 17.5 22.2
## 938 0.332 37.3 78.0 0.478 22.5 29.5
## FT_Percent ORB_per_100 DRB_per_100 TRB_per_100 AST_per_100 STL_per_100
## 415 0.804 13.7 30.7 44.5 23.9 9.1
## 463 0.770 13.1 33.3 46.3 24.5 9.2
## 179 0.736 9.9 36.1 46.0 25.8 8.0
## 526 0.733 12.3 29.7 42.1 23.2 7.7
## 195 0.786 10.0 32.6 42.6 22.7 7.9
## 938 0.762 13.7 27.5 41.1 25.0 9.2
## BLK_per_100 TOV_per_100 PF_per_100 PTS_per_100 sample_num
## 415 4.9 14.7 21.8 108.8 1
## 463 5.4 14.2 21.8 112.8 1
## 179 5.8 15.0 20.9 110.9 1
## 526 2.9 16.2 23.8 107.0 1
## 195 5.1 15.7 24.4 104.5 1
## 938 4.2 16.4 25.1 106.5 1
There aren’t many insights to provide for the first part of the data dive. The only insight I can improve at the moment is that the sample_frac plays a major role in how random the samples are. This part of the data dive is signficant because the samples are set up for the next parts. The only question I have is: I wonder how much of an impact changing the relative size of the samples has?
It is in this section of the data dive in which we learn about what stands out about the Sub-samples.
How different are they?
The examples I chose to show to illustrate this point were Points per 100 possessions and if the team made the Playoffs.
The actual average for PTS_Per_100 is 106.593
The actual percentage of playoff teams in the data is 56.3%
The first sample is about 0.04 points too high on its points average and 0.2% too high on its playoff teams.
The second sample is about 0.12 points too low on it points average and 2.3% too low on its playoff teams.
The third sample is also about 0.04 points too high with its points average and 0.2% too high on its playoff teams.
Because of these results, I would deem the first and third samples to not be anomalies. However, I would call the second sample an anomaly for how different its averages are. Going into the data specifically, I would be more likely to consider a team in the “High” or “Elite” scoring tier to be an anomaly in the second sample.
This comparison changes drastically when you change the relative size to 10%, and 75%.
When I changed the relative size to 10%:
The first sample is 0.16 too low on points average and 0.6% too low on playoff percentage.
The second sample is 0.01 points too high on points average and 0.8% too high on playoff percentage.
The third sample is 0.1 too low on points average and 5.6% too low in playoff percentage.
When I changed the relative size to 75% I saw slightly more accurate results in the samples.
The first sample 0.01 too low on points average and 0.6% too low on playoff percentage
The second sample is 0.09 too high on points average and 2.6% off on playoff percentage.
The third sample is 0.02 points too low on points average and 1.4% too low on playoff percentage.
As you can see there are significant changes whenever you change the relative sample size.
pts_by_sample <- aggregate(
PTS_per_100 ~ sample_num,
data = df_samples,
FUN = mean,
na.rm = TRUE
)
pts_by_sample
## sample_num PTS_per_100
## 1 1 106.6366
## 2 2 106.4763
## 3 3 106.6349
barplot(
pts_by_sample$PTS_per_100,
names.arg = paste("Sample", pts_by_sample$sample_num),
ylab = "Average Points per 100 Possessions",
main = "Average Scoring Across Random Subsamples"
)
playoff_rate <- aggregate(
Playoffs ~ sample_num,
data = df_samples,
FUN = mean
)
playoff_rate
## sample_num Playoffs
## 1 1 0.5657143
## 2 2 0.5400000
## 3 3 0.5657143
df_samples$scoring_tier <- cut(
df_samples$PTS_per_100,
breaks = c(0, 110, 115, 120, Inf),
labels = c("Low", "Medium", "High", "Elite")
)
tier_counts <- as.data.frame(
table(df_samples$sample_num, df_samples$scoring_tier)
)
colnames(tier_counts) <- c("Sample", "Scoring_Tier", "Count")
tier_counts
## Sample Scoring_Tier Count
## 1 1 Low 271
## 2 2 Low 264
## 3 3 Low 271
## 4 1 Medium 68
## 5 2 Medium 68
## 6 3 Medium 64
## 7 1 High 11
## 8 2 High 18
## 9 3 High 14
## 10 1 Elite 0
## 11 2 Elite 0
## 12 3 Elite 1
Some insights I gathered from this section of the data dive are that the higher the relative size of the sample size the less volatile the results are. This is significant because its shows that increasing the relative size of the samples does not guarantee it will be a more accurate reflection of the data, but it does drastically improve your chances of having data that better represents the data as a whole. My question would be how drastically different could a sample of 5% be in comparison to a sample of 75% or higher?
After completing this data dive I have certainly altered my perspective on how I might draw conclusions about data in the future. Going forward, I will understand the value of having a larger sample size of data. If I want the most accurate analysis I will try to have as large of a sample of the data as possible.
Additionally, I may take into perspective how certain average or values could be skewed if I separate my data into a smaller group of teams that only fit a certain characteristic. This data dive as a whole greatly helped my understanding of sampling and skewing results.
The main insight I have from my Drawing Conclusions section is my takeaway on understanding how analysis can be skewed if the data is being limited in some capacity. This is significant because I will understand the concept of how a smaller sample of data is less indicative of the larger population as a whole. My only question would be: What are some methods for doing analysis on sub-sections of data without being biased or skewing results?