Week 4 Data Dive - Sampling and Drawing Conclusions

In this week’s data dive I am going to isolate 3 samples from the NBA Team Stats data that I have been using. I will first build a data frame with the random samples. I will then scrutinize the samples and draw conclusions about the following subjects:

Lastly, I will consider how this investigation affects how I might draw conclusions about the data in the future.

Building a Data Frame with Random Samples

sample_frac <- 0.25   # Tried 0.10, 0.25, and 0.75
n_samples <- 3

df_samples <- data.frame()
set.seed(123)

for (i in 1:n_samples) {
  sample_size <- round(sample_frac * nrow(NBA_Stats_100))
  
  df_i <- NBA_Stats_100[
    sample(1:nrow(NBA_Stats_100), size = sample_size, replace = TRUE),
  ]
  df_i$sample_num <- i
  df_samples <- rbind(df_samples, df_i)
}

head(df_samples)
##      Season League                   Team Abbreviation Playoffs Games
## 415 2010-11    NBA Portland Trail Blazers          POR     TRUE    82
## 463 2008-09    NBA     Los Angeles Lakers          LAL     TRUE    82
## 179 2018-19    NBA              Utah Jazz          UTA     TRUE    82
## 526 2006-07    NBA        Milwaukee Bucks          MIL    FALSE    82
## 195 2017-18    NBA      Memphis Grizzlies          MEM    FALSE    82
## 938 1992-93    NBA       Sacramento Kings          SAC    FALSE    82
##     Minutes_Played FG_per_100 FGA_per_100 FG_Percent X3p_per_100 X3pa_per_100
## 415          19805       40.7        91.0      0.447         7.1         20.7
## 463          19780       42.6        89.8      0.474         7.0         19.5
## 179          19755       40.1        85.8      0.468        12.0         33.8
## 526          19855       40.9        87.9      0.465         6.9         19.3
## 195          19705       38.7        87.1      0.444         9.7         27.6
## 938          19780       40.4        87.4      0.463         3.2          9.5
##     X3p_Percent X2p_per_100 X2pa_per_100 X2p_Percent FT_per_100 FTA_per_100
## 415       0.345        33.5         70.2       0.477       20.3        25.3
## 463       0.361        35.5         70.3       0.505       20.7        26.9
## 179       0.356        28.1         52.0       0.541       18.7        25.3
## 526       0.356        34.0         68.6       0.495       18.4        25.1
## 195       0.352        28.9         59.5       0.486       17.5        22.2
## 938       0.332        37.3         78.0       0.478       22.5        29.5
##     FT_Percent ORB_per_100 DRB_per_100 TRB_per_100 AST_per_100 STL_per_100
## 415      0.804        13.7        30.7        44.5        23.9         9.1
## 463      0.770        13.1        33.3        46.3        24.5         9.2
## 179      0.736         9.9        36.1        46.0        25.8         8.0
## 526      0.733        12.3        29.7        42.1        23.2         7.7
## 195      0.786        10.0        32.6        42.6        22.7         7.9
## 938      0.762        13.7        27.5        41.1        25.0         9.2
##     BLK_per_100 TOV_per_100 PF_per_100 PTS_per_100 sample_num
## 415         4.9        14.7       21.8       108.8          1
## 463         5.4        14.2       21.8       112.8          1
## 179         5.8        15.0       20.9       110.9          1
## 526         2.9        16.2       23.8       107.0          1
## 195         5.1        15.7       24.4       104.5          1
## 938         4.2        16.4       25.1       106.5          1

Insights, Significance, and Further Questions:

There aren’t many insights to provide for the first part of the data dive. The only insight I can improve at the moment is that the sample_frac plays a major role in how random the samples are. This part of the data dive is signficant because the samples are set up for the next parts. The only question I have is: I wonder how much of an impact changing the relative size of the samples has?

Scrutinizing Sub-samples

It is in this section of the data dive in which we learn about what stands out about the Sub-samples.

How different are they?

The examples I chose to show to illustrate this point were Points per 100 possessions and if the team made the Playoffs.

The actual average for PTS_Per_100 is 106.593

The actual percentage of playoff teams in the data is 56.3%

Because of these results, I would deem the first and third samples to not be anomalies. However, I would call the second sample an anomaly for how different its averages are. Going into the data specifically, I would be more likely to consider a team in the “High” or “Elite” scoring tier to be an anomaly in the second sample.

This comparison changes drastically when you change the relative size to 10%, and 75%.

When I changed the relative size to 10%:

When I changed the relative size to 75% I saw slightly more accurate results in the samples.

As you can see there are significant changes whenever you change the relative sample size.

pts_by_sample <- aggregate(
  PTS_per_100 ~ sample_num,
  data = df_samples,
  FUN = mean,
  na.rm = TRUE
)

pts_by_sample
##   sample_num PTS_per_100
## 1          1    106.6366
## 2          2    106.4763
## 3          3    106.6349
barplot(
  pts_by_sample$PTS_per_100,
  names.arg = paste("Sample", pts_by_sample$sample_num),
  ylab = "Average Points per 100 Possessions",
  main = "Average Scoring Across Random Subsamples"
)

playoff_rate <- aggregate(
  Playoffs ~ sample_num,
  data = df_samples,
  FUN = mean
)

playoff_rate
##   sample_num  Playoffs
## 1          1 0.5657143
## 2          2 0.5400000
## 3          3 0.5657143
df_samples$scoring_tier <- cut(
  df_samples$PTS_per_100,
  breaks = c(0, 110, 115, 120, Inf),
  labels = c("Low", "Medium", "High", "Elite")
)

tier_counts <- as.data.frame(
  table(df_samples$sample_num, df_samples$scoring_tier)
)

colnames(tier_counts) <- c("Sample", "Scoring_Tier", "Count")

tier_counts
##    Sample Scoring_Tier Count
## 1       1          Low   271
## 2       2          Low   264
## 3       3          Low   271
## 4       1       Medium    68
## 5       2       Medium    68
## 6       3       Medium    64
## 7       1         High    11
## 8       2         High    18
## 9       3         High    14
## 10      1        Elite     0
## 11      2        Elite     0
## 12      3        Elite     1

Insights, Significance, and Further Questions:

Some insights I gathered from this section of the data dive are that the higher the relative size of the sample size the less volatile the results are. This is significant because its shows that increasing the relative size of the samples does not guarantee it will be a more accurate reflection of the data, but it does drastically improve your chances of having data that better represents the data as a whole. My question would be how drastically different could a sample of 5% be in comparison to a sample of 75% or higher?

Drawing Conclusions

After completing this data dive I have certainly altered my perspective on how I might draw conclusions about data in the future. Going forward, I will understand the value of having a larger sample size of data. If I want the most accurate analysis I will try to have as large of a sample of the data as possible.

Additionally, I may take into perspective how certain average or values could be skewed if I separate my data into a smaller group of teams that only fit a certain characteristic. This data dive as a whole greatly helped my understanding of sampling and skewing results.

Insights, Significance, and Further Questions:

The main insight I have from my Drawing Conclusions section is my takeaway on understanding how analysis can be skewed if the data is being limited in some capacity. This is significant because I will understand the concept of how a smaller sample of data is less indicative of the larger population as a whole. My only question would be: What are some methods for doing analysis on sub-sections of data without being biased or skewing results?