Week 4 Data Dive - Sampling and Drawing Conclusions

A collection of 5 random samples of the data

The data:

ncaa <- ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
ncaa_clean <- ncaa 
# because of environment and other projects I use
# ncaa and ncaa_clean interchangably here

**Note: the above data has changed from prior submissions. Here’s a link to show the cleaning I have done and the results of it.

https://rpubs.com/tbreedy/1222442

5 random samples with replacement:

ncaa_size = nrow(ncaa)

for (x in 1:5){
  set.seed(x)
sample_df <- ncaa[sample(1:ncaa_size-1,ncaa_size/2, replace=TRUE),]
  assign(paste0("ncaa_samp", x), sample_df)
}

Creates a list of the sampled data frames so I can loop through these. My guess is since these samples have over 40,000 rows, they are going to be pretty similar because of the Law of Large Numbers and the Central Limit Theorem.

dataframes <- list(ncaa_samp1,ncaa_samp2,ncaa_samp3,
                   ncaa_samp4,ncaa_samp5)

sport_samp <- ncaa_clean |>
  group_by(sports) |>
  summarise(population = n())

i <- 1

for (data in dataframes) {
  row <- data |>
    group_by(sports) |>
    summarise(cnt = n())
  sport_samp[paste0("sample",i)] <- row$cnt
  i <- i + 1
}

sport_samp

## # A tibble: 37 × 7
##    sports             population sample1 sample2 sample3 sample4 sample5
##    <chr>                   <int>   <int>   <int>   <int>   <int>   <int>
##  1 All Track Combined       5961    3059    3002    2970    2946    3000
##  2 Archery                    27      17      13      18      16      18
##  3 Badminton                   5       3       4       1       2       1
##  4 Baseball                 4693    2312    2351    2333    2322    2365
##  5 Basketball              10747    5388    5351    5338    5387    5397
##  6 Beach Volleyball          348     174     184     182     171     183
##  7 Bowling                   462     196     255     225     220     237
##  8 Diving                     48      32      28      28      21      29
##  9 Equestrian                350     181     190     183     159     164
## 10 Fencing                   373     210     169     161     171     194
## # ℹ 27 more rows

Comparing Number of Sports Programs by Sample

The above code has a potential problem where if a sport like Bandminton didn’t get drawn, or really any sport for that matter didn’t get drawn, the values wouldn’t correspond properly. This would case the data frame to not load. Luckily this didn’t happen given the seeds, so I’m not going to mess with that right now. On that same note there’s also the chance that only one row is selected about 40,000 times. Although that has something like a 1/8 * 10^(-256) change of happening, its still >0% and we would need to merge rows in a different way to ensure this code always works.

As expected though, everything was pretty similar. There’s some natural variation, but nothing appears out of the ordinary. This matches our expectations when working with Large Numbers and the Central Limit Theorem.

I want to come up with a multiple, large data frames I can use to compare different metrics of each sample directly to each other. This would work like columns on number of unique teams, number of athletes, amount of revenues, and amount of expenses.

# creating big df to compare samples

# gets initial column in df
inst <- c()
for (data in dataframes) {
  sum <- n_distinct(data$institution_name)
  #sum <- mean(data[,'rev_men'], na.rm = TRUE)
  inst <- append(inst, sum)
  #print(size(sum))
}

# creates df I will use to start the other dfs
comparisons <- data.frame("Unique_Teams" = inst)

#loops through each column name listed below
comp_rows <- list("sum_partic_men", "rev_men", "exp_men", 
                  "sum_partic_women", "rev_women", "exp_women")

comparisons_mean <- comparisons

for (i in 1:length(comp_rows)) {
  column = c()
  name = unlist(comp_rows[i])
  #print(name)
  
  for (data in dataframes) {
    sum <- mean(data[,name], na.rm = TRUE)
    column <- append(column, sum)
  }
  comparisons_mean[, name] <- column
  #print(column)
}

Again… but for max and min instead of just mean.

comparisons_max <- comparisons

for (i in 1:length(comp_rows)) {
  column = c()
  name = unlist(comp_rows[i])
  #print(name)
  
  for (data in dataframes) {
    sum <- max(data[,name], na.rm = TRUE)
    column <- append(column, sum)
  }
  comparisons_max[, name] <- column
  #print(column)
}

comparisons_min <- comparisons

for (i in 1:length(comp_rows)) {
  column = c()
  name = unlist(comp_rows[i])
  #print(name)
  
  for (data in dataframes) {
    sum <- min(data[,name], na.rm = TRUE)
    column <- append(column, sum)
  }
  comparisons_min[, name] <- column
  #print(column)
}

print(comparisons_mean)

##   Unique_Teams sum_partic_men rev_men  exp_men sum_partic_women rev_women
## 1         1141       34.31596 1189805 932206.9         23.38037  354354.1
## 2         1140       34.11097 1150036 925341.9         23.26037  352469.4
## 3         1141       34.10762 1141143 927377.8         23.26491  359824.0
## 4         1140       34.01641 1158671 918603.1         23.13146  353422.0
## 5         1140       33.99525 1143433 913943.3         23.18916  356988.2
##   exp_women
## 1  434759.4
## 2  430596.8
## 3  438094.3
## 4  430408.9
## 5  436831.3

print(comparisons_min)

##   Unique_Teams sum_partic_men rev_men exp_men sum_partic_women rev_women
## 1         1141              1    1001    1616                1       821
## 2         1140              1    1001    1057                1       821
## 3         1141              1    1001    1057                1      1001
## 4         1140              1    1001    1057                1      1001
## 5         1140              1    1001    1415                1         0
##   exp_women
## 1      1369
## 2      1369
## 3      1369
## 4      1369
## 5      1426

print(comparisons_max)

##   Unique_Teams sum_partic_men   rev_men  exp_men sum_partic_women rev_women
## 1         1141            249 143064180 69718059              327  20043215
## 2         1140            290 129023591 69718059              327  21440365
## 3         1141            230 143064180 68893857              327  21440365
## 4         1140            290 156147208 62252389              327  21440365
## 5         1140            249 144426105 62252389              248  20043215
##   exp_women
## 1   9485162
## 2   9485162
## 3   8497345
## 4   8323428
## 5   7839480

How different are the samples?

Looking at means… everything is pretty much identical. Only minor changes are seen from sample to sample, but this is exactly what we would expect to happen. There’s a chance that one sample might be heavily skewed upwards in an area like men’s revenue from a major football sport if that instance was chosen multiple times, but since over fourty-thousand rows are being factored into these samples, the chances large schools are repeated enough to make a significant impact is minimal, especially across the span of only five instances.

Looking at maximums, there’s also not much to see here. The only data that stands out is some of the women’s stats in the fifth sample, with the highest number of participating women and the highest women’s expenses being significantly lower than the other samples. These differences were about 18-24% less than other samples. I think its likely that this data set didn’t share an outlier or two which other samples had. Every other sample had the exact same max for participating women, with the fifth being the exception, reinforcing my suspicion that an outlier or two is to blame for this difference.

For minimums, the only real outlier comes from women’s revenue with the fifth sample having zero revenue while other samples have a marginally higher amount with either 821 or 1001. Considering the mean expense was about $350,000, this few hundred dollar difference is a mere rounding error. It does make me wonder if there is an error in an entry, and it also makes me want to further explore the “NA” columns. Do NA columns come from sports that have very low revenues, so instead of recording a few hundred dollars here and there the schools opt to not spend resources tracking and recording? This is something we can explore later.

I think this continued to reinforce the idea of Law of Large Numbers. With over 40,000 observations available, pooling from about half the population (since there is replacement it means that we will almost always pool from less than half the population, with some amount of duplicates “making up the rest”), besides a few lucky outliers being ex/included, we should expect to see our samples very comparable to the population and very comparable to each other.

Further analysis…

I could keep trying to zoom in and see if I can find a single school that was repeated in one sample more than others, maybe compare badminton or diving which are infrequent sports and can come with high variability because of that, but the importance would likely only manifest in very niche analysis.

We can show the little significance of this by looking at how often each institution is mentioned in each sample.

# Checks if there are any differences between the highets values
# of sample 1 and 3
setdiff(ncaa_samp1$institution_name,ncaa_samp3$institution_name)

## character(0)

# Finds what is missing from samples 2, 4, and 5
# Sample 2
print(setdiff(ncaa_samp1$institution_name,ncaa_samp2$institution_name))

## [1] "Columbia-Greene Community College"

# Sample 4
print(setdiff(ncaa_samp1$institution_name,ncaa_samp4$institution_name))

## [1] "Columbia-Greene Community College"

# Sample 5
print(setdiff(ncaa_samp1$institution_name,ncaa_samp5$institution_name))

## [1] "Columbia-Greene Community College"

#ncaa_clean[ncaa_clean['institution_name'] == "Columbia-Greene Community College"]
subset(ncaa_clean, institution_name == "Columbia-Greene Community College")

##           X year unitid                  institution_name city_txt state_cd
## 36296 36296 2019 190169 Columbia-Greene Community College   Hudson       NY
## 36297 36297 2019 190169 Columbia-Greene Community College   Hudson       NY
## 82169 82169 2019 190169 Columbia-Greene Community College   Hudson       NY
##       zip_text classification_code                classification_name
## 36296    12534                   7 NCAA Division III without football
## 36297    12534                   7 NCAA Division III without football
## 82169    12534                   7 NCAA Division III without football
##       classification_other ef_male_count ef_female_count ef_total_count
## 36296                   NA           235             292            527
## 36297                   NA           235             292            527
## 82169                   NA           235             292            527
##       sector_cd    sector_name sportscode partic_men partic_women
## 36296         4 Public, 2-year          1         18           NA
## 36297         4 Public, 2-year          2         14           NA
## 82169         4 Public, 2-year         15         NA           14
##       partic_coed_men partic_coed_women sum_partic_men sum_partic_women rev_men
## 36296              NA                NA             18               NA   37602
## 36297              NA                NA             14               NA   33403
## 82169              NA                NA             NA               14      NA
##       rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen
## 36296        NA              37602   14567        NA              14567
## 36297        NA              33403   33403        NA              33403
## 82169     26400              26400      NA     26400              26400
##           sports   pct_men
## 36296   Baseball 0.4459203
## 36297 Basketball 0.4459203
## 82169     Soccer 0.4459203

It looks like this school had an extremely small athletics program that began in 2019.

To make some extra analysis a bit easier, I’m going to remove instances of this school from samples 1 and 3. I don’t think this will be bad considering the school just opened its program in one of the five years of data this dataset tracks, and considering this isn’t an outlier in a sea of comparable programs, the omission of this school’s data won’t create any meaningful impact on the impacted samples.

print(nrow(ncaa_samp1))

## [1] 43433

print(nrow(ncaa_samp3))

## [1] 43436

ncaa_samp1 <- subset(ncaa_samp1, institution_name != "Columbia-Greene Community College")
ncaa_samp3 <- subset(ncaa_samp3, institution_name != "Columbia-Greene Community College")

print(nrow(ncaa_samp1))

## [1] 43430

print(nrow(ncaa_samp3))

## [1] 43432

Surpringly, these two samples included 3 and 4 of the 3 entries, respectively. Removing these few entries shouldn’t make a noticable impact on any analysis of these samples, but will help significantly in comparing institutions.

# update dataframes for the adjusted samples
dataframes <- list(ncaa_samp1,ncaa_samp2,ncaa_samp3,
                   ncaa_samp4,ncaa_samp5)

# confirm with old code that the number of unique schools are the same
inst <- c()
for (data in dataframes) {
  sum <- n_distinct(data$institution_name)
  #sum <- mean(data[,'rev_men'], na.rm = TRUE)
  inst <- append(inst, sum)
  #print(size(sum))
}
inst

## [1] 1140 1140 1140 1140 1140

woCGCC <- subset(ncaa_clean, institution_name != "Columbia-Greene Community College")
team_comparison <- woCGCC |>
  group_by(institution_name) |>
  summarise(population = n())

for (data in dataframes) {
  row <- data |>
    group_by(institution_name) |>
    summarise(cnt = n())
  team_comparison[paste0("sample",i)] <- row$cnt
  i <- i + 1
}

team_comparison

## # A tibble: 1,140 × 7
##    institution_name          population sample6 sample7 sample8 sample9 sample10
##    <chr>                          <int>   <int>   <int>   <int>   <int>    <int>
##  1 Abilene Christian Univer…         63      31      38      25      32       47
##  2 Academy of Art University         60      28      40      32      32       35
##  3 Adams State University            78      27      35      39      36       34
##  4 Adelphi University               115      50      65      59      48       57
##  5 Adrian College                   132      60      65      76      57       71
##  6 Alabama A & M University          70      30      31      41      41       30
##  7 Alabama State University          70      33      31      34      21       41
##  8 Albany State University           62      33      37      27      33       36
##  9 Albertus Magnus College           66      38      29      27      29       38
## 10 Albion College                   125      59      52      66      56       72
## # ℹ 1,130 more rows

team_comparison['samp_dif'] <- apply(team_comparison[3:7], 1, function(x) max(x) - min(x))

team_comparison

## # A tibble: 1,140 × 8
##    institution_name population sample6 sample7 sample8 sample9 sample10 samp_dif
##    <chr>                 <int>   <int>   <int>   <int>   <int>    <int>    <int>
##  1 Abilene Christi…         63      31      38      25      32       47       22
##  2 Academy of Art …         60      28      40      32      32       35       12
##  3 Adams State Uni…         78      27      35      39      36       34       12
##  4 Adelphi Univers…        115      50      65      59      48       57       17
##  5 Adrian College          132      60      65      76      57       71       19
##  6 Alabama A & M U…         70      30      31      41      41       30       11
##  7 Alabama State U…         70      33      31      34      21       41       20
##  8 Albany State Un…         62      33      37      27      33       36       10
##  9 Albertus Magnus…         66      38      29      27      29       38       11
## 10 Albion College          125      59      52      66      56       72       20
## # ℹ 1,130 more rows

for (row in 1:nrow(team_comparison)) {
  if(team_comparison[row,'samp_dif'] * 2 > team_comparison[row,'population']){
    print(as.character(team_comparison[row,'institution_name']))
    print(as.numeric(team_comparison[row,'samp_dif']))
  }
}

## [1] "Armstrong State University"
## [1] 7
## [1] "Birmingham-Southern College"
## [1] 10
## [1] "Concordia University Texas"
## [1] 28
## [1] "Daniel Webster College"
## [1] 9
## [1] "Ferris State University"
## [1] 7
## [1] "Georgia College and State University"
## [1] 15
## [1] "Hesston College"
## [1] 5
## [1] "Mary Baldwin College"
## [1] 4
## [1] "Montgomery College"
## [1] 5
## [1] "SUNY College at Cortland"
## [1] 15
## [1] "Western Colorado University"
## [1] 6
## [1] "Western State Colorado University"
## [1] 20
## [1] "Wheeling University"
## [1] 7
## [1] "Wheelock College"
## [1] 12
## [1] "Widener University-Main Campus"
## [1] 9

These are some colleges where there was high variation between samples. This can be important in some specific situations such as if one of these schools shared a unique sport or was one of the high revenue football schools where having significantly more or less representation can skew some statistics. Looking at the data above, most programs only had single digit entries, meaning that high variations can be expected since the denominator is so small, even though the actual number of change is still relatively insignificant.

No matter how much we scrutinize this data, unless we look at infrequent events under a microscope we’re not going to find anything notable. What could be notable though is what would happen if we hadn’t sampled such a large percentage of a large data set. What if we randomly selected a smaller amount of rows?

A Monte Carlo Simulation

Sample sizes, instead being half of the population at over 40,000, will range from 10 to 1000. Instead of 5 samples, we’ll use 1000 so the chances of an unlikely event occuring are higher, but runtime won’t be excessive. We’ll look at the sum of sporting revenue through the samples and show the distribution of the sums in a histogram.

range <- c()
for (i in 1:1000) {
  samp10 <- ncaa[sample(1:ncaa_size-1,10, replace=TRUE),]
  revenue <- sum(samp10['rev_men'], na.rm = TRUE) + 
             sum(samp10['rev_women'], na.rm = TRUE)
  range <- append(range, revenue)
}

hist(range, main="Histogram of 10 Samples", xlab="Sum Revenue")

range <- c()
for (i in 1:1000) {
  samp10 <- ncaa[sample(1:ncaa_size-1,20, replace=TRUE),]
  revenue <- sum(samp10['rev_men'], na.rm = TRUE) + 
             sum(samp10['rev_women'], na.rm = TRUE)
  range <- append(range, revenue)
}

hist(range, main="Histogram of 20 Samples", xlab="Sum Revenue")

range <- c()
for (i in 1:1000) {
  samp10 <- ncaa[sample(1:ncaa_size-1,100, replace=TRUE),]
  revenue <- sum(samp10['rev_men'], na.rm = TRUE) + 
             sum(samp10['rev_women'], na.rm = TRUE)
  range <- append(range, revenue)
}

hist(range, main="Histogram of 100 Samples", xlab="Sum Revenue")

range <- c()
for (i in 1:1000) {
  samp10 <- ncaa[sample(1:ncaa_size-1,1000, replace=TRUE),]
  revenue <- sum(samp10['rev_men'], na.rm = TRUE) + 
             sum(samp10['rev_women'], na.rm = TRUE)
  range <- append(range, revenue)
}

hist(range, main="Histogram of 1000 Samples", xlab="Sum Revenue")

## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow

(The sum of revenues got pretty large by the end there, so the last graph may be not be accurate on its right end tail, but even without that information, the trend in the data is still apparent.)

Looking at these different graphs, we can see how the distribution turns from what looks like exponential decay to more of a normal distribution. From prior data dives, we saw that revenue has a very long tail, right skewed distribution. When sample sizes are small, randomly selecting a sport on the far right end of that distribution has the inevitable potential of adjusting the sample’s average far beyond the population mean. However, as sample size gets larger, these heavy-weighting draws no longer pack the same punch on altering the sample as they used to, so we see the data slowly reflect a more normal distribution. When applying this to an analysis, it means that using small sample sizes can cause some pretty big distortions, so using larger sample sizes will help give more accurate and consistent conclusions.

Drawing conclusions in the future

I think this shows that if I have sufficiently large data to work with, I can be pretty confident that my results will mimic the population. However, my data has some extremely wide ranges, so if I limit my investigation too narrowly, it’s likely I’m going to miss something.

That is unless I ask a very specific question. I think with how large my data set is, it’s going to be nearly impossible to make a conclusion on much of anything beyond a very high level view by using all the data at once. Even though its only NCAA schools, there is some extreme variation between the different conferences and sports from different divisions. So when I’m looking at a niche part of the data set, maybe a specific regional sport for a certain division for example, I’ll need to accept that there is going to be some higher variation in my conclusions because of the smaller sample I’m drawing from, but if I add more “similar” data to the mix I likely will only be inputting garbage, not more data that can be meaningfully similar.

I’d be curious to see if I can get data after this project, maybe from 2020-2024 or 2012-2015 (the former might be a bit askew because of covid) and see how much variation there is in conclusiong of some of the specialized questions I have referenced compared to a broader, entire NCAA look at things.

Week 4 Data Dive

Thomas Reedy

2024-09-18