# This works to get rid of errors
library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
A collection of 5 random samples of the data
The data:
ncaa <- ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
ncaa_clean <- ncaa
# because of environment and other projects I use
# ncaa and ncaa_clean interchangably here
**Note: the above data has changed from prior submissions. Here’s a link to show the cleaning I have done and the results of it.
https://rpubs.com/tbreedy/1222442
5 random samples with replacement:
ncaa_size = nrow(ncaa)
for (x in 1:5){
set.seed(x)
sample_df <- ncaa[sample(1:ncaa_size-1,ncaa_size/2, replace=TRUE),]
assign(paste0("ncaa_samp", x), sample_df)
}
Creates a list of the sampled data frames so I can loop through these. My guess is since these samples have over 40,000 rows, they are going to be pretty similar because of the Law of Large Numbers and the Central Limit Theorem.
dataframes <- list(ncaa_samp1,ncaa_samp2,ncaa_samp3,
ncaa_samp4,ncaa_samp5)
sport_samp <- ncaa_clean |>
group_by(sports) |>
summarise(population = n())
i <- 1
for (data in dataframes) {
row <- data |>
group_by(sports) |>
summarise(cnt = n())
sport_samp[paste0("sample",i)] <- row$cnt
i <- i + 1
}
sport_samp
## # A tibble: 37 × 7
## sports population sample1 sample2 sample3 sample4 sample5
## <chr> <int> <int> <int> <int> <int> <int>
## 1 All Track Combined 5961 3059 3002 2970 2946 3000
## 2 Archery 27 17 13 18 16 18
## 3 Badminton 5 3 4 1 2 1
## 4 Baseball 4693 2312 2351 2333 2322 2365
## 5 Basketball 10747 5388 5351 5338 5387 5397
## 6 Beach Volleyball 348 174 184 182 171 183
## 7 Bowling 462 196 255 225 220 237
## 8 Diving 48 32 28 28 21 29
## 9 Equestrian 350 181 190 183 159 164
## 10 Fencing 373 210 169 161 171 194
## # ℹ 27 more rows
The above code has a potential problem where if a sport like Bandminton didn’t get drawn, or really any sport for that matter didn’t get drawn, the values wouldn’t correspond properly. This would case the data frame to not load. Luckily this didn’t happen given the seeds, so I’m not going to mess with that right now. On that same note there’s also the chance that only one row is selected about 40,000 times. Although that has something like a 1/8 * 10^(-256) change of happening, its still >0% and we would need to merge rows in a different way to ensure this code always works.
As expected though, everything was pretty similar. There’s some natural variation, but nothing appears out of the ordinary. This matches our expectations when working with Large Numbers and the Central Limit Theorem.
I want to come up with a multiple, large data frames I can use to compare different metrics of each sample directly to each other. This would work like columns on number of unique teams, number of athletes, amount of revenues, and amount of expenses.
# creating big df to compare samples
# gets initial column in df
inst <- c()
for (data in dataframes) {
sum <- n_distinct(data$institution_name)
#sum <- mean(data[,'rev_men'], na.rm = TRUE)
inst <- append(inst, sum)
#print(size(sum))
}
# creates df I will use to start the other dfs
comparisons <- data.frame("Unique_Teams" = inst)
#loops through each column name listed below
comp_rows <- list("sum_partic_men", "rev_men", "exp_men",
"sum_partic_women", "rev_women", "exp_women")
comparisons_mean <- comparisons
for (i in 1:length(comp_rows)) {
column = c()
name = unlist(comp_rows[i])
#print(name)
for (data in dataframes) {
sum <- mean(data[,name], na.rm = TRUE)
column <- append(column, sum)
}
comparisons_mean[, name] <- column
#print(column)
}
Again… but for max and min instead of just mean.
comparisons_max <- comparisons
for (i in 1:length(comp_rows)) {
column = c()
name = unlist(comp_rows[i])
#print(name)
for (data in dataframes) {
sum <- max(data[,name], na.rm = TRUE)
column <- append(column, sum)
}
comparisons_max[, name] <- column
#print(column)
}
comparisons_min <- comparisons
for (i in 1:length(comp_rows)) {
column = c()
name = unlist(comp_rows[i])
#print(name)
for (data in dataframes) {
sum <- min(data[,name], na.rm = TRUE)
column <- append(column, sum)
}
comparisons_min[, name] <- column
#print(column)
}
print(comparisons_mean)
## Unique_Teams sum_partic_men rev_men exp_men sum_partic_women rev_women
## 1 1141 34.31596 1189805 932206.9 23.38037 354354.1
## 2 1140 34.11097 1150036 925341.9 23.26037 352469.4
## 3 1141 34.10762 1141143 927377.8 23.26491 359824.0
## 4 1140 34.01641 1158671 918603.1 23.13146 353422.0
## 5 1140 33.99525 1143433 913943.3 23.18916 356988.2
## exp_women
## 1 434759.4
## 2 430596.8
## 3 438094.3
## 4 430408.9
## 5 436831.3
print(comparisons_min)
## Unique_Teams sum_partic_men rev_men exp_men sum_partic_women rev_women
## 1 1141 1 1001 1616 1 821
## 2 1140 1 1001 1057 1 821
## 3 1141 1 1001 1057 1 1001
## 4 1140 1 1001 1057 1 1001
## 5 1140 1 1001 1415 1 0
## exp_women
## 1 1369
## 2 1369
## 3 1369
## 4 1369
## 5 1426
print(comparisons_max)
## Unique_Teams sum_partic_men rev_men exp_men sum_partic_women rev_women
## 1 1141 249 143064180 69718059 327 20043215
## 2 1140 290 129023591 69718059 327 21440365
## 3 1141 230 143064180 68893857 327 21440365
## 4 1140 290 156147208 62252389 327 21440365
## 5 1140 249 144426105 62252389 248 20043215
## exp_women
## 1 9485162
## 2 9485162
## 3 8497345
## 4 8323428
## 5 7839480
Looking at means… everything is pretty much identical. Only minor changes are seen from sample to sample, but this is exactly what we would expect to happen. There’s a chance that one sample might be heavily skewed upwards in an area like men’s revenue from a major football sport if that instance was chosen multiple times, but since over fourty-thousand rows are being factored into these samples, the chances large schools are repeated enough to make a significant impact is minimal, especially across the span of only five instances.
Looking at maximums, there’s also not much to see here. The only data that stands out is some of the women’s stats in the fifth sample, with the highest number of participating women and the highest women’s expenses being significantly lower than the other samples. These differences were about 18-24% less than other samples. I think its likely that this data set didn’t share an outlier or two which other samples had. Every other sample had the exact same max for participating women, with the fifth being the exception, reinforcing my suspicion that an outlier or two is to blame for this difference.
For minimums, the only real outlier comes from women’s revenue with the fifth sample having zero revenue while other samples have a marginally higher amount with either 821 or 1001. Considering the mean expense was about $350,000, this few hundred dollar difference is a mere rounding error. It does make me wonder if there is an error in an entry, and it also makes me want to further explore the “NA” columns. Do NA columns come from sports that have very low revenues, so instead of recording a few hundred dollars here and there the schools opt to not spend resources tracking and recording? This is something we can explore later.
I think this continued to reinforce the idea of Law of Large Numbers. With over 40,000 observations available, pooling from about half the population (since there is replacement it means that we will almost always pool from less than half the population, with some amount of duplicates “making up the rest”), besides a few lucky outliers being ex/included, we should expect to see our samples very comparable to the population and very comparable to each other.
I could keep trying to zoom in and see if I can find a single school that was repeated in one sample more than others, maybe compare badminton or diving which are infrequent sports and can come with high variability because of that, but the importance would likely only manifest in very niche analysis.
We can show the little significance of this by looking at how often each institution is mentioned in each sample.
# Checks if there are any differences between the highets values
# of sample 1 and 3
setdiff(ncaa_samp1$institution_name,ncaa_samp3$institution_name)
## character(0)
# Finds what is missing from samples 2, 4, and 5
# Sample 2
print(setdiff(ncaa_samp1$institution_name,ncaa_samp2$institution_name))
## [1] "Columbia-Greene Community College"
# Sample 4
print(setdiff(ncaa_samp1$institution_name,ncaa_samp4$institution_name))
## [1] "Columbia-Greene Community College"
# Sample 5
print(setdiff(ncaa_samp1$institution_name,ncaa_samp5$institution_name))
## [1] "Columbia-Greene Community College"
#ncaa_clean[ncaa_clean['institution_name'] == "Columbia-Greene Community College"]
subset(ncaa_clean, institution_name == "Columbia-Greene Community College")
## X year unitid institution_name city_txt state_cd
## 36296 36296 2019 190169 Columbia-Greene Community College Hudson NY
## 36297 36297 2019 190169 Columbia-Greene Community College Hudson NY
## 82169 82169 2019 190169 Columbia-Greene Community College Hudson NY
## zip_text classification_code classification_name
## 36296 12534 7 NCAA Division III without football
## 36297 12534 7 NCAA Division III without football
## 82169 12534 7 NCAA Division III without football
## classification_other ef_male_count ef_female_count ef_total_count
## 36296 NA 235 292 527
## 36297 NA 235 292 527
## 82169 NA 235 292 527
## sector_cd sector_name sportscode partic_men partic_women
## 36296 4 Public, 2-year 1 18 NA
## 36297 4 Public, 2-year 2 14 NA
## 82169 4 Public, 2-year 15 NA 14
## partic_coed_men partic_coed_women sum_partic_men sum_partic_women rev_men
## 36296 NA NA 18 NA 37602
## 36297 NA NA 14 NA 33403
## 82169 NA NA NA 14 NA
## rev_women total_rev_menwomen exp_men exp_women total_exp_menwomen
## 36296 NA 37602 14567 NA 14567
## 36297 NA 33403 33403 NA 33403
## 82169 26400 26400 NA 26400 26400
## sports pct_men
## 36296 Baseball 0.4459203
## 36297 Basketball 0.4459203
## 82169 Soccer 0.4459203
It looks like this school had an extremely small athletics program that began in 2019.
To make some extra analysis a bit easier, I’m going to remove instances of this school from samples 1 and 3. I don’t think this will be bad considering the school just opened its program in one of the five years of data this dataset tracks, and considering this isn’t an outlier in a sea of comparable programs, the omission of this school’s data won’t create any meaningful impact on the impacted samples.
print(nrow(ncaa_samp1))
## [1] 43433
print(nrow(ncaa_samp3))
## [1] 43436
ncaa_samp1 <- subset(ncaa_samp1, institution_name != "Columbia-Greene Community College")
ncaa_samp3 <- subset(ncaa_samp3, institution_name != "Columbia-Greene Community College")
print(nrow(ncaa_samp1))
## [1] 43430
print(nrow(ncaa_samp3))
## [1] 43432
Surpringly, these two samples included 3 and 4 of the 3 entries, respectively. Removing these few entries shouldn’t make a noticable impact on any analysis of these samples, but will help significantly in comparing institutions.
# update dataframes for the adjusted samples
dataframes <- list(ncaa_samp1,ncaa_samp2,ncaa_samp3,
ncaa_samp4,ncaa_samp5)
# confirm with old code that the number of unique schools are the same
inst <- c()
for (data in dataframes) {
sum <- n_distinct(data$institution_name)
#sum <- mean(data[,'rev_men'], na.rm = TRUE)
inst <- append(inst, sum)
#print(size(sum))
}
inst
## [1] 1140 1140 1140 1140 1140
woCGCC <- subset(ncaa_clean, institution_name != "Columbia-Greene Community College")
team_comparison <- woCGCC |>
group_by(institution_name) |>
summarise(population = n())
for (data in dataframes) {
row <- data |>
group_by(institution_name) |>
summarise(cnt = n())
team_comparison[paste0("sample",i)] <- row$cnt
i <- i + 1
}
team_comparison
## # A tibble: 1,140 × 7
## institution_name population sample6 sample7 sample8 sample9 sample10
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Abilene Christian Univer… 63 31 38 25 32 47
## 2 Academy of Art University 60 28 40 32 32 35
## 3 Adams State University 78 27 35 39 36 34
## 4 Adelphi University 115 50 65 59 48 57
## 5 Adrian College 132 60 65 76 57 71
## 6 Alabama A & M University 70 30 31 41 41 30
## 7 Alabama State University 70 33 31 34 21 41
## 8 Albany State University 62 33 37 27 33 36
## 9 Albertus Magnus College 66 38 29 27 29 38
## 10 Albion College 125 59 52 66 56 72
## # ℹ 1,130 more rows
team_comparison['samp_dif'] <- apply(team_comparison[3:7], 1, function(x) max(x) - min(x))
team_comparison
## # A tibble: 1,140 × 8
## institution_name population sample6 sample7 sample8 sample9 sample10 samp_dif
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 Abilene Christi… 63 31 38 25 32 47 22
## 2 Academy of Art … 60 28 40 32 32 35 12
## 3 Adams State Uni… 78 27 35 39 36 34 12
## 4 Adelphi Univers… 115 50 65 59 48 57 17
## 5 Adrian College 132 60 65 76 57 71 19
## 6 Alabama A & M U… 70 30 31 41 41 30 11
## 7 Alabama State U… 70 33 31 34 21 41 20
## 8 Albany State Un… 62 33 37 27 33 36 10
## 9 Albertus Magnus… 66 38 29 27 29 38 11
## 10 Albion College 125 59 52 66 56 72 20
## # ℹ 1,130 more rows
for (row in 1:nrow(team_comparison)) {
if(team_comparison[row,'samp_dif'] * 2 > team_comparison[row,'population']){
print(as.character(team_comparison[row,'institution_name']))
print(as.numeric(team_comparison[row,'samp_dif']))
}
}
## [1] "Armstrong State University"
## [1] 7
## [1] "Birmingham-Southern College"
## [1] 10
## [1] "Concordia University Texas"
## [1] 28
## [1] "Daniel Webster College"
## [1] 9
## [1] "Ferris State University"
## [1] 7
## [1] "Georgia College and State University"
## [1] 15
## [1] "Hesston College"
## [1] 5
## [1] "Mary Baldwin College"
## [1] 4
## [1] "Montgomery College"
## [1] 5
## [1] "SUNY College at Cortland"
## [1] 15
## [1] "Western Colorado University"
## [1] 6
## [1] "Western State Colorado University"
## [1] 20
## [1] "Wheeling University"
## [1] 7
## [1] "Wheelock College"
## [1] 12
## [1] "Widener University-Main Campus"
## [1] 9
These are some colleges where there was high variation between samples. This can be important in some specific situations such as if one of these schools shared a unique sport or was one of the high revenue football schools where having significantly more or less representation can skew some statistics. Looking at the data above, most programs only had single digit entries, meaning that high variations can be expected since the denominator is so small, even though the actual number of change is still relatively insignificant.
No matter how much we scrutinize this data, unless we look at infrequent events under a microscope we’re not going to find anything notable. What could be notable though is what would happen if we hadn’t sampled such a large percentage of a large data set. What if we randomly selected a smaller amount of rows?
Sample sizes, instead being half of the population at over 40,000, will range from 10 to 1000. Instead of 5 samples, we’ll use 1000 so the chances of an unlikely event occuring are higher, but runtime won’t be excessive. We’ll look at the sum of sporting revenue through the samples and show the distribution of the sums in a histogram.
range <- c()
for (i in 1:1000) {
samp10 <- ncaa[sample(1:ncaa_size-1,10, replace=TRUE),]
revenue <- sum(samp10['rev_men'], na.rm = TRUE) +
sum(samp10['rev_women'], na.rm = TRUE)
range <- append(range, revenue)
}
hist(range, main="Histogram of 10 Samples", xlab="Sum Revenue")
range <- c()
for (i in 1:1000) {
samp10 <- ncaa[sample(1:ncaa_size-1,20, replace=TRUE),]
revenue <- sum(samp10['rev_men'], na.rm = TRUE) +
sum(samp10['rev_women'], na.rm = TRUE)
range <- append(range, revenue)
}
hist(range, main="Histogram of 20 Samples", xlab="Sum Revenue")
range <- c()
for (i in 1:1000) {
samp10 <- ncaa[sample(1:ncaa_size-1,100, replace=TRUE),]
revenue <- sum(samp10['rev_men'], na.rm = TRUE) +
sum(samp10['rev_women'], na.rm = TRUE)
range <- append(range, revenue)
}
hist(range, main="Histogram of 100 Samples", xlab="Sum Revenue")
range <- c()
for (i in 1:1000) {
samp10 <- ncaa[sample(1:ncaa_size-1,1000, replace=TRUE),]
revenue <- sum(samp10['rev_men'], na.rm = TRUE) +
sum(samp10['rev_women'], na.rm = TRUE)
range <- append(range, revenue)
}
hist(range, main="Histogram of 1000 Samples", xlab="Sum Revenue")
## Warning in breaks[-1L] + breaks[-nB]: NAs produced by integer overflow
(The sum of revenues got pretty large by the end there, so the last graph may be not be accurate on its right end tail, but even without that information, the trend in the data is still apparent.)
Looking at these different graphs, we can see how the distribution turns from what looks like exponential decay to more of a normal distribution. From prior data dives, we saw that revenue has a very long tail, right skewed distribution. When sample sizes are small, randomly selecting a sport on the far right end of that distribution has the inevitable potential of adjusting the sample’s average far beyond the population mean. However, as sample size gets larger, these heavy-weighting draws no longer pack the same punch on altering the sample as they used to, so we see the data slowly reflect a more normal distribution. When applying this to an analysis, it means that using small sample sizes can cause some pretty big distortions, so using larger sample sizes will help give more accurate and consistent conclusions.
I think this shows that if I have sufficiently large data to work with, I can be pretty confident that my results will mimic the population. However, my data has some extremely wide ranges, so if I limit my investigation too narrowly, it’s likely I’m going to miss something.
That is unless I ask a very specific question. I think with how large my data set is, it’s going to be nearly impossible to make a conclusion on much of anything beyond a very high level view by using all the data at once. Even though its only NCAA schools, there is some extreme variation between the different conferences and sports from different divisions. So when I’m looking at a niche part of the data set, maybe a specific regional sport for a certain division for example, I’ll need to accept that there is going to be some higher variation in my conclusions because of the smaller sample I’m drawing from, but if I add more “similar” data to the mix I likely will only be inputting garbage, not more data that can be meaningfully similar.
I’d be curious to see if I can get data after this project, maybe from 2020-2024 or 2012-2015 (the former might be a bit askew because of covid) and see how much variation there is in conclusiong of some of the specialized questions I have referenced compared to a broader, entire NCAA look at things.