This week, for our data dive, we will be looking at sampling in our data set with different columns and looking at the differences and what it means for our data.
Loading in the data, tidyverse and setting seed (for reproducble results):
# Loading tidyverse
library(tidyverse)
#Loading in Data
nhl_draft <- read_csv("nhldraft.csv")
# Setting seed
set.seed(1)
The task for this data dive is to choose a variety of columns from the data set and make sub samples for each column 5-10 times with roughly 50%-75% of the data for each sub sample.
To do this, I made a custom function taking 3 parameters: column name, the percent of data sampled from, and the amount of samples calculated. This function then uses a for loop to calculate the amount of samples and store each sample in a list. It also makes a vactor of sample names and stores each name for each iteration. Once the for loop is done, it binds the columns of the list together and stores it as a data frame in “df”. Finally, the names generated in the sample name vector is then applied to the data frame and the function returns the completed data frame.
sampling_function <- function(column_name, sample_size_percent, num_samples){
columns <- list()
col_names <- c()
for(i in 1:num_samples){
columns[[i]] <- nhl_draft |> sample_frac(sample_size_percent, replace = TRUE) |> pluck(column_name)
col_names[i] <- paste("sample",i)
}
df <- bind_cols(columns)
names(df) <- col_names
return(df)
}
Now we can calculate statistics for each sub sample based on the column that we choose. For every sub sample I decided to use 50% of the column for the sample and I decided that I would do 5 samples for each column. To show this, you can see for each column’s sample function I included 0.5 (50%) and 5 (number of samples) as arguments for the function.
The 7 columns I chose for this data dive are as follows:
gp_s <- sampling_function("games_played", 0.5, 5)
summary(gp_s)
## sample 1 sample 2 sample 3 sample 4
## Min. : 1.0 Min. : 1.0 Min. : 1 Min. : 1.0
## 1st Qu.: 23.0 1st Qu.: 25.0 1st Qu.: 25 1st Qu.: 25.0
## Median : 141.0 Median : 142.0 Median : 152 Median : 155.0
## Mean : 296.6 Mean : 297.1 Mean : 316 Mean : 305.4
## 3rd Qu.: 501.5 3rd Qu.: 491.5 3rd Qu.: 550 3rd Qu.: 536.0
## Max. :1779.0 Max. :1779.0 Max. :1635 Max. :1779.0
## NA's :3510 NA's :3577 NA's :3530 NA's :3489
## sample 5
## Min. : 1
## 1st Qu.: 23
## Median : 135
## Mean : 291
## 3rd Qu.: 488
## Max. :1733
## NA's :3576
Looking at the summary statistics for the samples above, we can see that the minimum values for each sample are 1.0 but have different numbers for everything else. Samples 1, 2 and 4 have the largest max values at 1779 while Sample 5 has a max value of 1733 and Sample 3 has a max value at 1635 which is the lowest for all the max values of the samples. All of the sample means are varied with sample 3’s mean being the highest at 316 while sample 1 has the lowest mean at 296. We can also see that there are a lot of NA’s which means that a lot of players have either unrecorded amount of games or haven’t played a NHL game yet. In this case scenario, abnormalities don’t exist as there is a variety of reasons that the domain of games played. Players who have played in the league for longer tend to have more games recorded than those who haven’t played at all.
team_s <- sampling_function("team", 0.5, 5)
prop.table(table(team_s$`sample 1`))
##
## Anaheim Ducks Arizona Coyotes Atlanta Flames
## 0.0199183673 0.0065306122 0.0071836735
## Atlanta Thrashers Boston Bruins Buffalo Sabres
## 0.0084897959 0.0368979592 0.0391836735
## Calgary Flames California Golden Seals Carolina Hurricanes
## 0.0334693878 0.0053877551 0.0163265306
## Chicago Blackhawks Cleveland Barons Colorado Avalanche
## 0.0486530612 0.0009795918 0.0187755102
## Colorado Rockies Columbus Blue Jackets Dallas Stars
## 0.0037551020 0.0155102041 0.0169795918
## Detroit Red Wings Edmonton Oilers Florida Panthers
## 0.0409795918 0.0300408163 0.0228571429
## Hartford Whalers Kansas City Scouts Los Angeles Kings
## 0.0153469388 0.0022857143 0.0352653061
## Minnesota North Stars Minnesota Wild Montreal Canadiens
## 0.0230204082 0.0115918367 0.0519183673
## Nashville Predators New Jersey Devils New York Islanders
## 0.0148571429 0.0359183673 0.0378775510
## New York Rangers Oakland Seals Ottawa Senators
## 0.0468571429 0.0011428571 0.0210612245
## Philadelphia Flyers Phoenix Coyotes Pittsburgh Penguins
## 0.0391836735 0.0135510204 0.0373877551
## Quebec Nordiques San Jose Sharks Seattle Kraken
## 0.0145306122 0.0191020408 0.0011428571
## St. Louis Blues Tampa Bay Lightning Toronto Maple Leafs
## 0.0383673469 0.0191020408 0.0455510204
## Vancouver Canucks Vegas Golden Knights Washington Capitals
## 0.0364081633 0.0031020408 0.0403265306
## Winnipeg Jets
## 0.0231836735
For the team sub samples, using our logic from the previous data dive, we can see than each team’s probability of getting selected is small but the teams with the longest history and most records will appear the most. The Detroit Red Wings and Chicago Black hawks prove this with their higher proportions. Across sub samples this is accurate too.
age_s <- sampling_function("age", 0.5, 5)
summary(age_s)
## sample 1 sample 2 sample 3 sample 4 sample 5
## Min. :16.00 Min. :16.00 Min. :16.00 Min. :16.00 Min. :16.0
## 1st Qu.:18.00 1st Qu.:18.00 1st Qu.:18.00 1st Qu.:18.00 1st Qu.:18.0
## Median :18.00 Median :18.00 Median :18.00 Median :18.00 Median :18.0
## Mean :18.69 Mean :18.66 Mean :18.67 Mean :18.63 Mean :18.7
## 3rd Qu.:19.00 3rd Qu.:19.00 3rd Qu.:19.00 3rd Qu.:19.00 3rd Qu.:19.0
## Max. :32.00 Max. :30.00 Max. :37.00 Max. :30.00 Max. :31.0
## NA's :1995 NA's :2008 NA's :1961 NA's :1964 NA's :1978
For the age variable we can see some interesting min and max values with all of the samples having minimum values of 16 years old while samples 1 and 3 have max values of 37. The max value of 37 years old for age I would consider as an abnormality for a draft in which the median age for each sub sample is 18 years old. We also can see consistent 1st and 3rd quartiles of 18 and 19 years old with every sub sample. This means that 50% of the sub samples are between 18 and 19 years old which justifies that 37 and 16 are outliers. The NA’s most likely come from players in the 1960s as that data is not recorded as well as more of the recent years.
position_s <- sampling_function("position", 0.5, 5)
prop.table(table(position_s$`sample 1`))
##
## C C / R C/LW C/RW C/W D
## 0.2235313369 0.0001636393 0.0044182622 0.0039273441 0.0001636393 0.3264604811
## D/LW D/RW D/W F G LW
## 0.0006545574 0.0004909180 0.0001636393 0.0013091147 0.0999836361 0.1710031091
## LW/C LW/D RW RW/C W
## 0.0016363934 0.0009818360 0.1605301915 0.0006545574 0.0039273441
For positions, we can see the main positions of the data hold true for the subsamples with the Center position being 21% of this sub sample, Defense being 32%, Goalie being 10%, Left wing being 16% and Right wing being 17%. This hold true over other subsamples too.
points_s <- sampling_function("points", 0.5, 5)
summary(points_s)
## sample 1 sample 2 sample 3 sample 4
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 3.0 1st Qu.: 3.0 1st Qu.: 3 1st Qu.: 3.0
## Median : 28.0 Median : 31.0 Median : 35 Median : 25.0
## Mean : 132.2 Mean : 142.1 Mean : 142 Mean : 139.1
## 3rd Qu.: 156.8 3rd Qu.: 170.0 3rd Qu.: 171 3rd Qu.: 169.0
## Max. :1723.0 Max. :1887.0 Max. :1887 Max. :1921.0
## NA's :3567 NA's :3468 NA's :3535 NA's :3544
## sample 5
## Min. : 0.0
## 1st Qu.: 3.5
## Median : 30.0
## Mean : 129.4
## 3rd Qu.: 162.0
## Max. :1771.0
## NA's :3450
The points column in my opinion is one of the most interesting columns of the entire data set. This is because of how the NHL has evolved over the years and the increase in player points. The sub samples for this data also show that total career points of a player in the NHL. Every sample has a minimum point score of 0 points which makes sense because a lot of players have come through the league and been on the bench but never made any points. The 1st quartile is also consistent through all subsamples at 3 points. This is also very interesting because that means that most players drafted have their point totals right skewed. This means that their careers certain players more focused on defense than offense. Most likely certain positions have more points than others. The NA count is in the 3500 range for each subsample most likely because those players either never scored points or it was never recorded. The max values for samples 1, 2, and 5 all have 1771 points while sample 3 has 1798 and sample 4 has the largest max value at 1921.
assists_s <- sampling_function("assists", 0.5, 5)
summary(assists_s)
## sample 1 sample 2 sample 3 sample 4
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 2.00 1st Qu.: 2.00
## Median : 18.00 Median : 20.00 Median : 18.00 Median : 19.00
## Mean : 85.94 Mean : 85.23 Mean : 83.47 Mean : 85.13
## 3rd Qu.: 113.00 3rd Qu.: 107.00 3rd Qu.: 104.25 3rd Qu.: 107.00
## Max. :1249.00 Max. :1249.00 Max. :1249.00 Max. :1193.00
## NA's :3480 NA's :3497 NA's :3485 NA's :3502
## sample 5
## Min. : 0.00
## 1st Qu.: 2.00
## Median : 18.00
## Mean : 87.11
## 3rd Qu.: 106.00
## Max. :1155.00
## NA's :3543
For assists, the minimum value between each sub sample is 0 which makes sense and each sub sample also has the same 1st quartile at 2 assists. The max assist value comes from samples 1 and 4 at 1249. The mean amount of assists varies between 80 and 90 with 80.06 being the minimum average in sample 3 and 90.08 being the maximum average in sample 4. There are also around 3500 NA values which most likely come from players who don’t have points either.
goaliegp_s <- sampling_function("goalie_games_played", 0.5, 5)
summary(goaliegp_s)
## sample 1 sample 2 sample 3 sample 4
## Min. : 1.0 Min. : 1.0 Min. : 1.0 Min. : 1.0
## 1st Qu.: 7.0 1st Qu.: 5.0 1st Qu.: 6.0 1st Qu.: 10.0
## Median : 67.0 Median : 74.0 Median : 58.0 Median : 78.0
## Mean :172.5 Mean : 200.8 Mean :172.3 Mean : 199.7
## 3rd Qu.:258.5 3rd Qu.: 349.0 3rd Qu.:309.0 3rd Qu.: 356.8
## Max. :887.0 Max. :1266.0 Max. :939.0 Max. :1266.0
## NA's :5871 NA's :5874 NA's :5884 NA's :5889
## sample 5
## Min. : 1.0
## 1st Qu.: 6.0
## Median : 52.5
## Mean : 157.2
## 3rd Qu.: 220.2
## Max. :1266.0
## NA's :5849
Looking at the summary statistics for the samples above, we can see that the minimum values for each sample are 1.0 but have different numbers for everything else. The mean for sample 4 is the largest at 200.4 while sample 1 has the lowest mean at 168.6. Samples 1, 3, and 5 have the larest maximum values at 1044 while sample 4 has the lowest max value at 887. Compared to the games played variable we can see that the maximum values for goalie game played are lower than games played and have 2000 more NA values than games played. This means that a lot of players have either unrecorded amount of games or haven’t played a NHL game yet.Just like the games played variable, abnormalities don’t exist as there is a variety of reasons that the domain of goalie games played. Players who have played in the league for longer tend to have more games recorded than those who haven’t played at all.
Each column gives us insight into how the data is layed out and their distributions. This is useful for purpose of sampling and seeing if our original hunches from the first data dive were correct.