Week 4 Data Dive

This week, for our data dive, we will be looking at sampling in our data set with different columns and looking at the differences and what it means for our data.

Loading in the data, tidyverse and setting seed (for reproducble results):

# Loading tidyverse 

library(tidyverse)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

# Setting seed

set.seed(1)

The task for this data dive is to choose a variety of columns from the data set and make sub samples for each column 5-10 times with roughly 50%-75% of the data for each sub sample.

To do this, I made a custom function taking 3 parameters: column name, the percent of data sampled from, and the amount of samples calculated. This function then uses a for loop to calculate the amount of samples and store each sample in a list. It also makes a vactor of sample names and stores each name for each iteration. Once the for loop is done, it binds the columns of the list together and stores it as a data frame in “df”. Finally, the names generated in the sample name vector is then applied to the data frame and the function returns the completed data frame.

sampling_function <- function(column_name, sample_size_percent, num_samples){
  
  columns <- list()
  col_names <- c()
  
  for(i in 1:num_samples){
    columns[[i]] <- nhl_draft |> sample_frac(sample_size_percent, replace = TRUE) |> pluck(column_name)
    col_names[i] <- paste("sample",i)
  }
  
  df <- bind_cols(columns)
  
  names(df) <- col_names
  
  return(df)  
}

Now we can calculate statistics for each sub sample based on the column that we choose. For every sub sample I decided to use 50% of the column for the sample and I decided that I would do 5 samples for each column. To show this, you can see for each column’s sample function I included 0.5 (50%) and 5 (number of samples) as arguments for the function.

The 7 columns I chose for this data dive are as follows:

Games Played

gp_s <- sampling_function("games_played", 0.5, 5)

summary(gp_s)

##     sample 1         sample 2         sample 3       sample 4     
##  Min.   :   1.0   Min.   :   1.0   Min.   :   1   Min.   :   1.0  
##  1st Qu.:  23.0   1st Qu.:  25.0   1st Qu.:  25   1st Qu.:  25.0  
##  Median : 141.0   Median : 142.0   Median : 152   Median : 155.0  
##  Mean   : 296.6   Mean   : 297.1   Mean   : 316   Mean   : 305.4  
##  3rd Qu.: 501.5   3rd Qu.: 491.5   3rd Qu.: 550   3rd Qu.: 536.0  
##  Max.   :1779.0   Max.   :1779.0   Max.   :1635   Max.   :1779.0  
##  NA's   :3510     NA's   :3577     NA's   :3530   NA's   :3489    
##     sample 5   
##  Min.   :   1  
##  1st Qu.:  23  
##  Median : 135  
##  Mean   : 291  
##  3rd Qu.: 488  
##  Max.   :1733  
##  NA's   :3576

Looking at the summary statistics for the samples above, we can see that the minimum values for each sample are 1.0 but have different numbers for everything else. Samples 1, 2 and 4 have the largest max values at 1779 while Sample 5 has a max value of 1733 and Sample 3 has a max value at 1635 which is the lowest for all the max values of the samples. All of the sample means are varied with sample 3’s mean being the highest at 316 while sample 1 has the lowest mean at 296. We can also see that there are a lot of NA’s which means that a lot of players have either unrecorded amount of games or haven’t played a NHL game yet. In this case scenario, abnormalities don’t exist as there is a variety of reasons that the domain of games played. Players who have played in the league for longer tend to have more games recorded than those who haven’t played at all.

Team

team_s <- sampling_function("team", 0.5, 5)

prop.table(table(team_s$`sample 1`))

## 
##           Anaheim Ducks         Arizona Coyotes          Atlanta Flames 
##            0.0199183673            0.0065306122            0.0071836735 
##       Atlanta Thrashers           Boston Bruins          Buffalo Sabres 
##            0.0084897959            0.0368979592            0.0391836735 
##          Calgary Flames California Golden Seals     Carolina Hurricanes 
##            0.0334693878            0.0053877551            0.0163265306 
##      Chicago Blackhawks        Cleveland Barons      Colorado Avalanche 
##            0.0486530612            0.0009795918            0.0187755102 
##        Colorado Rockies   Columbus Blue Jackets            Dallas Stars 
##            0.0037551020            0.0155102041            0.0169795918 
##       Detroit Red Wings         Edmonton Oilers        Florida Panthers 
##            0.0409795918            0.0300408163            0.0228571429 
##        Hartford Whalers      Kansas City Scouts       Los Angeles Kings 
##            0.0153469388            0.0022857143            0.0352653061 
##   Minnesota North Stars          Minnesota Wild      Montreal Canadiens 
##            0.0230204082            0.0115918367            0.0519183673 
##     Nashville Predators       New Jersey Devils      New York Islanders 
##            0.0148571429            0.0359183673            0.0378775510 
##        New York Rangers           Oakland Seals         Ottawa Senators 
##            0.0468571429            0.0011428571            0.0210612245 
##     Philadelphia Flyers         Phoenix Coyotes     Pittsburgh Penguins 
##            0.0391836735            0.0135510204            0.0373877551 
##        Quebec Nordiques         San Jose Sharks          Seattle Kraken 
##            0.0145306122            0.0191020408            0.0011428571 
##         St. Louis Blues     Tampa Bay Lightning     Toronto Maple Leafs 
##            0.0383673469            0.0191020408            0.0455510204 
##       Vancouver Canucks    Vegas Golden Knights     Washington Capitals 
##            0.0364081633            0.0031020408            0.0403265306 
##           Winnipeg Jets 
##            0.0231836735

For the team sub samples, using our logic from the previous data dive, we can see than each team’s probability of getting selected is small but the teams with the longest history and most records will appear the most. The Detroit Red Wings and Chicago Black hawks prove this with their higher proportions. Across sub samples this is accurate too.

age_s <- sampling_function("age", 0.5, 5)

summary(age_s)

##     sample 1        sample 2        sample 3        sample 4        sample 5   
##  Min.   :16.00   Min.   :16.00   Min.   :16.00   Min.   :16.00   Min.   :16.0  
##  1st Qu.:18.00   1st Qu.:18.00   1st Qu.:18.00   1st Qu.:18.00   1st Qu.:18.0  
##  Median :18.00   Median :18.00   Median :18.00   Median :18.00   Median :18.0  
##  Mean   :18.69   Mean   :18.66   Mean   :18.67   Mean   :18.63   Mean   :18.7  
##  3rd Qu.:19.00   3rd Qu.:19.00   3rd Qu.:19.00   3rd Qu.:19.00   3rd Qu.:19.0  
##  Max.   :32.00   Max.   :30.00   Max.   :37.00   Max.   :30.00   Max.   :31.0  
##  NA's   :1995    NA's   :2008    NA's   :1961    NA's   :1964    NA's   :1978

For the age variable we can see some interesting min and max values with all of the samples having minimum values of 16 years old while samples 1 and 3 have max values of 37. The max value of 37 years old for age I would consider as an abnormality for a draft in which the median age for each sub sample is 18 years old. We also can see consistent 1st and 3rd quartiles of 18 and 19 years old with every sub sample. This means that 50% of the sub samples are between 18 and 19 years old which justifies that 37 and 16 are outliers. The NA’s most likely come from players in the 1960s as that data is not recorded as well as more of the recent years.

Position

position_s <- sampling_function("position", 0.5, 5)

prop.table(table(position_s$`sample 1`))

## 
##            C        C / R         C/LW         C/RW          C/W            D 
## 0.2235313369 0.0001636393 0.0044182622 0.0039273441 0.0001636393 0.3264604811 
##         D/LW         D/RW          D/W            F            G           LW 
## 0.0006545574 0.0004909180 0.0001636393 0.0013091147 0.0999836361 0.1710031091 
##         LW/C         LW/D           RW         RW/C            W 
## 0.0016363934 0.0009818360 0.1605301915 0.0006545574 0.0039273441

For positions, we can see the main positions of the data hold true for the subsamples with the Center position being 21% of this sub sample, Defense being 32%, Goalie being 10%, Left wing being 16% and Right wing being 17%. This hold true over other subsamples too.

Points

points_s <- sampling_function("points", 0.5, 5)

summary(points_s)

##     sample 1         sample 2         sample 3       sample 4     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0   Min.   :   0.0  
##  1st Qu.:   3.0   1st Qu.:   3.0   1st Qu.:   3   1st Qu.:   3.0  
##  Median :  28.0   Median :  31.0   Median :  35   Median :  25.0  
##  Mean   : 132.2   Mean   : 142.1   Mean   : 142   Mean   : 139.1  
##  3rd Qu.: 156.8   3rd Qu.: 170.0   3rd Qu.: 171   3rd Qu.: 169.0  
##  Max.   :1723.0   Max.   :1887.0   Max.   :1887   Max.   :1921.0  
##  NA's   :3567     NA's   :3468     NA's   :3535   NA's   :3544    
##     sample 5     
##  Min.   :   0.0  
##  1st Qu.:   3.5  
##  Median :  30.0  
##  Mean   : 129.4  
##  3rd Qu.: 162.0  
##  Max.   :1771.0  
##  NA's   :3450

The points column in my opinion is one of the most interesting columns of the entire data set. This is because of how the NHL has evolved over the years and the increase in player points. The sub samples for this data also show that total career points of a player in the NHL. Every sample has a minimum point score of 0 points which makes sense because a lot of players have come through the league and been on the bench but never made any points. The 1st quartile is also consistent through all subsamples at 3 points. This is also very interesting because that means that most players drafted have their point totals right skewed. This means that their careers certain players more focused on defense than offense. Most likely certain positions have more points than others. The NA count is in the 3500 range for each subsample most likely because those players either never scored points or it was never recorded. The max values for samples 1, 2, and 5 all have 1771 points while sample 3 has 1798 and sample 4 has the largest max value at 1921.

Assists

assists_s <- sampling_function("assists", 0.5, 5)

summary(assists_s)

##     sample 1          sample 2          sample 3          sample 4      
##  Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   Min.   :   0.00  
##  1st Qu.:   2.00   1st Qu.:   2.00   1st Qu.:   2.00   1st Qu.:   2.00  
##  Median :  18.00   Median :  20.00   Median :  18.00   Median :  19.00  
##  Mean   :  85.94   Mean   :  85.23   Mean   :  83.47   Mean   :  85.13  
##  3rd Qu.: 113.00   3rd Qu.: 107.00   3rd Qu.: 104.25   3rd Qu.: 107.00  
##  Max.   :1249.00   Max.   :1249.00   Max.   :1249.00   Max.   :1193.00  
##  NA's   :3480      NA's   :3497      NA's   :3485      NA's   :3502     
##     sample 5      
##  Min.   :   0.00  
##  1st Qu.:   2.00  
##  Median :  18.00  
##  Mean   :  87.11  
##  3rd Qu.: 106.00  
##  Max.   :1155.00  
##  NA's   :3543

For assists, the minimum value between each sub sample is 0 which makes sense and each sub sample also has the same 1st quartile at 2 assists. The max assist value comes from samples 1 and 4 at 1249. The mean amount of assists varies between 80 and 90 with 80.06 being the minimum average in sample 3 and 90.08 being the maximum average in sample 4. There are also around 3500 NA values which most likely come from players who don’t have points either.

Goalie Games Played

goaliegp_s <- sampling_function("goalie_games_played", 0.5, 5)

summary(goaliegp_s)

##     sample 1        sample 2         sample 3        sample 4     
##  Min.   :  1.0   Min.   :   1.0   Min.   :  1.0   Min.   :   1.0  
##  1st Qu.:  7.0   1st Qu.:   5.0   1st Qu.:  6.0   1st Qu.:  10.0  
##  Median : 67.0   Median :  74.0   Median : 58.0   Median :  78.0  
##  Mean   :172.5   Mean   : 200.8   Mean   :172.3   Mean   : 199.7  
##  3rd Qu.:258.5   3rd Qu.: 349.0   3rd Qu.:309.0   3rd Qu.: 356.8  
##  Max.   :887.0   Max.   :1266.0   Max.   :939.0   Max.   :1266.0  
##  NA's   :5871    NA's   :5874     NA's   :5884    NA's   :5889    
##     sample 5     
##  Min.   :   1.0  
##  1st Qu.:   6.0  
##  Median :  52.5  
##  Mean   : 157.2  
##  3rd Qu.: 220.2  
##  Max.   :1266.0  
##  NA's   :5849

Looking at the summary statistics for the samples above, we can see that the minimum values for each sample are 1.0 but have different numbers for everything else. The mean for sample 4 is the largest at 200.4 while sample 1 has the lowest mean at 168.6. Samples 1, 3, and 5 have the larest maximum values at 1044 while sample 4 has the lowest max value at 887. Compared to the games played variable we can see that the maximum values for goalie game played are lower than games played and have 2000 more NA values than games played. This means that a lot of players have either unrecorded amount of games or haven’t played a NHL game yet.Just like the games played variable, abnormalities don’t exist as there is a variety of reasons that the domain of goalie games played. Players who have played in the league for longer tend to have more games recorded than those who haven’t played at all.

Each column gives us insight into how the data is layed out and their distributions. This is useful for purpose of sampling and seeing if our original hunches from the first data dive were correct.

Week 4 Data Dive

Connor Bryson

9/16/2023