Week 5 Data Dive

Loading in the tidyverse, data and setting seed

# Loading tidyverse 

library(tidyverse)

#Loading in Data

nhl_draft <- read_csv("nhldraft.csv")

# Setting seed

set.seed(1)

For this week’s Data Dive we’re going to be talking about documentation and the importance of understanding what each variable means. This includes why certain values are NA, why the person who created the data set chose that way/name to describe that value for a variable, and also what can we do with the data when we have more information.

Looking at the data set we can see all of the variables the data set contains:

colnames(nhl_draft)

##  [1] "id"                    "year"                  "overall_pick"         
##  [4] "team"                  "player"                "nationality"          
##  [7] "position"              "age"                   "to_year"              
## [10] "amateur_team"          "games_played"          "goals"                
## [13] "assists"               "points"                "plus_minus"           
## [16] "penalties_minutes"     "goalie_games_played"   "goalie_wins"          
## [19] "goalie_losses"         "goalie_ties_overtime"  "save_percentage"      
## [22] "goals_against_average" "point_shares"

Most of the variables look self-explanatory like points, draft id, draft year, player, overall pic, etc. However, there are a couple of variables that can be confusing for people without documentation.

If you would like to look at the documentation for this data set and download the data for your self, I got the data from Kaggle which is a open data website by Google that allows users in the Data Science community to come together, share ideas, compete in competitions, and post data they webscrapped/collected. Each Kaggle post contains the data and has a Usability rating which is determined by multiple factors to give people the ability to accurate assess the validity of the data posted.

The data that I have been using for the data dives is found in the link below.

https://www.kaggle.com/datasets/mattop/nhl-draft-hockey-player-data-1963-2022

Big thanks to user MATT OP for posting the data!

Now, let’s first look at the Plus/Minus variable. According to the data card it states that this is the “Plus minus of a player”. If you don’t know sports this can be confusing.

When you plot the variable, you get this:

nhl_draft |> 
ggplot(mapping = aes(x = plus_minus))+
  geom_bar()

## Warning: Removed 7016 rows containing non-finite values (`stat_count()`).

Which doesn’t tell you much!

Plus minus in sports is defined by Wikipedia as “a sports statistic used to measure a player’s impact, represented by the difference between their team’s total scoring versus their opponent’s when the player is in the game.”

Thus plus minus is an overall term that you need to understand on your own as it can be applied to multiple sports. Thus the documentation did not do a great job explaining it to non-sports people. The analysis with the variable would make no sense without context of a player during a game and the out lier of a 800 plus minus would be used like a normal variable. If you had no clue about sports terms then without documentation or prior understanding this variable can be seen as pointless or could be used wrong.

Next, let’s talk about the point_shares variable.

According to the documentation, point shares has no explanation! This is also due to point shares being something that only sports fans would know.

According to Wikipedia, Point Shares is “an estimate of the number of points contributed by a player.”

Thus, I can imagine that the person assumed that the audience the data would be going to would be sports fans. In this case, point shares is more straightforward as a term but still needed context to be figured out.

This is what point shares looks like plotted:

nhl_draft |> 
ggplot(mapping = aes(x = point_shares))+
  geom_bar()

## Warning: Removed 7004 rows containing non-finite values (`stat_count()`).

Finally, I want to explain the “to_year” variable. At first glance without context, it’s easy to forget what kind of data this data set is.

This data set is Draft Data and thus Missing values means that a player either wasn’t recorded (picks from 1960s and 70s) or hasn’t made the league yet (picks in the last decade).

The to_year variable is defined by the documentation as “Year draft pick played to” which means for players who have played in the league, this variable indicates the year that player stopped playing in the league and thus can have a range of years associated with it. The initial year would be the year drafted and the final year would be the value for this variable for that observation.

Note: Draft picks in more recent years have blank values for the to_year variable since they either haven’t played in the league yet (Minor league player) or are currently playing in the league.

Documentation is important to understanding the data that you work with and this data is no exception. There are always going to be flaws to certain values or variables that need to be taken into account when doing analysis (including sports variables). Without documentation analysis becomes skewed especially if it is data that you didn’t create.

Week 5 Data Dive

Connor Bryson

9/21/2023