1 The data

For this section we will use the nfl data. We will use the tidyverse and ggiraph packages. You will need to first install ggiraph with install.packages("ggiraph").

library(tidyverse)
library(ggplot2)
library(ggiraph)

Do any interesting relationships exists between NFL combine metrics and draft position?

The better the player performs in the combine metrics, the more likely that player will be picked for the spot. The combine metrics is composed of exercise that measure a wide range of player abilities. Excelling in any exercise will make the player stand out and be picked. 

Below is data on every player who participated in the 2018 NFL draft combine or was drafted by an NFL team in 2018. Let’s read in the data and set it as a tibble to make it easier to view. The argument stringsAsFactors = FALSE ensures variables such as Player, Pos, School, etc are read in as type character.

  • Player: player’s name
  • Pos: player’s position
  • School: college of player
  • Ht: height in inches
  • Wt: weight in pounds
  • Dash40: forty yard dash time in seconds
  • Vertical: vertical jump in inches
  • Bench: number of bench press repititions at 225lbs
  • Broad.Jump: broad jump distance in inches
  • Cone3: 3 cone drill time in seconds
  • Shuttle: twenty yard shuttle time in seconds
  • Team: team that drafted the player
  • Round: round the player was drafted (0 means no round)
  • Pick: draft selection (0 means not drafted)

2 Investigation

As you can see above, nfl contains a lot of missing values - represented by NA. Not all players participate in the NFL combine, and some who do participate do not perform each skills test. Also, undrafted players have NA values for their Team. Before we create visualizations let’s try to better understand some of the data.

2.1 Exercises

  1. How many players went undrafted in the data set nfl?
with(nfl, sum(is.na(Team)))
[1] 184
  1. Which player had the fastest 40 yard dash time? What was his time?
min(nfl$Dash40, na.rm = TRUE)
[1] 4.32
nfl$Player[which.min(nfl$Dash40)]
[1] "Donte Jackson"
  1. How many players are in the data set for each position? Hint: table()
table(nfl$Pos)

   C   CB   DB   DE   DT EDGE   FB  ILB    K   LB   LS   OG   OL  OLB   OT    P 
   9   41    2   16   24   23    1   15    4    3    1   14    2   13   23    7 
  QB   RB    S   TE   WR 
  19   31   27   17   44 
  1. What was the mean number of bench press repititions for all players listed at the position of DT?
nfl %>%
  group_by(Pos == "DT") %>%
  summarise(mean(Bench, na.rm = TRUE))
  1. Which team drafted the most players in 2018?
players.per.team <- table(nfl$Team)
sort(desc(players.per.team))[1]
Denver Broncos  
             -9 

3 Visualizations with ggplot

3.1 Exercises

Recreate plots 1 - 5. Add comments to the code that generated plots 6 and 7 to explain what is being done in those plots. Use the available hints before looking at the solution. Comment on any interesting trends/relationships you observe.

Plot 1

Plot

Hints

  • geom point()

  • geom_smooth()

Plot 2

Plot

Hints

  • geom_boxplot
  • colors used: purple, black
  • caption argument in labs()

Plot 3

Plot

Hints

  • subset nfl to create a data frame with only WR who were drafted
  • colors used: purple, black
  • stat = "identity" in geom_bar()
  • flip coordinates
  • scale_y_continuous(breaks = seq(0, 1, .1), labels = seq(4, 5, .1))
  • to sort the bars use reorder(Player, -Dash40)

Plot 4

Plot

Hints

  • geom_histogram
  • colors used: darkgreen, black
  • binwidth = 5

Plot 5

Plot

Hints

  • subset nfl to only include the positions of CB, S, RB, WR, and TE for players that were drafted
  • color used: black
  • shape = 21
  • scale_size(range=c(10, 3))
  • theme_bw()

Plot 6

Plot

Plot 7

Plot

4 Data know hows

In this section, we will get some practice reading in data sets and working with the apply function.

4.1 Read data

Data may be

  1. available in base R or through an R package such as the diamonds data set that is available through tidyverse;

  2. read in to R from a file on your computer;

  3. read in to R directly from a website;

  4. scraped from a website.

Today we will get practice with examples that involve 2 and 3. Some of the most popular R functions to accomplish this are

  • read.table()

  • read.csv()

  • load()

4.1.1 Exercises

Read in the following data sets and save them as a well-named R object. A preview of each data set is given below for you to check your answer. Make sure all variable types are the same, headers are available when applicable, and NA values appear where appropriate.

  1. nevada_casino_sqft.csv (available on D2L)
  1. nevada_casino_sqft.csv (available at: http://www.stat.ufl.edu/~winner/data/nevada_casino_sqft.csv)
  1. qqq.tsv (available on D2L)
  1. spy.txt (available on D2L)

4.2 Apply

4.2.1 Matrix examples

The examples below can also be applied to data frames

# create a 3 x 3 matrix
my.matrix <- matrix(data = c(3, 5, 10, -1, 0, 2, 18, 5, -3),
                    nrow = 3, ncol = 3)

# matrix is filled column-wise, view matrix
my.matrix

# by columns
apply(X = my.matrix, MARGIN = 2, FUN = mean)
apply(X = my.matrix, MARGIN = 2, FUN = max)

# by rows
apply(X = my.matrix, MARGIN = 1, FUN = sd)
apply(X = my.matrix, MARGIN = 1, FUN = which.max)
apply(X = my.matrix, MARGIN = 1, FUN = sort)
  1. Create a 4 x 3 matrix and the have the data filled in by rows.
amatrix <- matrix(data = c(3, 5, 10, -1, 0, 2, 18, 5, -3),
                    nrow = 4, ncol = 3)
amatrix
     [,1] [,2] [,3]
[1,]    3    0   -3
[2,]    5    2    3
[3,]   10   18    5
[4,]   -1    5   10
apply(X = amatrix, MARGIN = 1, FUN = which.max)
[1] 1 1 2 3
  1. Explain what the which.max function is doing.
The 'which.max' function returns the position of the element with the maximum value in a vector

4.2.2 Array examples

# create a 2 x 2 x 3 array that contains the numbers 1 - 12
my.array <- array(data = c(1:12), dim = c(2,2,3))

# view the array
my.array

# apply sum over 1 dimension
apply(my.array, 1, sum)
apply(my.array, 2, sum)
apply(my.array, 3, sum)

# apply sum over multiple dimensions
apply(my.array, c(1,2), sum)
apply(my.array, c(1,3), sum)
apply(my.array, c(2,3), sum)
apply(my.array, c(3,1), sum)
apply(my.array, c(3,2), sum)

4.3 Further details

A summary of the a,l,s,t apply functions

Command Description
apply(X, MARGIN, FUN, ...) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
lapply(X, FUN, ...) Obtain a list by applying FUN to the elements of a list X
sapply(X, FUN, ...) Simplified version of lapply. Returns a vector/array instead of list.
tapply(X, INDEX, FUN, ...) Obtain a table by applying FUN to each combination of the factors given in INDEX
  • The above functions are good alternatives to loops

  • They are typically more efficient than loops (often run considerably faster on large data sets)

  • They take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

  • Look at the Help’s examples