1 The data

For this section we will use the nfl data. We will use the tidyverse and ggiraph packages. You will need to first install ggiraph with install.packages("ggiraph").

library(tidyverse)
library(ggplot2)
library(ggiraph)

Do any interesting relationships exists between NFL combine metrics and draft position?

The better the player performs in the combine metrics, the more likely that player will be picked for the spot. The combine metrics is composed of exercise that measure a wide range of player abilities. Excelling in any exercise will make the player stand out and be picked.

Below is data on every player who participated in the 2018 NFL draft combine or was drafted by an NFL team in 2018. Let’s read in the data and set it as a tibble to make it easier to view. The argument stringsAsFactors = FALSE ensures variables such as Player, Pos, School, etc are read in as type character.

Player: player’s name
Pos: player’s position
School: college of player
Ht: height in inches
Wt: weight in pounds
Dash40: forty yard dash time in seconds
Vertical: vertical jump in inches
Bench: number of bench press repititions at 225lbs
Broad.Jump: broad jump distance in inches
Cone3: 3 cone drill time in seconds
Shuttle: twenty yard shuttle time in seconds
Team: team that drafted the player
Round: round the player was drafted (0 means no round)
Pick: draft selection (0 means not drafted)

2 Investigation

As you can see above, nfl contains a lot of missing values - represented by NA. Not all players participate in the NFL combine, and some who do participate do not perform each skills test. Also, undrafted players have NA values for their Team. Before we create visualizations let’s try to better understand some of the data.

2.1 Exercises

How many players went undrafted in the data set nfl?

with(nfl, sum(is.na(Team)))

[1] 184

Which player had the fastest 40 yard dash time? What was his time?

min(nfl$Dash40, na.rm = TRUE)

[1] 4.32

nfl$Player[which.min(nfl$Dash40)]

[1] "Donte Jackson"

How many players are in the data set for each position? Hint: table()

table(nfl$Pos)


   C   CB   DB   DE   DT EDGE   FB  ILB    K   LB   LS   OG   OL  OLB   OT    P 
   9   41    2   16   24   23    1   15    4    3    1   14    2   13   23    7 
  QB   RB    S   TE   WR 
  19   31   27   17   44

What was the mean number of bench press repititions for all players listed at the position of DT?

nfl %>%
  group_by(Pos == "DT") %>%
  summarise(mean(Bench, na.rm = TRUE))

Which team drafted the most players in 2018?

players.per.team <- table(nfl$Team)
sort(desc(players.per.team))[1]

Denver Broncos  
             -9

3 Visualizations with ggplot

3.1 Exercises

Recreate plots 1 - 5. Add comments to the code that generated plots 6 and 7 to explain what is being done in those plots. Use the available hints before looking at the solution. Comment on any interesting trends/relationships you observe.

Plot 1

Plot

Hints

geom point()
geom_smooth()

Plot 2

Plot

Hints

geom_boxplot
colors used: purple, black
caption argument in labs()

Plot 3

Plot

Hints

subset nfl to create a data frame with only WR who were drafted
colors used: purple, black
stat = "identity" in geom_bar()
flip coordinates
scale_y_continuous(breaks = seq(0, 1, .1), labels = seq(4, 5, .1))
to sort the bars use reorder(Player, -Dash40)

Plot 4

Plot

Hints

geom_histogram
colors used: darkgreen, black
binwidth = 5

Plot 5

Plot

Hints

subset nfl to only include the positions of CB, S, RB, WR, and TE for players that were drafted
color used: black
shape = 21
scale_size(range=c(10, 3))
theme_bw()

Plot 6

Plot

Plot 7

Plot

4 Data know hows

In this section, we will get some practice reading in data sets and working with the apply function.

4.1 Read data

Data may be

available in base R or through an R package such as the diamonds data set that is available through tidyverse;
read in to R from a file on your computer;
read in to R directly from a website;
scraped from a website.

Today we will get practice with examples that involve 2 and 3. Some of the most popular R functions to accomplish this are

read.table()
read.csv()
load()

4.1.1 Exercises

Read in the following data sets and save them as a well-named R object. A preview of each data set is given below for you to check your answer. Make sure all variable types are the same, headers are available when applicable, and NA values appear where appropriate.

nevada_casino_sqft.csv (available on D2L)

nevada_casino_sqft.csv (available at: http://www.stat.ufl.edu/~winner/data/nevada_casino_sqft.csv)

qqq.tsv (available on D2L)

spy.txt (available on D2L)

4.2 Apply

4.2.1 Matrix examples

The examples below can also be applied to data frames

# create a 3 x 3 matrix
my.matrix <- matrix(data = c(3, 5, 10, -1, 0, 2, 18, 5, -3),
                    nrow = 3, ncol = 3)

# matrix is filled column-wise, view matrix
my.matrix

# by columns
apply(X = my.matrix, MARGIN = 2, FUN = mean)
apply(X = my.matrix, MARGIN = 2, FUN = max)

# by rows
apply(X = my.matrix, MARGIN = 1, FUN = sd)
apply(X = my.matrix, MARGIN = 1, FUN = which.max)
apply(X = my.matrix, MARGIN = 1, FUN = sort)

Create a 4 x 3 matrix and the have the data filled in by rows.

amatrix <- matrix(data = c(3, 5, 10, -1, 0, 2, 18, 5, -3),
                    nrow = 4, ncol = 3)
amatrix

     [,1] [,2] [,3]
[1,]    3    0   -3
[2,]    5    2    3
[3,]   10   18    5
[4,]   -1    5   10

apply(X = amatrix, MARGIN = 1, FUN = which.max)

[1] 1 1 2 3

Explain what the which.max function is doing.

The 'which.max' function returns the position of the element with the maximum value in a vector

4.2.2 Array examples

# create a 2 x 2 x 3 array that contains the numbers 1 - 12
my.array <- array(data = c(1:12), dim = c(2,2,3))

# view the array
my.array

# apply sum over 1 dimension
apply(my.array, 1, sum)
apply(my.array, 2, sum)
apply(my.array, 3, sum)

# apply sum over multiple dimensions
apply(my.array, c(1,2), sum)
apply(my.array, c(1,3), sum)
apply(my.array, c(2,3), sum)
apply(my.array, c(3,1), sum)
apply(my.array, c(3,2), sum)

4.3 Further details

A summary of the a,l,s,t apply functions

Command	Description
`apply(X, MARGIN, FUN, ...)`	Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`lapply(X, FUN, ...)`	Obtain a list by applying `FUN` to the elements of a list `X`
`sapply(X, FUN, ...)`	Simplified version of `lapply`. Returns a vector/array instead of list.
`tapply(X, INDEX, FUN, ...)`	Obtain a table by applying `FUN` to each combination of the factors given in `INDEX`

The above functions are good alternatives to loops
They are typically more efficient than loops (often run considerably faster on large data sets)
They take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
Look at the Help’s examples

Data visualizations and know-hows

M3 ICA3

1 The data

2 Investigation

2.1 Exercises

3 Visualizations with ggplot

3.1 Exercises

Plot 1

Plot

Hints

Plot 2

Plot

Hints

Plot 3

Plot

Hints

Plot 4

Plot

Hints

Plot 5

Plot

Hints

Plot 6

Plot

Plot 7

Plot

4 Data know hows

4.1 Read data

4.1.1 Exercises

4.2 Apply

4.2.1 Matrix examples

4.2.2 Array examples

4.3 Further details

5 References