For this section we will use the nfl data. We will use the tidyverse and ggiraph packages. You will need to first install ggiraph with install.packages("ggiraph").
library(tidyverse)
library(ggplot2)
library(ggiraph)
Do any interesting relationships exists between NFL combine metrics and draft position?
The better the player performs in the combine metrics, the more likely that player will be picked for the spot. The combine metrics is composed of exercise that measure a wide range of player abilities. Excelling in any exercise will make the player stand out and be picked.
Below is data on every player who participated in the 2018 NFL draft combine or was drafted by an NFL team in 2018. Let’s read in the data and set it as a tibble to make it easier to view. The argument stringsAsFactors = FALSE ensures variables such as Player, Pos, School, etc are read in as type character.
As you can see above, nfl contains a lot of missing values - represented by NA. Not all players participate in the NFL combine, and some who do participate do not perform each skills test. Also, undrafted players have NA values for their Team. Before we create visualizations let’s try to better understand some of the data.
nfl?with(nfl, sum(is.na(Team)))
[1] 184
min(nfl$Dash40, na.rm = TRUE)
[1] 4.32
nfl$Player[which.min(nfl$Dash40)]
[1] "Donte Jackson"
table()table(nfl$Pos)
C CB DB DE DT EDGE FB ILB K LB LS OG OL OLB OT P
9 41 2 16 24 23 1 15 4 3 1 14 2 13 23 7
QB RB S TE WR
19 31 27 17 44
DT?nfl %>%
group_by(Pos == "DT") %>%
summarise(mean(Bench, na.rm = TRUE))
players.per.team <- table(nfl$Team)
sort(desc(players.per.team))[1]
Denver Broncos
-9
Recreate plots 1 - 5. Add comments to the code that generated plots 6 and 7 to explain what is being done in those plots. Use the available hints before looking at the solution. Comment on any interesting trends/relationships you observe.
geom point()
geom_smooth()
geom_boxplotcaption argument in labs()nfl to create a data frame with only WR who were draftedstat = "identity" in geom_bar()scale_y_continuous(breaks = seq(0, 1, .1), labels = seq(4, 5, .1))reorder(Player, -Dash40)geom_histogrambinwidth = 5nfl to only include the positions of CB, S, RB, WR, and TE for players that were draftedshape = 21scale_size(range=c(10, 3))theme_bw()In this section, we will get some practice reading in data sets and working with the apply function.
Data may be
available in base R or through an R package such as the diamonds data set that is available through tidyverse;
read in to R from a file on your computer;
read in to R directly from a website;
scraped from a website.
Today we will get practice with examples that involve 2 and 3. Some of the most popular R functions to accomplish this are
read.table()
read.csv()
load()
Read in the following data sets and save them as a well-named R object. A preview of each data set is given below for you to check your answer. Make sure all variable types are the same, headers are available when applicable, and NA values appear where appropriate.
nevada_casino_sqft.csv (available on D2L)nevada_casino_sqft.csv (available at: http://www.stat.ufl.edu/~winner/data/nevada_casino_sqft.csv)qqq.tsv (available on D2L)spy.txt (available on D2L)The examples below can also be applied to data frames
# create a 3 x 3 matrix
my.matrix <- matrix(data = c(3, 5, 10, -1, 0, 2, 18, 5, -3),
nrow = 3, ncol = 3)
# matrix is filled column-wise, view matrix
my.matrix
# by columns
apply(X = my.matrix, MARGIN = 2, FUN = mean)
apply(X = my.matrix, MARGIN = 2, FUN = max)
# by rows
apply(X = my.matrix, MARGIN = 1, FUN = sd)
apply(X = my.matrix, MARGIN = 1, FUN = which.max)
apply(X = my.matrix, MARGIN = 1, FUN = sort)
amatrix <- matrix(data = c(3, 5, 10, -1, 0, 2, 18, 5, -3),
nrow = 4, ncol = 3)
amatrix
[,1] [,2] [,3]
[1,] 3 0 -3
[2,] 5 2 3
[3,] 10 18 5
[4,] -1 5 10
apply(X = amatrix, MARGIN = 1, FUN = which.max)
[1] 1 1 2 3
The 'which.max' function returns the position of the element with the maximum value in a vector
# create a 2 x 2 x 3 array that contains the numbers 1 - 12
my.array <- array(data = c(1:12), dim = c(2,2,3))
# view the array
my.array
# apply sum over 1 dimension
apply(my.array, 1, sum)
apply(my.array, 2, sum)
apply(my.array, 3, sum)
# apply sum over multiple dimensions
apply(my.array, c(1,2), sum)
apply(my.array, c(1,3), sum)
apply(my.array, c(2,3), sum)
apply(my.array, c(3,1), sum)
apply(my.array, c(3,2), sum)
A summary of the a,l,s,t apply functions
| Command | Description |
|---|---|
apply(X, MARGIN, FUN, ...) |
Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X |
lapply(X, FUN, ...) |
Obtain a list by applying FUN to the elements of a list X |
sapply(X, FUN, ...) |
Simplified version of lapply. Returns a vector/array instead of list. |
tapply(X, INDEX, FUN, ...) |
Obtain a table by applying FUN to each combination of the factors given in INDEX |
The above functions are good alternatives to loops
They are typically more efficient than loops (often run considerably faster on large data sets)
They take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
Look at the Help’s examples