Cleaning Scraped Data for Algorithm Processing

Tutorial 4

First - load the necessary libraries…

library(dplyr)
library(tidyr)

Now, taking the result from the scraped/cleaned data produced in tutorial #3, I will begin this manipulation process. First I want to convert the existing data into long form and do so with the gather function… ***

Old…

receiving.df <- read.csv("receiving.df.csv", stringsAsFactors = F) 
head(receiving.df)

##             PLAYER POS TEAM REC TAR  YDS  AVG TD LONG Twenty.Plus YDS.G
## 1  Steve Smith Sr.  WR  CAR 103   0 1563 15.2 12   80          22  97.7
## 2     Santana Moss  WR  WSH  84   0 1483 17.7  9   78          24  92.7
## 3     Chad Johnson  WR  CIN  97   0 1432 14.8  9   70          16  89.5
## 4 Larry Fitzgerald  WR  ARI 103   0 1409 13.7 10   47          27  88.1
## 5    Anquan Boldin  WR  ARI 102   0 1402 13.7  7   54          21 100.1
## 6       Torry Holt  WR   LA 102   0 1331 13.0  9   44          15  95.1
##   FUM YAC First.Dns Year
## 1   1   0        70 2005
## 2   2   0        60 2005
## 3   1   0        74 2005
## 4   0   0        67 2005
## 5   2   0        68 2005
## 6   2   0        63 2005

New…

receiving.df1 <- receiving.df %>%
  gather("Pro.Record", "Value", 4:14) 
head(receiving.df1)

##             PLAYER POS TEAM Year Pro.Record Value
## 1  Steve Smith Sr.  WR  CAR 2005        REC   103
## 2     Santana Moss  WR  WSH 2005        REC    84
## 3     Chad Johnson  WR  CIN 2005        REC    97
## 4 Larry Fitzgerald  WR  ARI 2005        REC   103
## 5    Anquan Boldin  WR  ARI 2005        REC   102
## 6       Torry Holt  WR   LA 2005        REC   102

Now create a column that holds negative results - in this case fumbles - and title it “Con.Record”.

receiving.df1 <- receiving.df1 %>%
  mutate(Con.Record = "FUM")
head(receiving.df1)

##             PLAYER POS TEAM Year Pro.Record Value Con.Record
## 1  Steve Smith Sr.  WR  CAR 2005        REC   103        FUM
## 2     Santana Moss  WR  WSH 2005        REC    84        FUM
## 3     Chad Johnson  WR  CIN 2005        REC    97        FUM
## 4 Larry Fitzgerald  WR  ARI 2005        REC   103        FUM
## 5    Anquan Boldin  WR  ARI 2005        REC   102        FUM
## 6       Torry Holt  WR   LA 2005        REC   102        FUM

Now that this Con.Record column has been created - change every variable that isn’t “FUM” into “NA”…

receiving.df1$Con.Record[receiving.df1$Pro.Record != "FUM"] <- NA
head(receiving.df1)

##             PLAYER POS TEAM Year Pro.Record Value Con.Record
## 1  Steve Smith Sr.  WR  CAR 2005        REC   103       <NA>
## 2     Santana Moss  WR  WSH 2005        REC    84       <NA>
## 3     Chad Johnson  WR  CIN 2005        REC    97       <NA>
## 4 Larry Fitzgerald  WR  ARI 2005        REC   103       <NA>
## 5    Anquan Boldin  WR  ARI 2005        REC   102       <NA>
## 6       Torry Holt  WR   LA 2005        REC   102       <NA>

and…

receiving.df1<- receiving.df1 %>% arrange(desc(Con.Record))
head(receiving.df1)

##             PLAYER POS TEAM Year Pro.Record Value Con.Record
## 1  Steve Smith Sr.  WR  CAR 2005        FUM     1        FUM
## 2     Santana Moss  WR  WSH 2005        FUM     2        FUM
## 3     Chad Johnson  WR  CIN 2005        FUM     1        FUM
## 4 Larry Fitzgerald  WR  ARI 2005        FUM     0        FUM
## 5    Anquan Boldin  WR  ARI 2005        FUM     2        FUM
## 6       Torry Holt  WR   LA 2005        FUM     2        FUM

Make a second value column, titled “Value2”, that will hold negative values for the “FUM” variable…

receiving.df1<- mutate(receiving.df1
                , Value2 = ifelse(Con.Record != "NA", -Value, Value)) %>%
                arrange(desc(Con.Record))
head(receiving.df1)

##             PLAYER POS TEAM Year Pro.Record Value Con.Record Value2
## 1  Steve Smith Sr.  WR  CAR 2005        FUM     1        FUM     -1
## 2     Santana Moss  WR  WSH 2005        FUM     2        FUM     -2
## 3     Chad Johnson  WR  CIN 2005        FUM     1        FUM     -1
## 4 Larry Fitzgerald  WR  ARI 2005        FUM     0        FUM      0
## 5    Anquan Boldin  WR  ARI 2005        FUM     2        FUM     -2
## 6       Torry Holt  WR   LA 2005        FUM     2        FUM     -2

After doing all of this we are essentially where we want to be; however, I only want to include the variables for the appropriate positions under the column “POS” (e.g wide receivers, running backs, etc.). So, to do this, follow the following code.

# First, find out what POS are in this df with this code below...
Pro.Record.POS<- unique(receiving.df1[["POS"]])
print(Pro.Record.POS)

##  [1] " WR" " TE" " RB" " FB" " CB" " QB" " LS" " S"  " OL" " FS" " DE"
## [12] " DB" " OT" " DT" " LB"

# Then, filter for the desired positions under column "POS" (Note: I have used "filter" on string variables before, but for some reason it isn't working today: so I used grepl w/ filter for this...
  receiving.df11 <- receiving.df1 %>% 
    filter(grepl("WR|RB|TE|FB", POS))

Pro.Record.POS<- unique(receiving.df11[["POS"]])
print(Pro.Record.POS)

## [1] " WR" " TE" " RB" " FB"

Done! Now this data can be used for a variety of purposes including a weighting algorithm / function that I am planning to publish in a later tutorial.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Cleaning Scraped Data for Algorithm Processing

Mitchell Walker

June 30, 2016

Tutorial 4