Web Scraping / Data Wrangling NFL Game Results

library(dplyr)
library(tidyr)
library(XML)

First, I need to create a dataframe for the abreviated team names - note that what is usually “WAS”, has to be “WSH” for the particular website that I am interested in scraping.

T.Code.DF1 <- data_frame(T.Code = c("1","2","3","4","5","6","7","8"
                                      ,"9","10","11","12","13","14","15","16"
                                      ,"17","18","19","20","21","22","23","24"
                                      ,"25","26","27","28","29","30","31","32"),
                         TEAM = c("BAL","CIN","CLE","PIT","HOU","IND","JAX","TEN","BUF","MIA","NE"
                                  ,"NYJ","DEN","KC","OAK","SD","CHI","DET","GB","MIN","ATL","CAR"
                                  ,"NO","TB","DAL","NYG","PHI","WSH","ARI","LA","SF","SEA"))

T.Code.DF1 <- T.Code.DF1 %>% mutate(Team.LowerC = tolower(T.Code.DF1$TEAM))

So, with the above df I can create a for loop that will in turn create a df listing the URLs I’m intending to scrape.

  for(i in T.Code.DF1[3]){ 
    links1  <<- expand.grid(LinkRoot = "http://www.espn.com/nfl/team/schedule/_/name/",
                           Code = i,
                           KEEP.OUT.ATTRS = F, stringsAsFactors = F)  %>%
      mutate(URL = paste0(LinkRoot, Code)) 
  }

Then, with this df I can parse out the tables within ever URL’s respective site.

  tables1 <- getNodeSet(htmlParse(links1$URL[1]), "//table") 

  dataBuild1 <<- readHTMLTable(tables1[[1]], header = F, trim = TRUE
                              , stringsAsFactors = FALSE) %>%
    mutate(Team = links1$Code[1])

At this point I have created a df named ‘dataBuild1’. This df is essentially the df that all of the other dfs will be binded to.

data <- head(dataBuild1, n = 10)
knitr::kable(data, caption = "Table with kable")

Table with kable
V1	V2	V3	V4	V5	V6	V7	Team
2016 Regular Season Schedule	NA	NA	NA	NA	NA	NA	bal
WK	DATE	OPPONENT	RESULT	HI PASSING	HI RUSHING	HI RECEIVING	bal
1	Sun, Sep 11	vsBuffalo	W13-7	Flacco 258	Forsett 41	Wallace 91	bal
2	Sun, Sep 18	@Cleveland	W25-20	Flacco 302	West 42	Pitta 102	bal
3	Sun, Sep 25	@Jacksonville	W19-17	Flacco 214	West 45	Smith Sr. 87	bal
4	Sun, Oct 2	vsOakland	L28-27	Flacco 298	West 113	Smith Sr. 111	bal
5	Sun, Oct 9	vsWashington	L16-10	Flacco 210	West 95	Wallace 63	bal
6	Sun, Oct 16	@New York	L27-23	Flacco 307	West 87	Wallace 97	bal
7	Sun, Oct 23	@New York	L24-16	Flacco 248	West 10	Wallace 120	bal
8	BYE WEEK	NA	NA	NA	NA	NA	bal

This df does indeed have a lot of information that I don’t want; however, I will address this after all of the data has been scraped. For now I have created another for-loop which will scrape all of the data from every URL it feeds into it. (Note, at the bottom of this code you’ll find ‘# print(i)’ which - with the # removed - is a nice way for the scraper to count out every completed loop.

  for(i in 2:nrow(links1)){
    tables1 <- getNodeSet(htmlParse(links1$URL[i]), "//table")
    
    nfldata1 <<- as.data.frame(readHTMLTable(tables1[[1]], header = FALSE, trim = TRUE, 
                                                stringsAsFactors = F)) %>%
                                                mutate(Team = links1$Code[i])
    
    dataBuild1 <<- rbind(dataBuild1, nfldata1) 
    
     #print(i) 
  }

Now as you previously saw, the df is essentially a complete mess. As such, we are now at the point of cleaning / wrangling the data. First, I want to store what should be my column names in a df that I will name ‘dataBuild2.Row1’ and get rid of everything that cannot be used as a column name and isn’t data. Also, I will do this while preserving the original scraped data by giving this new df a seperate unique name of ‘dataBuild2’.

dataBuild2.Row1 <- dataBuild1 %>% filter(V1 == "WK") %>% select(1:7) %>% distinct() %>% print()

##   V1   V2       V3        V4         V5         V6           V7
## 1 WK DATE OPPONENT    RESULT HI PASSING HI RUSHING HI RECEIVING
## 2 WK DATE OPPONENT TIME (ET)    TICKETS       <NA>         <NA>
## 3 WK DATE OPPONENT    RESULT  RESOURCES       <NA>         <NA>

dataBuild2 <- dataBuild1 %>% 
                filter(V1 != "0") %>% filter(V1 != "WK") %>% 
                filter(V5 != "Box Score") %>% filter(V5 != "RESOURCES") %>% filter(V5 != "TICKETS")

From here I enter the column names from my ‘dataBuild.Row1’ df that I pasted together with a “,”, and followed this up with a rather quick way of removing all ‘NAs’ (i.e. ‘complete.cases’)…

  colnames(dataBuild2)[which(colnames(dataBuild2) %in% c(
    "V1", "V2", "V3", "V4", "V5", "V6", "V7") )] <- c("Week"
                     , "Date", "Opponent", "Result", "High Passing", "High Rushing", "High Receiving") 

  dataBuild2 <- dataBuild2[complete.cases(dataBuild2),]

At this point my df is really starting to come together; however, I want to make some additional columns that will give me what I’m after. For instance, I want to know which teams are playing “Home” and “Away”, while also getting rid of some pesky - yet helpful - symbols and abvs. such as “@” amd “vs”…

(Note, I kept the “NA” at the end of the ifelse comand not because I want NAs in my df, but to see if any entries slipped through the cracks of my filter. Luckily, my grepl/filter got everything; however, if it didn’t I would at best have to make some corrections to the data or - at worst - to the actual code itself.)

  dataBuild2 <- dataBuild2 %>% data.frame(
    Away = gsub(pattern = "(@)(.*)", replacement = "\\2", x = dataBuild2[,3], perl=T),
    Home = gsub(pattern = "(vs)(.*)", replacement = "\\2", x = dataBuild2[,3], perl=T))  
  
  dataBuild2$Home <- as.character(dataBuild2$Home)  
  dataBuild2$Away <- as.character(dataBuild2$Away)  
  
  dataBuild2 <<- dataBuild2 %>% mutate(Where = ifelse(grepl("vs", Away), "Home", "Away"))
  
  dataBuild2 <<- dataBuild2 %>% mutate(Opponent.1 = ifelse(grepl("vs", Opponent), Home, 
                                                    ifelse(grepl("@", Opponent), Away, "NA")))

Now things are starting to get a little tricky. For instance, I want to make a column that identifies wins, losses ,and ties; however, my data is a bit of a mess. Not to worry though because there is a miriad of ways to attack this issue. With that in mind I demonstrate how you could use ‘gsub’ and ‘grepl’. Honestly, I just knee-jerked and used gsub when I could have used grepl with an ifelse statement all along. Still, what could having more examples hurt? With that, be sure to take a look at how the df is coming together - we are really starting to get into the weeds…

dataBuild2 <- dataBuild2 %>% data.frame(
    Win = gsub(pattern = "(W)(.*)", replacement = "\\2", x = dataBuild2[,4], perl=T),
    Lose = gsub(pattern = "(L)(.*)", replacement = "\\2", x = dataBuild2[,4], perl=T),
    Tie = gsub(pattern = "(T)(.*)( OT)", replacement = "\\2", x = dataBuild2[,4], perl=T)) 

  dataBuild2$Win <- as.character(dataBuild2$Win)  
  dataBuild2$Lose <- as.character(dataBuild2$Lose) 
  dataBuild2$Tie <- as.character(dataBuild2$Tie) 
    
  dataBuild2 <<- dataBuild2 %>% mutate(Result.1 = ifelse(grepl("W", Result), Win, 
                                                  ifelse(grepl("L", Result), Lose, 
                                                  ifelse(grepl("OT", Result), Tie,"NA"))),
                                       Win = ifelse(grepl("W", Result), 1, 
                                             ifelse(grepl("L", Result), -1,
                                             ifelse(grepl("T", Result), 0, "NA")))) 
  
  data <- head(dataBuild2, n = 7)
knitr::kable(data, caption = "The 'dataBuild2' data frame")

The ‘dataBuild2’ data frame
Week	Date	Opponent	Result	High.Passing	High.Rushing	High.Receiving	Team	Away	Home	Where	Opponent.1	Win	Lose	Tie	Result.1
1	Sun, Sep 11	vsBuffalo	W13-7	Flacco 258	Forsett 41	Wallace 91	bal	vsBuffalo	Buffalo	Home	Buffalo	1	W13-7	W13-7	13-7
2	Sun, Sep 18	@Cleveland	W25-20	Flacco 302	West 42	Pitta 102	bal	Cleveland	@Cleveland	Away	Cleveland	1	W25-20	W25-20	25-20
3	Sun, Sep 25	@Jacksonville	W19-17	Flacco 214	West 45	Smith Sr. 87	bal	Jacksonville	@Jacksonville	Away	Jacksonville	1	W19-17	W19-17	19-17
4	Sun, Oct 2	vsOakland	L28-27	Flacco 298	West 113	Smith Sr. 111	bal	vsOakland	Oakland	Home	Oakland	-1	28-27	L28-27	28-27
5	Sun, Oct 9	vsWashington	L16-10	Flacco 210	West 95	Wallace 63	bal	vsWashington	Washington	Home	Washington	-1	16-10	L16-10	16-10
6	Sun, Oct 16	@New York	L27-23	Flacco 307	West 87	Wallace 97	bal	New York	@New York	Away	New York	-1	27-23	L27-23	27-23
7	Sun, Oct 23	@New York	L24-16	Flacco 248	West 10	Wallace 120	bal	New York	@New York	Away	New York	-1	24-16	L24-16	24-16

Okay, so there is all kinds of confusing things going on. For starters, there are three columns that caught my eye that are named “Home”, “Away”, and “Where”. Essentially, if the “Home” or “Away” column has either “@” or “vs” in it, the game was played at the either column name. Huh? That is, if a city name under the “Away” column - for example - has a “@” or “vs” attached to it the game was played in the other column or “Home”. This is the intended result of the mutate/ifelse/grepl code I wrote earlier and the “Home” and “Away” columns will eventually be removed. The entries that I am interested in now are the scores as well as those describing the high passing, receiving, and rushing yards of the games. These entries have two bits of data within them and I want to make columns for each bit - so to speak. I do this with the following code. It is going to get a little redundent; however, it’s all worth it in the end…

#1 This seperates the scores into two seperate columns...
spl <<-strsplit(as.character(dataBuild2$Result.1), "-")
  
  dataBuild2 <- data.frame(Score.1 = sapply(spl, "[", 1), Score.2 = sapply(spl, "[", 2)
                           , dataBuild2, stringsAsFactors = F) 
  
#####2 In column "Score.2" there is another bit of info portrayed as "OT" (or overtime) that also has to be removed into a seperate column (e.g. Nothing.1)...  
spl <<-strsplit(as.character(dataBuild2$Score.2), " ")
  
  dataBuild2 <- data.frame(Score.2.0 = sapply(spl, "[", 1), Nothing.1 = sapply(spl, "[", 2)
                           , dataBuild2, stringsAsFactors = F) 

##### Turn all NAs into 0  
  dataBuild2[is.na(dataBuild2)] <- 0

#3 Here I make another column that shows the score made be the team this row of data represents...    
dataBuild2 <- dataBuild2 %>%  mutate(Team.Score = ifelse(Win == 1, Score.1, Score.2.0),
                                      Opponent.Score = ifelse(Win == -1, Score.1, Score.2.0))

#4  This is a binary column titled "Over.Time" and show if a team went into overtime...                                    
dataBuild2 <- dataBuild2 %>%  mutate(Over.Time = ifelse(grepl("OT", Result),1,0))

#5 Now I have to make seperate columns for the QBs RBs and WRs with the game's highest yardage...
  # Note: this is a little rough because some of the player names have things like "III" or "jr.", etc at the end of it, so everytime there is a space in the name a seperate column has to be made. These columns housing just the   portions of the name must eventually be pasted together into one column, then all that's left is said name and their respective yardage... 
spl <<-strsplit(as.character(dataBuild2$High.Passing), " ") 
  
  dataBuild2 <- data.frame(QB = sapply(spl, "[", 1), Nothing.2 = sapply(spl, "[", 2)
                           , Passing.Yards = sapply(spl, "[", 3)
                           , Passing.Yards.1 = sapply(spl, "[", 4)
                           , dataBuild2, stringsAsFactors = F) 
  
dataBuild2$QB <- paste(dataBuild2$QB, dataBuild2$Nothing.2) 

dataBuild2[is.na(dataBuild2)] <- 0  
  
dataBuild2 <- dataBuild2 %>% mutate(Passing.Yards = ifelse(Passing.Yards.1 == "0"
                                                           , Passing.Yards, Passing.Yards.1))  

    
spl <<-strsplit(as.character(dataBuild2$High.Rushing), " ")
   
  dataBuild2 <- data.frame(RB = sapply(spl, "[", 1), Nothing.3 = sapply(spl, "[", 2)
                           , Rushing.Yards = sapply(spl, "[", 3)
                           , Rushing.Yards.1 = sapply(spl, "[", 4)
                           , Rushing.Yards.2 = sapply(spl, "[", 5)
                           , dataBuild2, stringsAsFactors = F) 
  
dataBuild2$RB <- paste(dataBuild2$RB, dataBuild2$Nothing.3)
  
dataBuild2[is.na(dataBuild2)] <- 0 
  
dataBuild2 <- dataBuild2 %>% mutate(Rushing.Yards = ifelse(Rushing.Yards.1 == "0"
                                                             , Rushing.Yards, Rushing.Yards.1))  

    
spl <<-strsplit(as.character(dataBuild2$High.Receiving), " ")
  
  dataBuild2 <- data.frame(WR = sapply(spl, "[", 1), Nothing.4 = sapply(spl, "[", 2),
                           Receiving.Yards = sapply(spl, "[", 3)
                           , Receiving.Yards.1 = sapply(spl, "[", 4)
                           , Receiving.Yards.2 = sapply(spl, "[", 5)
                           , dataBuild2, stringsAsFactors = F)
  
dataBuild2$WR <- paste(dataBuild2$WR, dataBuild2$Nothing.4)
  
dataBuild2[is.na(dataBuild2)] <- 0 
  
dataBuild2 <- dataBuild2 %>% mutate(Receiving.Yards = ifelse(Receiving.Yards.1 == "0"
                                                             , Receiving.Yards, Receiving.Yards.1))

# Turn all the "Team" entries into uppercase...
dataBuild2$Team <- toupper(dataBuild2$Team)

Now I have created a lot of columns. I want to see them and select only those I want to keep. Still, keep in mind that I might want some of the columns at another time, so I saved this new df under a new unique name.

# First I look...
colnames(dataBuild2)

##  [1] "WR"                "Nothing.4"         "Receiving.Yards"  
##  [4] "Receiving.Yards.1" "Receiving.Yards.2" "RB"               
##  [7] "Nothing.3"         "Rushing.Yards"     "Rushing.Yards.1"  
## [10] "Rushing.Yards.2"   "QB"                "Nothing.2"        
## [13] "Passing.Yards"     "Passing.Yards.1"   "Score.2.0"        
## [16] "Nothing.1"         "Score.1"           "Score.2"          
## [19] "Week"              "Date"              "Opponent"         
## [22] "Result"            "High.Passing"      "High.Rushing"     
## [25] "High.Receiving"    "Team"              "Away"             
## [28] "Home"              "Where"             "Opponent.1"       
## [31] "Win"               "Lose"              "Tie"              
## [34] "Result.1"          "Team.Score"        "Opponent.Score"   
## [37] "Over.Time"

# Then I select (or deselect in this case)
dataBuild4<- dataBuild2 %>% select(-c(2,4,5,7,9,10,12,14:18,23:25))  

# And then take another look...
colnames(dataBuild4)

##  [1] "WR"              "Receiving.Yards" "RB"             
##  [4] "Rushing.Yards"   "QB"              "Passing.Yards"  
##  [7] "Week"            "Date"            "Opponent"       
## [10] "Result"          "Team"            "Away"           
## [13] "Home"            "Where"           "Opponent.1"     
## [16] "Win"             "Lose"            "Tie"            
## [19] "Result.1"        "Team.Score"      "Opponent.Score" 
## [22] "Over.Time"

So things are looking pretty good right now! However, there is something else that I’d like to do. Essentially, each team has won and/or lost (and possibly tied) games and I want to tally these game results up. To do this I tally the work already completed (i.e. the portion of wrangling above that made wins equal to +1, losses equal to -1, and ties equal to 0), and subject it to a for-loop that adds a week’s game outcome to the game(s) before it. There might be an esier way to do this, but I’m happy with my method for now.

Temp <- dataBuild4 %>% mutate(C1 = Win)
Temp$C1 <- as.numeric(Temp$C1)
Temp$Win <- as.numeric(Temp$Win)
Temp <- Temp %>%ungroup()  
Temp <- Temp %>% group_by(Team) %>% mutate(Game.Num = seq_len(n())) # I used the code to the left to make count each team's game number regardless of what actual week number the game transpired. This is a very useful code to know, so try messing around with it. 

# This chunk of code is used to make the first/ foundational data frame for the following dfs to be joined to.
Temp.3 <- Temp %>% filter(Team == T.Code.DF1$TEAM[1]) %>% 
  mutate(C2 = C1) 
Temp.3 <- Temp.3 %>%  mutate(C2 = ifelse(Game.Num == 2, Temp.3[2,23] + Temp.3[1,25], C1)) 
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 3, Temp.3[3,23] + Temp.3[2,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 4, Temp.3[4,23] + Temp.3[3,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 5, Temp.3[5,23] + Temp.3[4,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 6, Temp.3[6,23] + Temp.3[5,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 7, Temp.3[7,23] + Temp.3[6,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 8, Temp.3[8,23] + Temp.3[7,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 9, Temp.3[9,23] + Temp.3[8,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)
Temp.3 <- Temp.3 %>% mutate(C2 = ifelse(Game.Num == 10, Temp.3[10,23] + Temp.3[9,25], C2))
Temp.3$C2 <- as.numeric(Temp.3$C2)

Temp.3 <- Temp.3 %>% ungroup()

# Now, this code is used to manipulate the remaining data and join it to the df just created...
for(i in 2:nrow(T.Code.DF1)){
Temp.2 <- Temp %>% filter(Team == T.Code.DF1$TEAM[i]) %>% 
  mutate(C2 = C1) 
Temp.2 <- Temp.2 %>%  mutate(C2 = ifelse(Game.Num == 2, Temp.2[2,23] + Temp.2[1,25], C1)) 
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 3, Temp.2[3,23] + Temp.2[2,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 4, Temp.2[4,23] + Temp.2[3,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 5, Temp.2[5,23] + Temp.2[4,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 6, Temp.2[6,23] + Temp.2[5,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 7, Temp.2[7,23] + Temp.2[6,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 8, Temp.2[8,23] + Temp.2[7,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 9, Temp.2[9,23] + Temp.2[8,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 10, Temp.2[10,23] + Temp.2[9,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 11, Temp.2[11,23] + Temp.2[10,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 12, Temp.2[12,23] + Temp.2[11,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 13, Temp.2[13,23] + Temp.2[12,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 14, Temp.2[14,23] + Temp.2[13,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 15, Temp.2[15,23] + Temp.2[14,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)
Temp.2 <- Temp.2 %>% mutate(C2 = ifelse(Game.Num == 16, Temp.2[16,23] + Temp.2[15,25], C2))
Temp.2$C2 <- as.numeric(Temp.2$C2)

Temp.2 <- Temp.2 %>% ungroup()

Temp.3 <- rbind(Temp.3, Temp.2)

#print(i)
}
dataBuild4 <- Temp.3 # I made "dataBuild4" so that any other manipulations won't alter what's already been created.

So that worked! Now I add a few columns with regards to game outcomes, Divisions, and Conferences…

dataBuild4 <- dataBuild4 %>%  
                      mutate(Division = ifelse(
                        Team == "PHI"|Team =="DAL"|Team =="WSH"|Team =="NYG", "NFC East", 
                                        ifelse(
                        Team == "GB"|Team =="DET"|Team =="MIN"|Team =="CHI", "NFC North", 
                                        ifelse(
                        Team == "CAR"|Team =="NO"|Team =="ATL"|Team =="TB", "NFC South",
                                        ifelse(
                        Team == "SEA"|Team =="ARI"|Team =="LA"|Team =="SF", "NFC West",
                                        ifelse(
                        Team == "BUF"|Team =="MIA"|Team =="NYJ"|Team =="NE", "AFC East",
                                        ifelse(
                        Team == "CIN"|Team =="CLE"|Team =="PIT"|Team =="BAL", "AFC North",
                                        ifelse(
                        Team == "HOU"|Team =="IND"|Team =="TEN"|Team =="JAX", "AFC South",
                                        ifelse(
                        Team == "KC"|Team =="OAK"|Team =="DEN"|Team =="SD", "AFC West", "NA")))))))),
                  Outcome = ifelse(
                        Win == 1, "Win", ifelse(Win == 0, "Tie", ifelse(Win == -1, "Lose", "NA"))),
                  Conference = ifelse(grepl("NFC", Division), "NFC", "AFC"))

Lastly, because the project that I am working on focuses on data related to calendar date in Tableau, I have to do more string spliting to make the date syntax Tableau appropriate…

spl <<-strsplit(as.character(dataBuild4$Date), ",")

dataBuild4 <- data.frame(Day = sapply(spl, "[", 1), Date = sapply(spl, "[", 2)
                        , dataBuild4, stringsAsFactors = F) 

dataBuild4 <- dataBuild4 %>% mutate(Year = 2016)

dataBuild4$Date <- paste(dataBuild4$Date, dataBuild4$Year)

And with that - I have a df that is almost complete and yet complete enough to store as a csv in case I have another df that I can join it to.

write.csv(dataBuild4, "Win_Loss.2016.csv", row.names = F)

Following this I want to create another web scraper that takes info regarding Power Rankings. This data isn’t from a table so it is a little different than the one made above. Still, it has a similar feel to it…

library(rvest)

The links have a special id code so I just made a list out of the actual URLs

T.Links <- data_frame(T.Code = c("1","2","3","4","5","6","7","8","9"),
                      URL.Links = c("http://www.espn.com/nfl/story/_/id/17473933/nfl-2016-week-1-power-rankings", "http://www.espn.com/nfl/story/_/id/17531081/nfl-2016-week-2-power-rankings", "http://www.espn.com/nfl/story/_/id/17547005/nfl-2016-week-3-power-rankings", "http://www.espn.com/nfl/story/_/id/17636431/nfl-2016-week-4-power-rankings", "http://www.espn.com/nfl/story/_/id/17704994/nfl-2016-week-5-power-rankings", "http://www.espn.com/nfl/story/_/id/17759755/nfl-2016-week-6-power-rankings", "http://www.espn.com/nfl/story/_/id/17813507/nfl-2016-week-7-power-rankings", "http://www.espn.com/nfl/story/_/id/17869586/nfl-2016-week-8-power-rankings", "http://www.espn.com/nfl/story/_/id/17931127/nfl-2016-week-9-power-rankings"),
                      Week.Num = c("1","2","3","4","5","6","7","8","9"))  

# T.Links$Week.Num <- as.numeric(T.Links$Week.Num)

Below is my use of the rvest code read_html as well as some of the other html commands… Also, I think it is important to mention that I used the Chrome selector-tool to identify the html node pertinent to this web scrape (i.e. the “h2”.

PowerRank.Scrape.1 <- read_html(T.Links$URL.Links[1])

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>% 
                        html_nodes("h2") %>%
                        html_text()

Here I turn my scrape into a df, change a column name, change some of the dfs classes, and add a column that is essential to my project: the “Week.Num”

PowerRank.Scrape.1 <- as.data.frame(PowerRank.Scrape.1)
names(PowerRank.Scrape.1)[1]<-"Text"
PowerRank.Scrape.1$Text <- as.character(PowerRank.Scrape.1$Text)

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>% mutate(Week = T.Links$Week.Num[1])

With this I created the df that I will use to bind the other dfs that are created the same way but with the addition of the for-loop…

for(i in 2:nrow(T.Links)){
PowerRank.Scrape.2 <- read_html(T.Links$URL.Links[i])

PowerRank.Scrape.2 <- PowerRank.Scrape.2 %>% 
  html_nodes("h2") %>%
  html_text() 

PowerRank.Scrape.2 <- as.data.frame(PowerRank.Scrape.2)
names(PowerRank.Scrape.2)[1]<-"Text"
PowerRank.Scrape.2$Text <- as.character(PowerRank.Scrape.2$Text)

PowerRank.Scrape.2 <- PowerRank.Scrape.2 %>% mutate(Week = T.Links$Week.Num[i])

PowerRank.Scrape.1 <<- rbind(PowerRank.Scrape.1,PowerRank.Scrape.2)

#print(i)
}

From here I have to do a little more wrangling - as I did above… First - I have the interesting issue of having cells that have random non-matching text strings that can’t be merely filtered out. As such, I use grepl to find words to filter out that I know won’t be found in any other - wanted - cell…

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>%
  mutate(NewText = ifelse(grepl(
    "NFL|odds|for|The|problem|rank|Which|Siemian|into|where|dominant|Vegas|learned|Debate", Text), 0, Text)) %>% 
  filter(NewText != 0)

# ...Some more of the usual string splitting...
spl <<-strsplit(as.character(PowerRank.Scrape.1$NewText), " ")

PowerRank.Scrape.1 <- data.frame(Rank = sapply(spl, "[", 1), City.1 = sapply(spl, "[", 2)
                     , City.2 = sapply(spl, "[", 3), Something = sapply(spl, "[", 4)
                         , PowerRank.Scrape.1, stringsAsFactors = F) 

#...however, this is just an interesting way or removing a "." from strings in one of the columns as well as a few other special characters that I just want removed from my data...
PowerRank.Scrape.1$Rank <- gsub( "\\.|/|\\-|\"|\\s" , "" , PowerRank.Scrape.1$Rank )

# Then I have to make a column that shows the cities for teams. The challenge this poses is quite similar to that presented above with player names with suffixes...
PowerRank.Scrape.1[is.na(PowerRank.Scrape.1)] <- 0 

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>% mutate(TEAM = ifelse(Something == 0, City.2, Something)) 

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>% mutate(Team.City = ifelse(Something != 0, paste(City.1, City.2), City.1))

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>% select(Rank,TEAM,Team.City, Week)

Then I have tochange all of these city names into their abreviated versions so that this df can be joined to the one created above (this is a pretty sizable code)…

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>%  mutate(
  Team = ifelse(
  Team.City == "Seattle", "SEA", ifelse(
  Team.City == "Pittsburgh", "PIT", ifelse(
  Team.City == "New England", "NE", ifelse( 
  Team.City == "Arizona", "ARI", ifelse(
  Team.City == "Carolina", "CAR", ifelse(
  Team.City == "Green Bay", "GB", ifelse(
  Team.City == "Cincinnati", "CIN", ifelse(
  Team.City == "Denver", "DEN", ifelse(
  Team.City == "Kansas City", "KC", ifelse(
  Team.City == "Oakland", "OAK", ifelse(
  Team.City == "Minnesota", "MIN", ifelse(
  Team.City == "Houston", "HOU", ifelse(
  Team.City == "Washington", "WSH", ifelse(
  Team.City == "Baltimore", "BAL", ifelse(
  Team.City == "Buffalo", "BUF", ifelse(
  Team.City == "Indianapolis", "IND", ifelse(
  Team.City == "Jacksonville", "JAX", ifelse(
  Team.City == "Detroit", "DET", ifelse( 
  Team.City == "Miami", "MIA", ifelse(
  Team.City == "Atlanta", "ATL", ifelse(
  Team.City == "Dallas", "DAL", ifelse(
  Team.City == "Tampa Bay", "TB", ifelse(
  Team.City == "Philadelphia", "PHI", ifelse(
  Team.City == "San Diego", "SD", ifelse(
  Team.City == "New Orleans", "NO", ifelse(
  Team.City == "Los Angeles", "LA", ifelse(
  Team.City == "Chicago", "CHI", ifelse(
  Team.City == "Tennessee", "TEN", ifelse( 
  Team.City == "Cleveland", "CLE", ifelse(
  Team.City == "San Francisco", "SF", ifelse(
  TEAM == "Giants", "NYG", ifelse(TEAM == "Jets", "NYJ", "NA")))))))))))))))))))))))))))))))))

Now, to make the number of the high ranked teams have a greater value than the lower ranked teams, so I flip the values of the teams by subtracting their rankings than a number greater than the number of existing teams (i.e. 33). Also, I created a csv of this df so I have when / if I need it.

PowerRank.Scrape.1$Rank <- as.integer(PowerRank.Scrape.1$Rank)

PowerRank.Scrape.1 <- PowerRank.Scrape.1 %>% mutate(True.Rank = 33-Rank)
  
write.csv(PowerRank.Scrape.1, "Power.Rnk.csv", row.names = F)

Now all I have left to do is switch the classes of the column that I want to join, join the dfs, and write the csv that I am going to use for my project.

dataBuild4$Week <- as.numeric(dataBuild4$Week)
PowerRank.Scrape.1$Week <- as.numeric(PowerRank.Scrape.1$Week) 

The.df <- left_join(dataBuild4, PowerRank.Scrape.1, by = c("Week", "Team"))
write.csv(The.df, "NFL.Win_Lose.11.3.2016_17.csv", row.names = F)

# Lastly, I included the specific date in the name of the csv file with the intention of changing it manually every time I want to create a new df. With this - I am done ... for now.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.