All related data and files for this project can be found in my github. Additionally, I’ve published it to rpubs.com. For this week’s assignment, all the data used is contained within this document.

Load Libraries

Load a few libraries that we will use along the way:

library(stringr) 
library(tidyr)
library(ggplot2)

Load the Data

First we load the data and take a look:

source_data<- "https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project1/tournamentinfo.txt" 
rankings <- unlist(read.table(source_data,header=FALSE,sep="\n")) 
head(rankings,10)
##                                                                                        V11 
##  ----------------------------------------------------------------------------------------- 
##                                                                                        V12 
##  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round|  
##                                                                                        V13 
##  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  
##                                                                                        V14 
##  ----------------------------------------------------------------------------------------- 
##                                                                                        V15 
##      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| 
##                                                                                        V16 
##     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    | 
##                                                                                        V17 
##  ----------------------------------------------------------------------------------------- 
##                                                                                        V18 
##      2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7| 
##                                                                                        V19 
##     MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    | 
##                                                                                       V110 
##  ----------------------------------------------------------------------------------------- 
## 131 Levels: ----------------------------------------------------------------------------------------- ...

We can see immediately that the data needs a bit of processing before we can do anything with it. We’ll try to parse it with some regular expressions

Parse the Data

It looks like we can extract what we need using regular expressions. The data that we need to grab is spread across 2 rows for each player. In both cases, the fields are consistently separated by pipes, making things a bit easier. The first row (r1) that we want to grab begins with the player number followed by a space and a “|” delimiter. The second row (r2) that we want begins with the the state-code (2 uppercase letters), followed by a space and a “|” delimiter. This information should be sufficent for us to grab and recombine the rows in a more usable format. I sometimes find regular expressions hard to decipher, particularly if I haven’t worked with them in a while, so I tend to explicitly explain them in my comments.

#if we see a 1-4 digit # followed by a space and a pipe, grab the it + the rest of the row
r1 <- unlist(str_extract_all(rankings[5:length(rankings)],"\\d{1,4}\\s\\|.+")) 

#if we see 2 uppercase letters followed by a space and a pipe, grab it + the rest of the row
r2 <- unlist(str_extract_all(rankings[5:length(rankings)],"[[:upper:]]{2}\\s\\|.+")) 

#while we're at it, we'll replace the "W's" "L's" "D's" and "B's" in r1... we just want thte #s
r1 <- unlist(str_replace_all(r1,"\\|(W|L|B|D|H|U)\\s","|"))


head(r1,5)
## [1] "1 | GARY HUA                        |6.0  | 39| 21| 18| 14|  7| 12|  4|"
## [2] "2 | DAKSHESH DARURI                 |6.0  | 63| 58|  4| 17| 16| 20|  7|"
## [3] "3 | ADITYA BAJAJ                    |6.0  |  8| 61| 25| 21| 11| 13| 12|"
## [4] "4 | PATRICK H SCHILLING             |5.5  | 23| 28|  2| 26|  5| 19|  1|"
## [5] "5 | HANSHI ZUO                      |5.5  | 45| 37| 12| 13|  4| 14| 17|"
head(r2,5)
## [1] "ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

The regex appears to have works as intended, so now we’ll split the rows on the “|”’s while loading the data to a dataframe. We’ll then down-select to just the cols that we need.

#create the df
df <- data.frame(str_split_fixed(r1,"\\|",n=11),str_split_fixed(r2,"\\|",n=10))


#grab the data we need
cols <- c(1:10,12:13)
df <- df[,cols]

#re-order the cols 
cols <- c(1,2,11,12,3:10)
df <- df[,cols]



#we'll also rename all the cols that we're going to keep while we're at it
names(df) <- c("PlayerNum","PlayerName","PlayerState",
               "USCF_Pre_Post","TotalPts","r1","r2",
               "r3","r4","r5","r6","r7")

#trim the whitespace
df <- as.data.frame(apply(df,2,function(x)gsub("^\\s+|\\s+$", "", x)))

head(df,5)
##   PlayerNum          PlayerName PlayerState               USCF_Pre_Post
## 1         1            GARY HUA          ON 15445895 / R: 1794   ->1817
## 2         2     DAKSHESH DARURI          MI 14598900 / R: 1553   ->1663
## 3         3        ADITYA BAJAJ          MI 14959604 / R: 1384   ->1640
## 4         4 PATRICK H SCHILLING          MI 12616049 / R: 1716   ->1744
## 5         5          HANSHI ZUO          MI 14601533 / R: 1655   ->1690
##   TotalPts r1 r2 r3 r4 r5 r6 r7
## 1      6.0 39 21 18 14  7 12  4
## 2      6.0 63 58  4 17 16 20  7
## 3      6.0  8 61 25 21 11 13 12
## 4      5.5 23 28  2 26  5 19  1
## 5      5.5 45 37 12 13  4 14 17

Now the data looks reasonably close to where we want it to be, however, we still need to deal with that last column (ID_Pre_Post). Because the columns appear to be well formated, we’re going to split the column into 3 using separate() from tidyr. We’re really only after the pre_rating column here. Another thing of note is that the pre-rating column sometimes contains a “p” with the ranking that we want consistently on the lefthand side of the “p”.

#split the columns 
df <- separate(data=df,col=USCF_Pre_Post,sep=" / R: ",into = c("USCF ID","Pre_Post"))
df <- separate(data=df,col=Pre_Post,sep="   ->",into = c("Pre_Rating","Post_Rating"))

#pre_rating sometimes contains a "p".  the resultant "etc" column is a throw-away
df <- separate(data=df,col=Pre_Rating,sep="P",into = c("Player_Pre_Rating","etc"))

head(df,5)
##   PlayerNum          PlayerName PlayerState  USCF ID Player_Pre_Rating
## 1         1            GARY HUA          ON 15445895              1794
## 2         2     DAKSHESH DARURI          MI 14598900              1553
## 3         3        ADITYA BAJAJ          MI 14959604              1384
## 4         4 PATRICK H SCHILLING          MI 12616049              1716
## 5         5          HANSHI ZUO          MI 14601533              1655
##    etc Post_Rating TotalPts r1 r2 r3 r4 r5 r6 r7
## 1 <NA>        1817      6.0 39 21 18 14  7 12  4
## 2 <NA>        1663      6.0 63 58  4 17 16 20  7
## 3 <NA>        1640      6.0  8 61 25 21 11 13 12
## 4 <NA>        1744      5.5 23 28  2 26  5 19  1
## 5 <NA>        1690      5.5 45 37 12 13  4 14 17

Compute the Pre-Chess Rating

The table looks good - all I have to do now is compute the average pre-chess score and drop all the unwanted columns. To compute the average score, I’m going to loop over each player, figure out who they played against (cols 9:15) and use that data to look-up the pre-game ratings. Using that data, I can compute the averages.

pre_chess_rating = c()

for (i in 1:nrow(df)){

  #get all players who played against this player
  opponent_nums <- df[i,9:15]
  opponent_nums <-  as.numeric(levels(unlist(opponent_nums)))[unlist(opponent_nums)]
  opponent_nums <- opponent_nums[!is.na(opponent_nums)]
  
  #pull scores of the opponents and compute the avg
  pre_chess_rating[i] <- trunc(mean(as.numeric(df[opponent_nums,5])))  
 
}

df$Player_Pre_Rating <- as.numeric(df$Player_Pre_Rating )

df$Avg_Opponent_Rating = pre_chess_rating

Write the File

Now we drop all the unwanted cols and write the data to file:

df <- df[c(2:3,8,5,16)]

head(df)
##            PlayerName PlayerState TotalPts Player_Pre_Rating
## 1            GARY HUA          ON      6.0              1794
## 2     DAKSHESH DARURI          MI      6.0              1553
## 3        ADITYA BAJAJ          MI      6.0              1384
## 4 PATRICK H SCHILLING          MI      5.5              1716
## 5          HANSHI ZUO          MI      5.5              1655
## 6         HANSEN SONG          OH      5.0              1686
##   Avg_Opponent_Rating
## 1                1605
## 2                1469
## 3                1563
## 4                1573
## 5                1500
## 6                1518
write.csv(df, file = "data_607_project1.csv")

Bonus Plot

I just wanted to take a look and see whether there were any obvious relatioships in the data. We see below that there appears to be a positive relatioship between total points and player_pre_rating. I also wanted to see if i could extract any info by adding an additional dimension to the plot (avg_opponent_rating) via dot-color.

ggplot(df, aes(Player_Pre_Rating,TotalPts)) + geom_point(aes(color=Avg_Opponent_Rating)) +
ggtitle("")