All related data and files for this project can be found in my github. Additionally, I’ve published it to rpubs.com. For this week’s assignment, all the data used is contained within this document.
Load a few libraries that we will use along the way:
library(stringr)
library(tidyr)
library(ggplot2)
First we load the data and take a look:
source_data<- "https://raw.githubusercontent.com/plb2018/DATA607/master/DATA_607_Project1/tournamentinfo.txt"
rankings <- unlist(read.table(source_data,header=FALSE,sep="\n"))
head(rankings,10)
## V11
## -----------------------------------------------------------------------------------------
## V12
## Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## V13
## Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## V14
## -----------------------------------------------------------------------------------------
## V15
## 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## V16
## ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## V17
## -----------------------------------------------------------------------------------------
## V18
## 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## V19
## MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## V110
## -----------------------------------------------------------------------------------------
## 131 Levels: ----------------------------------------------------------------------------------------- ...
We can see immediately that the data needs a bit of processing before we can do anything with it. We’ll try to parse it with some regular expressions
It looks like we can extract what we need using regular expressions. The data that we need to grab is spread across 2 rows for each player. In both cases, the fields are consistently separated by pipes, making things a bit easier. The first row (r1) that we want to grab begins with the player number followed by a space and a “|” delimiter. The second row (r2) that we want begins with the the state-code (2 uppercase letters), followed by a space and a “|” delimiter. This information should be sufficent for us to grab and recombine the rows in a more usable format. I sometimes find regular expressions hard to decipher, particularly if I haven’t worked with them in a while, so I tend to explicitly explain them in my comments.
#if we see a 1-4 digit # followed by a space and a pipe, grab the it + the rest of the row
r1 <- unlist(str_extract_all(rankings[5:length(rankings)],"\\d{1,4}\\s\\|.+"))
#if we see 2 uppercase letters followed by a space and a pipe, grab it + the rest of the row
r2 <- unlist(str_extract_all(rankings[5:length(rankings)],"[[:upper:]]{2}\\s\\|.+"))
#while we're at it, we'll replace the "W's" "L's" "D's" and "B's" in r1... we just want thte #s
r1 <- unlist(str_replace_all(r1,"\\|(W|L|B|D|H|U)\\s","|"))
head(r1,5)
## [1] "1 | GARY HUA |6.0 | 39| 21| 18| 14| 7| 12| 4|"
## [2] "2 | DAKSHESH DARURI |6.0 | 63| 58| 4| 17| 16| 20| 7|"
## [3] "3 | ADITYA BAJAJ |6.0 | 8| 61| 25| 21| 11| 13| 12|"
## [4] "4 | PATRICK H SCHILLING |5.5 | 23| 28| 2| 26| 5| 19| 1|"
## [5] "5 | HANSHI ZUO |5.5 | 45| 37| 12| 13| 4| 14| 17|"
head(r2,5)
## [1] "ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] "MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] "MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] "MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] "MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
The regex appears to have works as intended, so now we’ll split the rows on the “|”’s while loading the data to a dataframe. We’ll then down-select to just the cols that we need.
#create the df
df <- data.frame(str_split_fixed(r1,"\\|",n=11),str_split_fixed(r2,"\\|",n=10))
#grab the data we need
cols <- c(1:10,12:13)
df <- df[,cols]
#re-order the cols
cols <- c(1,2,11,12,3:10)
df <- df[,cols]
#we'll also rename all the cols that we're going to keep while we're at it
names(df) <- c("PlayerNum","PlayerName","PlayerState",
"USCF_Pre_Post","TotalPts","r1","r2",
"r3","r4","r5","r6","r7")
#trim the whitespace
df <- as.data.frame(apply(df,2,function(x)gsub("^\\s+|\\s+$", "", x)))
head(df,5)
## PlayerNum PlayerName PlayerState USCF_Pre_Post
## 1 1 GARY HUA ON 15445895 / R: 1794 ->1817
## 2 2 DAKSHESH DARURI MI 14598900 / R: 1553 ->1663
## 3 3 ADITYA BAJAJ MI 14959604 / R: 1384 ->1640
## 4 4 PATRICK H SCHILLING MI 12616049 / R: 1716 ->1744
## 5 5 HANSHI ZUO MI 14601533 / R: 1655 ->1690
## TotalPts r1 r2 r3 r4 r5 r6 r7
## 1 6.0 39 21 18 14 7 12 4
## 2 6.0 63 58 4 17 16 20 7
## 3 6.0 8 61 25 21 11 13 12
## 4 5.5 23 28 2 26 5 19 1
## 5 5.5 45 37 12 13 4 14 17
Now the data looks reasonably close to where we want it to be, however, we still need to deal with that last column (ID_Pre_Post). Because the columns appear to be well formated, we’re going to split the column into 3 using separate() from tidyr. We’re really only after the pre_rating column here. Another thing of note is that the pre-rating column sometimes contains a “p” with the ranking that we want consistently on the lefthand side of the “p”.
#split the columns
df <- separate(data=df,col=USCF_Pre_Post,sep=" / R: ",into = c("USCF ID","Pre_Post"))
df <- separate(data=df,col=Pre_Post,sep=" ->",into = c("Pre_Rating","Post_Rating"))
#pre_rating sometimes contains a "p". the resultant "etc" column is a throw-away
df <- separate(data=df,col=Pre_Rating,sep="P",into = c("Player_Pre_Rating","etc"))
head(df,5)
## PlayerNum PlayerName PlayerState USCF ID Player_Pre_Rating
## 1 1 GARY HUA ON 15445895 1794
## 2 2 DAKSHESH DARURI MI 14598900 1553
## 3 3 ADITYA BAJAJ MI 14959604 1384
## 4 4 PATRICK H SCHILLING MI 12616049 1716
## 5 5 HANSHI ZUO MI 14601533 1655
## etc Post_Rating TotalPts r1 r2 r3 r4 r5 r6 r7
## 1 <NA> 1817 6.0 39 21 18 14 7 12 4
## 2 <NA> 1663 6.0 63 58 4 17 16 20 7
## 3 <NA> 1640 6.0 8 61 25 21 11 13 12
## 4 <NA> 1744 5.5 23 28 2 26 5 19 1
## 5 <NA> 1690 5.5 45 37 12 13 4 14 17
The table looks good - all I have to do now is compute the average pre-chess score and drop all the unwanted columns. To compute the average score, I’m going to loop over each player, figure out who they played against (cols 9:15) and use that data to look-up the pre-game ratings. Using that data, I can compute the averages.
pre_chess_rating = c()
for (i in 1:nrow(df)){
#get all players who played against this player
opponent_nums <- df[i,9:15]
opponent_nums <- as.numeric(levels(unlist(opponent_nums)))[unlist(opponent_nums)]
opponent_nums <- opponent_nums[!is.na(opponent_nums)]
#pull scores of the opponents and compute the avg
pre_chess_rating[i] <- trunc(mean(as.numeric(df[opponent_nums,5])))
}
df$Player_Pre_Rating <- as.numeric(df$Player_Pre_Rating )
df$Avg_Opponent_Rating = pre_chess_rating
Now we drop all the unwanted cols and write the data to file:
df <- df[c(2:3,8,5,16)]
head(df)
## PlayerName PlayerState TotalPts Player_Pre_Rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## Avg_Opponent_Rating
## 1 1605
## 2 1469
## 3 1563
## 4 1573
## 5 1500
## 6 1518
write.csv(df, file = "data_607_project1.csv")
I just wanted to take a look and see whether there were any obvious relatioships in the data. We see below that there appears to be a positive relatioship between total points and player_pre_rating. I also wanted to see if i could extract any info by adding an additional dimension to the plot (avg_opponent_rating) via dot-color.
ggplot(df, aes(Player_Pre_Rating,TotalPts)) + geom_point(aes(color=Avg_Opponent_Rating)) +
ggtitle("")