In this project we are given a semi-structured text file and tasked with parsing, and exporting it into a structured CSV file that can be useful for later analysis. With the stringr package, we can extract specific text of interest using regular expressions, begin to structure the data into variables, and manipulate it for our needs.
Below we read the text file using the readLines function, which creates a character vector containing each line of the text file as an element.
l <- readLines("https://raw.githubusercontent.com/jreznyc/DATA607/master/Projects/Project%201/tournamentinfo.txt")
head(l)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
In order to make subsquent analysis easier by leaving each player’s information on every two lines, here we remove the unneccesary horizontal lines of dashes in the original text file and trim the whitespace. This leaves each player’s respective information on two lines.1
l <- str_trim(grep("^\\|?-+\\|?$|^$", l, value=TRUE, invert=TRUE))
head(l)
## [1] "Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|"
## [2] "Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |"
## [3] "1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [4] "ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [5] "2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [6] "MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
Here we declare vectors for each variable we want to include in the final output. At this stage we will build the following variables:
Below we declare the above mentioned variables and loop through each line to extract the desired information. Since we know that each player’s information is on a odd and even line, we have the loop skip the header and run the nested code on each odd line.
For each variable we use regex to capture that particular information by splitting the line using the pipes “|” which were left intact within each line of player information as split points to easily access a given piece of information. Below we see the output of this step.
playerID<-c()
name<-c()
state<-c()
points<-c()
rating<-c()
opponents_matrix<-matrix(nrow=64,ncol=7)
for(i in 1:length(l)){
if(i%%2!=0 && i!=1){
thisplayer <- as.integer(str_trim(unlist(str_split(l[i], "\\|" ))[1]))
playerID <- c(playerID,thisplayer)
name <- c(name,str_trim(unlist(str_split(l[i], "\\|" ))[2]))
state <- c(state,str_trim(unlist(str_split(l[i+1], "\\|" ))[1]))
points <- c(points,str_trim(unlist(str_split(l[i], "\\|" ))[3]))
rating <- as.integer(c(rating,str_extract(unlist(
str_split(l[i+1], "\\|" ))[2], "(?<=R: ?)[0-9]+")))
opponents_matrix[thisplayer,] <- c(as.integer(str_extract_all(
unlist(str_split(l[i], "\\|" ))[-c(1,2,3,11)], "\\d+")))
}
}
df<- data.frame(name,state,points,rating)
rownames(df)<- playerID
head(df)
## name state points rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
The next step is to calculate the average opponent rating for each player. Using the opponents_matrix and each playerID as a key we iterate through each player in the dataframe, looking up their opponents’ ratings, calculating the average and assigning that value to a new column named “avg_opp_rating” which is added to the dataframe.
At the end of this step we have the desired dataframe which we can subsequently export to a csv file.
getavg <- function(id){
round(mean(df[opponents_matrix[id,],'rating'],na.rm=TRUE))
}
for(i in 1:nrow(df)){
df$avg_opp_rating[i]<- getavg(i)
}
datatable(df)
Finally, we write the dataframe to a csv file containing the desired columns.
write.csv(df, file="output_data.csv", quote=FALSE)
Sources:
1) https://stackoverflow.com/questions/21114598/importing-a-text-file-into-r
2) https://stackoverflow.com/questions/35804379/stringr-str-extract-how-to-do-positive-lookbehind