Extracting Data From a Text File

By William Outcault

Problem

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

The Data

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be:

“Gary Hua, ON, 6.0, 1794, 1605”

library(stringr)
raw_data <- as.matrix(read.table("https://raw.githubusercontent.com/wco1216/Data-607/master/tournamentinfo.txt",
                                 skip = 4, stringsAsFactors = FALSE, sep = "\t"))
head(raw_data)

##      V1                                                                                         
## [1,] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2,] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3,] "-----------------------------------------------------------------------------------------"
## [4,] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5,] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6,] "-----------------------------------------------------------------------------------------"

tail(raw_data)

##        V1                                                                                         
## [187,] "   63 | THOMAS JOSEPH HOSMER            |1.0  |L   2|L  48|D  49|L  43|L  45|H    |U    |"
## [188,] "   MI | 15057092 / R: 1175   ->1125     |     |W    |B    |W    |B    |B    |     |     |"
## [189,] "-----------------------------------------------------------------------------------------"
## [190,] "   64 | BEN LI                          |1.0  |L  22|D  30|L  31|D  49|L  46|L  42|L  54|"
## [191,] "   MI | 15006561 / R: 1163   ->1112     |     |B    |W    |W    |B    |W    |B    |B    |"
## [192,] "-----------------------------------------------------------------------------------------"

Extracting the Different Observations From the .txt File.

In order to retrieve all the necessary information to complete this problem we need to use the functions from the stringr library. I identified the patterns needed to locate each name, state, points and pre-tournament rating. The data will still need a little cleaning but all of the significant information will be extracted and assigned to individual data frames.

name <- data.frame(unlist(str_extract_all(raw_data, "[:upper:]+ ?[:upper:]* ?[:upper:]* [:upper:]+")))
state <- data.frame(unlist(str_extract_all(raw_data, "   [:upper:][:upper:]")))
points <- data.frame(unlist(str_extract_all(raw_data, "[:digit:]\\.[:digit:]")))
pre_rating <- data.frame(unlist(str_extract_all(raw_data, "R:  ?[:digit:]{2,}")))

Creating our Data Frame with the Clean Data.

Before I extract each oppenents pre-tournament ranking and find the average for each competitor, I binded the data frames together to create my “clean data” data frame. I also renamed the headers as well as cleaned the excess characters and spaces in some of the elements.

clean_data <- cbind(name, state, points, pre_rating)
colnames(clean_data) <- c("name", "state", "points", "pre_rating")
clean_data$pre_rating <- str_remove(clean_data$pre_rating, "R: +")
clean_data$state <- str_remove_all(clean_data$state, " ") 
head(clean_data)

##                  name state points pre_rating
## 1            GARY HUA    ON    6.0       1794
## 2     DAKSHESH DARURI    MI    6.0       1553
## 3        ADITYA BAJAJ    MI    6.0       1384
## 4 PATRICK H SCHILLING    MI    5.5       1716
## 5          HANSHI ZUO    MI    5.5       1655
## 6         HANSEN SONG    OH    5.0       1686

Find the Average Oppenent Pre-Tournament Rating

I ran into an issue when replacing each pre-tournament rating with the corrsponding opponents ID number. First from the raw data I was able to extract the ID’s but in one vector with 64 observations. I figured there was two seperate approches I could take. Replace each value based on their patterns within the vector, or seperate the values into seven different vectors and iterate over the matrice.

opp <- as.matrix(unlist(str_extract_all(raw_data, 
                                        "[:upper:]  [:digit:]? ?[:digit:]\\|([:upper:]  [:digit:]? ?[:digit:]? ?\\|){1,7}")))

opp_index <- as.matrix(unlist(str_replace_all(opp, "\\|?[:upper:]?", "")))

The approach seen below is my attempt to create an empty data frame with seven columns and 64 rows. I then wanted to iterate over each row and column of the data frame, assigning row 1 and column 1 with the first extracted value. The extracted value would then be removed and we would go to row 1 column 2, extracting and removing the next value. Unfortunately there are flaws to this method and I have been running into a web of errors.

#opp_rankings <- data.frame(matrix(NA, nrow = 64, ncol = 7))

#for (i in 1:64) {
#  for (j in 1:7) {
#    opp_rankings[i, j] <- str_extract(opp_index, " [:digit:]? ?[:digit:]")
#    opp_index <- str_remove(opp_index, " [:digit:]? ?[:digit:]")
#  }
#}
write.csv(clean_data, 'chessratings.csv')

Writing the Data to a CSV File

Finally I will write the data frame “clean_data” to a csv file in my directory. Although it is missing a key attribute, there is still some value in the data that was cleaned and prepared.

write.csv(clean_data, 'chessratings.csv')

Conclusion

The main benefit I can take away from doing this project is learning how to identify patterns in .txt files. Prior to this assignment I have had very little experience working on something like this. The ability to be so precise with the characters you want to extract gives immense potential towards data scraping. I look forward to reading other individuals projects and finding out the most efficient way to go about this.

I did consider at one point to write 64 lines of code and using the str_replace() function, replacing all the oppenent ID values with their pre-tournament rankings, but I figured there wouldn’t be much learning or creativity involved in that. Searching for patterns and cleaning data can be maticulous but very rewarding. There is immense satisfaction when you can take data in a structured text file and put it into a data frame.