I began by reading in the data. I decided to use read.fwf which is designed to be used with files with arbitrary formatting but where every line is the same width and all the information is in the same position on each line. The resulting data frame is a bit messy but can be cleaned up quite easily.
con <- file('https://raw.githubusercontent.com/brian-cuny/607project1/master/tournamentinfo.txt', open='r')
raw.data <- read.fwf(con, widths=c(6, 2, 32, 1, 5, 43), skip=4, stringsAsFactors=FALSE)
close(con)
raw.data[which(1:nrow(raw.data) %% 3 != 0), ] %>% head() %>% kable()
| V1 | V2 | V3 | V4 | V5 | V6 | |
|---|---|---|---|---|---|---|
| 1 | 1 | | | GARY HUA | | | 6.0 | |W 39|W 21|W 18|W 14|W 7|D 12|D 4| |
| 2 | ON | | | 15445895 / R: 1794 ->1817 | | | N:2 | |W |B |W |B |W |B |W | |
| 4 | 2 | | | DAKSHESH DARURI | | | 6.0 | |W 63|W 58|L 4|W 17|W 16|W 20|W 7| |
| 5 | MI | | | 14598900 / R: 1553 ->1663 | | | N:2 | |B |W |B |W |B |W |B | |
| 7 | 3 | | | ADITYA BAJAJ | | | 6.0 | |L 8|W 61|W 25|W 21|W 11|W 13|W 12| |
| 8 | MI | | | 14959604 / R: 1384 ->1640 | | | N:2 | |W |B |W |B |W |B |W | |
I created two data frames, each one holding one line of information about each player. I ignored the line with transition ‘-’ symbols. The columns labled with a ‘d’ stand for dummy and are immediately removed. I added the number row to the second data frame to have a common column upon which to join the data frames.
players.row.1 <- raw.data[seq(1, nrow(raw.data), 3), ] %>%
setNames(c('number', 'd1', 'name', 'd2', 'points', 'opponents')) %>%
select(-d1, -d2)
players.row.2 <- raw.data[seq(2, nrow(raw.data), 3), ] %>%
setNames(c('location', 'd1', 'pre', 'd2', 'd3', 'd4')) %>%
select(-d1, -d2, -d3, -d4) %>%
mutate(number=players.row.1$number)
players.row.1 %>% head() %>% kable()
| number | name | points | opponents | |
|---|---|---|---|---|
| 1 | 1 | GARY HUA | 6.0 | |W 39|W 21|W 18|W 14|W 7|D 12|D 4| |
| 4 | 2 | DAKSHESH DARURI | 6.0 | |W 63|W 58|L 4|W 17|W 16|W 20|W 7| |
| 7 | 3 | ADITYA BAJAJ | 6.0 | |L 8|W 61|W 25|W 21|W 11|W 13|W 12| |
| 10 | 4 | PATRICK H SCHILLING | 5.5 | |W 23|D 28|W 2|W 26|D 5|W 19|D 1| |
| 13 | 5 | HANSHI ZUO | 5.5 | |W 45|W 37|D 12|D 13|D 4|W 14|W 17| |
| 16 | 6 | HANSEN SONG | 5.0 | |W 34|D 29|L 11|W 35|D 10|W 27|W 21| |
players.row.2 %>% head() %>% kable()
| location | pre | number | |
|---|---|---|---|
| 2 | ON | 15445895 / R: 1794 ->1817 | 1 |
| 5 | MI | 14598900 / R: 1553 ->1663 | 2 |
| 8 | MI | 14959604 / R: 1384 ->1640 | 3 |
| 11 | MI | 12616049 / R: 1716 ->1744 | 4 |
| 14 | MI | 14601533 / R: 1655 ->1690 | 5 |
| 17 | OH | 15055204 / R: 1686 ->1687 | 6 |
I combined the data frames into one and rearranged the columns to match the intended output.
players.combined <- left_join(players.row.1, players.row.2, by='number') %>%
subset(select=c(2, 5, 3, 6, 4)) %>%
map_df(str_trim)
players.combined %>% head() %>% kable()
| name | location | points | pre | opponents |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 15445895 / R: 1794 ->1817 | |W 39|W 21|W 18|W 14|W 7|D 12|D 4| |
| DAKSHESH DARURI | MI | 6.0 | 14598900 / R: 1553 ->1663 | |W 63|W 58|L 4|W 17|W 16|W 20|W 7| |
| ADITYA BAJAJ | MI | 6.0 | 14959604 / R: 1384 ->1640 | |L 8|W 61|W 25|W 21|W 11|W 13|W 12| |
| PATRICK H SCHILLING | MI | 5.5 | 12616049 / R: 1716 ->1744 | |W 23|D 28|W 2|W 26|D 5|W 19|D 1| |
| HANSHI ZUO | MI | 5.5 | 14601533 / R: 1655 ->1690 | |W 45|W 37|D 12|D 13|D 4|W 14|W 17| |
| HANSEN SONG | OH | 5.0 | 15055204 / R: 1686 ->1687 | |W 34|D 29|L 11|W 35|D 10|W 27|W 21| |
Three columns need to be further modified from the raw data contained inside of them. The ‘points’ column needs to be changed to a numeric. The ‘pre’ column has the specific points pulled and changed to a numeric. The ‘opponents’ column needs more work. I used ‘str_extract_all’ to remove all the numbers and then change them to numerics. At this moment I had a vector of numerics representing the opponents played by each player. From there I used ‘mapvalues’ to swap out the opponents number and swap in their pre score. Finally, I found the mean of those values and rounded it to the nearest whole number.
players.combined$points %<>%
as.numeric()
players.combined$pre %<>%
str_sub(15, 18) %>%
as.numeric()
players.combined$opponents %<>%
map_dbl(. %>% str_extract_all('\\d+') %>%
unlist() %>%
as.numeric() %>%
mapvalues(seq_along(players.combined$pre), players.combined$pre) %>%
mean() %>%
round()
)
players.combined %>% head() %>% kable()
| name | location | points | pre | opponents |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1501 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 |
Finally, I wrote the wrote the file to a csv. It is ready to be read into an sql table.
write.csv(players.combined, file='ChessData.csv')
The data has been successfully processed as described and written to a csv file. It is also ready for further processing in this R file if needed. I actually wrote out a seperate solution first that relied more heavily on regular expressions but I ended up liking this answer better. The regular expressions needed to extract each piece of information grew to an unwieldy length. I still use regular expressions in this solution, just when they were the single best tool to accomplish the task. I also made use of ‘str_trim’ and ‘str_sub’.