Project 1

By Brian Weinfeld

February 18, 2018

I began by reading in the data. I decided to use read.fwf which is designed to be used with files with arbitrary formatting but where every line is the same width and all the information is in the same position on each line. The resulting data frame is a bit messy but can be cleaned up quite easily.

con <- file('https://raw.githubusercontent.com/brian-cuny/607project1/master/tournamentinfo.txt', open='r')
raw.data <- read.fwf(con, widths=c(6, 2, 32, 1, 5, 43), skip=4, stringsAsFactors=FALSE) 
close(con)
raw.data[which(1:nrow(raw.data) %% 3 != 0), ] %>% head() %>% kable()
V1 V2 V3 V4 V5 V6
1 1 | GARY HUA | 6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
2 ON | 15445895 / R: 1794 ->1817 | N:2 |W |B |W |B |W |B |W |
4 2 | DAKSHESH DARURI | 6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
5 MI | 14598900 / R: 1553 ->1663 | N:2 |B |W |B |W |B |W |B |
7 3 | ADITYA BAJAJ | 6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
8 MI | 14959604 / R: 1384 ->1640 | N:2 |W |B |W |B |W |B |W |

I created two data frames, each one holding one line of information about each player. I ignored the line with transition ‘-’ symbols. The columns labled with a ‘d’ stand for dummy and are immediately removed. I added the number row to the second data frame to have a common column upon which to join the data frames.

players.row.1 <- raw.data[seq(1, nrow(raw.data), 3), ] %>% 
  setNames(c('number', 'd1', 'name', 'd2', 'points', 'opponents')) %>% 
  select(-d1, -d2)
players.row.2 <- raw.data[seq(2, nrow(raw.data), 3), ] %>% 
  setNames(c('location', 'd1', 'pre', 'd2', 'd3', 'd4')) %>% 
  select(-d1, -d2, -d3, -d4) %>% 
  mutate(number=players.row.1$number)
players.row.1 %>% head() %>% kable()
number name points opponents
1 1 GARY HUA 6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
4 2 DAKSHESH DARURI 6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
7 3 ADITYA BAJAJ 6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
10 4 PATRICK H SCHILLING 5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
13 5 HANSHI ZUO 5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
16 6 HANSEN SONG 5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|
players.row.2 %>% head() %>% kable()
location pre number
2 ON 15445895 / R: 1794 ->1817 1
5 MI 14598900 / R: 1553 ->1663 2
8 MI 14959604 / R: 1384 ->1640 3
11 MI 12616049 / R: 1716 ->1744 4
14 MI 14601533 / R: 1655 ->1690 5
17 OH 15055204 / R: 1686 ->1687 6

I combined the data frames into one and rearranged the columns to match the intended output.

players.combined <- left_join(players.row.1, players.row.2, by='number') %>% 
  subset(select=c(2, 5, 3, 6, 4)) %>%
  map_df(str_trim)
players.combined %>% head() %>% kable()
name location points pre opponents
GARY HUA ON 6.0 15445895 / R: 1794 ->1817 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
DAKSHESH DARURI MI 6.0 14598900 / R: 1553 ->1663 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
ADITYA BAJAJ MI 6.0 14959604 / R: 1384 ->1640 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
PATRICK H SCHILLING MI 5.5 12616049 / R: 1716 ->1744 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
HANSHI ZUO MI 5.5 14601533 / R: 1655 ->1690 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
HANSEN SONG OH 5.0 15055204 / R: 1686 ->1687 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|

Three columns need to be further modified from the raw data contained inside of them.
The ‘points’ column needs to be changed to a numeric.
The ‘pre’ column has the specific points pulled and changed to a numeric.
The ‘opponents’ column needs more work. I used ‘str_extract_all’ to remove all the numbers and then change them to numerics. At this moment I had a vector of numerics representing the opponents played by each player. From there I used ‘mapvalues’ to swap out the opponents number and swap in their pre score. Finally, I found the mean of those values and rounded it to the nearest whole number.

players.combined$points %<>% 
  as.numeric()
players.combined$pre %<>% 
  str_sub(15, 18) %>% 
  as.numeric()
players.combined$opponents %<>% 
  map_dbl(. %>% str_extract_all('\\d+') %>%
            unlist() %>% 
            as.numeric() %>% 
            mapvalues(seq_along(players.combined$pre), players.combined$pre) %>% 
            mean() %>% 
            round()
         )
players.combined %>% head() %>% kable()
name location points pre opponents
GARY HUA ON 6.0 1794 1605
DAKSHESH DARURI MI 6.0 1553 1469
ADITYA BAJAJ MI 6.0 1384 1564
PATRICK H SCHILLING MI 5.5 1716 1574
HANSHI ZUO MI 5.5 1655 1501
HANSEN SONG OH 5.0 1686 1519

Finally, I wrote the wrote the file to a csv. It is ready to be read into an sql table.

write.csv(players.combined, file='ChessData.csv')

The data has been successfully processed as described and written to a csv file. It is also ready for further processing in this R file if needed. I actually wrote out a seperate solution first that relied more heavily on regular expressions but I ended up liking this answer better. The regular expressions needed to extract each piece of information grew to an unwieldy length. I still use regular expressions in this solution, just when they were the single best tool to accomplish the task. I also made use of ‘str_trim’ and ‘str_sub’.