Project 1

By Brian Weinfeld

February 18, 2018

Chess Ranking Manipulations

I began by reading in the data. I decided to use read.fwf which is designed to be used with files with arbitrary formatting but where every line is the same width and all the information is in the same position on each line. The resulting data frame is a bit messy but can be cleaned up quite easily.

con <- file('https://raw.githubusercontent.com/brian-cuny/607project1/master/tournamentinfo.txt', open='r')
raw.data <- read.fwf(con, widths=c(6, 2, 32, 1, 5, 43), skip=4, stringsAsFactors=FALSE) 
close(con)
raw.data[which(1:nrow(raw.data) %% 3 != 0), ] %>% head() %>% kable()

	V1	V2	V3	V4	V5	V6
1	1	\|	GARY HUA	\|	6.0	\|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|
2	ON	\|	15445895 / R: 1794 ->1817	\|	N:2	\|W \|B \|W \|B \|W \|B \|W \|
4	2	\|	DAKSHESH DARURI	\|	6.0	\|W 63\|W 58\|L 4\|W 17\|W 16\|W 20\|W 7\|
5	MI	\|	14598900 / R: 1553 ->1663	\|	N:2	\|B \|W \|B \|W \|B \|W \|B \|
7	3	\|	ADITYA BAJAJ	\|	6.0	\|L 8\|W 61\|W 25\|W 21\|W 11\|W 13\|W 12\|
8	MI	\|	14959604 / R: 1384 ->1640	\|	N:2	\|W \|B \|W \|B \|W \|B \|W \|

I created two data frames, each one holding one line of information about each player. I ignored the line with transition ‘-’ symbols. The columns labled with a ‘d’ stand for dummy and are immediately removed. I added the number row to the second data frame to have a common column upon which to join the data frames.

players.row.1 <- raw.data[seq(1, nrow(raw.data), 3), ] %>% 
  setNames(c('number', 'd1', 'name', 'd2', 'points', 'opponents')) %>% 
  select(-d1, -d2)
players.row.2 <- raw.data[seq(2, nrow(raw.data), 3), ] %>% 
  setNames(c('location', 'd1', 'pre', 'd2', 'd3', 'd4')) %>% 
  select(-d1, -d2, -d3, -d4) %>% 
  mutate(number=players.row.1$number)
players.row.1 %>% head() %>% kable()

	number	name	points	opponents
1	1	GARY HUA	6.0	\|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|
4	2	DAKSHESH DARURI	6.0	\|W 63\|W 58\|L 4\|W 17\|W 16\|W 20\|W 7\|
7	3	ADITYA BAJAJ	6.0	\|L 8\|W 61\|W 25\|W 21\|W 11\|W 13\|W 12\|
10	4	PATRICK H SCHILLING	5.5	\|W 23\|D 28\|W 2\|W 26\|D 5\|W 19\|D 1\|
13	5	HANSHI ZUO	5.5	\|W 45\|W 37\|D 12\|D 13\|D 4\|W 14\|W 17\|
16	6	HANSEN SONG	5.0	\|W 34\|D 29\|L 11\|W 35\|D 10\|W 27\|W 21\|

players.row.2 %>% head() %>% kable()

	location	pre	number
2	ON	15445895 / R: 1794 ->1817	1
5	MI	14598900 / R: 1553 ->1663	2
8	MI	14959604 / R: 1384 ->1640	3
11	MI	12616049 / R: 1716 ->1744	4
14	MI	14601533 / R: 1655 ->1690	5
17	OH	15055204 / R: 1686 ->1687	6

I combined the data frames into one and rearranged the columns to match the intended output.

players.combined <- left_join(players.row.1, players.row.2, by='number') %>% 
  subset(select=c(2, 5, 3, 6, 4)) %>%
  map_df(str_trim)
players.combined %>% head() %>% kable()

name	location	points	pre	opponents
GARY HUA	ON	6.0	15445895 / R: 1794 ->1817	\|W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4\|
DAKSHESH DARURI	MI	6.0	14598900 / R: 1553 ->1663	\|W 63\|W 58\|L 4\|W 17\|W 16\|W 20\|W 7\|
ADITYA BAJAJ	MI	6.0	14959604 / R: 1384 ->1640	\|L 8\|W 61\|W 25\|W 21\|W 11\|W 13\|W 12\|
PATRICK H SCHILLING	MI	5.5	12616049 / R: 1716 ->1744	\|W 23\|D 28\|W 2\|W 26\|D 5\|W 19\|D 1\|
HANSHI ZUO	MI	5.5	14601533 / R: 1655 ->1690	\|W 45\|W 37\|D 12\|D 13\|D 4\|W 14\|W 17\|
HANSEN SONG	OH	5.0	15055204 / R: 1686 ->1687	\|W 34\|D 29\|L 11\|W 35\|D 10\|W 27\|W 21\|

Three columns need to be further modified from the raw data contained inside of them.
The ‘points’ column needs to be changed to a numeric.
The ‘pre’ column has the specific points pulled and changed to a numeric.
The ‘opponents’ column needs more work. I used ‘str_extract_all’ to remove all the numbers and then change them to numerics. At this moment I had a vector of numerics representing the opponents played by each player. From there I used ‘mapvalues’ to swap out the opponents number and swap in their pre score. Finally, I found the mean of those values and rounded it to the nearest whole number.

players.combined$points %<>% 
  as.numeric()
players.combined$pre %<>% 
  str_sub(15, 18) %>% 
  as.numeric()
players.combined$opponents %<>% 
  map_dbl(. %>% str_extract_all('\\d+') %>%
            unlist() %>% 
            as.numeric() %>% 
            mapvalues(seq_along(players.combined$pre), players.combined$pre) %>% 
            mean() %>% 
            round()
         )
players.combined %>% head() %>% kable()

name	location	points	pre	opponents
GARY HUA	ON	6.0	1794	1605
DAKSHESH DARURI	MI	6.0	1553	1469
ADITYA BAJAJ	MI	6.0	1384	1564
PATRICK H SCHILLING	MI	5.5	1716	1574
HANSHI ZUO	MI	5.5	1655	1501
HANSEN SONG	OH	5.0	1686	1519

Finally, I wrote the wrote the file to a csv. It is ready to be read into an sql table.

write.csv(players.combined, file='ChessData.csv')

The data has been successfully processed as described and written to a csv file. It is also ready for further processing in this R file if needed. I actually wrote out a seperate solution first that relied more heavily on regular expressions but I ended up liking this answer better. The regular expressions needed to extract each piece of information grew to an unwieldy length. I still use regular expressions in this solution, just when they were the single best tool to accomplish the task. I also made use of ‘str_trim’ and ‘str_sub’.