Why This Project

This project is completed as partial fulfillment of the coursework for IS 607, which forms part of the Data Analytics Masters Degree at the City University of New York. Due to scheduling constraints this was carried out as a solo project.

History of the Struggle.

This is an example of the kind of work that is sourced from many other people with more experience than me!

I began this quest with a total red herring – having just done a ton of work in regex, I assumed quite incorrectly that the heavy lifting of scraping the data would need to be done by regex. In the interest of humility, and because my regex was pretty killer, I leave as an appendix the tables of regex analysis I originally did at the beginning of this project.

Then Arun Kumar saved me from toiling fruitlessly when he cryptically said something to the effect that I was missing the point and that there was a faster way to do this. Upon which I exclaimed, “Aha! It has to do with delimiters, doesn’t it?”. The vertical pipes were staring me in the face. We exchanged a few messages back and forth (he was helpful without being too helpful) and I began a new approach.

The first problem I felt I’d need to handle would be getting a vector of lines into some sort of data frame that was three columns wide – one column for each row relating to a single player. This is where StackExchange rescued me (http://stackoverflow.com/questions/17752830/r-reshape-a-vector-into-multiple-columns, note the helpful advice to RTFM, which was a new and instantly understandable acronym.) Once I could get my data reshaped, the rest seemed pretty darn easy (thanks to Arun).

I did struggle a bit to figure out just how to use apply(), so I document here my actual to-and-fro of figuring out how to do it. This document is therefore not the shortest distance between two points but rather the documentation of a thought process.

The Good Stuff – Setting up the Data

First, I get the data from my GitHub, leaving out the first three rows of header data. Note, the warning about the incomplete final line can be ignored.

download.file("https://raw.githubusercontent.com/pm0kjp/IS607/master/tournamentinfo.txt", "tournamentinfo.txt", method="curl")
chess_text<-readLines("tournamentinfo.txt")

## Warning in readLines("tournamentinfo.txt"): incomplete final line found on
## 'tournamentinfo.txt'

chess_text<-chess_text[5:196]
head(chess_text)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

Then I’ll reshape it into a data frame that’s three columns wide.

chess_df<-data.frame(matrix(chess_text, ncol =3 , byrow = TRUE))
head(chess_df)

##                                                                                          X1
## 1     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 2     2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 3     3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## 4     4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
## 5     5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
## 6     6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|
##                                                                                          X2
## 1    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 2    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 3    MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 4    MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
## 5    MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 6    OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |
##                                                                                          X3
## 1 -----------------------------------------------------------------------------------------
## 2 -----------------------------------------------------------------------------------------
## 3 -----------------------------------------------------------------------------------------
## 4 -----------------------------------------------------------------------------------------
## 5 -----------------------------------------------------------------------------------------
## 6 -----------------------------------------------------------------------------------------

I don’t really need the last column, which consists of all hyphens.

chess_df<-chess_df[,1:2]

Now I want to split my two remaining column values, using the vertical pipe as my delimiter:

row1<-data.frame(do.call('rbind', strsplit(as.character(chess_df[,1]), '|', fixed=TRUE)))
row2<-data.frame(do.call('rbind', strsplit(as.character(chess_df[,2]), '|', fixed=TRUE)))
# I recombine them into one:
chess_df<-cbind(row1, row2)
head(chess_df)

##       X1                                X2    X3    X4    X5    X6    X7
## 1     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 2     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 3     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
## 4     4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26
## 5     5   HANSHI ZUO                       5.5   W  45 W  37 D  12 D  13
## 6     6   HANSEN SONG                      5.0   W  34 D  29 L  11 W  35
##      X8    X9   X10     X1                                X2    X3    X4
## 1 W   7 D  12 D   4    ON   15445895 / R: 1794   ->1817      N:2   W    
## 2 W  16 W  20 W   7    MI   14598900 / R: 1553   ->1663      N:2   B    
## 3 W  11 W  13 W  12    MI   14959604 / R: 1384   ->1640      N:2   W    
## 4 D   5 W  19 D   1    MI   12616049 / R: 1716   ->1744      N:2   W    
## 5 D   4 W  14 W  17    MI   14601533 / R: 1655   ->1690      N:2   B    
## 6 D  10 W  27 W  21    OH   15055204 / R: 1686   ->1687      N:3   W    
##      X5    X6    X7    X8    X9   X10
## 1 B     W     B     W     B     W    
## 2 W     B     W     B     W     B    
## 3 B     W     B     W     B     W    
## 4 B     W     B     W     B     B    
## 5 W     B     W     B     W     B    
## 6 B     W     B     B     W     B

I’m far from done, because I have some columns that contain several fields that need extraction, while other columns should be deleted. Let’s drop and rename to make things a bit simpler:

chess_df<-chess_df[,1:12]
colnames(chess_df)<-c('id','name','points', 'round_1', 'round_2',
                      'round_3','round_4','round_5', 'round_6', 'round_7',
                      'state', 'rating_change')

So far, id, name, and points are already separated. However, the seven games have both the game status (a letter) and the opponent (a number). We want to keep the number only. Heck, we’ll go ahead and make sure it’s an integer, while we’re at it.

library(stringr)
chess_df$round_1<-as.integer(str_extract(chess_df$round_1, "[[:digit:]]+"))
chess_df$round_2<-as.integer(str_extract(chess_df$round_2, "[[:digit:]]+"))
chess_df$round_3<-as.integer(str_extract(chess_df$round_3, "[[:digit:]]+"))
chess_df$round_4<-as.integer(str_extract(chess_df$round_4, "[[:digit:]]+"))
chess_df$round_5<-as.integer(str_extract(chess_df$round_5, "[[:digit:]]+"))
chess_df$round_6<-as.integer(str_extract(chess_df$round_6, "[[:digit:]]+"))
chess_df$round_7<-as.integer(str_extract(chess_df$round_7, "[[:digit:]]+"))

We’re getting much closer!

head(chess_df)

##       id                              name points round_1 round_2 round_3
## 1     1   GARY HUA                          6.0        39      21      18
## 2     2   DAKSHESH DARURI                   6.0        63      58       4
## 3     3   ADITYA BAJAJ                      6.0         8      61      25
## 4     4   PATRICK H SCHILLING               5.5        23      28       2
## 5     5   HANSHI ZUO                        5.5        45      37      12
## 6     6   HANSEN SONG                       5.0        34      29      11
##   round_4 round_5 round_6 round_7  state                     rating_change
## 1      14       7      12       4    ON   15445895 / R: 1794   ->1817     
## 2      17      16      20       7    MI   14598900 / R: 1553   ->1663     
## 3      21      11      13      12    MI   14959604 / R: 1384   ->1640     
## 4      26       5      19       1    MI   12616049 / R: 1716   ->1744     
## 5      13       4      14      17    MI   14601533 / R: 1655   ->1690     
## 6      35      10      27      21    OH   15055204 / R: 1686   ->1687

But… we still have to extract pre and post ratings from the rating_change column. Luckily I still have my awesome regex from back when I thought I had to regex the world.

pre_post<-data.frame(str_match(chess_df$rating_change,".+\\:\\s+([[:digit:]]+).+[\\-]>\\s*([[:digit:]]+)"))
head(pre_post)

##                             X1   X2   X3
## 1  15445895 / R: 1794   ->1817 1794 1817
## 2  14598900 / R: 1553   ->1663 1553 1663
## 3  14959604 / R: 1384   ->1640 1384 1640
## 4  12616049 / R: 1716   ->1744 1716 1744
## 5  14601533 / R: 1655   ->1690 1655 1690
## 6  15055204 / R: 1686   ->1687 1686 1687

I’ll remove the old column with the ratings information and replace it with a columns that represent the pre ratings.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

chess_df<-select(chess_df, -rating_change)
chess_df<-cbind(chess_df,pre_post[2])
# add a better name
colnames(chess_df)[12]<-"pre_rating"
# oh, and make it an integer, too, by way of character (factor->character-> integer)
chess_df$pre_rating<-as.integer(as.character(chess_df$pre_rating))

How are we doing so far? Pretty good! Let’s also look at the attributes of the data frame to see if we want to do additional cleanup before proceeding:

head(chess_df)

##       id                              name points round_1 round_2 round_3
## 1     1   GARY HUA                          6.0        39      21      18
## 2     2   DAKSHESH DARURI                   6.0        63      58       4
## 3     3   ADITYA BAJAJ                      6.0         8      61      25
## 4     4   PATRICK H SCHILLING               5.5        23      28       2
## 5     5   HANSHI ZUO                        5.5        45      37      12
## 6     6   HANSEN SONG                       5.0        34      29      11
##   round_4 round_5 round_6 round_7  state pre_rating
## 1      14       7      12       4    ON        1794
## 2      17      16      20       7    MI        1553
## 3      21      11      13      12    MI        1384
## 4      26       5      19       1    MI        1716
## 5      13       4      14      17    MI        1655
## 6      35      10      27      21    OH        1686

str(chess_df)

## 'data.frame':    64 obs. of  12 variables:
##  $ id        : Factor w/ 64 levels "    1 ","    2 ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name      : Factor w/ 64 levels " ADITYA BAJAJ                    ",..: 24 12 1 51 28 27 23 21 59 5 ...
##  $ points    : Factor w/ 11 levels "1.0  ","1.5  ",..: 11 11 11 10 10 9 9 9 9 9 ...
##  $ round_1   : int  39 63 8 23 45 34 57 3 25 16 ...
##  $ round_2   : int  21 58 61 28 37 29 46 32 18 19 ...
##  $ round_3   : int  18 4 25 2 12 11 13 14 59 55 ...
##  $ round_4   : int  14 17 21 26 13 35 11 9 8 31 ...
##  $ round_5   : int  7 16 11 5 4 10 1 47 26 6 ...
##  $ round_6   : int  12 20 13 19 14 27 9 28 7 25 ...
##  $ round_7   : int  4 7 12 1 17 21 2 19 20 18 ...
##  $ state     : Factor w/ 3 levels "   MI ","   OH ",..: 3 1 1 1 1 2 1 1 3 1 ...
##  $ pre_rating: int  1794 1553 1384 1716 1655 1686 1649 1641 1411 1365 ...

Before we go any farther, let’s do a tiny bit more cleanup. We want the id column to be integer, and we should trim the spaces in the name and state field, just to preserve disk space and really get tight, tidy data. We’ll also ensure that the points column is numeric.

chess_df$id<-as.integer(chess_df$id)
chess_df$name<-str_trim(as.character(chess_df$name, side = "both"))
chess_df$state<-str_trim(as.character(chess_df$state), side = "both")
# for the points column, it's a factor, so we have to first convert it to character, then numeric.
chess_df$points<-as.numeric(as.character(chess_df$points))

The Better Stuff – Calculation!

The nice thing about having the id’s of opponents in the various columns of any player is that we can use those values as row indexes. So, if I want to find the pre-rating score for player 1’s first opponent, I figure out how to locate the id of player 1’s first opponent, which is:

chess_df[1,"round_1"]

## [1] 39

Then, I can use that as the row index, and pre_rating as the column identifier, for the value of the first opponent’s pre rating:

chess_df[chess_df[1,"round_1"],"pre_rating"]

## [1] 1436

I think I know how to calculate the sum of opponent pre-ratings for a given tournament player. Let me test my hypothesis. This is what I came up with for player number 1:

sum(chess_df[chess_df[1,"round_1"],"pre_rating"],chess_df[chess_df[1,"round_2"],"pre_rating"], chess_df[chess_df[1,"round_3"],"pre_rating"],chess_df[chess_df[1,"round_4"],"pre_rating"], chess_df[chess_df[1,"round_5"],"pre_rating"],chess_df[chess_df[1,"round_6"],"pre_rating"], chess_df[chess_df[1,"round_7"],"pre_rating"], na.rm=TRUE)

## [1] 11237

Is this true? Well, the opponents of player 1 and their scores are below:

Opponent	Opponent’s Pre-Rating
39	1436
21	1563
18	1600
14	1610
7	1649
12	1663
4	1716
TOTAL	11237

OK, great, the total seems legit. What about the number of opponents? I think that I can find the total number of opponents with this R code:

sum(!is.na(c(chess_df[chess_df[1,"round_1"],"pre_rating"],chess_df[chess_df[1,"round_2"],"pre_rating"], chess_df[chess_df[1,"round_3"],"pre_rating"],chess_df[chess_df[1,"round_4"],"pre_rating"], chess_df[chess_df[1,"round_5"],"pre_rating"],chess_df[chess_df[1,"round_6"],"pre_rating"], chess_df[chess_df[1,"round_7"],"pre_rating"])))

## [1] 7

As expected, this gives me 7. So with the total rating and the number of opponents, I can easily figure out the average. I should also test my method using a row I know has some “empty opponents” or games that were not played.

I’ll do exactly what I did for player 1, but this time for player 53, who had several unplayed rounds.

sum(chess_df[chess_df[53,"round_1"],"pre_rating"],chess_df[chess_df[53,"round_2"],"pre_rating"], chess_df[chess_df[53,"round_3"],"pre_rating"],chess_df[chess_df[53,"round_4"],"pre_rating"], chess_df[chess_df[53,"round_5"],"pre_rating"],chess_df[chess_df[53,"round_6"],"pre_rating"], chess_df[chess_df[53,"round_7"],"pre_rating"], na.rm=TRUE)

## [1] 4036

Is this true? Well, the opponents of player 53 and their scores are below:

Opponent	Opponent’s Pre-Rating
25	1745
44	1199
57	1092
TOTAL	4036

OK, great, the total seems legit. What about the number of opponents? I think that I can find the total number of opponents with this R code:

sum(!is.na(c(chess_df[chess_df[53,"round_1"],"pre_rating"],chess_df[chess_df[53,"round_2"],"pre_rating"], chess_df[chess_df[53,"round_3"],"pre_rating"],chess_df[chess_df[53,"round_4"],"pre_rating"], chess_df[chess_df[53,"round_5"],"pre_rating"],chess_df[chess_df[53,"round_6"],"pre_rating"], chess_df[chess_df[53,"round_7"],"pre_rating"])))

## [1] 3

As expected, I come up with the value 3.

Well, I have a model, but I really don’t want to rewrite this 64 times OR create a loop that does it. I’ll try to use apply. This means that I first have to make an array (or data frame that can be coerced to an array), this time with only the rounds in it. I bet this means I can shorten all that repetitive “round_1”, “round_2” retyping!

opponent_df<-select(chess_df, starts_with("round_"))
head(opponent_df)

##   round_1 round_2 round_3 round_4 round_5 round_6 round_7
## 1      39      21      18      14       7      12       4
## 2      63      58       4      17      16      20       7
## 3       8      61      25      21      11      13      12
## 4      23      28       2      26       5      19       1
## 5      45      37      12      13       4      14      17
## 6      34      29      11      35      10      27      21

What if I just used apply to do the indexing I did above manually in order to find the pre ratings of opponents? Something like this?

opponent_pre_ratings<-apply(opponent_df, c(1,2), function(x) chess_df[x,"pre_rating"])

Let’s check the first row and the fifty-third, since they represent the first and fifty-third tournament participant, and we’ve already looked all that stuff up:

opponent_pre_ratings[1,]

## round_1 round_2 round_3 round_4 round_5 round_6 round_7 
##    1436    1563    1600    1610    1649    1663    1716

opponent_pre_ratings[53,]

## round_1 round_2 round_3 round_4 round_5 round_6 round_7 
##      NA    1745      NA    1199      NA    1092      NA

This is great! Now all I have to do is find the mean and add that to my original data frame. None of the typing and repetition I thought it would be.

chess_df$opponent_avg_pre_rating<-round(rowMeans(opponent_pre_ratings, na.rm=TRUE))

Now I’ll use dplyr arrange to limit the data to just what I was asked to provide:

final_chess_df<-select(chess_df, name, state, points, pre_rating, opponent_avg_pre_rating)
head(final_chess_df)

##                  name state points pre_rating opponent_avg_pre_rating
## 1            GARY HUA    ON    6.0       1794                    1605
## 2     DAKSHESH DARURI    MI    6.0       1553                    1469
## 3        ADITYA BAJAJ    MI    6.0       1384                    1564
## 4 PATRICK H SCHILLING    MI    5.5       1716                    1574
## 5          HANSHI ZUO    MI    5.5       1655                    1501
## 6         HANSEN SONG    OH    5.0       1686                    1519

All done. Now to write the csv:

write.csv(final_chess_df,"final_chess_data.csv")

Clean up

rm(list=ls())

Appendix: All Regex, All the Time

We start out with a sample file and consider what the potential regex code would be for each element:

Header Lines (Lines 1 - 4 of the text file:) Ignore (we’ll remove these)

Three Lines Apiece for Each Player: Note: We don’t include the EOL or \n character, since flattening the text seems helpful.

Line 1: Player identification, play results, and opponents

Description	Regex	Disposition
some spaces	\\s+	toss this
some numbers	[[:digit:]]+	keep this as player id
a space, maybe two?	\\s+	toss this
a vertical pipe (divider)	\\\|	toss this
a space, maybe two?	\\s+	toss this
some alphabetics and spaces	[\\s,[:alpha:]]+	keep as name (can trim spaces off the end later)
a vertical pipe (divider)	\\\|	toss this
some numbers with a decimal point	[\\.,[:digit:]]+	keep this as points
a space, maybe two?	\\s+	toss this
a vertical pipe (divider)	\\\|	toss this
Italicized lines below can repeat up to seven times, with possible absences among the repetitions
A capital letter	[A-Z]	toss this, we only care about opponent
a space, maybe two?	\\s+	toss this
some numbers, maybe!	[[:digit:]]*	keep as opponent ID
a vertical pipe (divider)	\\\|	toss this

Line 2: Player state, before and after rating

Description	Regex	Disposition
some spaces	\\s+	toss this
some letters (probably two)	[[:alpha:]]+	keep this as player state
a space, maybe two?	\\s+	toss this
a bunch of characters not including a colon	[^:]	toss this out
a colon and space	:\\s	this is a divider, toss it
some numbers	[[:digit:]]+	keep this as previous rating
a bunch of characters not including a hyphen	[^\\-]+	toss it
a forward arrow (hyphen gt)	\\-\\<	toss it
some numbers	[[:digit:]]+	keep as post-tourney rating
a bunch of other characters with no hyphen	[^\\-]+	toss – we’re done!

Line 3: All hyphens

Description	Regex	Disposition
A bunch of hyphens	\\-+	Toss it. Separator only.

Project 1 — IS 607

Joy Payton

September 24, 2015

Why This Project

History of the Struggle.

The Good Stuff – Setting up the Data

The Better Stuff – Calculation!

Appendix: All Regex, All the Time