Introduction

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

Data Upload & Initial Investigation

I started with loading the file into R. It was clear that the raw text file used the “|” character to delineate cells, so the read.delim function was the logical first step. Once the data was loaded, I wanted to explore how it was structured, and where my target information was located.

library(RCurl)
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/tournamentinfo.txt")
Table <- read.delim("https://raw.githubusercontent.com/ChristopherBloome/607/master/tournamentinfo.txt", header = FALSE, sep = "|")
head(Table,10)

##                                                                                           V1
## 1  -----------------------------------------------------------------------------------------
## 2                                                                                      Pair 
## 3                                                                                      Num  
## 4  -----------------------------------------------------------------------------------------
## 5                                                                                         1 
## 6                                                                                        ON 
## 7  -----------------------------------------------------------------------------------------
## 8                                                                                         2 
## 9                                                                                        MI 
## 10 -----------------------------------------------------------------------------------------
##                                   V2    V3    V4    V5    V6    V7    V8    V9
## 1                                                                             
## 2   Player Name                      Total Round Round Round Round Round Round
## 3   USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4     5     6  
## 4                                                                             
## 5   GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7 D  12
## 6   15445895 / R: 1794   ->1817      N:2   W     B     W     B     W     B    
## 7                                                                             
## 8   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16 W  20
## 9   14598900 / R: 1553   ->1663      N:2   B     W     B     W     B     W    
## 10                                                                            
##      V10 V11
## 1         NA
## 2  Round  NA
## 3    7    NA
## 4         NA
## 5  D   4  NA
## 6  W      NA
## 7         NA
## 8  W   7  NA
## 9  B      NA
## 10        NA

Strategy

The first thing that stands out is that each “cell” as drawn in the ANSI characters takes up 2 rows in our table. These rows are separated by an additional row that is mostly blank, with the exception of the first column which contains a string of en-dashes “-”.

The relevant information for each competitor is spread across each of these two rows. The top row contains the competitors name, their total points, and a series of cells that indicate who their competitor was for that round and if they won or loss. The lower row contains a cell which has the player ID and their rating before and after the tournament, followed by information indicating which color they played in a given round.

My first step is to remove the “extra” rows from our data-set, then restructure the data such that all relevant information for each competitor is in one row. Once we have one row per competitor, we can use Regex functions to pull our desired data from cells with multiple pieces of information. From here - we can generate a new table with the information requested of us and export as a CSV file.

Data Restructuring

The easiest way to go about this is to first remove the header rows and the rows serving as deliminators from our Data Frame, then create two tables: one composed entirely of “upper rows” and another composed of the “lower rows”, such that the position of the rows in each table are consistent for each competitor across tables (R retains the row numbers from the original tables, however, my desired outcome is to have each competitor in the same relative position in each table). From here, we can easily join the two tables into a new data frame with the cbind function.

Table <- Table[-c(1:4),]
Table <- subset(Table, Table$V2 != "")
head(Table, 5)

##        V1                                V2    V3    V4    V5    V6    V7    V8
## 5      1   GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7
## 6     ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B     W    
## 8      2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16
## 9     MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W     B    
## 11     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21 W  11
##       V9   V10 V11
## 5  D  12 D   4  NA
## 6  B     W      NA
## 8  W  20 W   7  NA
## 9  W     B      NA
## 11 W  13 W  12  NA

Now that the header has been removed, we need to find a way to easily distinguish the upper and lower rows. It initially seems as though, in column 1, the upper rows contain numbers while the lower rows contain state abbreviations, however, further examination proves that the top row hosts numbers stored as strings. For this reason, the is.numeric function will not work to distinguish upper and lower rows. Because as.numeric returns an error message when used on a string that can not be converted to a number, we can use this function coupled with the is.na function for our purposes. Finally, as there are leading and trailing spaces in this data set, we also use a trimws function to remove these.

Table$Split <- is.na(as.numeric(trimws(Table$V1)))

Table.Upper <- subset(Table, Table$Split == TRUE)
Table.Lower <- subset(Table, Table$Split == FALSE)

Table.both <- cbind(Table.Upper, Table.Lower)

A little more processing to remove extra information, and our table is starting to become more useful:

colnames(Table.both) <- c(1:24)
rownames(Table.both) <- c(1:64)
Table.both <- Table.both[,-c(3:13,23,24)]
colnames(Table.both) <- c(1:11)
head(Table.both,5)

##        1                                 2                                 3
## 1    ON   15445895 / R: 1794   ->1817       GARY HUA                        
## 2    MI   14598900 / R: 1553   ->1663       DAKSHESH DARURI                 
## 3    MI   14959604 / R: 1384   ->1640       ADITYA BAJAJ                    
## 4    MI   12616049 / R: 1716   ->1744       PATRICK H SCHILLING             
## 5    MI   14601533 / R: 1655   ->1690       HANSHI ZUO                      
##       4     5     6     7     8     9    10    11
## 1 6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4
## 2 6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7
## 3 6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12
## 4 5.5   W  23 D  28 W   2 W  26 D   5 W  19 D   1
## 5 5.5   W  45 W  37 D  12 D  13 D   4 W  14 W  17

Cleaning Cell Contents

Columns 5 through 11 contain both the opponent number, and the outcome of the match. As column 4 contains the total points, we actually do not need the outcome - only the opponent from these cells. Additionally, as there is one space between the W/L and the opponent number, we can use this to delineate the contents of the cell. I elected to use a gsub function to remove everything before the first space. As I know we will be referencing this value later, I also took the opportunity remove any spaces which may have been left over.

Table.both$`5` <- trimws(gsub( "^.* ", "",Table.both$`5`))
Table.both$`6` <- trimws(gsub( "^.* ", "",Table.both$`6`))
Table.both$`7` <- trimws(gsub( "^.* ", "",Table.both$`7`))
Table.both$`8` <- trimws(gsub( "^.* ", "",Table.both$`8`))
Table.both$`9` <- trimws(gsub( "^.* ", "",Table.both$`9`))
Table.both$`10` <- trimws(gsub( "^.* ", "",Table.both$`10`))
Table.both$`11` <- trimws(gsub( "^.* ", "",Table.both$`11`))

From here, we draw our attention to column 2, which contains a few bits of information: the rating before and after the tournament, as well as the players identification number. We are only concerned with the players pre-tournament rating, which is conveniently located between a “:” and an “-”. Additionally, these are the only occurrences of these characters in the string, so we can remove this extra information using our GSub function.

Also of note - some players have an additional bit of information in their rating This organization treats a players ratings early in their tenure as “provisional.” As a result, players who have finished a small amount of games have this quantity listed with the letter “p” followed by the number of finished games. As this status is irrelevant for our purposes, I removed it in this stage.

Table.both$`2` <- gsub("^.*:","",Table.both$`2`)
Table.both$`2` <- gsub("-.*$","",Table.both$`2`)
Table.both$`2` <- gsub("P.*$","",Table.both$`2`)

Pairing opponents with their rating

Our next challenge is to take the player numbers in Columns 5 through 11 and index column 2 to find each player’s ratings. As our table is still in the correct order (player 1 is now in row 1), this is relatively straight forward. Of note - both the values signifying the player number and the values with the player rating are currently text strings with leading spaces, so we will be cleaning and reformatting our data as we move forward.

Table.both$`12` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`5`)]))
Table.both$`13` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`6`)]))
Table.both$`14` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`7`)]))
Table.both$`15` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`8`)]))
Table.both$`16` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`9`)]))
Table.both$`17` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`10`)]))
Table.both$`18` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`11`)]))

Calculating mean

With each players rating in place, we can calculate the average rating of each competitors opponents.

Table.both$`19` <- round(rowMeans(Table.both[, 12:18],na.rm = TRUE), digits=0)

head(Table.both,5)

##        1        2                                 3     4  5  6  7  8  9 10 11
## 1    ON   1794     GARY HUA                         6.0   39 21 18 14  7 12  4
## 2    MI   1553     DAKSHESH DARURI                  6.0   63 58  4 17 16 20  7
## 3    MI   1384     ADITYA BAJAJ                     6.0    8 61 25 21 11 13 12
## 4    MI   1716     PATRICK H SCHILLING              5.5   23 28  2 26  5 19  1
## 5    MI   1655     HANSHI ZUO                       5.5   45 37 12 13  4 14 17
##     12   13   14   15   16   17   18   19
## 1 1436 1563 1600 1610 1649 1663 1716 1605
## 2 1175  917 1716 1629 1604 1595 1649 1469
## 3 1641  955 1745 1563 1712 1666 1663 1564
## 4 1363 1507 1553 1579 1655 1564 1794 1574
## 5 1242  980 1663 1666 1716 1610 1629 1501

Building our final table

Now that we have all of our relevant information - we can build our final table. As this will be our final product, we will label the columns in an intuitive manner.

Table.Final  <- data.frame(Table.both$`3`,Table.both$`1`,Table.both$`4`,Table.both$`2`,Table.both$`19`)

colnames(Table.Final) <- c("Player Name","Player State","Total Points","Pre-Tournament Rating","Average Rating of Opponents")

head(Table.Final,5)

##                         Player Name Player State Total Points
## 1  GARY HUA                                  ON         6.0  
## 2  DAKSHESH DARURI                           MI         6.0  
## 3  ADITYA BAJAJ                              MI         6.0  
## 4  PATRICK H SCHILLING                       MI         5.5  
## 5  HANSHI ZUO                                MI         5.5  
##   Pre-Tournament Rating Average Rating of Opponents
## 1               1794                           1605
## 2               1553                           1469
## 3               1384                           1564
## 4               1716                           1574
## 5               1655                           1501

Export as CSV

As our table looks correct and contains all the requested information, we are prepared to export as a CSV

write.csv(Table.Final,file = "BloomeProject1.CSV",row.names = FALSE)

607 Project 1

Christopher Bloome

2/18/2020