In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
I started with loading the file into R. It was clear that the raw text file used the “|” character to delineate cells, so the read.delim function was the logical first step. Once the data was loaded, I wanted to explore how it was structured, and where my target information was located.
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/ChristopherBloome/607/master/tournamentinfo.txt")
Table <- read.delim("https://raw.githubusercontent.com/ChristopherBloome/607/master/tournamentinfo.txt", header = FALSE, sep = "|")
head(Table,10)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair
## 3 Num
## 4 -----------------------------------------------------------------------------------------
## 5 1
## 6 ON
## 7 -----------------------------------------------------------------------------------------
## 8 2
## 9 MI
## 10 -----------------------------------------------------------------------------------------
## V2 V3 V4 V5 V6 V7 V8 V9
## 1
## 2 Player Name Total Round Round Round Round Round Round
## 3 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5 6
## 4
## 5 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12
## 6 15445895 / R: 1794 ->1817 N:2 W B W B W B
## 7
## 8 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20
## 9 14598900 / R: 1553 ->1663 N:2 B W B W B W
## 10
## V10 V11
## 1 NA
## 2 Round NA
## 3 7 NA
## 4 NA
## 5 D 4 NA
## 6 W NA
## 7 NA
## 8 W 7 NA
## 9 B NA
## 10 NA
The first thing that stands out is that each “cell” as drawn in the ANSI characters takes up 2 rows in our table. These rows are separated by an additional row that is mostly blank, with the exception of the first column which contains a string of en-dashes “-”.
The relevant information for each competitor is spread across each of these two rows. The top row contains the competitors name, their total points, and a series of cells that indicate who their competitor was for that round and if they won or loss. The lower row contains a cell which has the player ID and their rating before and after the tournament, followed by information indicating which color they played in a given round.
My first step is to remove the “extra” rows from our data-set, then restructure the data such that all relevant information for each competitor is in one row. Once we have one row per competitor, we can use Regex functions to pull our desired data from cells with multiple pieces of information. From here - we can generate a new table with the information requested of us and export as a CSV file.
The easiest way to go about this is to first remove the header rows and the rows serving as deliminators from our Data Frame, then create two tables: one composed entirely of “upper rows” and another composed of the “lower rows”, such that the position of the rows in each table are consistent for each competitor across tables (R retains the row numbers from the original tables, however, my desired outcome is to have each competitor in the same relative position in each table). From here, we can easily join the two tables into a new data frame with the cbind function.
Table <- Table[-c(1:4),]
Table <- subset(Table, Table$V2 != "")
head(Table, 5)
## V1 V2 V3 V4 V5 V6 V7 V8
## 5 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 6 ON 15445895 / R: 1794 ->1817 N:2 W B W B W
## 8 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## 9 MI 14598900 / R: 1553 ->1663 N:2 B W B W B
## 11 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11
## V9 V10 V11
## 5 D 12 D 4 NA
## 6 B W NA
## 8 W 20 W 7 NA
## 9 W B NA
## 11 W 13 W 12 NA
Now that the header has been removed, we need to find a way to easily distinguish the upper and lower rows. It initially seems as though, in column 1, the upper rows contain numbers while the lower rows contain state abbreviations, however, further examination proves that the top row hosts numbers stored as strings. For this reason, the is.numeric function will not work to distinguish upper and lower rows. Because as.numeric returns an error message when used on a string that can not be converted to a number, we can use this function coupled with the is.na function for our purposes. Finally, as there are leading and trailing spaces in this data set, we also use a trimws function to remove these.
Table$Split <- is.na(as.numeric(trimws(Table$V1)))
Table.Upper <- subset(Table, Table$Split == TRUE)
Table.Lower <- subset(Table, Table$Split == FALSE)
Table.both <- cbind(Table.Upper, Table.Lower)
A little more processing to remove extra information, and our table is starting to become more useful:
colnames(Table.both) <- c(1:24)
rownames(Table.both) <- c(1:64)
Table.both <- Table.both[,-c(3:13,23,24)]
colnames(Table.both) <- c(1:11)
head(Table.both,5)
## 1 2 3
## 1 ON 15445895 / R: 1794 ->1817 GARY HUA
## 2 MI 14598900 / R: 1553 ->1663 DAKSHESH DARURI
## 3 MI 14959604 / R: 1384 ->1640 ADITYA BAJAJ
## 4 MI 12616049 / R: 1716 ->1744 PATRICK H SCHILLING
## 5 MI 14601533 / R: 1655 ->1690 HANSHI ZUO
## 4 5 6 7 8 9 10 11
## 1 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
## 2 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
## 3 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12
## 4 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1
## 5 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17
Columns 5 through 11 contain both the opponent number, and the outcome of the match. As column 4 contains the total points, we actually do not need the outcome - only the opponent from these cells. Additionally, as there is one space between the W/L and the opponent number, we can use this to delineate the contents of the cell. I elected to use a gsub function to remove everything before the first space. As I know we will be referencing this value later, I also took the opportunity remove any spaces which may have been left over.
Table.both$`5` <- trimws(gsub( "^.* ", "",Table.both$`5`))
Table.both$`6` <- trimws(gsub( "^.* ", "",Table.both$`6`))
Table.both$`7` <- trimws(gsub( "^.* ", "",Table.both$`7`))
Table.both$`8` <- trimws(gsub( "^.* ", "",Table.both$`8`))
Table.both$`9` <- trimws(gsub( "^.* ", "",Table.both$`9`))
Table.both$`10` <- trimws(gsub( "^.* ", "",Table.both$`10`))
Table.both$`11` <- trimws(gsub( "^.* ", "",Table.both$`11`))
From here, we draw our attention to column 2, which contains a few bits of information: the rating before and after the tournament, as well as the players identification number. We are only concerned with the players pre-tournament rating, which is conveniently located between a “:” and an “-”. Additionally, these are the only occurrences of these characters in the string, so we can remove this extra information using our GSub function.
Also of note - some players have an additional bit of information in their rating This organization treats a players ratings early in their tenure as “provisional.” As a result, players who have finished a small amount of games have this quantity listed with the letter “p” followed by the number of finished games. As this status is irrelevant for our purposes, I removed it in this stage.
Table.both$`2` <- gsub("^.*:","",Table.both$`2`)
Table.both$`2` <- gsub("-.*$","",Table.both$`2`)
Table.both$`2` <- gsub("P.*$","",Table.both$`2`)
Our next challenge is to take the player numbers in Columns 5 through 11 and index column 2 to find each player’s ratings. As our table is still in the correct order (player 1 is now in row 1), this is relatively straight forward. Of note - both the values signifying the player number and the values with the player rating are currently text strings with leading spaces, so we will be cleaning and reformatting our data as we move forward.
Table.both$`12` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`5`)]))
Table.both$`13` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`6`)]))
Table.both$`14` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`7`)]))
Table.both$`15` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`8`)]))
Table.both$`16` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`9`)]))
Table.both$`17` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`10`)]))
Table.both$`18` <- as.numeric(trimws(Table.both$`2`[as.numeric(Table.both$`11`)]))
With each players rating in place, we can calculate the average rating of each competitors opponents.
Table.both$`19` <- round(rowMeans(Table.both[, 12:18],na.rm = TRUE), digits=0)
head(Table.both,5)
## 1 2 3 4 5 6 7 8 9 10 11
## 1 ON 1794 GARY HUA 6.0 39 21 18 14 7 12 4
## 2 MI 1553 DAKSHESH DARURI 6.0 63 58 4 17 16 20 7
## 3 MI 1384 ADITYA BAJAJ 6.0 8 61 25 21 11 13 12
## 4 MI 1716 PATRICK H SCHILLING 5.5 23 28 2 26 5 19 1
## 5 MI 1655 HANSHI ZUO 5.5 45 37 12 13 4 14 17
## 12 13 14 15 16 17 18 19
## 1 1436 1563 1600 1610 1649 1663 1716 1605
## 2 1175 917 1716 1629 1604 1595 1649 1469
## 3 1641 955 1745 1563 1712 1666 1663 1564
## 4 1363 1507 1553 1579 1655 1564 1794 1574
## 5 1242 980 1663 1666 1716 1610 1629 1501
Now that we have all of our relevant information - we can build our final table. As this will be our final product, we will label the columns in an intuitive manner.
Table.Final <- data.frame(Table.both$`3`,Table.both$`1`,Table.both$`4`,Table.both$`2`,Table.both$`19`)
colnames(Table.Final) <- c("Player Name","Player State","Total Points","Pre-Tournament Rating","Average Rating of Opponents")
head(Table.Final,5)
## Player Name Player State Total Points
## 1 GARY HUA ON 6.0
## 2 DAKSHESH DARURI MI 6.0
## 3 ADITYA BAJAJ MI 6.0
## 4 PATRICK H SCHILLING MI 5.5
## 5 HANSHI ZUO MI 5.5
## Pre-Tournament Rating Average Rating of Opponents
## 1 1794 1605
## 2 1553 1469
## 3 1384 1564
## 4 1716 1574
## 5 1655 1501
As our table looks correct and contains all the requested information, we are prepared to export as a CSV
write.csv(Table.Final,file = "BloomeProject1.CSV",row.names = FALSE)