DATA 607 Project 1: Chess Tournament

Solution

The problem is handled by reading in the data, processing it using regular expressions, and querying the data to arrive at the necessary calculated values.

Reading the Data

The text file is read in using the read.table function, with the | character as a column delimiter. The column names are specified based on the file, noting that there is an extra blank column created since each line ends with |. The first four rows are skipped, as they do not contain data. The fill = FALSE argument is specified so that rows without the specified number of columns are populated with blank columns.

chess_fields <- c("Number", "Name", "Points", "R1", "R2", "R3", "R4", "R5", "R6", "R7", "EOL")
tournament <- read.table("./tournamentinfo.txt", header = FALSE, skip = 4, sep = "|", fill = TRUE, stringsAsFactors = FALSE, col.names = chess_fields)

The resulting data frame contains rows used to separate between entries in the original text file. These rows are comprised of only hyphens in the Number column, with all other columns populated with blank strings. These rows, as well as the extra EOL column, are removed:

tournament <- subset(tournament, !Name == "", select = c(Number:R7))

Cleansing the Data

The resulting data setis in need of some cleanup – there are extra spaces from the text file, and the information about each player is spread across two rows.

First, the stringr package is loaded, and each column is trimmed to remove the padding spaces:

library(stringr)
for (i in 1:length(tournament)) {
  tournament[, i] <- str_trim(tournament[, i])
}

Now that the padding spaces have been removed, the data can be combined to a single row for each player.

The player’s state, located in the row following the player, is located in the Number field.

The players pre-tournament ranking, a 3-4 digit number which follows a space after a colon, can now be extracted using the regular expression "[[:blank:]]{1}[[:digit:]]{3,4}". Similar to the state above, the rating is in the Name field in the row following the player’s name.

These two fields are created as columns in the data frame. Once these fields are created, the rows below the players’ names are no longer needed, and are removed.

for (i in 1:nrow(tournament)) {
  tournament$State[i] <- tournament$Number[i + 1]
  tournament$Player[i] <- str_trim(str_extract(tournament$Name[i + 1], "[[:blank:]]{1}[[:digit:]]{3,4}"))
}
tournament <- subset(tournament, !is.na(Player))

Matching Opponents

Now that the data is consolidated such that each player is represented by a single row, the individual matches each player competed in must be considered. For the purposes of this task, the result of the match (Win, Loss, Draw) does not matter – only the number of the player faced. Each round in which a player played a match is represented by the result of the match, followed by a space and then the opponent’s number. Rounds in which a player did not play a match are noted by just a letter.

The number of each player’s is returned, with NA returned for rounds in which a player did not face an opponent:

for (i in 4:10) {
  tournament[, i] <- str_trim(str_extract(tournament[, i], "[[:space:]]+[[:digit:]]{1,2}"))
}

Finding Opponent Ratings

Now that the opponents have been identified, their pre-ratings must be pulled in. This is accomplished by finding the rating of each player faced:

for (i in 1:nrow(tournament)) {
  for(j in 4:10) {
    tournament[i,j] <- tournament[tournament$Number == tournament[i,j],12][1]
    # [1] at end avoids error from NAs
  }
}

Now that the round-by-round opponent ratings have been gathered, they must be converted to numbers and the average rating of opponents calculated. Player pre-rating and total points are also converted to numbers.

for (i in 4:10) {
  tournament[, i] <- as.numeric(tournament[, i])
}
tournament$Player <- as.numeric(tournament$Player)
tournament$Points <- as.numeric(tournament$Points)
tournament$Opponent <- round(rowMeans(tournament[, c(4:10)], na.rm = TRUE), 1)

Results

The relevant columns are saved to a new data frame, and this data frame is exported as a csv file.

ratings <- subset(tournament, TRUE, c(Name, State, Points, Player, Opponent))
row.names(ratings) <- 1:64
write.table(ratings, file = "chessratings.csv", sep = ",", row.names = FALSE)

A sample of the final data set:

##                        Name State Points Player Opponent
## 55                ALEX KONG    MI    2.0   1186   1406.0
## 49         MICHAEL J MARTIN    MI    2.5   1291   1285.8
## 64                   BEN LI    MI    1.0   1163   1263.0
## 44       JUSTIN D SCHILLING    MI    3.0   1199   1327.0
## 33                  JADE GE    MI    3.5   1449   1276.9
## 9               STEFANO LEE    ON    5.0   1411   1523.1
## 11 CAMERON WILLIAM MC LEMAN    MI    4.5   1712   1467.6
## 6               HANSEN SONG    OH    5.0   1686   1518.7
## 4       PATRICK H SCHILLING    MI    5.5   1716   1573.6
## 15   ZACHARY JAMES HOUGHTON    MI    4.5   1220   1483.9