Introduction

In this assignment, we are given a text file containing the outcomes of a chess tournament. The text file is not in a standard format that can be readily loaded, and it requires considerable cleaning and parsing.

Our first task is to read a text file and format it as a standard table that can be readily read and modified by other programs. We are to extract the following information for all players:

With the exception of the last one which we need to evaluate, all other values are included in the raw text file.

We will start by loading some packages that we will be using.

library(stringr)
library(DT)
library(ggplot2)

The data for running this code can be downloaded from https://drive.google.com/file/d/1vOPVInjQ9Jx0M61DR5zpm6YNdTdtABic/view?usp=sharing

Reading and Parsing the Data

Loading the text file

Since the data is not in standard format, we start by importing it as a text file and displaying the first few lines.

raw <- ("chess_table.txt")
results <- readLines(raw)
head(results, 6)
## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Preprocessing

Looking at our text file, we note that the first four rows are not of use and we can eliminate them. We can also eliminate the extra spaces and note that every third line is just dash lines and those can also be eliminated. We perform all these and take another look at our data.

# Remove first 4 rows
ChessTable <- results[-c(0:4)]
# Remove spaces
ChessTable <- ChessTable[sapply(ChessTable, nchar) > 0]
head(ChessTable, 10)
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "-----------------------------------------------------------------------------------------"
##  [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [6] "-----------------------------------------------------------------------------------------"
##  [7] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [8] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [9] "-----------------------------------------------------------------------------------------"
## [10] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"

Looking at our data we see the following trend:

  • Lines 1, 4, 7, 10 , … have the same type of information.

  • Lines 2, 5, 8, 11, … have the same type of information

  • Lines 3, 6, 9, 12, … are just dash lines with no information. Thus we extract lines number of the form \(3k+1\) and \(3k+2\) for \(k=0,1,2,…\).

# Lines 1,4,7, ...
Idx_3k1 <- c(seq(1, length(ChessTable), 3))
L_3k1 <- ChessTable[Idx_3k1]
head(L_3k1)
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
# Lines 2,5,8, ...
Idx_3k2 <- c(seq(2, length(ChessTable), 3))
L_3k2 <- ChessTable[Idx_3k2]
head(L_3k2)
## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

Extracting data with regex

Looking at two sets of lines that we separated in the last step, the first thing we can do is to extract the names from lines 1, 4, 7, …

Next, we can extract the states from lines 2, 5, 8, …

state <- str_extract(L_3k2, "[[:alpha:]]{2}")

For total points

total_points <- str_extract(L_3k1, "[[:digit:]]+\\.[[:digit:]]")
total_points <- as.numeric(as.character(total_points))

and for pre-rating and the opponent number at each game

pre_rating <- str_extract(L_3k2, ".\\: \\s?[[:digit:]]{3,4}")
pre_rating <- gsub(pre_rating, pattern = "R: ", replacement = "", fixed = T)
pre_rating <- as.numeric(as.character(pre_rating))

opponent_number <- str_extract_all(L_3k1, "[[:digit:]]{1,2}\\|")
opponent_number <- str_extract_all(opponent_number, "[[:digit:]]{1,2}")
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
opponent_number <- lapply(opponent_number, as.numeric)

Data Processing

As indicated in the introduction, we must calculate one additional value for each player: the average rating of the player’s opponents prior to the tournament. This is a straightforward problem.

We start with an empty array the size of our data. We execute a loop across all of the participants, and for each player, we utilize the list of opponents to extract their pre-game rating and compute the mean.

# declare the array to hold the average pre-game rating of opponents
opp_pre_rating <- array(0.0 , dim = c(length(name)))
# now we perform the loop
for (i in 1:length(name)){
  temp_opponents <- opponent_number[i]
  temp_opponents_vector <- unlist(temp_opponents)
  oppnents_pre <- pre_rating[temp_opponents_vector]
  opp_pre_rating[i] <- mean(oppnents_pre)
}

Creating a database

As the final stage, we combine all of the data listed in the introduction into a single dataframe and save it as a csv file.

# Load the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
variables <- list(name, state, total_points, pre_rating, opp_pre_rating)
# Create a new data frame
data_frame <- data.frame(variables)
names(data_frame) <- c("Player name", "Player state", 
"Total points", "Pre-tournament rating", 
"Pre-tournament rating of opponents")
write.csv(data_frame, file = "Chess_DATA.csv")