The goal of this project is to tidy a dataset of chess tournament standings that are presented in a scoresheet format. The dataset must be parsed, cleaned and exported to a csv file for use elsewhere, either in a database or spreadsheet software.

The final output for the project should include:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

Import and Clean

The initial step is to import the data from the txt file using readLines.

library(stringr)
setwd("C:/Users/Buddy Snacks/Documents/CUNY_Assignments/Data")
chess_raw = readLines("tournament_data.txt")
chess_raw[1:15]    # chess_raw is the raw imported text file... Each row a string
##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [13] "-----------------------------------------------------------------------------------------" 
## [14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
## [15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"

The data is formatted in such a way that it can be easily read, but not easily processed.

The first step in cleaning the data is to remove the header and filter out the dividing “—” lines. Once that is complete the remaining lines are subset into two vectors. The first I call name_line which are the lines containing names, id and opponent info. The second I call data_line which are the lines containing state and rating information.

chess_no_head = chess_raw[-1:-4]          # remove header
dashed_strings = grep("^-+", chess_no_head)  # filter lines with dashes...
chess_info = chess_no_head[-dashed_strings]   # ...and remove

name_line = chess_info[c(TRUE, FALSE)]    # subset every other line into its own vector     
data_line = chess_info[c(FALSE, TRUE)]        

Here are the resulting vectors:

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

Build the Dataframe

In order to build a dataframe, the neccesary column vectors need to be parsed from the string vectors. I use several functions from the stringr package to build these column vectors. Note that opp_scores is a list of string vectors.

# make the column vectors...
id = c(1:(length(name_line)))
names = str_extract(name_line, "([A-Z.-]+ )+")
points = str_extract(name_line, "[0-9]\\.[0-9]")
opp_scores = str_extract_all(name_line, perl("[0-9]{1,2}(?=\\|)"))  # list
states = str_extract(data_line, "[A-Z]{2}")
rating =str_extract(data_line, perl("(?<=R: )( ?[0-9]{3,4})"))


df1 = data.frame(
  as.numeric(id),
  names,
  as.factor(states),
  as.numeric(rating),
  as.numeric(points),
  stringsAsFactors = FALSE
)

c("id", "name", "state", "pre_tourn_rating", "points") -> colnames(df1)

The resulting dataframe includes name, state, pre-tourn rating and points. An id column was also included for convenience in case the dataset ever needs to be imported into an RDB.

##   id                 name state pre_tourn_rating points
## 1  1            GARY HUA     ON             1794    6.0
## 2  2     DAKSHESH DARURI     MI             1553    6.0
## 3  3        ADITYA BAJAJ     MI             1384    6.0
## 4  4 PATRICK H SCHILLING     MI             1716    5.5
## 5  5          HANSHI ZUO     MI             1655    5.5
## 6  6         HANSEN SONG     OH             1686    5.0

Compute the Average Opponent Rating

The final column in the project can now be computed using a nested set of lapply and sapply functions. The inner callback function simply matches the pre-tournament rating with the dataframe id column and creates it in the correct index of the matched_scores list.

opp_scores = lapply(opp_scores,as.numeric)

matched_scores = lapply(opp_scores, function(x){
    sapply(x, function(y){
        y = df1$pre_tourn_rating[df1$id == y] 
    })
})

matched_scores[1:5]
## [[1]]
## [1] 1436 1563 1600 1610 1649 1663 1716
## 
## [[2]]
## [1] 1175  917 1716 1629 1604 1595 1649
## 
## [[3]]
## [1] 1641  955 1745 1563 1712 1666 1663
## 
## [[4]]
## [1] 1363 1507 1553 1579 1655 1564 1794
## 
## [[5]]
## [1] 1242  980 1663 1666 1716 1610 1629

Finally, these matched_scores are averaged and appended to the dataframe as the last column.

avg_scores = sapply(matched_scores,mean)
df1$avg_opp_ratings = round(avg_scores, 0)

The final dataframe:

##   id                 name state pre_tourn_rating points avg_opp_ratings
## 1  1            GARY HUA     ON             1794    6.0            1605
## 2  2     DAKSHESH DARURI     MI             1553    6.0            1469
## 3  3        ADITYA BAJAJ     MI             1384    6.0            1564
## 4  4 PATRICK H SCHILLING     MI             1716    5.5            1574
## 5  5          HANSHI ZUO     MI             1655    5.5            1501
## 6  6         HANSEN SONG     OH             1686    5.0            1519

Export

The dataset can now be easily exported into the local working directory using write.csv() and imported into a spreadsheet or RDB.

write.csv(df1, "chess_output.csv", row.names = FALSE, quote = FALSE)