Introduction

Use the given text file which contains the results from a chess tournament to generate .CSV files that can be used to load into a database for further analysis.

Excerpt from text file

Each player in the tournament is listed along with the results from the 7 rounds of the tournament.

-----------------------------------------------------------------------------------------
 Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
 Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
-----------------------------------------------------------------------------------------
    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------------------
    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
-----------------------------------------------------------------------------------------
    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------------------
    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
-----------------------------------------------------------------------------------------
    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
-----------------------------------------------------------------------------------------

Load required R Libraries

library(tidyverse)

Import and Clean Data

Import tournament data

infoFile = 'https://raw.githubusercontent.com/dab31415/DATA607/main/Projects/Project_1/TournamentInfo.txt'
raw_tournament <-read.table(file = infoFile,
                            sep = '|',
                            strip.white = TRUE,
                            fill = TRUE)

head(raw_tournament,7)
##                                                                                          V1
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                      Pair
## 3                                                                                       Num
## 4 -----------------------------------------------------------------------------------------
## 5                                                                                         1
## 6                                                                                        ON
## 7 -----------------------------------------------------------------------------------------
##                            V2    V3    V4    V5    V6    V7    V8    V9   V10
## 1                                                                            
## 2                 Player Name Total Round Round Round Round Round Round Round
## 3   USCF ID / Rtg (Pre->Post)   Pts     1     2     3     4     5     6     7
## 4                                                                            
## 5                    GARY HUA   6.0 W  39 W  21 W  18 W  14 W   7 D  12 D   4
## 6 15445895 / R: 1794   ->1817   N:2     W     B     W     B     W     B     W
## 7                                                                            
##   V11
## 1  NA
## 2  NA
## 3  NA
## 4  NA
## 5  NA
## 6  NA
## 7  NA

Remove unneeded rows and columns

The first four rows of data contain multiple header and row separators. The last column is created because each row of the text file ends with a column separator. The data file also contains row separators between each player.

tournament <- raw_tournament[-c(1:4),-11] %>%
    filter(grepl('[A-Z0-9]',V1))

Set Column Names

names(tournament)[1] = 'num_state'
names(tournament)[2] = 'name_rating'
names(tournament)[3] = 'points'
names(tournament)[4:10] = paste('R', 1:7, sep = '')

Create Data Frames

Extract player details

The first row for each player contains their player number in the first field which is also used in the round data to match players with their opponent. Filter on numeric first column and select the player’s name and points earned in the tournament.

players <- as_tibble(tournament %>%
  filter(grepl('[0-9]',num_state)) %>%
  transmute(
    playerNum = as.numeric(num_state),
    playerName = name_rating,
    points = as.numeric(points),
  )) %>%
  rowid_to_column('id')

head(players,7)
## # A tibble: 7 x 4
##      id playerNum playerName          points
##   <int>     <dbl> <chr>                <dbl>
## 1     1         1 GARY HUA               6  
## 2     2         2 DAKSHESH DARURI        6  
## 3     3         3 ADITYA BAJAJ           6  
## 4     4         4 PATRICK H SCHILLING    5.5
## 5     5         5 HANSHI ZUO             5.5
## 6     6         6 HANSEN SONG            5  
## 7     7         7 GARY DEE SWATHELL      5

Extract Player State and Rating Details

The second row of player data contains the player’s state, USCF identification number, and their ratings both before and after tournament completion.

ratings <- as_tibble(tournament %>%
  filter(grepl('[A-Z]',num_state)) %>%
  separate(name_rating, into = c('USCF_ID','rawRatings'), sep = '/') %>%
  separate(rawRatings, into = c('rawpreRating','rawpostRating'), sep = '->') %>%
  transmute(
    state = num_state,
    USCF_ID = as.numeric(USCF_ID),
    preRating = as.numeric(str_extract(rawpreRating,'\\d{3,4}')),
    postRating = as.numeric(str_extract(rawpostRating,'\\d{3,4}'))
  )) %>%
  rowid_to_column('id')

head(ratings,7)
## # A tibble: 7 x 5
##      id state  USCF_ID preRating postRating
##   <int> <chr>    <dbl>     <dbl>      <dbl>
## 1     1 ON    15445895      1794       1817
## 2     2 MI    14598900      1553       1663
## 3     3 MI    14959604      1384       1640
## 4     4 MI    12616049      1716       1744
## 5     5 MI    14601533      1655       1690
## 6     6 OH    15055204      1686       1687
## 7     7 MI    11146376      1649       1673

Merge players and ratings data

We avoid using player number to merge the two data frames because we’d be dependent on the raw text file being sorted and numbered in order. By implementing a row number, we ensure that consecutive rows in the data file are merged together.

playerDetails <- players %>%
  merge(ratings,by = 'id') %>%
  select(playerNum, USCF_ID, playerName, state, preRating, points, postRating)

head(playerDetails,7)
##   playerNum  USCF_ID          playerName state preRating points postRating
## 1         1 15445895            GARY HUA    ON      1794    6.0       1817
## 2         2 14598900     DAKSHESH DARURI    MI      1553    6.0       1663
## 3         3 14959604        ADITYA BAJAJ    MI      1384    6.0       1640
## 4         4 12616049 PATRICK H SCHILLING    MI      1716    5.5       1744
## 5         5 14601533          HANSHI ZUO    MI      1655    5.5       1690
## 6         6 15055204         HANSEN SONG    OH      1686    5.0       1687
## 7         7 11146376   GARY DEE SWATHELL    MI      1649    5.0       1673

Extract match results

The first row of player data contains the results of the matches played in each round. We generate a data frame of all players and the match results by round.

matchResults <- as_tibble(tournament %>%
  filter(grepl('[0-9]',num_state)) %>%
  gather(roundNum,roundResult,R1:R7) %>%
  transmute(
    playerNum = as.numeric(num_state),
    roundNum = as.numeric(str_extract(roundNum,'\\d+')),
    oppNum = as.numeric(str_extract(roundResult,'\\d+')),
    result = str_extract(roundResult,'[A-Z]')
  ))

head(matchResults,7)
## # A tibble: 7 x 4
##   playerNum roundNum oppNum result
##       <dbl>    <dbl>  <dbl> <chr> 
## 1         1        1     39 W     
## 2         2        1     63 W     
## 3         3        1      8 L     
## 4         4        1     23 W     
## 5         5        1     45 W     
## 6         6        1     34 W     
## 7         7        1     57 W
matchResults %>%
  count(result)
## # A tibble: 7 x 2
##   result     n
##   <chr>  <int>
## 1 B          7
## 2 D         58
## 3 H         16
## 4 L        175
## 5 U         16
## 6 W        175
## 7 X          1

Add opponent rating to match results

Adding the opponent’s pre-tournament rating to the match results data frame will make it easier to calculate the average opponents’ rating and determine the expected match results.

I had some issues joining player details with match results on columns of different names which is why I used a select command to rename the ‘playerNum’ column to ‘oppNum’.

matchResults <- matchResults %>%
  left_join(playerDetails %>% select(oppNum = playerNum, oppRating = preRating),by = 'oppNum')

head(matchResults,7)
## # A tibble: 7 x 5
##   playerNum roundNum oppNum result oppRating
##       <dbl>    <dbl>  <dbl> <chr>      <dbl>
## 1         1        1     39 W           1436
## 2         2        1     63 W           1175
## 3         3        1      8 L           1641
## 4         4        1     23 W           1363
## 5         5        1     45 W           1242
## 6         6        1     34 W           1399
## 7         7        1     57 W           1092

Calculate Average Opponent Rating

For each match we join to player details to determine the opponents pre-tournament rating, then calculate the mean grouped by player. The calculation of average opponents’ rating accounts for players that played less than 7 matches.

playerDetails <- playerDetails %>%
  inner_join(matchResults %>%
               group_by(playerNum) %>%
               summarize(avgOppRating = as.integer(mean(oppRating, na.rm = TRUE))),by = 'playerNum')

head(playerDetails,7)
##   playerNum  USCF_ID          playerName state preRating points postRating
## 1         1 15445895            GARY HUA    ON      1794    6.0       1817
## 2         2 14598900     DAKSHESH DARURI    MI      1553    6.0       1663
## 3         3 14959604        ADITYA BAJAJ    MI      1384    6.0       1640
## 4         4 12616049 PATRICK H SCHILLING    MI      1716    5.5       1744
## 5         5 14601533          HANSHI ZUO    MI      1655    5.5       1690
## 6         6 15055204         HANSEN SONG    OH      1686    5.0       1687
## 7         7 11146376   GARY DEE SWATHELL    MI      1649    5.0       1673
##   avgOppRating
## 1         1605
## 2         1469
## 3         1563
## 4         1573
## 5         1500
## 6         1518
## 7         1372

Create CSV files

Generate a csv of the players and another for the results of the tournament. These files can then be used to load into a database.

local_path = 'C:\\Users\\dab31\\OneDrive\\Documents\\GitHub\\DATA607\\Projects\\Project_1\\'

playerDetails %>%
  write_csv(paste(local_path, 'playersDetails.csv', sep = ''))

matchResults %>%
  write_csv(paste(local_path,'matchResults.csv', sep = ''))