In this project, we will read a .txt file that contains the chess results for a given tournament. This information will be read, formatted and a .csv file will be produced from it. The format of the .csv file will be as follows: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the scrapping of the data from the .txt file and some formatting we will use the stringr package. Also, we will expect that the .txt file that we will read is available in the working directory (tournamentinfo.txt)

library(stringr)
library(plyr)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Loading required package: DBI

Reading information from tournamentinfo.txt into r

Although the data in tournamentinfo.txt has some structure, each row do not have the same format. Because of this we will use readlines command instead of read.table

t <- readLines(con="tournamentinfo.txt", n = 196, ok = TRUE, warn = TRUE,
          encoding = "unknown", skipNul = FALSE)
## Warning in readLines(con = "tournamentinfo.txt", n = 196, ok = TRUE, warn =
## TRUE, : incomplete final line found on 'tournamentinfo.txt'

We will look at the first 15 raws of the data.

t[1:15]
##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [13] "-----------------------------------------------------------------------------------------" 
## [14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
## [15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
length(t)
## [1] 196

From the results, it is clear that starting from line 5, there is a pattern of 2 lines containing data and 1 line acting as a separator. We will contruct a vector of lines that we will want to skip Mainly, line 1-4 (header) and every 3 line starting on line 7 until line 196 (last line in structure). We will then strip these lines from our structure t, and create a new working structure t2.

skip_lines <- c(1, 2, 3, 4, seq(from = 7, to = 196, by=3))

t2 <- t[-skip_lines]

t2[1:15]
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [7] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
##  [8] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
##  [9] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [10] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [11] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [12] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
## [13] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"
## [14] "   MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"
## [15] "    8 | EZEKIEL HOUGHTON                |5.0  |W   3|W  32|L  14|L   9|W  47|W  28|W  19|"

Analysis of the data structure

From this structure, it is clear that we have 2 formats, one for odd lines and one for even lines. In both case the “|” is acting a separator. Hence for each line type, we will first split the data into smaller groups using “|” as the separator. This will give us 2 distinct groups of strings.

We will continue the analysis for each type.

Odd Line Format First string is the player pair number in tournament, 2nd string contains the name of player, 3rd string is the total score of player, and the next 7 represent the results for player in matches he played. each of this represents results and the pair number of opponent.

If the player does not meet an opponent for a given round, extraction process will assign NA for that round.

Even Line Format First string is the state of the player, 2nd string contains 3 piece of information; UScF ID of player and player pre-tournament rating (following R:) We are not concerns at this time with the remaining information.

we will write a function to perform the extraction and return the result into a data frame.

odd_line_f <- function(line_string){
  # function to format odd lines from tournament table
  # split string into individual group of fields based on '|' as separator
  odd_line <- unlist(str_split(line_string, "\\|"))
  
  # 1st position represents player pair number, we recast string as integer
  player_pair_num <-as.integer(str_trim(odd_line[1]))
  
  # 2nd position represents player name, we are stripping extra blank
  player_name <- str_trim(odd_line[2])
  
  # 3rd position represent player total points, we recast string as numerical
  player_total_pts <- as.numeric(str_trim(odd_line[3]))
  
  # Next 7 positions represents player opponents and results
  # We first extract alpha charater and stored in results
  player_results <- unlist(str_extract(odd_line, "[[:alpha:]]"))[c(4, 5, 6, 7, 8, 9, 10)]
  
  # Next we extract digits and store in player_opponents
  player_opponents <- as.integer(str_extract(odd_line, "\\d{1,2}")[c(4, 5, 6, 7, 8, 9, 10)])
  
  df_out <- data.frame(player_pair_num, player_name, player_total_pts, player_opponents, player_results)
  
  return(df_out)
}

even_line_f <- function(line_string){
  # Function to format even lines from tournament table
  # split string into individual group of fields based on '|' as separator
  even_line <- unlist(str_split(line_string, "\\|"))

  # extract player's state and store
  player_state <- str_trim((even_line[1]))
  
  # split up 2nd field into individuals using :blank: as separator, limit to 6 groups to trim extra space 
  player_numbers_raw <- unlist(str_split(even_line[2], "[[:blank:]]", 6))
  # Only consider grouping with digits
  player_numbers <- grep("[[:digit:]]", player_numbers_raw, value = TRUE)
  
  # Extract 2nd position as player's USCF ID 
  player_USCF_id <- str_trim(player_numbers[1])
  
  # Extract 5th position as player pre-tournament score
  player_pre_score <- str_extract(player_numbers[2], "\\d*")
  
  df_out <- data.frame(player_state, player_USCF_id, player_pre_score)
  
  return(df_out)
}

Processing of the data structure

For each pair of rows in the data structure t2, we will process the first one using function: odd_line_f and the second row using even_line_f. The output dataframe for each will be binded together and then appended to our final data frame. This processing will be done with in a for loop.

for (i in 1:length(t2)){
  if(i %% 2 == 1){
    df_odd <- odd_line_f(t2[i])
  }else{
    df_even <- even_line_f(t2[i])
    df_row <- cbind(df_odd, df_even)
    
    if(i == 2){
      df_final <- df_row
    }else{
      df_final <- rbind(df_final, df_row)
    }
  }
}

Computation of average of pre-tournament rating of opponents

We first extract the list of players and their pre-tournament score. We also only consider the unique rows.
Then, we merge this dataframe with original, matching on the player opponents. This will bring the pre-score for each opponent.

We sort the resulting data frame by player_pair_num.

The merge function cast the scores as factor. To perform any aggregate function on these, we have to type cast them as numeric using as.numeric(as.character(df$factor_fiel)), as.numeric directly on factor does not work.

Finally, we will average this value accross all rows for each player.

# Extract player_pair_num and player pre-tournament score in different data frame
df_players_list <- df_final[, c(1, 8)]
df_players_list <- unique(df_players_list)

# match the opponent number with new list of player/pre-tournament score to populate opponent pre-tournament score
df_final_all <- merge(df_final, df_players_list, by.x="player_opponents", by.y="player_pair_num")

# sort the resulting data frame in order of player_pair_num
df_final_all_sorted <-df_final_all[order(df_final_all$player_pair_num), ]

# display structure of resulting sorted data frame
str(df_final_all_sorted)
## 'data.frame':    408 obs. of  9 variables:
##  $ player_opponents  : int  4 7 12 14 18 21 39 4 7 16 ...
##  $ player_pair_num   : int  1 1 1 1 1 1 1 2 2 2 ...
##  $ player_name       : Factor w/ 64 levels "GARY HUA","DAKSHESH DARURI",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_total_pts  : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ player_results    : Factor w/ 7 levels "D","W","L","H",..: 1 2 1 2 2 2 2 3 2 2 ...
##  $ player_state      : Factor w/ 3 levels "ON","MI","OH": 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_USCF_id    : Factor w/ 64 levels "15445895","14598900",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_pre_score.x: Factor w/ 64 levels "1794","1553",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_pre_score.y: Factor w/ 64 levels "1794","1553",..: 4 7 12 14 18 21 39 4 7 16 ...
# recast pre-tournament score for player and opponent from factor to numeric (must do character first)
# rename new field for better clarity
df_final_all_sorted$player_pre_score = as.numeric(as.character(df_final_all_sorted$player_pre_score.x))
df_final_all_sorted$opponent_pre_score = as.numeric(as.character(df_final_all_sorted$player_pre_score.y))

# display data frame structure data frame
str(df_final_all_sorted)
## 'data.frame':    408 obs. of  11 variables:
##  $ player_opponents  : int  4 7 12 14 18 21 39 4 7 16 ...
##  $ player_pair_num   : int  1 1 1 1 1 1 1 2 2 2 ...
##  $ player_name       : Factor w/ 64 levels "GARY HUA","DAKSHESH DARURI",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_total_pts  : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ player_results    : Factor w/ 7 levels "D","W","L","H",..: 1 2 1 2 2 2 2 3 2 2 ...
##  $ player_state      : Factor w/ 3 levels "ON","MI","OH": 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_USCF_id    : Factor w/ 64 levels "15445895","14598900",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_pre_score.x: Factor w/ 64 levels "1794","1553",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ player_pre_score.y: Factor w/ 64 levels "1794","1553",..: 4 7 12 14 18 21 39 4 7 16 ...
##  $ player_pre_score  : num  1794 1794 1794 1794 1794 ...
##  $ opponent_pre_score: num  1716 1649 1663 1610 1600 ...
# using ddply calculate average pre-tournament score of opponents 
df_final_average <- ddply(df_final_all_sorted, c("player_pair_num", "player_name", "player_state", "player_total_pts", "player_pre_score"),summarise, player_average_rating = round(mean(opponent_pre_score, na.rm=TRUE),0))

# display resulting data frame
str(df_final_average)
## 'data.frame':    64 obs. of  6 variables:
##  $ player_pair_num      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ player_name          : Factor w/ 64 levels "GARY HUA","DAKSHESH DARURI",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ player_state         : Factor w/ 3 levels "ON","MI","OH": 1 2 2 2 2 3 2 2 1 2 ...
##  $ player_total_pts     : num  6 6 6 5.5 5.5 5 5 5 5 5 ...
##  $ player_pre_score     : num  1794 1553 1384 1716 1655 ...
##  $ player_average_rating: num  1605 1469 1564 1574 1501 ...
# display the first 20 rows of resulting data frame
df_final_average[1:20, ]
##    player_pair_num              player_name player_state player_total_pts
## 1                1                 GARY HUA           ON              6.0
## 2                2          DAKSHESH DARURI           MI              6.0
## 3                3             ADITYA BAJAJ           MI              6.0
## 4                4      PATRICK H SCHILLING           MI              5.5
## 5                5               HANSHI ZUO           MI              5.5
## 6                6              HANSEN SONG           OH              5.0
## 7                7        GARY DEE SWATHELL           MI              5.0
## 8                8         EZEKIEL HOUGHTON           MI              5.0
## 9                9              STEFANO LEE           ON              5.0
## 10              10                ANVIT RAO           MI              5.0
## 11              11 CAMERON WILLIAM MC LEMAN           MI              4.5
## 12              12           KENNETH J TACK           MI              4.5
## 13              13        TORRANCE HENRY JR           MI              4.5
## 14              14             BRADLEY SHAW           MI              4.5
## 15              15   ZACHARY JAMES HOUGHTON           MI              4.5
## 16              16             MIKE NIKITIN           MI              4.0
## 17              17       RONALD GRZEGORCZYK           MI              4.0
## 18              18            DAVID SUNDEEN           MI              4.0
## 19              19             DIPANKAR ROY           MI              4.0
## 20              20              JASON ZHENG           MI              4.0
##    player_pre_score player_average_rating
## 1              1794                  1605
## 2              1553                  1469
## 3              1384                  1564
## 4              1716                  1574
## 5              1655                  1501
## 6              1686                  1519
## 7              1649                  1372
## 8              1641                  1468
## 9              1411                  1523
## 10             1365                  1554
## 11             1712                  1468
## 12             1663                  1506
## 13             1666                  1498
## 14             1610                  1515
## 15             1220                  1484
## 16             1604                  1386
## 17             1629                  1499
## 18             1600                  1480
## 19             1564                  1426
## 20             1595                  1411

Finally we will create a .csv file from the final resulting data frame.

write.csv(df_final_average, file = "chess_tournament.csv")