Introduction

The objective is to import a text file containing the results of a chess tournament, extract the information needed, and export to a CSV file that could be used in a database.

A preview of the text file being used


Data Dictionary

Column Description
player_name player’s full name
player_state player’s state
total_pts total points
pre-rating player’s pre-rating entering the tournament
avg_opp_rating average pre-rating of opponents entering the tournament

Required Libraries

library(tidyverse)
library(stringr)
library(ggrepel)

Read Text File

This will import the text file, reading each line individually and appending it to a DataFrame.

df <- data.frame()

con = file('data.txt', "r")
while (TRUE) {
  line = readLines(con, n = 1)
  
  if (length(line) == 0) {
     break
  }
  
  df <- bind_rows(df, data.frame(line))
}
close(con)

The output of the DataFrame looks like this.

line
—————————————————————————————–
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
—————————————————————————————–
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |

Drop Non-Alphanumeric Rows

The next step is to remove rows in the DataFrame that does not provide any useful information at all, such as rows only containing “—”. Using str_detect(), I looked for rows containing at least some form of alphanumeric values within it.

# find rows that has alpha-numeric characters
for (row in df){
  df$str_alpha <- str_detect(row, '[a-zA-Z0-9]')
}

We can see which rows were alphanumeric and which ones were not

line str_alpha
—————————————————————————————– FALSE
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| TRUE
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | TRUE
—————————————————————————————– FALSE
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| TRUE
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W | TRUE

Now lets drop these rows and the header rows we do not need

# drop non-alpha-numeric rows
df_alpha_num <- df %>% 
  filter(., str_alpha == TRUE) %>%
  select(., line)

# drop first two rows containing header info
df_alpha_num <- df_alpha_num %>%
  slice(-c(1, 2))

Here is the output contaning only the player information

line
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |

Extract Player Data

Every two rows contains a individual’s tournament information. The DataFrame will be parsed between “|” characters and added into individual columns for each player. The columns of interest will then be selected and placed into a new DataFrame.

# create empty dataframe
player_df <- data.frame()

# loop through each row 
for (i in seq(1, nrow(df_alpha_num), by = 2)){
  
  # read every odd row for player id, name and total points
  row_info <- str_split_fixed(df_alpha_num[i, ],regex("|", literal=TRUE),n=Inf)
  df_temp <- data.frame(row_info)
  temp_cols <- c("player_id", "player_name", "total_pts", "round_1", "round_2", "round_3", "round_4", "round_5", "round_6", "round_7")
  colnames(df_temp) = temp_cols
  df_export <- df_temp %>% 
      select(1:10)
  
  # read every even row to add additional columns for player info state and pre-rating
  row_info <- str_split_fixed(df_alpha_num[i+1, ],regex("|", literal=TRUE),n=Inf)
  df_temp <- data.frame(row_info)
  temp_cols <- c("player_state", "pre_rating")
  colnames(df_temp) = temp_cols
  df_export <- bind_cols(df_export, df_temp %>% 
      select(1:2))
  
  # extract individual player info to all players dataframe
  player_df <- bind_rows(player_df, df_export)
}

Here we can see an example of how the information was separated

player_state pre_rating NA NA NA NA NA NA NA NA NA
MI 15006561 / R: 1163 ->1112 B W W B W B B

Finally, we get a DataFrame that begins to look more manageable to work with

player_id player_name total_pts round_1 round_2 round_3 round_4 round_5 round_6 round_7 player_state pre_rating
1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817
2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663
3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640
4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744
5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690
6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21 OH 15055204 / R: 1686 ->1687

Clean Strings

We can see that some of the columns need to be further cleaned because they contain more information than needed. The next step was removing the decision such as “W” for win, and solely keeping the numeric value as it tells us the player ID of the person they played that round.

for (i in 4:10){
  player_df[, i] <- gsub("[^0-9]", "", player_df[, i])
}

Next, the player’s pre-rating is right after the “:” but before “<-”. For example, in “14959604 / R: 1384 ->1640”, 1384 would be the player’s pre-rating. The number will be extracted using parse_number()

player_df <- player_df %>% 
  separate_wider_delim(pre_rating, delim = ": ", names = c('rtg_1', 'rtg_2'))

player_df$pre_rating <- player_df$rtg_2 %>% parse_number()

This will remove preceding and following whitespaces in a string throughout the DataFrame.

player_df <- player_df %>% 
  mutate(across(where(is.character), str_trim))

Finally, the columns data types were change where appropriately needed.

change_type <- c("player_id", "total_pts", "round_1", "round_2", "round_3", "round_4", "round_5", "round_6", "round_7", "pre_rating")
player_df <- player_df %>% 
  mutate_at(change_type, as.numeric)

The DataFrame now looks like this.

player_id player_name total_pts round_1 round_2 round_3 round_4 round_5 round_6 round_7 player_state rtg_1 rtg_2 pre_rating
1 GARY HUA 6.0 39 21 18 14 7 12 4 ON 15445895 / R 1794 ->1817 1794
2 DAKSHESH DARURI 6.0 63 58 4 17 16 20 7 MI 14598900 / R 1553 ->1663 1553
3 ADITYA BAJAJ 6.0 8 61 25 21 11 13 12 MI 14959604 / R 1384 ->1640 1384
4 PATRICK H SCHILLING 5.5 23 28 2 26 5 19 1 MI 12616049 / R 1716 ->1744 1716
5 HANSHI ZUO 5.5 45 37 12 13 4 14 17 MI 14601533 / R 1655 ->1690 1655
6 HANSEN SONG 5.0 34 29 11 35 10 27 21 OH 15055204 / R 1686 ->1687 1686

Determine Opponents Average Pre-Ratings

As mentioned before, each round column contains the opponent they played. Now, that opponent’s player ID will be replaced with the opponent’s pre-rating instead. This is so we can calculate the opponent’s average pre-rating for the player.

for (i in 1:nrow(player_df)){
  for (j in 4:10){
    if (!is.na(player_df[[i,j]])){
      player_df[i,j] <- player_df %>% 
        filter(., player_id == player_df[[i, j]]) %>% 
        select(pre_rating)
    }
  }
}
player_id player_name total_pts round_1 round_2 round_3 round_4 round_5 round_6 round_7 player_state rtg_1 rtg_2 pre_rating
1 GARY HUA 6.0 1436 1563 1600 1610 1649 1663 1716 ON 15445895 / R 1794 ->1817 1794
2 DAKSHESH DARURI 6.0 1175 917 1716 1629 1604 1595 1649 MI 14598900 / R 1553 ->1663 1553
3 ADITYA BAJAJ 6.0 1641 955 1745 1563 1712 1666 1663 MI 14959604 / R 1384 ->1640 1384
4 PATRICK H SCHILLING 5.5 1363 1507 1553 1579 1655 1564 1794 MI 12616049 / R 1716 ->1744 1716
5 HANSHI ZUO 5.5 1242 980 1663 1666 1716 1610 1629 MI 14601533 / R 1655 ->1690 1655
6 HANSEN SONG 5.0 1399 1602 1712 1438 1365 1552 1563 OH 15055204 / R 1686 ->1687 1686

Here, we finally calculate the average opponent pre-rating

player_df <- player_df %>% 
  mutate(avg_opp_pre_rating = round(rowMeans(pick(round_1:round_7), na.rm=TRUE), 2))
player_id player_name total_pts round_1 round_2 round_3 round_4 round_5 round_6 round_7 player_state rtg_1 rtg_2 pre_rating avg_opp_pre_rating
1 GARY HUA 6.0 1436 1563 1600 1610 1649 1663 1716 ON 15445895 / R 1794 ->1817 1794 1605.29
2 DAKSHESH DARURI 6.0 1175 917 1716 1629 1604 1595 1649 MI 14598900 / R 1553 ->1663 1553 1469.29
3 ADITYA BAJAJ 6.0 1641 955 1745 1563 1712 1666 1663 MI 14959604 / R 1384 ->1640 1384 1563.57
4 PATRICK H SCHILLING 5.5 1363 1507 1553 1579 1655 1564 1794 MI 12616049 / R 1716 ->1744 1716 1573.57
5 HANSHI ZUO 5.5 1242 980 1663 1666 1716 1610 1629 MI 14601533 / R 1655 ->1690 1655 1500.86
6 HANSEN SONG 5.0 1399 1602 1712 1438 1365 1552 1563 OH 15055204 / R 1686 ->1687 1686 1518.71

Export to CSV

Now that we have the data we want, we can export the columns of interest to CSV.

export_df <- player_df %>% 
  select(-c(1, 4:10, 12, 13)) %>% 
  select(c(1, 3, 2, 4, 5)) 

write_csv(export_df, 'chess_players.csv')
player_name player_state total_pts pre_rating avg_opp_pre_rating
GARY HUA ON 6.0 1794 1605.29
DAKSHESH DARURI MI 6.0 1553 1469.29
ADITYA BAJAJ MI 6.0 1384 1563.57
PATRICK H SCHILLING MI 5.5 1716 1573.57
HANSHI ZUO MI 5.5 1655 1500.86
HANSEN SONG OH 5.0 1686 1518.71

Conclusion

Calculating the pre-rating difference between the player and average player rating, we can see some players had a tougher road to reach the top in total points earned. Aditya Bajaj had the largest difference of almost -180 pre-rating to their average opponent, yet collected 6.0 total points.

player_name player_state total_pts rating_diff
ADITYA BAJAJ MI 6.0 -179.57
DAKSHESH DARURI MI 6.0 83.71
GARY HUA ON 6.0 188.71
PATRICK H SCHILLING MI 5.5 142.43
HANSHI ZUO MI 5.5 154.14
STEFANO LEE ON 5.0 -112.14
HANSEN SONG OH 5.0 167.29
DINH DANG BUI ON 4.0 92.57
ALAN BUI ON 4.0 149.14

Recommendations

Reviewing the methods used, alternatives along the way could have been done. Utilizing regular expressions and creating flexibility on the column headers. This would allow us to create a DataFrame that collected all the data provided in long format.