The objective is to import a text file containing the results of a chess tournament, extract the information needed, and export to a CSV file that could be used in a database.
A preview of the text file being used
Column | Description |
---|---|
player_name | player’s full name |
player_state | player’s state |
total_pts | total points |
pre-rating | player’s pre-rating entering the tournament |
avg_opp_rating | average pre-rating of opponents entering the tournament |
library(tidyverse)
library(stringr)
library(ggrepel)
This will import the text file, reading each line individually and appending it to a DataFrame.
df <- data.frame()
con = file('data.txt', "r")
while (TRUE) {
line = readLines(con, n = 1)
if (length(line) == 0) {
break
}
df <- bind_rows(df, data.frame(line))
}
close(con)
The output of the DataFrame looks like this.
line |
---|
—————————————————————————————– |
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| |
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
—————————————————————————————– |
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| |
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W | |
The next step is to remove rows in the DataFrame that does not provide any useful information at all, such as rows only containing “—”. Using str_detect(), I looked for rows containing at least some form of alphanumeric values within it.
# find rows that has alpha-numeric characters
for (row in df){
df$str_alpha <- str_detect(row, '[a-zA-Z0-9]')
}
We can see which rows were alphanumeric and which ones were not
line | str_alpha |
---|---|
—————————————————————————————– | FALSE |
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| | TRUE |
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | | TRUE |
—————————————————————————————– | FALSE |
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| | TRUE |
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W | | TRUE |
Now lets drop these rows and the header rows we do not need
# drop non-alpha-numeric rows
df_alpha_num <- df %>%
filter(., str_alpha == TRUE) %>%
select(., line)
# drop first two rows containing header info
df_alpha_num <- df_alpha_num %>%
slice(-c(1, 2))
Here is the output contaning only the player information
line |
---|
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| |
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W | |
2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| |
MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B | |
3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| |
MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W | |
Every two rows contains a individual’s tournament information. The DataFrame will be parsed between “|” characters and added into individual columns for each player. The columns of interest will then be selected and placed into a new DataFrame.
# create empty dataframe
player_df <- data.frame()
# loop through each row
for (i in seq(1, nrow(df_alpha_num), by = 2)){
# read every odd row for player id, name and total points
row_info <- str_split_fixed(df_alpha_num[i, ],regex("|", literal=TRUE),n=Inf)
df_temp <- data.frame(row_info)
temp_cols <- c("player_id", "player_name", "total_pts", "round_1", "round_2", "round_3", "round_4", "round_5", "round_6", "round_7")
colnames(df_temp) = temp_cols
df_export <- df_temp %>%
select(1:10)
# read every even row to add additional columns for player info state and pre-rating
row_info <- str_split_fixed(df_alpha_num[i+1, ],regex("|", literal=TRUE),n=Inf)
df_temp <- data.frame(row_info)
temp_cols <- c("player_state", "pre_rating")
colnames(df_temp) = temp_cols
df_export <- bind_cols(df_export, df_temp %>%
select(1:2))
# extract individual player info to all players dataframe
player_df <- bind_rows(player_df, df_export)
}
Here we can see an example of how the information was separated
player_state | pre_rating | NA | NA | NA | NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|---|---|---|---|---|
MI | 15006561 / R: 1163 ->1112 | B | W | W | B | W | B | B |
Finally, we get a DataFrame that begins to look more manageable to work with
player_id | player_name | total_pts | round_1 | round_2 | round_3 | round_4 | round_5 | round_6 | round_7 | player_state | pre_rating |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | GARY HUA | 6.0 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 | ON | 15445895 / R: 1794 ->1817 |
2 | DAKSHESH DARURI | 6.0 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 | MI | 14598900 / R: 1553 ->1663 |
3 | ADITYA BAJAJ | 6.0 | L 8 | W 61 | W 25 | W 21 | W 11 | W 13 | W 12 | MI | 14959604 / R: 1384 ->1640 |
4 | PATRICK H SCHILLING | 5.5 | W 23 | D 28 | W 2 | W 26 | D 5 | W 19 | D 1 | MI | 12616049 / R: 1716 ->1744 |
5 | HANSHI ZUO | 5.5 | W 45 | W 37 | D 12 | D 13 | D 4 | W 14 | W 17 | MI | 14601533 / R: 1655 ->1690 |
6 | HANSEN SONG | 5.0 | W 34 | D 29 | L 11 | W 35 | D 10 | W 27 | W 21 | OH | 15055204 / R: 1686 ->1687 |
We can see that some of the columns need to be further cleaned because they contain more information than needed. The next step was removing the decision such as “W” for win, and solely keeping the numeric value as it tells us the player ID of the person they played that round.
for (i in 4:10){
player_df[, i] <- gsub("[^0-9]", "", player_df[, i])
}
Next, the player’s pre-rating is right after the “:” but before “<-”. For example, in “14959604 / R: 1384 ->1640”, 1384 would be the player’s pre-rating. The number will be extracted using parse_number()
player_df <- player_df %>%
separate_wider_delim(pre_rating, delim = ": ", names = c('rtg_1', 'rtg_2'))
player_df$pre_rating <- player_df$rtg_2 %>% parse_number()
This will remove preceding and following whitespaces in a string throughout the DataFrame.
player_df <- player_df %>%
mutate(across(where(is.character), str_trim))
Finally, the columns data types were change where appropriately needed.
change_type <- c("player_id", "total_pts", "round_1", "round_2", "round_3", "round_4", "round_5", "round_6", "round_7", "pre_rating")
player_df <- player_df %>%
mutate_at(change_type, as.numeric)
The DataFrame now looks like this.
player_id | player_name | total_pts | round_1 | round_2 | round_3 | round_4 | round_5 | round_6 | round_7 | player_state | rtg_1 | rtg_2 | pre_rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | GARY HUA | 6.0 | 39 | 21 | 18 | 14 | 7 | 12 | 4 | ON | 15445895 / R | 1794 ->1817 | 1794 |
2 | DAKSHESH DARURI | 6.0 | 63 | 58 | 4 | 17 | 16 | 20 | 7 | MI | 14598900 / R | 1553 ->1663 | 1553 |
3 | ADITYA BAJAJ | 6.0 | 8 | 61 | 25 | 21 | 11 | 13 | 12 | MI | 14959604 / R | 1384 ->1640 | 1384 |
4 | PATRICK H SCHILLING | 5.5 | 23 | 28 | 2 | 26 | 5 | 19 | 1 | MI | 12616049 / R | 1716 ->1744 | 1716 |
5 | HANSHI ZUO | 5.5 | 45 | 37 | 12 | 13 | 4 | 14 | 17 | MI | 14601533 / R | 1655 ->1690 | 1655 |
6 | HANSEN SONG | 5.0 | 34 | 29 | 11 | 35 | 10 | 27 | 21 | OH | 15055204 / R | 1686 ->1687 | 1686 |
As mentioned before, each round column contains the opponent they played. Now, that opponent’s player ID will be replaced with the opponent’s pre-rating instead. This is so we can calculate the opponent’s average pre-rating for the player.
for (i in 1:nrow(player_df)){
for (j in 4:10){
if (!is.na(player_df[[i,j]])){
player_df[i,j] <- player_df %>%
filter(., player_id == player_df[[i, j]]) %>%
select(pre_rating)
}
}
}
player_id | player_name | total_pts | round_1 | round_2 | round_3 | round_4 | round_5 | round_6 | round_7 | player_state | rtg_1 | rtg_2 | pre_rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | GARY HUA | 6.0 | 1436 | 1563 | 1600 | 1610 | 1649 | 1663 | 1716 | ON | 15445895 / R | 1794 ->1817 | 1794 |
2 | DAKSHESH DARURI | 6.0 | 1175 | 917 | 1716 | 1629 | 1604 | 1595 | 1649 | MI | 14598900 / R | 1553 ->1663 | 1553 |
3 | ADITYA BAJAJ | 6.0 | 1641 | 955 | 1745 | 1563 | 1712 | 1666 | 1663 | MI | 14959604 / R | 1384 ->1640 | 1384 |
4 | PATRICK H SCHILLING | 5.5 | 1363 | 1507 | 1553 | 1579 | 1655 | 1564 | 1794 | MI | 12616049 / R | 1716 ->1744 | 1716 |
5 | HANSHI ZUO | 5.5 | 1242 | 980 | 1663 | 1666 | 1716 | 1610 | 1629 | MI | 14601533 / R | 1655 ->1690 | 1655 |
6 | HANSEN SONG | 5.0 | 1399 | 1602 | 1712 | 1438 | 1365 | 1552 | 1563 | OH | 15055204 / R | 1686 ->1687 | 1686 |
Here, we finally calculate the average opponent pre-rating
player_df <- player_df %>%
mutate(avg_opp_pre_rating = round(rowMeans(pick(round_1:round_7), na.rm=TRUE), 2))
player_id | player_name | total_pts | round_1 | round_2 | round_3 | round_4 | round_5 | round_6 | round_7 | player_state | rtg_1 | rtg_2 | pre_rating | avg_opp_pre_rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | GARY HUA | 6.0 | 1436 | 1563 | 1600 | 1610 | 1649 | 1663 | 1716 | ON | 15445895 / R | 1794 ->1817 | 1794 | 1605.29 |
2 | DAKSHESH DARURI | 6.0 | 1175 | 917 | 1716 | 1629 | 1604 | 1595 | 1649 | MI | 14598900 / R | 1553 ->1663 | 1553 | 1469.29 |
3 | ADITYA BAJAJ | 6.0 | 1641 | 955 | 1745 | 1563 | 1712 | 1666 | 1663 | MI | 14959604 / R | 1384 ->1640 | 1384 | 1563.57 |
4 | PATRICK H SCHILLING | 5.5 | 1363 | 1507 | 1553 | 1579 | 1655 | 1564 | 1794 | MI | 12616049 / R | 1716 ->1744 | 1716 | 1573.57 |
5 | HANSHI ZUO | 5.5 | 1242 | 980 | 1663 | 1666 | 1716 | 1610 | 1629 | MI | 14601533 / R | 1655 ->1690 | 1655 | 1500.86 |
6 | HANSEN SONG | 5.0 | 1399 | 1602 | 1712 | 1438 | 1365 | 1552 | 1563 | OH | 15055204 / R | 1686 ->1687 | 1686 | 1518.71 |
Now that we have the data we want, we can export the columns of interest to CSV.
export_df <- player_df %>%
select(-c(1, 4:10, 12, 13)) %>%
select(c(1, 3, 2, 4, 5))
write_csv(export_df, 'chess_players.csv')
player_name | player_state | total_pts | pre_rating | avg_opp_pre_rating |
---|---|---|---|---|
GARY HUA | ON | 6.0 | 1794 | 1605.29 |
DAKSHESH DARURI | MI | 6.0 | 1553 | 1469.29 |
ADITYA BAJAJ | MI | 6.0 | 1384 | 1563.57 |
PATRICK H SCHILLING | MI | 5.5 | 1716 | 1573.57 |
HANSHI ZUO | MI | 5.5 | 1655 | 1500.86 |
HANSEN SONG | OH | 5.0 | 1686 | 1518.71 |
Calculating the pre-rating difference between the player and average player rating, we can see some players had a tougher road to reach the top in total points earned. Aditya Bajaj had the largest difference of almost -180 pre-rating to their average opponent, yet collected 6.0 total points.
player_name | player_state | total_pts | rating_diff |
---|---|---|---|
ADITYA BAJAJ | MI | 6.0 | -179.57 |
DAKSHESH DARURI | MI | 6.0 | 83.71 |
GARY HUA | ON | 6.0 | 188.71 |
PATRICK H SCHILLING | MI | 5.5 | 142.43 |
HANSHI ZUO | MI | 5.5 | 154.14 |
STEFANO LEE | ON | 5.0 | -112.14 |
HANSEN SONG | OH | 5.0 | 167.29 |
DINH DANG BUI | ON | 4.0 | 92.57 |
ALAN BUI | ON | 4.0 | 149.14 |
Reviewing the methods used, alternatives along the way could have been done. Utilizing regular expressions and creating flexibility on the column headers. This would allow us to create a DataFrame that collected all the data provided in long format.