Overview

For this project, we were given a text document with a very specific formatting and we were asked to reformat into dataframe of columns “Id”, “Player”, “State”, “Total”, “Player Pre-Tournament Rtg”, “Avg. Openent Tournament Rating” that can exported as a csv.

I read the text file in two ways:

First, I initialized a variable for my filename, a list of column headers, and number of columns to use later on.

file <-'tournamentinfo.txt'

headers <- c("player_id", "Player", "Total", "Round 1", "Round 2", "Round 3", "Round 4", "Round 5", "Round 6", "Round 7", "State", "USCFID/Rtg (Pre->Post)", "n", "Side 1", "Side 2", "Side 3", "Side 4", "Side 5", "Side 6", "Side 7")

num_cols <- length(headers)

Reading Text File Using as readLines

The readLines function provides a simple method for reading the file as type char. After reading in the file, I began to parse the data. First, I used grep to remove any line that includes the record breaks (“—”). I then split each line into columns using the strsplit function and setting my separator to “|”. This resulted in a list.

# readlines 
raw_lines <- readLines(file)
## Warning in readLines(file): incomplete final line found on 'tournamentinfo.txt'
# subset only data rows by omitting the breaker lines
data_rows <- raw_lines[grep("---", raw_lines, invert = TRUE)]
# split by |
rows_split = strsplit(data_rows, "\\|")

Joining data from two rows

The resulting list still had each player’s information spread accross two rows. I needed to convert it to dataframe. I first used unlist to expand my data into a long format, putting every value on a single line. I then used the matrix function into the right shape with 20 cols by 64 rows. I casted the result into a dataframe and set my headers.

# convert list to dataframe
parsed_list <- rows_split[3:length(rows_split)]
lines_df <- data.frame(matrix(unlist(parsed_list),  ncol=num_cols, byrow = 1))
# add column names
names(lines_df) <- headers


# convert dataframe to tibble and trim white space
line_data <- as_tibble(lines_df) |>
   mutate(across(where(is.character), str_trim))
dim(line_data)
## [1] 64 20
gt(line_data[1:5,])
player_id Player Total Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 State USCFID/Rtg (Pre->Post) n Side 1 Side 2 Side 3 Side 4 Side 5 Side 6 Side 7
1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817 N:2 W B W B W B W
2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663 N:2 B W B W B W B
3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640 N:2 W B W B W B W
4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744 N:2 W B W B W B B
5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690 N:2 B W B W B W B

Reading using read.delim

Although my data ended up in the correct format, I wasn’t sure if I loved the approach. I opted to try using the read.delim function declare my separators and remove blank columns ahead of time. This method also had the added benefit of returning a dataframe, so I didn’t need to convert it as in the last step. The record breaks (“—”) converted in NAs so I used drop_na to remove these rows from my dataset.

# read text file usig read.delim
delim_data <- read.delim(file, header=FALSE, sep="|", skip=4, skipNul=1, blank.lines.skip=1, na.strings=c(""," ","NA")) 

# drop all NA's
delim_data_df <- Filter(function(x)!all(is.na(x)), delim_data) |>
  drop_na()

Consolidating Rows

I used the modulo within the slice function to split my data rows into two dataframes, each containing half of the record’s data. I added a column for the row number, which I used as my key when stitching the dataframes back together using merge function. I later dropped the row_id column and declared my new dataframes column names. There’s probably a more elegant way to do this, but I wasn’t able to find a method that worked.

# split data into two dfs
df_even <- delim_data_df |>
  slice(which(row_number() %% 2 == 1)) |>
  mutate(row_id = row_number())

df_odd <- delim_data_df |>
  slice(which(row_number() %% 2 == 0)) |>
  mutate(row_id = row_number())

# merge data by row_id
chess_data <- merge(df_even, df_odd, by="row_id") |>
  subset(select = -c(row_id)) 

# add column names
names(chess_data) <- headers
head(chess_data)
##   player_id                            Player Total Round 1 Round 2 Round 3
## 1        1   GARY HUA                         6.0     W  39   W  21   W  18
## 2        2   DAKSHESH DARURI                  6.0     W  63   W  58   L   4
## 3        3   ADITYA BAJAJ                     6.0     L   8   W  61   W  25
## 4        4   PATRICK H SCHILLING              5.5     W  23   D  28   W   2
## 5        5   HANSHI ZUO                       5.5     W  45   W  37   D  12
## 6        6   HANSEN SONG                      5.0     W  34   D  29   L  11
##   Round 4 Round 5 Round 6 Round 7  State            USCFID/Rtg (Pre->Post)
## 1   W  14   W   7   D  12   D   4    ON   15445895 / R: 1794   ->1817     
## 2   W  17   W  16   W  20   W   7    MI   14598900 / R: 1553   ->1663     
## 3   W  21   W  11   W  13   W  12    MI   14959604 / R: 1384   ->1640     
## 4   W  26   D   5   W  19   D   1    MI   12616049 / R: 1716   ->1744     
## 5   D  13   D   4   W  14   W  17    MI   14601533 / R: 1655   ->1690     
## 6   W  35   D  10   W  27   W  21    OH   15055204 / R: 1686   ->1687     
##       n Side 1 Side 2 Side 3 Side 4 Side 5 Side 6 Side 7
## 1 N:2    W      B      W      B      W      B      W    
## 2 N:2    B      W      B      W      B      W      B    
## 3 N:2    W      B      W      B      W      B      W    
## 4 N:2    W      B      W      B      W      B      B    
## 5 N:2    B      W      B      W      B      W      B    
## 6 N:3    W      B      W      B      B      W      B

Tidying Up the Dataframe

I next had to begin tidying up the dataframe. First, I needed to extract the multiple values stored in the “USCFID/Rtg (Pre->Post)” column, including the pre-tournament rating column. I used the separate function in several steps to break up “USCFID/Rtg (Pre->Post)” to “USCFID”, ‘Rtg (Pre)’, ‘Rtg (Post)’, ‘Rtg (Pre P)’, ‘Rtg (Post P)’. R displayed some warnings since the “P” column had many missing values, but these can be ignored since we won’t be using these columns. I created a subset with only the values I needed and assigned this to the dataframe ‘chess_player_rating’.

I also needed to do something similar with the Round # columns, as this column stored both the outcome (W=Win, L=Loss, H=Half Point, U=Withdrew).

# Break up "USCFID/Rtg (Pre->Post)" into its parts
chess_player <- chess_data |>
  separate(col="USCFID/Rtg (Pre->Post)", into=c('USCFID', 'Rtg'), sep = "\\/ R:") 
  
chess_player_rating <- chess_player |>
  separate(col="Rtg", into=c('Rtg (Pre)', 'Rtg (Post)'), sep = "->") |>
  separate(col="Rtg (Pre)", into=c('rtg_pre', 'Rtg (Pre P)'), sep = "P") |>
  separate(col="Rtg (Post)", into=c('rtg_post', 'Rtg (Post P)'), sep = "P") |>
  mutate(across(where(is.character), str_trim)) |>
  mutate(
    rtg_pre = as.integer(rtg_pre),
    rtg_post = as.integer(rtg_post)
  ) |>
  select("player_id", "Player", "State", "Total", rtg_pre, rtg_post)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [1, 2, 3, 4, 5,
## 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 22, 23, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 56 rows [1, 2, 3, 4, 5,
## 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].

I also needed to do something similar with the Round # columns, as this column stored both the outcome (W=Win, L=Loss, H=Half Point, U=Withdrew) and the opponent’s id. I used pivot_longer to get every observation into its own row, then separate to break up the result and opponent’s id into their own colummn. I removed any white space then filtered for W|L rows as H|U represent unplayed matches. I did a right_join to my player ratings dataframe using the oppenent_id as a key. Finally, I ran a summary to get mean or average oppenent rating pre-tournament.

# get opponent pre ratings per match
tournament_opponents_ratings <- chess_player |>
  select( "player_id","Round 1", "Round 2", "Round 3", "Round 4", "Round 5", "Round 6", "Round 7") |>
  pivot_longer(
    cols = !(player_id),
    names_to = c( "round"), 
    values_to = "results"
  ) |>
  separate(col="results", into=c("result","opponent_id"), sep="(\\s{2})") |>
  mutate(across(where(is.character), str_trim)) |>
  filter(result == "W" | result =="L") |>
  right_join(chess_player_rating, join_by(opponent_id == player_id))
## Warning: Expected 2 pieces. Additional pieces discarded in 40 rows [82, 108, 112, 152,
## 189, 203, 250, 253, 257, 264, 285, 286, 287, 302, 332, 334, 342, 343, 348, 365,
## ...].
# summarise oppent ratings per player
opponent_pre_rtgs <- tournament_opponents_ratings  |>
  group_by(player_id) |>
  summarise(
    avg_opponent_rtg =  mean(rtg_pre),
    avg_opponent_rtg = round(avg_opponent_rtg,0)
  )

Merging the results

Having my player ratings and their openent ratings, I conducted a left_merge using the player_id as my primary key to combine the two datasets into their final form.

# merge final dataframe
chess_player_stats <- chess_player_rating |>
  left_join(opponent_pre_rtgs, join_by(player_id == player_id)) |>
  select("Player", "State", "Total", rtg_pre, avg_opponent_rtg)

gt(chess_player_stats)
Player State Total rtg_pre avg_opponent_rtg
GARY HUA ON 6.0 1794 1572
DAKSHESH DARURI MI 6.0 1553 1469
ADITYA BAJAJ MI 6.0 1384 1564
PATRICK H SCHILLING MI 5.5 1716 1515
HANSHI ZUO MI 5.5 1655 1365
HANSEN SONG OH 5.0 1686 1533
GARY DEE SWATHELL MI 5.0 1649 1372
EZEKIEL HOUGHTON MI 5.0 1641 1468
STEFANO LEE ON 5.0 1411 1523
ANVIT RAO MI 5.0 1365 1518
CAMERON WILLIAM MC LEMAN MI 4.5 1712 1475
KENNETH J TACK MI 4.5 1663 1397
TORRANCE HENRY JR MI 4.5 1666 1472
BRADLEY SHAW MI 4.5 1610 1509
ZACHARY JAMES HOUGHTON MI 4.5 1220 1470
MIKE NIKITIN MI 4.0 1604 1391
RONALD GRZEGORCZYK MI 4.0 1629 1499
DAVID SUNDEEN MI 4.0 1600 1480
DIPANKAR ROY MI 4.0 1564 1451
JASON ZHENG MI 4.0 1595 1411
DINH DANG BUI ON 4.0 1563 1470
EUGENE L MCCLURE MI 4.0 1555 1373
ALAN BUI ON 4.0 1363 1214
MICHAEL R ALDRICH MI 4.0 1229 1357
LOREN SCHWIEBERT MI 3.5 1745 1357
MAX ZHU ON 3.5 1579 1518
GAURAV GIDWANI MI 3.5 1552 1144
SOFIA ADINA STANESCU-BELLU MI 3.5 1507 1505
CHIEDOZIE OKORIE MI 3.5 1602 1239
GEORGE AVERY JONES ON 3.5 1522 1141
RISHI SHETTY MI 3.5 1494 1272
JOSHUA PHILIP MATHEWS ON 3.5 1441 1345
JADE GE MI 3.5 1449 1264
MICHAEL JEFFERY THOMAS MI 3.5 1399 1314
JOSHUA DAVID LEE MI 3.5 1438 1186
SIDDHARTH JHA MI 3.5 1355 1343
AMIYATOSH PWNANANDAM MI 3.5 980 1385
BRIAN LIU MI 3.0 1423 1505
JOEL R HENDON MI 3.0 1436 1430
FOREST ZHANG MI 3.0 1348 1391
KYLE WILLIAM MURPHY MI 3.0 1403 1248
JARED GE MI 3.0 1332 1225
ROBERT GLEN VASEY MI 3.0 1283 1107
JUSTIN D SCHILLING MI 3.0 1199 1327
DEREK YAN MI 3.0 1242 1182
JACOB ALEXANDER LAVALLEY MI 3.0 377 1358
ERIC WRIGHT MI 2.5 1362 1456
DANIEL KHAIN MI 2.5 1382 1461
MICHAEL J MARTIN MI 2.5 1291 1364
SHIVAM JHA MI 2.5 1056 1296
TEJAS AYYAGARI MI 2.5 1011 1355
ETHAN GUO MI 2.5 935 1522
JOSE C YBARRA MI 2.0 1393 1345
LARRY HODGE MI 2.0 1270 1206
ALEX KONG MI 2.0 1186 1425
MARISA RICCI MI 2.0 1153 1414
MICHAEL LU MI 2.0 1092 1363
VIRAJ MOHILE MI 2.0 917 1391
SEAN M MC CORMICK MI 2.0 853 1319
JULIA SHEN MI 1.5 967 1359
JEZZEL FARKAS ON 1.5 955 1326
ASHWIN BALAJI MI 1.0 1530 1186
THOMAS JOSEPH HOSMER MI 1.0 1175 1365
BEN LI MI 1.0 1163 1206

Writing the Dataframe to CSV

Finally, I used write.csv to export my dataframe as a csv.

write.csv(chess_player_stats,"chess_tournament_player_stats.csv")