For this project, we were given a text document with a very specific formatting and we were asked to reformat into dataframe of columns “Id”, “Player”, “State”, “Total”, “Player Pre-Tournament Rtg”, “Avg. Openent Tournament Rating” that can exported as a csv.
I read the text file in two ways:
First, I initialized a variable for my filename, a list of column headers, and number of columns to use later on.
file <-'tournamentinfo.txt'
headers <- c("player_id", "Player", "Total", "Round 1", "Round 2", "Round 3", "Round 4", "Round 5", "Round 6", "Round 7", "State", "USCFID/Rtg (Pre->Post)", "n", "Side 1", "Side 2", "Side 3", "Side 4", "Side 5", "Side 6", "Side 7")
num_cols <- length(headers)
The readLines function provides a simple method for reading the file as type char. After reading in the file, I began to parse the data. First, I used grep to remove any line that includes the record breaks (“—”). I then split each line into columns using the strsplit function and setting my separator to “|”. This resulted in a list.
# readlines
raw_lines <- readLines(file)
## Warning in readLines(file): incomplete final line found on 'tournamentinfo.txt'
# subset only data rows by omitting the breaker lines
data_rows <- raw_lines[grep("---", raw_lines, invert = TRUE)]
# split by |
rows_split = strsplit(data_rows, "\\|")
The resulting list still had each player’s information spread accross two rows. I needed to convert it to dataframe. I first used unlist to expand my data into a long format, putting every value on a single line. I then used the matrix function into the right shape with 20 cols by 64 rows. I casted the result into a dataframe and set my headers.
# convert list to dataframe
parsed_list <- rows_split[3:length(rows_split)]
lines_df <- data.frame(matrix(unlist(parsed_list), ncol=num_cols, byrow = 1))
# add column names
names(lines_df) <- headers
# convert dataframe to tibble and trim white space
line_data <- as_tibble(lines_df) |>
mutate(across(where(is.character), str_trim))
dim(line_data)
## [1] 64 20
gt(line_data[1:5,])
| player_id | Player | Total | Round 1 | Round 2 | Round 3 | Round 4 | Round 5 | Round 6 | Round 7 | State | USCFID/Rtg (Pre->Post) | n | Side 1 | Side 2 | Side 3 | Side 4 | Side 5 | Side 6 | Side 7 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GARY HUA | 6.0 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 | ON | 15445895 / R: 1794 ->1817 | N:2 | W | B | W | B | W | B | W |
| 2 | DAKSHESH DARURI | 6.0 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 | MI | 14598900 / R: 1553 ->1663 | N:2 | B | W | B | W | B | W | B |
| 3 | ADITYA BAJAJ | 6.0 | L 8 | W 61 | W 25 | W 21 | W 11 | W 13 | W 12 | MI | 14959604 / R: 1384 ->1640 | N:2 | W | B | W | B | W | B | W |
| 4 | PATRICK H SCHILLING | 5.5 | W 23 | D 28 | W 2 | W 26 | D 5 | W 19 | D 1 | MI | 12616049 / R: 1716 ->1744 | N:2 | W | B | W | B | W | B | B |
| 5 | HANSHI ZUO | 5.5 | W 45 | W 37 | D 12 | D 13 | D 4 | W 14 | W 17 | MI | 14601533 / R: 1655 ->1690 | N:2 | B | W | B | W | B | W | B |
Although my data ended up in the correct format, I wasn’t sure if I loved the approach. I opted to try using the read.delim function declare my separators and remove blank columns ahead of time. This method also had the added benefit of returning a dataframe, so I didn’t need to convert it as in the last step. The record breaks (“—”) converted in NAs so I used drop_na to remove these rows from my dataset.
# read text file usig read.delim
delim_data <- read.delim(file, header=FALSE, sep="|", skip=4, skipNul=1, blank.lines.skip=1, na.strings=c(""," ","NA"))
# drop all NA's
delim_data_df <- Filter(function(x)!all(is.na(x)), delim_data) |>
drop_na()
I used the modulo within the slice function to split my data rows into two dataframes, each containing half of the record’s data. I added a column for the row number, which I used as my key when stitching the dataframes back together using merge function. I later dropped the row_id column and declared my new dataframes column names. There’s probably a more elegant way to do this, but I wasn’t able to find a method that worked.
# split data into two dfs
df_even <- delim_data_df |>
slice(which(row_number() %% 2 == 1)) |>
mutate(row_id = row_number())
df_odd <- delim_data_df |>
slice(which(row_number() %% 2 == 0)) |>
mutate(row_id = row_number())
# merge data by row_id
chess_data <- merge(df_even, df_odd, by="row_id") |>
subset(select = -c(row_id))
# add column names
names(chess_data) <- headers
head(chess_data)
## player_id Player Total Round 1 Round 2 Round 3
## 1 1 GARY HUA 6.0 W 39 W 21 W 18
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11
## Round 4 Round 5 Round 6 Round 7 State USCFID/Rtg (Pre->Post)
## 1 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 ->1817
## 2 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 ->1663
## 3 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 ->1640
## 4 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 ->1744
## 5 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 ->1690
## 6 W 35 D 10 W 27 W 21 OH 15055204 / R: 1686 ->1687
## n Side 1 Side 2 Side 3 Side 4 Side 5 Side 6 Side 7
## 1 N:2 W B W B W B W
## 2 N:2 B W B W B W B
## 3 N:2 W B W B W B W
## 4 N:2 W B W B W B B
## 5 N:2 B W B W B W B
## 6 N:3 W B W B B W B
I next had to begin tidying up the dataframe. First, I needed to extract the multiple values stored in the “USCFID/Rtg (Pre->Post)” column, including the pre-tournament rating column. I used the separate function in several steps to break up “USCFID/Rtg (Pre->Post)” to “USCFID”, ‘Rtg (Pre)’, ‘Rtg (Post)’, ‘Rtg (Pre P)’, ‘Rtg (Post P)’. R displayed some warnings since the “P” column had many missing values, but these can be ignored since we won’t be using these columns. I created a subset with only the values I needed and assigned this to the dataframe ‘chess_player_rating’.
I also needed to do something similar with the Round # columns, as this column stored both the outcome (W=Win, L=Loss, H=Half Point, U=Withdrew).
# Break up "USCFID/Rtg (Pre->Post)" into its parts
chess_player <- chess_data |>
separate(col="USCFID/Rtg (Pre->Post)", into=c('USCFID', 'Rtg'), sep = "\\/ R:")
chess_player_rating <- chess_player |>
separate(col="Rtg", into=c('Rtg (Pre)', 'Rtg (Post)'), sep = "->") |>
separate(col="Rtg (Pre)", into=c('rtg_pre', 'Rtg (Pre P)'), sep = "P") |>
separate(col="Rtg (Post)", into=c('rtg_post', 'Rtg (Post P)'), sep = "P") |>
mutate(across(where(is.character), str_trim)) |>
mutate(
rtg_pre = as.integer(rtg_pre),
rtg_post = as.integer(rtg_post)
) |>
select("player_id", "Player", "State", "Total", rtg_pre, rtg_post)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [1, 2, 3, 4, 5,
## 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 22, 23, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 56 rows [1, 2, 3, 4, 5,
## 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].
I also needed to do something similar with the Round # columns, as this column stored both the outcome (W=Win, L=Loss, H=Half Point, U=Withdrew) and the opponent’s id. I used pivot_longer to get every observation into its own row, then separate to break up the result and opponent’s id into their own colummn. I removed any white space then filtered for W|L rows as H|U represent unplayed matches. I did a right_join to my player ratings dataframe using the oppenent_id as a key. Finally, I ran a summary to get mean or average oppenent rating pre-tournament.
# get opponent pre ratings per match
tournament_opponents_ratings <- chess_player |>
select( "player_id","Round 1", "Round 2", "Round 3", "Round 4", "Round 5", "Round 6", "Round 7") |>
pivot_longer(
cols = !(player_id),
names_to = c( "round"),
values_to = "results"
) |>
separate(col="results", into=c("result","opponent_id"), sep="(\\s{2})") |>
mutate(across(where(is.character), str_trim)) |>
filter(result == "W" | result =="L") |>
right_join(chess_player_rating, join_by(opponent_id == player_id))
## Warning: Expected 2 pieces. Additional pieces discarded in 40 rows [82, 108, 112, 152,
## 189, 203, 250, 253, 257, 264, 285, 286, 287, 302, 332, 334, 342, 343, 348, 365,
## ...].
# summarise oppent ratings per player
opponent_pre_rtgs <- tournament_opponents_ratings |>
group_by(player_id) |>
summarise(
avg_opponent_rtg = mean(rtg_pre),
avg_opponent_rtg = round(avg_opponent_rtg,0)
)
Having my player ratings and their openent ratings, I conducted a left_merge using the player_id as my primary key to combine the two datasets into their final form.
# merge final dataframe
chess_player_stats <- chess_player_rating |>
left_join(opponent_pre_rtgs, join_by(player_id == player_id)) |>
select("Player", "State", "Total", rtg_pre, avg_opponent_rtg)
gt(chess_player_stats)
| Player | State | Total | rtg_pre | avg_opponent_rtg |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1572 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1515 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1365 |
| HANSEN SONG | OH | 5.0 | 1686 | 1533 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 | 1372 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 |
| STEFANO LEE | ON | 5.0 | 1411 | 1523 |
| ANVIT RAO | MI | 5.0 | 1365 | 1518 |
| CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1475 |
| KENNETH J TACK | MI | 4.5 | 1663 | 1397 |
| TORRANCE HENRY JR | MI | 4.5 | 1666 | 1472 |
| BRADLEY SHAW | MI | 4.5 | 1610 | 1509 |
| ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1470 |
| MIKE NIKITIN | MI | 4.0 | 1604 | 1391 |
| RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1499 |
| DAVID SUNDEEN | MI | 4.0 | 1600 | 1480 |
| DIPANKAR ROY | MI | 4.0 | 1564 | 1451 |
| JASON ZHENG | MI | 4.0 | 1595 | 1411 |
| DINH DANG BUI | ON | 4.0 | 1563 | 1470 |
| EUGENE L MCCLURE | MI | 4.0 | 1555 | 1373 |
| ALAN BUI | ON | 4.0 | 1363 | 1214 |
| MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1357 |
| LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1357 |
| MAX ZHU | ON | 3.5 | 1579 | 1518 |
| GAURAV GIDWANI | MI | 3.5 | 1552 | 1144 |
| SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 | 1505 |
| CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1239 |
| GEORGE AVERY JONES | ON | 3.5 | 1522 | 1141 |
| RISHI SHETTY | MI | 3.5 | 1494 | 1272 |
| JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1345 |
| JADE GE | MI | 3.5 | 1449 | 1264 |
| MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1314 |
| JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1186 |
| SIDDHARTH JHA | MI | 3.5 | 1355 | 1343 |
| AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1385 |
| BRIAN LIU | MI | 3.0 | 1423 | 1505 |
| JOEL R HENDON | MI | 3.0 | 1436 | 1430 |
| FOREST ZHANG | MI | 3.0 | 1348 | 1391 |
| KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1248 |
| JARED GE | MI | 3.0 | 1332 | 1225 |
| ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1107 |
| JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1327 |
| DEREK YAN | MI | 3.0 | 1242 | 1182 |
| JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1358 |
| ERIC WRIGHT | MI | 2.5 | 1362 | 1456 |
| DANIEL KHAIN | MI | 2.5 | 1382 | 1461 |
| MICHAEL J MARTIN | MI | 2.5 | 1291 | 1364 |
| SHIVAM JHA | MI | 2.5 | 1056 | 1296 |
| TEJAS AYYAGARI | MI | 2.5 | 1011 | 1355 |
| ETHAN GUO | MI | 2.5 | 935 | 1522 |
| JOSE C YBARRA | MI | 2.0 | 1393 | 1345 |
| LARRY HODGE | MI | 2.0 | 1270 | 1206 |
| ALEX KONG | MI | 2.0 | 1186 | 1425 |
| MARISA RICCI | MI | 2.0 | 1153 | 1414 |
| MICHAEL LU | MI | 2.0 | 1092 | 1363 |
| VIRAJ MOHILE | MI | 2.0 | 917 | 1391 |
| SEAN M MC CORMICK | MI | 2.0 | 853 | 1319 |
| JULIA SHEN | MI | 1.5 | 967 | 1359 |
| JEZZEL FARKAS | ON | 1.5 | 955 | 1326 |
| ASHWIN BALAJI | MI | 1.0 | 1530 | 1186 |
| THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1365 |
| BEN LI | MI | 1.0 | 1163 | 1206 |
Finally, I used write.csv to export my dataframe as a csv.
write.csv(chess_player_stats,"chess_tournament_player_stats.csv")