MCD607 - Project 1

Overview

For this project, we were given a text document with a very specific formatting and we were asked to reformat into dataframe of columns “Id”, “Player”, “State”, “Total”, “Player Pre-Tournament Rtg”, “Avg. Openent Tournament Rating” that can exported as a csv.

I read the text file in two ways:

using readLines,
using read.delim

First, I initialized a variable for my filename, a list of column headers, and number of columns to use later on.

file <-'tournamentinfo.txt'

headers <- c("player_id", "Player", "Total", "Round 1", "Round 2", "Round 3", "Round 4", "Round 5", "Round 6", "Round 7", "State", "USCFID/Rtg (Pre->Post)", "n", "Side 1", "Side 2", "Side 3", "Side 4", "Side 5", "Side 6", "Side 7")

num_cols <- length(headers)

Reading Text File Using as readLines

The readLines function provides a simple method for reading the file as type char. After reading in the file, I began to parse the data. First, I used grep to remove any line that includes the record breaks (“—”). I then split each line into columns using the strsplit function and setting my separator to “|”. This resulted in a list.

# readlines 
raw_lines <- readLines(file)

## Warning in readLines(file): incomplete final line found on 'tournamentinfo.txt'

# subset only data rows by omitting the breaker lines
data_rows <- raw_lines[grep("---", raw_lines, invert = TRUE)]
# split by |
rows_split = strsplit(data_rows, "\\|")

Joining data from two rows

The resulting list still had each player’s information spread accross two rows. I needed to convert it to dataframe. I first used unlist to expand my data into a long format, putting every value on a single line. I then used the matrix function into the right shape with 20 cols by 64 rows. I casted the result into a dataframe and set my headers.

# convert list to dataframe
parsed_list <- rows_split[3:length(rows_split)]
lines_df <- data.frame(matrix(unlist(parsed_list),  ncol=num_cols, byrow = 1))
# add column names
names(lines_df) <- headers


# convert dataframe to tibble and trim white space
line_data <- as_tibble(lines_df) |>
   mutate(across(where(is.character), str_trim))
dim(line_data)

## [1] 64 20

gt(line_data[1:5,])

player_id	Player	Total	Round 1	Round 2	Round 3	Round 4	Round 5	Round 6	Round 7	State	USCFID/Rtg (Pre->Post)	n	Side 1	Side 2	Side 3	Side 4	Side 5	Side 6	Side 7
1	GARY HUA	6.0	W 39	W 21	W 18	W 14	W 7	D 12	D 4	ON	15445895 / R: 1794 ->1817	N:2	W	B	W	B	W	B	W
2	DAKSHESH DARURI	6.0	W 63	W 58	L 4	W 17	W 16	W 20	W 7	MI	14598900 / R: 1553 ->1663	N:2	B	W	B	W	B	W	B
3	ADITYA BAJAJ	6.0	L 8	W 61	W 25	W 21	W 11	W 13	W 12	MI	14959604 / R: 1384 ->1640	N:2	W	B	W	B	W	B	W
4	PATRICK H SCHILLING	5.5	W 23	D 28	W 2	W 26	D 5	W 19	D 1	MI	12616049 / R: 1716 ->1744	N:2	W	B	W	B	W	B	B
5	HANSHI ZUO	5.5	W 45	W 37	D 12	D 13	D 4	W 14	W 17	MI	14601533 / R: 1655 ->1690	N:2	B	W	B	W	B	W	B

Reading using read.delim

Although my data ended up in the correct format, I wasn’t sure if I loved the approach. I opted to try using the read.delim function declare my separators and remove blank columns ahead of time. This method also had the added benefit of returning a dataframe, so I didn’t need to convert it as in the last step. The record breaks (“—”) converted in NAs so I used drop_na to remove these rows from my dataset.

# read text file usig read.delim
delim_data <- read.delim(file, header=FALSE, sep="|", skip=4, skipNul=1, blank.lines.skip=1, na.strings=c(""," ","NA")) 

# drop all NA's
delim_data_df <- Filter(function(x)!all(is.na(x)), delim_data) |>
  drop_na()

Consolidating Rows

I used the modulo within the slice function to split my data rows into two dataframes, each containing half of the record’s data. I added a column for the row number, which I used as my key when stitching the dataframes back together using merge function. I later dropped the row_id column and declared my new dataframes column names. There’s probably a more elegant way to do this, but I wasn’t able to find a method that worked.

# split data into two dfs
df_even <- delim_data_df |>
  slice(which(row_number() %% 2 == 1)) |>
  mutate(row_id = row_number())

df_odd <- delim_data_df |>
  slice(which(row_number() %% 2 == 0)) |>
  mutate(row_id = row_number())

# merge data by row_id
chess_data <- merge(df_even, df_odd, by="row_id") |>
  subset(select = -c(row_id)) 

# add column names
names(chess_data) <- headers
head(chess_data)

##   player_id                            Player Total Round 1 Round 2 Round 3
## 1        1   GARY HUA                         6.0     W  39   W  21   W  18
## 2        2   DAKSHESH DARURI                  6.0     W  63   W  58   L   4
## 3        3   ADITYA BAJAJ                     6.0     L   8   W  61   W  25
## 4        4   PATRICK H SCHILLING              5.5     W  23   D  28   W   2
## 5        5   HANSHI ZUO                       5.5     W  45   W  37   D  12
## 6        6   HANSEN SONG                      5.0     W  34   D  29   L  11
##   Round 4 Round 5 Round 6 Round 7  State            USCFID/Rtg (Pre->Post)
## 1   W  14   W   7   D  12   D   4    ON   15445895 / R: 1794   ->1817     
## 2   W  17   W  16   W  20   W   7    MI   14598900 / R: 1553   ->1663     
## 3   W  21   W  11   W  13   W  12    MI   14959604 / R: 1384   ->1640     
## 4   W  26   D   5   W  19   D   1    MI   12616049 / R: 1716   ->1744     
## 5   D  13   D   4   W  14   W  17    MI   14601533 / R: 1655   ->1690     
## 6   W  35   D  10   W  27   W  21    OH   15055204 / R: 1686   ->1687     
##       n Side 1 Side 2 Side 3 Side 4 Side 5 Side 6 Side 7
## 1 N:2    W      B      W      B      W      B      W    
## 2 N:2    B      W      B      W      B      W      B    
## 3 N:2    W      B      W      B      W      B      W    
## 4 N:2    W      B      W      B      W      B      B    
## 5 N:2    B      W      B      W      B      W      B    
## 6 N:3    W      B      W      B      B      W      B

Tidying Up the Dataframe

I next had to begin tidying up the dataframe. First, I needed to extract the multiple values stored in the “USCFID/Rtg (Pre->Post)” column, including the pre-tournament rating column. I used the separate function in several steps to break up “USCFID/Rtg (Pre->Post)” to “USCFID”, ‘Rtg (Pre)’, ‘Rtg (Post)’, ‘Rtg (Pre P)’, ‘Rtg (Post P)’. R displayed some warnings since the “P” column had many missing values, but these can be ignored since we won’t be using these columns. I created a subset with only the values I needed and assigned this to the dataframe ‘chess_player_rating’.

I also needed to do something similar with the Round # columns, as this column stored both the outcome (W=Win, L=Loss, H=Half Point, U=Withdrew).

# Break up "USCFID/Rtg (Pre->Post)" into its parts
chess_player <- chess_data |>
  separate(col="USCFID/Rtg (Pre->Post)", into=c('USCFID', 'Rtg'), sep = "\\/ R:") 
  
chess_player_rating <- chess_player |>
  separate(col="Rtg", into=c('Rtg (Pre)', 'Rtg (Post)'), sep = "->") |>
  separate(col="Rtg (Pre)", into=c('rtg_pre', 'Rtg (Pre P)'), sep = "P") |>
  separate(col="Rtg (Post)", into=c('rtg_post', 'Rtg (Post P)'), sep = "P") |>
  mutate(across(where(is.character), str_trim)) |>
  mutate(
    rtg_pre = as.integer(rtg_pre),
    rtg_post = as.integer(rtg_post)
  ) |>
  select("player_id", "Player", "State", "Total", rtg_pre, rtg_post)

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 54 rows [1, 2, 3, 4, 5,
## 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 22, 23, ...].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 56 rows [1, 2, 3, 4, 5,
## 6, 7, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].

I also needed to do something similar with the Round # columns, as this column stored both the outcome (W=Win, L=Loss, H=Half Point, U=Withdrew) and the opponent’s id. I used pivot_longer to get every observation into its own row, then separate to break up the result and opponent’s id into their own colummn. I removed any white space then filtered for W|L rows as H|U represent unplayed matches. I did a right_join to my player ratings dataframe using the oppenent_id as a key. Finally, I ran a summary to get mean or average oppenent rating pre-tournament.

# get opponent pre ratings per match
tournament_opponents_ratings <- chess_player |>
  select( "player_id","Round 1", "Round 2", "Round 3", "Round 4", "Round 5", "Round 6", "Round 7") |>
  pivot_longer(
    cols = !(player_id),
    names_to = c( "round"), 
    values_to = "results"
  ) |>
  separate(col="results", into=c("result","opponent_id"), sep="(\\s{2})") |>
  mutate(across(where(is.character), str_trim)) |>
  filter(result == "W" | result =="L") |>
  right_join(chess_player_rating, join_by(opponent_id == player_id))

## Warning: Expected 2 pieces. Additional pieces discarded in 40 rows [82, 108, 112, 152,
## 189, 203, 250, 253, 257, 264, 285, 286, 287, 302, 332, 334, 342, 343, 348, 365,
## ...].

# summarise oppent ratings per player
opponent_pre_rtgs <- tournament_opponents_ratings  |>
  group_by(player_id) |>
  summarise(
    avg_opponent_rtg =  mean(rtg_pre),
    avg_opponent_rtg = round(avg_opponent_rtg,0)
  )

Merging the results

Having my player ratings and their openent ratings, I conducted a left_merge using the player_id as my primary key to combine the two datasets into their final form.

# merge final dataframe
chess_player_stats <- chess_player_rating |>
  left_join(opponent_pre_rtgs, join_by(player_id == player_id)) |>
  select("Player", "State", "Total", rtg_pre, avg_opponent_rtg)

gt(chess_player_stats)

Player	State	Total	rtg_pre	avg_opponent_rtg
GARY HUA	ON	6.0	1794	1572
DAKSHESH DARURI	MI	6.0	1553	1469
ADITYA BAJAJ	MI	6.0	1384	1564
PATRICK H SCHILLING	MI	5.5	1716	1515
HANSHI ZUO	MI	5.5	1655	1365
HANSEN SONG	OH	5.0	1686	1533
GARY DEE SWATHELL	MI	5.0	1649	1372
EZEKIEL HOUGHTON	MI	5.0	1641	1468
STEFANO LEE	ON	5.0	1411	1523
ANVIT RAO	MI	5.0	1365	1518
CAMERON WILLIAM MC LEMAN	MI	4.5	1712	1475
KENNETH J TACK	MI	4.5	1663	1397
TORRANCE HENRY JR	MI	4.5	1666	1472
BRADLEY SHAW	MI	4.5	1610	1509
ZACHARY JAMES HOUGHTON	MI	4.5	1220	1470
MIKE NIKITIN	MI	4.0	1604	1391
RONALD GRZEGORCZYK	MI	4.0	1629	1499
DAVID SUNDEEN	MI	4.0	1600	1480
DIPANKAR ROY	MI	4.0	1564	1451
JASON ZHENG	MI	4.0	1595	1411
DINH DANG BUI	ON	4.0	1563	1470
EUGENE L MCCLURE	MI	4.0	1555	1373
ALAN BUI	ON	4.0	1363	1214
MICHAEL R ALDRICH	MI	4.0	1229	1357
LOREN SCHWIEBERT	MI	3.5	1745	1357
MAX ZHU	ON	3.5	1579	1518
GAURAV GIDWANI	MI	3.5	1552	1144
SOFIA ADINA STANESCU-BELLU	MI	3.5	1507	1505
CHIEDOZIE OKORIE	MI	3.5	1602	1239
GEORGE AVERY JONES	ON	3.5	1522	1141
RISHI SHETTY	MI	3.5	1494	1272
JOSHUA PHILIP MATHEWS	ON	3.5	1441	1345
JADE GE	MI	3.5	1449	1264
MICHAEL JEFFERY THOMAS	MI	3.5	1399	1314
JOSHUA DAVID LEE	MI	3.5	1438	1186
SIDDHARTH JHA	MI	3.5	1355	1343
AMIYATOSH PWNANANDAM	MI	3.5	980	1385
BRIAN LIU	MI	3.0	1423	1505
JOEL R HENDON	MI	3.0	1436	1430
FOREST ZHANG	MI	3.0	1348	1391
KYLE WILLIAM MURPHY	MI	3.0	1403	1248
JARED GE	MI	3.0	1332	1225
ROBERT GLEN VASEY	MI	3.0	1283	1107
JUSTIN D SCHILLING	MI	3.0	1199	1327
DEREK YAN	MI	3.0	1242	1182
JACOB ALEXANDER LAVALLEY	MI	3.0	377	1358
ERIC WRIGHT	MI	2.5	1362	1456
DANIEL KHAIN	MI	2.5	1382	1461
MICHAEL J MARTIN	MI	2.5	1291	1364
SHIVAM JHA	MI	2.5	1056	1296
TEJAS AYYAGARI	MI	2.5	1011	1355
ETHAN GUO	MI	2.5	935	1522
JOSE C YBARRA	MI	2.0	1393	1345
LARRY HODGE	MI	2.0	1270	1206
ALEX KONG	MI	2.0	1186	1425
MARISA RICCI	MI	2.0	1153	1414
MICHAEL LU	MI	2.0	1092	1363
VIRAJ MOHILE	MI	2.0	917	1391
SEAN M MC CORMICK	MI	2.0	853	1319
JULIA SHEN	MI	1.5	967	1359
JEZZEL FARKAS	ON	1.5	955	1326
ASHWIN BALAJI	MI	1.0	1530	1186
THOMAS JOSEPH HOSMER	MI	1.0	1175	1365
BEN LI	MI	1.0	1163	1206

Writing the Dataframe to CSV

Finally, I used write.csv to export my dataframe as a csv.

write.csv(chess_player_stats,"chess_tournament_player_stats.csv")

MCD607 - Project 1 - Chess

Marco Castro

2024-10-03