Reading the data

This a project that reads chess tournament results and writes a summarisation to a csv. To begin the data is read in skipping the first several rows that contain what would be the headers. We also want to filter out the dividing rows (those consisting of the “-” character) and trim any leading or trailing whitespace in the column values. This will normalize the data and make it easier to manipulate

chess_raw <- read_delim("tournamentinfo.txt", delim = "|", skip = 4, col_names = FALSE) |>
  filter(!str_detect(X1, "^-+-$")) |>
  mutate(across(where(is.character), str_trim))

chess_raw
## # A tibble: 128 × 11
##    X1    X2                X3    X4    X5    X6    X7    X8    X9    X10   X11  
##    <chr> <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
##  1 1     GARY HUA          6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4 NA   
##  2 ON    15445895 / R: 17… N:2   W     B     W     B     W     B     W     NA   
##  3 2     DAKSHESH DARURI   6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7 NA   
##  4 MI    14598900 / R: 15… N:2   B     W     B     W     B     W     B     NA   
##  5 3     ADITYA BAJAJ      6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12 NA   
##  6 MI    14959604 / R: 13… N:2   W     B     W     B     W     B     W     NA   
##  7 4     PATRICK H SCHILL… 5.5   W  23 D  28 W   2 W  26 D   5 W  19 D   1 NA   
##  8 MI    12616049 / R: 17… N:2   W     B     W     B     W     B     B     NA   
##  9 5     HANSHI ZUO        5.5   W  45 W  37 D  12 D  13 D   4 W  14 W  17 NA   
## 10 MI    14601533 / R: 16… N:2   B     W     B     W     B     W     B     NA   
## # ℹ 118 more rows

Divide the Rows

Each entry in the original data is split into two rows in the dataframe. It is necessary to split the odd rows (those with a number in the first column) and even rows (those with the state in the first column) into separate dataframes in order to accurately label the data. By indexing the second dataframe by row number the dataframes can be rejoined in the future.

first_row <- chess_raw |>
  filter(str_detect(X1, r"{^\d+}")) |>
  mutate(across(X1, as.integer))

second_row <- chess_raw |>
  filter(str_detect(X1, r"{^\D+}")) |>
  mutate(id = row_number())

first_row
## # A tibble: 64 × 11
##       X1 X2                X3    X4    X5    X6    X7    X8    X9    X10   X11  
##    <int> <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
##  1     1 GARY HUA          6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4 NA   
##  2     2 DAKSHESH DARURI   6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7 NA   
##  3     3 ADITYA BAJAJ      6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12 NA   
##  4     4 PATRICK H SCHILL… 5.5   W  23 D  28 W   2 W  26 D   5 W  19 D   1 NA   
##  5     5 HANSHI ZUO        5.5   W  45 W  37 D  12 D  13 D   4 W  14 W  17 NA   
##  6     6 HANSEN SONG       5.0   W  34 D  29 L  11 W  35 D  10 W  27 W  21 NA   
##  7     7 GARY DEE SWATHELL 5.0   W  57 W  46 W  13 W  11 L   1 W   9 L   2 NA   
##  8     8 EZEKIEL HOUGHTON  5.0   W   3 W  32 L  14 L   9 W  47 W  28 W  19 NA   
##  9     9 STEFANO LEE       5.0   W  25 L  18 W  59 W   8 W  26 L   7 W  20 NA   
## 10    10 ANVIT RAO         5.0   D  16 L  19 W  55 W  31 D   6 W  25 W  18 NA   
## # ℹ 54 more rows
second_row
## # A tibble: 64 × 12
##    X1    X2          X3    X4    X5    X6    X7    X8    X9    X10   X11      id
##    <chr> <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <int>
##  1 ON    15445895 /… N:2   W     B     W     B     W     B     W     NA        1
##  2 MI    14598900 /… N:2   B     W     B     W     B     W     B     NA        2
##  3 MI    14959604 /… N:2   W     B     W     B     W     B     W     NA        3
##  4 MI    12616049 /… N:2   W     B     W     B     W     B     B     NA        4
##  5 MI    14601533 /… N:2   B     W     B     W     B     W     B     NA        5
##  6 OH    15055204 /… N:3   W     B     W     B     B     W     B     NA        6
##  7 MI    11146376 /… N:3   W     B     W     B     B     W     W     NA        7
##  8 MI    15142253 /… N:3   B     W     B     W     B     W     W     NA        8
##  9 ON    14954524 /… N:2   W     B     W     B     W     B     B     NA        9
## 10 MI    14150362 /… N:3   W     W     B     B     W     B     W     NA       10
## # ℹ 54 more rows

Naming Columns

Now that the data is separated it can be accurately labeled and parsed. Using functions such as parse_number makes it easy to ignore the non-numeric characters in many of the fields. The dataframes can then be combined by joining on the id column.

first_row_rename <- first_row |>
  select(!X11) |>
  rename(id = "X1", name = "X2", score = "X3", r1 = "X4", r2 = "X5", r3 = "X6", r4 = "X7", r5 = "X8", r6 = "X9", r7 = "X10") |>
  mutate(across(matches(r"{r\d}"), parse_number))

second_row_rename <- second_row |> 
  rename(state = "X1", rating = "X2") |>
  mutate(prev_rating = parse_number(str_extract(rating, r"-{R:\W+\d{1,4}}-"))) |>
  select(id, state, prev_rating)

working_df <- inner_join(first_row_rename, second_row_rename)
working_df
## # A tibble: 64 × 12
##       id name  score    r1    r2    r3    r4    r5    r6    r7 state prev_rating
##    <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       <dbl>
##  1     1 GARY… 6.0      39    21    18    14     7    12     4 ON           1794
##  2     2 DAKS… 6.0      63    58     4    17    16    20     7 MI           1553
##  3     3 ADIT… 6.0       8    61    25    21    11    13    12 MI           1384
##  4     4 PATR… 5.5      23    28     2    26     5    19     1 MI           1716
##  5     5 HANS… 5.5      45    37    12    13     4    14    17 MI           1655
##  6     6 HANS… 5.0      34    29    11    35    10    27    21 OH           1686
##  7     7 GARY… 5.0      57    46    13    11     1     9     2 MI           1649
##  8     8 EZEK… 5.0       3    32    14     9    47    28    19 MI           1641
##  9     9 STEF… 5.0      25    18    59     8    26     7    20 ON           1411
## 10    10 ANVI… 5.0      16    19    55    31     6    25    18 MI           1365
## # ℹ 54 more rows

Calculate Opponent’s Averages

The opponent id number in columns r1-r6 can be easily replaced with the initial rating of the opponent, making a row wise average a simple operation. Remembering to drop NA values is important to account for matches not played and prevent the NAs from passing to the final value.

averaged_op <- working_df |>
  mutate(across(r1:r7, ~ working_df$prev_rating[.x])) |>
  mutate(mean = round(rowMeans(pick(r1:r7), na.rm = TRUE)))

averaged_op
## # A tibble: 64 × 13
##       id name  score    r1    r2    r3    r4    r5    r6    r7 state prev_rating
##    <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>       <dbl>
##  1     1 GARY… 6.0    1436  1563  1600  1610  1649  1663  1716 ON           1794
##  2     2 DAKS… 6.0    1175   917  1716  1629  1604  1595  1649 MI           1553
##  3     3 ADIT… 6.0    1641   955  1745  1563  1712  1666  1663 MI           1384
##  4     4 PATR… 5.5    1363  1507  1553  1579  1655  1564  1794 MI           1716
##  5     5 HANS… 5.5    1242   980  1663  1666  1716  1610  1629 MI           1655
##  6     6 HANS… 5.0    1399  1602  1712  1438  1365  1552  1563 OH           1686
##  7     7 GARY… 5.0    1092   377  1666  1712  1794  1411  1553 MI           1649
##  8     8 EZEK… 5.0    1384  1441  1610  1411  1362  1507  1564 MI           1641
##  9     9 STEF… 5.0    1745  1600   853  1641  1579  1649  1595 ON           1411
## 10    10 ANVI… 5.0    1604  1564  1186  1494  1686  1745  1600 MI           1365
## # ℹ 54 more rows
## # ℹ 1 more variable: mean <dbl>

Select final columns and marshal before writing

Renaming and reordering the columns to match the desired format, the results are finally written to a csv.

final_df <- averaged_op |>
  select(name, state, score, prev_rating, mean) |>
  rename(`Player’s Name` = name, `Player’s State` = state, `Total Number of Points` = score, `Player’s Pre-Rating` = prev_rating, `Average Pre Chess Rating of Opponents` = mean)

write_csv(final_df, "chess_summary.csv")