project_1

Approach

Schema

  • name
  • state
  • total_points
  • pre_rating
  • opponent_pre_rating

Viewing the Data

-----------------------------------------------------------------------------------------
 Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
 Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
-----------------------------------------------------------------------------------------
    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------------------
    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
-----------------------------------------------------------------------------------------
    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
-----------------------------------------------------------------------------------------
    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
-----------------------------------------------------------------------------------------
    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
-----------------------------------------------------------------------------------------

Use gsub to remove -

  • Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|

  • Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

  • 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|

  • ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |

I’m wondering if it’s even worth saving the headers. We could just readline into a vector where each row would just be the complete line. So you can just vector[3:] and get the important part:

  • 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
  • ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |

sub “|W, |L, |N, |B, |W” into ’ | ’ & / into ’ ’ - strip whitespace

  • 1 | GARY L HUA | 6.0 | 39 | 21 | 18 | 14 | 7 | 12 | 4
  • ON | 15445895 R: 1794 ->1817 | 2

I notice that some players have this weird “980P12->1077P17” P thing in their rating.

  • sub '->' into ’ | ’

    • gives us 980P12 1077P17
  • sub P\\d+ into ’ ’

    • gives us 980 1077
  • sub R: into ’ ’ and N: into ’ ’

  • strip whitespace

  • 1 | GARY L HUA | 6.0 | 39 | 21 | 18 | 14 | 7 | 12 | 4

  • ON | 15445895 | 1794 | 1817 | 2

So the schema reads as such

  • odd_rows
    • id, name, points, opponent id 1 -> 7
  • even_rows
    • state, match id, pre rating, post rating

v <- readlines("tournamentinfo.txt")

We can do df <- tibble(raw_string = v) |> mutate(index = row_number()) and str_split() by |, drop raw_string.Might have to strip white space.

df1 <- df |> filter(index %% 2 != 0) df2 <- df |> filter(index %% 2 == 0)

  • df1 schema
    • col1 = id
    • col2 = name
    • col4 = total_points
    • col5+ = opponents

df1 <- select(index, id, name, total_points)

We need keys because now index is disjointed. df1 would be like 1 3 5. df2 would be 2 4 6. We could just index + 1 to df2 to match df1.

  • df2 schema
    • col1 = index
    • col2 = state
    • col3 = match_id
    • col4 = pre_rating
    • col5 = post_rating
    • col6 = idk - can just drop it haha

df2 <- select(index, state, pre_rating, post_rating) df2 <- mutate(index = index + 1)

join via index

player_df <- df1 |> leftjoin(df2, by ='index')

  • new schema

    • index
    • id
    • name
    • state
    • pre_rating
    • post_rating
    • total_points
    • opp_ + [1:7]
  • 1, 1, GARY HUA, ON, 15445895, 1794, 1817, 6.0, 39, 21, 18, 14, 7, 12, 4

  • The following need to be summarized.

    • post_rating by mean
    • total_points by sum

Need to perform a lookup function on the opponent columns, to the id, return pre-rating. Probably easier to just create a lookup table

lookup <- df |> select(id, pre_rating)

Then we can have a vector to store the column names for the opponents op_cols <- paste0("opp_", 1:7)

So, we could do a pipe where

player_df |> mutate(across(all_of(op_cols)), ~ lookup$pre_rating[match(.x, lookup$id)]) |> mutate(op_avg = rowMeans(across(all_of(op_cols)), na.rm = TRUE)) |> select(!all_of(p_cols))

Mutate(across()) says, we are going to alter, across says these columns, the function (~) takes the lookup table by rating column and returns the value of the id match to subset. Then we mean all those rows, then we keep the mean, remove the op_cols.

Objective

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

name, state, total_points, pre_rating, post_rating, op_avg

That should be it!

Codebase

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#start
df_raw <- read_lines("tournamentinfo.txt")
df_raw_t <- tibble(df_raw)
var_1 <- c("pair", "name", "total_pts", paste0("opp_", 1:7))
var_2 <- c("state", "match", "pre_rating", "post_rating")
var_3 <- c("pair", "name", "state", "pre_rating", "post_rating", "total_pts", paste0("opp_", 1:7))
var_4 <- c("name", "state", "pre_rating", "post_rating", "total_pts")

#Working
df <- tidyr::tibble(raw = df_raw) |>
  filter(
    !str_detect(raw, "^-"),
    !str_detect(raw, "\\s*Pair"),
    !str_detect(raw, "\\s*Num")
  ) |>
  mutate(index = row_number()) |>
  mutate(
    raw = str_trim(raw),
    raw = str_replace(raw, "\\|$", ""),
    raw = str_replace_all(raw, "\\|[BWLDXUHN]\\s*", "|"),
    raw = str_replace_all(raw, "\\s*\\|\\s*", "|"),
    raw = str_replace_all(raw, "\\|:\\d\\s*", "|"),
    raw = str_replace_all(raw, "P\\d+", ""),
    raw = str_replace(raw, "->", "|"),
    raw = str_replace(raw, "R:", ""),
    raw = str_replace(raw, "/", "|"),
    raw = str_replace(raw, "\\|*\\|$", ""),
    raw = str_trim(raw)
    )

df1 <- df |> filter(index %% 2 != 0) |>
  separate(raw, sep = "\\|", into = var_1, extra = "drop", fill = "right") |>
  mutate(across(everything(), str_trim)) |>
  mutate(across(everything(), ~ na_if(.x, "")))

df2 <- df |> filter(index %% 2 == 0) |>
  mutate(index = index - 1) |>
  separate(raw, sep = "\\|", into = var_2, extra = "drop", fill = "right") |>
  mutate(across(everything(), str_trim)) |>
  mutate(across(everything(), ~ na_if(.x, "")))

player_df <- df1 |> left_join(df2, by ='index')|> 
  select(all_of(var_3)) |>
  readr::type_convert()

── Column specification ────────────────────────────────────────────────────────
cols(
  pair = col_double(),
  name = col_character(),
  state = col_character(),
  pre_rating = col_double(),
  post_rating = col_double(),
  total_pts = col_double(),
  opp_1 = col_double(),
  opp_2 = col_double(),
  opp_3 = col_double(),
  opp_4 = col_double(),
  opp_5 = col_double(),
  opp_6 = col_double(),
  opp_7 = col_double()
)
rating_map <- setNames(player_df$pre_rating, player_df$pair)

player_df <- player_df |>
  mutate(across(starts_with("opp"), ~ unname(rating_map[.x]))) |>
  mutate(op_avg_rating = rowMeans(pick(starts_with("opp")), na.rm = TRUE)) |>
  select(-all_of(paste0("opp_", 1:7)))

player_df
# A tibble: 64 × 7
    pair name               state pre_rating post_rating total_pts op_avg_rating
   <dbl> <chr>              <chr>      <dbl>       <dbl>     <dbl>         <dbl>
 1     1 GARY HUA           ON          1794        1817       6           1605.
 2     2 DAKSHESH DARURI    MI          1553        1663       6           1469.
 3     3 ADITYA BAJAJ       MI          1384        1640       6           1564.
 4     4 PATRICK H SCHILLI… MI          1716        1744       5.5         1574.
 5     5 HANSHI ZUO         MI          1655        1690       5.5         1501.
 6     6 HANSEN SONG        OH          1686        1687       5           1519.
 7     7 GARY DEE SWATHELL  MI          1649        1673       5           1372.
 8     8 EZEKIEL HOUGHTON   MI          1641        1657       5           1468.
 9     9 STEFANO LEE        ON          1411        1564       5           1523.
10    10 ANVIT RAO          MI          1365        1544       5           1554.
# ℹ 54 more rows
glimpse(player_df)
Rows: 64
Columns: 7
$ pair          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ name          <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
$ state         <chr> "ON", "MI", "MI", "MI", "MI", "OH", "MI", "MI", "ON", "M…
$ pre_rating    <dbl> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
$ post_rating   <dbl> 1817, 1663, 1640, 1744, 1690, 1687, 1673, 1657, 1564, 15…
$ total_pts     <dbl> 6.0, 6.0, 6.0, 5.5, 5.5, 5.0, 5.0, 5.0, 5.0, 5.0, 4.5, 4…
$ op_avg_rating <dbl> 1605.286, 1469.286, 1563.571, 1573.571, 1500.857, 1518.7…

This was my workspace. I used this to check each str_replace and grep formula.

rdf <- tibble(raw = df_raw)
rdf <- rdf |> filter(!str_detect(raw, "^-"))
rdf <- rdf |> filter(!str_detect(raw, "\\s*Pair"))
rdf <- rdf |> filter(!str_detect(raw, "\\s*Num"))
rdf <- rdf |> mutate(index = row_number())
rdf <- rdf |> mutate(raw = str_trim(raw))
rdf <- rdf |> mutate(raw = str_replace(raw, "\\|$", ""))
rdf <- rdf |> mutate(raw = str_replace_all(raw, "\\|[BWLDXUHN]\\s*", "|"))
rdf <- rdf |> mutate(raw = str_replace_all(raw, "\\s*\\|\\s*", "|"))
rdf <- rdf |> mutate(raw = str_replace_all(raw, "\\|:\\d\\s*", "|"))
rdf <- rdf |> mutate(raw = str_replace_all(raw, "P\\d+", ""))
rdf <- rdf |> mutate(raw = str_replace(raw, "->", "|"))
rdf <- rdf |> mutate(raw = str_replace(raw, "R:", ""))
rdf <- rdf |> mutate(raw = str_replace(raw, "/", "|"))
rdf <- rdf |> mutate(raw = str_replace(raw, "\\|*\\|$", ""))
rdf <- rdf |> mutate(raw = str_trim(raw))