DATA 607 Project 1

Intro

In this project, my goal is to generate a csv file with the chess player information:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

I have placed the given text file in my github repo. Exploring it below:

data_src <- "https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/project-01/tournamentinfo.txt"

# Read file lines
lines <- readLines(data_src)

## Warning in readLines(data_src): incomplete final line found on
## 'https://raw.githubusercontent.com/cdube89128/DATA-607/refs/heads/main/project-01/tournamentinfo.txt'

It looks like there was a warning from reading in the file, but my rudimentary Google suggests that this is due to the lack of a newline character at the end of my text file, so I’m continuing onward as normal.

head(lines, 10)

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------"

There are 4 lines of header/misc text. Each entry takes up two lines, and different pieces of information that are needed are distinguished in different ways within this text file. I.e. there is nothing perfectly standard like comma delimiters here.

I am going to clean up these lines to get the data for each entry consolidated together.

# Remove header lines in file
lines <- lines[-c(1:4)]

# Each entry is 2 lines, followed by a divider line (---)

# Remove those divider lines
lines <- lines[!grepl("^-", lines)]

# Checking in again
head(lines, 10)

##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [7] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
##  [8] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
##  [9] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [10] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

# Group into chunks of 2 lines per player
player_lines <- split(lines, ceiling(seq_along(lines)/2))

# Combine each pair into one string
combined <- sapply(player_lines, paste, collapse = "")

# The whitespace is messy, cleaning that up
combined <- gsub("\\s+", " ", combined)   
combined <- trimws(combined)

# Checking in again
head(combined, 5)

##                                                                                                                           1 
##           "1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |" 
##                                                                                                                           2 
##    "2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |" 
##                                                                                                                           3 
##      "3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |" 
##                                                                                                                           4 
## "4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |" 
##                                                                                                                           5 
##        "5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17| MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"

The data looks much more easily parsable now. Next I will focus on pulling out the individual pieces of data per entry (e.g. Player’s Number, Player Name, Player’s Rating etc.)

# This looks more easily parsable. Almost all of the distinct values are separated by pipes (|).
split_data <- str_split(combined, "\\|")

# Checking in again
head(split_data, 2)

## [[1]]
##  [1] "1 "                          " GARY HUA "                 
##  [3] "6.0 "                        "W 39"                       
##  [5] "W 21"                        "W 18"                       
##  [7] "W 14"                        "W 7"                        
##  [9] "D 12"                        "D 4"                        
## [11] " ON "                        " 15445895 / R: 1794 ->1817 "
## [13] "N:2 "                        "W "                         
## [15] "B "                          "W "                         
## [17] "B "                          "W "                         
## [19] "B "                          "W "                         
## [21] ""                           
## 
## [[2]]
##  [1] "2 "                          " DAKSHESH DARURI "          
##  [3] "6.0 "                        "W 63"                       
##  [5] "W 58"                        "L 4"                        
##  [7] "W 17"                        "W 16"                       
##  [9] "W 20"                        "W 7"                        
## [11] " MI "                        " 14598900 / R: 1553 ->1663 "
## [13] "N:2 "                        "B "                         
## [15] "W "                          "B "                         
## [17] "W "                          "B "                         
## [19] "W "                          "B "                         
## [21] ""

class(split_data)

## [1] "list"

class(split_data[[1]])

## [1] "character"

# It looks like split_data is a list of character vectors
# Create a function to parse each entry
parse_player <- function(x) {
  x <- str_trim(x)   # trim whitespace because it was still slightly irregular

  tibble(
    Pair = as.numeric(x[1]),
    Name = x[2],
    Total = as.numeric(x[3]),
    Round_1 = as.numeric(str_extract(x[4], "\\d+")),
    Round_2 = as.numeric(str_extract(x[5], "\\d+")),
    Round_3 = as.numeric(str_extract(x[6], "\\d+")),
    Round_4 = as.numeric(str_extract(x[7], "\\d+")),
    Round_5 = as.numeric(str_extract(x[8], "\\d+")),
    Round_6 = as.numeric(str_extract(x[9], "\\d+")),
    Round_7 = as.numeric(str_extract(x[10], "\\d+")),
    State = x[11],
    #After this, more complicated parsing is needed
    ID = str_extract(x[12], "\\d+"),                          # get 1st group of digits
    Pre_Rating = as.numeric(str_extract(x[12], "(?<=R: )\\d+")),     # get group of digits after R:
    Post_Rating = as.numeric(str_extract(x[12], "(?<=->)\\d+"))      # get group of digits after ->
  )
}

# Apply my function to each element of split_data
# Bind the resulting rows together into a new dataframe
my_df <- bind_rows(lapply(split_data, parse_player))

# Checking in
head(my_df, 5)

## # A tibble: 5 × 14
##    Pair Name       Total Round_1 Round_2 Round_3 Round_4 Round_5 Round_6 Round_7
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1 GARY HUA     6        39      21      18      14       7      12       4
## 2     2 DAKSHESH …   6        63      58       4      17      16      20       7
## 3     3 ADITYA BA…   6         8      61      25      21      11      13      12
## 4     4 PATRICK H…   5.5      23      28       2      26       5      19       1
## 5     5 HANSHI ZUO   5.5      45      37      12      13       4      14      17
## # ℹ 4 more variables: State <chr>, ID <chr>, Pre_Rating <dbl>,
## #   Post_Rating <dbl>

I have almost everything I wanted from this data now. Next I will calculate the average pre chess rating of opponents for each entry/player. (I was struggling to find a really clean way to do this, so I ultimately have opted to use another self-defined function to help the process.)

# Get the columns names for the rounds
round_cols <- paste0("Round_", 1:7) 

# Adding in the Average Pre Chess Rating of Opponents
my_df <- my_df %>%
  rowwise() %>%
  mutate(
    Avg_Opp_Rating = round(
      mean(
        sapply(c_across(all_of(round_cols)), function(opponent_pair) {
          if (!is.na(opponent_pair)) {
            my_df$Pre_Rating[my_df$Pair == opponent_pair]
          } else {
            NA_real_  # handling NAs from rounds without opponents
          }
        }),
        na.rm = TRUE  # ignore NAs in the mean calculation
      ),
      0  # round to 0 decimal places
    )
  ) %>%
  ungroup()

Lastly, I am selecting only the columns of information that I wanted from this data.

# Reduce this to only get the columns on interest
my_df <- my_df %>%
  select(Name, State, Total, Pre_Rating, Avg_Opp_Rating)

# Checking in
head(my_df, 5)

## # A tibble: 5 × 5
##   Name                State Total Pre_Rating Avg_Opp_Rating
##   <chr>               <chr> <dbl>      <dbl>          <dbl>
## 1 GARY HUA            ON      6         1794           1605
## 2 DAKSHESH DARURI     MI      6         1553           1469
## 3 ADITYA BAJAJ        MI      6         1384           1564
## 4 PATRICK H SCHILLING MI      5.5       1716           1574
## 5 HANSHI ZUO          MI      5.5       1655           1501

The cleaned data (that will go into the csv file) is prepared.

kable(my_df)

Name	State	Total	Pre_Rating	Avg_Opp_Rating
GARY HUA	ON	6.0	1794	1605
DAKSHESH DARURI	MI	6.0	1553	1469
ADITYA BAJAJ	MI	6.0	1384	1564
PATRICK H SCHILLING	MI	5.5	1716	1574
HANSHI ZUO	MI	5.5	1655	1501
HANSEN SONG	OH	5.0	1686	1519
GARY DEE SWATHELL	MI	5.0	1649	1372
EZEKIEL HOUGHTON	MI	5.0	1641	1468
STEFANO LEE	ON	5.0	1411	1523
ANVIT RAO	MI	5.0	1365	1554
CAMERON WILLIAM MC LEMAN	MI	4.5	1712	1468
KENNETH J TACK	MI	4.5	1663	1506
TORRANCE HENRY JR	MI	4.5	1666	1498
BRADLEY SHAW	MI	4.5	1610	1515
ZACHARY JAMES HOUGHTON	MI	4.5	1220	1484
MIKE NIKITIN	MI	4.0	1604	1386
RONALD GRZEGORCZYK	MI	4.0	1629	1499
DAVID SUNDEEN	MI	4.0	1600	1480
DIPANKAR ROY	MI	4.0	1564	1426
JASON ZHENG	MI	4.0	1595	1411
DINH DANG BUI	ON	4.0	1563	1470
EUGENE L MCCLURE	MI	4.0	1555	1300
ALAN BUI	ON	4.0	1363	1214
MICHAEL R ALDRICH	MI	4.0	1229	1357
LOREN SCHWIEBERT	MI	3.5	1745	1363
MAX ZHU	ON	3.5	1579	1507
GAURAV GIDWANI	MI	3.5	1552	1222
SOFIA ADINA STANESCU-BELLU	MI	3.5	1507	1522
CHIEDOZIE OKORIE	MI	3.5	1602	1314
GEORGE AVERY JONES	ON	3.5	1522	1144
RISHI SHETTY	MI	3.5	1494	1260
JOSHUA PHILIP MATHEWS	ON	3.5	1441	1379
JADE GE	MI	3.5	1449	1277
MICHAEL JEFFERY THOMAS	MI	3.5	1399	1375
JOSHUA DAVID LEE	MI	3.5	1438	1150
SIDDHARTH JHA	MI	3.5	1355	1388
AMIYATOSH PWNANANDAM	MI	3.5	980	1385
BRIAN LIU	MI	3.0	1423	1539
JOEL R HENDON	MI	3.0	1436	1430
FOREST ZHANG	MI	3.0	1348	1391
KYLE WILLIAM MURPHY	MI	3.0	1403	1248
JARED GE	MI	3.0	1332	1150
ROBERT GLEN VASEY	MI	3.0	1283	1107
JUSTIN D SCHILLING	MI	3.0	1199	1327
DEREK YAN	MI	3.0	1242	1152
JACOB ALEXANDER LAVALLEY	MI	3.0	377	1358
ERIC WRIGHT	MI	2.5	1362	1392
DANIEL KHAIN	MI	2.5	1382	1356
MICHAEL J MARTIN	MI	2.5	1291	1286
SHIVAM JHA	MI	2.5	1056	1296
TEJAS AYYAGARI	MI	2.5	1011	1356
ETHAN GUO	MI	2.5	935	1495
JOSE C YBARRA	MI	2.0	1393	1345
LARRY HODGE	MI	2.0	1270	1206
ALEX KONG	MI	2.0	1186	1406
MARISA RICCI	MI	2.0	1153	1414
MICHAEL LU	MI	2.0	1092	1363
VIRAJ MOHILE	MI	2.0	917	1391
SEAN M MC CORMICK	MI	2.0	853	1319
JULIA SHEN	MI	1.5	967	1330
JEZZEL FARKAS	ON	1.5	955	1327
ASHWIN BALAJI	MI	1.0	1530	1186
THOMAS JOSEPH HOSMER	MI	1.0	1175	1350
BEN LI	MI	1.0	1163	1263

It is time to create a csv of this information.

write.csv(my_df, "chess_tournament_player_info.csv", row.names = TRUE)

DATA 607 Project 1

Catherine Dube

Intro

Conclusion