Week 4 Assignment

Introduction

Project 1 is structured around scraping data from an unfriendly text table about chess statistics to get it into R. Once the data has been wrangled, the next focus is transferring information that exists one multiple other rows to another single row, for every row. Our ultimate goal is to end up with a CSV that includes the data formatted into the columns: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.

Loading The Data

To begin with, we need to load the data into R in whatever format we can get to further work on processing it. My initial thought was loading in the text file with readr’s read_delim function. With each row being delimited by the blocks made up of “-”, and each column being delimited by individual “|”. Unfortunately, changing the row delimiter is not possible. So I needed to create a grouping id every second row and group every two rows together using reframe. This leaves us with a pretty good starting point for processing the individual columns.

tournament_info_rd <- read_delim("https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%204%20Project/tournamentinfo.txt", 
                                        delim = "|", escape_double = FALSE, comment = "--", 
                                        trim_ws = TRUE, skip = 1, col_names = FALSE, show_col_types = FALSE)
head(tournament_info_rd,10)

## # A tibble: 10 × 11
##    X1    X2                X3    X4    X5    X6    X7    X8    X9    X10   X11  
##    <chr> <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
##  1 Pair  Player Name       Total Round Round Round Round Round Round Round NA   
##  2 Num   USCF ID / Rtg (P… Pts   1     2     3     4     5     6     7     NA   
##  3 1     GARY HUA          6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4 NA   
##  4 ON    15445895 / R: 17… N:2   W     B     W     B     W     B     W     NA   
##  5 2     DAKSHESH DARURI   6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7 NA   
##  6 MI    14598900 / R: 15… N:2   B     W     B     W     B     W     B     NA   
##  7 3     ADITYA BAJAJ      6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12 NA   
##  8 MI    14959604 / R: 13… N:2   W     B     W     B     W     B     W     NA   
##  9 4     PATRICK H SCHILL… 5.5   W  23 D  28 W   2 W  26 D   5 W  19 D   1 NA   
## 10 MI    12616049 / R: 17… N:2   W     B     W     B     W     B     B     NA

tournament_info_rd2 <- tournament_info_rd %>%
  mutate(grouping_id = rep(1:(n()/2),each = 2)) %>%
  group_by(grouping_id) %>%
  reframe(across(X1:X10, ~trimws(paste0(.x, collapse=" "))))
names(tournament_info_rd2) <- tournament_info_rd2 %>% slice(1) %>% unlist()
tournament_info_rd3 <- tournament_info_rd2[,-1] %>% slice(-1)
head(tournament_info_rd3,10)

## # A tibble: 10 × 10
##    `Pair Num` Player N…¹ Total…² Round…³ Round…⁴ Round…⁵ Round…⁶ Round…⁷ Round…⁸
##    <chr>      <chr>      <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
##  1 1 ON       GARY HUA … 6.0 N:2 W  39 W W  21 B W  18 W W  14 B W   7 W D  12 B
##  2 2 MI       DAKSHESH … 6.0 N:2 W  63 B W  58 W L   4 B W  17 W W  16 B W  20 W
##  3 3 MI       ADITYA BA… 6.0 N:2 L   8 W W  61 B W  25 W W  21 B W  11 W W  13 B
##  4 4 MI       PATRICK H… 5.5 N:2 W  23 W D  28 B W   2 W W  26 B D   5 W W  19 B
##  5 5 MI       HANSHI ZU… 5.5 N:2 W  45 B W  37 W D  12 B D  13 W D   4 B W  14 W
##  6 6 OH       HANSEN SO… 5.0 N:3 W  34 W D  29 B L  11 W W  35 B D  10 B W  27 W
##  7 7 MI       GARY DEE … 5.0 N:3 W  57 W W  46 B W  13 W W  11 B L   1 B W   9 W
##  8 8 MI       EZEKIEL H… 5.0 N:3 W   3 B W  32 W L  14 B L   9 W W  47 B W  28 W
##  9 9 ON       STEFANO L… 5.0 N:2 W  25 W L  18 B W  59 W W   8 B W  26 W L   7 B
## 10 10 MI      ANVIT RAO… 5.0 N:3 D  16 W L  19 W W  55 B W  31 B D   6 W W  25 B
## # … with 1 more variable: `Round 7` <chr>, and abbreviated variable names
## #   ¹`Player Name USCF ID / Rtg (Pre->Post)`, ²`Total Pts`, ³`Round 1`,
## #   ⁴`Round 2`, ⁵`Round 3`, ⁶`Round 4`, ⁷`Round 5`, ⁸`Round 6`

Loading the Data Again

At some point I had realized that this project was supposed to be based on string parsing and regex, so I started working on the alternative approach of loading the data in as a string and getting it into a dataframe before we began processing the columns themselves.

To begin with we read the text file that has been uploaded onto Github within R as a variable. As we can see with the first few lines, the string itself has a lot of characters that need to be removed to get in a better shape.

tournament_info_string <- read_file("https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%204%20Project/tournamentinfo.txt")
substr(tournament_info_string,1,800)

## [1] "-----------------------------------------------------------------------------------------\r\n Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| \r\n Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | \r\n-----------------------------------------------------------------------------------------\r\n    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|\r\n   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |\r\n-----------------------------------------------------------------------------------------\r\n    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|\r\n   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    "

Processing The String into a Tibble

To begin our processing journey we remove newline and carriage return. Then we turn the string into an array of each row of data based on the dash block used to separate each observation. Next we remove the first and last entry of the array which were not useful as they ended with dash blocks but did not have an observation afterwards. Since we need to eventually turn this into a tibble, we obtain the row of column names that will be used for the data frame by parsing the first entry of the character vector at the “|” character. Finally we remove that first entry, and we have a character vector ready to be turned into a tibble. Pretty good shape compared to the mess of the string we had before.

tournament_info_string2 <- str_replace_all(tournament_info_string, regex(r"(\r|\n)"), "")
tournament_info_array1 <- str_split_1(tournament_info_string2,regex(r"(-+-)"))
tournament_info_array2 <- head(tournament_info_array1,-1)[-1]
tournament_info_names <- str_trim(str_split_1(tournament_info_array2[1],fixed("|")))
tournament_info_array3 <- tournament_info_array2[-1]
tournament_info_array3[1:4]

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"

Next we transform the character vector into a data frame by temporarily turning it into a tibble and then passing that tibble through separate_wider_delim which provides a starting point for only extracting and processing the rows we want. While we’re doing this, simplest and first thing that we can do here to process the data within the tibble is to drop the columns that do not contain any information pertaining to the data that we want. In this case, the last 9 columns hold nothing of interest for us.

suppressMessages(
tournament_info_dataframe <- tibble(tournament_info_array3) %>%
  separate_wider_delim(tournament_info_array3, delim = "|", names = tournament_info_names, names_repair = "unique")
)
tournament_info_dataframe2 <- tournament_info_dataframe[,1:12]
glimpse(tournament_info_dataframe2)

## Rows: 64
## Columns: 12
## $ Pair                        <chr> "    1 ", "    2 ", "    3 ", "    4 ", " …
## $ `Player Name`               <chr> " GARY HUA                        ", " DAK…
## $ Total                       <chr> "6.0  ", "6.0  ", "6.0  ", "5.5  ", "5.5  …
## $ Round...4                   <chr> "W  39", "W  63", "L   8", "W  23", "W  45…
## $ Round...5                   <chr> "W  21", "W  58", "W  61", "D  28", "W  37…
## $ Round...6                   <chr> "W  18", "L   4", "W  25", "W   2", "D  12…
## $ Round...7                   <chr> "W  14", "W  17", "W  21", "W  26", "D  13…
## $ Round...8                   <chr> "W   7", "W  16", "W  11", "D   5", "D   4…
## $ Round...9                   <chr> "D  12", "W  20", "W  13", "W  19", "W  14…
## $ Round...10                  <chr> "D   4", "W   7", "W  12", "D   1", "W  17…
## $ Num                         <chr> "   ON ", "   MI ", "   MI ", "   MI ", " …
## $ `USCF ID / Rtg (Pre->Post)` <chr> " 15445895 / R: 1794   ->1817     ", " 145…

Processing The Tibble

Now that we have a Tibble we can work with, we still want to further manipulate the entries of each columns. We’ll begin with removing the unnecessary white space caught in glimpse. After this, we extract only the pre-rating of the player with a regex match. Then we get only the player IDs from the round column which is what we want, at the same time we convert the other columns that have numbers to the dbl datatype instead of chr as they previously were. One thing to note, is that in the case of players not having an opponent in a round for whatever reason, they have had their opponent match converted to NA in that case.

Next we give the tibble even more apt title names just so it looks nicer.

tournament_info_dataframe3 <- tournament_info_dataframe2 %>%
  mutate(across(1:12,str_trim)) %>%
    mutate(`USCF ID / Rtg (Pre->Post)` = str_extract(`USCF ID / Rtg (Pre->Post)`,r"((?<=R:\s{1,2})\d{3,4})")) %>%
      mutate(across(starts_with('Round')|'Pair'|'Total'|`USCF ID / Rtg (Pre->Post)`, parse_number))
names(tournament_info_dataframe3) <- c('Player ID','Player’s Name','Total Number of Points', 'Round 1 Opponent','Round 2 Opponent','Round 3 Opponent','Round 4 Opponent','Round 5 Opponent','Round 6 Opponent','Round 7 Opponent','Player’s State', 'Player’s Pre-Rating')
head(tournament_info_dataframe3,5)

## # A tibble: 5 × 12
##   `Player ID` Player’s…¹ Total…² Round…³ Round…⁴ Round…⁵ Round…⁶ Round…⁷ Round…⁸
##         <dbl> <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1           1 GARY HUA       6        39      21      18      14       7      12
## 2           2 DAKSHESH …     6        63      58       4      17      16      20
## 3           3 ADITYA BA…     6         8      61      25      21      11      13
## 4           4 PATRICK H…     5.5      23      28       2      26       5      19
## 5           5 HANSHI ZUO     5.5      45      37      12      13       4      14
## # … with 3 more variables: `Round 7 Opponent` <dbl>, `Player’s State` <chr>,
## #   `Player’s Pre-Rating` <dbl>, and abbreviated variable names
## #   ¹`Player’s Name`, ²`Total Number of Points`, ³`Round 1 Opponent`,
## #   ⁴`Round 2 Opponent`, ⁵`Round 3 Opponent`, ⁶`Round 4 Opponent`,
## #   ⁷`Round 5 Opponent`, ⁸`Round 6 Opponent`

Generating the Average Pre Chess Rating of Opponents

To generate the average pre-chess rating of the opponents, we’ll use the opponent columns as helper columns to join the player’s pre-rating from the ID contained within it. This was done through a for loop because I could not for the life of me figure out how to do this with dplyr functions. Then we can get the mean of those columns in the new column that becomes the average pre chess rating of opponents.

tournament_info_dataframe4 <- tournament_info_dataframe3
x = 3
for(i in tournament_info_dataframe4[4:10]) {
  x = x + 1
  y = 0
  for(j in i) {
    y = y+1
    if (!is.na(j)) {
      tournament_info_dataframe4[y,x] <- tournament_info_dataframe4[j,][['Player’s Pre-Rating']]
    }
  }
}
tournament_info_dataframe5 <- tournament_info_dataframe4 |>
  mutate('Average Chess Pre-Rating of Opponents' = round(rowMeans(tournament_info_dataframe4[4:10], na.rm=TRUE))) |>
    select(c('Player’s Name', 'Player’s State', 'Total Number of Points', 'Player’s Pre-Rating', 'Average Chess Pre-Rating of Opponents'))
head(tournament_info_dataframe5,10)

## # A tibble: 10 × 5
##    `Player’s Name`     `Player’s State` `Total Number of Points` Playe…¹ Avera…²
##    <chr>               <chr>                               <dbl>   <dbl>   <dbl>
##  1 GARY HUA            ON                                    6      1794    1605
##  2 DAKSHESH DARURI     MI                                    6      1553    1469
##  3 ADITYA BAJAJ        MI                                    6      1384    1564
##  4 PATRICK H SCHILLING MI                                    5.5    1716    1574
##  5 HANSHI ZUO          MI                                    5.5    1655    1501
##  6 HANSEN SONG         OH                                    5      1686    1519
##  7 GARY DEE SWATHELL   MI                                    5      1649    1372
##  8 EZEKIEL HOUGHTON    MI                                    5      1641    1468
##  9 STEFANO LEE         ON                                    5      1411    1523
## 10 ANVIT RAO           MI                                    5      1365    1554
## # … with abbreviated variable names ¹`Player’s Pre-Rating`,
## #   ²`Average Chess Pre-Rating of Opponents`

Generating the CSV File

Now that we have the data in the format that we want, all that’s needed is the simple process of outputting a .csv file.

write.csv(tournament_info_dataframe5, file = "tournament_info.csv", row.names = FALSE)