library(tidyverse)
Project 1 is structured around scraping data from an unfriendly text table about chess statistics to get it into R. Once the data has been wrangled, the next focus is transferring information that exists one multiple other rows to another single row, for every row. Our ultimate goal is to end up with a CSV that includes the data formatted into the columns: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
To begin with, we need to load the data into R in whatever format we can get to further work on processing it. My initial thought was loading in the text file with readr’s read_delim function. With each row being delimited by the blocks made up of “-”, and each column being delimited by individual “|”. Unfortunately, changing the row delimiter is not possible. So I needed to create a grouping id every second row and group every two rows together using reframe. This leaves us with a pretty good starting point for processing the individual columns.
tournament_info_rd <- read_delim("https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%204%20Project/tournamentinfo.txt",
delim = "|", escape_double = FALSE, comment = "--",
trim_ws = TRUE, skip = 1, col_names = FALSE, show_col_types = FALSE)
head(tournament_info_rd,10)
## # A tibble: 10 × 11
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 Pair Player Name Total Round Round Round Round Round Round Round NA
## 2 Num USCF ID / Rtg (P… Pts 1 2 3 4 5 6 7 NA
## 3 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 NA
## 4 ON 15445895 / R: 17… N:2 W B W B W B W NA
## 5 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 NA
## 6 MI 14598900 / R: 15… N:2 B W B W B W B NA
## 7 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 NA
## 8 MI 14959604 / R: 13… N:2 W B W B W B W NA
## 9 4 PATRICK H SCHILL… 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 NA
## 10 MI 12616049 / R: 17… N:2 W B W B W B B NA
tournament_info_rd2 <- tournament_info_rd %>%
mutate(grouping_id = rep(1:(n()/2),each = 2)) %>%
group_by(grouping_id) %>%
reframe(across(X1:X10, ~trimws(paste0(.x, collapse=" "))))
names(tournament_info_rd2) <- tournament_info_rd2 %>% slice(1) %>% unlist()
tournament_info_rd3 <- tournament_info_rd2[,-1] %>% slice(-1)
head(tournament_info_rd3,10)
## # A tibble: 10 × 10
## `Pair Num` Player N…¹ Total…² Round…³ Round…⁴ Round…⁵ Round…⁶ Round…⁷ Round…⁸
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 ON GARY HUA … 6.0 N:2 W 39 W W 21 B W 18 W W 14 B W 7 W D 12 B
## 2 2 MI DAKSHESH … 6.0 N:2 W 63 B W 58 W L 4 B W 17 W W 16 B W 20 W
## 3 3 MI ADITYA BA… 6.0 N:2 L 8 W W 61 B W 25 W W 21 B W 11 W W 13 B
## 4 4 MI PATRICK H… 5.5 N:2 W 23 W D 28 B W 2 W W 26 B D 5 W W 19 B
## 5 5 MI HANSHI ZU… 5.5 N:2 W 45 B W 37 W D 12 B D 13 W D 4 B W 14 W
## 6 6 OH HANSEN SO… 5.0 N:3 W 34 W D 29 B L 11 W W 35 B D 10 B W 27 W
## 7 7 MI GARY DEE … 5.0 N:3 W 57 W W 46 B W 13 W W 11 B L 1 B W 9 W
## 8 8 MI EZEKIEL H… 5.0 N:3 W 3 B W 32 W L 14 B L 9 W W 47 B W 28 W
## 9 9 ON STEFANO L… 5.0 N:2 W 25 W L 18 B W 59 W W 8 B W 26 W L 7 B
## 10 10 MI ANVIT RAO… 5.0 N:3 D 16 W L 19 W W 55 B W 31 B D 6 W W 25 B
## # … with 1 more variable: `Round 7` <chr>, and abbreviated variable names
## # ¹`Player Name USCF ID / Rtg (Pre->Post)`, ²`Total Pts`, ³`Round 1`,
## # ⁴`Round 2`, ⁵`Round 3`, ⁶`Round 4`, ⁷`Round 5`, ⁸`Round 6`
At some point I had realized that this project was supposed to be based on string parsing and regex, so I started working on the alternative approach of loading the data in as a string and getting it into a dataframe before we began processing the columns themselves.
To begin with we read the text file that has been uploaded onto Github within R as a variable. As we can see with the first few lines, the string itself has a lot of characters that need to be removed to get in a better shape.
tournament_info_string <- read_file("https://raw.githubusercontent.com/alu-potato/DATA607/main/Assignments/Week%204%20Project/tournamentinfo.txt")
substr(tournament_info_string,1,800)
## [1] "-----------------------------------------------------------------------------------------\r\n Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| \r\n Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | \r\n-----------------------------------------------------------------------------------------\r\n 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|\r\n ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |\r\n-----------------------------------------------------------------------------------------\r\n 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|\r\n MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W "
To begin our processing journey we remove newline and carriage return. Then we turn the string into an array of each row of data based on the dash block used to separate each observation. Next we remove the first and last entry of the array which were not useful as they ended with dash blocks but did not have an observation afterwards. Since we need to eventually turn this into a tibble, we obtain the row of column names that will be used for the data frame by parsing the first entry of the character vector at the “|” character. Finally we remove that first entry, and we have a character vector ready to be turned into a tibble. Pretty good shape compared to the mess of the string we had before.
tournament_info_string2 <- str_replace_all(tournament_info_string, regex(r"(\r|\n)"), "")
tournament_info_array1 <- str_split_1(tournament_info_string2,regex(r"(-+-)"))
tournament_info_array2 <- head(tournament_info_array1,-1)[-1]
tournament_info_names <- str_trim(str_split_1(tournament_info_array2[1],fixed("|")))
tournament_info_array3 <- tournament_info_array2[-1]
tournament_info_array3[1:4]
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
Next we transform the character vector into a data frame by temporarily turning it into a tibble and then passing that tibble through separate_wider_delim which provides a starting point for only extracting and processing the rows we want. While we’re doing this, simplest and first thing that we can do here to process the data within the tibble is to drop the columns that do not contain any information pertaining to the data that we want. In this case, the last 9 columns hold nothing of interest for us.
suppressMessages(
tournament_info_dataframe <- tibble(tournament_info_array3) %>%
separate_wider_delim(tournament_info_array3, delim = "|", names = tournament_info_names, names_repair = "unique")
)
tournament_info_dataframe2 <- tournament_info_dataframe[,1:12]
glimpse(tournament_info_dataframe2)
## Rows: 64
## Columns: 12
## $ Pair <chr> " 1 ", " 2 ", " 3 ", " 4 ", " …
## $ `Player Name` <chr> " GARY HUA ", " DAK…
## $ Total <chr> "6.0 ", "6.0 ", "6.0 ", "5.5 ", "5.5 …
## $ Round...4 <chr> "W 39", "W 63", "L 8", "W 23", "W 45…
## $ Round...5 <chr> "W 21", "W 58", "W 61", "D 28", "W 37…
## $ Round...6 <chr> "W 18", "L 4", "W 25", "W 2", "D 12…
## $ Round...7 <chr> "W 14", "W 17", "W 21", "W 26", "D 13…
## $ Round...8 <chr> "W 7", "W 16", "W 11", "D 5", "D 4…
## $ Round...9 <chr> "D 12", "W 20", "W 13", "W 19", "W 14…
## $ Round...10 <chr> "D 4", "W 7", "W 12", "D 1", "W 17…
## $ Num <chr> " ON ", " MI ", " MI ", " MI ", " …
## $ `USCF ID / Rtg (Pre->Post)` <chr> " 15445895 / R: 1794 ->1817 ", " 145…
Now that we have a Tibble we can work with, we still want to further manipulate the entries of each columns. We’ll begin with removing the unnecessary white space caught in glimpse. After this, we extract only the pre-rating of the player with a regex match. Then we get only the player IDs from the round column which is what we want, at the same time we convert the other columns that have numbers to the dbl datatype instead of chr as they previously were. One thing to note, is that in the case of players not having an opponent in a round for whatever reason, they have had their opponent match converted to NA in that case.
Next we give the tibble even more apt title names just so it looks nicer.
tournament_info_dataframe3 <- tournament_info_dataframe2 %>%
mutate(across(1:12,str_trim)) %>%
mutate(`USCF ID / Rtg (Pre->Post)` = str_extract(`USCF ID / Rtg (Pre->Post)`,r"((?<=R:\s{1,2})\d{3,4})")) %>%
mutate(across(starts_with('Round')|'Pair'|'Total'|`USCF ID / Rtg (Pre->Post)`, parse_number))
names(tournament_info_dataframe3) <- c('Player ID','Player’s Name','Total Number of Points', 'Round 1 Opponent','Round 2 Opponent','Round 3 Opponent','Round 4 Opponent','Round 5 Opponent','Round 6 Opponent','Round 7 Opponent','Player’s State', 'Player’s Pre-Rating')
head(tournament_info_dataframe3,5)
## # A tibble: 5 × 12
## `Player ID` Player’s…¹ Total…² Round…³ Round…⁴ Round…⁵ Round…⁶ Round…⁷ Round…⁸
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 GARY HUA 6 39 21 18 14 7 12
## 2 2 DAKSHESH … 6 63 58 4 17 16 20
## 3 3 ADITYA BA… 6 8 61 25 21 11 13
## 4 4 PATRICK H… 5.5 23 28 2 26 5 19
## 5 5 HANSHI ZUO 5.5 45 37 12 13 4 14
## # … with 3 more variables: `Round 7 Opponent` <dbl>, `Player’s State` <chr>,
## # `Player’s Pre-Rating` <dbl>, and abbreviated variable names
## # ¹`Player’s Name`, ²`Total Number of Points`, ³`Round 1 Opponent`,
## # ⁴`Round 2 Opponent`, ⁵`Round 3 Opponent`, ⁶`Round 4 Opponent`,
## # ⁷`Round 5 Opponent`, ⁸`Round 6 Opponent`
To generate the average pre-chess rating of the opponents, we’ll use the opponent columns as helper columns to join the player’s pre-rating from the ID contained within it. This was done through a for loop because I could not for the life of me figure out how to do this with dplyr functions. Then we can get the mean of those columns in the new column that becomes the average pre chess rating of opponents.
tournament_info_dataframe4 <- tournament_info_dataframe3
x = 3
for(i in tournament_info_dataframe4[4:10]) {
x = x + 1
y = 0
for(j in i) {
y = y+1
if (!is.na(j)) {
tournament_info_dataframe4[y,x] <- tournament_info_dataframe4[j,][['Player’s Pre-Rating']]
}
}
}
tournament_info_dataframe5 <- tournament_info_dataframe4 |>
mutate('Average Chess Pre-Rating of Opponents' = round(rowMeans(tournament_info_dataframe4[4:10], na.rm=TRUE))) |>
select(c('Player’s Name', 'Player’s State', 'Total Number of Points', 'Player’s Pre-Rating', 'Average Chess Pre-Rating of Opponents'))
head(tournament_info_dataframe5,10)
## # A tibble: 10 × 5
## `Player’s Name` `Player’s State` `Total Number of Points` Playe…¹ Avera…²
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 GARY HUA ON 6 1794 1605
## 2 DAKSHESH DARURI MI 6 1553 1469
## 3 ADITYA BAJAJ MI 6 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5 1686 1519
## 7 GARY DEE SWATHELL MI 5 1649 1372
## 8 EZEKIEL HOUGHTON MI 5 1641 1468
## 9 STEFANO LEE ON 5 1411 1523
## 10 ANVIT RAO MI 5 1365 1554
## # … with abbreviated variable names ¹`Player’s Pre-Rating`,
## # ²`Average Chess Pre-Rating of Opponents`
Now that we have the data in the format that we want, all that’s needed is the simple process of outputting a .csv file.
write.csv(tournament_info_dataframe5, file = "tournament_info.csv", row.names = FALSE)
With the CSV file in hand we could take further action such as adding this information to a database. Within that database we could be tracking the gamut of competitive chess players, and perhaps looking into if there are some players who get “luckier” than others and consistently get matched with players much lower rated than themselves.