607_Project1_DylanGold

Approach

In this assignment we have to parse a text file that was given to us and use the information in the file to generate a pre-chess rating of all the opponents. Once we do this we create a csv with the Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.

The initial challenge is parsing the text file. All the information is on the row plaintext except for the average pre-chess rating of opponents. For this we need to also store every match and use this information to get the average pre-chess rating. Because the data is separated by | for each entry we could probably use this to split the information. This might be easier in python with its built in split but it looks like R has str_split() as well which should make it easier. We can try some regex for separating the wins and player in each round, though it would probably be easier to just take the first character for the win status.

I will create a data frame with csv output format we need then write.csv() to create our csv.

Codebase

Dplyr set up, as well as stringi for parsing strings better

library(tidyverse)
library(stringi)

Lets start with getting the tournament info into our project as a string.

#Use readlines() to get it from a url
tournament_txt <- readLines("https://raw.githubusercontent.com/DylanGoldJ/607-Project-1/refs/heads/main/tournamentinfo.txt")
head(tournament_txt, 12) #Look at the top of the txt.

 [1] "-----------------------------------------------------------------------------------------" 
 [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [4] "-----------------------------------------------------------------------------------------" 
 [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [7] "-----------------------------------------------------------------------------------------" 
 [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[10] "-----------------------------------------------------------------------------------------" 
[11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
[12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

First lets split the string based on the “-+$” pattern. This will give us each line separately. We will also use

tournament_lines <- stri_split(tournament_txt,regex = "-+$", omit_empty = TRUE) #This lines composed of (and ending) in - as the delimiter
#We have some empty characters in our string after this so we can remove them with a filter for greater than 0 lengths
tournament_lines <- tournament_lines[lengths(tournament_lines) > 0]
head(tournament_lines, 8)

[[1]]
[1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "

[[2]]
[1] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "

[[3]]
[1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"

[[4]]
[1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

[[5]]
[1] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"

[[6]]
[1] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

[[7]]
[1] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"

[[8]]
[1] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

We now have a list of the important lines.We can separate the every other row to organize it more I found that the equivalent to array slicing in python would be to use a seq() I will create an even and odd sequence to split the list so we can pair the elements together

odd_sequence <- seq(1, length(tournament_lines), 2)
even_sequence <- seq(2, length(tournament_lines),2)
player_info1 <- tournament_lines[odd_sequence] #We create a list of lists. 
player_info2 <- tournament_lines[even_sequence]
#Remove the first header line
player_info1 <- player_info1[-1]
player_info2<- player_info2[-1]
#Show the new first values, should be gary's lines
player_info1[[1]]

[1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"

player_info2[[1]]

[1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

We now have a format that is more organized
We can now start parsing for our data.

As I was doing it I realized that if I did not use regex it would get pretty messy so I adopted to start using regex, I could have just parsed it from the start for next time.
We can use str_match with regex with grouping for extracting values.
We can start with player number, even though its already sorted

#Takes a player_info from a string in player_info1
player_num <- str_match(player_info1, "(\\d+) \\| ")[,2] 
player_num

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
[61] "61" "62" "63" "64"

I realize I can just do multiple groupings in one large regex for the whole row
We actually can’t do this for multiple captures of the same type at once (getting the match opponent) like this as far as I can tell so we can hold off on that.

player_groupings1 <- str_match(player_info1, "(\\d+) \\| ([\\w+ |-]+) *\\|(\\d\\.\\d) *\\|")
player_num <- player_groupings1[,2]
player_name <-player_groupings1[,3] %>%
  str_trim() %>% #Because we look for { } or {-} in between names we are left with the empty space at the end of the name. str_trim will remove this whitespace
  str_to_title()
player_points <- player_groupings1[,4]

head(player_name, 8) #Show example

[1] "Gary Hua"            "Dakshesh Daruri"     "Aditya Bajaj"       
[4] "Patrick H Schilling" "Hanshi Zuo"          "Hansen Song"        
[7] "Gary Dee Swathell"   "Ezekiel Houghton"

Now in the other list

player_groupings2 <- str_match(player_info2, "(\\w{2}) *\\| \\d+ \\/ R: *(\\d+)(P\\d*)?")
player_state <- player_groupings2[,2]
player_rating <- player_groupings2[,3]
head(player_rating, 8)

[1] "1794" "1553" "1384" "1716" "1655" "1686" "1649" "1641"

Now we need to get the groups for the opponent matches.
We can do this with str_match_all() I show first player as example.

#We create regex for the single pattern of the opponent match, match all will find all the occurrences of it.
opponent_groupings <- str_match_all(player_info1, "\\|[W|L|H|B|D|U|X] *(\\d*)")
#Opponent_groupings is a list of a 7by2 character array
#We can get the opponents of the first player as a sample
(opponent_groupings[[1]])[,2]

[1] "39" "21" "18" "14" "7"  "12" "4"

Now we make a list of these values. We just want the second column of the previous step.
str_match_all created a list of values, first the matches, then the groupings. We just want the groupings.

#I was not sure how to get this sub-list all at once so I used a loop
#We initialize a list with vector to give length upfront
player_opponents <- vector("list", length(opponent_groupings))
for(i in 1 : length(opponent_groupings)){
  player_opponents[[i]] <- opponent_groupings[[i]][,2]
}
head(player_opponents, 15)

[[1]]
[1] "39" "21" "18" "14" "7"  "12" "4" 

[[2]]
[1] "63" "58" "4"  "17" "16" "20" "7" 

[[3]]
[1] "8"  "61" "25" "21" "11" "13" "12"

[[4]]
[1] "23" "28" "2"  "26" "5"  "19" "1" 

[[5]]
[1] "45" "37" "12" "13" "4"  "14" "17"

[[6]]
[1] "34" "29" "11" "35" "10" "27" "21"

[[7]]
[1] "57" "46" "13" "11" "1"  "9"  "2" 

[[8]]
[1] "3"  "32" "14" "9"  "47" "28" "19"

[[9]]
[1] "25" "18" "59" "8"  "26" "7"  "20"

[[10]]
[1] "16" "19" "55" "31" "6"  "25" "18"

[[11]]
[1] "38" "56" "6"  "7"  "3"  "34" "26"

[[12]]
[1] "42" "33" "5"  "38" ""   "1"  "3" 

[[13]]
[1] "36" "27" "7"  "5"  "33" "3"  "32"

[[14]]
[1] "54" "44" "8"  "1"  "27" "5"  "31"

[[15]]
[1] "19" "16" "30" "22" "54" "33" "38"

We now have all the pieces, now we just need to get the opponents average rating. First I will combine all of this into a data frame.

player_data <- data.frame (
  number = as.integer(player_num), #Convert from character to int
  name = player_name,
  state = player_state,
  points = as.numeric(player_points),
  rating = as.numeric(player_rating)
)
player_data <- player_data %>% mutate(
  opponents = player_opponents
)

head(player_data,12) #Dataframe is made properly

   number                     name state points rating
1       1                 Gary Hua    ON    6.0   1794
2       2          Dakshesh Daruri    MI    6.0   1553
3       3             Aditya Bajaj    MI    6.0   1384
4       4      Patrick H Schilling    MI    5.5   1716
5       5               Hanshi Zuo    MI    5.5   1655
6       6              Hansen Song    OH    5.0   1686
7       7        Gary Dee Swathell    MI    5.0   1649
8       8         Ezekiel Houghton    MI    5.0   1641
9       9              Stefano Lee    ON    5.0   1411
10     10                Anvit Rao    MI    5.0   1365
11     11 Cameron William Mc Leman    MI    4.5   1712
12     12           Kenneth J Tack    MI    4.5   1663
                    opponents
1    39, 21, 18, 14, 7, 12, 4
2    63, 58, 4, 17, 16, 20, 7
3   8, 61, 25, 21, 11, 13, 12
4     23, 28, 2, 26, 5, 19, 1
5   45, 37, 12, 13, 4, 14, 17
6  34, 29, 11, 35, 10, 27, 21
7     57, 46, 13, 11, 1, 9, 2
8    3, 32, 14, 9, 47, 28, 19
9    25, 18, 59, 8, 26, 7, 20
10  16, 19, 55, 31, 6, 25, 18
11    38, 56, 6, 7, 3, 34, 26
12      42, 33, 5, 38, , 1, 3

Now we just need to use the opponents column to get the average rating of the opponents

#I create a function that when given a chr vector of players, will give me the average rating of those players

get_avg_opp_rating <- function(opponents){ #I just loop through
  opponents <- opponents[opponents != ""] #Get rid of the empty strings
  count = 0
  sum = 0
  
  for (opp in opponents ){
      count <-  count + 1
      rating <- filter(player_data, number == opp) %>% select("rating") %>% pull #Filter to get the row, select the rating column then full
      sum <- sum + rating
  }
  
  return (sum/count)
}
#Test
print(get_avg_opp_rating(player_data[[1,6]]))

[1] 1605.286

We have created a function that will let us get the avg rating given list of opponents. This value lines up with the given value for the first player.
Now we mutate a new column using this function.

player_data <- player_data %>% 
  rowwise() %>%
  mutate(avg_opponent_rating = get_avg_opp_rating(c_across(c("opponents"))))

head(player_data)

# A tibble: 6 × 7
# Rowwise: 
  number name                state points rating opponents avg_opponent_rating
   <int> <chr>               <chr>  <dbl>  <dbl> <list>                  <dbl>
1      1 Gary Hua            ON       6     1794 <chr [7]>               1605.
2      2 Dakshesh Daruri     MI       6     1553 <chr [7]>               1469.
3      3 Aditya Bajaj        MI       6     1384 <chr [7]>               1564.
4      4 Patrick H Schilling MI       5.5   1716 <chr [7]>               1574.
5      5 Hanshi Zuo          MI       5.5   1655 <chr [7]>               1501.
6      6 Hansen Song         OH       5     1686 <chr [7]>               1519.

We can also check a row with null values to make sure everything went alright.
Ashwin just faced a single opponent. No player faced 0 other players so we don’t have to worry about that. He is the 63th player.
He faces 55. We can see that Ashwin’s avg_opponent_rating is the same rating as player 55, alex kong.

filter(player_data, number == 62)

# A tibble: 1 × 7
# Rowwise: 
  number name          state points rating opponents avg_opponent_rating
   <int> <chr>         <chr>  <dbl>  <dbl> <list>                  <dbl>
1     62 Ashwin Balaji MI         1   1530 <chr [7]>                1186

filter(player_data, number == 55)

# A tibble: 1 × 7
# Rowwise: 
  number name      state points rating opponents avg_opponent_rating
   <int> <chr>     <chr>  <dbl>  <dbl> <list>                  <dbl>
1     55 Alex Kong MI         2   1186 <chr [7]>                1406

I will also test out one more just to make sure. Lets look at player 16, who faced 10, 15, 39, 2, 36.
I calculate myself the average of 1365, 1220, 1436, 1553, 1355.
This example has opponents with a P after their score.

mean(c(1365, 1220, 1436, 1553, 1355))

[1] 1385.8

filter(player_data, number == 16)

# A tibble: 1 × 7
# Rowwise: 
  number name         state points rating opponents avg_opponent_rating
   <int> <chr>        <chr>  <dbl>  <dbl> <list>                  <dbl>
1     16 Mike Nikitin MI         4   1604 <chr [7]>               1386.

We tested out some of the rows and everything seems to be working. Now lets create a data frame with the output information.

#Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
output_df <-player_data %>% 
  select(c("Player’s Name" = "name","Player’s State" = "state","Total Number of Points" = "points","Player’s Pre-Rating" = "rating", "Average Pre Chess Rating of Opponents" = "avg_opponent_rating")) #also set the names for better readability in the csv
head(output_df)

# A tibble: 6 × 5
# Rowwise: 
  `Player’s Name`  `Player’s State` Total Number of Poin…¹ `Player’s Pre-Rating`
  <chr>            <chr>                             <dbl>                 <dbl>
1 Gary Hua         ON                                  6                    1794
2 Dakshesh Daruri  MI                                  6                    1553
3 Aditya Bajaj     MI                                  6                    1384
4 Patrick H Schil… MI                                  5.5                  1716
5 Hanshi Zuo       MI                                  5.5                  1655
6 Hansen Song      OH                                  5                    1686
# ℹ abbreviated name: ¹`Total Number of Points`
# ℹ 1 more variable: `Average Pre Chess Rating of Opponents` <dbl>

Now convert to csv using write.csv()

write.csv(output_df, "tournamentinfo.csv", row.names = FALSE) #print, get rid of the index column/row name

We can check our tournamentinfo.csv to see we now have the expected output.

Conclusion

In this project I was able to parse a given file. The file had a uniform format between the lines which made it much easier. I used regex to create groupings to extract the data I needed from the string. After doing this I used dplyr functions as well as created my own function to generate a data frame with the needed information. Once a data frame was created I used write.csv to create a csv file that contained the data formatted from the original text.
Some ways I could further improve this project is work more with how winning and losing in a match up effected the players post-tournament rating. As well as explore how the player matches were determined. It would be difficult to apply the same parsing to another text file of different format but exploring other data sources with new regex could also be interesting.