Project 1

Read text file

The tournament file is saved on my github account where I am pulling it using the function “read.delim”. I tried multiple separators, but decided on “\t”.

tournamentInfo <- read.delim("https://raw.githubusercontent.com/jglendrange/DATA607/main/tournamentinfo.txt", header=FALSE, sep="\t")

head(tournamentInfo)

##                                                                                           V1
## 1  -----------------------------------------------------------------------------------------
## 2  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4  -----------------------------------------------------------------------------------------
## 5      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |

Cleanup

Currently all of the fields are stored in 1 field: V1. In the next block of code I clean up the data using regular expressions. Then, once I’m comfortable with how the data is formatted I split V1 into 10 columns by the character “|”.

tournamentInfo <- filter(tournamentInfo, !grepl('-{89}', V1))

tournamentInfo$V1 <- gsub("\\s+\\|", "|", tournamentInfo$V1)
tournamentInfo$V1 <- gsub("\\|\\s+", "|", tournamentInfo$V1)
tournamentInfo$V1 <- sub("\\s+", "", tournamentInfo$V1)
tournamentInfo$V1 <- sub("\\|$", "", tournamentInfo$V1)

tournamentInfo <- data.frame(do.call("rbind", strsplit(as.character(tournamentInfo$V1), "|", fixed=TRUE)))

## Warning in rbind(c("Pair", "Player Name", "Total", "Round", "Round", "Round", :
## number of columns of result is not a multiple of vector length (arg 34)

colnames(tournamentInfo) <- c("pair","player_name","total_points","round1","round2","round3","round4","round5", "round6","round7")
tournamentInfo <- tournamentInfo[-1,] 
tournamentInfo <- tournamentInfo[-1,] 

head(tournamentInfo)

##   pair                 player_name total_points round1 round2 round3 round4
## 3    1                    GARY HUA          6.0  W  39  W  21  W  18  W  14
## 4   ON 15445895 / R: 1794   ->1817          N:2      W      B      W      B
## 5    2             DAKSHESH DARURI          6.0  W  63  W  58  L   4  W  17
## 6   MI 14598900 / R: 1553   ->1663          N:2      B      W      B      W
## 7    3                ADITYA BAJAJ          6.0  L   8  W  61  W  25  W  21
## 8   MI 14959604 / R: 1384   ->1640          N:2      W      B      W      B
##   round5 round6 round7
## 3  W   7  D  12  D   4
## 4      W      B      W
## 5  W  16  W  20  W   7
## 6      B      W      B
## 7  W  11  W  13  W  12
## 8      W      B      W

Split

The next problem we face is 1 chess players data is spread across 2 rows. In the next chunk I split the dataframe into 2. The first contains the chess player information and the second contains their state and elo information. In the elo_rating table I’m only interested in the elo_rating and the state, so I remove all the other columns

players <- filter(tournamentInfo, !grepl('\\d+', player_name))
elo_rating <- filter(tournamentInfo, grepl('\\d+', player_name))

elo_rating <- elo_rating %>% select(1,2) 
colnames(elo_rating) <- c('state','elo_rating')

head(players)

##   pair         player_name total_points round1 round2 round3 round4 round5
## 1    1            GARY HUA          6.0  W  39  W  21  W  18  W  14  W   7
## 2    2     DAKSHESH DARURI          6.0  W  63  W  58  L   4  W  17  W  16
## 3    3        ADITYA BAJAJ          6.0  L   8  W  61  W  25  W  21  W  11
## 4    4 PATRICK H SCHILLING          5.5  W  23  D  28  W   2  W  26  D   5
## 5    5          HANSHI ZUO          5.5  W  45  W  37  D  12  D  13  D   4
## 6    6         HANSEN SONG          5.0  W  34  D  29  L  11  W  35  D  10
##   round6 round7
## 1  D  12  D   4
## 2  W  20  W   7
## 3  W  13  W  12
## 4  W  19  D   1
## 5  W  14  W  17
## 6  W  27  W  21

head(elo_rating)

##   state                  elo_rating
## 1    ON 15445895 / R: 1794   ->1817
## 2    MI 14598900 / R: 1553   ->1663
## 3    MI 14959604 / R: 1384   ->1640
## 4    MI 12616049 / R: 1716   ->1744
## 5    MI 14601533 / R: 1655   ->1690
## 6    OH 15055204 / R: 1686   ->1687

Extract Pre Rating

In the elo rating field I have both the pre and post elo rating. Since I am only interested in each players pre rating I extract it using regular expressions. Then I add the 2 fields to the players table. From now on I will only be using the players table.

elo_rating$elo_rating <- gsub(".*R:\\s+(\\d+).*->.+", "\\1", elo_rating$elo_rating)

players$elo_rating <- elo_rating$elo_rating
players$state <- elo_rating$state

head(players)

##   pair         player_name total_points round1 round2 round3 round4 round5
## 1    1            GARY HUA          6.0  W  39  W  21  W  18  W  14  W   7
## 2    2     DAKSHESH DARURI          6.0  W  63  W  58  L   4  W  17  W  16
## 3    3        ADITYA BAJAJ          6.0  L   8  W  61  W  25  W  21  W  11
## 4    4 PATRICK H SCHILLING          5.5  W  23  D  28  W   2  W  26  D   5
## 5    5          HANSHI ZUO          5.5  W  45  W  37  D  12  D  13  D   4
## 6    6         HANSEN SONG          5.0  W  34  D  29  L  11  W  35  D  10
##   round6 round7 elo_rating state
## 1  D  12  D   4       1794    ON
## 2  W  20  W   7       1553    MI
## 3  W  13  W  12       1384    MI
## 4  W  19  D   1       1716    MI
## 5  W  14  W  17       1655    MI
## 6  W  27  W  21       1686    OH

Functions

Here I define 2 functions. The first, “extract_pair”, uses regular expressions to pull out the pair ID and then filter the plaeyers table to grab the corresponding elo rating. The second grabs the game status (Win, Lose, Draw, etc) and return either 1 or 0 depending on if they actually played a game.

extract_pair <- function(a) {
  extracted_pair <- str_replace(a,"[A-Z]\\s+(\\d+)", "\\1")
  if (extracted_pair == "B" | extracted_pair == "H" | extracted_pair == "U" | extracted_pair == "X") {
    return(0)
  }
  else {
    return(as.integer(filter(players, pair == as.integer(extracted_pair))$elo_rating))
  }
}

extract_status <- function(a) {
  val <- str_replace(a,"([A-Z])\\s+\\d+", "\\1")
  if (val == "W" | val == "L" | val == "D") {
    return(1)
  }
  else {
    return(0)
  }
}

Calculate # of games played and sum all elo from all opponents

In a for loop I iterate through each row in the players data frame and use the two functions I defined on each round. I sum up all the returned results and store them in 2 separate lists. Then, after the for loop I divide the 2 lists to get the average elo of opponents faced.

games_played <- c()
total_elo <- c()

for (i in 1:nrow(players)) {
    total_sum <- extract_status(players[i,]$round1) + extract_status(players[i,]$round2) + extract_status(players[i,]$round3) + extract_status(players[i,]$round4) + extract_status(players[i,]$round5) + extract_status(players[i,]$round6) + extract_status(players[i,]$round7)
  
  games_played <- c(games_played, total_sum)
  
    elo_sum <- extract_pair(players[i,]$round1) + extract_pair(players[i,]$round2) + extract_pair(players[i,]$round3) + extract_pair(players[i,]$round4) + extract_pair(players[i,]$round5) + extract_pair(players[i,]$round6) + extract_pair(players[i,]$round7)
  
  total_elo <- c(total_elo, elo_sum)
}

average_faced_rating <- round(total_elo / games_played)

Final

Lastly, I select the columns I’m interested in and add the field “average_faced_rating” to the table. With my completed table I write the results to a csv.

cleaned_tournament_info <- players %>% select(2,12,3,11)

cleaned_tournament_info$avg_opponent_rating <- average_faced_rating

write.csv(cleaned_tournament_info,"./tournamentResults.csv")

head(cleaned_tournament_info)

##           player_name state total_points elo_rating avg_opponent_rating
## 1            GARY HUA    ON          6.0       1794                1605
## 2     DAKSHESH DARURI    MI          6.0       1553                1469
## 3        ADITYA BAJAJ    MI          6.0       1384                1564
## 4 PATRICK H SCHILLING    MI          5.5       1716                1574
## 5          HANSHI ZUO    MI          5.5       1655                1501
## 6         HANSEN SONG    OH          5.0       1686                1519