Project 1: Data Analysis: Chess Tournament

Reading the Data

URL <- "https://raw.githubusercontent.com/okhaimova/DATA-607/master/Project1/tournamentinfo"
tournamenttemp <- read.csv(URL, header = FALSE, sep = "|")

Cleaning up the Data

After reading the data, we have to clean it up. I removed the dashes and then separated the data into two separate data frames. One consists of the pair number, player name, total points, and rounds. The second consists of the state, USCF ID, pre-rating, post-rating, and more round information. They are labeled odd and even respectively. I also split the variables further in order to analyze it and made some variables of the numeric class.

#removing the lines with dashes
tournamenttemp <- tournamenttemp[!grepl("---", tournamenttemp[ ,1]), ]

#removing the first 2 columns
tournamenttemp <- tournamenttemp[-c(1:2), ]

#removing V11 column
tournamenttemp <- tournamenttemp[ ,-c(11)]

#splitting data frame into two
odd <- tournamenttemp[seq(1, nrow(tournamenttemp), 2), ]
even <- tournamenttemp[seq(2, nrow(tournamenttemp), 2), ]

#renaming variables in odd
colnames(odd) <- c("pair", "name", "points", 1:7)

#clean up the even data frame
#removing + renaming variables
even <- even[ ,-c(3:10)]
even <- rename(even, "state" = "V1")

#splitting columns
even <- separate(even, 2, c("uscfID", "rating" ), "/")
even <- separate(even, "rating", c("pre-rating", "post-rating"), "->")

#cleaning the ratings to make it numeric 
even$`pre-rating` <- str_extract(even$`pre-rating`, "\\d+\\d")
even$`post-rating` <- str_extract(even$`post-rating`, "\\d+\\d")

#combining data frames
tournament <- cbind(odd, even)

#cleaning up the data to make it numeric
tournament$pair <- unlist(str_extract_all(tournament$pair, "\\d+"))
tournament$points <- unlist(str_extract_all(tournament$points, "\\d.\\d"))

#changing class
tournament$`pre-rating` <- as.numeric(tournament$`pre-rating`)
tournament$points <- round(as.double(tournament$points), 2)

To extract the pair numbers from the rounds variables in order to match it to the opponents’ pre-rating, I converted it to a matrix. I then applied a for loop which took each value in the matrix and replaced it with the opponents’ pre-ratings. Afterwards, I made it numeric.

#assigning the round columns its own variable
rounds <- tournament[4:10]

#extracting the digits from the round columns, putting and putting it in a matrix
test <- matrix(str_extract(unlist(rounds), "\\d+"), ncol = 7)

#for loop
#it goes through the `test` matrix and replaces the values with the opponents' pre-rating
for (row in 1:nrow(test))
{
  for (col in 1:ncol(test))
  {
    i <- test[row, col]
    
    if(!is.na(i))
    {
      test[row, col] <- tournament$`pre-rating`[tournament$pair == i]
    }
  }
}

#changing the character matrix into numeric
class(test) <- "numeric"

#assigning the `test` matrix back into the `rounds` data frame
rounds <- test

Calculations

I then calculated the average of the opponent’s pre-ratings for each player and removed the NA values for it to only be effective for the amount of rounds they played.

#calculating the average of the opponents' pre-ratings
tournament$avg <- round(rowMeans(rounds, na.rm = TRUE))

Output

I then created a toutput with only the name, state, points, pre-rating, and avg variables. Afterwards, it generates a CSV file with only the toutput data.

#reordering columns and creating output data frame
toutput <- tournament[c("name", "state", "points", "pre-rating", "avg")]

head(toutput)

##                                 name  state points pre-rating  avg
## 5   GARY HUA                            ON     6.0       1794 1605
## 8   DAKSHESH DARURI                     MI     6.0       1553 1469
## 11  ADITYA BAJAJ                        MI     6.0       1384 1564
## 14  PATRICK H SCHILLING                 MI     5.5       1716 1574
## 17  HANSHI ZUO                          MI     5.5       1655 1501
## 20  HANSEN SONG                         OH     5.0       1686 1519

#creating csv file
write.csv(toutput, "tournament", row.names = FALSE)

Calculating the players’ expected scores

Using the ELO calculation, I determined each player’s expected result (number of points), based on his or her pre-tournament rating, and the average pre-tournament rating for all of the player’s opponents. I was able to find their probability of winning and then multiplied it by the amount of rounds they played. By doing so, I was able to calculate their expected number of points.

toutput$win_prob <- round((1 / (1 + 10 ^ ((toutput$avg - toutput$`pre-rating`) / 400))), 4)

toutput$predicted_score <- round(toutput$win_prob * rowSums(!is.na(rounds)), 4)

toutput$diff <- round(abs(toutput$points - toutput$predicted_score), 4)

head(toutput[order(toutput$diff),], 5)

##                                  name  state points pre-rating  avg win_prob
## 149  MICHAEL J MARTIN                    MI     2.5       1291 1286   0.5072
## 20   HANSEN SONG                         OH     5.0       1686 1519   0.7234
## 44   BRADLEY SHAW                        MI     4.5       1610 1515   0.6334
## 122  FOREST ZHANG                        MI     3.0       1348 1391   0.4384
## 50   MIKE NIKITIN                        MI     4.0       1604 1386   0.7781
##     predicted_score   diff
## 149          2.5360 0.0360
## 20           5.0638 0.0638
## 44           4.4338 0.0662
## 122          3.0688 0.0688
## 50           3.8905 0.1095

Analysis of Expected Scores

According to the calculations, Michael J Martin scored the most points relative to his expected result, followed by Hansen Song, and Bradley Shaw.