Project 1

For Project 1, we’re given a text file with chess tournament results where the information has some structure.

The below code is creating an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

# Fetching chess tournament data
url <- "https://raw.githubusercontent.com/hbedros/DATA607_Proj1/main/chessTextData.txt"

textData <- suppressWarnings(readLines(url))

First, we’re removing separator lines unnecessary headers, and unwanted elements from textData and creating a new vector called filtered_textData.

filtered_textData <- textData[!grepl("----", textData) & nchar(trimws(textData)) > 0 & !grepl("Pair | Player", textData)]

head(filtered_textData)

## [1] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [3] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [6] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"

Next, making a player_df dataframe from raw data using a loop. This loop looks at filtered_textData. Even-numbered lines give player’s ID and name. The line right after gives their state and rating.

The loop is using these regex patterns:

\\|: Splits data at the pipe (|) character.
/ R:: Splits data at the pattern / R: .
P.*$: Removes ‘P’ and everything after it in a string.

# Initializing an empty data frame
player_df <- data.frame(
  PlayerID = integer(),
  Player = character(),
  State = character(),
  PreRating = integer(),
  stringsAsFactors = FALSE
)

for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
  # Splitting each line by the delimiter "|"
  line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
  line2 <- trimws(unlist(strsplit(filtered_textData[i+1], "\\|")))
  
  playerID <- as.integer(line1[1])
  player <- line1[2]
  state <- line2[1]
  preRatingString <- strsplit(strsplit(line2[2], "/ R: ")[[1]][2], " ->")[[1]][1]
  # Removing 'P' and any characters or digits following it
  preRatingString <- gsub("P.*$", "", preRatingString)
  preRating <- as.integer(preRatingString)
  
  # Binding player's data to the data frame
  player_df <- rbind(player_df, data.frame(
    PlayerID = playerID,
    Player = player,
    State = state,
    PreRating = preRating
  ))
}

head(player_df)

##   PlayerID              Player State PreRating
## 1        1            GARY HUA    ON      1794
## 2        2     DAKSHESH DARURI    MI      1553
## 3        3        ADITYA BAJAJ    MI      1384
## 4        4 PATRICK H SCHILLING    MI      1716
## 5        5          HANSHI ZUO    MI      1655
## 6        6         HANSEN SONG    OH      1686

Next, setting up schedule_df before making the final table. I’m using a loop to get match details from filtered_textData. Even-numbered lines have player info. Another loop inside this gets the match results and opponent info.

The loop uses these regex patterns:

\\|: Splits data at the pipe (|) character.
[0-9]: Checks if there’s any digit in match_data.
[^0-9]: Matches any character that’s not a digit.

# Initializing an empty data frame
schedule_df <- data.frame(
  PlayerID = integer(),
  OponentID = integer(),
  OponentRating = integer(),
  Result = character(),
  Points = integer(),
  stringsAsFactors = FALSE
)


for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
  line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
  playerID <- as.integer(line1[1])
  
  for (j in 4:10) {
    match_data <- line1[j]
    result <- substr(match_data, 1, 1)  # First char is the result
    
    # Extracting opponentID if present, else give it value 0
    if (grepl("[0-9]", match_data)) {
      opponentID <- as.integer(gsub("[^0-9]", "", match_data))
    } else {
      opponentID <- 0
    }
    
    # Assigning points based on match result
    if (result == "W") {
      points <- 1
    } else if (result == "D" || result == "H") {
      points <- 0.5
    } else {
      points <- 0
    }
    
    # Fetching opponent's PreRating, default to 0 if not found
    opponentPreRating <- ifelse(length(player_df$PreRating[player_df$PlayerID == opponentID]) > 0,
                                player_df$PreRating[player_df$PlayerID == opponentID],
                                0)
    
    # Adding match data to `schedule_df`
    schedule_df <- rbind(schedule_df, data.frame(
      PlayerID = playerID,
      OpponentID = opponentID,
      OpponentPreRating = opponentPreRating,
      Result = result,
      Points = points
    ))
  }
}

head(schedule_df)

##   PlayerID OpponentID OpponentPreRating Result Points
## 1        1         39              1436      W    1.0
## 2        1         21              1563      W    1.0
## 3        1         18              1600      W    1.0
## 4        1         14              1610      W    1.0
## 5        1          7              1649      W    1.0
## 6        1         12              1663      D    0.5

Finally, creating the final dataframe final_df with the following criteria:

From player_df, retrieving the player’ ID, name, state, and pre-rating.
From schedule_df, creating a games_played variable that counts the instances of ‘W’, ‘D’, and ‘L’ in the results column for each player ID.
From schedule_df, creating a total_scores variable that represents the sum of the points for each player ID.
From schedule_df, generating an avg_opp_rate variable that is being calculated as the sum of the opponent rating divided by the games_played for each player.

# Fetching the already existing data
final_df <- player_df[, c("PlayerID", "Player", "State", "PreRating")]

# Calculating games played for each Player ID
games_played <- as.data.frame(table(schedule_df$PlayerID[schedule_df$Result %in% c("W", "D", "L")]))

# Calculating total points for each Player ID
total_points <- aggregate(Points ~ PlayerID, data = schedule_df, sum)
colnames(total_points)[2] <- "TotalPts"
final_df <- merge(final_df, total_points, by = "PlayerID")

# Calculating the average opponent rating for each Player ID
avg_opponent_rating <- aggregate(OpponentPreRating ~ PlayerID, data = schedule_df, FUN = function(x) round(sum(x)/length(x)))
colnames(avg_opponent_rating)[2] <- "AvgOponentRating"
final_df <- merge(final_df, avg_opponent_rating, by = "PlayerID")

# This is the final layout for the chess tournament data
head(final_df)

##   PlayerID              Player State PreRating TotalPts AvgOponentRating
## 1        1            GARY HUA    ON      1794      6.0             1605
## 2        2     DAKSHESH DARURI    MI      1553      6.0             1469
## 3        3        ADITYA BAJAJ    MI      1384      6.0             1564
## 4        4 PATRICK H SCHILLING    MI      1716      5.5             1574
## 5        5          HANSHI ZUO    MI      1655      5.5             1501
## 6        6         HANSEN SONG    OH      1686      5.0             1519

This piece generates a CSV file using the write.csv() function so the output can be loaded to a sql database:

# Specify the filename and path
file_path <- "chess_tournament_data.csv"

# Write the dataframe to the CSV
write.csv(final_df, file_path, row.names = FALSE)

# Notify the user about the location
message("The CSV file was created at: ", getwd(), "/", file_path)

Extra Credit - ELO calculations

Based on difference in ratings between the chess players and each of their opponents in our Project 1 tournament, calculate each player’s expected score (e.g. 4.3) and the difference from their actual score (e.g 4.0). List the five players who most overperformed relative to their expected score, and the five players that most underperformed relative to their expected score.

You’ll find some small differences in different implementation of ELO formulas. You may use any reasonably-sourced formula, but please cite your source.

library(elo)

# Using the elo.prob() function for expected score
# Reference for the ELO computation: https://cran.r-project.org/web/packages/elo/vignettes/intro.html
schedule_df$ExpectedScore <- mapply(function(player_rating, opponent_rating) {
  elo.prob(player_rating, opponent_rating)
}, player_df$PreRating[schedule_df$PlayerID], schedule_df$OpponentPreRating)

# Aggregating the expected scores for each player
expected_scores <- aggregate(ExpectedScore ~ PlayerID, data = schedule_df, sum)
colnames(expected_scores)[2] <- "ExpectedTotalPts"
final_df <- merge(final_df, expected_scores, by = "PlayerID")

# Calculating the difference between actual and expected scores
final_df$ScoreDifference <- final_df$TotalPts - final_df$ExpectedTotalPts

# Identifying top performers and underperformers
overperformers <- final_df[order(-final_df$ScoreDifference), ][1:5, ]
underperformers <- final_df[order(final_df$ScoreDifference), ][1:5, ]

overperformers

##    PlayerID                   Player State PreRating TotalPts AvgOponentRating
## 3         3             ADITYA BAJAJ    MI      1384      6.0             1564
## 15       15   ZACHARY JAMES HOUGHTON    MI      1220      4.5             1484
## 10       10                ANVIT RAO    MI      1365      5.0             1554
## 46       46 JACOB ALEXANDER LAVALLEY    MI       377      3.0             1358
## 9         9              STEFANO LEE    ON      1411      5.0             1523
##    ExpectedTotalPts ScoreDifference
## 3        1.94508791        4.054912
## 15       1.37330887        3.126691
## 10       1.94485405        3.055146
## 46       0.04324981        2.956750
## 9        2.28654888        2.713451

underperformers

##    PlayerID              Player State PreRating TotalPts AvgOponentRating
## 62       62       ASHWIN BALAJI    MI      1530      1.0              169
## 53       53       JOSE C YBARRA    MI      1393      2.0              577
## 54       54         LARRY HODGE    MI      1270      1.0             1034
## 41       41 KYLE WILLIAM MURPHY    MI      1403      2.0              713
## 25       25    LOREN SCHWIEBERT    MI      1745      3.5             1363
##    ExpectedTotalPts ScoreDifference
## 62         6.877807       -5.877807
## 53         5.718298       -3.718298
## 54         4.398246       -3.398246
## 41         5.363915       -3.363915
## 25         6.275650       -2.775650

Conclusion:

The above code performs a comprehensive analysis of a chess tournament dataset. It begins by fetching the raw data and then processes this data to derive relevant player information and their match details.

By leveraging the ELO rating system (referenced from the CRAN documentation), the code calculates the expected scores of each player based on their rating differences with opponents. The results highlight players who overperformed and underperformed relative to their expected scores.

Project 1

Haig Bedros

2023-09-24

Extra Credit - ELO calculations

Conclusion: