Project 1

For Project 1, we’re given a text file with chess tournament results where the information has some structure.

The below code is creating an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

# Fetching chess tournament data
url <- "https://raw.githubusercontent.com/hbedros/DATA607_Proj1/main/chessTextData.txt"

textData <- suppressWarnings(readLines(url))

First, we’re removing separator lines unnecessary headers, and unwanted elements from textData and creating a new vector called filtered_textData.

filtered_textData <- textData[!grepl("----", textData) & nchar(trimws(textData)) > 0 & !grepl("Pair | Player", textData)]

head(filtered_textData)

## [1] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [3] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [6] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"

Next, making a player_df dataframe from raw data using a loop. This loop looks at filtered_textData. Even-numbered lines give player’s ID and name. The line right after gives their state and rating.

The loop is using these regex patterns:

\\|: Splits data at the pipe (|) character.
/ R:: Splits data at the pattern / R: .
P.*$: Removes ‘P’ and everything after it in a string.

# Initializing an empty data frame
player_df <- data.frame(
  PlayerID = integer(),
  Player = character(),
  State = character(),
  PreRating = integer(),
  stringsAsFactors = FALSE
)

for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
  # Splitting each line by the delimiter "|"
  line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
  line2 <- trimws(unlist(strsplit(filtered_textData[i+1], "\\|")))
  
  playerID <- as.integer(line1[1])
  player <- line1[2]
  state <- line2[1]
  preRatingString <- strsplit(strsplit(line2[2], "/ R: ")[[1]][2], " ->")[[1]][1]
  # Removing 'P' and any characters or digits following it
  preRatingString <- gsub("P.*$", "", preRatingString)
  preRating <- as.integer(preRatingString)
  
  # Binding player's data to the data frame
  player_df <- rbind(player_df, data.frame(
    PlayerID = playerID,
    Player = player,
    State = state,
    PreRating = preRating
  ))
}

head(player_df)

##   PlayerID              Player State PreRating
## 1        1            GARY HUA    ON      1794
## 2        2     DAKSHESH DARURI    MI      1553
## 3        3        ADITYA BAJAJ    MI      1384
## 4        4 PATRICK H SCHILLING    MI      1716
## 5        5          HANSHI ZUO    MI      1655
## 6        6         HANSEN SONG    OH      1686

Next, setting up schedule_df before making the final table. I’m using a loop to get match details from filtered_textData. Even-numbered lines have player info. Another loop inside this gets the match results and opponent info.

The loop uses these regex patterns:

\\|: Splits data at the pipe (|) character.
[0-9]: Checks if there’s any digit in match_data.
[^0-9]: Matches any character that’s not a digit.

# Initializing an empty data frame
schedule_df <- data.frame(
  PlayerID = integer(),
  OponentID = integer(),
  OponentRating = integer(),
  Result = character(),
  Points = integer(),
  stringsAsFactors = FALSE
)


for (i in seq(2, length(filtered_textData) - 1, by = 2)) {
  line1 <- trimws(unlist(strsplit(filtered_textData[i], "\\|")))
  playerID <- as.integer(line1[1])
  
  for (j in 4:10) {
    match_data <- line1[j]
    result <- substr(match_data, 1, 1)  # First char is the result
    
    # Extracting opponentID if present, else give it value 0
    if (grepl("[0-9]", match_data)) {
      opponentID <- as.integer(gsub("[^0-9]", "", match_data))
    } else {
      opponentID <- 0
    }
    
    # Assigning points based on match result
    if (result == "W") {
      points <- 1
    } else if (result == "D" || result == "H") {
      points <- 0.5
    } else {
      points <- 0
    }
    
    # Fetching opponent's PreRating, default to 0 if not found
    opponentPreRating <- ifelse(length(player_df$PreRating[player_df$PlayerID == opponentID]) > 0,
                                player_df$PreRating[player_df$PlayerID == opponentID],
                                0)
    
    # Adding match data to `schedule_df`
    schedule_df <- rbind(schedule_df, data.frame(
      PlayerID = playerID,
      OpponentID = opponentID,
      OpponentPreRating = opponentPreRating,
      Result = result,
      Points = points
    ))
  }
}

head(schedule_df)

##   PlayerID OpponentID OpponentPreRating Result Points
## 1        1         39              1436      W    1.0
## 2        1         21              1563      W    1.0
## 3        1         18              1600      W    1.0
## 4        1         14              1610      W    1.0
## 5        1          7              1649      W    1.0
## 6        1         12              1663      D    0.5

Finally, creating the final dataframe final_df with the following criteria:

From player_df, retrieving the player’ ID, name, state, and pre-rating.
From schedule_df, creating a games_played variable that counts the instances of ‘W’, ‘D’, and ‘L’ in the results column for each player ID.
From schedule_df, creating a total_scores variable that represents the sum of the points for each player ID.
From schedule_df, generating an avg_opp_rate variable that is being calculated as the sum of the opponent rating divided by the games_played for each player.

# Fetching the already existing data
final_df <- player_df[, c("PlayerID", "Player", "State", "PreRating")]

# Calculating games played for each Player ID
games_played <- as.data.frame(table(schedule_df$PlayerID[schedule_df$Result %in% c("W", "D", "L")]))

# Calculating total points for each Player ID
total_points <- aggregate(Points ~ PlayerID, data = schedule_df, sum)
colnames(total_points)[2] <- "TotalPts"
final_df <- merge(final_df, total_points, by = "PlayerID")

# Calculating the average opponent rating for each Player ID
avg_opponent_rating <- aggregate(OpponentPreRating ~ PlayerID, data = schedule_df, FUN = function(x) round(sum(x)/length(x)))
colnames(avg_opponent_rating)[2] <- "AvgOponentRating"
final_df <- merge(final_df, avg_opponent_rating, by = "PlayerID")

# This is the final layout for the chess tournament data
head(final_df)

##   PlayerID              Player State PreRating TotalPts AvgOponentRating
## 1        1            GARY HUA    ON      1794      6.0             1605
## 2        2     DAKSHESH DARURI    MI      1553      6.0             1469
## 3        3        ADITYA BAJAJ    MI      1384      6.0             1564
## 4        4 PATRICK H SCHILLING    MI      1716      5.5             1574
## 5        5          HANSHI ZUO    MI      1655      5.5             1501
## 6        6         HANSEN SONG    OH      1686      5.0             1519

This piece generates a CSV file using the write.csv() function so the output can be loaded to a sql database:

# Specify the filename and path
file_path <- "chess_tournament_data.csv"

# Write the dataframe to the CSV
write.csv(final_df, file_path, row.names = FALSE)

# Notify the user about the location
message("The CSV file was created at: ", getwd(), "/", file_path)

Project 1

Haig Bedros

2023-09-24