DATA607_Project1

Overview

The purpose of this document is to systematically import a text file containing data from a chess tournament cross table. The below R code uses string manipulations and extractions in order to clean up the text file and extract relevant information formatted in a clean, reusable, and exported dataframe with the goal of writing the newly created table to a CSV file for easy use.

library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Import and Clean

First you’ll need to import the raw data from my github repo. You can find a link to the data here: <https://raw.githubusercontent.com/awrubes/DATA607_Project1/refs/heads/main/tournamentinfo.txt?token=GHSAT0AAAAAACW23YBZCXZQRWYMYX2AVM46ZXQO5ZA>. I find it easier to read in the file as lines in a string vector so that we can use stringr functions to extract specific pieces of data.

#import chess data as txt file
chess_data <- readLines("https://raw.githubusercontent.com/awrubes/DATA607_Project1/main/tournamentinfo.txt")

## Warning in
## readLines("https://raw.githubusercontent.com/awrubes/DATA607_Project1/main/tournamentinfo.txt"):
## incomplete final line found on
## 'https://raw.githubusercontent.com/awrubes/DATA607_Project1/main/tournamentinfo.txt'

Next, we’ll want to clean up the text file so that it’s easier to parse. We’ll do this by getting rid of the header rows and removing the “—” row separators. This way the only text in the file will be the data we want to selectively extract.

#remove the header lines
chess_data <- chess_data[-(1:3)]
head(chess_data)

## [1] "-----------------------------------------------------------------------------------------"
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [3] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "-----------------------------------------------------------------------------------------"
## [5] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [6] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

#removes the "-" lines separating the rows of player data
chess_data <- chess_data[!grepl("^-+$", chess_data)]
head(chess_data)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Initializing and Populating the DataFrame

Before we begin extracting values from the text file, we’ll want to initialize the dataframe with appropriate headers and cell value types.

#creates the dataframe with desired columns
chess_dataFrame <- data.frame(
  PlayerID = numeric(),
  PlayerName = character(),
  State = character(),
  TotalPoints = numeric(),
  PreRating = numeric(),
  GameCount = character(),
  AvgOpponentRating = numeric(),
  stringsAsFactors = FALSE
)

Once the dataframe is initialized, now we can start extracting data from the text file and adding it to the appropriate cells. In this particular case, I took a more cautious approach to extracting the data (which you’ll see momentarily). Instead of extracting everything simultaneously in a single loop indexing through vector rows, I extracted data one piece at a time, which allowed me to appropriately debug at every step to ensure the correct data was being extracted. Understood, this method is by no means the most computationally efficient.

You’ll also notice that for the total games and opponent ID’s, I extract both together and include in a single cell as values to parse later on, once the foundations of the dataframe are set.

#sequentially move through file and add player ID's dynamically
start_index <- 1

#extract player names
for (i in seq(1, length(chess_data), by = 2)) {
  
  # Get player info from the current line
  player_info <- chess_data[i]
  
  playerID <- start_index
  
  # Extract the names of players
  player_names <- paste(unlist(str_extract(player_info, "(?<=\\|)\\s*[A-Z\\s]+(?=\\s*\\|)")))
  player_names <- trimws(player_names)
  
  # Add ID and names to the data frame, with placeholders for others
  chess_dataFrame <- rbind(chess_dataFrame, data.frame(
    PlayerID = playerID,
    PlayerName = player_names,
    State = NA,  # Placeholder
    TotalPoints = NA,  # Placeholder
    PreRating = NA,  # Placeholder
    GameCount = NA,  # Placeholder
    AvgOpponentRating = NA,  # Placeholder
    stringsAsFactors = FALSE
  ))
  
  # Increment ID
  start_index <- start_index + 1
  
}

#extract states
start_index <- 1

for (i in seq(1, length(chess_data),by=2)) {
  
  #split player info into lines
  player_info <- chess_data[i]
  player_rating_info <- chess_data[i+1]
  
  #extracts the states, replace NA with "na" for easier differentiation
  player_state <- paste(str_extract(player_rating_info, "^\\s*[A-Z]{2}\\b"))
  player_state <- trimws(player_state)
  
  chess_dataFrame$State[start_index] <- player_state
  start_index <- start_index + 1
  
}

#extract total points

start_index <- 1

for (i in seq(1, length(chess_data),by=2)) {
  
  #split player info into lines
  player_info <- chess_data[i]
  player_rating_info <- chess_data[i+1]
  
  #player total points
  total_points <- paste(unlist(str_extract_all(player_info, "(?<=|)\\d\\.\\d")))
  
  chess_dataFrame$TotalPoints[start_index] <- total_points
  start_index <- start_index + 1

}

#extract player preratings

start_index <- 1

for (i in seq(1, length(chess_data),by=2)) {
  
  #split player info into lines
  player_info <- chess_data[i]
  player_rating_info <- chess_data[i+1]
  
  #pre-ratings for players
  player_rating <- paste(unlist(str_extract_all(player_rating_info, "(?<=R:)\\s*(\\d+)(?=\\s*->|\\w)")))
  player_rating <- trimws(player_rating)
  
  chess_dataFrame$PreRating[start_index] <- as.numeric(player_rating)
  
  start_index <- start_index + 1
}


#extract game outcomes + opponent IDs
start_index <- 1

for (i in seq(1, length(chess_data),by=2)) {
  
  #split player info into lines
  player_info <- chess_data[i]
  player_rating_info <- chess_data[i+1]
  
  #game outcomes + opponent ID
  games_played <- paste(unlist(str_extract_all(player_info, "\\b(W|L|D){1}\\s*\\d+")), collapse =", ")
  
  chess_dataFrame$GameCount[start_index] <- games_played
  
  start_index <- start_index + 1
}

Now you’ll see we should have a pretty solid table built out. We still need to calculate avg opponent rating (sum of opponent pre-tournament score divided by games played)

head(chess_dataFrame)

##   PlayerID          PlayerName State TotalPoints PreRating
## 1        1            GARY HUA    ON         6.0      1794
## 2        2     DAKSHESH DARURI    MI         6.0      1553
## 3        3        ADITYA BAJAJ    MI         6.0      1384
## 4        4 PATRICK H SCHILLING    MI         5.5      1716
## 5        5          HANSHI ZUO    MI         5.5      1655
## 6        6         HANSEN SONG    OH         5.0      1686
##                                         GameCount AvgOpponentRating
## 1 W  39, W  21, W  18, W  14, W   7, D  12, D   4                NA
## 2 W  63, W  58, L   4, W  17, W  16, W  20, W   7                NA
## 3 L   8, W  61, W  25, W  21, W  11, W  13, W  12                NA
## 4 W  23, D  28, W   2, W  26, D   5, W  19, D   1                NA
## 5 W  45, W  37, D  12, D  13, D   4, W  14, W  17                NA
## 6 W  34, D  29, L  11, W  35, D  10, W  27, W  21                NA

We can do this by looping through the chess dataframe extracting the number of games played and then dynamically looping through the vector of opponent ID’s to locate their pre-tournament score.

#calculate avg opponent rating
for(i in 1:nrow(chess_dataFrame)){
  game_count_str <- chess_dataFrame$GameCount[i]
  
  total_games <- str_count(game_count_str, "[A-Z]")
  
  opponent_ids <- as.numeric(unlist(str_extract_all(game_count_str, "\\d+")))
  
  total_opponent_rating <- 0
  index <- 1
  for(opponent_id in opponent_ids){
    opponent_rating <- chess_dataFrame$PreRating[opponent_id]
    
    total_opponent_rating <- total_opponent_rating + opponent_rating
    
  }
  
  avg_op_rating <- total_opponent_rating / total_games
  
  chess_dataFrame$AvgOpponentRating[i] = avg_op_rating
}

The table is pretty much complete. However, there are some unnecessary columns that we can trim before exporting the final to CSV.

head(chess_dataFrame)

##   PlayerID          PlayerName State TotalPoints PreRating
## 1        1            GARY HUA    ON         6.0      1794
## 2        2     DAKSHESH DARURI    MI         6.0      1553
## 3        3        ADITYA BAJAJ    MI         6.0      1384
## 4        4 PATRICK H SCHILLING    MI         5.5      1716
## 5        5          HANSHI ZUO    MI         5.5      1655
## 6        6         HANSEN SONG    OH         5.0      1686
##                                         GameCount AvgOpponentRating
## 1 W  39, W  21, W  18, W  14, W   7, D  12, D   4          1605.286
## 2 W  63, W  58, L   4, W  17, W  16, W  20, W   7          1469.286
## 3 L   8, W  61, W  25, W  21, W  11, W  13, W  12          1563.571
## 4 W  23, D  28, W   2, W  26, D   5, W  19, D   1          1573.571
## 5 W  45, W  37, D  12, D  13, D   4, W  14, W  17          1500.857
## 6 W  34, D  29, L  11, W  35, D  10, W  27, W  21          1518.714

Trim and Export

In the final table, we only need 5 columns: Name, State, TotalPoints, PreRating, and Avg Opponent Rating. So, we’ll subset this dataframe in a new one, excluding the unneeded columns.

After doing that, we’re ready to export as a CSV!

#subset the dataframe
selected_columns <- chess_dataFrame[, c("PlayerName", "State", "TotalPoints", "PreRating", "AvgOpponentRating")]

#write to a csv
write.csv(selected_columns, file = "chess_tournament_selected_columns.csv", row.names = FALSE)

head(selected_columns)

##            PlayerName State TotalPoints PreRating AvgOpponentRating
## 1            GARY HUA    ON         6.0      1794          1605.286
## 2     DAKSHESH DARURI    MI         6.0      1553          1469.286
## 3        ADITYA BAJAJ    MI         6.0      1384          1563.571
## 4 PATRICK H SCHILLING    MI         5.5      1716          1573.571
## 5          HANSHI ZUO    MI         5.5      1655          1500.857
## 6         HANSEN SONG    OH         5.0      1686          1518.714

DATA607_Project1

aw

2024-09-25

Overview

Import and Clean

Initializing and Populating the DataFrame

Trim and Export