DATA 607 Project 1 - Chess Scores

Description

In this project, I was given a text file with chess tournament results where the information has some structure. The task is to create an R Markdown file that generates a .CSV file with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Setting Up My Environment

Before I explain my whole approach to this task I will do a bit of preliminary work to set up my environment. I will load the libraries I will be using:

library(readr)
library(stringr)
library(dplyr)

Next I will read in the chess data using readr’s read_lines function which loads each line into a character vector.

lines <- read_lines("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/tournamentinfo.txt")
lines[1:7]

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [7] "-----------------------------------------------------------------------------------------"

My Approach

I will loop through the character vector. I will check if the line is made up of dashes. After reading two dashed lines I will start extracting data. There are two lines that have useful data. The first has the player’s id, name and match data. The second line has the player’s state and starting score.

I will extract the player data into a list then load that into a dataframe. The match data will be a set of player ids that I will join to the player scores to compute the average. Once I have these averages I will merge that in with the player data.

The name data is in all capital letters. We will need to function that will capitalize the first letter.

capitalize_first_letter <- function(x) {
  x <- tolower(x)
  substr(x, 1, 1) <- toupper(substr(x, 1, 1))
  x
}

Now we are ready to process the file. First we will initialize some variables.

# Initialize counters
dash_lines_found <- 0
data_line_count <- 0
# Initialize the extract data flag
extract_data <- FALSE
# Initialize a vector that will hold the match data
matches <- c()

Data Extraction

Now for the heart of the project: the for loop.

# Loop through line by line
for (line in lines){
  # Look for the dash line
  if (line == "-----------------------------------------------------------------------------------------"){
    # Increment the dash line count
    dash_lines_found <- dash_lines_found + 1
    # Check if we need to reset the data line counter
    if (extract_data){
      data_line_count <- 0
    }
    # Check if we need to extract data
    if (dash_lines_found == 2){
      extract_data <- TRUE
    }
  } else if (extract_data) {
    # Increase the data line count by one
    data_line_count <- data_line_count + 1
    # Split the data by the pipe character
    split_line <- unlist(strsplit(line, "\\|"))
    # Extract the data
    if (data_line_count == 1){
      # This is the first line so extract the player's id
      player_id <- str_extract(split_line[1], "[[:digit:]]+")
      # Extract name, split name into parts, capitalize the first letter and
      # put it back together
      player_name <- str_extract_all(split_line[2], "[[:alpha:]]+")
      player_name <- unlist(player_name)
      player_name <- capitalize_first_letter(player_name)
      player_name <- paste(player_name, collapse = " ")
      # Extract total number of points
      player_total_points <- as.numeric(trimws(split_line[3]))
      # Loop through the match results
      for (game in split_line[4:10]){
        # Extract if it is a win, loss, draw, etc.
        result <- str_extract(game, "[[:alpha:]]")
        # Extract the opponent's id
        opponent_id <- str_extract(game, "[[:digit:]]+")
        # Add the data to the match vector
        matches <- rbind(matches, c(player_id, opponent_id, result))
      }
    } else {
      # This is the second line so Get the player's rating
      player_rating <- unlist(str_extract_all(split_line[2], "[[:digit:]]+"))
      player_rating <- as.numeric(player_rating[2])
      # Get the player's state
      player_state <- str_extract(split_line[1], "[[:alpha:]]+")
      # Pull all the extracted player data together into a list
      player_data <- list("player_id" = player_id,
                          "players_name" = player_name,
                          "players_state" = player_state,
                          "player_total_points" = player_total_points,
                          "players_pre_rating" = player_rating)
      # Turn that list into a dataframe with one row
      player_row <- data.frame(bind_rows(player_data))
      # Add it to the players dataframe
      if (!exists("player_df")){
        player_df <- player_row
      } else {
        player_df <- rbind(player_df, player_row)
      }
    }
  }
}

Extracton Results

Now that the data is extracted we need to take a look at it. Here’s what the extracted player data looks like:

head(player_df)

##   player_id        players_name players_state player_total_points
## 1         1            Gary Hua            ON                 6.0
## 2         2     Dakshesh Daruri            MI                 6.0
## 3         3        Aditya Bajaj            MI                 6.0
## 4         4 Patrick H Schilling            MI                 5.5
## 5         5          Hanshi Zuo            MI                 5.5
## 6         6         Hansen Song            OH                 5.0
##   players_pre_rating
## 1               1794
## 2               1553
## 3               1384
## 4               1716
## 5               1655
## 6               1686

We also have the match data:

matches[1:6,]

##      [,1] [,2] [,3]
## [1,] "1"  "39" "W" 
## [2,] "1"  "21" "W" 
## [3,] "1"  "18" "W" 
## [4,] "1"  "14" "W" 
## [5,] "1"  "7"  "W" 
## [6,] "1"  "12" "D"

Post-Processing

Now that we have extracted the data we move into the post-processing stage. First we need to turn the match results data into a dataframe, rename the variables and add in the player ranking data.

# Take the match results
matches <- matches %>%
  data.frame() %>%
  rename(player_id = X1,
         opponent = X2,
         results = X3)
# Add in the player's initial rating
matches <- player_df %>%
  rename(opponent = player_id) %>%
  select(opponent, players_pre_rating) %>%
  merge(matches)

Here’s what our match results dataframe looks like

head(matches)

##   opponent players_pre_rating player_id results
## 1        1               1794         7       L
## 2        1               1794        39       L
## 3        1               1794        18       L
## 4        1               1794        12       D
## 5        1               1794         4       D
## 6        1               1794        21       L

Now we need to compute the averages. We need to group the data by player id and compute an average using a dplyr pipeline.

averages <- matches %>%
  group_by(player_id) %>%
  summarise(average = mean(players_pre_rating))

Here’s what the averages looks like:

head(averages)

## # A tibble: 6 x 2
##   player_id average
##   <fct>       <dbl>
## 1 1           1605.
## 2 10          1554.
## 3 11          1468.
## 4 12          1506.
## 5 13          1498.
## 6 14          1515

Now we need to merge the averages to the player dataframe and drop the player id so it matches what is asked for in the description. The averages are doubles, not integers like the example so we will adjust that

df <- merge(player_df, averages) %>%
  mutate(average_opponent_pre_rating = as.integer(average)) %>%
    select(-player_id, -average)

Here’s what the dataframe looks like:

head(df)

##               players_name players_state player_total_points
## 1                 Gary Hua            ON                 6.0
## 2                Anvit Rao            MI                 5.0
## 3 Cameron William Mc Leman            MI                 4.5
## 4           Kenneth J Tack            MI                 4.5
## 5        Torrance Henry Jr            MI                 4.5
## 6             Bradley Shaw            MI                 4.5
##   players_pre_rating average_opponent_pre_rating
## 1               1794                        1605
## 2               1365                        1554
## 3               1712                        1467
## 4               1663                        1506
## 5               1666                        1497
## 6               1610                        1515

Finalizing the CSV

Now that the player data is in the form specified we need to save it as a CSV.

write_csv(player_df, 'data/chess.csv')

Fun with the Data

I really enjoy graph theory and couldn’t pass up the oportunity to create a network graph from this chess data. I recently learned how to use the igraph library from DataCamp so I will practice my skills on this data. This is beyond the scope of the project and is for the fun of it.

I want the graph to use the players names as the ids

library(igraph)

players_names_1 <- player_df %>%
  select(player_id, players_name)

players_names_2 <- player_df %>%
  rename(opponent = player_id,
         opponents_name = players_name) %>%
  select(opponent, opponents_name)

G <- matches %>%
  select(player_id, opponent) %>%
  merge(players_names_1) %>%
  merge(players_names_2) %>%
  select(players_name, opponents_name) %>%
  graph_from_data_frame(., directed = FALSE) %>%
  simplify()

Now that we have loaded up the data into a graph let’s see who Gary Hua played

neighbors(G, 'Gary Hua')

## + 7/64 vertices, named, from 75df4ad:
## [1] Gary Dee Swathell   Kenneth J Tack      Patrick H Schilling
## [4] Bradley Shaw        Joel R Hendon       David Sundeen      
## [7] Dinh Dang Bui

One last thing. I want to see if any of the people who played Gary Hua played each other. I want to visualize it.

G_sub <- induced_subgraph(G, neighbors(G, 'Gary Hua'))
layout <- layout.reingold.tilford(G_sub, circular=T)
plot(G_sub, layout=layout, vertex.size=20, vertex.color='white', vertex.frame.color = 'cornflower blue')

It looks like Joel R Hendon and Dinh Dang Bui were the only two who played each other and Gary Hua.