In this project, the task is to transform a text file with chess tournament results into a clean tidy data format and store the results as a .CSV file with the following information for all of the players:

  • Player’s Name
  • Player’s State
  • Total Number of Points
  • Player’s Pre-Rating
  • Average Pre Chess Rating of Opponents

The challenge is that the input data format has a complicated structure with multiple lines holding content for a single observation.

Below you will find the outputs of the processing steps. In order to see the R code used to generate them, select “Show all code” on the “Code” button in the top right corner of the page or next to each of the outputs.

If you want to run the code on your machine, download the folder with the code and data from GitHub and change the folder paths in the code to the downloaded directory.


Step 1: Read the raw input file and turn it into a data frame

  • Read the raw file
  • Select the necessary data from even and uneven lines and combine them in a single data frame

At the beginning, the raw input data looks like this

# Load packages 

require(dplyr)
require(readr)
require(tidyr)
require(stringr)
require(knitr)

# Read the file as text and remove the annoying dash separators

raw = readLines("/Users/dima/Google Drive/CUNY MSDA/DATA 607 Data Acquisition/Week_4/tournamentinfo.txt")

raw = str_replace_all(raw,"-","") 

print(raw[1:9])
## [1] ""                                                                                          
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre>Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | " 
## [4] ""                                                                                          
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   >1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"  
## [7] ""                                                                                          
## [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
## [9] "   MI | 14598900 / R: 1553   >1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

After processing the raw file, we get a nice data frame:

# Read cleaned data into a data frame and drop the empty column

inp = read.delim(textConnection(raw),header = F, sep = "|", stringsAsFactors = F) # The source for this solution is provided at the bottom of the page

inp = inp[,-11]

# The first two lines are the header that should be one line and also simplified

hdr = paste(trimws(as.character(inp[1,])), trimws(as.character(inp[2,])))

hdr = make.names(str_sub(hdr, 1,11))

# Add the header

inp = as.tbl(inp[-1:-2,])
names(inp) = hdr

# Player ids and names are at uneven positions
# Player states and ranks are at even positions
p_id = seq(1,nrow(inp),2)
s_r_id = seq(2,nrow(inp),2)

players = inp[p_id,]
state_ranks = select(inp[s_r_id,], State = Pair.Num, Pre.Rating = Player.Name)

# As we only care about wins, losses and draws, we can ignore the second line of
# game records and just bind the columns:

players = bind_cols(players, state_ranks)

kable(players[1:5,])
Pair.Num Player.Name Total.Pts Round.1 Round.2 Round.3 Round.4 Round.5 Round.6 Round.7 State Pre.Rating
1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 >1817
2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 >1663
3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 >1640
4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 >1744
5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 >1690

However, the column values in this table are still not in the required format. Also the usage of columns starting with “Round” does not conform to the tidy data format (recording one unique fact/observation per row, which is a chess round in our case). This makes subsequent aggregation of player ratings difficult. Both of the issues are solved in the next step.

Step 2: Clean up and further process required columns

  • Format the columns to the required output format
  • Use a self-join to add opponent pre-ratings

After formatting and converting into the tidy data format, the data frame looks like this

# Cleanup the Pre.Rating column to single out the numerical identifier of the opponent in each round

players = players %>% 
  mutate(Pre.Rating = str_replace(Pre.Rating,"\\d+\\s+/ R:\\s+","")) %>% 
  mutate(Pre.Rating = str_replace(Pre.Rating,"^[P].+|>.+","")) %>% 
  mutate(Pre.Rating = as.integer(str_extract(Pre.Rating, "\\d+"))) 
  
# Fix formats and types of the Total points, Pair.Num, and Player.Name columns
players = players %>% 
  mutate(Total.Pts = as.numeric(Total.Pts),
         Pair.Num = parse_number(Pair.Num),
         Player.Name = trimws(Player.Name))
  

# Transform into a tidy format for easier aggregation

players = players %>%
  gather("Round","OppPairNum", 4:10) %>% 
  mutate(OppPairNum = parse_number(OppPairNum)) 

# Join player pre ratings

playerRatings = select(players, Pair.Num,OppPreRating = Pre.Rating) %>% distinct()

players = left_join(players, playerRatings, by = c("OppPairNum" = "Pair.Num"))

kable(players[1:8,] )
Pair.Num Player.Name Total.Pts State Pre.Rating Round OppPairNum OppPreRating
1 GARY HUA 6.0 ON 1794 Round.1 39 1436
2 DAKSHESH DARURI 6.0 MI 1553 Round.1 63 1175
3 ADITYA BAJAJ 6.0 MI 1384 Round.1 8 1641
4 PATRICK H SCHILLING 5.5 MI 1716 Round.1 23 1363
5 HANSHI ZUO 5.5 MI 1655 Round.1 45 1242
6 HANSEN SONG 5.0 OH 1686 Round.1 34 1399
7 GARY DEE SWATHELL 5.0 MI 1649 Round.1 57 1092
8 EZEKIEL HOUGHTON 5.0 MI 1641 Round.1 3 1384

Step 3: Final Output

  • Calculate final table with average opponent ratings
  • Store a .CSV file
# Calculate final table with average opponent ratings

players = players %>% 
  group_by(Player.Name) %>% 
  mutate(AvgOppPreRating = round(mean(OppPreRating, na.rm=T)))

result = players %>% 
  select(Player.Name, Player.State = State, Total.Pts, Pre.Rating, AvgOppPreRating) %>% 
  ungroup() %>% 
  distinct()

# Store a csv

write_csv(result, "/Users/dima/Google Drive/CUNY MSDA/DATA 607 Data Acquisition/Week_4/project_1_output.csv")

The final table looks like this:

kable(head(result))
Player.Name Player.State Total.Pts Pre.Rating AvgOppPreRating
GARY HUA ON 6.0 1794 1605
DAKSHESH DARURI MI 6.0 1553 1469
ADITYA BAJAJ MI 6.0 1384 1564
PATRICK H SCHILLING MI 5.5 1716 1574
HANSHI ZUO MI 5.5 1655 1501
HANSEN SONG OH 5.0 1686 1519

Bonus: Diagram

The diagram below shows a comparison between the player’s own pre-rating vs. their opponents’ average pre-rating. The color corresponds to the total number of points scored in the tournament.
All observations above the gray diagonal line represent the players who had a higher pre-rating than the average of their opponents, and the points below the line represent the opposite case.

library(plotly)

plot_ly(result, x = Pre.Rating, y = AvgOppPreRating, color = Total.Pts, 
        mode = "markers", text = paste(Player.Name,"Total Points:",Total.Pts) ) %>% 
  add_trace(x = c(1000,max(Pre.Rating)), 
            y= c(1000, max(AvgOppPreRating)), mode = "lines", 
            line=list(color = "gray")) %>% 
  layout (xaxis=list (title = "Player Pre-Rating"),
          yaxis=list (title = "Average Pre-Rating of Opponents"),
          title = "Pre-Tournament Ratings: Player vs Opponents' Average",
          showlegend = FALSE)

Sources

StackOverflow