In this project, the task is to transform a text file with chess tournament results into a clean tidy data format and store the results as a .CSV file with the following information for all of the players:
The challenge is that the input data format has a complicated structure with multiple lines holding content for a single observation.
Below you will find the outputs of the processing steps. In order to see the R code used to generate them, select “Show all code” on the “Code” button in the top right corner of the page or next to each of the outputs.
If you want to run the code on your machine, download the folder with the code and data from GitHub and change the folder paths in the code to the downloaded directory.
At the beginning, the raw input data looks like this
# Load packages
require(dplyr)
require(readr)
require(tidyr)
require(stringr)
require(knitr)
# Read the file as text and remove the annoying dash separators
raw = readLines("/Users/dima/Google Drive/CUNY MSDA/DATA 607 Data Acquisition/Week_4/tournamentinfo.txt")
raw = str_replace_all(raw,"-","")
print(raw[1:9])
## [1] ""
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] ""
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |"
## [7] ""
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |"
After processing the raw file, we get a nice data frame:
# Read cleaned data into a data frame and drop the empty column
inp = read.delim(textConnection(raw),header = F, sep = "|", stringsAsFactors = F) # The source for this solution is provided at the bottom of the page
inp = inp[,-11]
# The first two lines are the header that should be one line and also simplified
hdr = paste(trimws(as.character(inp[1,])), trimws(as.character(inp[2,])))
hdr = make.names(str_sub(hdr, 1,11))
# Add the header
inp = as.tbl(inp[-1:-2,])
names(inp) = hdr
# Player ids and names are at uneven positions
# Player states and ranks are at even positions
p_id = seq(1,nrow(inp),2)
s_r_id = seq(2,nrow(inp),2)
players = inp[p_id,]
state_ranks = select(inp[s_r_id,], State = Pair.Num, Pre.Rating = Player.Name)
# As we only care about wins, losses and draws, we can ignore the second line of
# game records and just bind the columns:
players = bind_cols(players, state_ranks)
kable(players[1:5,])
| Pair.Num | Player.Name | Total.Pts | Round.1 | Round.2 | Round.3 | Round.4 | Round.5 | Round.6 | Round.7 | State | Pre.Rating |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GARY HUA | 6.0 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 | ON | 15445895 / R: 1794 >1817 |
| 2 | DAKSHESH DARURI | 6.0 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 | MI | 14598900 / R: 1553 >1663 |
| 3 | ADITYA BAJAJ | 6.0 | L 8 | W 61 | W 25 | W 21 | W 11 | W 13 | W 12 | MI | 14959604 / R: 1384 >1640 |
| 4 | PATRICK H SCHILLING | 5.5 | W 23 | D 28 | W 2 | W 26 | D 5 | W 19 | D 1 | MI | 12616049 / R: 1716 >1744 |
| 5 | HANSHI ZUO | 5.5 | W 45 | W 37 | D 12 | D 13 | D 4 | W 14 | W 17 | MI | 14601533 / R: 1655 >1690 |
However, the column values in this table are still not in the required format. Also the usage of columns starting with “Round” does not conform to the tidy data format (recording one unique fact/observation per row, which is a chess round in our case). This makes subsequent aggregation of player ratings difficult. Both of the issues are solved in the next step.
After formatting and converting into the tidy data format, the data frame looks like this
# Cleanup the Pre.Rating column to single out the numerical identifier of the opponent in each round
players = players %>%
mutate(Pre.Rating = str_replace(Pre.Rating,"\\d+\\s+/ R:\\s+","")) %>%
mutate(Pre.Rating = str_replace(Pre.Rating,"^[P].+|>.+","")) %>%
mutate(Pre.Rating = as.integer(str_extract(Pre.Rating, "\\d+")))
# Fix formats and types of the Total points, Pair.Num, and Player.Name columns
players = players %>%
mutate(Total.Pts = as.numeric(Total.Pts),
Pair.Num = parse_number(Pair.Num),
Player.Name = trimws(Player.Name))
# Transform into a tidy format for easier aggregation
players = players %>%
gather("Round","OppPairNum", 4:10) %>%
mutate(OppPairNum = parse_number(OppPairNum))
# Join player pre ratings
playerRatings = select(players, Pair.Num,OppPreRating = Pre.Rating) %>% distinct()
players = left_join(players, playerRatings, by = c("OppPairNum" = "Pair.Num"))
kable(players[1:8,] )
| Pair.Num | Player.Name | Total.Pts | State | Pre.Rating | Round | OppPairNum | OppPreRating |
|---|---|---|---|---|---|---|---|
| 1 | GARY HUA | 6.0 | ON | 1794 | Round.1 | 39 | 1436 |
| 2 | DAKSHESH DARURI | 6.0 | MI | 1553 | Round.1 | 63 | 1175 |
| 3 | ADITYA BAJAJ | 6.0 | MI | 1384 | Round.1 | 8 | 1641 |
| 4 | PATRICK H SCHILLING | 5.5 | MI | 1716 | Round.1 | 23 | 1363 |
| 5 | HANSHI ZUO | 5.5 | MI | 1655 | Round.1 | 45 | 1242 |
| 6 | HANSEN SONG | 5.0 | OH | 1686 | Round.1 | 34 | 1399 |
| 7 | GARY DEE SWATHELL | 5.0 | MI | 1649 | Round.1 | 57 | 1092 |
| 8 | EZEKIEL HOUGHTON | 5.0 | MI | 1641 | Round.1 | 3 | 1384 |
# Calculate final table with average opponent ratings
players = players %>%
group_by(Player.Name) %>%
mutate(AvgOppPreRating = round(mean(OppPreRating, na.rm=T)))
result = players %>%
select(Player.Name, Player.State = State, Total.Pts, Pre.Rating, AvgOppPreRating) %>%
ungroup() %>%
distinct()
# Store a csv
write_csv(result, "/Users/dima/Google Drive/CUNY MSDA/DATA 607 Data Acquisition/Week_4/project_1_output.csv")
The final table looks like this:
kable(head(result))
| Player.Name | Player.State | Total.Pts | Pre.Rating | AvgOppPreRating |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1501 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 |
The diagram below shows a comparison between the player’s own pre-rating vs. their opponents’ average pre-rating. The color corresponds to the total number of points scored in the tournament.
All observations above the gray diagonal line represent the players who had a higher pre-rating than the average of their opponents, and the points below the line represent the opposite case.
library(plotly)
plot_ly(result, x = Pre.Rating, y = AvgOppPreRating, color = Total.Pts,
mode = "markers", text = paste(Player.Name,"Total Points:",Total.Pts) ) %>%
add_trace(x = c(1000,max(Pre.Rating)),
y= c(1000, max(AvgOppPreRating)), mode = "lines",
line=list(color = "gray")) %>%
layout (xaxis=list (title = "Player Pre-Rating"),
yaxis=list (title = "Average Pre-Rating of Opponents"),
title = "Pre-Tournament Ratings: Player vs Opponents' Average",
showlegend = FALSE)