Cleaning, Viewing and Exporting Structured Data

Introduction
Creation of Data Table
Basic Analysis with Structured Data
- Summary Statistics and Distribution
- Analysis of Tournament Points vs. Ratings

Introduction

Project Overview:

The subsequent file, tournament, is a text file giving the results of an arbitrary chess tournament. Due to its semi-structured state, it is feasible to extract pertinent information using some basic R and package tools. The main push of the assignment is to load in the .txt and extract the prescribed information (namely: the player’s name; their state; their point total; the player’s pre-rating and the combined pre-tournament rating average of their respective opponents) and export the data to a csv file. The following exhibits one particular method of achieving this.

Relevence:

The majority of data encountered in “the field” is not organized and pretty. It is mangled and from multiple sources. It could be financial data in some bank-specific format or scrapped data from an API with html code hugging every target value. The ability to operate-on and clean such data is pertinent in the toolbox of any data scientist.

Data Reference and Libraries

The .txt is available in its raw format here. There is no data dictionary, but one is necessary for the scope of this assignment. For the layman/woman: some further reading is available from this wiki.

Outside of formatting packages, the libraries used for this project are:

stringr - For regex-based functions to clean and restructure the document.
DT - To create an interactive data table.
ggplot - Provides some more advanced plotting utilities.

Creation of Data Table

Cleanup

The data from the chess tournament, in its semi-structured state, is easy for the human eye to comprehend. After a few glances and a little understanding of some chess lingo, we can get a general idea of how each individual performed. An example of four individuals under the current structure:

 [1] "-----------------------------------------------------------------------------------------" 
 [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
 [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
 [4] "-----------------------------------------------------------------------------------------" 
 [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
 [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
 [7] "-----------------------------------------------------------------------------------------" 
 [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
 [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
[10] "-----------------------------------------------------------------------------------------" 
[11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
[12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
[13] "-----------------------------------------------------------------------------------------" 
[14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
[15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
[16] "-----------------------------------------------------------------------------------------"

If we wanted to do any computational analysis, i.e, convert the data into a table and do basic summary statistics, it would be a very difficult feat under the data’s current format. There are numerous things standing in our way, e.g., white space padding and the “—” lines. With some regex and tools from the stringr package, it is relatively easy to remove these issues and leave us with clearer sense of the data we are working with:

# Remove header, leading, tailing and excessive whitespace:
tournament <- tournament[5:length(tournament)]
tournament <- gsub("^[[:space:]]+?|(?<=\\|)[[:space:]]+?|[[:space:]]+?(?=\\|)", 
    "", tournament, perl = TRUE)
tournament <- gsub("[[:space:]]{2,}", " ", tournament, perl = TRUE)

# Remove '---...' vectors:
tournament <- tournament[!str_detect(tournament, "^-+")]
head(tournament)

[1] " 1|GARY HUA|6.0|W 39|W 21|W 18|W 14|W 7|D 12|D 4|"       
[2] " ON|15445895 / R: 1794 ->1817|N:2|W|B|W|B|W|B|W|"        
[3] " 2|DAKSHESH DARURI|6.0|W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
[4] " MI|14598900 / R: 1553 ->1663|N:2|B|W|B|W|B|W|B|"        
[5] " 3|ADITYA BAJAJ|6.0|L 8|W 61|W 25|W 21|W 11|W 13|W 12|"  
[6] " MI|14959604 / R: 1384 ->1640|N:2|W|B|W|B|W|B|W|"

Merge and Subset Table

We are left with something that could be coerced into a table. However, we still have a big issue: Our individual’s performance is split over two consecutive lines. For the scope of this project, we are only concerned with the player’s state and pre-rating from the second line, so the other variables will be excluded. Merging the two consecutive lines, the table can be created:

# Merge and create a table
tournament <- str_c(tournament[seq(1, length(tournament), 2)], tournament[seq(2, 
    length(tournament), 2)])
tournament <- read.table(text = tournament, sep = "|", stringsAsFactors = FALSE)
tournament <- subset(tournament, select = c(1:12))

# Isolate player's pre-rating:
tournament$V12 <- as.numeric(str_extract(tournament$V12, "(?<=R: )\\d+"))

head(tournament)

  V1                  V2  V3   V4   V5   V6   V7   V8   V9  V10 V11  V12
1  1            GARY HUA 6.0 W 39 W 21 W 18 W 14  W 7 D 12  D 4  ON 1794
2  2     DAKSHESH DARURI 6.0 W 63 W 58  L 4 W 17 W 16 W 20  W 7  MI 1553
3  3        ADITYA BAJAJ 6.0  L 8 W 61 W 25 W 21 W 11 W 13 W 12  MI 1384
4  4 PATRICK H SCHILLING 5.5 W 23 D 28  W 2 W 26  D 5 W 19  D 1  MI 1716
5  5          HANSHI ZUO 5.5 W 45 W 37 D 12 D 13  D 4 W 14 W 17  MI 1655
6  6         HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21  OH 1686

Averaging Opponents Pre-Rating

The number in the variables V4:V10, let’s call these matches, corresponds to the key of the opponent in V1. To average the opponent’s pre-rating, the opponent’s pre-rating needs to be looked up by their key in the match variable. The pre-ratings will be summed across the matches of V4:V10 and divided by the number of opponents played.

# Isolate player key and, if present, lookup pre-rating. Sum accross matches
# and divide by oppents faced:
for (player in 1:nrow(tournament)) {
    opp_sum <- 0
    count <- 0
    for (i in c(4:10)) {
        if (str_detect(tournament[player, i], "[^HUBX]")) {
            count <- count + 1
            opp_sum <- opp_sum + tournament$V12[tournament[, 1] == str_extract(tournament[player, 
                i], "\\d+")]
        }
    }
    tournament[player, 13] <- round(opp_sum/count, 0)
}

# Subset table in to desired variables and output:
tournament <- subset(tournament, select = c(2, 3, 11, 12, 13))
names(tournament) <- c("name", "points", "state", "pre_rating", "avg_opp_pre_rating")

Export as .csv and Display Table

If the following code is run, a table of the data in csv format will be on your working directory titled Byrne_Pro1_Output.csv. The data, the summation of our labor, is in an interactive format below:

write.csv(tournament, file = "Byrne_Pro1_Output.csv", row.names = TRUE, na = "")
datatable(tournament)

Basic Analysis with Structured Data

Summary Statistics and Distribution

Now that we have wrangled the data in to a format that R understands, some analysis of the chess tournament is possible. Some basic statistics and the distribution of player ratings are provided below:

summary(tournament[c(2, 4, 5)])

     points        pre_rating   avg_opp_pre_rating
 Min.   :1.000   Min.   : 377   Min.   :1107      
 1st Qu.:2.500   1st Qu.:1227   1st Qu.:1310      
 Median :3.500   Median :1407   Median :1382      
 Mean   :3.438   Mean   :1378   Mean   :1379      
 3rd Qu.:4.000   3rd Qu.:1583   3rd Qu.:1481      
 Max.   :6.000   Max.   :1794   Max.   :1605

ggplot(tournament, aes(x = tournament$pre_rating)) + geom_histogram(aes(y = ..density..), 
    binwidth = 100, color = "black", fill = "white") + geom_density(alpha = 0.2, 
    fill = "#FF6666") + labs(title = "Player Rating Distribution", x = "Player's Pre-Rating")

boxplot(tournament$pre_rating, ylab = "Player Pre-Rating")

The data distribution is left-skewed, meaning there were a few low-ranking players in the tournament up against more advanced players. Outside of these relatively low ranking players, the distribution is fairly normal.

Analysis of Tournament Points vs. Ratings

It would be interesting to see the points acquired relative to the caliber of opponents faced. Did players win the most points because they were up against low-ranking players, or were they just better? To do this, we will first look at the distribution of the difference in player’s ratings relative to points acquired. If average opponent’s pre-rating - player’s pre-rating is negative, the player was up against weaker players, if positive they were up against stronger players.

rate_diff <- tournament$avg_opp_pre_rating - tournament$pre_rating

summary(rate_diff)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-382.000 -172.200  -68.000    0.125  139.800  981.000

boxplot(rate_diff)

From the single box plot, it appears that the difference in players ratings during the tournament is relatively evenly distributed about a median close to 0. Overall, players typically faced opponents of the same approximate ratings. Let us drill down a little further and see points earned based off of the difference in ratings.

ggplot(tournament, aes(factor(tournament$points), rate_diff)) + geom_boxplot() + 
    labs(x = "Points", y = "Average Opponent Rating - Player Rating")

From the box plots, it appears that the players who won less than 3 points were typically up against stronger players and the players who won more than 3 points were matched against weaker players.

Below, we will plot points earned as a function of the difference in ratings of the players. The shaded region corresponds to a 95% confidence interval.

ggplot(tournament, aes(x = tournament$avg_opp_pre_rating - tournament$pre_rating, 
    y = tournament$points)) + geom_point() + stat_smooth() + geom_vline(xintercept = 0) + 
    labs(x = "Average of Opponent Ratings - Player Rating", y = "Points")

The polynomial trendline shows a predictive model for guessing the amount of points earned based off the difference in ratings. Not surprisingly, if the player is matched up against weaker players, i.e., rating difference < 0, a greater amount of points would be expected from the tournament. Chess tournaments, like most games, are not set up to pit the best against the best and the weakest against the weakest. It is based off of pre-game standings and the result of each subsequent match, so this should not be a signal of foul play. It does show that the players who, by the structure of tournament rules, managed to face overall weaker players, were more likely to score the most points.