Data Wrangling

It is often necessary to transform raw data to a condensed and more useful format to facilitate downstream analysis. This process is often refered to as ‘Data Wrangling’

In this demo, we will wrangle data from a chess tournament. The raw data is given to us as a .txt file and our goal is to transform the text data into a much more concise tabular form and to export the data as a .csv (comma separated value) file. However, we are not simply transcribing to text data. Rather, we are extracting specific elements and performing some simple manipulations to represent a subset of the information given in the text file.

We need to extract the following information from the raw data:

  1. Player’s Name
  2. Player’s State
  3. Total Number of Points
  4. Player’s Pre-Rating
  5. Average Pre Chess Rating of Opponents

From the image above, we see that there are many irrelevant characters. However, we also observe that the raw data has some predictable structure that we can take advantage of to parse the information that we need to format our deliverable.

To achieve this goal, we will perform the following steps:

  1. Import the Raw Data
  2. Initialize a data.frame to serve as an intermediary
  3. Parse the text for the data we need
  4. Write our data.frame to a .csv file
  5. Form some preliminary insights

1. Import the Raw Data

For the convenience of this demo, the .txt raw data has been uploaded to the author’s github account. We can access the data as follows:

##                                                                               tournament_text
## 1   -----------------------------------------------------------------------------------------
## 2   Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3   Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4   -----------------------------------------------------------------------------------------
## 5       1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6      ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 7   -----------------------------------------------------------------------------------------
## 8       2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 9      MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 10  -----------------------------------------------------------------------------------------
## 11      3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## 12     MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |



2. Initialize a data.frame to serve as an intermediary

Now we will initialize a data.frame to hold the results of our data wrangling. The features that we want to extract have been defined for us, so we can set up the corresponding number of collumns and give them descriptive names. If we open up and inspect the text file, we can scroll down and see that there are 64 records. Therefore, we will preallocate 64 rows.

Is it necessary to preallocate space for our data?
No, not necessarily, at least not in R. However, since we know the size and shape of the data we are expecting, specifiying so in the size and shape of the data.frame can help catch some data processing errors early on.



3. Parse the text

If we look at the data.frame ‘tournament_text_stringsdf’ we can see that most of the data we are interested in parsing out is relatively easy to locate within the reoccuring pattern of the text.

Four of the five data features we are interested in are indicated in the figure below:

  1. Player’s Name (red circle)
  2. Player’s State (blue circle)
  3. Total Number of Points (orange circle)
  4. Player’s Pre-Rating (purple circle)



Now, to extract the data. These are the basic steps we will take:

  • set up idices for the rows of text that hold data
  • use strsplit() to split the rows by the delimeting character ‘|’
  • select out the specific split of interst
  • reformat the split & add the data to the proallocated data.frame
#the data for the features 'Player's Name' & Total Number of Points'
#first appears in row 5 and then re-occures every 3 lines.
#set up and list of indices for these rows:
namesRow_IDX <- seq( 5, 194, 3)
#the data for the features 'Player's State' & Player's Pre-Rating'
#first appears in row 6 and then re-occures every 3 lines.
#set up and list of indices for these rows:
IDnumRow_IDX <- seq( 6, 195, 3)

#use strsplit() to split the rows of interest into string fragments 
#delimited by the character '|'. Use the lists 'NamesRow_IDX' & 
#'IDnumRow_IDX' to select subsets of rows to help select for data features
splitNamesRow <- strsplit(tournament_text_stringsdf[namesRow_IDX,], "\\|")
splitIDnumRow <- strsplit(tournament_text_stringsdf[IDnumRow_IDX,], "\\|")

#use the function unlist() with lappy() to access a specific string
#from the list of split substrings.
#for the 'Player's Name feature, we need to select the 2nd element 
#from splitNamesRow and use the trimws() to eliminate leading and trailing
#blank spaces.
tournament_df$Name <- trimws(unlist(lapply(splitNamesRow, '[[', 2))) 
#for the 'Player's State' feature, we need to select the 1st element 
#from splitIDnumRow and use the trimws() to eliminate leading and trailing
#blank spaces.
tournament_df$State <- trimws(unlist(lapply(splitIDnumRow, '[[', 1)))
#for the 'Total Number of Points' feature, we need to select the 3rd element 
#from splitNamesRow and use the as.numeric() to cast from character to
#numeric data
tournament_df$Total_Num_Points <- 
    as.numeric(trimws(unlist(lapply(splitNamesRow, '[[', 3))))
#The 'Player’s Pre-Rating' feature is a bit more complicated. 
#we need to select the 2nd element from splitIDnumRow with lappy() & unlist().
#Next, we use a regular expression pattern to select the Player's
#pre-tournament rating. We then use the gsub() function to select
#the information and as.numeric() to re-cast the data.
tournament_df$Pre_Rating <- unlist(lapply(splitIDnumRow,'[[', 2))
#this pattern will select a variable length of numeric characters that are
#preceded by 'R:' and followed by any other type of character 
#(in the case of our data this is either '-' or 'P').
pat <- "^.*R:.*?([0-9]+).*"
tournament_df$Pre_Rating <- as.numeric(
    gsub(pat, "\\1", tournament_df$Pre_Rating))
#display the head of the tournament_df dataframe to visually inspect our result.
head( tournament_df )
##                  Name State Total_Num_Points Pre_Rating
## 1            GARY HUA    ON              6.0       1794
## 2     DAKSHESH DARURI    MI              6.0       1553
## 3        ADITYA BAJAJ    MI              6.0       1384
## 4 PATRICK H SCHILLING    MI              5.5       1716
## 5          HANSHI ZUO    MI              5.5       1655
## 6         HANSEN SONG    OH              5.0       1686
##   Ave_Pre_Rating_Opponents
## 1                       NA
## 2                       NA
## 3                       NA
## 4                       NA
## 5                       NA
## 6                       NA



Great!, we are almost done wrangling our text data into the form we want.
But there is one more feature left, ‘Average Pre Chess Rating of Opponents’.
This feature is a bit more of a challenge to extract, so we will be taking the following steps:

  • select the 4th through 10th string elements from ‘splitNamesRow’
  • select the numeric digits from each element
  • With a for loop, use these numbers as indices to:
    • keep track of rounds without ppponent numbers
    • select an opponent’s numeric ID
  • Find the average pre-tournament rating of a player’s opponents
#select the 4th through 10th string elements from 'splitNamesRow'
tournament_df$Ave_Pre_Rating_Opponents <- lapply(splitNamesRow,'[', 4:10)

#use a combination of gsub() and str_extract_all() (from the stringr library)
#to select the numeric character for each round.
library(stringr)
tournament_df$Ave_Pre_Rating_Opponents <- str_extract_all(
    gsub( '\\"[A-Za-z]', '\\1', tournament_df$Ave_Pre_Rating_Opponents),
    "\\(?[0-9]+\\)?" )

#The numbers we extracted give the numeric ID of a Player's opponent
#for every round where a player faced an opponent (win, lose, or draw)
#in this for loop, we use the numeric ID as an index to select the opponent's
#pre-tournament ranking. We will keep track of the pre-tournament ranking
#and keep track of the number of rounds where there was no opponent.
num_records <- length(tournament_df$Ave_Pre_Rating_Opponents)
scores <- rep(0, num_records)
numRounds <- rep(0, num_records)
for (round in seq(1:7) ){
    scoresIDX <- as.numeric(
      lapply(tournament_df$Ave_Pre_Rating_Opponents, '[', round))
    scores2ADD <- tournament_df$Pre_Rating[scoresIDX]
    NAadd <- as.numeric(is.na(scores2ADD ))
    numRounds <- numRounds + NAadd
    scores2ADD[is.na(scores2ADD)] <- 0
    scores <- scores + scores2ADD
}

#Find the average pre-tournament rating of a player's opponents
#subtract the number of Rounds where there was no opponent (e.g. byes)
totalRoundsPlayed <- rep(7, num_records)-numRounds
#divide the sum of the opponent scores by the number of opponents
tournament_df$Ave_Pre_Rating_Opponents <- round(scores/totalRoundsPlayed)

head( tournament_df )
##                  Name State Total_Num_Points Pre_Rating
## 1            GARY HUA    ON              6.0       1794
## 2     DAKSHESH DARURI    MI              6.0       1553
## 3        ADITYA BAJAJ    MI              6.0       1384
## 4 PATRICK H SCHILLING    MI              5.5       1716
## 5          HANSHI ZUO    MI              5.5       1655
## 6         HANSEN SONG    OH              5.0       1686
##   Ave_Pre_Rating_Opponents
## 1                     1605
## 2                     1469
## 3                     1564
## 4                     1574
## 5                     1501
## 6                     1519



4. Write our data.frame to a .csv file

Our hard work paid off: we wrangled our data into a concise tabular form as an r data.frame. Now we need to write the data.frame to a .csv file format. Ths csv format is convenient, because it can be loaded into various platforms for further data management and/or manipulation.



5. Form some preliminary insights

Now that we have extracted the data of interest from the original text file, we can perform some preliminary analysis.

We can start by using the summary() function to look at some simple descriptive statistics of the features (columns) that we derived from the raw text:

##      Name              State           Total_Num_Points   Pre_Rating  
##  Length:64          Length:64          Min.   :1.000    Min.   : 377  
##  Class :character   Class :character   1st Qu.:2.500    1st Qu.:1227  
##  Mode  :character   Mode  :character   Median :3.500    Median :1407  
##                                        Mean   :3.438    Mean   :1378  
##                                        3rd Qu.:4.000    3rd Qu.:1583  
##                                        Max.   :6.000    Max.   :1794  
##  Ave_Pre_Rating_Opponents
##  Min.   :1107            
##  1st Qu.:1310            
##  Median :1382            
##  Mean   :1379            
##  3rd Qu.:1481            
##  Max.   :1605



We can explore the data further with a few simple visualizations

Let’s plot the average pre-tournament rating of a player’s opponents as a function of the player’s pre tournament rating:

From the figure above, we see that there is a suggestion of a positive trend in the data given by the regression line (blue line). However, the variability of the data points is very high about the regression line and the R2 value is very low.

What if we look for regional differences in the data? Do different states have players with higher average pre-tournament rankings?

## # A tibble: 3 x 5
##   State numPlayers  Mean Median IQRange
##   <chr>      <int> <dbl>  <dbl>   <dbl>
## 1 MI            55 1362   1393      370
## 2 OH             1 1686   1686        0
## 3 ON             8 1454.  1482.     168

As we can see from both the results of the aggregation and the boxplot visualization, there are very few data points from the states ‘OH’ and ‘ON’ (presumably Ohio and Ontario). Therefore, we cannot draw any solid conclusions about differences between the groups. However, we observe that players that traveled from a state other than Michigan have a higher median pre-tournament rating.

Conclusions

In this demo we wrangled data from a text file, restructured the data features into a concise format and wrote the output to a .csv file. The reformated data is more tractable that the raw text version because it omits superfluous characters. Additionally, the .csv output can be imported to many different platforms for further downstream analysis. Therefore, although we were not able to draw many meaningful insights from this as a standalone dataset, the data can be combined with other information in future analyses and/or this code can be repurposed for a similar but much larger raw data set.

Images from The Simpsons ‘The Cad and the Hat’ Season 28,Episode 14.