It is often necessary to transform raw data to a condensed and more useful format to facilitate downstream analysis. This process is often refered to as ‘Data Wrangling’
In this demo, we will wrangle data from a chess tournament. The raw data is given to us as a .txt file and our goal is to transform the text data into a much more concise tabular form and to export the data as a .csv (comma separated value) file. However, we are not simply transcribing to text data. Rather, we are extracting specific elements and performing some simple manipulations to represent a subset of the information given in the text file.
We need to extract the following information from the raw data:
From the image above, we see that there are many irrelevant characters. However, we also observe that the raw data has some predictable structure that we can take advantage of to parse the information that we need to format our deliverable.
To achieve this goal, we will perform the following steps:
For the convenience of this demo, the .txt raw data has been uploaded to the author’s github account. We can access the data as follows:
#URL for the chess tournament text data
tournament_text_URL <- 'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/tournamentinfo.txt'
#pass 'tournament_text_URL' to the readLines() function.
#Each line of the text file will be read into r.
tournament_text <- readLines(tournament_text_URL,warn=FALSE)
#format the text as an r data.frame
tournament_text_stringsdf <- data.frame(tournament_text,stringsAsFactors = FALSE)
#use the head() function to display the first several lines in the data.frame
head( tournament_text_stringsdf, 12 )## tournament_text
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 7 -----------------------------------------------------------------------------------------
## 8 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## 9 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## 10 -----------------------------------------------------------------------------------------
## 11 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## 12 MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
Now we will initialize a data.frame to hold the results of our data wrangling. The features that we want to extract have been defined for us, so we can set up the corresponding number of collumns and give them descriptive names. If we open up and inspect the text file, we can scroll down and see that there are 64 records. Therefore, we will preallocate 64 rows.
Is it necessary to preallocate space for our data?
No, not necessarily, at least not in R. However, since we know the size and shape of the data we are expecting, specifiying so in the size and shape of the data.frame can help catch some data processing errors early on.
#initialize a dataframe & name the columns
chessfeatures <- c("Name","State","Total_Num_Points",
"Pre_Rating","Ave_Pre_Rating_Opponents")
#initialize a dataframe & give the columns the names from 'chessfeatures'
tournament_df <- setNames(data.frame(matrix(ncol = length( chessfeatures ), nrow = 64)), chessfeatures) #64 rows for the 64 chess recordsIf we look at the data.frame ‘tournament_text_stringsdf’ we can see that most of the data we are interested in parsing out is relatively easy to locate within the reoccuring pattern of the text.
Four of the five data features we are interested in are indicated in the figure below:
Now, to extract the data. These are the basic steps we will take:
#the data for the features 'Player's Name' & Total Number of Points'
#first appears in row 5 and then re-occures every 3 lines.
#set up and list of indices for these rows:
namesRow_IDX <- seq( 5, 194, 3)
#the data for the features 'Player's State' & Player's Pre-Rating'
#first appears in row 6 and then re-occures every 3 lines.
#set up and list of indices for these rows:
IDnumRow_IDX <- seq( 6, 195, 3)
#use strsplit() to split the rows of interest into string fragments
#delimited by the character '|'. Use the lists 'NamesRow_IDX' &
#'IDnumRow_IDX' to select subsets of rows to help select for data features
splitNamesRow <- strsplit(tournament_text_stringsdf[namesRow_IDX,], "\\|")
splitIDnumRow <- strsplit(tournament_text_stringsdf[IDnumRow_IDX,], "\\|")
#use the function unlist() with lappy() to access a specific string
#from the list of split substrings.
#for the 'Player's Name feature, we need to select the 2nd element
#from splitNamesRow and use the trimws() to eliminate leading and trailing
#blank spaces.
tournament_df$Name <- trimws(unlist(lapply(splitNamesRow, '[[', 2)))
#for the 'Player's State' feature, we need to select the 1st element
#from splitIDnumRow and use the trimws() to eliminate leading and trailing
#blank spaces.
tournament_df$State <- trimws(unlist(lapply(splitIDnumRow, '[[', 1)))
#for the 'Total Number of Points' feature, we need to select the 3rd element
#from splitNamesRow and use the as.numeric() to cast from character to
#numeric data
tournament_df$Total_Num_Points <-
as.numeric(trimws(unlist(lapply(splitNamesRow, '[[', 3))))
#The 'Player’s Pre-Rating' feature is a bit more complicated.
#we need to select the 2nd element from splitIDnumRow with lappy() & unlist().
#Next, we use a regular expression pattern to select the Player's
#pre-tournament rating. We then use the gsub() function to select
#the information and as.numeric() to re-cast the data.
tournament_df$Pre_Rating <- unlist(lapply(splitIDnumRow,'[[', 2))
#this pattern will select a variable length of numeric characters that are
#preceded by 'R:' and followed by any other type of character
#(in the case of our data this is either '-' or 'P').
pat <- "^.*R:.*?([0-9]+).*"
tournament_df$Pre_Rating <- as.numeric(
gsub(pat, "\\1", tournament_df$Pre_Rating))
#display the head of the tournament_df dataframe to visually inspect our result.
head( tournament_df )## Name State Total_Num_Points Pre_Rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## Ave_Pre_Rating_Opponents
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
Great!, we are almost done wrangling our text data into the form we want.
But there is one more feature left, ‘Average Pre Chess Rating of Opponents’.
This feature is a bit more of a challenge to extract, so we will be taking the following steps:
#select the 4th through 10th string elements from 'splitNamesRow'
tournament_df$Ave_Pre_Rating_Opponents <- lapply(splitNamesRow,'[', 4:10)
#use a combination of gsub() and str_extract_all() (from the stringr library)
#to select the numeric character for each round.
library(stringr)
tournament_df$Ave_Pre_Rating_Opponents <- str_extract_all(
gsub( '\\"[A-Za-z]', '\\1', tournament_df$Ave_Pre_Rating_Opponents),
"\\(?[0-9]+\\)?" )
#The numbers we extracted give the numeric ID of a Player's opponent
#for every round where a player faced an opponent (win, lose, or draw)
#in this for loop, we use the numeric ID as an index to select the opponent's
#pre-tournament ranking. We will keep track of the pre-tournament ranking
#and keep track of the number of rounds where there was no opponent.
num_records <- length(tournament_df$Ave_Pre_Rating_Opponents)
scores <- rep(0, num_records)
numRounds <- rep(0, num_records)
for (round in seq(1:7) ){
scoresIDX <- as.numeric(
lapply(tournament_df$Ave_Pre_Rating_Opponents, '[', round))
scores2ADD <- tournament_df$Pre_Rating[scoresIDX]
NAadd <- as.numeric(is.na(scores2ADD ))
numRounds <- numRounds + NAadd
scores2ADD[is.na(scores2ADD)] <- 0
scores <- scores + scores2ADD
}
#Find the average pre-tournament rating of a player's opponents
#subtract the number of Rounds where there was no opponent (e.g. byes)
totalRoundsPlayed <- rep(7, num_records)-numRounds
#divide the sum of the opponent scores by the number of opponents
tournament_df$Ave_Pre_Rating_Opponents <- round(scores/totalRoundsPlayed)
head( tournament_df )## Name State Total_Num_Points Pre_Rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## Ave_Pre_Rating_Opponents
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
Our hard work paid off: we wrangled our data into a concise tabular form as an r data.frame. Now we need to write the data.frame to a .csv file format. Ths csv format is convenient, because it can be loaded into various platforms for further data management and/or manipulation.
#this step is a simple oneliner with the function write.csv():
write.csv( tournament_df, file ='tournamentDATA.csv', row.names = FALSE)Now that we have extracted the data of interest from the original text file, we can perform some preliminary analysis.
We can start by using the summary() function to look at some simple descriptive statistics of the features (columns) that we derived from the raw text:
## Name State Total_Num_Points Pre_Rating
## Length:64 Length:64 Min. :1.000 Min. : 377
## Class :character Class :character 1st Qu.:2.500 1st Qu.:1227
## Mode :character Mode :character Median :3.500 Median :1407
## Mean :3.438 Mean :1378
## 3rd Qu.:4.000 3rd Qu.:1583
## Max. :6.000 Max. :1794
## Ave_Pre_Rating_Opponents
## Min. :1107
## 1st Qu.:1310
## Median :1382
## Mean :1379
## 3rd Qu.:1481
## Max. :1605
We can explore the data further with a few simple visualizations
Let’s plot the average pre-tournament rating of a player’s opponents as a function of the player’s pre tournament rating:
library( ggplot2 )
library(ggpmisc)
my.formula <- y ~ x
p <- ggplot(tournament_df, aes(x=Pre_Rating, y=Ave_Pre_Rating_Opponents)) +
geom_point(size=2, shape=23) +
geom_smooth(method=lm, se=T) +
ggtitle('Mean Opponent Rating as a function of\nPre-tournament Rating') +
xlab('Player’s Pre-Tournament Rating') +
ylab("Opponent's Average Pre-Tournament Rating") +
stat_poly_eq(formula = my.formula,
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
parse = TRUE)
pFrom the figure above, we see that there is a suggestion of a positive trend in the data given by the regression line (blue line). However, the variability of the data points is very high about the regression line and the R2 value is very low.
What if we look for regional differences in the data? Do different states have players with higher average pre-tournament rankings?
# Set color by cond
ggplot(tournament_df, aes(x=Pre_Rating, y=Ave_Pre_Rating_Opponents,
color=State)) +
geom_point(alpha=0.5, size=4) +
ggtitle('Mean Opponent Rating as a function of Pre-tournament Rating\nGrouped by State') +
xlab('Player’s Pre-Tournament Rating') +
ylab("Opponent's Average Pre-Tournament Rating")library(dplyr)
agg_byState <- tournament_df %>% group_by(State) %>%
summarise( numPlayers = n(), Mean = mean(Pre_Rating), Median = median(Pre_Rating), IQRange = IQR(Pre_Rating))
agg_byState## # A tibble: 3 x 5
## State numPlayers Mean Median IQRange
## <chr> <int> <dbl> <dbl> <dbl>
## 1 MI 55 1362 1393 370
## 2 OH 1 1686 1686 0
## 3 ON 8 1454. 1482. 168
# Basic box plot
p <- ggplot(tournament_df, aes(x=reorder(State, Pre_Rating, FUN=median), y=Pre_Rating, fill=State)) +
geom_boxplot() +
ggtitle('Pre-tournament Rating Grouped by State') +
ylab('Mean Pre-Tournament Rating') +
xlab("Player's State")
pAs we can see from both the results of the aggregation and the boxplot visualization, there are very few data points from the states ‘OH’ and ‘ON’ (presumably Ohio and Ontario). Therefore, we cannot draw any solid conclusions about differences between the groups. However, we observe that players that traveled from a state other than Michigan have a higher median pre-tournament rating.
In this demo we wrangled data from a text file, restructured the data features into a concise format and wrote the output to a .csv file. The reformated data is more tractable that the raw text version because it omits superfluous characters. Additionally, the .csv output can be imported to many different platforms for further downstream analysis. Therefore, although we were not able to draw many meaningful insights from this as a standalone dataset, the data can be combined with other information in future analyses and/or this code can be repurposed for a similar but much larger raw data set.
Images from The Simpsons ‘The Cad and the Hat’ Season 28,Episode 14.