How can we transform a text file with partially structured data into a more useful form? In this project I examine one such file that reports the proceedings of a chess tournament. By exploiting the structure of the text file and employing tools in R, I show how to create a tibble containing all the information in the text file in a form more amenable to further analysis. Finally, I generate a CSV containing a subset of this tibble.
There are 6 steps to the transformation procedure.
Create a tibble containing the raw data from the text file, delimited by line breaks.
Reorganize the rows of the table so that all the information pertaining to each player resides in a single row.
Reorganize the columns of the table so that each field resides in a single column.
Resolve anomalies in the data.
Compute the average pre-tournament opponent rating for each player.
Create a CSV containing the requested information on each player.
data_on_github = "https://raw.githubusercontent.com/dmoscoe/SPS/main/DATA607/proj1/tournamentinfo.txt"
chess_line_delim <- read.delim(url(data_on_github), header = FALSE, sep = "\n")
head(chess_line_delim)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
chess_line_collapse <- c()
j <- 1
for (i in seq(1: length(chess_line_delim[,1]))) {
if (i %% 3 == 2) {
next_line <- str_c(chess_line_delim[i, 1], chess_line_delim[i + 1, 1])
chess_line_collapse <- c(chess_line_collapse, next_line)
j <- j + 1
}
}
head(chess_line_collapse)
## [1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [4] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [5] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [6] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17| MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
for (i in seq(1:length(chess_line_collapse))) {
write_lines(chess_line_collapse[i], str_c(path,"chess_line_collapse.txt"), append = TRUE)
}
Viewing the txt file with a fixed-width font, and placing a “ruler” across the top of the file, allow me to easily read off the starting and ending position of each field.
I use these positions to construct \(\texttt{tidy_chess}\), a tibble with each player’s information in one row, and each field in one column. This tibble contains more information than is necessary to produce the CSV file that is the final product for this project. However, since the information is easy to capture, I preserve it.
tidy_chess <- read_fwf(chess_line_collapse,
trim_ws = TRUE, skip = 1, fwf_cols(
PairNum = c(1, 6),
USCFID = c(98, 105),
State = c(93, 94),
PlayerName = c(8, 40),
PreRating = c(112, 118),
RtgPost = c(121, 129),
TotalPtsN = c(133, 133),
TotalPoints = c(42, 46),
Rd1Opp = c(51, 52),
Rd1Color = c(137, 137),
Rd1Result = c(48, 48),
Rd2Opp = c(57, 58),
Rd2Color = c(143, 143),
Rd2Result = c(54, 54),
Rd3Opp = c(63, 64),
Rd3Color = c(149, 149),
Rd3Result = c(60, 60),
Rd4Opp = c(69, 70),
Rd4Color = c(155, 155),
Rd4Result = c(66, 66),
Rd5Opp = c(75, 76),
Rd5Color = c(161, 161),
Rd5Result = c(72, 72),
Rd6Opp = c(81, 82),
Rd6Color = c(167, 167),
Rd6Result = c(78, 78),
Rd7Opp = c(87, 88),
Rd7Color = c(173, 173),
Rd7Result = c(84, 84)
))
head(tidy_chess)
## # A tibble: 6 x 29
## PairNum USCFID State PlayerName PreRating RtgPost TotalPtsN TotalPoints Rd1Opp
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1.54e7 ON GARY HUA 1794 1817 2 6 39
## 2 2 1.46e7 MI DAKSHESH ~ 1553 1663 2 6 63
## 3 3 1.50e7 MI ADITYA BA~ 1384 1640 2 6 8
## 4 4 1.26e7 MI PATRICK H~ 1716 1744 2 5.5 23
## 5 5 1.46e7 MI HANSHI ZUO 1655 1690 2 5.5 45
## 6 6 1.51e7 OH HANSEN SO~ 1686 1687 3 5 34
## # ... with 20 more variables: Rd1Color <chr>, Rd1Result <chr>, Rd2Opp <dbl>,
## # Rd2Color <chr>, Rd2Result <chr>, Rd3Opp <dbl>, Rd3Color <chr>,
## # Rd3Result <chr>, Rd4Opp <dbl>, Rd4Color <chr>, Rd4Result <chr>,
## # Rd5Opp <dbl>, Rd5Color <chr>, Rd5Result <chr>, Rd6Opp <dbl>,
## # Rd6Color <chr>, Rd6Result <chr>, Rd7Opp <dbl>, Rd7Color <chr>,
## # Rd7Result <chr>
tidy_chess <- tidy_chess %>%
mutate(PreRating = parse_number(PreRating),
RtgPost = parse_number(RtgPost))
tidy_chess$Rd1OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd1Opp, tidy_chess$PairNum)]
tidy_chess$Rd2OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd2Opp, tidy_chess$PairNum)]
tidy_chess$Rd3OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd3Opp, tidy_chess$PairNum)]
tidy_chess$Rd4OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd4Opp, tidy_chess$PairNum)]
tidy_chess$Rd5OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd5Opp, tidy_chess$PairNum)]
tidy_chess$Rd6OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd6Opp, tidy_chess$PairNum)]
tidy_chess$Rd7OppRtg <- tidy_chess$PreRating[match(tidy_chess$Rd7Opp, tidy_chess$PairNum)]
I then calculate the average across the new columns for each player to determine the player’s average pre-chess opponent rating.
tidy_chess <- tidy_chess %>%
rowwise() %>%
mutate(AvgPreChessRtgOpp = round(
mean(
c(Rd1OppRtg, Rd2OppRtg, Rd3OppRtg, Rd4OppRtg, Rd5OppRtg, Rd6OppRtg, Rd7OppRtg),
na.rm = TRUE),
0))
head(tidy_chess)
## # A tibble: 6 x 37
## # Rowwise:
## PairNum USCFID State PlayerName PreRating RtgPost TotalPtsN TotalPoints Rd1Opp
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1.54e7 ON GARY HUA 1794 1817 2 6 39
## 2 2 1.46e7 MI DAKSHESH ~ 1553 1663 2 6 63
## 3 3 1.50e7 MI ADITYA BA~ 1384 1640 2 6 8
## 4 4 1.26e7 MI PATRICK H~ 1716 1744 2 5.5 23
## 5 5 1.46e7 MI HANSHI ZUO 1655 1690 2 5.5 45
## 6 6 1.51e7 OH HANSEN SO~ 1686 1687 3 5 34
## # ... with 28 more variables: Rd1Color <chr>, Rd1Result <chr>, Rd2Opp <dbl>,
## # Rd2Color <chr>, Rd2Result <chr>, Rd3Opp <dbl>, Rd3Color <chr>,
## # Rd3Result <chr>, Rd4Opp <dbl>, Rd4Color <chr>, Rd4Result <chr>,
## # Rd5Opp <dbl>, Rd5Color <chr>, Rd5Result <chr>, Rd6Opp <dbl>,
## # Rd6Color <chr>, Rd6Result <chr>, Rd7Opp <dbl>, Rd7Color <chr>,
## # Rd7Result <chr>, Rd1OppRtg <dbl>, Rd2OppRtg <dbl>, Rd3OppRtg <dbl>,
## # Rd4OppRtg <dbl>, Rd5OppRtg <dbl>, Rd6OppRtg <dbl>, Rd7OppRtg <dbl>,
## # AvgPreChessRtgOpp <dbl>
final_product <- tidy_chess %>%
select(PlayerName,State,TotalPoints,PreRating, AvgPreChessRtgOpp)
write_csv(final_product, str_c(path,"final_product.csv"))
head(final_product)
## # A tibble: 6 x 5
## # Rowwise:
## PlayerName State TotalPoints PreRating AvgPreChessRtgOpp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 GARY HUA ON 6 1794 1605
## 2 DAKSHESH DARURI MI 6 1553 1469
## 3 ADITYA BAJAJ MI 6 1384 1564
## 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5 1686 1519
Transforming data into a useful form almost always relies on identifying and exploiting structure in the raw data. Sometimes this structure is found in reliable patterns within the data itself. In this case, extracting sequences based on a match with a regular expression can be useful. Sometimes there is a delimiter, like a comma or a tab, that distinguishes each entry from its neighbors. The data examined in this project was not consistently delimited. However, it was presented in a pseudo-tabular structure, presumably so that it could be easily read by a person. In preparing this data for analysis, I relied on this pseudo-tabular structure, along with a “ruler” that helped me identify the starting and ending positions of each field. While this strategy is simple, there is also something satisfying about relying only on the structure intentionally supplied by the “authors” of this file. It enabled me to easily capture all the data in the file, rather than narrowing my efforts to only the information required for producing the final product. The consequence is that options for future analysis of this data are broader than they otherwise might have been.
I believe there is room for improvement in my approach to computing the average pre-chess opponent rating, since this is the most repetitive part of my code. Is there a way I could have used a loop to perform this same column-creating operation seven times, without writing seven separate lines? Is there some other way to compute the required result without generating additional columns in the first place?
In future assignments I will look for ways to create more visually appealing R Markdown documents. I would like to learn more about dividing my work into visually recognizable sections, or including a sidebar for navigation.