For this project students were provided a text file containing the results of a chess tournament structured as a cross-table. Our task was to deliver an R Markdown file that generates a CSV file with the name of each player included in the given text file, their state or province, total number of points earned in terms of round outcomes (e.g. wins, losses, draws, byes, forfeitures, etc.), their Elo rating prior to tournament play, and the average pre-tournament rating of their opponents. I relied heavily on the text processing functions provided in the stringr package to manipulate the given cross-table text into a form that could be more easily analyzed and accurately rendered as a CSV file with the desired variables.
library(readr)
library(stringr)
library(ggplot2)
The read_lines() function from the readr package works in the same way as the base::readLines() function, returning a character vector with one element for each line read in from the source string or text file. I use it here to import the tournamentinfo.txt file from my remote GitHub repository for this project. Unlike the corresponding base functions, readr functions can connect to and read in from HTTPS URLs without additional workarounds or function calls.
raw_data <- read_lines("https://github.com/juddanderman/Data_607_Project_1/raw/master/tournamentinfo.txt")
head(raw_data)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
In the first step of cleaning the vector of character strings assigned to raw_data above, I loop through each element of raw_data and use the str_detect() and str_c() functions to concatenate adjacent vector elements on the condition that both the indexed element and the next proceeding element contain meaningful content, i.e do not begin with “-” as the separator rows in the given cross-table do. During each iteration, the concatenated strings, which correspond to the variable labels in the cross-table and the data for the players in the tournament, are incrementally assigned (using a separate index from the loop counter) to the elements of cleaner, which was initialized as a vector of empty strings with the appropriate length before the loop to avoid growing the vector during execution of the loop.
cleaner <- vector(mode = "character", length = length(raw_data[str_detect(raw_data, "^[^-]")]) / 2)
j <- 1
for (i in 1:length(raw_data)) {
if (str_detect(raw_data[i], "^[^-]") && str_detect(raw_data[i + 1], "^[^-]")) {
cleaner[j] <- str_c(raw_data[i], raw_data[i+1])
j <- j + 1
}
}
cleaner[1:3]
## [1] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [2] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
In the next data cleaning step, I use the str_split() function to split each element of the cleaner vector at every vertical bar, forward slash, and right arrow (i.e. a hypen followed by a greater than sign, “->”), which represent column boundaries, separators between USCF ID numbers and ratings, and separators between pre- and post-ratings respectively in the original cross-table. Here, I use the argument simplify = TRUE in the function call so that str_split() returns a matrix of character vectors, and in this case, splitting leaves us with individual character strings for each variable label and data element.
cleaner <- str_split(cleaner, "[|]|/|(->)", simplify = TRUE)
cleaner[1:3, -23]
## [,1] [,2] [,3] [,4] [,5]
## [1,] " Pair " " Player Name " "Total" "Round" "Round"
## [2,] " 1 " " GARY HUA " "6.0 " "W 39" "W 21"
## [3,] " 2 " " DAKSHESH DARURI " "6.0 " "W 63" "W 58"
## [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] "Round" "Round" "Round" "Round" "Round" " Num " " USCF ID "
## [2,] "W 18" "W 14" "W 7" "D 12" "D 4" " ON " " 15445895 "
## [3,] "L 4" "W 17" "W 16" "W 20" "W 7" " MI " " 14598900 "
## [,13] [,14] [,15] [,16] [,17] [,18] [,19]
## [1,] " Rtg (Pre" "Post) " " Pts " " 1 " " 2 " " 3 " " 4 "
## [2,] " R: 1794 " "1817 " "N:2 " "W " "B " "W " "B "
## [3,] " R: 1553 " "1663 " "N:2 " "B " "W " "B " "W "
## [,20] [,21] [,22]
## [1,] " 5 " " 6 " " 7 "
## [2,] "W " "B " "W "
## [3,] "B " "W " "B "
cleaner[, 23]
## [1] " " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [18] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [35] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [52] "" "" "" "" "" "" "" "" "" "" "" "" "" ""
Noting that the elements in the 23rd column of what is now a cleaner matrix are either exclusively space characters or empty strings, I remove that column from the matrix. Then I replace a number of the elements of the first row of cleaner with new strings reflecting clearer names for the variables of interest. I also duplicate the columns containing strings that refer to the outcomes and opponents played in each round and combine those columns to cleaner so that I can retain round outcome and opponent values as distinct columns and eventually process those variables independently. Before moving on to cleaning the strings representing the player data, or values for each of the variables, I use colnames() to name the columns of the matrix with the character strings from the first row which represent the names of the variables, and then drop that row from the matrix.
cleaner <- cleaner[, -23]
cleaner[1, c(1, 3, 4:10, 11, 13:15, 16:22)] <- c("Player_ID", "Total_Pts", str_c("Round", 1:7, "_Pair"), "State", "Rtg_Pre", "Rtg_Post", "Norm", str_c("Round", 1:7, "_Color"))
cleaner[1, ] <- str_trim(cleaner[1, ])
cleaner <- cbind(cleaner, cleaner[, 4:10])
cleaner[1, 23:29] <- str_c("Round", 1:7, "_Outcome")
cleaner[1, ]
## [1] "Player_ID" "Player Name" "Total_Pts" "Round1_Pair"
## [5] "Round2_Pair" "Round3_Pair" "Round4_Pair" "Round5_Pair"
## [9] "Round6_Pair" "Round7_Pair" "State" "USCF ID"
## [13] "Rtg_Pre" "Rtg_Post" "Norm" "Round1_Color"
## [17] "Round2_Color" "Round3_Color" "Round4_Color" "Round5_Color"
## [21] "Round6_Color" "Round7_Color" "Round1_Outcome" "Round2_Outcome"
## [25] "Round3_Outcome" "Round4_Outcome" "Round5_Outcome" "Round6_Outcome"
## [29] "Round7_Outcome"
colnames(cleaner) <- cleaner[1, ]
cleaner <- cleaner[-1, ]
cleaner[1:3, ]
## Player_ID Player Name Total_Pts Round1_Pair
## [1,] " 1 " " GARY HUA " "6.0 " "W 39"
## [2,] " 2 " " DAKSHESH DARURI " "6.0 " "W 63"
## [3,] " 3 " " ADITYA BAJAJ " "6.0 " "L 8"
## Round2_Pair Round3_Pair Round4_Pair Round5_Pair Round6_Pair
## [1,] "W 21" "W 18" "W 14" "W 7" "D 12"
## [2,] "W 58" "L 4" "W 17" "W 16" "W 20"
## [3,] "W 61" "W 25" "W 21" "W 11" "W 13"
## Round7_Pair State USCF ID Rtg_Pre Rtg_Post Norm
## [1,] "D 4" " ON " " 15445895 " " R: 1794 " "1817 " "N:2 "
## [2,] "W 7" " MI " " 14598900 " " R: 1553 " "1663 " "N:2 "
## [3,] "W 12" " MI " " 14959604 " " R: 1384 " "1640 " "N:2 "
## Round1_Color Round2_Color Round3_Color Round4_Color Round5_Color
## [1,] "W " "B " "W " "B " "W "
## [2,] "B " "W " "B " "W " "B "
## [3,] "W " "B " "W " "B " "W "
## Round6_Color Round7_Color Round1_Outcome Round2_Outcome
## [1,] "B " "W " "W 39" "W 21"
## [2,] "W " "B " "W 63" "W 58"
## [3,] "B " "W " "L 8" "W 61"
## Round3_Outcome Round4_Outcome Round5_Outcome Round6_Outcome
## [1,] "W 18" "W 14" "W 7" "D 12"
## [2,] "L 4" "W 17" "W 16" "W 20"
## [3,] "W 25" "W 21" "W 11" "W 13"
## Round7_Outcome
## [1,] "D 4"
## [2,] "W 7"
## [3,] "W 12"
Below, I loop through the columns of cleaner and use regular expressions and the str_extract(), str_replace(), and str_trim() functions to replace the column elements with cleaned-up strings representing single values for each variable without leading or trailing spaces. These string manipulation and matrix element replacement operations are conditioned on the column names of cleaner so that the element re-assignments occur as expected. For example, the elements of the columns called RoundX_Pair and RoundX_Outcome which formerly contained identical character strings, are here assigned just the opponent pair ID numbers in the first case and just the match outcomes in the second. In addition, provisional ratings, e.g “P17”, and extraneous characters, e.g. “R: ”, are removed from the strings corresponding to players’ pre- and post-tournament ratings. Lastly, the matrix is coerced to a data frame so that the elements can eventually also be coerced to a mix of the appropriate data types.
for (i in 1:ncol(cleaner)) {
if (str_detect(colnames(cleaner)[i], "Pair")) {
cleaner[, i] <- str_extract(cleaner[, i], "(\\d+$)")
} else if (str_detect(colnames(cleaner)[i], "Outcome")) {
cleaner[, i] <- str_extract(cleaner[, i], "(^\\w+)")
} else if (colnames(cleaner)[i] == "Rtg_Pre") {
cleaner[, i] <- str_trim(str_replace(str_extract(cleaner[, i], "(\\w+)(( +)?$)"), "P\\d+", ""))
} else if (colnames(cleaner)[i] == "Rtg_Post") {
cleaner[, i] <- str_trim(str_replace(cleaner[, i], "P\\d+", ""))
} else {
cleaner[ , i] <- str_trim(cleaner[, i])
}
}
cleaner <- as.data.frame(cleaner, stringsAsFactors = FALSE)
cleaner[1:3, ]
## Player_ID Player Name Total_Pts Round1_Pair Round2_Pair Round3_Pair
## 1 1 GARY HUA 6.0 39 21 18
## 2 2 DAKSHESH DARURI 6.0 63 58 4
## 3 3 ADITYA BAJAJ 6.0 8 61 25
## Round4_Pair Round5_Pair Round6_Pair Round7_Pair State USCF ID Rtg_Pre
## 1 14 7 12 4 ON 15445895 1794
## 2 17 16 20 7 MI 14598900 1553
## 3 21 11 13 12 MI 14959604 1384
## Rtg_Post Norm Round1_Color Round2_Color Round3_Color Round4_Color
## 1 1817 N:2 W B W B
## 2 1663 N:2 B W B W
## 3 1640 N:2 W B W B
## Round5_Color Round6_Color Round7_Color Round1_Outcome Round2_Outcome
## 1 W B W W W
## 2 B W B W W
## 3 W B W L W
## Round3_Outcome Round4_Outcome Round5_Outcome Round6_Outcome
## 1 W W W D
## 2 L W W W
## 3 W W W W
## Round7_Outcome
## 1 D
## 2 W
## 3 W
Below, I create a new vector opp_mean to store the average pre-tournament rating for each player’s opponents, calculate those mean values (using the argument na.rm = TRUE to disregard unplayed rounds), round the means to integer values, and assign the resulting value to the element in opp_mean corresponding to the relevant player. Then the desired columns from cleaner and the opp_mean vector are combined in a new data frame called output which will be used to write the CSV output. The structure of output is quickly examined using the str() function and its contents are reproduced as a table in my R Markdown output using the knitr::kable() function.
opp_mean <- vector(mode = "integer", length = nrow(cleaner))
for (i in 1:nrow(cleaner)) {
opp_mean[i] <- round(mean(as.numeric(cleaner$Rtg_Pre[as.numeric(unlist(cleaner[i, 4:10]))]), na.rm = TRUE), digits = 0)
}
output <- data.frame(Player_Name = cleaner$`Player Name`, State_Code = cleaner$State, Total_Pts = cleaner$Total_Pts, Pre_Rating = as.numeric(cleaner$Rtg_Pre), Opponents_Avg_Pre_Rtg = opp_mean, Post_Rating = as.numeric(cleaner$Rtg_Post), stringsAsFactors = FALSE)
str(output)
## 'data.frame': 64 obs. of 6 variables:
## $ Player_Name : chr "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
## $ State_Code : chr "ON" "MI" "MI" "MI" ...
## $ Total_Pts : chr "6.0" "6.0" "6.0" "5.5" ...
## $ Pre_Rating : num 1794 1553 1384 1716 1655 ...
## $ Opponents_Avg_Pre_Rtg: num 1605 1469 1564 1574 1501 ...
## $ Post_Rating : num 1817 1663 1640 1744 1690 ...
knitr::kable(output)
| Player_Name | State_Code | Total_Pts | Pre_Rating | Opponents_Avg_Pre_Rtg | Post_Rating |
|---|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 | 1817 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 | 1663 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 | 1640 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 | 1744 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1501 | 1690 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 | 1687 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 | 1372 | 1673 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 | 1657 |
| STEFANO LEE | ON | 5.0 | 1411 | 1523 | 1564 |
| ANVIT RAO | MI | 5.0 | 1365 | 1554 | 1544 |
| CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1468 | 1696 |
| KENNETH J TACK | MI | 4.5 | 1663 | 1506 | 1670 |
| TORRANCE HENRY JR | MI | 4.5 | 1666 | 1498 | 1662 |
| BRADLEY SHAW | MI | 4.5 | 1610 | 1515 | 1618 |
| ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1484 | 1416 |
| MIKE NIKITIN | MI | 4.0 | 1604 | 1386 | 1613 |
| RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1499 | 1610 |
| DAVID SUNDEEN | MI | 4.0 | 1600 | 1480 | 1600 |
| DIPANKAR ROY | MI | 4.0 | 1564 | 1426 | 1570 |
| JASON ZHENG | MI | 4.0 | 1595 | 1411 | 1569 |
| DINH DANG BUI | ON | 4.0 | 1563 | 1470 | 1562 |
| EUGENE L MCCLURE | MI | 4.0 | 1555 | 1300 | 1529 |
| ALAN BUI | ON | 4.0 | 1363 | 1214 | 1371 |
| MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1357 | 1300 |
| LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1363 | 1681 |
| MAX ZHU | ON | 3.5 | 1579 | 1507 | 1564 |
| GAURAV GIDWANI | MI | 3.5 | 1552 | 1222 | 1539 |
| SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 | 1522 | 1513 |
| CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1314 | 1508 |
| GEORGE AVERY JONES | ON | 3.5 | 1522 | 1144 | 1444 |
| RISHI SHETTY | MI | 3.5 | 1494 | 1260 | 1444 |
| JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1379 | 1433 |
| JADE GE | MI | 3.5 | 1449 | 1277 | 1421 |
| MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1375 | 1400 |
| JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1150 | 1392 |
| SIDDHARTH JHA | MI | 3.5 | 1355 | 1388 | 1367 |
| AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1385 | 1077 |
| BRIAN LIU | MI | 3.0 | 1423 | 1539 | 1439 |
| JOEL R HENDON | MI | 3.0 | 1436 | 1430 | 1413 |
| FOREST ZHANG | MI | 3.0 | 1348 | 1391 | 1346 |
| KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1248 | 1341 |
| JARED GE | MI | 3.0 | 1332 | 1150 | 1256 |
| ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1107 | 1244 |
| JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1327 | 1199 |
| DEREK YAN | MI | 3.0 | 1242 | 1152 | 1191 |
| JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1358 | 1076 |
| ERIC WRIGHT | MI | 2.5 | 1362 | 1392 | 1341 |
| DANIEL KHAIN | MI | 2.5 | 1382 | 1356 | 1335 |
| MICHAEL J MARTIN | MI | 2.5 | 1291 | 1286 | 1259 |
| SHIVAM JHA | MI | 2.5 | 1056 | 1296 | 1111 |
| TEJAS AYYAGARI | MI | 2.5 | 1011 | 1356 | 1097 |
| ETHAN GUO | MI | 2.5 | 935 | 1495 | 1092 |
| JOSE C YBARRA | MI | 2.0 | 1393 | 1345 | 1359 |
| LARRY HODGE | MI | 2.0 | 1270 | 1206 | 1200 |
| ALEX KONG | MI | 2.0 | 1186 | 1406 | 1163 |
| MARISA RICCI | MI | 2.0 | 1153 | 1414 | 1140 |
| MICHAEL LU | MI | 2.0 | 1092 | 1363 | 1079 |
| VIRAJ MOHILE | MI | 2.0 | 917 | 1391 | 941 |
| SEAN M MC CORMICK | MI | 2.0 | 853 | 1319 | 878 |
| JULIA SHEN | MI | 1.5 | 967 | 1330 | 984 |
| JEZZEL FARKAS | ON | 1.5 | 955 | 1327 | 979 |
| ASHWIN BALAJI | MI | 1.0 | 1530 | 1186 | 1535 |
| THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1350 | 1125 |
| BEN LI | MI | 1.0 | 1163 | 1263 | 1112 |
Here, the requested CSV file is written to the working directory, the file name and path can be changed as needed.
# Write CSV file to working directory
write.csv(output, file = "chess_player_data.csv")
Finally, I visualize a few of the variables from the tournament data set using plotting functions from the ggplot2 package. In particular, I graph the distribution of Total_Pts which represents point totals earned through match outcomes of win, draw, bye, and so on; the distribution of players’ pre-tournament Elo ratings; the relationship between players’ pre-ratings and their opponents’ average pre-ratings; the relationship between Total_Pts and opponents’ average pre-ratings; and the relationship between opponents’ average pre-ratings and the difference between each player’s pre- and post-tournament rating.
ggplot(output, aes(Total_Pts)) + geom_bar()
ggplot(output, aes(Pre_Rating)) + geom_histogram(binwidth = 50)
ggplot(output, aes(x = Pre_Rating, y = Opponents_Avg_Pre_Rtg)) + geom_point()
ggplot(output, aes(x = Total_Pts, y = Opponents_Avg_Pre_Rtg)) + geom_boxplot() + coord_flip()
ggplot(output, aes(x = Opponents_Avg_Pre_Rtg, y = Post_Rating - Pre_Rating)) + geom_point() +
ylab("Delta_Rating") +
geom_hline(yintercept = mean(output$Post_Rating - output$Pre_Rating)) +
geom_vline(xintercept = mean(output$Opponents_Avg_Pre_Rtg))