Project overview

For this project students were provided a text file containing the results of a chess tournament structured as a cross-table. Our task was to deliver an R Markdown file that generates a CSV file with the name of each player included in the given text file, their state or province, total number of points earned in terms of round outcomes (e.g. wins, losses, draws, byes, forfeitures, etc.), their Elo rating prior to tournament play, and the average pre-tournament rating of their opponents. I relied heavily on the text processing functions provided in the stringr package to manipulate the given cross-table text into a form that could be more easily analyzed and accurately rendered as a CSV file with the desired variables.

Load required packages

library(readr)
library(stringr)
library(ggplot2)

Import data from remote GitHub repository

The read_lines() function from the readr package works in the same way as the base::readLines() function, returning a character vector with one element for each line read in from the source string or text file. I use it here to import the tournamentinfo.txt file from my remote GitHub repository for this project. Unlike the corresponding base functions, readr functions can connect to and read in from HTTPS URLs without additional workarounds or function calls.

raw_data <- read_lines("https://github.com/juddanderman/Data_607_Project_1/raw/master/tournamentinfo.txt")

head(raw_data)
## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Cleaning up the raw data

In the first step of cleaning the vector of character strings assigned to raw_data above, I loop through each element of raw_data and use the str_detect() and str_c() functions to concatenate adjacent vector elements on the condition that both the indexed element and the next proceeding element contain meaningful content, i.e do not begin with “-” as the separator rows in the given cross-table do. During each iteration, the concatenated strings, which correspond to the variable labels in the cross-table and the data for the players in the tournament, are incrementally assigned (using a separate index from the loop counter) to the elements of cleaner, which was initialized as a vector of empty strings with the appropriate length before the loop to avoid growing the vector during execution of the loop.

cleaner <- vector(mode = "character", length = length(raw_data[str_detect(raw_data, "^[^-]")]) / 2)
j <- 1

for (i in 1:length(raw_data)) {
  if (str_detect(raw_data[i], "^[^-]") && str_detect(raw_data[i + 1], "^[^-]")) {
    cleaner[j] <- str_c(raw_data[i], raw_data[i+1])
    j <- j + 1
  }
}

cleaner[1:3]
## [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round|  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"  
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

In the next data cleaning step, I use the str_split() function to split each element of the cleaner vector at every vertical bar, forward slash, and right arrow (i.e. a hypen followed by a greater than sign, “->”), which represent column boundaries, separators between USCF ID numbers and ratings, and separators between pre- and post-ratings respectively in the original cross-table. Here, I use the argument simplify = TRUE in the function call so that str_split() returns a matrix of character vectors, and in this case, splitting leaves us with individual character strings for each variable label and data element.

cleaner <- str_split(cleaner, "[|]|/|(->)", simplify = TRUE)

cleaner[1:3, -23]
##      [,1]     [,2]                                [,3]    [,4]    [,5]   
## [1,] " Pair " " Player Name                     " "Total" "Round" "Round"
## [2,] "    1 " " GARY HUA                        " "6.0  " "W  39" "W  21"
## [3,] "    2 " " DAKSHESH DARURI                 " "6.0  " "W  63" "W  58"
##      [,6]    [,7]    [,8]    [,9]    [,10]   [,11]     [,12]       
## [1,] "Round" "Round" "Round" "Round" "Round" "  Num  " " USCF ID " 
## [2,] "W  18" "W  14" "W   7" "D  12" "D   4" "   ON "  " 15445895 "
## [3,] "L   4" "W  17" "W  16" "W  20" "W   7" "   MI "  " 14598900 "
##      [,13]         [,14]          [,15]   [,16]   [,17]   [,18]   [,19]  
## [1,] " Rtg (Pre"   "Post)       " " Pts " "  1  " "  2  " "  3  " "  4  "
## [2,] " R: 1794   " "1817     "    "N:2  " "W    " "B    " "W    " "B    "
## [3,] " R: 1553   " "1663     "    "N:2  " "B    " "W    " "B    " "W    "
##      [,20]   [,21]   [,22]  
## [1,] "  5  " "  6  " "  7  "
## [2,] "W    " "B    " "W    "
## [3,] "B    " "W    " "B    "
cleaner[, 23]
##  [1] " " ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [18] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [35] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [52] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""

Noting that the elements in the 23rd column of what is now a cleaner matrix are either exclusively space characters or empty strings, I remove that column from the matrix. Then I replace a number of the elements of the first row of cleaner with new strings reflecting clearer names for the variables of interest. I also duplicate the columns containing strings that refer to the outcomes and opponents played in each round and combine those columns to cleaner so that I can retain round outcome and opponent values as distinct columns and eventually process those variables independently. Before moving on to cleaning the strings representing the player data, or values for each of the variables, I use colnames() to name the columns of the matrix with the character strings from the first row which represent the names of the variables, and then drop that row from the matrix.

cleaner <- cleaner[, -23]

cleaner[1, c(1, 3, 4:10, 11, 13:15, 16:22)] <- c("Player_ID", "Total_Pts", str_c("Round", 1:7, "_Pair"), "State", "Rtg_Pre", "Rtg_Post", "Norm", str_c("Round", 1:7, "_Color"))
cleaner[1, ] <- str_trim(cleaner[1, ])

cleaner <- cbind(cleaner, cleaner[, 4:10])
cleaner[1, 23:29] <- str_c("Round", 1:7, "_Outcome")

cleaner[1, ]
##  [1] "Player_ID"      "Player Name"    "Total_Pts"      "Round1_Pair"   
##  [5] "Round2_Pair"    "Round3_Pair"    "Round4_Pair"    "Round5_Pair"   
##  [9] "Round6_Pair"    "Round7_Pair"    "State"          "USCF ID"       
## [13] "Rtg_Pre"        "Rtg_Post"       "Norm"           "Round1_Color"  
## [17] "Round2_Color"   "Round3_Color"   "Round4_Color"   "Round5_Color"  
## [21] "Round6_Color"   "Round7_Color"   "Round1_Outcome" "Round2_Outcome"
## [25] "Round3_Outcome" "Round4_Outcome" "Round5_Outcome" "Round6_Outcome"
## [29] "Round7_Outcome"
colnames(cleaner) <- cleaner[1, ]
cleaner <- cleaner[-1, ]

cleaner[1:3, ]
##      Player_ID Player Name                         Total_Pts Round1_Pair
## [1,] "    1 "  " GARY HUA                        " "6.0  "   "W  39"    
## [2,] "    2 "  " DAKSHESH DARURI                 " "6.0  "   "W  63"    
## [3,] "    3 "  " ADITYA BAJAJ                    " "6.0  "   "L   8"    
##      Round2_Pair Round3_Pair Round4_Pair Round5_Pair Round6_Pair
## [1,] "W  21"     "W  18"     "W  14"     "W   7"     "D  12"    
## [2,] "W  58"     "L   4"     "W  17"     "W  16"     "W  20"    
## [3,] "W  61"     "W  25"     "W  21"     "W  11"     "W  13"    
##      Round7_Pair State    USCF ID      Rtg_Pre       Rtg_Post    Norm   
## [1,] "D   4"     "   ON " " 15445895 " " R: 1794   " "1817     " "N:2  "
## [2,] "W   7"     "   MI " " 14598900 " " R: 1553   " "1663     " "N:2  "
## [3,] "W  12"     "   MI " " 14959604 " " R: 1384   " "1640     " "N:2  "
##      Round1_Color Round2_Color Round3_Color Round4_Color Round5_Color
## [1,] "W    "      "B    "      "W    "      "B    "      "W    "     
## [2,] "B    "      "W    "      "B    "      "W    "      "B    "     
## [3,] "W    "      "B    "      "W    "      "B    "      "W    "     
##      Round6_Color Round7_Color Round1_Outcome Round2_Outcome
## [1,] "B    "      "W    "      "W  39"        "W  21"       
## [2,] "W    "      "B    "      "W  63"        "W  58"       
## [3,] "B    "      "W    "      "L   8"        "W  61"       
##      Round3_Outcome Round4_Outcome Round5_Outcome Round6_Outcome
## [1,] "W  18"        "W  14"        "W   7"        "D  12"       
## [2,] "L   4"        "W  17"        "W  16"        "W  20"       
## [3,] "W  25"        "W  21"        "W  11"        "W  13"       
##      Round7_Outcome
## [1,] "D   4"       
## [2,] "W   7"       
## [3,] "W  12"

Below, I loop through the columns of cleaner and use regular expressions and the str_extract(), str_replace(), and str_trim() functions to replace the column elements with cleaned-up strings representing single values for each variable without leading or trailing spaces. These string manipulation and matrix element replacement operations are conditioned on the column names of cleaner so that the element re-assignments occur as expected. For example, the elements of the columns called RoundX_Pair and RoundX_Outcome which formerly contained identical character strings, are here assigned just the opponent pair ID numbers in the first case and just the match outcomes in the second. In addition, provisional ratings, e.g “P17”, and extraneous characters, e.g. “R: ”, are removed from the strings corresponding to players’ pre- and post-tournament ratings. Lastly, the matrix is coerced to a data frame so that the elements can eventually also be coerced to a mix of the appropriate data types.

for (i in 1:ncol(cleaner)) {
  if (str_detect(colnames(cleaner)[i], "Pair")) {
    cleaner[, i] <- str_extract(cleaner[, i], "(\\d+$)")
  } else if (str_detect(colnames(cleaner)[i], "Outcome")) {
    cleaner[, i] <- str_extract(cleaner[, i], "(^\\w+)")
  } else if (colnames(cleaner)[i] == "Rtg_Pre") {
    cleaner[, i] <- str_trim(str_replace(str_extract(cleaner[, i], "(\\w+)(( +)?$)"), "P\\d+", ""))
  } else if (colnames(cleaner)[i] == "Rtg_Post") {
    cleaner[, i] <- str_trim(str_replace(cleaner[, i], "P\\d+", ""))
  } else {
    cleaner[ , i] <- str_trim(cleaner[, i])
  }
}

cleaner <- as.data.frame(cleaner, stringsAsFactors = FALSE)

cleaner[1:3, ]
##   Player_ID     Player Name Total_Pts Round1_Pair Round2_Pair Round3_Pair
## 1         1        GARY HUA       6.0          39          21          18
## 2         2 DAKSHESH DARURI       6.0          63          58           4
## 3         3    ADITYA BAJAJ       6.0           8          61          25
##   Round4_Pair Round5_Pair Round6_Pair Round7_Pair State  USCF ID Rtg_Pre
## 1          14           7          12           4    ON 15445895    1794
## 2          17          16          20           7    MI 14598900    1553
## 3          21          11          13          12    MI 14959604    1384
##   Rtg_Post Norm Round1_Color Round2_Color Round3_Color Round4_Color
## 1     1817  N:2            W            B            W            B
## 2     1663  N:2            B            W            B            W
## 3     1640  N:2            W            B            W            B
##   Round5_Color Round6_Color Round7_Color Round1_Outcome Round2_Outcome
## 1            W            B            W              W              W
## 2            B            W            B              W              W
## 3            W            B            W              L              W
##   Round3_Outcome Round4_Outcome Round5_Outcome Round6_Outcome
## 1              W              W              W              D
## 2              L              W              W              W
## 3              W              W              W              W
##   Round7_Outcome
## 1              D
## 2              W
## 3              W

Below, I create a new vector opp_mean to store the average pre-tournament rating for each player’s opponents, calculate those mean values (using the argument na.rm = TRUE to disregard unplayed rounds), round the means to integer values, and assign the resulting value to the element in opp_mean corresponding to the relevant player. Then the desired columns from cleaner and the opp_mean vector are combined in a new data frame called output which will be used to write the CSV output. The structure of output is quickly examined using the str() function and its contents are reproduced as a table in my R Markdown output using the knitr::kable() function.

opp_mean <- vector(mode = "integer", length = nrow(cleaner))

for (i in 1:nrow(cleaner)) {
  opp_mean[i] <- round(mean(as.numeric(cleaner$Rtg_Pre[as.numeric(unlist(cleaner[i, 4:10]))]), na.rm = TRUE), digits = 0)
}

output <- data.frame(Player_Name = cleaner$`Player Name`, State_Code = cleaner$State, Total_Pts = cleaner$Total_Pts, Pre_Rating = as.numeric(cleaner$Rtg_Pre), Opponents_Avg_Pre_Rtg = opp_mean, Post_Rating = as.numeric(cleaner$Rtg_Post), stringsAsFactors = FALSE)

str(output)
## 'data.frame':    64 obs. of  6 variables:
##  $ Player_Name          : chr  "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
##  $ State_Code           : chr  "ON" "MI" "MI" "MI" ...
##  $ Total_Pts            : chr  "6.0" "6.0" "6.0" "5.5" ...
##  $ Pre_Rating           : num  1794 1553 1384 1716 1655 ...
##  $ Opponents_Avg_Pre_Rtg: num  1605 1469 1564 1574 1501 ...
##  $ Post_Rating          : num  1817 1663 1640 1744 1690 ...
knitr::kable(output)
Player_Name State_Code Total_Pts Pre_Rating Opponents_Avg_Pre_Rtg Post_Rating
GARY HUA ON 6.0 1794 1605 1817
DAKSHESH DARURI MI 6.0 1553 1469 1663
ADITYA BAJAJ MI 6.0 1384 1564 1640
PATRICK H SCHILLING MI 5.5 1716 1574 1744
HANSHI ZUO MI 5.5 1655 1501 1690
HANSEN SONG OH 5.0 1686 1519 1687
GARY DEE SWATHELL MI 5.0 1649 1372 1673
EZEKIEL HOUGHTON MI 5.0 1641 1468 1657
STEFANO LEE ON 5.0 1411 1523 1564
ANVIT RAO MI 5.0 1365 1554 1544
CAMERON WILLIAM MC LEMAN MI 4.5 1712 1468 1696
KENNETH J TACK MI 4.5 1663 1506 1670
TORRANCE HENRY JR MI 4.5 1666 1498 1662
BRADLEY SHAW MI 4.5 1610 1515 1618
ZACHARY JAMES HOUGHTON MI 4.5 1220 1484 1416
MIKE NIKITIN MI 4.0 1604 1386 1613
RONALD GRZEGORCZYK MI 4.0 1629 1499 1610
DAVID SUNDEEN MI 4.0 1600 1480 1600
DIPANKAR ROY MI 4.0 1564 1426 1570
JASON ZHENG MI 4.0 1595 1411 1569
DINH DANG BUI ON 4.0 1563 1470 1562
EUGENE L MCCLURE MI 4.0 1555 1300 1529
ALAN BUI ON 4.0 1363 1214 1371
MICHAEL R ALDRICH MI 4.0 1229 1357 1300
LOREN SCHWIEBERT MI 3.5 1745 1363 1681
MAX ZHU ON 3.5 1579 1507 1564
GAURAV GIDWANI MI 3.5 1552 1222 1539
SOFIA ADINA STANESCU-BELLU MI 3.5 1507 1522 1513
CHIEDOZIE OKORIE MI 3.5 1602 1314 1508
GEORGE AVERY JONES ON 3.5 1522 1144 1444
RISHI SHETTY MI 3.5 1494 1260 1444
JOSHUA PHILIP MATHEWS ON 3.5 1441 1379 1433
JADE GE MI 3.5 1449 1277 1421
MICHAEL JEFFERY THOMAS MI 3.5 1399 1375 1400
JOSHUA DAVID LEE MI 3.5 1438 1150 1392
SIDDHARTH JHA MI 3.5 1355 1388 1367
AMIYATOSH PWNANANDAM MI 3.5 980 1385 1077
BRIAN LIU MI 3.0 1423 1539 1439
JOEL R HENDON MI 3.0 1436 1430 1413
FOREST ZHANG MI 3.0 1348 1391 1346
KYLE WILLIAM MURPHY MI 3.0 1403 1248 1341
JARED GE MI 3.0 1332 1150 1256
ROBERT GLEN VASEY MI 3.0 1283 1107 1244
JUSTIN D SCHILLING MI 3.0 1199 1327 1199
DEREK YAN MI 3.0 1242 1152 1191
JACOB ALEXANDER LAVALLEY MI 3.0 377 1358 1076
ERIC WRIGHT MI 2.5 1362 1392 1341
DANIEL KHAIN MI 2.5 1382 1356 1335
MICHAEL J MARTIN MI 2.5 1291 1286 1259
SHIVAM JHA MI 2.5 1056 1296 1111
TEJAS AYYAGARI MI 2.5 1011 1356 1097
ETHAN GUO MI 2.5 935 1495 1092
JOSE C YBARRA MI 2.0 1393 1345 1359
LARRY HODGE MI 2.0 1270 1206 1200
ALEX KONG MI 2.0 1186 1406 1163
MARISA RICCI MI 2.0 1153 1414 1140
MICHAEL LU MI 2.0 1092 1363 1079
VIRAJ MOHILE MI 2.0 917 1391 941
SEAN M MC CORMICK MI 2.0 853 1319 878
JULIA SHEN MI 1.5 967 1330 984
JEZZEL FARKAS ON 1.5 955 1327 979
ASHWIN BALAJI MI 1.0 1530 1186 1535
THOMAS JOSEPH HOSMER MI 1.0 1175 1350 1125
BEN LI MI 1.0 1163 1263 1112

Here, the requested CSV file is written to the working directory, the file name and path can be changed as needed.

# Write CSV file to working directory
write.csv(output, file = "chess_player_data.csv")

Visualizing the chess tournament data

Finally, I visualize a few of the variables from the tournament data set using plotting functions from the ggplot2 package. In particular, I graph the distribution of Total_Pts which represents point totals earned through match outcomes of win, draw, bye, and so on; the distribution of players’ pre-tournament Elo ratings; the relationship between players’ pre-ratings and their opponents’ average pre-ratings; the relationship between Total_Pts and opponents’ average pre-ratings; and the relationship between opponents’ average pre-ratings and the difference between each player’s pre- and post-tournament rating.

ggplot(output, aes(Total_Pts)) + geom_bar()

ggplot(output, aes(Pre_Rating)) + geom_histogram(binwidth = 50)

ggplot(output, aes(x = Pre_Rating, y = Opponents_Avg_Pre_Rtg)) + geom_point()

ggplot(output, aes(x = Total_Pts, y = Opponents_Avg_Pre_Rtg)) + geom_boxplot() + coord_flip()

ggplot(output, aes(x = Opponents_Avg_Pre_Rtg, y = Post_Rating - Pre_Rating)) + geom_point() + 
  ylab("Delta_Rating") + 
  geom_hline(yintercept = mean(output$Post_Rating - output$Pre_Rating)) +
  geom_vline(xintercept = mean(output$Opponents_Avg_Pre_Rtg))