Data 607 Project 1

Project overview

For this project students were provided a text file containing the results of a chess tournament structured as a cross-table. Our task was to deliver an R Markdown file that generates a CSV file with the name of each player included in the given text file, their state or province, total number of points earned in terms of round outcomes (e.g. wins, losses, draws, byes, forfeitures, etc.), their Elo rating prior to tournament play, and the average pre-tournament rating of their opponents. I relied heavily on the text processing functions provided in the stringr package to manipulate the given cross-table text into a form that could be more easily analyzed and accurately rendered as a CSV file with the desired variables.

Load required packages

library(readr)
library(stringr)
library(ggplot2)

Import data from remote GitHub repository

The read_lines() function from the readr package works in the same way as the base::readLines() function, returning a character vector with one element for each line read in from the source string or text file. I use it here to import the tournamentinfo.txt file from my remote GitHub repository for this project. Unlike the corresponding base functions, readr functions can connect to and read in from HTTPS URLs without additional workarounds or function calls.

raw_data <- read_lines("https://github.com/juddanderman/Data_607_Project_1/raw/master/tournamentinfo.txt")

head(raw_data)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Cleaning up the raw data

In the first step of cleaning the vector of character strings assigned to raw_data above, I loop through each element of raw_data and use the str_detect() and str_c() functions to concatenate adjacent vector elements on the condition that both the indexed element and the next proceeding element contain meaningful content, i.e do not begin with “-” as the separator rows in the given cross-table do. During each iteration, the concatenated strings, which correspond to the variable labels in the cross-table and the data for the players in the tournament, are incrementally assigned (using a separate index from the loop counter) to the elements of cleaner, which was initialized as a vector of empty strings with the appropriate length before the loop to avoid growing the vector during execution of the loop.

cleaner <- vector(mode = "character", length = length(raw_data[str_detect(raw_data, "^[^-]")]) / 2)
j <- 1

for (i in 1:length(raw_data)) {
  if (str_detect(raw_data[i], "^[^-]") && str_detect(raw_data[i + 1], "^[^-]")) {
    cleaner[j] <- str_c(raw_data[i], raw_data[i+1])
    j <- j + 1
  }
}

cleaner[1:3]

## [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round|  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [2] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"  
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"

In the next data cleaning step, I use the str_split() function to split each element of the cleaner vector at every vertical bar, forward slash, and right arrow (i.e. a hypen followed by a greater than sign, “->”), which represent column boundaries, separators between USCF ID numbers and ratings, and separators between pre- and post-ratings respectively in the original cross-table. Here, I use the argument simplify = TRUE in the function call so that str_split() returns a matrix of character vectors, and in this case, splitting leaves us with individual character strings for each variable label and data element.

cleaner <- str_split(cleaner, "[|]|/|(->)", simplify = TRUE)

cleaner[1:3, -23]

##      [,1]     [,2]                                [,3]    [,4]    [,5]   
## [1,] " Pair " " Player Name                     " "Total" "Round" "Round"
## [2,] "    1 " " GARY HUA                        " "6.0  " "W  39" "W  21"
## [3,] "    2 " " DAKSHESH DARURI                 " "6.0  " "W  63" "W  58"
##      [,6]    [,7]    [,8]    [,9]    [,10]   [,11]     [,12]       
## [1,] "Round" "Round" "Round" "Round" "Round" "  Num  " " USCF ID " 
## [2,] "W  18" "W  14" "W   7" "D  12" "D   4" "   ON "  " 15445895 "
## [3,] "L   4" "W  17" "W  16" "W  20" "W   7" "   MI "  " 14598900 "
##      [,13]         [,14]          [,15]   [,16]   [,17]   [,18]   [,19]  
## [1,] " Rtg (Pre"   "Post)       " " Pts " "  1  " "  2  " "  3  " "  4  "
## [2,] " R: 1794   " "1817     "    "N:2  " "W    " "B    " "W    " "B    "
## [3,] " R: 1553   " "1663     "    "N:2  " "B    " "W    " "B    " "W    "
##      [,20]   [,21]   [,22]  
## [1,] "  5  " "  6  " "  7  "
## [2,] "W    " "B    " "W    "
## [3,] "B    " "W    " "B    "

cleaner[, 23]

##  [1] " " ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [18] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [35] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [52] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""

Noting that the elements in the 23^rd column of what is now a cleaner matrix are either exclusively space characters or empty strings, I remove that column from the matrix. Then I replace a number of the elements of the first row of cleaner with new strings reflecting clearer names for the variables of interest. I also duplicate the columns containing strings that refer to the outcomes and opponents played in each round and combine those columns to cleaner so that I can retain round outcome and opponent values as distinct columns and eventually process those variables independently. Before moving on to cleaning the strings representing the player data, or values for each of the variables, I use colnames() to name the columns of the matrix with the character strings from the first row which represent the names of the variables, and then drop that row from the matrix.

cleaner <- cleaner[, -23]

cleaner[1, c(1, 3, 4:10, 11, 13:15, 16:22)] <- c("Player_ID", "Total_Pts", str_c("Round", 1:7, "_Pair"), "State", "Rtg_Pre", "Rtg_Post", "Norm", str_c("Round", 1:7, "_Color"))
cleaner[1, ] <- str_trim(cleaner[1, ])

cleaner <- cbind(cleaner, cleaner[, 4:10])
cleaner[1, 23:29] <- str_c("Round", 1:7, "_Outcome")

cleaner[1, ]

##  [1] "Player_ID"      "Player Name"    "Total_Pts"      "Round1_Pair"   
##  [5] "Round2_Pair"    "Round3_Pair"    "Round4_Pair"    "Round5_Pair"   
##  [9] "Round6_Pair"    "Round7_Pair"    "State"          "USCF ID"       
## [13] "Rtg_Pre"        "Rtg_Post"       "Norm"           "Round1_Color"  
## [17] "Round2_Color"   "Round3_Color"   "Round4_Color"   "Round5_Color"  
## [21] "Round6_Color"   "Round7_Color"   "Round1_Outcome" "Round2_Outcome"
## [25] "Round3_Outcome" "Round4_Outcome" "Round5_Outcome" "Round6_Outcome"
## [29] "Round7_Outcome"

colnames(cleaner) <- cleaner[1, ]
cleaner <- cleaner[-1, ]

cleaner[1:3, ]

##      Player_ID Player Name                         Total_Pts Round1_Pair
## [1,] "    1 "  " GARY HUA                        " "6.0  "   "W  39"    
## [2,] "    2 "  " DAKSHESH DARURI                 " "6.0  "   "W  63"    
## [3,] "    3 "  " ADITYA BAJAJ                    " "6.0  "   "L   8"    
##      Round2_Pair Round3_Pair Round4_Pair Round5_Pair Round6_Pair
## [1,] "W  21"     "W  18"     "W  14"     "W   7"     "D  12"    
## [2,] "W  58"     "L   4"     "W  17"     "W  16"     "W  20"    
## [3,] "W  61"     "W  25"     "W  21"     "W  11"     "W  13"    
##      Round7_Pair State    USCF ID      Rtg_Pre       Rtg_Post    Norm   
## [1,] "D   4"     "   ON " " 15445895 " " R: 1794   " "1817     " "N:2  "
## [2,] "W   7"     "   MI " " 14598900 " " R: 1553   " "1663     " "N:2  "
## [3,] "W  12"     "   MI " " 14959604 " " R: 1384   " "1640     " "N:2  "
##      Round1_Color Round2_Color Round3_Color Round4_Color Round5_Color
## [1,] "W    "      "B    "      "W    "      "B    "      "W    "     
## [2,] "B    "      "W    "      "B    "      "W    "      "B    "     
## [3,] "W    "      "B    "      "W    "      "B    "      "W    "     
##      Round6_Color Round7_Color Round1_Outcome Round2_Outcome
## [1,] "B    "      "W    "      "W  39"        "W  21"       
## [2,] "W    "      "B    "      "W  63"        "W  58"       
## [3,] "B    "      "W    "      "L   8"        "W  61"       
##      Round3_Outcome Round4_Outcome Round5_Outcome Round6_Outcome
## [1,] "W  18"        "W  14"        "W   7"        "D  12"       
## [2,] "L   4"        "W  17"        "W  16"        "W  20"       
## [3,] "W  25"        "W  21"        "W  11"        "W  13"       
##      Round7_Outcome
## [1,] "D   4"       
## [2,] "W   7"       
## [3,] "W  12"

Below, I loop through the columns of cleaner and use regular expressions and the str_extract(), str_replace(), and str_trim() functions to replace the column elements with cleaned-up strings representing single values for each variable without leading or trailing spaces. These string manipulation and matrix element replacement operations are conditioned on the column names of cleaner so that the element re-assignments occur as expected. For example, the elements of the columns called RoundX_Pair and RoundX_Outcome which formerly contained identical character strings, are here assigned just the opponent pair ID numbers in the first case and just the match outcomes in the second. In addition, provisional ratings, e.g “P17”, and extraneous characters, e.g. “R: ”, are removed from the strings corresponding to players’ pre- and post-tournament ratings. Lastly, the matrix is coerced to a data frame so that the elements can eventually also be coerced to a mix of the appropriate data types.

for (i in 1:ncol(cleaner)) {
  if (str_detect(colnames(cleaner)[i], "Pair")) {
    cleaner[, i] <- str_extract(cleaner[, i], "(\\d+$)")
  } else if (str_detect(colnames(cleaner)[i], "Outcome")) {
    cleaner[, i] <- str_extract(cleaner[, i], "(^\\w+)")
  } else if (colnames(cleaner)[i] == "Rtg_Pre") {
    cleaner[, i] <- str_trim(str_replace(str_extract(cleaner[, i], "(\\w+)(( +)?$)"), "P\\d+", ""))
  } else if (colnames(cleaner)[i] == "Rtg_Post") {
    cleaner[, i] <- str_trim(str_replace(cleaner[, i], "P\\d+", ""))
  } else {
    cleaner[ , i] <- str_trim(cleaner[, i])
  }
}

cleaner <- as.data.frame(cleaner, stringsAsFactors = FALSE)

cleaner[1:3, ]

##   Player_ID     Player Name Total_Pts Round1_Pair Round2_Pair Round3_Pair
## 1         1        GARY HUA       6.0          39          21          18
## 2         2 DAKSHESH DARURI       6.0          63          58           4
## 3         3    ADITYA BAJAJ       6.0           8          61          25
##   Round4_Pair Round5_Pair Round6_Pair Round7_Pair State  USCF ID Rtg_Pre
## 1          14           7          12           4    ON 15445895    1794
## 2          17          16          20           7    MI 14598900    1553
## 3          21          11          13          12    MI 14959604    1384
##   Rtg_Post Norm Round1_Color Round2_Color Round3_Color Round4_Color
## 1     1817  N:2            W            B            W            B
## 2     1663  N:2            B            W            B            W
## 3     1640  N:2            W            B            W            B
##   Round5_Color Round6_Color Round7_Color Round1_Outcome Round2_Outcome
## 1            W            B            W              W              W
## 2            B            W            B              W              W
## 3            W            B            W              L              W
##   Round3_Outcome Round4_Outcome Round5_Outcome Round6_Outcome
## 1              W              W              W              D
## 2              L              W              W              W
## 3              W              W              W              W
##   Round7_Outcome
## 1              D
## 2              W
## 3              W

Below, I create a new vector opp_mean to store the average pre-tournament rating for each player’s opponents, calculate those mean values (using the argument na.rm = TRUE to disregard unplayed rounds), round the means to integer values, and assign the resulting value to the element in opp_mean corresponding to the relevant player. Then the desired columns from cleaner and the opp_mean vector are combined in a new data frame called output which will be used to write the CSV output. The structure of output is quickly examined using the str() function and its contents are reproduced as a table in my R Markdown output using the knitr::kable() function.

opp_mean <- vector(mode = "integer", length = nrow(cleaner))

for (i in 1:nrow(cleaner)) {
  opp_mean[i] <- round(mean(as.numeric(cleaner$Rtg_Pre[as.numeric(unlist(cleaner[i, 4:10]))]), na.rm = TRUE), digits = 0)
}

output <- data.frame(Player_Name = cleaner$`Player Name`, State_Code = cleaner$State, Total_Pts = cleaner$Total_Pts, Pre_Rating = as.numeric(cleaner$Rtg_Pre), Opponents_Avg_Pre_Rtg = opp_mean, Post_Rating = as.numeric(cleaner$Rtg_Post), stringsAsFactors = FALSE)

str(output)

## 'data.frame':    64 obs. of  6 variables:
##  $ Player_Name          : chr  "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ" "PATRICK H SCHILLING" ...
##  $ State_Code           : chr  "ON" "MI" "MI" "MI" ...
##  $ Total_Pts            : chr  "6.0" "6.0" "6.0" "5.5" ...
##  $ Pre_Rating           : num  1794 1553 1384 1716 1655 ...
##  $ Opponents_Avg_Pre_Rtg: num  1605 1469 1564 1574 1501 ...
##  $ Post_Rating          : num  1817 1663 1640 1744 1690 ...

knitr::kable(output)

Player_Name	State_Code	Total_Pts	Pre_Rating	Opponents_Avg_Pre_Rtg	Post_Rating
GARY HUA	ON	6.0	1794	1605	1817
DAKSHESH DARURI	MI	6.0	1553	1469	1663
ADITYA BAJAJ	MI	6.0	1384	1564	1640
PATRICK H SCHILLING	MI	5.5	1716	1574	1744
HANSHI ZUO	MI	5.5	1655	1501	1690
HANSEN SONG	OH	5.0	1686	1519	1687
GARY DEE SWATHELL	MI	5.0	1649	1372	1673
EZEKIEL HOUGHTON	MI	5.0	1641	1468	1657
STEFANO LEE	ON	5.0	1411	1523	1564
ANVIT RAO	MI	5.0	1365	1554	1544
CAMERON WILLIAM MC LEMAN	MI	4.5	1712	1468	1696
KENNETH J TACK	MI	4.5	1663	1506	1670
TORRANCE HENRY JR	MI	4.5	1666	1498	1662
BRADLEY SHAW	MI	4.5	1610	1515	1618
ZACHARY JAMES HOUGHTON	MI	4.5	1220	1484	1416
MIKE NIKITIN	MI	4.0	1604	1386	1613
RONALD GRZEGORCZYK	MI	4.0	1629	1499	1610
DAVID SUNDEEN	MI	4.0	1600	1480	1600
DIPANKAR ROY	MI	4.0	1564	1426	1570
JASON ZHENG	MI	4.0	1595	1411	1569
DINH DANG BUI	ON	4.0	1563	1470	1562
EUGENE L MCCLURE	MI	4.0	1555	1300	1529
ALAN BUI	ON	4.0	1363	1214	1371
MICHAEL R ALDRICH	MI	4.0	1229	1357	1300
LOREN SCHWIEBERT	MI	3.5	1745	1363	1681
MAX ZHU	ON	3.5	1579	1507	1564
GAURAV GIDWANI	MI	3.5	1552	1222	1539
SOFIA ADINA STANESCU-BELLU	MI	3.5	1507	1522	1513
CHIEDOZIE OKORIE	MI	3.5	1602	1314	1508
GEORGE AVERY JONES	ON	3.5	1522	1144	1444
RISHI SHETTY	MI	3.5	1494	1260	1444
JOSHUA PHILIP MATHEWS	ON	3.5	1441	1379	1433
JADE GE	MI	3.5	1449	1277	1421
MICHAEL JEFFERY THOMAS	MI	3.5	1399	1375	1400
JOSHUA DAVID LEE	MI	3.5	1438	1150	1392
SIDDHARTH JHA	MI	3.5	1355	1388	1367
AMIYATOSH PWNANANDAM	MI	3.5	980	1385	1077
BRIAN LIU	MI	3.0	1423	1539	1439
JOEL R HENDON	MI	3.0	1436	1430	1413
FOREST ZHANG	MI	3.0	1348	1391	1346
KYLE WILLIAM MURPHY	MI	3.0	1403	1248	1341
JARED GE	MI	3.0	1332	1150	1256
ROBERT GLEN VASEY	MI	3.0	1283	1107	1244
JUSTIN D SCHILLING	MI	3.0	1199	1327	1199
DEREK YAN	MI	3.0	1242	1152	1191
JACOB ALEXANDER LAVALLEY	MI	3.0	377	1358	1076
ERIC WRIGHT	MI	2.5	1362	1392	1341
DANIEL KHAIN	MI	2.5	1382	1356	1335
MICHAEL J MARTIN	MI	2.5	1291	1286	1259
SHIVAM JHA	MI	2.5	1056	1296	1111
TEJAS AYYAGARI	MI	2.5	1011	1356	1097
ETHAN GUO	MI	2.5	935	1495	1092
JOSE C YBARRA	MI	2.0	1393	1345	1359
LARRY HODGE	MI	2.0	1270	1206	1200
ALEX KONG	MI	2.0	1186	1406	1163
MARISA RICCI	MI	2.0	1153	1414	1140
MICHAEL LU	MI	2.0	1092	1363	1079
VIRAJ MOHILE	MI	2.0	917	1391	941
SEAN M MC CORMICK	MI	2.0	853	1319	878
JULIA SHEN	MI	1.5	967	1330	984
JEZZEL FARKAS	ON	1.5	955	1327	979
ASHWIN BALAJI	MI	1.0	1530	1186	1535
THOMAS JOSEPH HOSMER	MI	1.0	1175	1350	1125
BEN LI	MI	1.0	1163	1263	1112

Here, the requested CSV file is written to the working directory, the file name and path can be changed as needed.

# Write CSV file to working directory
write.csv(output, file = "chess_player_data.csv")

Visualizing the chess tournament data

Finally, I visualize a few of the variables from the tournament data set using plotting functions from the ggplot2 package. In particular, I graph the distribution of Total_Pts which represents point totals earned through match outcomes of win, draw, bye, and so on; the distribution of players’ pre-tournament Elo ratings; the relationship between players’ pre-ratings and their opponents’ average pre-ratings; the relationship between Total_Pts and opponents’ average pre-ratings; and the relationship between opponents’ average pre-ratings and the difference between each player’s pre- and post-tournament rating.

ggplot(output, aes(Total_Pts)) + geom_bar()

ggplot(output, aes(Pre_Rating)) + geom_histogram(binwidth = 50)

ggplot(output, aes(x = Pre_Rating, y = Opponents_Avg_Pre_Rtg)) + geom_point()

ggplot(output, aes(x = Total_Pts, y = Opponents_Avg_Pre_Rtg)) + geom_boxplot() + coord_flip()

ggplot(output, aes(x = Opponents_Avg_Pre_Rtg, y = Post_Rating - Pre_Rating)) + geom_point() + 
  ylab("Delta_Rating") + 
  geom_hline(yintercept = mean(output$Post_Rating - output$Pre_Rating)) +
  geom_vline(xintercept = mean(output$Opponents_Avg_Pre_Rtg))