Libraries used

library(stringr)
library(knitr)

Reading the text file from GitHub into R

First, I read in the text file from my GitHub account - I saved it from Blackboard initially, and then used R to read it into my project code.

chess_data <- paste(readLines("https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project1/chess_ratings.txt"), collapse = '\n')

Using Regular Expressions

Next, I wanted to take the long string and use regular expressions to extract the important information. I initally looked for isolating the player names. Since all of them are in uppercase format, I utilized this as a way to save them into a vector.

player_names <- unlist(str_extract_all(chess_data, '([:upper:]+\\s[:upper:]+(\\s[:upper:]+)?(\\s[:upper:]+)?)'))

# remove the top value since it's not a name
player_names <- player_names[player_names != "USCF ID"]

# change to 'proper' case
player_names <- str_to_title(player_names)
head(player_names)
## [1] "Gary Hua"            "Dakshesh Daruri"     "Aditya Bajaj"       
## [4] "Patrick H Schilling" "Hanshi Zuo"          "Hansen Song"

For Player’s State (location), Pre-rating, Total Points, and Post-rating, I also used regular expressions to store this specific information in vectors. See below for my code:

# found that each player's state could be extracted through this regular expression combination
location <- unlist(str_extract_all(chess_data, '(\n)\\s\\s\\s[:upper:]+'))
location <- unlist(str_replace_all(location, '^\\s+|\\s+$', ""))
head(location)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
# found that each player's pre-rating could be extracted through this regular expression combination
pre_rating <- unlist(str_extract_all(chess_data, '(R:)\\s{1,}\\d+'))
pre_rating <- unlist(str_replace_all(pre_rating, '(R:)\\s', ""))
head(pre_rating)
## [1] "1794" "1553" "1384" "1716" "1655" "1686"
# found that each player's post-rating could be extracted through this regular expression combination (used later in data visualization)
post_rating <- unlist(str_extract_all(chess_data, '(->)(\\s{1,})?\\d+'))
post_rating <- unlist(str_replace_all(post_rating, '(->)(\\s)?', ""))
# found that each player's total points could be extracted through this regular expression combination
total_points <- unlist(str_extract_all(chess_data, '(\\d\\.\\d)'))
head(total_points)
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"

Dealing with the player opponent data

Once the above information was isolated and stored in vectors, I needed to find a way to grab the results from the tournament for each player in a string. To do this, I used a specific regular expression that was able to find the structure of just the result and the opponent number for each players’ games. See below:

results <- unlist(str_extract_all(chess_data, '(\\d\\.\\d\\s{2}(\\|[:upper:]\\s{2}(\\s)?(\\s)?(\\d+)?){1,})'))
results <- unlist(str_replace(results, '\\d\\.\\d\\s{2}\\|', ''))
kable(head(results))
x
W 39|W 21|W 18|W 14|W 7|D 12|D 4
W 63|W 58|L 4|W 17|W 16|W 20|W 7
L 8|W 61|W 25|W 21|W 11|W 13|W 12
W 23|D 28|W 2|W 26|D 5|W 19|D 1
W 45|W 37|D 12|D 13|D 4|W 14|W 17
W 34|D 29|L 11|W 35|D 10|W 27|W 21

Storing these in a separate data frame

After grabbing the results, I needed to clean them up a bit in order to eventually replace the opponent number with their pre-tournament elo rating. I then stored this information in a temporary dataframe in order to bind it together with the full data later (stored in ‘tempdf’):

opponents <- unlist(str_replace_all(results, '[:upper:]\\s+', ''))
opponents <- unlist(str_replace_all(opponents, '\\|', '\\,'))
opponents <- unlist(str_replace_all(opponents, '\\,{2,}', '\\,'))
opponents <- unlist(str_replace_all(opponents, '(\\,$)', ''))
opponents <- unlist(str_replace_all(opponents, '^\\,', ''))
tempdf <- data.frame(V1 = opponents)
kable(head(tempdf))
V1
39,21,18,14,7,12,4
63,58,4,17,16,20,7
8,61,25,21,11,13,12
23,28,2,26,5,19,1
45,37,12,13,4,14,17
34,29,11,35,10,27,21

Once the temporary dataframe had just the opponents data in one column, I wanted to split the strings and create separate columns for each opponent based on the game. To do this, I created a for loop:

# used a for loop here to iterate over V1, using strsplit and the comma to separate out the opponent numbers into the appropriate columns. This also stored N/As in columns where a player didn't play an opponent.
for(i in 1:7){
  tempdf[, paste0('game', i)] <- sapply(strsplit(as.character(tempdf$V1),','), "[", i)
}
kable(head(tempdf))
V1 game1 game2 game3 game4 game5 game6 game7
39,21,18,14,7,12,4 39 21 18 14 7 12 4
63,58,4,17,16,20,7 63 58 4 17 16 20 7
8,61,25,21,11,13,12 8 61 25 21 11 13 12
23,28,2,26,5,19,1 23 28 2 26 5 19 1
45,37,12,13,4,14,17 45 37 12 13 4 14 17
34,29,11,35,10,27,21 34 29 11 35 10 27 21

Merging data into one data frame

Now that the opponent data was stored correctly in a separate dataframe, I decided to merge this data in with the rest of my vectors to create one large dataframe (‘chess_ratings_df’):

chess_ratings_df <- data.frame(player_name = player_names, 
                               player_state = location,
                               total_points = total_points,
                               player_pre_rating = pre_rating, 
                               player_post_rating = post_rating,
                               game1 = tempdf$game1,
                               game2 = tempdf$game2,
                               game3 = tempdf$game3,
                               game4 = tempdf$game4,
                               game5 = tempdf$game5,
                               game6 = tempdf$game6,
                               game7 = tempdf$game7)

kable(head(chess_ratings_df))
player_name player_state total_points player_pre_rating player_post_rating game1 game2 game3 game4 game5 game6 game7
Gary Hua ON 6.0 1794 1817 39 21 18 14 7 12 4
Dakshesh Daruri MI 6.0 1553 1663 63 58 4 17 16 20 7
Aditya Bajaj MI 6.0 1384 1640 8 61 25 21 11 13 12
Patrick H Schilling MI 5.5 1716 1744 23 28 2 26 5 19 1
Hanshi Zuo MI 5.5 1655 1690 45 37 12 13 4 14 17
Hansen Song OH 5.0 1686 1687 34 29 11 35 10 27 21

Changing the data types from strings to numbers

In order to do proper average calculations later on, I needed to make sure my data types were correct. For this, I changed the character vectors to numeric vectors:

# just to be safe, I changed the data types for these columns to numbers
chess_ratings_df$player_pre_rating <- as.numeric(as.character(chess_ratings_df$player_pre_rating))
chess_ratings_df$player_post_rating <- as.numeric(as.character(chess_ratings_df$player_post_rating))
chess_ratings_df$total_points <- as.numeric(as.character(chess_ratings_df$total_points))
chess_ratings_df$game1 <- as.numeric(as.character(chess_ratings_df$game1))
chess_ratings_df$game2 <- as.numeric(as.character(chess_ratings_df$game2))
chess_ratings_df$game3 <- as.numeric(as.character(chess_ratings_df$game3))
chess_ratings_df$game4 <- as.numeric(as.character(chess_ratings_df$game4))
chess_ratings_df$game5 <- as.numeric(as.character(chess_ratings_df$game5))
chess_ratings_df$game6 <- as.numeric(as.character(chess_ratings_df$game6))
chess_ratings_df$game7 <- as.numeric(as.character(chess_ratings_df$game7))

Using loops to substitute opponent numbers with their pre-tournament elo ratings

To substitute the opponent number with their pre-tournament elo rating, I created another set of for loops:

# while iterating over the game columns, iterate over the player opponent number and match it with the opponent's pre-tournament elo rating. Then replace the opponent number with their elo rating.
for(i in 6:12) {
  for(j in 1:64) {
    value <- chess_ratings_df[,i][j]
    chess_ratings_df[,i][j] <- chess_ratings_df$player_pre_rating[value]
  }
}

kable(head(chess_ratings_df))
player_name player_state total_points player_pre_rating player_post_rating game1 game2 game3 game4 game5 game6 game7
Gary Hua ON 6.0 1794 1817 1436 1563 1600 1610 1649 1663 1716
Dakshesh Daruri MI 6.0 1553 1663 1175 917 1716 1629 1604 1595 1649
Aditya Bajaj MI 6.0 1384 1640 1641 955 1745 1563 1712 1666 1663
Patrick H Schilling MI 5.5 1716 1744 1363 1507 1553 1579 1655 1564 1794
Hanshi Zuo MI 5.5 1655 1690 1242 980 1663 1666 1716 1610 1629
Hansen Song OH 5.0 1686 1687 1399 1602 1712 1438 1365 1552 1563

Taking the average of the pre-tournament elo ratings

Finally, after the pre-tournament elo ratings had been substituted in, I could then take the average of these ratings across the seven games of the tournament to obtain the Average Pre Chess Rating of Opponents:

chess_ratings_df$average_opp_rating <- round(rowMeans(chess_ratings_df[,6:12], na.rm = TRUE), digits = 0)

For future data visualization work (see at the end), I created a few more columns in the ‘chess_ratings_df’ data frame.

For my future data visualization, I decided to create a few more columns. I wanted to see the difference in player pre-tournament rating and player post-tournament rating to see who had the highest elo rating gain. Additionally, I created a column on the conditional of whether or not the rating difference between pre and post was negative or positive. Finally, for charting purposes, I created one final column so that there weren’t any negative numbers for the rating differences, by squaring and taking the square root of the difference (will be used later):

chess_ratings_df$rating_difference <- chess_ratings_df$player_post_rating - chess_ratings_df$player_pre_rating
chess_ratings_df$change_pos <- ifelse(chess_ratings_df$rating_difference >=0, TRUE, FALSE)
chess_ratings_df$rating_difference_sqr <- sqrt((chess_ratings_df$rating_difference ^ 2))


The final data frame to be exported to a csv file

In the end, you can see the data frame required for the project (‘final_df’).

final_df <- chess_ratings_df[, c(1:4, 13)]
kable(final_df)
player_name player_state total_points player_pre_rating average_opp_rating
Gary Hua ON 6.0 1794 1605
Dakshesh Daruri MI 6.0 1553 1469
Aditya Bajaj MI 6.0 1384 1564
Patrick H Schilling MI 5.5 1716 1574
Hanshi Zuo MI 5.5 1655 1501
Hansen Song OH 5.0 1686 1519
Gary Dee Swathell MI 5.0 1649 1372
Ezekiel Houghton MI 5.0 1641 1468
Stefano Lee ON 5.0 1411 1523
Anvit Rao MI 5.0 1365 1554
Cameron William Mc Leman MI 4.5 1712 1468
Kenneth J Tack MI 4.5 1663 1506
Torrance Henry Jr MI 4.5 1666 1498
Bradley Shaw MI 4.5 1610 1515
Zachary James Houghton MI 4.5 1220 1484
Mike Nikitin MI 4.0 1604 1386
Ronald Grzegorczyk MI 4.0 1629 1499
David Sundeen MI 4.0 1600 1480
Dipankar Roy MI 4.0 1564 1426
Jason Zheng MI 4.0 1595 1411
Dinh Dang Bui ON 4.0 1563 1470
Eugene L Mcclure MI 4.0 1555 1300
Alan Bui ON 4.0 1363 1214
Michael R Aldrich MI 4.0 1229 1357
Loren Schwiebert MI 3.5 1745 1363
Max Zhu ON 3.5 1579 1507
Gaurav Gidwani MI 3.5 1552 1222
Sofia Adina Stanescu MI 3.5 1507 1522
Chiedozie Okorie MI 3.5 1602 1314
George Avery Jones ON 3.5 1522 1144
Rishi Shetty MI 3.5 1494 1260
Joshua Philip Mathews ON 3.5 1441 1379
Jade Ge MI 3.5 1449 1277
Michael Jeffery Thomas MI 3.5 1399 1375
Joshua David Lee MI 3.5 1438 1150
Siddharth Jha MI 3.5 1355 1388
Amiyatosh Pwnanandam MI 3.5 980 1385
Brian Liu MI 3.0 1423 1539
Joel R Hendon MI 3.0 1436 1430
Forest Zhang MI 3.0 1348 1391
Kyle William Murphy MI 3.0 1403 1248
Jared Ge MI 3.0 1332 1150
Robert Glen Vasey MI 3.0 1283 1107
Justin D Schilling MI 3.0 1199 1327
Derek Yan MI 3.0 1242 1152
Jacob Alexander Lavalley MI 3.0 377 1358
Eric Wright MI 2.5 1362 1392
Daniel Khain MI 2.5 1382 1356
Michael J Martin MI 2.5 1291 1286
Shivam Jha MI 2.5 1056 1296
Tejas Ayyagari MI 2.5 1011 1356
Ethan Guo MI 2.5 935 1495
Jose C Ybarra MI 2.0 1393 1345
Larry Hodge MI 2.0 1270 1206
Alex Kong MI 2.0 1186 1406
Marisa Ricci MI 2.0 1153 1414
Michael Lu MI 2.0 1092 1363
Viraj Mohile MI 2.0 917 1391
Sean M Mc Cormick MI 2.0 853 1319
Julia Shen MI 1.5 967 1330
Jezzel Farkas ON 1.5 955 1327
Ashwin Balaji MI 1.0 1530 1186
Thomas Joseph Hosmer MI 1.0 1175 1350
Ben Li MI 1.0 1163 1263

Here’s the syntax for exporting it to my local computer. However, feel free to change the filepath to export it anywhere else.

filename <- '~/Desktop/grad_school/data_607_cunysps/data607_cunysps/Project1/chess_ratings.csv'
write.csv(final_df, file = filename)


Data Visualization

To go a step further with this project, I created a micro-site using a Javascript Framework (Angular), as well as Google Firebase to deploy this to the internet. At this link:

https://data606-chess.firebaseapp.com/

…you’ll be able to find a ‘mock’ newspaper article that display a data visualizations and tables using d3.js (charting library for Javascript). The data visualization displays data on hover, as well as more data if you click the associated button – to show +/- elo ratings.

As a quick note, the url says ‘data606’ but it should be data607 (I made a typo on the setup - sorry about that!)

If interested, you can find the files for this on my Github as well, at this repository.