library(stringr)
library(knitr)
First, I read in the text file from my GitHub account - I saved it from Blackboard initially, and then used R to read it into my project code.
chess_data <- paste(readLines("https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project1/chess_ratings.txt"), collapse = '\n')
Next, I wanted to take the long string and use regular expressions to extract the important information. I initally looked for isolating the player names. Since all of them are in uppercase format, I utilized this as a way to save them into a vector.
player_names <- unlist(str_extract_all(chess_data, '([:upper:]+\\s[:upper:]+(\\s[:upper:]+)?(\\s[:upper:]+)?)'))
# remove the top value since it's not a name
player_names <- player_names[player_names != "USCF ID"]
# change to 'proper' case
player_names <- str_to_title(player_names)
head(player_names)
## [1] "Gary Hua" "Dakshesh Daruri" "Aditya Bajaj"
## [4] "Patrick H Schilling" "Hanshi Zuo" "Hansen Song"
For Player’s State (location), Pre-rating, Total Points, and Post-rating, I also used regular expressions to store this specific information in vectors. See below for my code:
# found that each player's state could be extracted through this regular expression combination
location <- unlist(str_extract_all(chess_data, '(\n)\\s\\s\\s[:upper:]+'))
location <- unlist(str_replace_all(location, '^\\s+|\\s+$', ""))
head(location)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
# found that each player's pre-rating could be extracted through this regular expression combination
pre_rating <- unlist(str_extract_all(chess_data, '(R:)\\s{1,}\\d+'))
pre_rating <- unlist(str_replace_all(pre_rating, '(R:)\\s', ""))
head(pre_rating)
## [1] "1794" "1553" "1384" "1716" "1655" "1686"
# found that each player's post-rating could be extracted through this regular expression combination (used later in data visualization)
post_rating <- unlist(str_extract_all(chess_data, '(->)(\\s{1,})?\\d+'))
post_rating <- unlist(str_replace_all(post_rating, '(->)(\\s)?', ""))
# found that each player's total points could be extracted through this regular expression combination
total_points <- unlist(str_extract_all(chess_data, '(\\d\\.\\d)'))
head(total_points)
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"
Once the above information was isolated and stored in vectors, I needed to find a way to grab the results from the tournament for each player in a string. To do this, I used a specific regular expression that was able to find the structure of just the result and the opponent number for each players’ games. See below:
results <- unlist(str_extract_all(chess_data, '(\\d\\.\\d\\s{2}(\\|[:upper:]\\s{2}(\\s)?(\\s)?(\\d+)?){1,})'))
results <- unlist(str_replace(results, '\\d\\.\\d\\s{2}\\|', ''))
kable(head(results))
| x |
|---|
| W 39|W 21|W 18|W 14|W 7|D 12|D 4 |
| W 63|W 58|L 4|W 17|W 16|W 20|W 7 |
| L 8|W 61|W 25|W 21|W 11|W 13|W 12 |
| W 23|D 28|W 2|W 26|D 5|W 19|D 1 |
| W 45|W 37|D 12|D 13|D 4|W 14|W 17 |
| W 34|D 29|L 11|W 35|D 10|W 27|W 21 |
After grabbing the results, I needed to clean them up a bit in order to eventually replace the opponent number with their pre-tournament elo rating. I then stored this information in a temporary dataframe in order to bind it together with the full data later (stored in ‘tempdf’):
opponents <- unlist(str_replace_all(results, '[:upper:]\\s+', ''))
opponents <- unlist(str_replace_all(opponents, '\\|', '\\,'))
opponents <- unlist(str_replace_all(opponents, '\\,{2,}', '\\,'))
opponents <- unlist(str_replace_all(opponents, '(\\,$)', ''))
opponents <- unlist(str_replace_all(opponents, '^\\,', ''))
tempdf <- data.frame(V1 = opponents)
kable(head(tempdf))
| V1 |
|---|
| 39,21,18,14,7,12,4 |
| 63,58,4,17,16,20,7 |
| 8,61,25,21,11,13,12 |
| 23,28,2,26,5,19,1 |
| 45,37,12,13,4,14,17 |
| 34,29,11,35,10,27,21 |
Once the temporary dataframe had just the opponents data in one column, I wanted to split the strings and create separate columns for each opponent based on the game. To do this, I created a for loop:
# used a for loop here to iterate over V1, using strsplit and the comma to separate out the opponent numbers into the appropriate columns. This also stored N/As in columns where a player didn't play an opponent.
for(i in 1:7){
tempdf[, paste0('game', i)] <- sapply(strsplit(as.character(tempdf$V1),','), "[", i)
}
kable(head(tempdf))
| V1 | game1 | game2 | game3 | game4 | game5 | game6 | game7 |
|---|---|---|---|---|---|---|---|
| 39,21,18,14,7,12,4 | 39 | 21 | 18 | 14 | 7 | 12 | 4 |
| 63,58,4,17,16,20,7 | 63 | 58 | 4 | 17 | 16 | 20 | 7 |
| 8,61,25,21,11,13,12 | 8 | 61 | 25 | 21 | 11 | 13 | 12 |
| 23,28,2,26,5,19,1 | 23 | 28 | 2 | 26 | 5 | 19 | 1 |
| 45,37,12,13,4,14,17 | 45 | 37 | 12 | 13 | 4 | 14 | 17 |
| 34,29,11,35,10,27,21 | 34 | 29 | 11 | 35 | 10 | 27 | 21 |
Now that the opponent data was stored correctly in a separate dataframe, I decided to merge this data in with the rest of my vectors to create one large dataframe (‘chess_ratings_df’):
chess_ratings_df <- data.frame(player_name = player_names,
player_state = location,
total_points = total_points,
player_pre_rating = pre_rating,
player_post_rating = post_rating,
game1 = tempdf$game1,
game2 = tempdf$game2,
game3 = tempdf$game3,
game4 = tempdf$game4,
game5 = tempdf$game5,
game6 = tempdf$game6,
game7 = tempdf$game7)
kable(head(chess_ratings_df))
| player_name | player_state | total_points | player_pre_rating | player_post_rating | game1 | game2 | game3 | game4 | game5 | game6 | game7 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gary Hua | ON | 6.0 | 1794 | 1817 | 39 | 21 | 18 | 14 | 7 | 12 | 4 |
| Dakshesh Daruri | MI | 6.0 | 1553 | 1663 | 63 | 58 | 4 | 17 | 16 | 20 | 7 |
| Aditya Bajaj | MI | 6.0 | 1384 | 1640 | 8 | 61 | 25 | 21 | 11 | 13 | 12 |
| Patrick H Schilling | MI | 5.5 | 1716 | 1744 | 23 | 28 | 2 | 26 | 5 | 19 | 1 |
| Hanshi Zuo | MI | 5.5 | 1655 | 1690 | 45 | 37 | 12 | 13 | 4 | 14 | 17 |
| Hansen Song | OH | 5.0 | 1686 | 1687 | 34 | 29 | 11 | 35 | 10 | 27 | 21 |
In order to do proper average calculations later on, I needed to make sure my data types were correct. For this, I changed the character vectors to numeric vectors:
# just to be safe, I changed the data types for these columns to numbers
chess_ratings_df$player_pre_rating <- as.numeric(as.character(chess_ratings_df$player_pre_rating))
chess_ratings_df$player_post_rating <- as.numeric(as.character(chess_ratings_df$player_post_rating))
chess_ratings_df$total_points <- as.numeric(as.character(chess_ratings_df$total_points))
chess_ratings_df$game1 <- as.numeric(as.character(chess_ratings_df$game1))
chess_ratings_df$game2 <- as.numeric(as.character(chess_ratings_df$game2))
chess_ratings_df$game3 <- as.numeric(as.character(chess_ratings_df$game3))
chess_ratings_df$game4 <- as.numeric(as.character(chess_ratings_df$game4))
chess_ratings_df$game5 <- as.numeric(as.character(chess_ratings_df$game5))
chess_ratings_df$game6 <- as.numeric(as.character(chess_ratings_df$game6))
chess_ratings_df$game7 <- as.numeric(as.character(chess_ratings_df$game7))
To substitute the opponent number with their pre-tournament elo rating, I created another set of for loops:
# while iterating over the game columns, iterate over the player opponent number and match it with the opponent's pre-tournament elo rating. Then replace the opponent number with their elo rating.
for(i in 6:12) {
for(j in 1:64) {
value <- chess_ratings_df[,i][j]
chess_ratings_df[,i][j] <- chess_ratings_df$player_pre_rating[value]
}
}
kable(head(chess_ratings_df))
| player_name | player_state | total_points | player_pre_rating | player_post_rating | game1 | game2 | game3 | game4 | game5 | game6 | game7 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gary Hua | ON | 6.0 | 1794 | 1817 | 1436 | 1563 | 1600 | 1610 | 1649 | 1663 | 1716 |
| Dakshesh Daruri | MI | 6.0 | 1553 | 1663 | 1175 | 917 | 1716 | 1629 | 1604 | 1595 | 1649 |
| Aditya Bajaj | MI | 6.0 | 1384 | 1640 | 1641 | 955 | 1745 | 1563 | 1712 | 1666 | 1663 |
| Patrick H Schilling | MI | 5.5 | 1716 | 1744 | 1363 | 1507 | 1553 | 1579 | 1655 | 1564 | 1794 |
| Hanshi Zuo | MI | 5.5 | 1655 | 1690 | 1242 | 980 | 1663 | 1666 | 1716 | 1610 | 1629 |
| Hansen Song | OH | 5.0 | 1686 | 1687 | 1399 | 1602 | 1712 | 1438 | 1365 | 1552 | 1563 |
Finally, after the pre-tournament elo ratings had been substituted in, I could then take the average of these ratings across the seven games of the tournament to obtain the Average Pre Chess Rating of Opponents:
chess_ratings_df$average_opp_rating <- round(rowMeans(chess_ratings_df[,6:12], na.rm = TRUE), digits = 0)
For my future data visualization, I decided to create a few more columns. I wanted to see the difference in player pre-tournament rating and player post-tournament rating to see who had the highest elo rating gain. Additionally, I created a column on the conditional of whether or not the rating difference between pre and post was negative or positive. Finally, for charting purposes, I created one final column so that there weren’t any negative numbers for the rating differences, by squaring and taking the square root of the difference (will be used later):
chess_ratings_df$rating_difference <- chess_ratings_df$player_post_rating - chess_ratings_df$player_pre_rating
chess_ratings_df$change_pos <- ifelse(chess_ratings_df$rating_difference >=0, TRUE, FALSE)
chess_ratings_df$rating_difference_sqr <- sqrt((chess_ratings_df$rating_difference ^ 2))
In the end, you can see the data frame required for the project (‘final_df’).
final_df <- chess_ratings_df[, c(1:4, 13)]
kable(final_df)
| player_name | player_state | total_points | player_pre_rating | average_opp_rating |
|---|---|---|---|---|
| Gary Hua | ON | 6.0 | 1794 | 1605 |
| Dakshesh Daruri | MI | 6.0 | 1553 | 1469 |
| Aditya Bajaj | MI | 6.0 | 1384 | 1564 |
| Patrick H Schilling | MI | 5.5 | 1716 | 1574 |
| Hanshi Zuo | MI | 5.5 | 1655 | 1501 |
| Hansen Song | OH | 5.0 | 1686 | 1519 |
| Gary Dee Swathell | MI | 5.0 | 1649 | 1372 |
| Ezekiel Houghton | MI | 5.0 | 1641 | 1468 |
| Stefano Lee | ON | 5.0 | 1411 | 1523 |
| Anvit Rao | MI | 5.0 | 1365 | 1554 |
| Cameron William Mc Leman | MI | 4.5 | 1712 | 1468 |
| Kenneth J Tack | MI | 4.5 | 1663 | 1506 |
| Torrance Henry Jr | MI | 4.5 | 1666 | 1498 |
| Bradley Shaw | MI | 4.5 | 1610 | 1515 |
| Zachary James Houghton | MI | 4.5 | 1220 | 1484 |
| Mike Nikitin | MI | 4.0 | 1604 | 1386 |
| Ronald Grzegorczyk | MI | 4.0 | 1629 | 1499 |
| David Sundeen | MI | 4.0 | 1600 | 1480 |
| Dipankar Roy | MI | 4.0 | 1564 | 1426 |
| Jason Zheng | MI | 4.0 | 1595 | 1411 |
| Dinh Dang Bui | ON | 4.0 | 1563 | 1470 |
| Eugene L Mcclure | MI | 4.0 | 1555 | 1300 |
| Alan Bui | ON | 4.0 | 1363 | 1214 |
| Michael R Aldrich | MI | 4.0 | 1229 | 1357 |
| Loren Schwiebert | MI | 3.5 | 1745 | 1363 |
| Max Zhu | ON | 3.5 | 1579 | 1507 |
| Gaurav Gidwani | MI | 3.5 | 1552 | 1222 |
| Sofia Adina Stanescu | MI | 3.5 | 1507 | 1522 |
| Chiedozie Okorie | MI | 3.5 | 1602 | 1314 |
| George Avery Jones | ON | 3.5 | 1522 | 1144 |
| Rishi Shetty | MI | 3.5 | 1494 | 1260 |
| Joshua Philip Mathews | ON | 3.5 | 1441 | 1379 |
| Jade Ge | MI | 3.5 | 1449 | 1277 |
| Michael Jeffery Thomas | MI | 3.5 | 1399 | 1375 |
| Joshua David Lee | MI | 3.5 | 1438 | 1150 |
| Siddharth Jha | MI | 3.5 | 1355 | 1388 |
| Amiyatosh Pwnanandam | MI | 3.5 | 980 | 1385 |
| Brian Liu | MI | 3.0 | 1423 | 1539 |
| Joel R Hendon | MI | 3.0 | 1436 | 1430 |
| Forest Zhang | MI | 3.0 | 1348 | 1391 |
| Kyle William Murphy | MI | 3.0 | 1403 | 1248 |
| Jared Ge | MI | 3.0 | 1332 | 1150 |
| Robert Glen Vasey | MI | 3.0 | 1283 | 1107 |
| Justin D Schilling | MI | 3.0 | 1199 | 1327 |
| Derek Yan | MI | 3.0 | 1242 | 1152 |
| Jacob Alexander Lavalley | MI | 3.0 | 377 | 1358 |
| Eric Wright | MI | 2.5 | 1362 | 1392 |
| Daniel Khain | MI | 2.5 | 1382 | 1356 |
| Michael J Martin | MI | 2.5 | 1291 | 1286 |
| Shivam Jha | MI | 2.5 | 1056 | 1296 |
| Tejas Ayyagari | MI | 2.5 | 1011 | 1356 |
| Ethan Guo | MI | 2.5 | 935 | 1495 |
| Jose C Ybarra | MI | 2.0 | 1393 | 1345 |
| Larry Hodge | MI | 2.0 | 1270 | 1206 |
| Alex Kong | MI | 2.0 | 1186 | 1406 |
| Marisa Ricci | MI | 2.0 | 1153 | 1414 |
| Michael Lu | MI | 2.0 | 1092 | 1363 |
| Viraj Mohile | MI | 2.0 | 917 | 1391 |
| Sean M Mc Cormick | MI | 2.0 | 853 | 1319 |
| Julia Shen | MI | 1.5 | 967 | 1330 |
| Jezzel Farkas | ON | 1.5 | 955 | 1327 |
| Ashwin Balaji | MI | 1.0 | 1530 | 1186 |
| Thomas Joseph Hosmer | MI | 1.0 | 1175 | 1350 |
| Ben Li | MI | 1.0 | 1163 | 1263 |
Here’s the syntax for exporting it to my local computer. However, feel free to change the filepath to export it anywhere else.
filename <- '~/Desktop/grad_school/data_607_cunysps/data607_cunysps/Project1/chess_ratings.csv'
write.csv(final_df, file = filename)
To go a step further with this project, I created a micro-site using a Javascript Framework (Angular), as well as Google Firebase to deploy this to the internet. At this link:
https://data606-chess.firebaseapp.com/
…you’ll be able to find a ‘mock’ newspaper article that display a data visualizations and tables using d3.js (charting library for Javascript). The data visualization displays data on hover, as well as more data if you click the associated button – to show +/- elo ratings.
As a quick note, the url says ‘data606’ but it should be data607 (I made a typo on the setup - sorry about that!)
If interested, you can find the files for this on my Github as well, at this repository.