Data 607 - Project #1

Libraries used

library(stringr)
library(knitr)

Reading the text file from GitHub into R

First, I read in the text file from my GitHub account - I saved it from Blackboard initially, and then used R to read it into my project code.

chess_data <- paste(readLines("https://raw.githubusercontent.com/zachalexander/data607_cunysps/master/Project1/chess_ratings.txt"), collapse = '\n')

Using Regular Expressions

Next, I wanted to take the long string and use regular expressions to extract the important information. I initally looked for isolating the player names. Since all of them are in uppercase format, I utilized this as a way to save them into a vector.

player_names <- unlist(str_extract_all(chess_data, '([:upper:]+\\s[:upper:]+(\\s[:upper:]+)?(\\s[:upper:]+)?)'))

# remove the top value since it's not a name
player_names <- player_names[player_names != "USCF ID"]

# change to 'proper' case
player_names <- str_to_title(player_names)
head(player_names)

## [1] "Gary Hua"            "Dakshesh Daruri"     "Aditya Bajaj"       
## [4] "Patrick H Schilling" "Hanshi Zuo"          "Hansen Song"

For Player’s State (location), Pre-rating, Total Points, and Post-rating, I also used regular expressions to store this specific information in vectors. See below for my code:

# found that each player's state could be extracted through this regular expression combination
location <- unlist(str_extract_all(chess_data, '(\n)\\s\\s\\s[:upper:]+'))
location <- unlist(str_replace_all(location, '^\\s+|\\s+$', ""))
head(location)

## [1] "ON" "MI" "MI" "MI" "MI" "OH"

# found that each player's pre-rating could be extracted through this regular expression combination
pre_rating <- unlist(str_extract_all(chess_data, '(R:)\\s{1,}\\d+'))
pre_rating <- unlist(str_replace_all(pre_rating, '(R:)\\s', ""))
head(pre_rating)

## [1] "1794" "1553" "1384" "1716" "1655" "1686"

# found that each player's post-rating could be extracted through this regular expression combination (used later in data visualization)
post_rating <- unlist(str_extract_all(chess_data, '(->)(\\s{1,})?\\d+'))
post_rating <- unlist(str_replace_all(post_rating, '(->)(\\s)?', ""))

# found that each player's total points could be extracted through this regular expression combination
total_points <- unlist(str_extract_all(chess_data, '(\\d\\.\\d)'))
head(total_points)

## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"

Dealing with the player opponent data

Once the above information was isolated and stored in vectors, I needed to find a way to grab the results from the tournament for each player in a string. To do this, I used a specific regular expression that was able to find the structure of just the result and the opponent number for each players’ games. See below:

results <- unlist(str_extract_all(chess_data, '(\\d\\.\\d\\s{2}(\\|[:upper:]\\s{2}(\\s)?(\\s)?(\\d+)?){1,})'))
results <- unlist(str_replace(results, '\\d\\.\\d\\s{2}\\|', ''))
kable(head(results))

x
W 39\|W 21\|W 18\|W 14\|W 7\|D 12\|D 4
W 63\|W 58\|L 4\|W 17\|W 16\|W 20\|W 7
L 8\|W 61\|W 25\|W 21\|W 11\|W 13\|W 12
W 23\|D 28\|W 2\|W 26\|D 5\|W 19\|D 1
W 45\|W 37\|D 12\|D 13\|D 4\|W 14\|W 17
W 34\|D 29\|L 11\|W 35\|D 10\|W 27\|W 21

Storing these in a separate data frame

After grabbing the results, I needed to clean them up a bit in order to eventually replace the opponent number with their pre-tournament elo rating. I then stored this information in a temporary dataframe in order to bind it together with the full data later (stored in ‘tempdf’):

opponents <- unlist(str_replace_all(results, '[:upper:]\\s+', ''))
opponents <- unlist(str_replace_all(opponents, '\\|', '\\,'))
opponents <- unlist(str_replace_all(opponents, '\\,{2,}', '\\,'))
opponents <- unlist(str_replace_all(opponents, '(\\,$)', ''))
opponents <- unlist(str_replace_all(opponents, '^\\,', ''))
tempdf <- data.frame(V1 = opponents)
kable(head(tempdf))

V1
39,21,18,14,7,12,4
63,58,4,17,16,20,7
8,61,25,21,11,13,12
23,28,2,26,5,19,1
45,37,12,13,4,14,17
34,29,11,35,10,27,21

Once the temporary dataframe had just the opponents data in one column, I wanted to split the strings and create separate columns for each opponent based on the game. To do this, I created a for loop:

# used a for loop here to iterate over V1, using strsplit and the comma to separate out the opponent numbers into the appropriate columns. This also stored N/As in columns where a player didn't play an opponent.
for(i in 1:7){
  tempdf[, paste0('game', i)] <- sapply(strsplit(as.character(tempdf$V1),','), "[", i)
}
kable(head(tempdf))

V1	game1	game2	game3	game4	game5	game6	game7
39,21,18,14,7,12,4	39	21	18	14	7	12	4
63,58,4,17,16,20,7	63	58	4	17	16	20	7
8,61,25,21,11,13,12	8	61	25	21	11	13	12
23,28,2,26,5,19,1	23	28	2	26	5	19	1
45,37,12,13,4,14,17	45	37	12	13	4	14	17
34,29,11,35,10,27,21	34	29	11	35	10	27	21

Merging data into one data frame

Now that the opponent data was stored correctly in a separate dataframe, I decided to merge this data in with the rest of my vectors to create one large dataframe (‘chess_ratings_df’):

chess_ratings_df <- data.frame(player_name = player_names, 
                               player_state = location,
                               total_points = total_points,
                               player_pre_rating = pre_rating, 
                               player_post_rating = post_rating,
                               game1 = tempdf$game1,
                               game2 = tempdf$game2,
                               game3 = tempdf$game3,
                               game4 = tempdf$game4,
                               game5 = tempdf$game5,
                               game6 = tempdf$game6,
                               game7 = tempdf$game7)

kable(head(chess_ratings_df))

player_name	player_state	total_points	player_pre_rating	player_post_rating	game1	game2	game3	game4	game5	game6	game7
Gary Hua	ON	6.0	1794	1817	39	21	18	14	7	12	4
Dakshesh Daruri	MI	6.0	1553	1663	63	58	4	17	16	20	7
Aditya Bajaj	MI	6.0	1384	1640	8	61	25	21	11	13	12
Patrick H Schilling	MI	5.5	1716	1744	23	28	2	26	5	19	1
Hanshi Zuo	MI	5.5	1655	1690	45	37	12	13	4	14	17
Hansen Song	OH	5.0	1686	1687	34	29	11	35	10	27	21

Changing the data types from strings to numbers

In order to do proper average calculations later on, I needed to make sure my data types were correct. For this, I changed the character vectors to numeric vectors:

# just to be safe, I changed the data types for these columns to numbers
chess_ratings_df$player_pre_rating <- as.numeric(as.character(chess_ratings_df$player_pre_rating))
chess_ratings_df$player_post_rating <- as.numeric(as.character(chess_ratings_df$player_post_rating))
chess_ratings_df$total_points <- as.numeric(as.character(chess_ratings_df$total_points))
chess_ratings_df$game1 <- as.numeric(as.character(chess_ratings_df$game1))
chess_ratings_df$game2 <- as.numeric(as.character(chess_ratings_df$game2))
chess_ratings_df$game3 <- as.numeric(as.character(chess_ratings_df$game3))
chess_ratings_df$game4 <- as.numeric(as.character(chess_ratings_df$game4))
chess_ratings_df$game5 <- as.numeric(as.character(chess_ratings_df$game5))
chess_ratings_df$game6 <- as.numeric(as.character(chess_ratings_df$game6))
chess_ratings_df$game7 <- as.numeric(as.character(chess_ratings_df$game7))

Using loops to substitute opponent numbers with their pre-tournament elo ratings

To substitute the opponent number with their pre-tournament elo rating, I created another set of for loops:

# while iterating over the game columns, iterate over the player opponent number and match it with the opponent's pre-tournament elo rating. Then replace the opponent number with their elo rating.
for(i in 6:12) {
  for(j in 1:64) {
    value <- chess_ratings_df[,i][j]
    chess_ratings_df[,i][j] <- chess_ratings_df$player_pre_rating[value]
  }
}

kable(head(chess_ratings_df))

player_name	player_state	total_points	player_pre_rating	player_post_rating	game1	game2	game3	game4	game5	game6	game7
Gary Hua	ON	6.0	1794	1817	1436	1563	1600	1610	1649	1663	1716
Dakshesh Daruri	MI	6.0	1553	1663	1175	917	1716	1629	1604	1595	1649
Aditya Bajaj	MI	6.0	1384	1640	1641	955	1745	1563	1712	1666	1663
Patrick H Schilling	MI	5.5	1716	1744	1363	1507	1553	1579	1655	1564	1794
Hanshi Zuo	MI	5.5	1655	1690	1242	980	1663	1666	1716	1610	1629
Hansen Song	OH	5.0	1686	1687	1399	1602	1712	1438	1365	1552	1563

Taking the average of the pre-tournament elo ratings

Finally, after the pre-tournament elo ratings had been substituted in, I could then take the average of these ratings across the seven games of the tournament to obtain the Average Pre Chess Rating of Opponents:

chess_ratings_df$average_opp_rating <- round(rowMeans(chess_ratings_df[,6:12], na.rm = TRUE), digits = 0)

For future data visualization work (see at the end), I created a few more columns in the ‘chess_ratings_df’ data frame.

For my future data visualization, I decided to create a few more columns. I wanted to see the difference in player pre-tournament rating and player post-tournament rating to see who had the highest elo rating gain. Additionally, I created a column on the conditional of whether or not the rating difference between pre and post was negative or positive. Finally, for charting purposes, I created one final column so that there weren’t any negative numbers for the rating differences, by squaring and taking the square root of the difference (will be used later):

chess_ratings_df$rating_difference <- chess_ratings_df$player_post_rating - chess_ratings_df$player_pre_rating
chess_ratings_df$change_pos <- ifelse(chess_ratings_df$rating_difference >=0, TRUE, FALSE)
chess_ratings_df$rating_difference_sqr <- sqrt((chess_ratings_df$rating_difference ^ 2))

The final data frame to be exported to a csv file

In the end, you can see the data frame required for the project (‘final_df’).

final_df <- chess_ratings_df[, c(1:4, 13)]
kable(final_df)

player_name	player_state	total_points	player_pre_rating	average_opp_rating
Gary Hua	ON	6.0	1794	1605
Dakshesh Daruri	MI	6.0	1553	1469
Aditya Bajaj	MI	6.0	1384	1564
Patrick H Schilling	MI	5.5	1716	1574
Hanshi Zuo	MI	5.5	1655	1501
Hansen Song	OH	5.0	1686	1519
Gary Dee Swathell	MI	5.0	1649	1372
Ezekiel Houghton	MI	5.0	1641	1468
Stefano Lee	ON	5.0	1411	1523
Anvit Rao	MI	5.0	1365	1554
Cameron William Mc Leman	MI	4.5	1712	1468
Kenneth J Tack	MI	4.5	1663	1506
Torrance Henry Jr	MI	4.5	1666	1498
Bradley Shaw	MI	4.5	1610	1515
Zachary James Houghton	MI	4.5	1220	1484
Mike Nikitin	MI	4.0	1604	1386
Ronald Grzegorczyk	MI	4.0	1629	1499
David Sundeen	MI	4.0	1600	1480
Dipankar Roy	MI	4.0	1564	1426
Jason Zheng	MI	4.0	1595	1411
Dinh Dang Bui	ON	4.0	1563	1470
Eugene L Mcclure	MI	4.0	1555	1300
Alan Bui	ON	4.0	1363	1214
Michael R Aldrich	MI	4.0	1229	1357
Loren Schwiebert	MI	3.5	1745	1363
Max Zhu	ON	3.5	1579	1507
Gaurav Gidwani	MI	3.5	1552	1222
Sofia Adina Stanescu	MI	3.5	1507	1522
Chiedozie Okorie	MI	3.5	1602	1314
George Avery Jones	ON	3.5	1522	1144
Rishi Shetty	MI	3.5	1494	1260
Joshua Philip Mathews	ON	3.5	1441	1379
Jade Ge	MI	3.5	1449	1277
Michael Jeffery Thomas	MI	3.5	1399	1375
Joshua David Lee	MI	3.5	1438	1150
Siddharth Jha	MI	3.5	1355	1388
Amiyatosh Pwnanandam	MI	3.5	980	1385
Brian Liu	MI	3.0	1423	1539
Joel R Hendon	MI	3.0	1436	1430
Forest Zhang	MI	3.0	1348	1391
Kyle William Murphy	MI	3.0	1403	1248
Jared Ge	MI	3.0	1332	1150
Robert Glen Vasey	MI	3.0	1283	1107
Justin D Schilling	MI	3.0	1199	1327
Derek Yan	MI	3.0	1242	1152
Jacob Alexander Lavalley	MI	3.0	377	1358
Eric Wright	MI	2.5	1362	1392
Daniel Khain	MI	2.5	1382	1356
Michael J Martin	MI	2.5	1291	1286
Shivam Jha	MI	2.5	1056	1296
Tejas Ayyagari	MI	2.5	1011	1356
Ethan Guo	MI	2.5	935	1495
Jose C Ybarra	MI	2.0	1393	1345
Larry Hodge	MI	2.0	1270	1206
Alex Kong	MI	2.0	1186	1406
Marisa Ricci	MI	2.0	1153	1414
Michael Lu	MI	2.0	1092	1363
Viraj Mohile	MI	2.0	917	1391
Sean M Mc Cormick	MI	2.0	853	1319
Julia Shen	MI	1.5	967	1330
Jezzel Farkas	ON	1.5	955	1327
Ashwin Balaji	MI	1.0	1530	1186
Thomas Joseph Hosmer	MI	1.0	1175	1350
Ben Li	MI	1.0	1163	1263

Here’s the syntax for exporting it to my local computer. However, feel free to change the filepath to export it anywhere else.

filename <- '~/Desktop/grad_school/data_607_cunysps/data607_cunysps/Project1/chess_ratings.csv'
write.csv(final_df, file = filename)

Data Visualization

To go a step further with this project, I created a micro-site using a Javascript Framework (Angular), as well as Google Firebase to deploy this to the internet. At this link:

https://data606-chess.firebaseapp.com/

…you’ll be able to find a ‘mock’ newspaper article that display a data visualizations and tables using d3.js (charting library for Javascript). The data visualization displays data on hover, as well as more data if you click the associated button – to show +/- elo ratings.

As a quick note, the url says ‘data606’ but it should be data607 (I made a typo on the setup - sorry about that!)

Steps I used to create my data visualization
- Created an Angular project using the angular-cli
- Created a firebase project and deployed my Angular project using firebase
- Took my csv file that I created in R, and converted the data to json format
- Using Javascript and Typescript, I created a d3.js bar chart displaying some of this data
- I used boostrap (css library) to generate the tables

If interested, you can find the files for this on my Github as well, at this repository.