Possibly needed libraries
library(tidyverse)
library(openintro)
library(stringr)
library(dplyr)
library(readr)
library(tidyr)
In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…
The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments. You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.
chess_url <- "https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/607%20Project%201%20Chess%20Raw"
chess_raw <- read.table(chess_url,header=FALSE,sep="\n")
head(chess_raw, 15)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 7 -----------------------------------------------------------------------------------------
## 8 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## 9 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## 10 -----------------------------------------------------------------------------------------
## 11 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## 12 MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
## 13 -----------------------------------------------------------------------------------------
## 14 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## 15 MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
nrow(chess_raw) # 196 rows
## [1] 196
ncol(chess_raw) # 1 column
## [1] 1
# We need to generate a csv file with the Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
# This is 5 columns, and the last column is an average of at most 7 other numbers. Unfortunately, the data is all just 1 column, and must be split up
# Looking at the data itself, it seems we don't need the first 4 rows, we only care about the data with the players themselves, starting with Gary Hua. We should also ignore all the rows that just have the "-" character, which happens on row 4,7,10,13- so every 3 rows.
# Bridge program showed R has matrices built into it, that we can use to create a matrix from some values
# https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/matrix
chess_matrix <- matrix(unlist(chess_raw), byrow = TRUE)
chess_players <- chess_matrix[seq(5,length(chess_matrix),3)]
# This line says we start at row 5, then pull every 3 rows- all of which contain the actual data of the players themselves
head(chess_players)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
# So we have the information below;
# 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
# But this is only half the data, so I need a second matrix from chess_raw, which contains the other half of data such as state , ID, rating
chess_players2 <- chess_matrix[seq(6,length(chess_matrix),3)]
head(chess_players2)
## [1] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [6] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
# Now I will combine them together
combined_players <- paste (chess_players, chess_players2, sep = "")
head(combined_players)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17| MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21| OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
# Now that we have the data in a proper table, rather than in 1 column, we can use Regex to extract what we need
# Regex resources:
# https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
# https://www.rexegg.com/regex-quickstart.html
# https://regex101.com/
#-------------------------------------------------------------------------------
# Player Name
chess_player_name <- str_trim(str_extract(chess_players, "(\\w+\\s){2,3}"))
# The regex \w is a meta character, like \d, that looks for any one word/non-word character, which are [a-zA-Z0-9_]. Just easier to type
# The regex \s is any one space/non-space character, which are [ \n\r\t\f] (newline, return, tab, form feed)
# {2,3} is repetition, meaning some word to repeat between 2 or 3 times.
# Altogether, the Regex above simply looks for a word
head(chess_player_name)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
#-------------------------------------------------------------------------------
# Player State
chess_player_state <- str_extract(chess_players2, "[A-Z]{2}")
head(chess_player_state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
# Same as \w above
#-------------------------------------------------------------------------------
# Player Points
chess_player_points <- as.numeric(str_extract(chess_players, "[0-9]\\.[0-9]"))
head(chess_player_points)
## [1] 6.0 6.0 6.0 5.5 5.5 5.0
#-------------------------------------------------------------------------------
# player pre-game rating
# It's currently stored as a string, which we will convert
chess_player_prerating <- str_extract(chess_players2, "\\s\\d{3,4}[^\\d]") #This looks for a players score which could be 3 or 4 digits
chess_player_prerating <- as.integer(str_extract(chess_player_prerating, "\\d+")) # Convert to number
head(chess_player_prerating)
## [1] 1794 1553 1384 1716 1655 1686
#-------------------------------------------------------------------------------
# We got the first 4 requested fields, now we need to find the average pre-chess rating for all opponents, for each player.
# First we should find how many opponents each player had. We can get this from looking at our first matrix, chess_players, which has the Number ID of each player. The ID is a 2-digit numeric character
# matching all combinations of 1 letter 2 spaces and any numbers
Opponent_IDs <- str_extract_all(chess_players, "\\d+\\|")
# Look for the ID's, which occur right before the | character in our first matrix
# This Regex looks for numeric digits that occur 1 or more times.
Opponent_IDs <- str_extract_all(Opponent_IDs, "\\d+")
# I originally did "\\d+" , which does in fact pull all the numbers, but it also pulled the row number itself, along with the total points, which is wrong. The +\\| is needed to let the Regex know to only look at the numbers followed by a |
head(Opponent_IDs)
## [[1]]
## [1] "39" "21" "18" "14" "7" "12" "4"
##
## [[2]]
## [1] "63" "58" "4" "17" "16" "20" "7"
##
## [[3]]
## [1] "8" "61" "25" "21" "11" "13" "12"
##
## [[4]]
## [1] "23" "28" "2" "26" "5" "19" "1"
##
## [[5]]
## [1] "45" "37" "12" "13" "4" "14" "17"
##
## [[6]]
## [1] "34" "29" "11" "35" "10" "27" "21"
# Now that we know the ID's of each players opponent, we know how many players they faced, thus we can calculate the average pre-chess rating for each player's opponents.
avg_opponent_prerating <- c()
num_opponents <- c(1: length(Opponent_IDs)) # Define a length to loop from for each player
# Start with an empty list, which we will append the average numbers to, with a for loop
for(i in num_opponents)
{
avg_opponent_prerating[i] <- mean(chess_player_prerating[as.numeric(Opponent_IDs[[i]])])
}
# This loops looks at the chess_player_pre-rating for each opponent, averages them, for each player
# For loop documentation; https://www.dataquest.io/blog/for-loop-in-r/
# Round to nearest number
avg_opponent_prerating <- round(avg_opponent_prerating, 0)
head(avg_opponent_prerating)
## [1] 1605 1469 1564 1574 1501 1519
#-------------------------------------------------------------------------------
# Turn into dataframe
Chess_dataframe_final <- data.frame(list(chess_player_name, chess_player_state, chess_player_points, chess_player_prerating,avg_opponent_prerating))
colnames(Chess_dataframe_final) <- c("Name", "State", "Total Points", "Player Pre-Rating", "Average Opponent Pre-Rating")
Chess_dataframe_final # Display the dataframe with all finalized fields.
## Name State Total Points Player Pre-Rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## 7 GARY DEE SWATHELL MI 5.0 1649
## 8 EZEKIEL HOUGHTON MI 5.0 1641
## 9 STEFANO LEE ON 5.0 1411
## 10 ANVIT RAO MI 5.0 1365
## 11 CAMERON WILLIAM MC MI 4.5 1712
## 12 KENNETH J TACK MI 4.5 1663
## 13 TORRANCE HENRY JR MI 4.5 1666
## 14 BRADLEY SHAW MI 4.5 1610
## 15 ZACHARY JAMES HOUGHTON MI 4.5 1220
## 16 MIKE NIKITIN MI 4.0 1604
## 17 RONALD GRZEGORCZYK MI 4.0 1629
## 18 DAVID SUNDEEN MI 4.0 1600
## 19 DIPANKAR ROY MI 4.0 1564
## 20 JASON ZHENG MI 4.0 1595
## 21 DINH DANG BUI ON 4.0 1563
## 22 EUGENE L MCCLURE MI 4.0 1555
## 23 ALAN BUI ON 4.0 1363
## 24 MICHAEL R ALDRICH MI 4.0 1229
## 25 LOREN SCHWIEBERT MI 3.5 1745
## 26 MAX ZHU ON 3.5 1579
## 27 GAURAV GIDWANI MI 3.5 1552
## 28 SOFIA ADINA MI 3.5 1507
## 29 CHIEDOZIE OKORIE MI 3.5 1602
## 30 GEORGE AVERY JONES ON 3.5 1522
## 31 RISHI SHETTY MI 3.5 1494
## 32 JOSHUA PHILIP MATHEWS ON 3.5 1441
## 33 JADE GE MI 3.5 1449
## 34 MICHAEL JEFFERY THOMAS MI 3.5 1399
## 35 JOSHUA DAVID LEE MI 3.5 1438
## 36 SIDDHARTH JHA MI 3.5 1355
## 37 AMIYATOSH PWNANANDAM MI 3.5 980
## 38 BRIAN LIU MI 3.0 1423
## 39 JOEL R HENDON MI 3.0 1436
## 40 FOREST ZHANG MI 3.0 1348
## 41 KYLE WILLIAM MURPHY MI 3.0 1403
## 42 JARED GE MI 3.0 1332
## 43 ROBERT GLEN VASEY MI 3.0 1283
## 44 JUSTIN D SCHILLING MI 3.0 1199
## 45 DEREK YAN MI 3.0 1242
## 46 JACOB ALEXANDER LAVALLEY MI 3.0 377
## 47 ERIC WRIGHT MI 2.5 1362
## 48 DANIEL KHAIN MI 2.5 1382
## 49 MICHAEL J MARTIN MI 2.5 1291
## 50 SHIVAM JHA MI 2.5 1056
## 51 TEJAS AYYAGARI MI 2.5 1011
## 52 ETHAN GUO MI 2.5 935
## 53 JOSE C YBARRA MI 2.0 1393
## 54 LARRY HODGE MI 2.0 1270
## 55 ALEX KONG MI 2.0 1186
## 56 MARISA RICCI MI 2.0 1153
## 57 MICHAEL LU MI 2.0 1092
## 58 VIRAJ MOHILE MI 2.0 917
## 59 SEAN M MC MI 2.0 853
## 60 JULIA SHEN MI 1.5 967
## 61 JEZZEL FARKAS ON 1.5 955
## 62 ASHWIN BALAJI MI 1.0 1530
## 63 THOMAS JOSEPH HOSMER MI 1.0 1175
## 64 BEN LI MI 1.0 1163
## Average Opponent Pre-Rating
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
## 7 1372
## 8 1468
## 9 1523
## 10 1554
## 11 1468
## 12 1506
## 13 1498
## 14 1515
## 15 1484
## 16 1386
## 17 1499
## 18 1480
## 19 1426
## 20 1411
## 21 1470
## 22 1300
## 23 1214
## 24 1357
## 25 1363
## 26 1507
## 27 1222
## 28 1522
## 29 1314
## 30 1144
## 31 1260
## 32 1379
## 33 1277
## 34 1375
## 35 1150
## 36 1388
## 37 1385
## 38 1539
## 39 1430
## 40 1391
## 41 1248
## 42 1150
## 43 1107
## 44 1327
## 45 1152
## 46 1358
## 47 1392
## 48 1356
## 49 1286
## 50 1296
## 51 1356
## 52 1495
## 53 1345
## 54 1206
## 55 1406
## 56 1414
## 57 1363
## 58 1391
## 59 1319
## 60 1330
## 61 1327
## 62 1186
## 63 1350
## 64 1263
write.csv(Chess_dataframe_final, file = "Chess_Results.csv", col.names= TRUE, row.names= FALSE)