Assignment information

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…

The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments. You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

chess_url <- "https://raw.githubusercontent.com/RonBalaban/CUNY-SPS-R/main/607%20Project%201%20Chess%20Raw"
chess_raw <- read.table(chess_url,header=FALSE,sep="\n")
head(chess_raw, 15)

##                                                                                            V1
## 1   -----------------------------------------------------------------------------------------
## 2   Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3   Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4   -----------------------------------------------------------------------------------------
## 5       1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6      ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 7   -----------------------------------------------------------------------------------------
## 8       2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 9      MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 10  -----------------------------------------------------------------------------------------
## 11      3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## 12     MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 13  -----------------------------------------------------------------------------------------
## 14      4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
## 15     MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |

nrow(chess_raw)  # 196 rows

## [1] 196

ncol(chess_raw) # 1 column

## [1] 1

# We need to generate a csv file with the Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
# This is 5 columns, and the last column is an average of at most 7 other numbers. Unfortunately, the data is all just 1 column, and must be split up 

# Looking at the data itself, it seems we don't need the first 4 rows, we only care about the data with the players themselves, starting with Gary Hua. We should also ignore all the rows that just have the "-" character, which happens on row 4,7,10,13- so every 3 rows.


# Bridge program showed R has matrices built into it, that we can use to create a matrix from some values
# https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/matrix
chess_matrix <- matrix(unlist(chess_raw), byrow = TRUE)
chess_players <- chess_matrix[seq(5,length(chess_matrix),3)]  
# This line says we start at row 5, then pull every 3 rows- all of which contain the actual data of the players themselves
head(chess_players)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"

# So we have the information below; 
#  1 | GARY HUA |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|


# But this is only half the data, so I need a second matrix from chess_raw, which contains the other half of data such as state , ID, rating
chess_players2 <- chess_matrix[seq(6,length(chess_matrix),3)]
head(chess_players2)

## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

# Now I will combine them together
combined_players <- paste (chess_players, chess_players2, sep = "")
head(combined_players)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

# Now that we have the data in a proper table, rather than in 1 column, we can use Regex to extract what we need
# Regex resources: 
# https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html
# https://www.rexegg.com/regex-quickstart.html
# https://regex101.com/


#-------------------------------------------------------------------------------
# Player Name
chess_player_name <-  str_trim(str_extract(chess_players, "(\\w+\\s){2,3}")) 
# The regex \w is a meta character, like \d, that looks for any one word/non-word character, which are [a-zA-Z0-9_]. Just easier to type
# The regex \s is any one space/non-space character, which are [ \n\r\t\f]  (newline, return, tab, form feed)
# {2,3} is repetition, meaning some word to repeat between 2 or 3 times.
# Altogether, the Regex above simply looks for a word
head(chess_player_name)

## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"

#-------------------------------------------------------------------------------
# Player State
chess_player_state <- str_extract(chess_players2, "[A-Z]{2}")
head(chess_player_state)

## [1] "ON" "MI" "MI" "MI" "MI" "OH"

# Same as \w above


#-------------------------------------------------------------------------------
# Player Points
chess_player_points <- as.numeric(str_extract(chess_players, "[0-9]\\.[0-9]"))
head(chess_player_points)

## [1] 6.0 6.0 6.0 5.5 5.5 5.0

#-------------------------------------------------------------------------------
# player pre-game rating
# It's currently stored as a string, which we will convert
chess_player_prerating <- str_extract(chess_players2, "\\s\\d{3,4}[^\\d]")  #This looks for a players score which could be 3 or 4 digits
chess_player_prerating <- as.integer(str_extract(chess_player_prerating, "\\d+")) # Convert to number
head(chess_player_prerating)

## [1] 1794 1553 1384 1716 1655 1686

#-------------------------------------------------------------------------------
# We got the first 4 requested fields, now we need to find the average pre-chess rating for all opponents, for each player.
# First we should find how many opponents each player had. We can get this from looking at our first matrix, chess_players, which has the Number ID of each player. The ID is a 2-digit numeric character

  
# matching all combinations of 1 letter 2 spaces and any numbers
Opponent_IDs <- str_extract_all(chess_players, "\\d+\\|")
# Look for the ID's, which occur right before the | character in our first matrix
# This Regex looks for numeric digits that occur 1 or more times.
Opponent_IDs <- str_extract_all(Opponent_IDs, "\\d+")
# I originally did  "\\d+"  , which does in fact pull all the numbers, but it also pulled the row number itself, along with the total points, which is wrong.  The +\\| is needed to let the Regex know to only look at the numbers followed by a |
head(Opponent_IDs)

## [[1]]
## [1] "39" "21" "18" "14" "7"  "12" "4" 
## 
## [[2]]
## [1] "63" "58" "4"  "17" "16" "20" "7" 
## 
## [[3]]
## [1] "8"  "61" "25" "21" "11" "13" "12"
## 
## [[4]]
## [1] "23" "28" "2"  "26" "5"  "19" "1" 
## 
## [[5]]
## [1] "45" "37" "12" "13" "4"  "14" "17"
## 
## [[6]]
## [1] "34" "29" "11" "35" "10" "27" "21"

# Now that we know the ID's of each players opponent, we know how many players they faced, thus we can calculate the average pre-chess rating for each player's opponents. 




avg_opponent_prerating <- c()
num_opponents <- c(1: length(Opponent_IDs)) # Define a length to loop from for each player

# Start with an empty list, which we will append the average numbers to, with a for loop
for(i in num_opponents)
  {
  avg_opponent_prerating[i] <- mean(chess_player_prerating[as.numeric(Opponent_IDs[[i]])])
  }
# This loops looks at the chess_player_pre-rating for each opponent, averages them, for each player
# For loop documentation; https://www.dataquest.io/blog/for-loop-in-r/

# Round to nearest number
avg_opponent_prerating <- round(avg_opponent_prerating, 0)
head(avg_opponent_prerating)

## [1] 1605 1469 1564 1574 1501 1519

#-------------------------------------------------------------------------------
# Turn into dataframe
Chess_dataframe_final <- data.frame(list(chess_player_name, chess_player_state, chess_player_points, chess_player_prerating,avg_opponent_prerating))
colnames(Chess_dataframe_final) <- c("Name", "State", "Total Points", "Player Pre-Rating", "Average Opponent Pre-Rating")


Chess_dataframe_final  # Display the dataframe with all finalized fields.

##                        Name State Total Points Player Pre-Rating
## 1                  GARY HUA    ON          6.0              1794
## 2           DAKSHESH DARURI    MI          6.0              1553
## 3              ADITYA BAJAJ    MI          6.0              1384
## 4       PATRICK H SCHILLING    MI          5.5              1716
## 5                HANSHI ZUO    MI          5.5              1655
## 6               HANSEN SONG    OH          5.0              1686
## 7         GARY DEE SWATHELL    MI          5.0              1649
## 8          EZEKIEL HOUGHTON    MI          5.0              1641
## 9               STEFANO LEE    ON          5.0              1411
## 10                ANVIT RAO    MI          5.0              1365
## 11       CAMERON WILLIAM MC    MI          4.5              1712
## 12           KENNETH J TACK    MI          4.5              1663
## 13        TORRANCE HENRY JR    MI          4.5              1666
## 14             BRADLEY SHAW    MI          4.5              1610
## 15   ZACHARY JAMES HOUGHTON    MI          4.5              1220
## 16             MIKE NIKITIN    MI          4.0              1604
## 17       RONALD GRZEGORCZYK    MI          4.0              1629
## 18            DAVID SUNDEEN    MI          4.0              1600
## 19             DIPANKAR ROY    MI          4.0              1564
## 20              JASON ZHENG    MI          4.0              1595
## 21            DINH DANG BUI    ON          4.0              1563
## 22         EUGENE L MCCLURE    MI          4.0              1555
## 23                 ALAN BUI    ON          4.0              1363
## 24        MICHAEL R ALDRICH    MI          4.0              1229
## 25         LOREN SCHWIEBERT    MI          3.5              1745
## 26                  MAX ZHU    ON          3.5              1579
## 27           GAURAV GIDWANI    MI          3.5              1552
## 28              SOFIA ADINA    MI          3.5              1507
## 29         CHIEDOZIE OKORIE    MI          3.5              1602
## 30       GEORGE AVERY JONES    ON          3.5              1522
## 31             RISHI SHETTY    MI          3.5              1494
## 32    JOSHUA PHILIP MATHEWS    ON          3.5              1441
## 33                  JADE GE    MI          3.5              1449
## 34   MICHAEL JEFFERY THOMAS    MI          3.5              1399
## 35         JOSHUA DAVID LEE    MI          3.5              1438
## 36            SIDDHARTH JHA    MI          3.5              1355
## 37     AMIYATOSH PWNANANDAM    MI          3.5               980
## 38                BRIAN LIU    MI          3.0              1423
## 39            JOEL R HENDON    MI          3.0              1436
## 40             FOREST ZHANG    MI          3.0              1348
## 41      KYLE WILLIAM MURPHY    MI          3.0              1403
## 42                 JARED GE    MI          3.0              1332
## 43        ROBERT GLEN VASEY    MI          3.0              1283
## 44       JUSTIN D SCHILLING    MI          3.0              1199
## 45                DEREK YAN    MI          3.0              1242
## 46 JACOB ALEXANDER LAVALLEY    MI          3.0               377
## 47              ERIC WRIGHT    MI          2.5              1362
## 48             DANIEL KHAIN    MI          2.5              1382
## 49         MICHAEL J MARTIN    MI          2.5              1291
## 50               SHIVAM JHA    MI          2.5              1056
## 51           TEJAS AYYAGARI    MI          2.5              1011
## 52                ETHAN GUO    MI          2.5               935
## 53            JOSE C YBARRA    MI          2.0              1393
## 54              LARRY HODGE    MI          2.0              1270
## 55                ALEX KONG    MI          2.0              1186
## 56             MARISA RICCI    MI          2.0              1153
## 57               MICHAEL LU    MI          2.0              1092
## 58             VIRAJ MOHILE    MI          2.0               917
## 59                SEAN M MC    MI          2.0               853
## 60               JULIA SHEN    MI          1.5               967
## 61            JEZZEL FARKAS    ON          1.5               955
## 62            ASHWIN BALAJI    MI          1.0              1530
## 63     THOMAS JOSEPH HOSMER    MI          1.0              1175
## 64                   BEN LI    MI          1.0              1163
##    Average Opponent Pre-Rating
## 1                         1605
## 2                         1469
## 3                         1564
## 4                         1574
## 5                         1501
## 6                         1519
## 7                         1372
## 8                         1468
## 9                         1523
## 10                        1554
## 11                        1468
## 12                        1506
## 13                        1498
## 14                        1515
## 15                        1484
## 16                        1386
## 17                        1499
## 18                        1480
## 19                        1426
## 20                        1411
## 21                        1470
## 22                        1300
## 23                        1214
## 24                        1357
## 25                        1363
## 26                        1507
## 27                        1222
## 28                        1522
## 29                        1314
## 30                        1144
## 31                        1260
## 32                        1379
## 33                        1277
## 34                        1375
## 35                        1150
## 36                        1388
## 37                        1385
## 38                        1539
## 39                        1430
## 40                        1391
## 41                        1248
## 42                        1150
## 43                        1107
## 44                        1327
## 45                        1152
## 46                        1358
## 47                        1392
## 48                        1356
## 49                        1286
## 50                        1296
## 51                        1356
## 52                        1495
## 53                        1345
## 54                        1206
## 55                        1406
## 56                        1414
## 57                        1363
## 58                        1391
## 59                        1319
## 60                        1330
## 61                        1327
## 62                        1186
## 63                        1350
## 64                        1263

write.csv(Chess_dataframe_final, file = "Chess_Results.csv", col.names= TRUE, row.names= FALSE)

607 Project 1

Ron Balaban

2023-09-23

Assignment information