Introduction

Below, I’ll go step by step into converting the chess tournament text data into a workable data frame and .csv file.

I’ll start by importing tidyverse to handle analyses as they may arise.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Reading the data

I downloaded the text data, so I’ll read it in here and check the dimensions.

data <- read.delim('../data/data607_project1_chess.txt')

dim(data)
## [1] 195   1
colnames(data)
## [1] "X........................................................................................."

The data appears to have read in as a data frame with 1 column of 195 rows. The only column name is “X” followed by an indeterminate amount of dots.

In order to make it easier to select the column as a vector, I’ll change the name of it to ‘col1’, then call the whole column.

colnames(data) <- c('col1')

For now I’ll reset the data to start with the first line of an actual player, Gary Hua. I can re-insert the column names later as needed.

data1 <- data$col1[4:length(data$col1)]

head(data1, 20)
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "-----------------------------------------------------------------------------------------"
##  [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [6] "-----------------------------------------------------------------------------------------"
##  [7] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [8] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [9] "-----------------------------------------------------------------------------------------"
## [10] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [11] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [12] "-----------------------------------------------------------------------------------------"
## [13] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [14] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [15] "-----------------------------------------------------------------------------------------"
## [16] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [17] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
## [18] "-----------------------------------------------------------------------------------------"
## [19] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"
## [20] "   MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"

Properly separating the data

The ‘dash lines’ are on index multiples of 3, starting at 3. I’ll make a sequence of numbers from 3 to the length of data1 and remove all values with those indices.

data2 <-data1[-seq(0, length(data1), by = 3)] 

head(data2, 20)
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
##  [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
##  [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
##  [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [7] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
##  [8] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
##  [9] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [10] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [11] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [12] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
## [13] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|"
## [14] "   MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"
## [15] "    8 | EZEKIEL HOUGHTON                |5.0  |W   3|W  32|L  14|L   9|W  47|W  28|W  19|"
## [16] "   MI | 15142253 / R: 1641P17->1657P24  |N:3  |B    |W    |B    |W    |B    |W    |W    |"
## [17] "    9 | STEFANO LEE                     |5.0  |W  25|L  18|W  59|W   8|W  26|L   7|W  20|"
## [18] "   ON | 14954524 / R: 1411   ->1564     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [19] "   10 | ANVIT RAO                       |5.0  |D  16|L  19|W  55|W  31|D   6|W  25|W  18|"
## [20] "   MI | 14150362 / R: 1365   ->1544     |N:3  |W    |W    |B    |B    |W    |B    |W    |"

Now every pair of rows contains the full data for one individual (albeit still in a messy character format). I should be able to loop through the even indices and construct a new vector consisting of one element for each player.

evens <- seq(2, length(data2), by = 2)

data3 <- c()

for (x in evens) {
  row1 <- data2[x-1]
  row2 <- data2[x]
  
  new_row <- paste(row1, row2)
  
  data3 <- append(data3, new_row)
}

head(data3, 20)
##  [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|    MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
##  [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|    MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
##  [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|    MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
##  [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|    OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"
##  [7] "    7 | GARY DEE SWATHELL               |5.0  |W  57|W  46|W  13|W  11|L   1|W   9|L   2|    MI | 11146376 / R: 1649   ->1673     |N:3  |W    |B    |W    |B    |B    |W    |W    |"
##  [8] "    8 | EZEKIEL HOUGHTON                |5.0  |W   3|W  32|L  14|L   9|W  47|W  28|W  19|    MI | 15142253 / R: 1641P17->1657P24  |N:3  |B    |W    |B    |W    |B    |W    |W    |"
##  [9] "    9 | STEFANO LEE                     |5.0  |W  25|L  18|W  59|W   8|W  26|L   7|W  20|    ON | 14954524 / R: 1411   ->1564     |N:2  |W    |B    |W    |B    |W    |B    |B    |"
## [10] "   10 | ANVIT RAO                       |5.0  |D  16|L  19|W  55|W  31|D   6|W  25|W  18|    MI | 14150362 / R: 1365   ->1544     |N:3  |W    |W    |B    |B    |W    |B    |W    |"
## [11] "   11 | CAMERON WILLIAM MC LEMAN        |4.5  |D  38|W  56|W   6|L   7|L   3|W  34|W  26|    MI | 12581589 / R: 1712   ->1696     |N:3  |B    |W    |B    |W    |B    |W    |B    |"
## [12] "   12 | KENNETH J TACK                  |4.5  |W  42|W  33|D   5|W  38|H    |D   1|L   3|    MI | 12681257 / R: 1663   ->1670     |N:3  |W    |B    |W    |B    |     |W    |B    |"
## [13] "   13 | TORRANCE HENRY JR               |4.5  |W  36|W  27|L   7|D   5|W  33|L   3|W  32|    MI | 15082995 / R: 1666   ->1662     |N:3  |B    |W    |B    |B    |W    |W    |B    |"
## [14] "   14 | BRADLEY SHAW                    |4.5  |W  54|W  44|W   8|L   1|D  27|L   5|W  31|    MI | 10131499 / R: 1610   ->1618     |N:3  |W    |B    |W    |W    |B    |B    |W    |"
## [15] "   15 | ZACHARY JAMES HOUGHTON          |4.5  |D  19|L  16|W  30|L  22|W  54|W  33|W  38|    MI | 15619130 / R: 1220P13->1416P20  |N:3  |B    |B    |W    |W    |B    |B    |W    |"
## [16] "   16 | MIKE NIKITIN                    |4.0  |D  10|W  15|H    |W  39|L   2|W  36|U    |    MI | 10295068 / R: 1604   ->1613     |N:3  |B    |W    |     |B    |W    |B    |     |"
## [17] "   17 | RONALD GRZEGORCZYK              |4.0  |W  48|W  41|L  26|L   2|W  23|W  22|L   5|    MI | 10297702 / R: 1629   ->1610     |N:3  |W    |B    |W    |B    |W    |B    |W    |"
## [18] "   18 | DAVID SUNDEEN                   |4.0  |W  47|W   9|L   1|W  32|L  19|W  38|L  10|    MI | 11342094 / R: 1600   ->1600     |N:3  |B    |W    |B    |W    |B    |W    |B    |"
## [19] "   19 | DIPANKAR ROY                    |4.0  |D  15|W  10|W  52|D  28|W  18|L   4|L   8|    MI | 14862333 / R: 1564   ->1570     |N:3  |W    |B    |W    |B    |W    |W    |B    |"
## [20] "   20 | JASON ZHENG                     |4.0  |L  40|W  49|W  23|W  41|W  28|L   2|L   9|    MI | 14529060 / R: 1595   ->1569     |N:4  |W    |B    |W    |B    |W    |B    |W    |"

Splitting each “row”

I now have 64 rows, matching the number of players in the .txt data. Fortunately, despite some messiness with spaces, each field we care about is separated from the others by a single character: “|”. I can make a new list of vectors from these strings by splitting them on that character, recognizing that “|” is a special character in regex and requires two backslashes to operate properly.

data4 <- str_split(data3, pattern = '\\|')

head(data4)
## [[1]]
##  [1] "    1 "                            " GARY HUA                        "
##  [3] "6.0  "                             "W  39"                            
##  [5] "W  21"                             "W  18"                            
##  [7] "W  14"                             "W   7"                            
##  [9] "D  12"                             "D   4"                            
## [11] "    ON "                           " 15445895 / R: 1794   ->1817     "
## [13] "N:2  "                             "W    "                            
## [15] "B    "                             "W    "                            
## [17] "B    "                             "W    "                            
## [19] "B    "                             "W    "                            
## [21] ""                                 
## 
## [[2]]
##  [1] "    2 "                            " DAKSHESH DARURI                 "
##  [3] "6.0  "                             "W  63"                            
##  [5] "W  58"                             "L   4"                            
##  [7] "W  17"                             "W  16"                            
##  [9] "W  20"                             "W   7"                            
## [11] "    MI "                           " 14598900 / R: 1553   ->1663     "
## [13] "N:2  "                             "B    "                            
## [15] "W    "                             "B    "                            
## [17] "W    "                             "B    "                            
## [19] "W    "                             "B    "                            
## [21] ""                                 
## 
## [[3]]
##  [1] "    3 "                            " ADITYA BAJAJ                    "
##  [3] "6.0  "                             "L   8"                            
##  [5] "W  61"                             "W  25"                            
##  [7] "W  21"                             "W  11"                            
##  [9] "W  13"                             "W  12"                            
## [11] "    MI "                           " 14959604 / R: 1384   ->1640     "
## [13] "N:2  "                             "W    "                            
## [15] "B    "                             "W    "                            
## [17] "B    "                             "W    "                            
## [19] "B    "                             "W    "                            
## [21] ""                                 
## 
## [[4]]
##  [1] "    4 "                            " PATRICK H SCHILLING             "
##  [3] "5.5  "                             "W  23"                            
##  [5] "D  28"                             "W   2"                            
##  [7] "W  26"                             "D   5"                            
##  [9] "W  19"                             "D   1"                            
## [11] "    MI "                           " 12616049 / R: 1716   ->1744     "
## [13] "N:2  "                             "W    "                            
## [15] "B    "                             "W    "                            
## [17] "B    "                             "W    "                            
## [19] "B    "                             "B    "                            
## [21] ""                                 
## 
## [[5]]
##  [1] "    5 "                            " HANSHI ZUO                      "
##  [3] "5.5  "                             "W  45"                            
##  [5] "W  37"                             "D  12"                            
##  [7] "D  13"                             "D   4"                            
##  [9] "W  14"                             "W  17"                            
## [11] "    MI "                           " 14601533 / R: 1655   ->1690     "
## [13] "N:2  "                             "B    "                            
## [15] "W    "                             "B    "                            
## [17] "W    "                             "B    "                            
## [19] "W    "                             "B    "                            
## [21] ""                                 
## 
## [[6]]
##  [1] "    6 "                            " HANSEN SONG                     "
##  [3] "5.0  "                             "W  34"                            
##  [5] "D  29"                             "L  11"                            
##  [7] "W  35"                             "D  10"                            
##  [9] "W  27"                             "W  21"                            
## [11] "    OH "                           " 15055204 / R: 1686   ->1687     "
## [13] "N:3  "                             "W    "                            
## [15] "B    "                             "W    "                            
## [17] "B    "                             "B    "                            
## [19] "W    "                             "B    "                            
## [21] ""

I now have 64 individual vectors that constitute the values needed for our final data frame. Even though the data is still quite messy, it’s consistent enough to put into a data frame and manipulate from there.

Across each entry, here is an index mapping of all relevant fields and what I need them to be:

data4[[x]][1] = the player ID, which I will need to construct a dictionary matching IDs to pre-tournament rating. Need to eliminate excess spaces. data4[[x]][2] = the player name, with 1 excess space to the left and many to the right. data4[[x]][3] = the player’s total points for the tournament (required as is, except for extra spaces) data4[[x]][4 through 10] = the results of the 7 rounds of the tournament. For the purpose of this project, the letters in the fields (W, L and D signifying a win, loss or draw) are not necessary. All that matters is the number after each letter, signifying which opponent the player faced in that round. These numbers will eventually allow me to sub in and calculate player ratings. data4[[x]][11] = player’s state, required as is except for extra spaces on both sides data4[[x]][12] = player’s USCF ID, pre-tournament rating and post-tournament rating. Only the pre-rating matters in this context, so I will extract it based on its position after “R:” but before “->”

All other values in the dataset are irrelevant for our purposes. At a later date, I would be interested in exploring the degree to which a player playing white or black was predictive of victory in this tournament, but it will have to wait for now!

I’ll make a new list of player stats with only the rows I care about.

data5 <- c()

for (x in seq(1, length(data4))) {
  #make it a list, otherwise they are appended in one long chain
  new_vec <- list(data4[[x]][c(1:12)])
  data5 <- append(data5, new_vec)
}

head(data5)
## [[1]]
##  [1] "    1 "                            " GARY HUA                        "
##  [3] "6.0  "                             "W  39"                            
##  [5] "W  21"                             "W  18"                            
##  [7] "W  14"                             "W   7"                            
##  [9] "D  12"                             "D   4"                            
## [11] "    ON "                           " 15445895 / R: 1794   ->1817     "
## 
## [[2]]
##  [1] "    2 "                            " DAKSHESH DARURI                 "
##  [3] "6.0  "                             "W  63"                            
##  [5] "W  58"                             "L   4"                            
##  [7] "W  17"                             "W  16"                            
##  [9] "W  20"                             "W   7"                            
## [11] "    MI "                           " 14598900 / R: 1553   ->1663     "
## 
## [[3]]
##  [1] "    3 "                            " ADITYA BAJAJ                    "
##  [3] "6.0  "                             "L   8"                            
##  [5] "W  61"                             "W  25"                            
##  [7] "W  21"                             "W  11"                            
##  [9] "W  13"                             "W  12"                            
## [11] "    MI "                           " 14959604 / R: 1384   ->1640     "
## 
## [[4]]
##  [1] "    4 "                            " PATRICK H SCHILLING             "
##  [3] "5.5  "                             "W  23"                            
##  [5] "D  28"                             "W   2"                            
##  [7] "W  26"                             "D   5"                            
##  [9] "W  19"                             "D   1"                            
## [11] "    MI "                           " 12616049 / R: 1716   ->1744     "
## 
## [[5]]
##  [1] "    5 "                            " HANSHI ZUO                      "
##  [3] "5.5  "                             "W  45"                            
##  [5] "W  37"                             "D  12"                            
##  [7] "D  13"                             "D   4"                            
##  [9] "W  14"                             "W  17"                            
## [11] "    MI "                           " 14601533 / R: 1655   ->1690     "
## 
## [[6]]
##  [1] "    6 "                            " HANSEN SONG                     "
##  [3] "5.0  "                             "W  34"                            
##  [5] "D  29"                             "L  11"                            
##  [7] "W  35"                             "D  10"                            
##  [9] "W  27"                             "W  21"                            
## [11] "    OH "                           " 15055204 / R: 1686   ->1687     "

Creating a data frame

At this point I could consider cleaning the data (removing excess spaces, narrowing values to only what’s needed, making sure each variable is the correct data type), but that might more easily be accomplished once the data is in a data frame and the columns can be adjusted and transformed. So at this stage I’ll create a data frame, albeit a messy one.

df <- data.frame(data5)

#this interprets each player as a column rather than a row, so I need to transpose the data frame.

df <- t(df)

#make sure it's considered a data frame rather than a collection of atomic vectors

df <- as.data.frame(df)

head(df)
##                                                                         V1
## c......1......GARY.HUA............................6.0......W..39...     1 
## c......2......DAKSHESH.DARURI.....................6.0......W..63...     2 
## c......3......ADITYA.BAJAJ........................6.0......L...8...     3 
## c......4......PATRICK.H.SCHILLING.................5.5......W..23...     4 
## c......5......HANSHI.ZUO..........................5.5......W..45...     5 
## c......6......HANSEN.SONG.........................5.0......W..34...     6 
##                                                                                                    V2
## c......1......GARY.HUA............................6.0......W..39...  GARY HUA                        
## c......2......DAKSHESH.DARURI.....................6.0......W..63...  DAKSHESH DARURI                 
## c......3......ADITYA.BAJAJ........................6.0......L...8...  ADITYA BAJAJ                    
## c......4......PATRICK.H.SCHILLING.................5.5......W..23...  PATRICK H SCHILLING             
## c......5......HANSHI.ZUO..........................5.5......W..45...  HANSHI ZUO                      
## c......6......HANSEN.SONG.........................5.0......W..34...  HANSEN SONG                     
##                                                                        V3    V4
## c......1......GARY.HUA............................6.0......W..39... 6.0   W  39
## c......2......DAKSHESH.DARURI.....................6.0......W..63... 6.0   W  63
## c......3......ADITYA.BAJAJ........................6.0......L...8... 6.0   L   8
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... 5.5   W  23
## c......5......HANSHI.ZUO..........................5.5......W..45... 5.5   W  45
## c......6......HANSEN.SONG.........................5.0......W..34... 5.0   W  34
##                                                                        V5    V6
## c......1......GARY.HUA............................6.0......W..39... W  21 W  18
## c......2......DAKSHESH.DARURI.....................6.0......W..63... W  58 L   4
## c......3......ADITYA.BAJAJ........................6.0......L...8... W  61 W  25
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... D  28 W   2
## c......5......HANSHI.ZUO..........................5.5......W..45... W  37 D  12
## c......6......HANSEN.SONG.........................5.0......W..34... D  29 L  11
##                                                                        V7    V8
## c......1......GARY.HUA............................6.0......W..39... W  14 W   7
## c......2......DAKSHESH.DARURI.....................6.0......W..63... W  17 W  16
## c......3......ADITYA.BAJAJ........................6.0......L...8... W  21 W  11
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... W  26 D   5
## c......5......HANSHI.ZUO..........................5.5......W..45... D  13 D   4
## c......6......HANSEN.SONG.........................5.0......W..34... W  35 D  10
##                                                                        V9   V10
## c......1......GARY.HUA............................6.0......W..39... D  12 D   4
## c......2......DAKSHESH.DARURI.....................6.0......W..63... W  20 W   7
## c......3......ADITYA.BAJAJ........................6.0......L...8... W  13 W  12
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... W  19 D   1
## c......5......HANSHI.ZUO..........................5.5......W..45... W  14 W  17
## c......6......HANSEN.SONG.........................5.0......W..34... W  27 W  21
##                                                                         V11
## c......1......GARY.HUA............................6.0......W..39...     ON 
## c......2......DAKSHESH.DARURI.....................6.0......W..63...     MI 
## c......3......ADITYA.BAJAJ........................6.0......L...8...     MI 
## c......4......PATRICK.H.SCHILLING.................5.5......W..23...     MI 
## c......5......HANSHI.ZUO..........................5.5......W..45...     MI 
## c......6......HANSEN.SONG.........................5.0......W..34...     OH 
##                                                                                                   V12
## c......1......GARY.HUA............................6.0......W..39...  15445895 / R: 1794   ->1817     
## c......2......DAKSHESH.DARURI.....................6.0......W..63...  14598900 / R: 1553   ->1663     
## c......3......ADITYA.BAJAJ........................6.0......L...8...  14959604 / R: 1384   ->1640     
## c......4......PATRICK.H.SCHILLING.................5.5......W..23...  12616049 / R: 1716   ->1744     
## c......5......HANSHI.ZUO..........................5.5......W..45...  14601533 / R: 1655   ->1690     
## c......6......HANSEN.SONG.........................5.0......W..34...  15055204 / R: 1686   ->1687

Now will set column names–even though they are not formatted properly, its’ a reminder of what I’m aspiring to for each column.

colnames(df) <- c(
  "player_id",
  "player_name",
  "total_points",
  "rd1_opp_id",
  "rd2_opp_id",
  "rd3_opp_id",
  "rd4_opp_id",
  "rd5_opp_id",
  "rd6_opp_id",
  "rd7_opp_id",
  "player_state",
  "player_pre_rating"
)

#I see there is also a strange compiled index/rowname column, which I'll eliminate now

rownames(df) <- NULL

head(df)
##   player_id                       player_name total_points rd1_opp_id
## 1        1   GARY HUA                                6.0        W  39
## 2        2   DAKSHESH DARURI                         6.0        W  63
## 3        3   ADITYA BAJAJ                            6.0        L   8
## 4        4   PATRICK H SCHILLING                     5.5        W  23
## 5        5   HANSHI ZUO                              5.5        W  45
## 6        6   HANSEN SONG                             5.0        W  34
##   rd2_opp_id rd3_opp_id rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id
## 1      W  21      W  18      W  14      W   7      D  12      D   4
## 2      W  58      L   4      W  17      W  16      W  20      W   7
## 3      W  61      W  25      W  21      W  11      W  13      W  12
## 4      D  28      W   2      W  26      D   5      W  19      D   1
## 5      W  37      D  12      D  13      D   4      W  14      W  17
## 6      D  29      L  11      W  35      D  10      W  27      W  21
##   player_state                 player_pre_rating
## 1          ON   15445895 / R: 1794   ->1817     
## 2          MI   14598900 / R: 1553   ->1663     
## 3          MI   14959604 / R: 1384   ->1640     
## 4          MI   12616049 / R: 1716   ->1744     
## 5          MI   14601533 / R: 1655   ->1690     
## 6          OH   15055204 / R: 1686   ->1687

Cleaning the data frame

I will need to conduct the following transformations on my columns.

player_id: extract number, set as integer player_name: remove excess spaces total_points: extract number, set as float ALL rd[x]_opp_id columns: extract number, set as integer. Ensure cells with no digit characters are interpreted as NULL for future averaging. player_state: remove excess spaces player_pre_rating: extract number after “R:” and before “->”, as well as removing excess spacing. As a first step, any occurrance of “P” followed by a number should be removed, as we are ignoring that aspect. For consistency’s sake they could be replaced with ” “.

#player_id: extract number, set as integer
df$player_id <- as.integer(str_extract(df$player_id, '[0-9]+'))

#player_name: remove excess spaces
df$player_name <- trimws(df$player_name)

#total_points: extract number, set as double
df$total_points <- as.double(trimws(df$total_points))

#ALL rd[x]_opp_id columns: extract number, set as integer. Ensure cells with no digit characters are interpreted as NULL for future averaging. 

#not sure why this doesnt work but whatever
#for (i in colnames(df)[4:10]) {
  #df$(i) <- as.integer(str_extract(df$(i), '[0-9]+'))
#}

df$rd1_opp_id <- as.integer(str_extract(df$rd1_opp_id , '[0-9]+'))
df$rd2_opp_id <- as.integer(str_extract(df$rd2_opp_id , '[0-9]+'))
df$rd3_opp_id <- as.integer(str_extract(df$rd3_opp_id , '[0-9]+'))
df$rd4_opp_id <- as.integer(str_extract(df$rd4_opp_id , '[0-9]+'))
df$rd5_opp_id <- as.integer(str_extract(df$rd5_opp_id , '[0-9]+'))
df$rd6_opp_id <- as.integer(str_extract(df$rd6_opp_id , '[0-9]+'))
df$rd7_opp_id <- as.integer(str_extract(df$rd7_opp_id , '[0-9]+'))

#player_state: remove excess spaces
df$player_state <- trimws(df$player_state)

#player_pre_rating: extract number after "R: " and before "->". I should be able to do this by removing all spaces, and then finding the number (of any number of digits) immediately following "R:". We are ignoring the "P" number that appears in some cases, and restricting the regex to digits should account for this.


df$player_pre_rating <- str_extract(str_remove_all(df$player_pre_rating, ' '), pattern = '(?<=R:)[0-9]+')

df$player_pre_rating <- as.integer(df$player_pre_rating)

head(df)
##   player_id         player_name total_points rd1_opp_id rd2_opp_id rd3_opp_id
## 1         1            GARY HUA          6.0         39         21         18
## 2         2     DAKSHESH DARURI          6.0         63         58          4
## 3         3        ADITYA BAJAJ          6.0          8         61         25
## 4         4 PATRICK H SCHILLING          5.5         23         28          2
## 5         5          HANSHI ZUO          5.5         45         37         12
## 6         6         HANSEN SONG          5.0         34         29         11
##   rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id player_state player_pre_rating
## 1         14          7         12          4           ON              1794
## 2         17         16         20          7           MI              1553
## 3         21         11         13         12           MI              1384
## 4         26          5         19          1           MI              1716
## 5         13          4         14         17           MI              1655
## 6         35         10         27         21           OH              1686

Incorporating player ratings

Next, I need to make new columns that indicate the pre-rating for each opponent in each round, these values serving as the foundation for a ‘mean opponent rating’ calculation. Fortunately, the format of the data frame lends itself to a solution.

Vectors can be called by an index or series of indices. Since the player_id’s mirror the data frame’s index, I can call a column of opponent player_id’s as the indices needed to fetch the appropriate pre-rating. For example, the round 1 opponent for the first player, Gary Hua, has the player id 39. Player 39 is Joel R Hendon, whose pre-rating was 1439. If I call df$player_pre_rating[39], I get 1439. The same holds if I call the entire column as an index, which I can do across each round.

df$rd1_opp_pre_rating <- df$player_pre_rating[df$rd1_opp_id]
df$rd2_opp_pre_rating <- df$player_pre_rating[df$rd2_opp_id]
df$rd3_opp_pre_rating <- df$player_pre_rating[df$rd3_opp_id]
df$rd4_opp_pre_rating <- df$player_pre_rating[df$rd4_opp_id]
df$rd5_opp_pre_rating <- df$player_pre_rating[df$rd5_opp_id]
df$rd6_opp_pre_rating <- df$player_pre_rating[df$rd6_opp_id]
df$rd7_opp_pre_rating <- df$player_pre_rating[df$rd7_opp_id]

#spot checks across random players, opponents and ratings show the mapping worked as expected

head(df)
##   player_id         player_name total_points rd1_opp_id rd2_opp_id rd3_opp_id
## 1         1            GARY HUA          6.0         39         21         18
## 2         2     DAKSHESH DARURI          6.0         63         58          4
## 3         3        ADITYA BAJAJ          6.0          8         61         25
## 4         4 PATRICK H SCHILLING          5.5         23         28          2
## 5         5          HANSHI ZUO          5.5         45         37         12
## 6         6         HANSEN SONG          5.0         34         29         11
##   rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id player_state player_pre_rating
## 1         14          7         12          4           ON              1794
## 2         17         16         20          7           MI              1553
## 3         21         11         13         12           MI              1384
## 4         26          5         19          1           MI              1716
## 5         13          4         14         17           MI              1655
## 6         35         10         27         21           OH              1686
##   rd1_opp_pre_rating rd2_opp_pre_rating rd3_opp_pre_rating rd4_opp_pre_rating
## 1               1436               1563               1600               1610
## 2               1175                917               1716               1629
## 3               1641                955               1745               1563
## 4               1363               1507               1553               1579
## 5               1242                980               1663               1666
## 6               1399               1602               1712               1438
##   rd5_opp_pre_rating rd6_opp_pre_rating rd7_opp_pre_rating
## 1               1649               1663               1716
## 2               1604               1595               1649
## 3               1712               1666               1663
## 4               1655               1564               1794
## 5               1716               1610               1629
## 6               1365               1552               1563

Now I just need to average the opponent ratings for each player in a new column.

#Setting na.rm to TRUE ensures NAs are dropped from the calculation rather than resulting in an NA for the final value.
df$opp_mean_pre_rating <- rowMeans(df[13:19], na.rm = TRUE)

head(df)
##   player_id         player_name total_points rd1_opp_id rd2_opp_id rd3_opp_id
## 1         1            GARY HUA          6.0         39         21         18
## 2         2     DAKSHESH DARURI          6.0         63         58          4
## 3         3        ADITYA BAJAJ          6.0          8         61         25
## 4         4 PATRICK H SCHILLING          5.5         23         28          2
## 5         5          HANSHI ZUO          5.5         45         37         12
## 6         6         HANSEN SONG          5.0         34         29         11
##   rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id player_state player_pre_rating
## 1         14          7         12          4           ON              1794
## 2         17         16         20          7           MI              1553
## 3         21         11         13         12           MI              1384
## 4         26          5         19          1           MI              1716
## 5         13          4         14         17           MI              1655
## 6         35         10         27         21           OH              1686
##   rd1_opp_pre_rating rd2_opp_pre_rating rd3_opp_pre_rating rd4_opp_pre_rating
## 1               1436               1563               1600               1610
## 2               1175                917               1716               1629
## 3               1641                955               1745               1563
## 4               1363               1507               1553               1579
## 5               1242                980               1663               1666
## 6               1399               1602               1712               1438
##   rd5_opp_pre_rating rd6_opp_pre_rating rd7_opp_pre_rating opp_mean_pre_rating
## 1               1649               1663               1716            1605.286
## 2               1604               1595               1649            1469.286
## 3               1712               1666               1663            1563.571
## 4               1655               1564               1794            1573.571
## 5               1716               1610               1629            1500.857
## 6               1365               1552               1563            1518.714

Creating the final data frame

With the mean calculated, I can select only the columns that are necessary.

df_final <- df[c('player_name', 'player_state', 
                 'total_points','player_pre_rating', 
                 'opp_mean_pre_rating')]

df_final
##                   player_name player_state total_points player_pre_rating
## 1                    GARY HUA           ON          6.0              1794
## 2             DAKSHESH DARURI           MI          6.0              1553
## 3                ADITYA BAJAJ           MI          6.0              1384
## 4         PATRICK H SCHILLING           MI          5.5              1716
## 5                  HANSHI ZUO           MI          5.5              1655
## 6                 HANSEN SONG           OH          5.0              1686
## 7           GARY DEE SWATHELL           MI          5.0              1649
## 8            EZEKIEL HOUGHTON           MI          5.0              1641
## 9                 STEFANO LEE           ON          5.0              1411
## 10                  ANVIT RAO           MI          5.0              1365
## 11   CAMERON WILLIAM MC LEMAN           MI          4.5              1712
## 12             KENNETH J TACK           MI          4.5              1663
## 13          TORRANCE HENRY JR           MI          4.5              1666
## 14               BRADLEY SHAW           MI          4.5              1610
## 15     ZACHARY JAMES HOUGHTON           MI          4.5              1220
## 16               MIKE NIKITIN           MI          4.0              1604
## 17         RONALD GRZEGORCZYK           MI          4.0              1629
## 18              DAVID SUNDEEN           MI          4.0              1600
## 19               DIPANKAR ROY           MI          4.0              1564
## 20                JASON ZHENG           MI          4.0              1595
## 21              DINH DANG BUI           ON          4.0              1563
## 22           EUGENE L MCCLURE           MI          4.0              1555
## 23                   ALAN BUI           ON          4.0              1363
## 24          MICHAEL R ALDRICH           MI          4.0              1229
## 25           LOREN SCHWIEBERT           MI          3.5              1745
## 26                    MAX ZHU           ON          3.5              1579
## 27             GAURAV GIDWANI           MI          3.5              1552
## 28 SOFIA ADINA STANESCU-BELLU           MI          3.5              1507
## 29           CHIEDOZIE OKORIE           MI          3.5              1602
## 30         GEORGE AVERY JONES           ON          3.5              1522
## 31               RISHI SHETTY           MI          3.5              1494
## 32      JOSHUA PHILIP MATHEWS           ON          3.5              1441
## 33                    JADE GE           MI          3.5              1449
## 34     MICHAEL JEFFERY THOMAS           MI          3.5              1399
## 35           JOSHUA DAVID LEE           MI          3.5              1438
## 36              SIDDHARTH JHA           MI          3.5              1355
## 37       AMIYATOSH PWNANANDAM           MI          3.5               980
## 38                  BRIAN LIU           MI          3.0              1423
## 39              JOEL R HENDON           MI          3.0              1436
## 40               FOREST ZHANG           MI          3.0              1348
## 41        KYLE WILLIAM MURPHY           MI          3.0              1403
## 42                   JARED GE           MI          3.0              1332
## 43          ROBERT GLEN VASEY           MI          3.0              1283
## 44         JUSTIN D SCHILLING           MI          3.0              1199
## 45                  DEREK YAN           MI          3.0              1242
## 46   JACOB ALEXANDER LAVALLEY           MI          3.0               377
## 47                ERIC WRIGHT           MI          2.5              1362
## 48               DANIEL KHAIN           MI          2.5              1382
## 49           MICHAEL J MARTIN           MI          2.5              1291
## 50                 SHIVAM JHA           MI          2.5              1056
## 51             TEJAS AYYAGARI           MI          2.5              1011
## 52                  ETHAN GUO           MI          2.5               935
## 53              JOSE C YBARRA           MI          2.0              1393
## 54                LARRY HODGE           MI          2.0              1270
## 55                  ALEX KONG           MI          2.0              1186
## 56               MARISA RICCI           MI          2.0              1153
## 57                 MICHAEL LU           MI          2.0              1092
## 58               VIRAJ MOHILE           MI          2.0               917
## 59          SEAN M MC CORMICK           MI          2.0               853
## 60                 JULIA SHEN           MI          1.5               967
## 61              JEZZEL FARKAS           ON          1.5               955
## 62              ASHWIN BALAJI           MI          1.0              1530
## 63       THOMAS JOSEPH HOSMER           MI          1.0              1175
## 64                     BEN LI           MI          1.0              1163
##    opp_mean_pre_rating
## 1             1605.286
## 2             1469.286
## 3             1563.571
## 4             1573.571
## 5             1500.857
## 6             1518.714
## 7             1372.143
## 8             1468.429
## 9             1523.143
## 10            1554.143
## 11            1467.571
## 12            1506.167
## 13            1497.857
## 14            1515.000
## 15            1483.857
## 16            1385.800
## 17            1498.571
## 18            1480.000
## 19            1426.286
## 20            1410.857
## 21            1470.429
## 22            1300.333
## 23            1213.857
## 24            1357.000
## 25            1363.286
## 26            1506.857
## 27            1221.667
## 28            1522.143
## 29            1313.500
## 30            1144.143
## 31            1259.857
## 32            1378.714
## 33            1276.857
## 34            1375.286
## 35            1149.714
## 36            1388.167
## 37            1384.800
## 38            1539.167
## 39            1429.571
## 40            1390.571
## 41            1248.500
## 42            1149.857
## 43            1106.571
## 44            1327.000
## 45            1152.000
## 46            1357.714
## 47            1392.000
## 48            1355.800
## 49            1285.800
## 50            1296.000
## 51            1356.143
## 52            1494.571
## 53            1345.333
## 54            1206.167
## 55            1406.000
## 56            1414.400
## 57            1363.000
## 58            1391.000
## 59            1319.000
## 60            1330.200
## 61            1327.286
## 62            1186.000
## 63            1350.200
## 64            1263.000

Writing the .csv file

Satisfied with the result, I can now write it into a .csv for further analysis. Some day :)

#Ror presentation's sakem the mean opponent rating should be rounded to the nearest integer

df_final$opp_mean_pre_rating <- round(df_final$opp_mean_pre_rating)

write.csv(df_final, '../data/data607_project1_chess_FINAL.csv')