Below, I’ll go step by step into converting the chess tournament text data into a workable data frame and .csv file.
I’ll start by importing tidyverse to handle analyses as they may arise.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
I downloaded the text data, so I’ll read it in here and check the dimensions.
data <- read.delim('../data/data607_project1_chess.txt')
dim(data)
## [1] 195 1
colnames(data)
## [1] "X........................................................................................."
The data appears to have read in as a data frame with 1 column of 195 rows. The only column name is “X” followed by an indeterminate amount of dots.
In order to make it easier to select the column as a vector, I’ll change the name of it to ‘col1’, then call the whole column.
colnames(data) <- c('col1')
For now I’ll reset the data to start with the first line of an actual player, Gary Hua. I can re-insert the column names later as needed.
data1 <- data$col1[4:length(data$col1)]
head(data1, 20)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] "-----------------------------------------------------------------------------------------"
## [7] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [8] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [9] "-----------------------------------------------------------------------------------------"
## [10] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [11] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [12] "-----------------------------------------------------------------------------------------"
## [13] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [14] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [15] "-----------------------------------------------------------------------------------------"
## [16] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [17] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
## [18] "-----------------------------------------------------------------------------------------"
## [19] " 7 | GARY DEE SWATHELL |5.0 |W 57|W 46|W 13|W 11|L 1|W 9|L 2|"
## [20] " MI | 11146376 / R: 1649 ->1673 |N:3 |W |B |W |B |B |W |W |"
The ‘dash lines’ are on index multiples of 3, starting at 3. I’ll make a sequence of numbers from 3 to the length of data1 and remove all values with those indices.
data2 <-data1[-seq(0, length(data1), by = 3)]
head(data2, 20)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [4] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [5] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [6] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [7] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [8] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [9] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [10] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [11] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [12] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
## [13] " 7 | GARY DEE SWATHELL |5.0 |W 57|W 46|W 13|W 11|L 1|W 9|L 2|"
## [14] " MI | 11146376 / R: 1649 ->1673 |N:3 |W |B |W |B |B |W |W |"
## [15] " 8 | EZEKIEL HOUGHTON |5.0 |W 3|W 32|L 14|L 9|W 47|W 28|W 19|"
## [16] " MI | 15142253 / R: 1641P17->1657P24 |N:3 |B |W |B |W |B |W |W |"
## [17] " 9 | STEFANO LEE |5.0 |W 25|L 18|W 59|W 8|W 26|L 7|W 20|"
## [18] " ON | 14954524 / R: 1411 ->1564 |N:2 |W |B |W |B |W |B |B |"
## [19] " 10 | ANVIT RAO |5.0 |D 16|L 19|W 55|W 31|D 6|W 25|W 18|"
## [20] " MI | 14150362 / R: 1365 ->1544 |N:3 |W |W |B |B |W |B |W |"
Now every pair of rows contains the full data for one individual (albeit still in a messy character format). I should be able to loop through the even indices and construct a new vector consisting of one element for each player.
evens <- seq(2, length(data2), by = 2)
data3 <- c()
for (x in evens) {
row1 <- data2[x-1]
row2 <- data2[x]
new_row <- paste(row1, row2)
data3 <- append(data3, new_row)
}
head(data3, 20)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17| MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21| OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
## [7] " 7 | GARY DEE SWATHELL |5.0 |W 57|W 46|W 13|W 11|L 1|W 9|L 2| MI | 11146376 / R: 1649 ->1673 |N:3 |W |B |W |B |B |W |W |"
## [8] " 8 | EZEKIEL HOUGHTON |5.0 |W 3|W 32|L 14|L 9|W 47|W 28|W 19| MI | 15142253 / R: 1641P17->1657P24 |N:3 |B |W |B |W |B |W |W |"
## [9] " 9 | STEFANO LEE |5.0 |W 25|L 18|W 59|W 8|W 26|L 7|W 20| ON | 14954524 / R: 1411 ->1564 |N:2 |W |B |W |B |W |B |B |"
## [10] " 10 | ANVIT RAO |5.0 |D 16|L 19|W 55|W 31|D 6|W 25|W 18| MI | 14150362 / R: 1365 ->1544 |N:3 |W |W |B |B |W |B |W |"
## [11] " 11 | CAMERON WILLIAM MC LEMAN |4.5 |D 38|W 56|W 6|L 7|L 3|W 34|W 26| MI | 12581589 / R: 1712 ->1696 |N:3 |B |W |B |W |B |W |B |"
## [12] " 12 | KENNETH J TACK |4.5 |W 42|W 33|D 5|W 38|H |D 1|L 3| MI | 12681257 / R: 1663 ->1670 |N:3 |W |B |W |B | |W |B |"
## [13] " 13 | TORRANCE HENRY JR |4.5 |W 36|W 27|L 7|D 5|W 33|L 3|W 32| MI | 15082995 / R: 1666 ->1662 |N:3 |B |W |B |B |W |W |B |"
## [14] " 14 | BRADLEY SHAW |4.5 |W 54|W 44|W 8|L 1|D 27|L 5|W 31| MI | 10131499 / R: 1610 ->1618 |N:3 |W |B |W |W |B |B |W |"
## [15] " 15 | ZACHARY JAMES HOUGHTON |4.5 |D 19|L 16|W 30|L 22|W 54|W 33|W 38| MI | 15619130 / R: 1220P13->1416P20 |N:3 |B |B |W |W |B |B |W |"
## [16] " 16 | MIKE NIKITIN |4.0 |D 10|W 15|H |W 39|L 2|W 36|U | MI | 10295068 / R: 1604 ->1613 |N:3 |B |W | |B |W |B | |"
## [17] " 17 | RONALD GRZEGORCZYK |4.0 |W 48|W 41|L 26|L 2|W 23|W 22|L 5| MI | 10297702 / R: 1629 ->1610 |N:3 |W |B |W |B |W |B |W |"
## [18] " 18 | DAVID SUNDEEN |4.0 |W 47|W 9|L 1|W 32|L 19|W 38|L 10| MI | 11342094 / R: 1600 ->1600 |N:3 |B |W |B |W |B |W |B |"
## [19] " 19 | DIPANKAR ROY |4.0 |D 15|W 10|W 52|D 28|W 18|L 4|L 8| MI | 14862333 / R: 1564 ->1570 |N:3 |W |B |W |B |W |W |B |"
## [20] " 20 | JASON ZHENG |4.0 |L 40|W 49|W 23|W 41|W 28|L 2|L 9| MI | 14529060 / R: 1595 ->1569 |N:4 |W |B |W |B |W |B |W |"
I now have 64 rows, matching the number of players in the .txt data. Fortunately, despite some messiness with spaces, each field we care about is separated from the others by a single character: “|”. I can make a new list of vectors from these strings by splitting them on that character, recognizing that “|” is a special character in regex and requires two backslashes to operate properly.
data4 <- str_split(data3, pattern = '\\|')
head(data4)
## [[1]]
## [1] " 1 " " GARY HUA "
## [3] "6.0 " "W 39"
## [5] "W 21" "W 18"
## [7] "W 14" "W 7"
## [9] "D 12" "D 4"
## [11] " ON " " 15445895 / R: 1794 ->1817 "
## [13] "N:2 " "W "
## [15] "B " "W "
## [17] "B " "W "
## [19] "B " "W "
## [21] ""
##
## [[2]]
## [1] " 2 " " DAKSHESH DARURI "
## [3] "6.0 " "W 63"
## [5] "W 58" "L 4"
## [7] "W 17" "W 16"
## [9] "W 20" "W 7"
## [11] " MI " " 14598900 / R: 1553 ->1663 "
## [13] "N:2 " "B "
## [15] "W " "B "
## [17] "W " "B "
## [19] "W " "B "
## [21] ""
##
## [[3]]
## [1] " 3 " " ADITYA BAJAJ "
## [3] "6.0 " "L 8"
## [5] "W 61" "W 25"
## [7] "W 21" "W 11"
## [9] "W 13" "W 12"
## [11] " MI " " 14959604 / R: 1384 ->1640 "
## [13] "N:2 " "W "
## [15] "B " "W "
## [17] "B " "W "
## [19] "B " "W "
## [21] ""
##
## [[4]]
## [1] " 4 " " PATRICK H SCHILLING "
## [3] "5.5 " "W 23"
## [5] "D 28" "W 2"
## [7] "W 26" "D 5"
## [9] "W 19" "D 1"
## [11] " MI " " 12616049 / R: 1716 ->1744 "
## [13] "N:2 " "W "
## [15] "B " "W "
## [17] "B " "W "
## [19] "B " "B "
## [21] ""
##
## [[5]]
## [1] " 5 " " HANSHI ZUO "
## [3] "5.5 " "W 45"
## [5] "W 37" "D 12"
## [7] "D 13" "D 4"
## [9] "W 14" "W 17"
## [11] " MI " " 14601533 / R: 1655 ->1690 "
## [13] "N:2 " "B "
## [15] "W " "B "
## [17] "W " "B "
## [19] "W " "B "
## [21] ""
##
## [[6]]
## [1] " 6 " " HANSEN SONG "
## [3] "5.0 " "W 34"
## [5] "D 29" "L 11"
## [7] "W 35" "D 10"
## [9] "W 27" "W 21"
## [11] " OH " " 15055204 / R: 1686 ->1687 "
## [13] "N:3 " "W "
## [15] "B " "W "
## [17] "B " "B "
## [19] "W " "B "
## [21] ""
I now have 64 individual vectors that constitute the values needed for our final data frame. Even though the data is still quite messy, it’s consistent enough to put into a data frame and manipulate from there.
Across each entry, here is an index mapping of all relevant fields and what I need them to be:
data4[[x]][1] = the player ID, which I will need to construct a dictionary matching IDs to pre-tournament rating. Need to eliminate excess spaces. data4[[x]][2] = the player name, with 1 excess space to the left and many to the right. data4[[x]][3] = the player’s total points for the tournament (required as is, except for extra spaces) data4[[x]][4 through 10] = the results of the 7 rounds of the tournament. For the purpose of this project, the letters in the fields (W, L and D signifying a win, loss or draw) are not necessary. All that matters is the number after each letter, signifying which opponent the player faced in that round. These numbers will eventually allow me to sub in and calculate player ratings. data4[[x]][11] = player’s state, required as is except for extra spaces on both sides data4[[x]][12] = player’s USCF ID, pre-tournament rating and post-tournament rating. Only the pre-rating matters in this context, so I will extract it based on its position after “R:” but before “->”
All other values in the dataset are irrelevant for our purposes. At a later date, I would be interested in exploring the degree to which a player playing white or black was predictive of victory in this tournament, but it will have to wait for now!
I’ll make a new list of player stats with only the rows I care about.
data5 <- c()
for (x in seq(1, length(data4))) {
#make it a list, otherwise they are appended in one long chain
new_vec <- list(data4[[x]][c(1:12)])
data5 <- append(data5, new_vec)
}
head(data5)
## [[1]]
## [1] " 1 " " GARY HUA "
## [3] "6.0 " "W 39"
## [5] "W 21" "W 18"
## [7] "W 14" "W 7"
## [9] "D 12" "D 4"
## [11] " ON " " 15445895 / R: 1794 ->1817 "
##
## [[2]]
## [1] " 2 " " DAKSHESH DARURI "
## [3] "6.0 " "W 63"
## [5] "W 58" "L 4"
## [7] "W 17" "W 16"
## [9] "W 20" "W 7"
## [11] " MI " " 14598900 / R: 1553 ->1663 "
##
## [[3]]
## [1] " 3 " " ADITYA BAJAJ "
## [3] "6.0 " "L 8"
## [5] "W 61" "W 25"
## [7] "W 21" "W 11"
## [9] "W 13" "W 12"
## [11] " MI " " 14959604 / R: 1384 ->1640 "
##
## [[4]]
## [1] " 4 " " PATRICK H SCHILLING "
## [3] "5.5 " "W 23"
## [5] "D 28" "W 2"
## [7] "W 26" "D 5"
## [9] "W 19" "D 1"
## [11] " MI " " 12616049 / R: 1716 ->1744 "
##
## [[5]]
## [1] " 5 " " HANSHI ZUO "
## [3] "5.5 " "W 45"
## [5] "W 37" "D 12"
## [7] "D 13" "D 4"
## [9] "W 14" "W 17"
## [11] " MI " " 14601533 / R: 1655 ->1690 "
##
## [[6]]
## [1] " 6 " " HANSEN SONG "
## [3] "5.0 " "W 34"
## [5] "D 29" "L 11"
## [7] "W 35" "D 10"
## [9] "W 27" "W 21"
## [11] " OH " " 15055204 / R: 1686 ->1687 "
At this point I could consider cleaning the data (removing excess spaces, narrowing values to only what’s needed, making sure each variable is the correct data type), but that might more easily be accomplished once the data is in a data frame and the columns can be adjusted and transformed. So at this stage I’ll create a data frame, albeit a messy one.
df <- data.frame(data5)
#this interprets each player as a column rather than a row, so I need to transpose the data frame.
df <- t(df)
#make sure it's considered a data frame rather than a collection of atomic vectors
df <- as.data.frame(df)
head(df)
## V1
## c......1......GARY.HUA............................6.0......W..39... 1
## c......2......DAKSHESH.DARURI.....................6.0......W..63... 2
## c......3......ADITYA.BAJAJ........................6.0......L...8... 3
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... 4
## c......5......HANSHI.ZUO..........................5.5......W..45... 5
## c......6......HANSEN.SONG.........................5.0......W..34... 6
## V2
## c......1......GARY.HUA............................6.0......W..39... GARY HUA
## c......2......DAKSHESH.DARURI.....................6.0......W..63... DAKSHESH DARURI
## c......3......ADITYA.BAJAJ........................6.0......L...8... ADITYA BAJAJ
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... PATRICK H SCHILLING
## c......5......HANSHI.ZUO..........................5.5......W..45... HANSHI ZUO
## c......6......HANSEN.SONG.........................5.0......W..34... HANSEN SONG
## V3 V4
## c......1......GARY.HUA............................6.0......W..39... 6.0 W 39
## c......2......DAKSHESH.DARURI.....................6.0......W..63... 6.0 W 63
## c......3......ADITYA.BAJAJ........................6.0......L...8... 6.0 L 8
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... 5.5 W 23
## c......5......HANSHI.ZUO..........................5.5......W..45... 5.5 W 45
## c......6......HANSEN.SONG.........................5.0......W..34... 5.0 W 34
## V5 V6
## c......1......GARY.HUA............................6.0......W..39... W 21 W 18
## c......2......DAKSHESH.DARURI.....................6.0......W..63... W 58 L 4
## c......3......ADITYA.BAJAJ........................6.0......L...8... W 61 W 25
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... D 28 W 2
## c......5......HANSHI.ZUO..........................5.5......W..45... W 37 D 12
## c......6......HANSEN.SONG.........................5.0......W..34... D 29 L 11
## V7 V8
## c......1......GARY.HUA............................6.0......W..39... W 14 W 7
## c......2......DAKSHESH.DARURI.....................6.0......W..63... W 17 W 16
## c......3......ADITYA.BAJAJ........................6.0......L...8... W 21 W 11
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... W 26 D 5
## c......5......HANSHI.ZUO..........................5.5......W..45... D 13 D 4
## c......6......HANSEN.SONG.........................5.0......W..34... W 35 D 10
## V9 V10
## c......1......GARY.HUA............................6.0......W..39... D 12 D 4
## c......2......DAKSHESH.DARURI.....................6.0......W..63... W 20 W 7
## c......3......ADITYA.BAJAJ........................6.0......L...8... W 13 W 12
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... W 19 D 1
## c......5......HANSHI.ZUO..........................5.5......W..45... W 14 W 17
## c......6......HANSEN.SONG.........................5.0......W..34... W 27 W 21
## V11
## c......1......GARY.HUA............................6.0......W..39... ON
## c......2......DAKSHESH.DARURI.....................6.0......W..63... MI
## c......3......ADITYA.BAJAJ........................6.0......L...8... MI
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... MI
## c......5......HANSHI.ZUO..........................5.5......W..45... MI
## c......6......HANSEN.SONG.........................5.0......W..34... OH
## V12
## c......1......GARY.HUA............................6.0......W..39... 15445895 / R: 1794 ->1817
## c......2......DAKSHESH.DARURI.....................6.0......W..63... 14598900 / R: 1553 ->1663
## c......3......ADITYA.BAJAJ........................6.0......L...8... 14959604 / R: 1384 ->1640
## c......4......PATRICK.H.SCHILLING.................5.5......W..23... 12616049 / R: 1716 ->1744
## c......5......HANSHI.ZUO..........................5.5......W..45... 14601533 / R: 1655 ->1690
## c......6......HANSEN.SONG.........................5.0......W..34... 15055204 / R: 1686 ->1687
Now will set column names–even though they are not formatted properly, its’ a reminder of what I’m aspiring to for each column.
colnames(df) <- c(
"player_id",
"player_name",
"total_points",
"rd1_opp_id",
"rd2_opp_id",
"rd3_opp_id",
"rd4_opp_id",
"rd5_opp_id",
"rd6_opp_id",
"rd7_opp_id",
"player_state",
"player_pre_rating"
)
#I see there is also a strange compiled index/rowname column, which I'll eliminate now
rownames(df) <- NULL
head(df)
## player_id player_name total_points rd1_opp_id
## 1 1 GARY HUA 6.0 W 39
## 2 2 DAKSHESH DARURI 6.0 W 63
## 3 3 ADITYA BAJAJ 6.0 L 8
## 4 4 PATRICK H SCHILLING 5.5 W 23
## 5 5 HANSHI ZUO 5.5 W 45
## 6 6 HANSEN SONG 5.0 W 34
## rd2_opp_id rd3_opp_id rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id
## 1 W 21 W 18 W 14 W 7 D 12 D 4
## 2 W 58 L 4 W 17 W 16 W 20 W 7
## 3 W 61 W 25 W 21 W 11 W 13 W 12
## 4 D 28 W 2 W 26 D 5 W 19 D 1
## 5 W 37 D 12 D 13 D 4 W 14 W 17
## 6 D 29 L 11 W 35 D 10 W 27 W 21
## player_state player_pre_rating
## 1 ON 15445895 / R: 1794 ->1817
## 2 MI 14598900 / R: 1553 ->1663
## 3 MI 14959604 / R: 1384 ->1640
## 4 MI 12616049 / R: 1716 ->1744
## 5 MI 14601533 / R: 1655 ->1690
## 6 OH 15055204 / R: 1686 ->1687
I will need to conduct the following transformations on my columns.
player_id: extract number, set as integer player_name: remove excess spaces total_points: extract number, set as float ALL rd[x]_opp_id columns: extract number, set as integer. Ensure cells with no digit characters are interpreted as NULL for future averaging. player_state: remove excess spaces player_pre_rating: extract number after “R:” and before “->”, as well as removing excess spacing. As a first step, any occurrance of “P” followed by a number should be removed, as we are ignoring that aspect. For consistency’s sake they could be replaced with ” “.
#player_id: extract number, set as integer
df$player_id <- as.integer(str_extract(df$player_id, '[0-9]+'))
#player_name: remove excess spaces
df$player_name <- trimws(df$player_name)
#total_points: extract number, set as double
df$total_points <- as.double(trimws(df$total_points))
#ALL rd[x]_opp_id columns: extract number, set as integer. Ensure cells with no digit characters are interpreted as NULL for future averaging.
#not sure why this doesnt work but whatever
#for (i in colnames(df)[4:10]) {
#df$(i) <- as.integer(str_extract(df$(i), '[0-9]+'))
#}
df$rd1_opp_id <- as.integer(str_extract(df$rd1_opp_id , '[0-9]+'))
df$rd2_opp_id <- as.integer(str_extract(df$rd2_opp_id , '[0-9]+'))
df$rd3_opp_id <- as.integer(str_extract(df$rd3_opp_id , '[0-9]+'))
df$rd4_opp_id <- as.integer(str_extract(df$rd4_opp_id , '[0-9]+'))
df$rd5_opp_id <- as.integer(str_extract(df$rd5_opp_id , '[0-9]+'))
df$rd6_opp_id <- as.integer(str_extract(df$rd6_opp_id , '[0-9]+'))
df$rd7_opp_id <- as.integer(str_extract(df$rd7_opp_id , '[0-9]+'))
#player_state: remove excess spaces
df$player_state <- trimws(df$player_state)
#player_pre_rating: extract number after "R: " and before "->". I should be able to do this by removing all spaces, and then finding the number (of any number of digits) immediately following "R:". We are ignoring the "P" number that appears in some cases, and restricting the regex to digits should account for this.
df$player_pre_rating <- str_extract(str_remove_all(df$player_pre_rating, ' '), pattern = '(?<=R:)[0-9]+')
df$player_pre_rating <- as.integer(df$player_pre_rating)
head(df)
## player_id player_name total_points rd1_opp_id rd2_opp_id rd3_opp_id
## 1 1 GARY HUA 6.0 39 21 18
## 2 2 DAKSHESH DARURI 6.0 63 58 4
## 3 3 ADITYA BAJAJ 6.0 8 61 25
## 4 4 PATRICK H SCHILLING 5.5 23 28 2
## 5 5 HANSHI ZUO 5.5 45 37 12
## 6 6 HANSEN SONG 5.0 34 29 11
## rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id player_state player_pre_rating
## 1 14 7 12 4 ON 1794
## 2 17 16 20 7 MI 1553
## 3 21 11 13 12 MI 1384
## 4 26 5 19 1 MI 1716
## 5 13 4 14 17 MI 1655
## 6 35 10 27 21 OH 1686
Next, I need to make new columns that indicate the pre-rating for each opponent in each round, these values serving as the foundation for a ‘mean opponent rating’ calculation. Fortunately, the format of the data frame lends itself to a solution.
Vectors can be called by an index or series of indices. Since the player_id’s mirror the data frame’s index, I can call a column of opponent player_id’s as the indices needed to fetch the appropriate pre-rating. For example, the round 1 opponent for the first player, Gary Hua, has the player id 39. Player 39 is Joel R Hendon, whose pre-rating was 1439. If I call df$player_pre_rating[39], I get 1439. The same holds if I call the entire column as an index, which I can do across each round.
df$rd1_opp_pre_rating <- df$player_pre_rating[df$rd1_opp_id]
df$rd2_opp_pre_rating <- df$player_pre_rating[df$rd2_opp_id]
df$rd3_opp_pre_rating <- df$player_pre_rating[df$rd3_opp_id]
df$rd4_opp_pre_rating <- df$player_pre_rating[df$rd4_opp_id]
df$rd5_opp_pre_rating <- df$player_pre_rating[df$rd5_opp_id]
df$rd6_opp_pre_rating <- df$player_pre_rating[df$rd6_opp_id]
df$rd7_opp_pre_rating <- df$player_pre_rating[df$rd7_opp_id]
#spot checks across random players, opponents and ratings show the mapping worked as expected
head(df)
## player_id player_name total_points rd1_opp_id rd2_opp_id rd3_opp_id
## 1 1 GARY HUA 6.0 39 21 18
## 2 2 DAKSHESH DARURI 6.0 63 58 4
## 3 3 ADITYA BAJAJ 6.0 8 61 25
## 4 4 PATRICK H SCHILLING 5.5 23 28 2
## 5 5 HANSHI ZUO 5.5 45 37 12
## 6 6 HANSEN SONG 5.0 34 29 11
## rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id player_state player_pre_rating
## 1 14 7 12 4 ON 1794
## 2 17 16 20 7 MI 1553
## 3 21 11 13 12 MI 1384
## 4 26 5 19 1 MI 1716
## 5 13 4 14 17 MI 1655
## 6 35 10 27 21 OH 1686
## rd1_opp_pre_rating rd2_opp_pre_rating rd3_opp_pre_rating rd4_opp_pre_rating
## 1 1436 1563 1600 1610
## 2 1175 917 1716 1629
## 3 1641 955 1745 1563
## 4 1363 1507 1553 1579
## 5 1242 980 1663 1666
## 6 1399 1602 1712 1438
## rd5_opp_pre_rating rd6_opp_pre_rating rd7_opp_pre_rating
## 1 1649 1663 1716
## 2 1604 1595 1649
## 3 1712 1666 1663
## 4 1655 1564 1794
## 5 1716 1610 1629
## 6 1365 1552 1563
Now I just need to average the opponent ratings for each player in a new column.
#Setting na.rm to TRUE ensures NAs are dropped from the calculation rather than resulting in an NA for the final value.
df$opp_mean_pre_rating <- rowMeans(df[13:19], na.rm = TRUE)
head(df)
## player_id player_name total_points rd1_opp_id rd2_opp_id rd3_opp_id
## 1 1 GARY HUA 6.0 39 21 18
## 2 2 DAKSHESH DARURI 6.0 63 58 4
## 3 3 ADITYA BAJAJ 6.0 8 61 25
## 4 4 PATRICK H SCHILLING 5.5 23 28 2
## 5 5 HANSHI ZUO 5.5 45 37 12
## 6 6 HANSEN SONG 5.0 34 29 11
## rd4_opp_id rd5_opp_id rd6_opp_id rd7_opp_id player_state player_pre_rating
## 1 14 7 12 4 ON 1794
## 2 17 16 20 7 MI 1553
## 3 21 11 13 12 MI 1384
## 4 26 5 19 1 MI 1716
## 5 13 4 14 17 MI 1655
## 6 35 10 27 21 OH 1686
## rd1_opp_pre_rating rd2_opp_pre_rating rd3_opp_pre_rating rd4_opp_pre_rating
## 1 1436 1563 1600 1610
## 2 1175 917 1716 1629
## 3 1641 955 1745 1563
## 4 1363 1507 1553 1579
## 5 1242 980 1663 1666
## 6 1399 1602 1712 1438
## rd5_opp_pre_rating rd6_opp_pre_rating rd7_opp_pre_rating opp_mean_pre_rating
## 1 1649 1663 1716 1605.286
## 2 1604 1595 1649 1469.286
## 3 1712 1666 1663 1563.571
## 4 1655 1564 1794 1573.571
## 5 1716 1610 1629 1500.857
## 6 1365 1552 1563 1518.714
With the mean calculated, I can select only the columns that are necessary.
df_final <- df[c('player_name', 'player_state',
'total_points','player_pre_rating',
'opp_mean_pre_rating')]
df_final
## player_name player_state total_points player_pre_rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## 7 GARY DEE SWATHELL MI 5.0 1649
## 8 EZEKIEL HOUGHTON MI 5.0 1641
## 9 STEFANO LEE ON 5.0 1411
## 10 ANVIT RAO MI 5.0 1365
## 11 CAMERON WILLIAM MC LEMAN MI 4.5 1712
## 12 KENNETH J TACK MI 4.5 1663
## 13 TORRANCE HENRY JR MI 4.5 1666
## 14 BRADLEY SHAW MI 4.5 1610
## 15 ZACHARY JAMES HOUGHTON MI 4.5 1220
## 16 MIKE NIKITIN MI 4.0 1604
## 17 RONALD GRZEGORCZYK MI 4.0 1629
## 18 DAVID SUNDEEN MI 4.0 1600
## 19 DIPANKAR ROY MI 4.0 1564
## 20 JASON ZHENG MI 4.0 1595
## 21 DINH DANG BUI ON 4.0 1563
## 22 EUGENE L MCCLURE MI 4.0 1555
## 23 ALAN BUI ON 4.0 1363
## 24 MICHAEL R ALDRICH MI 4.0 1229
## 25 LOREN SCHWIEBERT MI 3.5 1745
## 26 MAX ZHU ON 3.5 1579
## 27 GAURAV GIDWANI MI 3.5 1552
## 28 SOFIA ADINA STANESCU-BELLU MI 3.5 1507
## 29 CHIEDOZIE OKORIE MI 3.5 1602
## 30 GEORGE AVERY JONES ON 3.5 1522
## 31 RISHI SHETTY MI 3.5 1494
## 32 JOSHUA PHILIP MATHEWS ON 3.5 1441
## 33 JADE GE MI 3.5 1449
## 34 MICHAEL JEFFERY THOMAS MI 3.5 1399
## 35 JOSHUA DAVID LEE MI 3.5 1438
## 36 SIDDHARTH JHA MI 3.5 1355
## 37 AMIYATOSH PWNANANDAM MI 3.5 980
## 38 BRIAN LIU MI 3.0 1423
## 39 JOEL R HENDON MI 3.0 1436
## 40 FOREST ZHANG MI 3.0 1348
## 41 KYLE WILLIAM MURPHY MI 3.0 1403
## 42 JARED GE MI 3.0 1332
## 43 ROBERT GLEN VASEY MI 3.0 1283
## 44 JUSTIN D SCHILLING MI 3.0 1199
## 45 DEREK YAN MI 3.0 1242
## 46 JACOB ALEXANDER LAVALLEY MI 3.0 377
## 47 ERIC WRIGHT MI 2.5 1362
## 48 DANIEL KHAIN MI 2.5 1382
## 49 MICHAEL J MARTIN MI 2.5 1291
## 50 SHIVAM JHA MI 2.5 1056
## 51 TEJAS AYYAGARI MI 2.5 1011
## 52 ETHAN GUO MI 2.5 935
## 53 JOSE C YBARRA MI 2.0 1393
## 54 LARRY HODGE MI 2.0 1270
## 55 ALEX KONG MI 2.0 1186
## 56 MARISA RICCI MI 2.0 1153
## 57 MICHAEL LU MI 2.0 1092
## 58 VIRAJ MOHILE MI 2.0 917
## 59 SEAN M MC CORMICK MI 2.0 853
## 60 JULIA SHEN MI 1.5 967
## 61 JEZZEL FARKAS ON 1.5 955
## 62 ASHWIN BALAJI MI 1.0 1530
## 63 THOMAS JOSEPH HOSMER MI 1.0 1175
## 64 BEN LI MI 1.0 1163
## opp_mean_pre_rating
## 1 1605.286
## 2 1469.286
## 3 1563.571
## 4 1573.571
## 5 1500.857
## 6 1518.714
## 7 1372.143
## 8 1468.429
## 9 1523.143
## 10 1554.143
## 11 1467.571
## 12 1506.167
## 13 1497.857
## 14 1515.000
## 15 1483.857
## 16 1385.800
## 17 1498.571
## 18 1480.000
## 19 1426.286
## 20 1410.857
## 21 1470.429
## 22 1300.333
## 23 1213.857
## 24 1357.000
## 25 1363.286
## 26 1506.857
## 27 1221.667
## 28 1522.143
## 29 1313.500
## 30 1144.143
## 31 1259.857
## 32 1378.714
## 33 1276.857
## 34 1375.286
## 35 1149.714
## 36 1388.167
## 37 1384.800
## 38 1539.167
## 39 1429.571
## 40 1390.571
## 41 1248.500
## 42 1149.857
## 43 1106.571
## 44 1327.000
## 45 1152.000
## 46 1357.714
## 47 1392.000
## 48 1355.800
## 49 1285.800
## 50 1296.000
## 51 1356.143
## 52 1494.571
## 53 1345.333
## 54 1206.167
## 55 1406.000
## 56 1414.400
## 57 1363.000
## 58 1391.000
## 59 1319.000
## 60 1330.200
## 61 1327.286
## 62 1186.000
## 63 1350.200
## 64 1263.000
Satisfied with the result, I can now write it into a .csv for further analysis. Some day :)
#Ror presentation's sakem the mean opponent rating should be rounded to the nearest integer
df_final$opp_mean_pre_rating <- round(df_final$opp_mean_pre_rating)
write.csv(df_final, '../data/data607_project1_chess_FINAL.csv')