Use the given text file which contains the results from a chess tournament to generate .CSV files that can be used to load into a database for further analysis.
Each player in the tournament is listed along with the results from the 7 rounds of the tournament.
-----------------------------------------------------------------------------------------
Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
-----------------------------------------------------------------------------------------
1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
-----------------------------------------------------------------------------------------
2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
-----------------------------------------------------------------------------------------
3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
-----------------------------------------------------------------------------------------
4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
-----------------------------------------------------------------------------------------
5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |
-----------------------------------------------------------------------------------------
library(tidyverse)
infoFile = 'https://raw.githubusercontent.com/dab31415/DATA607/main/Projects/Project_1/TournamentInfo.txt'
raw_tournament <-read.table(file = infoFile,
sep = '|',
strip.white = TRUE,
fill = TRUE)
head(raw_tournament,7)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair
## 3 Num
## 4 -----------------------------------------------------------------------------------------
## 5 1
## 6 ON
## 7 -----------------------------------------------------------------------------------------
## V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1
## 2 Player Name Total Round Round Round Round Round Round Round
## 3 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5 6 7
## 4
## 5 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
## 6 15445895 / R: 1794 ->1817 N:2 W B W B W B W
## 7
## V11
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
The first four rows of data contain multiple header and row separators. The last column is created because each row of the text file ends with a column separator. The data file also contains row separators between each player.
tournament <- raw_tournament[-c(1:4),-11] %>%
filter(grepl('[A-Z0-9]',V1))
names(tournament)[1] = 'num_state'
names(tournament)[2] = 'name_rating'
names(tournament)[3] = 'points'
names(tournament)[4:10] = paste('R', 1:7, sep = '')
The first row for each player contains their player number in the first field which is also used in the round data to match players with their opponent. Filter on numeric first column and select the player’s name and points earned in the tournament.
players <- as_tibble(tournament %>%
filter(grepl('[0-9]',num_state)) %>%
transmute(
playerNum = as.numeric(num_state),
playerName = name_rating,
points = as.numeric(points),
)) %>%
rowid_to_column('id')
head(players,7)
## # A tibble: 7 x 4
## id playerNum playerName points
## <int> <dbl> <chr> <dbl>
## 1 1 1 GARY HUA 6
## 2 2 2 DAKSHESH DARURI 6
## 3 3 3 ADITYA BAJAJ 6
## 4 4 4 PATRICK H SCHILLING 5.5
## 5 5 5 HANSHI ZUO 5.5
## 6 6 6 HANSEN SONG 5
## 7 7 7 GARY DEE SWATHELL 5
The second row of player data contains the player’s state, USCF identification number, and their ratings both before and after tournament completion.
ratings <- as_tibble(tournament %>%
filter(grepl('[A-Z]',num_state)) %>%
separate(name_rating, into = c('USCF_ID','rawRatings'), sep = '/') %>%
separate(rawRatings, into = c('rawpreRating','rawpostRating'), sep = '->') %>%
transmute(
state = num_state,
USCF_ID = as.numeric(USCF_ID),
preRating = as.numeric(str_extract(rawpreRating,'\\d{3,4}')),
postRating = as.numeric(str_extract(rawpostRating,'\\d{3,4}'))
)) %>%
rowid_to_column('id')
head(ratings,7)
## # A tibble: 7 x 5
## id state USCF_ID preRating postRating
## <int> <chr> <dbl> <dbl> <dbl>
## 1 1 ON 15445895 1794 1817
## 2 2 MI 14598900 1553 1663
## 3 3 MI 14959604 1384 1640
## 4 4 MI 12616049 1716 1744
## 5 5 MI 14601533 1655 1690
## 6 6 OH 15055204 1686 1687
## 7 7 MI 11146376 1649 1673
We avoid using player number to merge the two data frames because we’d be dependent on the raw text file being sorted and numbered in order. By implementing a row number, we ensure that consecutive rows in the data file are merged together.
playerDetails <- players %>%
merge(ratings,by = 'id') %>%
select(playerNum, USCF_ID, playerName, state, preRating, points, postRating)
head(playerDetails,7)
## playerNum USCF_ID playerName state preRating points postRating
## 1 1 15445895 GARY HUA ON 1794 6.0 1817
## 2 2 14598900 DAKSHESH DARURI MI 1553 6.0 1663
## 3 3 14959604 ADITYA BAJAJ MI 1384 6.0 1640
## 4 4 12616049 PATRICK H SCHILLING MI 1716 5.5 1744
## 5 5 14601533 HANSHI ZUO MI 1655 5.5 1690
## 6 6 15055204 HANSEN SONG OH 1686 5.0 1687
## 7 7 11146376 GARY DEE SWATHELL MI 1649 5.0 1673
The first row of player data contains the results of the matches played in each round. We generate a data frame of all players and the match results by round.
matchResults <- as_tibble(tournament %>%
filter(grepl('[0-9]',num_state)) %>%
gather(roundNum,roundResult,R1:R7) %>%
transmute(
playerNum = as.numeric(num_state),
roundNum = as.numeric(str_extract(roundNum,'\\d+')),
oppNum = as.numeric(str_extract(roundResult,'\\d+')),
result = str_extract(roundResult,'[A-Z]')
))
head(matchResults,7)
## # A tibble: 7 x 4
## playerNum roundNum oppNum result
## <dbl> <dbl> <dbl> <chr>
## 1 1 1 39 W
## 2 2 1 63 W
## 3 3 1 8 L
## 4 4 1 23 W
## 5 5 1 45 W
## 6 6 1 34 W
## 7 7 1 57 W
matchResults %>%
count(result)
## # A tibble: 7 x 2
## result n
## <chr> <int>
## 1 B 7
## 2 D 58
## 3 H 16
## 4 L 175
## 5 U 16
## 6 W 175
## 7 X 1
Adding the opponent’s pre-tournament rating to the match results data frame will make it easier to calculate the average opponents’ rating and determine the expected match results.
I had some issues joining player details with match results on columns of different names which is why I used a select command to rename the ‘playerNum’ column to ‘oppNum’.
matchResults <- matchResults %>%
left_join(playerDetails %>% select(oppNum = playerNum, oppRating = preRating),by = 'oppNum')
head(matchResults,7)
## # A tibble: 7 x 5
## playerNum roundNum oppNum result oppRating
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1 39 W 1436
## 2 2 1 63 W 1175
## 3 3 1 8 L 1641
## 4 4 1 23 W 1363
## 5 5 1 45 W 1242
## 6 6 1 34 W 1399
## 7 7 1 57 W 1092
For each match we join to player details to determine the opponents pre-tournament rating, then calculate the mean grouped by player. The calculation of average opponents’ rating accounts for players that played less than 7 matches.
playerDetails <- playerDetails %>%
inner_join(matchResults %>%
group_by(playerNum) %>%
summarize(avgOppRating = as.integer(mean(oppRating, na.rm = TRUE))),by = 'playerNum')
head(playerDetails,7)
## playerNum USCF_ID playerName state preRating points postRating
## 1 1 15445895 GARY HUA ON 1794 6.0 1817
## 2 2 14598900 DAKSHESH DARURI MI 1553 6.0 1663
## 3 3 14959604 ADITYA BAJAJ MI 1384 6.0 1640
## 4 4 12616049 PATRICK H SCHILLING MI 1716 5.5 1744
## 5 5 14601533 HANSHI ZUO MI 1655 5.5 1690
## 6 6 15055204 HANSEN SONG OH 1686 5.0 1687
## 7 7 11146376 GARY DEE SWATHELL MI 1649 5.0 1673
## avgOppRating
## 1 1605
## 2 1469
## 3 1563
## 4 1573
## 5 1500
## 6 1518
## 7 1372
Generate a csv of the players and another for the results of the tournament. These files can then be used to load into a database.
local_path = 'C:\\Users\\dab31\\OneDrive\\Documents\\GitHub\\DATA607\\Projects\\Project_1\\'
playerDetails %>%
write_csv(paste(local_path, 'playersDetails.csv', sep = ''))
matchResults %>%
write_csv(paste(local_path,'matchResults.csv', sep = ''))