For this project we were given a text file with chess tournament results where the information has some structure. The task was to generate a .CSV file containing the name, pre-rating score of an individual player, their total points and the average of all their opponents pre-ratings. The tournament results are stored in the text file called tournamentinfo.
Guidelines for this project
As of right now the data looks like:
The final format we want is:
| Player’s Name | Player’s State | Total # of points | Player’s Pre Rating | Avg Pre-Chess of Opponents |
|---|---|---|---|---|
| Gary Hua | ON | 6.0 | 1794 | 1605 |
file = "https://raw.githubusercontent.com/moiyajosephs/Data607-Project1/main/tournamentinfo.txt"
tm= read.delim(file,header=FALSE,sep="|", dec = ".")
The steps to clean the data will be as described below:
Data Cleaning Steps
Each players information is on two separate rows enclosed between the –. To start, remove any row in the data frame that starts with –. This is not needed for the data to be separated any longer.
tm = tm %>%
filter(!grepl('^-', V1))
head(tm) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 |
|---|---|---|---|---|---|---|---|---|---|---|
| Pair | Player Name | Total | Round | Round | Round | Round | Round | Round | Round | NA |
| Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | NA |
| 1 | GARY HUA | 6.0 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 | NA |
| ON | 15445895 / R: 1794 ->1817 | N:2 | W | B | W | B | W | B | W | NA |
| 2 | DAKSHESH DARURI | 6.0 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 | NA |
| MI | 14598900 / R: 1553 ->1663 | N:2 | B | W | B | W | B | W | B | NA |
Next remove the two rows that contain header information in the start of the text file, I will provide actual column names.
tm <- tm[-c(1:2),-11]
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | GARY HUA | 6.0 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 |
| 4 | ON | 15445895 / R: 1794 ->1817 | N:2 | W | B | W | B | W | B | W |
| 5 | 2 | DAKSHESH DARURI | 6.0 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 |
| 6 | MI | 14598900 / R: 1553 ->1663 | N:2 | B | W | B | W | B | W | B |
| 7 | 3 | ADITYA BAJAJ | 6.0 | L 8 | W 61 | W 25 | W 21 | W 11 | W 13 | W 12 |
| 8 | MI | 14959604 / R: 1384 ->1640 | N:2 | W | B | W | B | W | B | W |
Split by each row to individual data frames after skipping every two, that way I can append the second row to the first row. Tm1 represents the first set of rows while Tm2 represents the rest.
tm1 <- tm[row.names(tm) %in% seq(1,132,2),] #first part of the players data
tm2 <- tm[row.names(tm) %in% seq(2,132,2),] #second part of the players data
Name the columns of each divided data frame so it will be clear when we combine them. There is the players number, their name, total points and seven rounds total.
names(tm1) <- c("player_num", "player_name", "total points", "round1", "round2", "round3","round4","round5","round6","round7")
head(tm2) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | ON | 15445895 / R: 1794 ->1817 | N:2 | W | B | W | B | W | B | W |
| 6 | MI | 14598900 / R: 1553 ->1663 | N:2 | B | W | B | W | B | W | B |
| 8 | MI | 14959604 / R: 1384 ->1640 | N:2 | W | B | W | B | W | B | W |
| 10 | MI | 12616049 / R: 1716 ->1744 | N:2 | W | B | W | B | W | B | B |
| 12 | MI | 14601533 / R: 1655 ->1690 | N:2 | B | W | B | W | B | W | B |
| 14 | OH | 15055204 / R: 1686 ->1687 | N:3 | W | B | W | B | B | W | B |
The second part of the players information contains the state the player is from, their pre-rating score, points, and some additional round information.
names(tm2)<- c("state","pre_rating", "pts","round1", "round2", "round3","round4","round5","round6","round7")
head(tm2) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
| state | pre_rating | pts | round1 | round2 | round3 | round4 | round5 | round6 | round7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | ON | 15445895 / R: 1794 ->1817 | N:2 | W | B | W | B | W | B | W |
| 6 | MI | 14598900 / R: 1553 ->1663 | N:2 | B | W | B | W | B | W | B |
| 8 | MI | 14959604 / R: 1384 ->1640 | N:2 | W | B | W | B | W | B | W |
| 10 | MI | 12616049 / R: 1716 ->1744 | N:2 | W | B | W | B | W | B | B |
| 12 | MI | 14601533 / R: 1655 ->1690 | N:2 | B | W | B | W | B | W | B |
| 14 | OH | 15055204 / R: 1686 ->1687 | N:3 | W | B | W | B | B | W | B |
To get the pre score number may take more cleaning. Currently it has additional and unnecessary characters. We want everything remove everything before the R: but before the ->. remove the n:2 columns also.
I cleaned the second data frame step by step.
First split by the R.
tm2$`pre_rating`<-sub(".*R:","",tm2$`pre_rating`)
| state | pre_rating | pts | round1 | round2 | round3 | round4 | round5 | round6 | round7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | ON | 1794 ->1817 | N:2 | W | B | W | B | W | B | W |
| 6 | MI | 1553 ->1663 | N:2 | B | W | B | W | B | W | B |
| 8 | MI | 1384 ->1640 | N:2 | W | B | W | B | W | B | W |
| 10 | MI | 1716 ->1744 | N:2 | W | B | W | B | W | B | B |
| 12 | MI | 1655 ->1690 | N:2 | B | W | B | W | B | W | B |
| 14 | OH | 1686 ->1687 | N:3 | W | B | W | B | B | W | B |
Next remove after the “->”
tm2$`pre_rating`<- sub("->.*","",tm2$`pre_rating`)
| state | pre_rating | pts | round1 | round2 | round3 | round4 | round5 | round6 | round7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | ON | 1794 | N:2 | W | B | W | B | W | B | W |
| 6 | MI | 1553 | N:2 | B | W | B | W | B | W | B |
| 8 | MI | 1384 | N:2 | W | B | W | B | W | B | W |
| 10 | MI | 1716 | N:2 | W | B | W | B | W | B | B |
| 12 | MI | 1655 | N:2 | B | W | B | W | B | W | B |
| 14 | OH | 1686 | N:3 | W | B | W | B | B | W | B |
Keep anything before the P.
tm2$`pre_rating`<- sub("P.*","",tm2$`pre_rating`)
| state | pre_rating | pts | round1 | round2 | round3 | round4 | round5 | round6 | round7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | ON | 1794 | N:2 | W | B | W | B | W | B | W |
| 6 | MI | 1553 | N:2 | B | W | B | W | B | W | B |
| 8 | MI | 1384 | N:2 | W | B | W | B | W | B | W |
| 10 | MI | 1716 | N:2 | W | B | W | B | W | B | B |
| 12 | MI | 1655 | N:2 | B | W | B | W | B | W | B |
| 14 | OH | 1686 | N:3 | W | B | W | B | B | W | B |
Realized I do not need anything passed pts. Columns 3 to 10 should be removed.
tm2<- tm2[,1:2]
| state | pre_rating | |
|---|---|---|
| 4 | ON | 1794 |
| 6 | MI | 1553 |
| 8 | MI | 1384 |
| 10 | MI | 1716 |
| 12 | MI | 1655 |
| 14 | OH | 1686 |
Now to combine.
To append the two data frames use c bind. This will be saved into tm3.
tm3 <- cbind(tm1,tm2)
| player_num | player_name | total points | round1 | round2 | round3 | round4 | round5 | round6 | round7 | state | pre_rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | GARY HUA | 6.0 | W 39 | W 21 | W 18 | W 14 | W 7 | D 12 | D 4 | ON | 1794 |
| 5 | 2 | DAKSHESH DARURI | 6.0 | W 63 | W 58 | L 4 | W 17 | W 16 | W 20 | W 7 | MI | 1553 |
| 7 | 3 | ADITYA BAJAJ | 6.0 | L 8 | W 61 | W 25 | W 21 | W 11 | W 13 | W 12 | MI | 1384 |
| 9 | 4 | PATRICK H SCHILLING | 5.5 | W 23 | D 28 | W 2 | W 26 | D 5 | W 19 | D 1 | MI | 1716 |
| 11 | 5 | HANSHI ZUO | 5.5 | W 45 | W 37 | D 12 | D 13 | D 4 | W 14 | W 17 | MI | 1655 |
| 13 | 6 | HANSEN SONG | 5.0 | W 34 | D 29 | L 11 | W 35 | D 10 | W 27 | W 21 | OH | 1686 |
Now to get to the average of each players opponents, I needed to find the opponents for each person. This is specified in rounds 1 through 7. Currently the data has unnecessary values B,H, and U, the extra letters before the opponent. I need to do a little bit more data wrangling in order to get the opponents pre-ratings and each opponent.
player_info <- tm3 %>% select(c(player_num, round1:round7,pre_rating,state, `total points`)) %>% melt(id.var=c("player_num","pre_rating","state", "total points"),value.name = "round") %>% mutate(opponent = as.numeric(str_extract(round," .*"))) %>% select(c(-round)) %>% mutate(pre_rating= as.numeric(pre_rating))
head(player_info)
## player_num pre_rating state total points variable opponent
## 1 1 1794 ON 6.0 round1 39
## 2 2 1553 MI 6.0 round1 63
## 3 3 1384 MI 6.0 round1 8
## 4 4 1716 MI 5.5 round1 23
## 5 5 1655 MI 5.5 round1 45
## 6 6 1686 OH 5.0 round1 34
I will need to know each players opponents pre-rating as well so this will be its own table with the columns renamed to avoid ambiguous error.
opponent_scores <- tm3 %>% select(player_num, pre_rating) %>% mutate(player_num= as.numeric(player_num)) %>% mutate(pre_rating = as.numeric(pre_rating))
names(opponent_scores)[names(opponent_scores)=="player_num"] = "opponent2"
names(opponent_scores)[names(opponent_scores)=="pre_rating"] = "o_ratings"
head(opponent_scores) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
| opponent2 | o_ratings | |
|---|---|---|
| 3 | 1 | 1794 |
| 5 | 2 | 1553 |
| 7 | 3 | 1384 |
| 9 | 4 | 1716 |
| 11 | 5 | 1655 |
| 13 | 6 | 1686 |
Now join the two data frames using sql, by selecting and grouping by the player num and averaging the opponent ratings.
To get the opponent average, I have to select the player num from player_info table group by it and average the opponent ratings for each respective player. I achieved this using SQL.
opponent_avg <- sqldf('select player_num, round(avg(o_ratings),0 ) as opponent from player_info inner join opponent_scores on opponent = opponent2 group by player_num')
datatable(opponent_avg, extensions= 'Scroller', options=list(pageLength=5))
Now we need to combine our final data frame with all the information so it can look like our final tournament table. With this table you can easily search for each opponent by name. It even allows you to sort by any column.
tournament <- tm3 %>% full_join(opponent_avg, by=c("player_num")) %>% select(c(-(round1:round7)))
datatable(tournament, extensions= 'Scroller', options=list(pageLength=5))
The final step is to save to a tournament csv which would be the correct schema.
write_csv(tournament, "./tournament.csv")
The resulting csv can be found on my GitHub: Tournament