1 Overview

For this project we were given a text file with chess tournament results where the information has some structure. The task was to generate a .CSV file containing the name, pre-rating score of an individual player, their total points and the average of all their opponents pre-ratings. The tournament results are stored in the text file called tournamentinfo.

Guidelines for this project

  • The rounds that end with whats detailed below will not count:
    • B(ye)
    • H(alf point)
    • U(nplayed)
  • Instead we will include:
    • W(in)
    • L(ose)
    • D(raw)

As of right now the data looks like: Tournament

The final format we want is:

Player’s Name Player’s State Total # of points Player’s Pre Rating Avg Pre-Chess of Opponents
Gary Hua ON 6.0 1794 1605

1.1 Read in the text file

file = "https://raw.githubusercontent.com/moiyajosephs/Data607-Project1/main/tournamentinfo.txt"
tm= read.delim(file,header=FALSE,sep="|", dec = ".")

2 Data cleaning

The steps to clean the data will be as described below:

Data Cleaning Steps

2.1 Initial data frame

Each players information is on two separate rows enclosed between the –. To start, remove any row in the data frame that starts with –. This is not needed for the data to be separated any longer.

tm = tm %>% 
  filter(!grepl('^-', V1))
head(tm) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Pair Player Name Total Round Round Round Round Round Round Round NA
Num USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5 6 7 NA
1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 NA
ON 15445895 / R: 1794 ->1817 N:2 W B W B W B W NA
2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 NA
MI 14598900 / R: 1553 ->1663 N:2 B W B W B W B NA

Next remove the two rows that contain header information in the start of the text file, I will provide actual column names.

tm <- tm[-c(1:2),-11]
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
3 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
4 ON 15445895 / R: 1794 ->1817 N:2 W B W B W B W
5 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
6 MI 14598900 / R: 1553 ->1663 N:2 B W B W B W B
7 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12
8 MI 14959604 / R: 1384 ->1640 N:2 W B W B W B W

2.1.1 Divide the data

Split by each row to individual data frames after skipping every two, that way I can append the second row to the first row. Tm1 represents the first set of rows while Tm2 represents the rest.

tm1 <- tm[row.names(tm) %in% seq(1,132,2),] #first part of the players data
tm2 <- tm[row.names(tm) %in% seq(2,132,2),] #second part of the players data

Name the columns of each divided data frame so it will be clear when we combine them. There is the players number, their name, total points and seven rounds total.

names(tm1) <- c("player_num", "player_name", "total points", "round1", "round2", "round3","round4","round5","round6","round7")
head(tm2) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
4 ON 15445895 / R: 1794 ->1817 N:2 W B W B W B W
6 MI 14598900 / R: 1553 ->1663 N:2 B W B W B W B
8 MI 14959604 / R: 1384 ->1640 N:2 W B W B W B W
10 MI 12616049 / R: 1716 ->1744 N:2 W B W B W B B
12 MI 14601533 / R: 1655 ->1690 N:2 B W B W B W B
14 OH 15055204 / R: 1686 ->1687 N:3 W B W B B W B

The second part of the players information contains the state the player is from, their pre-rating score, points, and some additional round information.

names(tm2)<- c("state","pre_rating", "pts","round1", "round2", "round3","round4","round5","round6","round7")
head(tm2) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
state pre_rating pts round1 round2 round3 round4 round5 round6 round7
4 ON 15445895 / R: 1794 ->1817 N:2 W B W B W B W
6 MI 14598900 / R: 1553 ->1663 N:2 B W B W B W B
8 MI 14959604 / R: 1384 ->1640 N:2 W B W B W B W
10 MI 12616049 / R: 1716 ->1744 N:2 W B W B W B B
12 MI 14601533 / R: 1655 ->1690 N:2 B W B W B W B
14 OH 15055204 / R: 1686 ->1687 N:3 W B W B B W B

To get the pre score number may take more cleaning. Currently it has additional and unnecessary characters. We want everything remove everything before the R: but before the ->. remove the n:2 columns also.

2.1.2 Clean the individual data frames

I cleaned the second data frame step by step.

First split by the R.

tm2$`pre_rating`<-sub(".*R:","",tm2$`pre_rating`)
state pre_rating pts round1 round2 round3 round4 round5 round6 round7
4 ON 1794 ->1817 N:2 W B W B W B W
6 MI 1553 ->1663 N:2 B W B W B W B
8 MI 1384 ->1640 N:2 W B W B W B W
10 MI 1716 ->1744 N:2 W B W B W B B
12 MI 1655 ->1690 N:2 B W B W B W B
14 OH 1686 ->1687 N:3 W B W B B W B

Next remove after the “->”

tm2$`pre_rating`<- sub("->.*","",tm2$`pre_rating`)
state pre_rating pts round1 round2 round3 round4 round5 round6 round7
4 ON 1794 N:2 W B W B W B W
6 MI 1553 N:2 B W B W B W B
8 MI 1384 N:2 W B W B W B W
10 MI 1716 N:2 W B W B W B B
12 MI 1655 N:2 B W B W B W B
14 OH 1686 N:3 W B W B B W B

Keep anything before the P.

tm2$`pre_rating`<- sub("P.*","",tm2$`pre_rating`)
state pre_rating pts round1 round2 round3 round4 round5 round6 round7
4 ON 1794 N:2 W B W B W B W
6 MI 1553 N:2 B W B W B W B
8 MI 1384 N:2 W B W B W B W
10 MI 1716 N:2 W B W B W B B
12 MI 1655 N:2 B W B W B W B
14 OH 1686 N:3 W B W B B W B

Realized I do not need anything passed pts. Columns 3 to 10 should be removed.

tm2<- tm2[,1:2]
state pre_rating
4 ON 1794
6 MI 1553
8 MI 1384
10 MI 1716
12 MI 1655
14 OH 1686

Now to combine.

2.1.3 Combine the data frames

To append the two data frames use c bind. This will be saved into tm3.

tm3 <- cbind(tm1,tm2)
player_num player_name total points round1 round2 round3 round4 round5 round6 round7 state pre_rating
3 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 ON 1794
5 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 MI 1553
7 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 MI 1384
9 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 MI 1716
11 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17 MI 1655
13 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21 OH 1686

3 Calculations

Now to get to the average of each players opponents, I needed to find the opponents for each person. This is specified in rounds 1 through 7. Currently the data has unnecessary values B,H, and U, the extra letters before the opponent. I need to do a little bit more data wrangling in order to get the opponents pre-ratings and each opponent.

3.1 Find the average of the Opponents

player_info <- tm3 %>% select(c(player_num, round1:round7,pre_rating,state, `total points`)) %>% melt(id.var=c("player_num","pre_rating","state", "total points"),value.name = "round") %>% mutate(opponent = as.numeric(str_extract(round," .*"))) %>% select(c(-round)) %>% mutate(pre_rating= as.numeric(pre_rating)) 
head(player_info)
##   player_num pre_rating  state total points variable opponent
## 1         1        1794    ON         6.0     round1       39
## 2         2        1553    MI         6.0     round1       63
## 3         3        1384    MI         6.0     round1        8
## 4         4        1716    MI         5.5     round1       23
## 5         5        1655    MI         5.5     round1       45
## 6         6        1686    OH         5.0     round1       34

3.1.1 Opponent table

I will need to know each players opponents pre-rating as well so this will be its own table with the columns renamed to avoid ambiguous error.

opponent_scores <- tm3 %>% select(player_num, pre_rating) %>% mutate(player_num= as.numeric(player_num)) %>% mutate(pre_rating = as.numeric(pre_rating))
names(opponent_scores)[names(opponent_scores)=="player_num"] = "opponent2"
names(opponent_scores)[names(opponent_scores)=="pre_rating"] = "o_ratings"
head(opponent_scores) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))
opponent2 o_ratings
3 1 1794
5 2 1553
7 3 1384
9 4 1716
11 5 1655
13 6 1686

Now join the two data frames using sql, by selecting and grouping by the player num and averaging the opponent ratings.

3.1.2 Opponent Average

To get the opponent average, I have to select the player num from player_info table group by it and average the opponent ratings for each respective player. I achieved this using SQL.

opponent_avg <- sqldf('select player_num, round(avg(o_ratings),0 ) as opponent from player_info inner join opponent_scores on opponent = opponent2 group by player_num')
datatable(opponent_avg, extensions= 'Scroller', options=list(pageLength=5))

Now we need to combine our final data frame with all the information so it can look like our final tournament table. With this table you can easily search for each opponent by name. It even allows you to sort by any column.

tournament <- tm3 %>% full_join(opponent_avg, by=c("player_num")) %>% select(c(-(round1:round7)))
datatable(tournament, extensions= 'Scroller', options=list(pageLength=5))

4 Final Step

The final step is to save to a tournament csv which would be the correct schema.

write_csv(tournament, "./tournament.csv")

The resulting csv can be found on my GitHub: Tournament