1 Overview

For this project we were given a text file with chess tournament results where the information has some structure. The task was to generate a .CSV file containing the name, pre-rating score of an individual player, their total points and the average of all their opponents pre-ratings. The tournament results are stored in the text file called tournamentinfo.

Guidelines for this project

The rounds that end with whats detailed below will not count:
- B(ye)
- H(alf point)
- U(nplayed)
Instead we will include:
- W(in)
- L(ose)
- D(raw)

As of right now the data looks like: Tournament

The final format we want is:

Player’s Name	Player’s State	Total # of points	Player’s Pre Rating	Avg Pre-Chess of Opponents
Gary Hua	ON	6.0	1794	1605

1.1 Read in the text file

file = "https://raw.githubusercontent.com/moiyajosephs/Data607-Project1/main/tournamentinfo.txt"
tm= read.delim(file,header=FALSE,sep="|", dec = ".")

2 Data cleaning

The steps to clean the data will be as described below:

Data Cleaning Steps

2.1 Initial data frame

Each players information is on two separate rows enclosed between the –. To start, remove any row in the data frame that starts with –. This is not needed for the data to be separated any longer.

tm = tm %>% 
  filter(!grepl('^-', V1))
head(tm) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
Pair	Player Name	Total	Round	Round	Round	Round	Round	Round	Round	NA
Num	USCF ID / Rtg (Pre->Post)	Pts	1	2	3	4	5	6	7	NA
1	GARY HUA	6.0	W 39	W 21	W 18	W 14	W 7	D 12	D 4	NA
ON	15445895 / R: 1794 ->1817	N:2	W	B	W	B	W	B	W	NA
2	DAKSHESH DARURI	6.0	W 63	W 58	L 4	W 17	W 16	W 20	W 7	NA
MI	14598900 / R: 1553 ->1663	N:2	B	W	B	W	B	W	B	NA

Next remove the two rows that contain header information in the start of the text file, I will provide actual column names.

tm <- tm[-c(1:2),-11]

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10
3	1	GARY HUA	6.0	W 39	W 21	W 18	W 14	W 7	D 12	D 4
4	ON	15445895 / R: 1794 ->1817	N:2	W	B	W	B	W	B	W
5	2	DAKSHESH DARURI	6.0	W 63	W 58	L 4	W 17	W 16	W 20	W 7
6	MI	14598900 / R: 1553 ->1663	N:2	B	W	B	W	B	W	B
7	3	ADITYA BAJAJ	6.0	L 8	W 61	W 25	W 21	W 11	W 13	W 12
8	MI	14959604 / R: 1384 ->1640	N:2	W	B	W	B	W	B	W

2.1.1 Divide the data

Split by each row to individual data frames after skipping every two, that way I can append the second row to the first row. Tm1 represents the first set of rows while Tm2 represents the rest.

tm1 <- tm[row.names(tm) %in% seq(1,132,2),] #first part of the players data
tm2 <- tm[row.names(tm) %in% seq(2,132,2),] #second part of the players data

Name the columns of each divided data frame so it will be clear when we combine them. There is the players number, their name, total points and seven rounds total.

names(tm1) <- c("player_num", "player_name", "total points", "round1", "round2", "round3","round4","round5","round6","round7")
head(tm2) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10
4	ON	15445895 / R: 1794 ->1817	N:2	W	B	W	B	W	B	W
6	MI	14598900 / R: 1553 ->1663	N:2	B	W	B	W	B	W	B
8	MI	14959604 / R: 1384 ->1640	N:2	W	B	W	B	W	B	W
10	MI	12616049 / R: 1716 ->1744	N:2	W	B	W	B	W	B	B
12	MI	14601533 / R: 1655 ->1690	N:2	B	W	B	W	B	W	B
14	OH	15055204 / R: 1686 ->1687	N:3	W	B	W	B	B	W	B

The second part of the players information contains the state the player is from, their pre-rating score, points, and some additional round information.

names(tm2)<- c("state","pre_rating", "pts","round1", "round2", "round3","round4","round5","round6","round7")
head(tm2) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))

	state	pre_rating	pts	round1	round2	round3	round4	round5	round6	round7
4	ON	15445895 / R: 1794 ->1817	N:2	W	B	W	B	W	B	W
6	MI	14598900 / R: 1553 ->1663	N:2	B	W	B	W	B	W	B
8	MI	14959604 / R: 1384 ->1640	N:2	W	B	W	B	W	B	W
10	MI	12616049 / R: 1716 ->1744	N:2	W	B	W	B	W	B	B
12	MI	14601533 / R: 1655 ->1690	N:2	B	W	B	W	B	W	B
14	OH	15055204 / R: 1686 ->1687	N:3	W	B	W	B	B	W	B

To get the pre score number may take more cleaning. Currently it has additional and unnecessary characters. We want everything remove everything before the R: but before the ->. remove the n:2 columns also.

2.1.2 Clean the individual data frames

I cleaned the second data frame step by step.

First split by the R.

tm2$`pre_rating`<-sub(".*R:","",tm2$`pre_rating`)

	state	pre_rating	pts	round1	round2	round3	round4	round5	round6	round7
4	ON	1794 ->1817	N:2	W	B	W	B	W	B	W
6	MI	1553 ->1663	N:2	B	W	B	W	B	W	B
8	MI	1384 ->1640	N:2	W	B	W	B	W	B	W
10	MI	1716 ->1744	N:2	W	B	W	B	W	B	B
12	MI	1655 ->1690	N:2	B	W	B	W	B	W	B
14	OH	1686 ->1687	N:3	W	B	W	B	B	W	B

Next remove after the “->”

tm2$`pre_rating`<- sub("->.*","",tm2$`pre_rating`)

	state	pre_rating	pts	round1	round2	round3	round4	round5	round6	round7
4	ON	1794	N:2	W	B	W	B	W	B	W
6	MI	1553	N:2	B	W	B	W	B	W	B
8	MI	1384	N:2	W	B	W	B	W	B	W
10	MI	1716	N:2	W	B	W	B	W	B	B
12	MI	1655	N:2	B	W	B	W	B	W	B
14	OH	1686	N:3	W	B	W	B	B	W	B

Keep anything before the P.

tm2$`pre_rating`<- sub("P.*","",tm2$`pre_rating`)

	state	pre_rating	pts	round1	round2	round3	round4	round5	round6	round7
4	ON	1794	N:2	W	B	W	B	W	B	W
6	MI	1553	N:2	B	W	B	W	B	W	B
8	MI	1384	N:2	W	B	W	B	W	B	W
10	MI	1716	N:2	W	B	W	B	W	B	B
12	MI	1655	N:2	B	W	B	W	B	W	B
14	OH	1686	N:3	W	B	W	B	B	W	B

Realized I do not need anything passed pts. Columns 3 to 10 should be removed.

tm2<- tm2[,1:2]

	state	pre_rating
4	ON	1794
6	MI	1553
8	MI	1384
10	MI	1716
12	MI	1655
14	OH	1686

Now to combine.

2.1.3 Combine the data frames

To append the two data frames use c bind. This will be saved into tm3.

tm3 <- cbind(tm1,tm2)

	player_num	player_name	total points	round1	round2	round3	round4	round5	round6	round7	state	pre_rating
3	1	GARY HUA	6.0	W 39	W 21	W 18	W 14	W 7	D 12	D 4	ON	1794
5	2	DAKSHESH DARURI	6.0	W 63	W 58	L 4	W 17	W 16	W 20	W 7	MI	1553
7	3	ADITYA BAJAJ	6.0	L 8	W 61	W 25	W 21	W 11	W 13	W 12	MI	1384
9	4	PATRICK H SCHILLING	5.5	W 23	D 28	W 2	W 26	D 5	W 19	D 1	MI	1716
11	5	HANSHI ZUO	5.5	W 45	W 37	D 12	D 13	D 4	W 14	W 17	MI	1655
13	6	HANSEN SONG	5.0	W 34	D 29	L 11	W 35	D 10	W 27	W 21	OH	1686

3 Calculations

Now to get to the average of each players opponents, I needed to find the opponents for each person. This is specified in rounds 1 through 7. Currently the data has unnecessary values B,H, and U, the extra letters before the opponent. I need to do a little bit more data wrangling in order to get the opponents pre-ratings and each opponent.

3.1 Find the average of the Opponents

player_info <- tm3 %>% select(c(player_num, round1:round7,pre_rating,state, `total points`)) %>% melt(id.var=c("player_num","pre_rating","state", "total points"),value.name = "round") %>% mutate(opponent = as.numeric(str_extract(round," .*"))) %>% select(c(-round)) %>% mutate(pre_rating= as.numeric(pre_rating)) 
head(player_info)

##   player_num pre_rating  state total points variable opponent
## 1         1        1794    ON         6.0     round1       39
## 2         2        1553    MI         6.0     round1       63
## 3         3        1384    MI         6.0     round1        8
## 4         4        1716    MI         5.5     round1       23
## 5         5        1655    MI         5.5     round1       45
## 6         6        1686    OH         5.0     round1       34

3.1.1 Opponent table

I will need to know each players opponents pre-rating as well so this will be its own table with the columns renamed to avoid ambiguous error.

opponent_scores <- tm3 %>% select(player_num, pre_rating) %>% mutate(player_num= as.numeric(player_num)) %>% mutate(pre_rating = as.numeric(pre_rating))
names(opponent_scores)[names(opponent_scores)=="player_num"] = "opponent2"
names(opponent_scores)[names(opponent_scores)=="pre_rating"] = "o_ratings"
head(opponent_scores) %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover"))

	opponent2	o_ratings
3	1	1794
5	2	1553
7	3	1384
9	4	1716
11	5	1655
13	6	1686

Now join the two data frames using sql, by selecting and grouping by the player num and averaging the opponent ratings.

3.1.2 Opponent Average

To get the opponent average, I have to select the player num from player_info table group by it and average the opponent ratings for each respective player. I achieved this using SQL.

opponent_avg <- sqldf('select player_num, round(avg(o_ratings),0 ) as opponent from player_info inner join opponent_scores on opponent = opponent2 group by player_num')
datatable(opponent_avg, extensions= 'Scroller', options=list(pageLength=5))

Now we need to combine our final data frame with all the information so it can look like our final tournament table. With this table you can easily search for each opponent by name. It even allows you to sort by any column.

tournament <- tm3 %>% full_join(opponent_avg, by=c("player_num")) %>% select(c(-(round1:round7)))
datatable(tournament, extensions= 'Scroller', options=list(pageLength=5))

4 Final Step

The final step is to save to a tournament csv which would be the correct schema.

write_csv(tournament, "./tournament.csv")

The resulting csv can be found on my GitHub: Tournament

Project 1

Moiya Josephs

2/15/2022