The goal in this assignment is to read a text file containing the results of a chess tournament and generate a CSV of that data. An average of each player’s opponent’s pre-rating score must also be calculated and included.
The first few lines of the text file in question look like this:
Chess Tournament Text File
At first glance we can see that records are separated with horizontal dashes, and fields within the records are separated with vertical pipes.
The first step is to import the stringr library and read the table into R
library(stringr)
chess <- read.table("tournamentinfo.txt",header=FALSE, sep="*")
head(chess)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
We can see that each line of the table has been imported as one long string.
Next, we will remove the header lines.
chess <- chess[-c(1,2,3,4),]
head(chess)
## [1] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [2] ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## [3] -----------------------------------------------------------------------------------------
## [4] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [5] MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## [6] -----------------------------------------------------------------------------------------
## 131 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ...
To separate the lines containing data from the lines containing horizontal dashes, we well sequentally assign lines to new variables.
chess_line1 <- chess[seq(1,length(chess), 3)]
head(chess_line1)
## [1] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [2] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [3] 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [4] 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## [5] 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
## [6] 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|
## 131 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ...
chess_line2 <- chess[seq(2,length(chess), 3)]
head(chess_line2)
## [1] ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## [2] MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## [3] MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
## [4] MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
## [5] MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |
## [6] OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |
## 131 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ...
We now need to take the lines with data and split them into individual data fields. Because each field is separated by a vertical pipe, we will use the str_split function of the stringr library and use the pipe as the separation point.
split1 <- str_split(chess_line1,"\\|")
split1[[1]]
## [1] " 1 "
## [2] " GARY HUA "
## [3] "6.0 "
## [4] "W 39"
## [5] "W 21"
## [6] "W 18"
## [7] "W 14"
## [8] "W 7"
## [9] "D 12"
## [10] "D 4"
## [11] ""
split2 <- str_split(chess_line2,"\\|")
split2[[1]]
## [1] " ON "
## [2] " 15445895 / R: 1794 ->1817 "
## [3] "N:2 "
## [4] "W "
## [5] "B "
## [6] "W "
## [7] "B "
## [8] "W "
## [9] "B "
## [10] "W "
## [11] ""
The above displays only the first record. We can see the data we need, but some clean up is needed. Before that, we will take the data we want and place them into vectors using a for loop.
name <- NULL
state <- NULL
total_pts <- NULL
pre_rating <- NULL
round_1 <- NULL
round_2 <- NULL
round_3 <- NULL
round_4 <- NULL
round_5 <- NULL
round_6 <- NULL
round_7 <- NULL
for(i in split1){
name <- c(name,i[2])
total_pts <- c(total_pts,i[3])
round_1 <- c(round_1,i[4])
round_2 <- c(round_2,i[5])
round_3 <- c(round_3,i[6])
round_4 <- c(round_4,i[7])
round_5 <- c(round_5,i[8])
round_6 <- c(round_6,i[9])
round_7 <- c(round_7,i[10])
}
for(i in split2){
state <- c(state,i[1])
pre_rating <- c(pre_rating,i[2])
}
head(name)
## [1] " GARY HUA " " DAKSHESH DARURI "
## [3] " ADITYA BAJAJ " " PATRICK H SCHILLING "
## [5] " HANSHI ZUO " " HANSEN SONG "
head(state)
## [1] " ON " " MI " " MI " " MI " " MI " " OH "
head(total_pts)
## [1] "6.0 " "6.0 " "6.0 " "5.5 " "5.5 " "5.0 "
head(pre_rating)
## [1] " 15445895 / R: 1794 ->1817 " " 14598900 / R: 1553 ->1663 "
## [3] " 14959604 / R: 1384 ->1640 " " 12616049 / R: 1716 ->1744 "
## [5] " 14601533 / R: 1655 ->1690 " " 15055204 / R: 1686 ->1687 "
head(round_1)
## [1] "W 39" "W 63" "L 8" "W 23" "W 45" "W 34"
We now have separate vectors containing the name, state, total points, and pre_rating of each player. We also have fields containing the player number of the opponent each player faced in the seven rounds. But we can see that some clean up is needed to leave us with only the data we want in the correct datatype.
We will use the str_extract function on the “round” fields, as we do not need the Win/Loss letter indicator.
name <- str_trim(name)
state <-str_trim(state)
total_pts <- as.numeric(total_pts)
round_1<-as.numeric(str_extract(round_1,"(\\d)+"))
round_2<-as.numeric(str_extract(round_2,"(\\d)+"))
round_3<-as.numeric(str_extract(round_3,"(\\d)+"))
round_4<-as.numeric(str_extract(round_4,"(\\d)+"))
round_5<-as.numeric(str_extract(round_5,"(\\d)+"))
round_6<-as.numeric(str_extract(round_6,"(\\d)+"))
round_7<-as.numeric(str_extract(round_7,"(\\d)+"))
head(name)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
head(state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
head(total_pts)
## [1] 6.0 6.0 6.0 5.5 5.5 5.0
head(round_1)
## [1] 39 63 8 23 45 34
Pre-rating will also use str_extract, but in two parts. The first to ensure we extract the correct digits from the string (as they vary in length, leading spaces, and trailing characters), and the second to filter off any left over characters.
pre_rating <- str_extract(pre_rating,"[ ]([0-9]{3})[ ]|[:][ ]([0-9]{4})|([0-9]{3})[P]|([0-9]{4})[P]")
pre_rating <- as.numeric(str_extract(pre_rating,"(\\d)+"))
head(pre_rating)
## [1] 1794 1553 1384 1716 1655 1686
Now we have most of the elements needed to build our final dataframe.
df <- data.frame(name,state,total_pts,pre_rating)
head(df)
## name state total_pts pre_rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
The final step (before generating the CSV) is to calculate the average pre-rating of each player’s opponents.
Because the “round” vectors contain each opponent’s numerical identifier, we can use that as a reference to their pre-rating.
round_1_opponent_rating <- df$pre_rating[round_1]
round_2_opponent_rating <- df$pre_rating[round_2]
round_3_opponent_rating <- df$pre_rating[round_3]
round_4_opponent_rating <- df$pre_rating[round_4]
round_5_opponent_rating <- df$pre_rating[round_5]
round_6_opponent_rating <- df$pre_rating[round_6]
round_7_opponent_rating <- df$pre_rating[round_7]
head(round_1_opponent_rating)
## [1] 1436 1175 1641 1363 1242 1399
Now we can build a dataframe containing each player’s opponent pre-ratings.
average_df <- data.frame(round_1_opponent_rating,round_2_opponent_rating,round_3_opponent_rating,round_4_opponent_rating,round_5_opponent_rating,round_6_opponent_rating,round_7_opponent_rating)
head(average_df)
## round_1_opponent_rating round_2_opponent_rating round_3_opponent_rating
## 1 1436 1563 1600
## 2 1175 917 1716
## 3 1641 955 1745
## 4 1363 1507 1553
## 5 1242 980 1663
## 6 1399 1602 1712
## round_4_opponent_rating round_5_opponent_rating round_6_opponent_rating
## 1 1610 1649 1663
## 2 1629 1604 1595
## 3 1563 1712 1666
## 4 1579 1655 1564
## 5 1666 1716 1610
## 6 1438 1365 1552
## round_7_opponent_rating
## 1 1716
## 2 1649
## 3 1663
## 4 1794
## 5 1629
## 6 1563
Some players did not play certain rounds, and so have no opponent that round. We can use the rowMeans function on average_df to get the average of each row regardless of any NAs, and then create a new column in our final df that contains these averages.
df$opponent_average <- rowMeans(average_df,na.rm=TRUE)
head(df)
## name state total_pts pre_rating opponent_average
## 1 GARY HUA ON 6.0 1794 1605.286
## 2 DAKSHESH DARURI MI 6.0 1553 1469.286
## 3 ADITYA BAJAJ MI 6.0 1384 1563.571
## 4 PATRICK H SCHILLING MI 5.5 1716 1573.571
## 5 HANSHI ZUO MI 5.5 1655 1500.857
## 6 HANSEN SONG OH 5.0 1686 1518.714
With the final dataframe complete we can generate the CSV.
write.csv(df,"chess_tournament.csv")