Overview
The objective of this project is to import and clean a set of data about chess tournament player performance. The output will be a csv which contains the following information:
|
Players.Name
|
Players.State
|
Total.Number.of.Points
|
Players.Pre_Rating
|
Average.Pre.Chess.Rating.of.Opponents
|
|
Gary Hua
|
ON
|
6
|
1794
|
1605
|
Retrieve Data
The data can be found in GitHub as a text file. This will be imported into R using the read_table() function.
## Parsed with column specification:
## cols(
## `-----------------------------------------------------------------------------------------` = col_character()
## )
Clean Data
Below we are going to clean the Chess Tournament data that has been imported from a text file. There are enitre dashed lines separating lines of data and a single line of data spans two rows. We will need to merge the information from these two rows into a single line. I will also rename and remove columns that won’t be needed.
## -----------------------------------------------------------------------------------------1
## "Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|"
## -----------------------------------------------------------------------------------------2
## "Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |"
## -----------------------------------------------------------------------------------------3
## "-----------------------------------------------------------------------------------------"
## -----------------------------------------------------------------------------------------4
## "1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## -----------------------------------------------------------------------------------------5
## "ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## -----------------------------------------------------------------------------------------6
## "-----------------------------------------------------------------------------------------"
#Remove all lines that are only dashes
chess <- str_replace_all(chess,"-","")
chess <- chess[str_detect(chess,"[^(-)\\1+$]",2) == FALSE]
#Combine the lines that have been separated
line_numbers <- c(1:length(chess))
odd_lines_numbers <- line_numbers[line_numbers%%2 == 1]
even_line_numbers <- line_numbers[line_numbers%%2 == 0]
chess_data <- str_c(chess[odd_lines_numbers],chess[even_line_numbers])
#Additional cleaning & renaming columns
chess_data <- str_replace_all(chess_data,"[\\|]",",")
chess_data <- str_split(chess_data,",")
chess_df <- as.data.frame(matrix(unlist(chess_data), nrow=65, byrow=TRUE))
#Subset & rename only columns that are needed
chess_df_subset <- chess_df[,c(1:13)]
chess_df_subset <- chess_df_subset[-1,]
colnames(chess_df_subset) <- c("Player_ID","PlayerName","Total Points","Opponent1","Opponent2","Opponent3","Opponent4","Opponent5","Opponent6","Opponent7","State","USCF ID / Rtg (Pre>Post)","Points_Total")
#Clean up data types
chess_df_subset$Player_ID <- as.numeric(as.character(chess_df_subset$Player_ID))
Player’s Pre & Post Tournament Ratings
From the raw data, each player can been seen to have a pre tournament rating as well as a post tournament ratings. The first number that follows the “R:” is the Pre-Tournamnet Player Rating.
|
|
Player_ID
|
USCF ID / Rtg (Pre>Post)
|
|
2
|
1
|
15445895 / R: 1794 >1817
|
|
3
|
2
|
14598900 / R: 1553 >1663
|
|
4
|
3
|
14959604 / R: 1384 >1640
|
|
5
|
4
|
12616049 / R: 1716 >1744
|
|
6
|
5
|
14601533 / R: 1655 >1690
|
|
7
|
6
|
15055204 / R: 1686 >1687
|
The information is all contained in a single string and must be parsed out using regular expressions.
chess_avg_player <- chess_df_subset[,c('Player_ID','USCF ID / Rtg (Pre>Post)')]
chess_avg_player <- chess_avg_player %>% separate(`USCF ID / Rtg (Pre>Post)`,c("USCF_ID","Rtg(Pre>Post)"),"/")
chess_avg_player <- chess_avg_player %>% separate('Rtg(Pre>Post)',c("Rating_Pre","Rating_Post"),">")
#Clean up player pre ratings
chess_avg_player$Rating_Pre <- str_replace_all(chess_avg_player$Rating_Pre,"[\\s+?R:\\s+]","")
chess_avg_player$Rating_Pre <- str_replace_all(chess_avg_player$Rating_Pre, "(P\\d+)$","")
chess_avg_player$Rating_Pre <- as.numeric(as.character(chess_avg_player$Rating_Pre))
#Clean up player post ratings
chess_avg_player$Rating_Post <- str_replace_all(chess_avg_player$Rating_Post,"[\\s+]","")
chess_avg_player$Rating_Post <- str_replace_all(chess_avg_player$Rating_Post,"(P\\d+)$","")
chess_avg_player$Rating_Post <- as.numeric(as.character(chess_avg_player$Rating_Post))
Average Opponent Ratings
We then grouped each player’s opponent ratings and average the opponent scores.
|
Player_ID
|
Avg_Opponent_Pre_Rating
|
|
1
|
1605.286
|
|
2
|
1469.286
|
|
3
|
1563.571
|
|
4
|
1573.571
|
|
5
|
1500.857
|
|
6
|
1518.714
|
Combining Everything Together
individual_player <- left_join(chess_df_subset[,c(1,2,3,11)],chess_avg_player[,c(1,3)],by = c("Player_ID" = "Player_ID"))
final <- left_join(individual_player,ratings,by = c("Player_ID" = "Player_ID"))
final <- final[,c(2,4,3,5,6)]
Now that we have the individual player information:
|
Player_ID
|
PlayerName
|
Total Points
|
State
|
Rating_Pre
|
|
1
|
GARY HUA
|
6.0
|
ON
|
1794
|
|
2
|
DAKSHESH DARURI
|
6.0
|
MI
|
1553
|
|
3
|
ADITYA BAJAJ
|
6.0
|
MI
|
1384
|
|
4
|
PATRICK H SCHILLING
|
5.5
|
MI
|
1716
|
|
5
|
HANSHI ZUO
|
5.5
|
MI
|
1655
|
|
6
|
HANSEN SONG
|
5.0
|
OH
|
1686
|
Along with the average opponent pre tournament scores:
|
Player_ID
|
Avg_Opponent_Pre_Rating
|
|
1
|
1605.286
|
|
2
|
1469.286
|
|
3
|
1563.571
|
|
4
|
1573.571
|
|
5
|
1500.857
|
|
6
|
1518.714
|
We can combine them together, joining using Player_ID, to get our final results & export the data into a csv “Chess_Tournament_Results.csv”.
|
PlayerName
|
State
|
Total Points
|
Rating_Pre
|
Avg_Opponent_Pre_Rating
|
|
GARY HUA
|
ON
|
6.0
|
1794
|
1605.286
|
|
DAKSHESH DARURI
|
MI
|
6.0
|
1553
|
1469.286
|
|
ADITYA BAJAJ
|
MI
|
6.0
|
1384
|
1563.571
|
|
PATRICK H SCHILLING
|
MI
|
5.5
|
1716
|
1573.571
|
|
HANSHI ZUO
|
MI
|
5.5
|
1655
|
1500.857
|
|
HANSEN SONG
|
OH
|
5.0
|
1686
|
1518.714
|
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. chess[