Project1

Overview

The objective of this project is to import and clean a set of data about chess tournament player performance. The output will be a csv which contains the following information:

Players.Name	Players.State	Total.Number.of.Points	Players.Pre_Rating	Average.Pre.Chess.Rating.of.Opponents
Gary Hua	ON	6	1794	1605

Retrieve Data

The data can be found in GitHub as a text file. This will be imported into R using the read_table() function.

file_path <- "https://raw.githubusercontent.com/devinteran/DATA607---Project-1/master/tournamentinfo.txt"
chess_raw <- read_table(file_path)

## Parsed with column specification:
## cols(
##   `-----------------------------------------------------------------------------------------` = col_character()
## )

Clean Data

Below we are going to clean the Chess Tournament data that has been imported from a text file. There are enitre dashed lines separating lines of data and a single line of data spans two rows. We will need to merge the information from these two rows into a single line. I will also rename and remove columns that won’t be needed.

chess <- unlist(chess_raw)
head(chess)

##  -----------------------------------------------------------------------------------------1 
##  "Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round|" 
##  -----------------------------------------------------------------------------------------2 
##  "Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  |" 
##  -----------------------------------------------------------------------------------------3 
## "-----------------------------------------------------------------------------------------" 
##  -----------------------------------------------------------------------------------------4 
##     "1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  -----------------------------------------------------------------------------------------5 
##    "ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  -----------------------------------------------------------------------------------------6 
## "-----------------------------------------------------------------------------------------"

#Remove all lines that are only dashes
chess <- str_replace_all(chess,"-","")
chess <- chess[str_detect(chess,"[^(-)\\1+$]",2) == FALSE]

#Combine the lines that have been separated
line_numbers <- c(1:length(chess))
odd_lines_numbers <- line_numbers[line_numbers%%2 == 1]
even_line_numbers <- line_numbers[line_numbers%%2 == 0]
chess_data <- str_c(chess[odd_lines_numbers],chess[even_line_numbers])

#Additional cleaning & renaming columns
chess_data <- str_replace_all(chess_data,"[\\|]",",")
chess_data <- str_split(chess_data,",")
chess_df <- as.data.frame(matrix(unlist(chess_data), nrow=65, byrow=TRUE))

#Subset & rename only columns that are needed
chess_df_subset <- chess_df[,c(1:13)]
chess_df_subset <- chess_df_subset[-1,]
colnames(chess_df_subset) <- c("Player_ID","PlayerName","Total Points","Opponent1","Opponent2","Opponent3","Opponent4","Opponent5","Opponent6","Opponent7","State","USCF ID / Rtg (Pre>Post)","Points_Total")

#Clean up data types
chess_df_subset$Player_ID <- as.numeric(as.character(chess_df_subset$Player_ID))

Player’s Pre & Post Tournament Ratings

From the raw data, each player can been seen to have a pre tournament rating as well as a post tournament ratings. The first number that follows the “R:” is the Pre-Tournamnet Player Rating.

	Player_ID	USCF ID / Rtg (Pre>Post)
2	1	15445895 / R: 1794 >1817
3	2	14598900 / R: 1553 >1663
4	3	14959604 / R: 1384 >1640
5	4	12616049 / R: 1716 >1744
6	5	14601533 / R: 1655 >1690
7	6	15055204 / R: 1686 >1687

The information is all contained in a single string and must be parsed out using regular expressions.

chess_avg_player <- chess_df_subset[,c('Player_ID','USCF ID / Rtg (Pre>Post)')]
chess_avg_player <- chess_avg_player %>% separate(`USCF ID / Rtg (Pre>Post)`,c("USCF_ID","Rtg(Pre>Post)"),"/")
chess_avg_player <- chess_avg_player %>% separate('Rtg(Pre>Post)',c("Rating_Pre","Rating_Post"),">")

#Clean up player pre ratings
chess_avg_player$Rating_Pre <- str_replace_all(chess_avg_player$Rating_Pre,"[\\s+?R:\\s+]","")
chess_avg_player$Rating_Pre <- str_replace_all(chess_avg_player$Rating_Pre, "(P\\d+)$","")
chess_avg_player$Rating_Pre <- as.numeric(as.character(chess_avg_player$Rating_Pre))

#Clean up player post ratings
chess_avg_player$Rating_Post <- str_replace_all(chess_avg_player$Rating_Post,"[\\s+]","")
chess_avg_player$Rating_Post <- str_replace_all(chess_avg_player$Rating_Post,"(P\\d+)$","")
chess_avg_player$Rating_Post <- as.numeric(as.character(chess_avg_player$Rating_Post))

Clean Opponent Match Information

We must remove excess information that doesn’t pertain to our analysis. We are looking to tidy data so that we can determine for each player, which opponent did they go up against in each match? We will pivot the data into a long format, so joining with player data later will be easier.

#Subset data to get opponent player IDs
chess_opponents <- chess_df_subset[,c(1,4:11)]

#Strip out W,L,D,B characters since we won't need them 
opponent_cols <- c("Opponent1","Opponent2","Opponent3","Opponent4","Opponent5","Opponent6","Opponent7")
chess_opponents$Opponent1 <- str_replace_all(chess_opponents$Opponent1,"[\\s+?(W|L|D|B|H|U|X)\\s+?]","")
chess_opponents$Opponent2 <- str_replace_all(chess_opponents$Opponent2,"[\\s+?(W|L|D|B|H|U|X))\\s+?]","")
chess_opponents$Opponent3 <- str_replace_all(chess_opponents$Opponent3,"[\\s+?(W|L|D|B|H|U|X))\\s+?]","")
chess_opponents$Opponent4 <- str_replace_all(chess_opponents$Opponent4,"[\\s+?(W|L|D|B|H|U|X))\\s+?]","")
chess_opponents$Opponent5 <- str_replace_all(chess_opponents$Opponent5,"[\\s+?(W|L|D|B|H|U|X))\\s+?]","")
chess_opponents$Opponent6 <- str_replace_all(chess_opponents$Opponent6,"[\\s+?(W|L|D|B|H|U|X))\\s+?]","")
chess_opponents$Opponent7 <- str_replace_all(chess_opponents$Opponent7,"[\\s+?(W|L|D|B|H|U|X))\\s+?]","")

chess_opponents_long <- chess_opponents %>% pivot_longer(cols = starts_with("Opponent"),names_to="Opponent_no",values_to="Opponent_ID")
chess_opponents_long$Opponent_ID <- as.numeric(as.character(chess_opponents_long$Opponent_ID))

kable(head(chess_opponents_long)) %>%
  kable_styling(bootstrap_options = c("striped","hover"),full_width=F)

Player_ID	State	Opponent_no	Opponent_ID
1	ON	Opponent1	39
1	ON	Opponent2	21
1	ON	Opponent3	18
1	ON	Opponent4	14
1	ON	Opponent5	7
1	ON	Opponent6	12

Average Opponent Ratings

We then grouped each player’s opponent ratings and average the opponent scores.

combined <- inner_join(chess_opponents_long,chess_avg_player,by= c("Opponent_ID" = "Player_ID"))
ratings <- combined %>% group_by(Player_ID) %>% summarise(Avg_Opponent_Pre_Rating = mean(Rating_Pre))
ratings$Player_ID <- as.numeric(as.character(ratings$Player_ID))

kable(head(ratings)) %>%
  kable_styling(bootstrap_options = c("striped","hover"),full_width=F)

Player_ID	Avg_Opponent_Pre_Rating
1	1605.286
2	1469.286
3	1563.571
4	1573.571
5	1500.857
6	1518.714

Combining Everything Together

individual_player <- left_join(chess_df_subset[,c(1,2,3,11)],chess_avg_player[,c(1,3)],by = c("Player_ID" = "Player_ID"))
final <- left_join(individual_player,ratings,by = c("Player_ID" = "Player_ID"))
final <- final[,c(2,4,3,5,6)]

Now that we have the individual player information:

Player_ID	PlayerName	Total Points	State	Rating_Pre
1	GARY HUA	6.0	ON	1794
2	DAKSHESH DARURI	6.0	MI	1553
3	ADITYA BAJAJ	6.0	MI	1384
4	PATRICK H SCHILLING	5.5	MI	1716
5	HANSHI ZUO	5.5	MI	1655
6	HANSEN SONG	5.0	OH	1686

Along with the average opponent pre tournament scores:

Player_ID	Avg_Opponent_Pre_Rating
1	1605.286
2	1469.286
3	1563.571
4	1573.571
5	1500.857
6	1518.714

We can combine them together, joining using Player_ID, to get our final results & export the data into a csv “Chess_Tournament_Results.csv”.

PlayerName	State	Total Points	Rating_Pre	Avg_Opponent_Pre_Rating
GARY HUA	ON	6.0	1794	1605.286
DAKSHESH DARURI	MI	6.0	1553	1469.286
ADITYA BAJAJ	MI	6.0	1384	1563.571
PATRICK H SCHILLING	MI	5.5	1716	1573.571
HANSHI ZUO	MI	5.5	1655	1500.857
HANSEN SONG	OH	5.0	1686	1518.714

write.csv(final,'Chess_Tournament_Results.csv')

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. chess[