Given a text file containing chess tournament results in the following format, we import the results into R and tidy the data into a structured format. As shown below, the data is semi-structured and must be tidied.
#import tournament data from my github repository
tournamentraw <- read_lines(file="https://raw.githubusercontent.com/mkivenson/Data-Acquisition-and-Management/master/Project%201/tournamentinfo.txt")
head(tournamentraw, n=15)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [13] "-----------------------------------------------------------------------------------------"
## [14] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [15] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
In order to structure the data, the following packages are used.
In this section, we remove all text separators and separate strings into columns using text length. This data is still poorly structured and includes white space.
#remove all text separators
tournamentraw <- str_remove_all(tournamentraw,pattern = "(-){2,}|\\|")
#separate string into columns
tournamentraw <- read_fwf(tournamentraw,fwf_widths(c(5,33,6,5,5,5,5,5,5,5), c("pair","playername","total","round1","round2","round3","round4","round5","round6","round7")))
tournamentraw <- tournamentraw[-c(0:2),]
datatable(as.data.frame(tournamentraw))
We create a new dataframe for each of the 64 players and 6 fields that are required. The fields of interest are:
tournament <- data.frame(matrix(vector(), 64, 6,
dimnames=list(c(), c("number",
"name",
"state",
"totalpoints",
"prerating",
"opponent"))),
stringsAsFactors=F)
We can use various regular expressions to populate player information fields. In order to avoid issues that may be caused by NA values, the function na.omit was used to exclude null values.
#Player’s Number
tournament$number <- grep("([0-9])",
tournamentraw$pair,
value = TRUE)
#Player’s Name
tournament$name <- grep("([A-Z]+\\s){1,}",
tournamentraw$playername,
value = TRUE)
#Player’s State
tournament$state <- grep("([A-Z])",
tournamentraw$pair,
value = TRUE)
#Total Number of Points
tournament$totalpoints <- as.numeric(grep("(\\d.\\d)",
tournamentraw$total,
value = TRUE))
#Player’s Pre-rating
tournament$prerating <- na.omit(str_extract(tournamentraw$playername,
"\\d+(?=\\s*->)|\\d+(?=P\\d{2}\\s*->)"))
datatable(tournament)
While the previous vectors looked at only player informtion from one column at a time, the average of the opponent ratings require more effort. We extract ratings from seven rounds, match them to the correct opponents, and apply a mean function.
#Extract opponent numbers from each round
odd_rows <- seq(1,128,2)
matches <- data.frame(matrix(vector(), 64, 9,
dimnames=list(c(),
c("person","prerating","r1","r2","r3","r4","r5","r6","r7"))),
stringsAsFactors=F)
#Create a new dataframe for opponent data
matches$person <- tournament$number
matches$prerating <- tournament$prerating
matches$r1 <- str_extract(tournamentraw$round1[odd_rows], "\\d{1,2}")
matches$r2 <- str_extract(tournamentraw$round2[odd_rows], "\\d{1,2}")
matches$r3 <- str_extract(tournamentraw$round3[odd_rows], "\\d{1,2}")
matches$r4 <- str_extract(tournamentraw$round4[odd_rows], "\\d{1,2}")
matches$r5 <- str_extract(tournamentraw$round5[odd_rows], "\\d{1,2}")
matches$r6 <- str_extract(tournamentraw$round6[odd_rows], "\\d{1,2}")
matches$r7 <- str_extract(tournamentraw$round7[odd_rows], "\\d{1,2}")
#Expand matches data, then join matches and tournament data to return opponent pre-ratings
matches <- gather(matches,"round","number",c(r1,r2,r3,r4,r5,r6,r7))
matches <- left_join(x = matches, y = tournament, by = "number")
matches <- subset(matches, select=c(person,number,prerating.y))
matches$prerating.y <- as.numeric(matches$prerating.y)
#Create a function that calculates mean and ignores NA values
meanfunction <- function(x) {
y <- mean(x,na.rm=TRUE)
return(y)
}
#Add the average opponent pre-ratings to the tournament table
tournament$opponent <- aggregate(matches$prerating.y,list(matches$person), FUN = meanfunction)[,2]
datatable(tournament)
#write.csv
write.csv(tournament, file = 'tournamentoutput.csv', row.names = FALSE)
Is there a relationship between pre-ratings of players and total points?
For the plot below, we create a new variable: the pre-rating ratio. This ratio determines the pre-rating of a player to the average pre-ratings of their opponents. Ratios close to 1 indicate that players and their opponents are generally evenly matched, while ratios below 1 indicate opponents are more skilled and ratios above 1 indicate a player is more skilled than their opponents. Box plots were created for the pre-rating ratio, grouped by total points.
According to this plot, the total points attained and the ratio of pre-ratings have a positive relationship, except at the lower and higher point values.
# create ratio variable
tournament$ratio <- as.numeric(tournament$prerating) / as.numeric(tournament$opponent)
# create scatter plot
ggplot(tournament, aes(totalpoints, ratio, fill = totalpoints)) +
geom_boxplot(aes(group = totalpoints)) +
geom_smooth(method = "loess", se=FALSE, color="blue", aes(group=1)) +
scale_fill_gradient(low="red", high="green") +
ggtitle("Distribution of Pre-Rating Ratios by Total Points") +
xlab("Total Points") +
ylab("Ratio of Pre-Ratings")