Project 1

Chess Tournament Text Conversion

library(tidyverse)

This project will use chess tournament data. Our main objective is to be able to take the raw text file provided and convert it into spreadsheet. It will also need to have certain variables extracted and require calculations to those variables to describe the information.

Directions

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Background

This data comes from a chess tournament that contains ratings from the United States Chess Federation. It has columns for the player’s name, their pairing, their total points in the tournament, and rounds one through seven. It also shows the state (or province if Canada) the player is from and their wins, losses, draws, and other textual information. All players have a unique pair number. The opponent of each player has their number recorded in the round with who they are paired to play against. There are 64 players total.

Methods and Analysis

Data was posted to Github for ease of access and reproducibility. To better manage this project, the workload was broken into three steps. In the first, the raw text would be transformed into a usable format. Second, the player calculations were performed and lastly, the information was synthesized to a data frame. Generating the output as a spreadsheet (.csv) should be adapted by the user to fit their needs on where the data gets exported.

Transforming Text Data

The steps that were taken to transform the data into a usable format to perform calculations which were as follows:

Create two variables
- One for raw text
- One for the strings
Remove the hyphens surrounding each players data
Remove unnecessary text from each player’s information
Split the text into strings of factors
Store the results as a new data frame

Though the raw text variable was overwritten, it was used as a reference at the end of these transformations. To get the strings mostly clean, the R for loop in the chunk below used spaces to separate the text. This is the chunk for step 1.

tournamentinfo <- read_lines("https://raw.githubusercontent.com/palmorezm/msdsdata607/master/tournamentinfo.txt")
hyphens <- str_detect(tournamentinfo, '^[-]{2,}$') 
tournamentinfo <- tournamentinfo[!hyphens == "TRUE"]
tournamentinfo <- str_remove_all(tournamentinfo, "[WLBD]") 
tournamentinfo <- str_replace_all(tournamentinfo, "[|/]",",") 
empty <- c("") 
for (i in seq(1, length(tournamentinfo)-1, by = 2)){
   empty <- c(empty, paste(tournamentinfo[i], tournamentinfo[i+1], sep = "", collapse = NULL))
}
TournamentResults <- as.data.frame(do.call(rbind, strsplit(empty, ",")), stringsAsFactors = FALSE)

## Warning in (function (..., deparse.level = 1) : number of columns of result is
## not a multiple of vector length (arg 3)

Player Calculations

From the directions, a variable of the average pre-rating for each of a single player’s opponents should be calculated. This was done by referencing the pairing of each player’s opponents and changing the number of their pair to that opponent’s pre-rating. For every player, the rounds then contained the pre-rating of their opponents for the rounds they played.

However, the data was not clean enough to perform this calculation outright. To get to the right data type to use this method, the data needed further cleaning. A new string of preratings was created to suit this need by first extracting all digits between the characters “R:” and “->”. Characters were then narrowed down until all that remained were the numerical pre-ratings of each player.

# str_match_all(tournamentinfo, "R:\\s*(.*?)\\s*->")
Preratings <- str_extract_all(tournamentinfo, "R:\\d*(.*?)\\d*->", simplify = TRUE)
Preratings <- str_remove_all(Preratings, "R:")
Preratings <- str_remove_all(Preratings, "->")
Preratings <- str_remove_all(Preratings, "P\\d{2}")
Preratings <- str_remove_all(Preratings, "P\\d")
Preratings <- str_match_all(Preratings, "\\d+")
Preratings <- str_extract_all(Preratings, "\\d+", simplify = TRUE)

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

stuff <- unlist(Preratings)
stuff <- as.numeric(as.character(stuff))
is.numeric(stuff)

## [1] TRUE

stuff <- stuff - 1
stuff <- gsub("-1", NA, stuff)
stuff <- na.omit(stuff)
Preratings <- stuff 
Preratings <- as.data.frame(Preratings)
Preratings <- as.numeric(as.character(unlist(Preratings[[1]])))
Preratings <- as.data.frame(Preratings)
Preratings <- na.omit(Preratings)
# view(Preratings)

With a data frame containing numerical vectors of each player’s pre-rating, the data frame was then combined with the tournament results to make the players’ average opponent rating calculations possible.

Synthesis

When separating the data into strings as factors, the top row with column names was left for referencing to the raw text file. It was removed prior to combining the data frames to prevent any mismatchings of player’s or their opponent’s pre-ratings.

TournamentResults <- TournamentResults[2:65, ]
TournamentResults <- cbind.data.frame(TournamentResults, Preratings)

As mentioned, each round contained the unique pair number of the player’s opponent. For example, Gary Hua, the first player, was paired against player 39, Joel R.Hendon, in his first round. This trend continued for all rounds but before it was used as a reference, all records in each column’s data were changed to numeric vectors for calculation. Results were placed in a subset of the original data frame TournamentResults.

Round1 <- as.numeric(as.character(unlist(TournamentResults$V4)))
Round2 <- as.numeric(as.character(unlist(TournamentResults$V5)))
Round3 <- as.numeric(as.character(unlist(TournamentResults$V6)))
Round4 <- as.numeric(as.character(unlist(TournamentResults$V7)))
Round5 <- as.numeric(as.character(unlist(TournamentResults$V8)))
Round6 <- as.numeric(as.character(unlist(TournamentResults$V9)))
Round7 <- as.numeric(as.character(unlist(TournamentResults$V10)))
TournamentResults <- subset(TournamentResults, select = c(
  "V1",
  "V2",
  "V3",
  "V11",
  "Preratings")
  )
TournamentResults <- cbind(TournamentResults, Round1, Round2, Round3, Round4, Round5, Round6, Round7)

# Assigning new column names
colnames(TournamentResults) <- c(
  "Pair",
  "Name",
  "Total",
  "State",
  "Preratings",
  "Round1",
  "Round2",
  "Round3",
  "Round4",
  "Round5",
  "Round6",
  "Round7"
  )

New column names were then assigned to each variable to better represent the data frame. A sample of the results can be seen here:

TournamentResults[1:6, 1:4]

This result is not quite finalized. It only contains the information necessary to perform the average opponent calculation and other factors kept for exporting in the final spreadsheet.

In the calculation, two loops were used to assign the pre-ratings of each player into their respective rounds then using those rows with each player’s opponent’s pre-rating to find the average of their scores. The first loop shows the assignment, the second the final calculation of the mean opponent rating.

for (i in 1:64) {
  TournamentResults$Round1[i] <- TournamentResults$Preratings[TournamentResults$Round1[i]]
  TournamentResults$Round2[i] <- TournamentResults$Preratings[TournamentResults$Round2[i]]
  TournamentResults$Round3[i] <- TournamentResults$Preratings[TournamentResults$Round3[i]]
  TournamentResults$Round4[i] <- TournamentResults$Preratings[TournamentResults$Round4[i]]
  TournamentResults$Round5[i] <- TournamentResults$Preratings[TournamentResults$Round5[i]]
  TournamentResults$Round6[i] <- TournamentResults$Preratings[TournamentResults$Round6[i]]
  TournamentResults$Round7[i] <- TournamentResults$Preratings[TournamentResults$Round7[i]]
}
for (i in 1:64) {
  TournamentResults$Means [i] <- rowMeans(TournamentResults[i, 6:12], na.rm = TRUE)
}
TournamentResults$MeanOpponentRating <- TournamentResults$Means

Lastly, only the columns with information relevant to the project were selected. This can be shown along with the calculation of the average opponent rating in the column titled MeanOpponentRating.

TournamentResults <- subset(TournamentResults, select = c(
  "Pair",
  "Name",
  "Total",
  "State",
  "Preratings",
  "MeanOpponentRating"
))

Conclusion

After the analysis we were able to create a spreadsheet with the following variables for each player:

Name	State	Total	Pre-rating	Avg-Rate
Gary Hua	ON	6.0	1794	1605
———	——-	——-	————	———-

This data frame with the first five players selected alongside their statistics is shown below. It contains all player’s names, their state, their total points, their pre-rating, and the average pre-rating of their opponents.

# data frame with first 5 players stats selected
TournamentResults[1:5,1:6]

The data frame can be converted to a spreadsheet (specifically a “.csv”) with the following chunk.

# Converting to a .csv from data frame
write.csv(TournamentResults, "C:\\Users\\Owner\\Desktop\\TournamentResults.csv", row.names = FALSE)

Currently, data is exported to my desktop. It is unique to this machine but it could be pulled into any available folder for any user by adjusting the file path. Alternatively, the same data frame could also be used to export the data to a SQL database.