CUNY MSDS Data 607 Project #1
Overview
In this project, there was a text file with chess tournament results where the information has some structure. This R Markdown generates a .CSV file with the following information for all of the players:
| Player’s Name | Player’s State | Total Number of Points | Player’s Pre-Rating | Average Pre-Chess Rating of Opponents |
|---|
The Challenge
For this project, I made it my challenge to only utilize functions found in the very useful stringr package since I don’t work with text file often. I wanted to learn about this package, and this project was a great way to do this.
library(stringr)The Text
The text file was stored on my GitHub. The following retrieves this specific text file:
theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%201/playerdata.txt"
data <- file(theURL, open="r")
playerresult <- readLines(data)
# A preview of the chess tournament text file
head(playerresult)## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
Cleaning the text
1st: Separators
There are a few formats of the layout in this text file that needs to be removed because it is difficult to read in R for calculations. These include:
- The repeated dashes at the end of every two rows to separate each players information about the chess tournament.
- The pipelines and forward slashes acting as separators for the variables and scores, respectively.
Therefore, these bits of layout were remove/replace in order to allow R to read the text file easily.
# Identifying where the dashes are located
dash <- str_detect(playerresult, '^[-]{2,}$')
# Remove these rows so that there is nothing separating one player from the other
playerresult <- playerresult[!dash == "TRUE"]
# Remove/Replace the unnecessary indications of win, draw or lose, pipelines and forward slashes
## removed W, D, & L
playerresult <- str_remove_all(playerresult, "[WDL]")
## replace pipelines and slashes with commas so it can later be transfromed into a dataframe
playerresult <- str_replace_all(playerresult, "[|/]",",") 2nd: Combine
Moreover, these separators needed to be firstly removed in order to combine the two rows of player’s information into one row. This function transforms the two rows into one.
# Combine the two rows for each player
fnew <- c("")
for (i in seq(1, length(playerresult)-1, by = 2)){
fnew <- c(fnew, paste(playerresult[i], playerresult[i+1], sep = "", collapse = NULL))
}Text to Data Frame
Transformations
Now that the text has been formatted with information that is workable, a data frame can be easily created with the comma separator which was added. Therefore in these next few steps, the data frame is put through some rigid visual and data class transformations.
# Creating the dataframe
ChessTourn <- as.data.frame(do.call(rbind, strsplit(fnew, ",")), stringsAsFactors = FALSE)
# Adding the column names which are in the 1st row, and removing the name row from the dataframe
names(ChessTourn) <- unlist(ChessTourn[1,])
ChessTourn = ChessTourn[-1,]
# Renaming and removing some columns
colnames(ChessTourn)[11] <- c("State")
colnames(ChessTourn)[4:10] <- c("P1","P2","P3","P4","P5","P6","P7") # The opponents' number
rownames(ChessTourn) <- 1:nrow(ChessTourn)
ChessTourn[12] <- list(NULL) # Removing the USCFI numbers as they are not needed
colnames(ChessTourn)[12] <- c("PreRating")
ChessTourn[c(1,13:ncol(ChessTourn))] <- list(NULL) # Removing the other unnecessary columns
# Keeping the pre-rating scores for calculations later
ChessTourn$PreRating <- str_sub(ChessTourn$PreRating, 5, 8)
# Converting to number for calculations later
ChessTourn[c(2:9,11)] <- sapply((ChessTourn)[c(2:9,11)], as.character)
ChessTourn[c(2:9,11)] <- sapply((ChessTourn)[c(2:9,11)], as.numeric)
# Removing spaces from players name and States
ChessTourn[c(1,10)] <- sapply(as.vector((ChessTourn)[c(1,10)]), str_trim)
# Change NA values to zero for calculations later
ChessTourn[is.na(ChessTourn)] <- 0 Pièce De Résistance
After all these transformation, the data frame now looks like:
head(ChessTourn)## Player Name Total P1 P2 P3 P4 P5 P6 P7 State
## 1 GARY HUA 6.0 39 21 18 14 7 12 4 ON
## 2 AKSHESH ARURI 6.0 63 58 4 17 16 20 7 MI
## 3 AITYA BAJAJ 6.0 8 61 25 21 11 13 12 MI
## 4 PATRICK H SCHIING 5.5 23 28 2 26 5 19 1 MI
## 5 HANSHI ZUO 5.5 45 37 12 13 4 14 17 MI
## 6 HANSEN SONG 5.0 34 29 11 35 10 27 21 OH
## PreRating
## 1 1794
## 2 1553
## 3 1384
## 4 1716
## 5 1655
## 6 1686
BUT it is not over as yet!
Calculations
The pre-rating average for each player can be determined based on the their opponent’s pre-rating. The average rating per player is found by using the pre-tournament opponents’ ratings and dividing by the total number of games played.
# Initializing some vectors for calcuations
TotalRating <- numeric(0)
NumOppt <- numeric(0)
PreRatingAvg <- vector()
# Pre-rating average
for (i in 1:length(ChessTourn$PreRating)){
players <- as.numeric(as.vector(ChessTourn[i,3:9]))
TotalRating <- sum(ChessTourn[players, "PreRating"])
NumOppt <- sum(ChessTourn[i,c(3:9)]!=0)
PreRatingAvg[i] <- round(TotalRating / NumOppt, digits = 0)
}
ChessTourn$PreRatingAvg <- PreRatingAvgFinal Output
The final output is a csv file in the format that was needed.
ChessTourn <- ChessTourn[c(1,10,2,11,12)]
head(ChessTourn)## Player Name State Total PreRating PreRatingAvg
## 1 GARY HUA ON 6.0 1794 1605
## 2 AKSHESH ARURI MI 6.0 1553 1469
## 3 AITYA BAJAJ MI 6.0 1384 1564
## 4 PATRICK H SCHIING MI 5.5 1716 1574
## 5 HANSHI ZUO MI 5.5 1655 1501
## 6 HANSEN SONG OH 5.0 1686 1519
write.csv(ChessTourn, file="ChessTournament.csv")Works Cited
- Wickham, H. (2017). R: Package
stringr. Retrieved February 12, 2019, from https://cran.r-project.org/web/packages/stringr/stringr.pdf