DATA 607 Project 1


Group Name: SPARK

Github

All the project work and graphics are availble under github Project Page.


Problem Statement

In this project, we’re given a text file with chess tournament results where the information has some structure.The task is to generate a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name Player’s State Total Number of Points Player’s Pre-Rating Average Pre-Chess Rating of Opponents
GARY HUA ON 6.0 1794 1605

Data in Text File
The chess tournament text file is written in a format that is not legible in R as it is. This file needs to be restructured in order to find the average pre-tournament score.

Data in Text File


Summary of the Approaches

Each contributor decided to work on a solution in order to identify the different ways this task can be solved. However, there were key steps we took which were similar.

Extract

  1. Loading the specific R packages we felt that could solve this problem easily.

Transform

  1. The different methods of cleaning and formatting with text file

Publish

  1. Lastly, building and exporting the final CSV

Solution


The Text File

The chess tournament text file is written in a format that is not legible in R as it is. This file needs to be restructured in order to find the average pre-tournament score.

## [1] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [2] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [3] "-----------------------------------------------------------------------------------------" 
## [4] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [5] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [6] "-----------------------------------------------------------------------------------------"

Final CSV Output

Once all the rigorous formatting was carried out, the final, workable data frame now has the format below.


The R Package

stringr was the main package needed to solve this problem, while many other package were utilized to perfrom data wrangling (dplyr , tidyverse) and create a presentable R markdown page rmdformats .

library(stringr)
library(tidyverse)
library(dplyr)

Key Methods Used To Restructure the .txt file into the Dataframe


Pattern Replacement Methods

Check Code

These are the different steps we took to identify and replace specific characters from the text file in order to convert the text into a data frame.

Method 1

** Using read_lines **

# ----------------------------------------------------------#
# Read the File from Local and Skip the 1st row "----------"
# ----------------------------------------------------------#

cd2 <-read_lines("tournamentinfo.txt",skip = 1)

# ----------------------------------------------------------#
# Replace all "------" by NA
# Read only data apart from header
# ----------------------------------------------------------#

cd2_data <- unlist(str_replace_all(cd2,"(-{89})","NA"))[-c(1:3)]
cd2_head <- unlist(str_replace_all(cd2,"(-{89})","NA"))[c(1:2)]

# ----------------------------------------------------------#
# Extracting common patterns data in dataframe , and remvoing 
# leading or trailing sapces.
# ----------------------------------------------------------#
tempFull_Data <-  trimws(unlist(str_extract_all(cd2_data,"[[:alpha:]-? ?]{2,}|[\\d.?\\d? +]{2,}|[\\d +]{1,2}|[\\w:?\\d? +]{2,}",simplify = TRUE)))
tempFull_Data <- as.data.frame(tempFull_Data, stringsAsFactors=F)

# ----------------------------------------------------------#
# Removing All NA for 1st column , As it belongs to 3rd row 
# in text file with Seprator which was repalced by NA.
# ----------------------------------------------------------#
tempFull_Data <- tempFull_Data[which(!tempFull_Data$V1=="NA"),]

# ----------------------------------------------------------#
# -----------Final Data to work for next steps.
# ----------------------------------------------------------#
tempFull_Data_merged <- tempFull_Data

Method 2

** Using read.delim **

# ----------------------------------------------------------#
# create a dataframe called chess
# extract the data from column 1 and column 2
# ----------------------------------------------------------#

tmp <- read.delim("tournamentinfo.txt", header=FALSE, sep="|", skip = 4 ,stringsAsFactors = FALSE)
chess <- data.frame(Name = tmp$V2[c(TRUE, FALSE, FALSE)], State = tmp$V1[c(FALSE, TRUE, FALSE)], Points = round(as.numeric(tmp$V3[c(TRUE, FALSE, FALSE)]), 1), Rating = integer(64), opponents_averageRating = integer(64))
chess

# ----------------------------------------------------------#
# extract the pre-rating score using regular expressions
# ----------------------------------------------------------#

tmp_rate <- str_extract(tmp$V2[c(FALSE, TRUE, FALSE)], "R:[:blank:]+[:digit:]+")
tmp_rate
chess$Rating <- as.numeric(str_extract(tmp_rate, "[:digit:]+"))

# ----------------------------------------------------------#
# include all the necessary columns in the ratings dataframe
# ----------------------------------------------------------#

ratings <- data.frame(playnum = str_trim(tmp$V1[c(TRUE, FALSE, FALSE)]), rate = chess$Rating, R1 = as.numeric(str_extract(tmp$V4[c(TRUE, FALSE, FALSE)], "[:digit:]+")), R2 = as.numeric(str_extract(tmp$V5[c(TRUE, FALSE, FALSE)], "[:digit:]+")), R3 = as.numeric(str_extract(tmp$V6[c(TRUE, FALSE, FALSE)], "[:digit:]+")), R4 = as.numeric(str_extract(tmp$V7[c(TRUE, FALSE, FALSE)], "[:digit:]+")), R5 =  as.numeric(str_extract(tmp$V8[c(TRUE, FALSE, FALSE)], "[:digit:]+")), R6 =  as.numeric(str_extract(tmp$V9[c(TRUE, FALSE, FALSE)], "[:digit:]+")), R7 =  as.numeric(str_extract(tmp$V10[c(TRUE, FALSE, FALSE)], "[:digit:]+")), avg_rate = numeric(64))

Method 3

** Using github url **

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%201/playerdata.txt"
data <- file(theURL, open="r")
playerresult <- readLines(data)

# ----------------------------------------------------------#
# Identifying where the dashes are located 
# ----------------------------------------------------------#

dash <- str_detect(playerresult, '^[-]{2,}$') 

# ----------------------------------------------------------#
# Remove these rows so that there is nothing separating one player from the other
# ----------------------------------------------------------#

playerresult <- playerresult[!dash == "TRUE"]

# ----------------------------------------------------------#
# Remove/Replace the unnecessary indications of win, draw or lose, pipelines and forward slashes
## removed W, D, & L
# ----------------------------------------------------------#

playerresult <- str_remove_all(playerresult, "[WDL]") 

# ----------------------------------------------------------#
## replace pipelines and slashes with commas so it can later be transfromed into a dataframe
# ----------------------------------------------------------#

playerresult <- str_replace_all(playerresult, "[|/]",",")   

# ----------------------------------------------------------#
# Combine the two rows for each player
# ----------------------------------------------------------#

fnew <- c("") 
for (i in seq(1, length(playerresult)-1, by = 2)){
   fnew <- c(fnew, paste(playerresult[i], playerresult[i+1], sep = "", collapse = NULL))
}
# Creating the dataframe
ChessTourn <- as.data.frame(do.call(rbind, strsplit(fnew, ",")), stringsAsFactors = FALSE)

Rigorous Cleaning of the Data Frame

Check Code

Method 1

** Creating two dataframes and binding the necessary data using string functions and regular expressions **

# ----------------------------------------------------------#
# Create two empty data frames which can hold first and second 
# rows respectively (2nd row belongs to player in 1st row)
# ----------------------------------------------------------#

tempFull_Data_first <- data.frame()
tempFull_Data_second <- data.frame()
#Iterator over full dataset and load the first and second data frames
for(rowNum in 1:nrow(tempFull_Data)) {
  if(rowNum %% 2 == 0) {
    tempFull_Data_second <- rbind(tempFull_Data_second, tempFull_Data[rowNum,])
  } else {
    tempFull_Data_first <- rbind(tempFull_Data_first, tempFull_Data[rowNum,])
  }
}

# ----------------------------------------------------------#
# Set colnames for 2nd dataframe so that it doesn't conflict
# with first dataframe i.e., from V20 to V33
# ----------------------------------------------------------#
columnNames1 <- c()
for (i in 20:33) {
  columnNames1 <- c(columnNames1, paste("V", i, sep = ""))
}
colnames(tempFull_Data_second) <- columnNames1

# ----------------------------------------------------------#
# Merge both the dataframes using cbind, 
# Final Table with all info in 1 row.
# ----------------------------------------------------------#

tempFull_Data_merged <- cbind(tempFull_Data_first, tempFull_Data_second)

# ----------------------------------------------------------#
# ---Create Subset of data to only inlcude Chess Round data
# ----------------------------------------------------------#

chessRound  <- tempFull_Data_merged[,c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,22,23)]

# ----------------------------------------------------------#
# Column Name R1O, R2O...R7O are indicator of Opponent of respective rounds
# ----------------------------------------------------------#
names(chessRound) <- c("PID","PName","Total","R1","R1O","R2","R2O","R3","R3O","R4","R4O","R5","R5O","R6","R6O","R7","R7O","State","Pre","Post")

# ----------------------------------------------------------#
# Convert column "Pre" to Number, by reading only 3 or 4 digits only.
# ----------------------------------------------------------#

chessRound[,18:20] <- mutate( chessRound[,18:20], 
                  Pre = ifelse( str_detect(Pre,"P\\d{2}$"), 
                                trimws(str_extract(Pre," \\d{3,4}")), 
                                trimws(str_extract(Pre," \\d{3,4}")) ))


# ----------------------------------------------------------#
# Convert all Rounds Column 
# (R1,R2,R3,R4,R5,R6,R7,R1O,R2O,R3O,R4O,R5O,R6O,R7O)to NA 
# if they have  B H U . Only keep W L ANF D
# ----------------------------------------------------------#

chessRound_Temp <- mutate_at( chessRound[,1:17], vars(R1,R2,R3,R4,R5,R6,R7,R1O,R2O,R3O,R4O,R5O,R6O,R7O),
                        list(~ ifelse( str_detect(.,"B|H|U"), NA, .)))

# ----------------------------------------------------------#
# Shift all NA to end of the Row, so that we can read 
# all the data one after the other .
# ----------------------------------------------------------#
# Here I am grouping the data into non-NA set and NA , 
# for each row and then using Tranpose to get it in right format. 
# ----------------------------------------------------------#
chessRound_Temp <- t(apply(chessRound_Temp, 1,  function(x) c(x[!is.na(x)], x[is.na(x)])))
chessRound_Temp <- as.data.frame(chessRound_Temp,stringsAsFactors=F)
Final_Chess_Data <- data.frame(chessRound_Temp[,],chessRound[,18:20])
 
# ----------------------------------------------------------#
# -----------------    Rename Columns ----------------------#
# ----------------------------------------------------------#
names(Final_Chess_Data) <- c("PID","PName","Total","R1","R1O","R2","R2O","R3","R3O","R4","R4O","R5","R5O","R6","R6O","R7","R7O","State","Pre","Post")

Method 3

** Step by Step process: Adding and removing the data rows, spaces and columns needed **

# ----------------------------------------------------------#
# Adding the column names which are in the 1st row, 
# and removing the name row from the dataframe
# ----------------------------------------------------------#

names(ChessTourn) <- unlist(ChessTourn[1,])  
ChessTourn = ChessTourn[-1,]
# ----------------------------------------------------------#
# Renaming and removing some columns
# ----------------------------------------------------------#

colnames(ChessTourn)[11] <- c("State")
colnames(ChessTourn)[4:10] <- c("P1","P2","P3","P4","P5","P6","P7") # The opponents' number
rownames(ChessTourn) <- 1:nrow(ChessTourn)
ChessTourn[12] <- list(NULL) # Removing the USCFI numbers as they are not needed
colnames(ChessTourn)[12] <- c("PreRating")
ChessTourn[c(1,13:ncol(ChessTourn))] <- list(NULL) # Removing the other unnecessary columns

# ----------------------------------------------------------#
# Keeping the pre-rating scores for calculations later
# ----------------------------------------------------------#

ChessTourn$PreRating <- str_sub(ChessTourn$PreRating, 5, 8)

# ----------------------------------------------------------#
# Converting to number for calculations later
# ----------------------------------------------------------#

ChessTourn[c(2:9,11)] <- sapply((ChessTourn)[c(2:9,11)], as.character) 
ChessTourn[c(2:9,11)] <- sapply((ChessTourn)[c(2:9,11)], as.numeric)  

# ----------------------------------------------------------#
# Removing spaces from players name and States
# ----------------------------------------------------------#

ChessTourn[c(1,10)] <- sapply(as.vector((ChessTourn)[c(1,10)]), str_trim) 

# ----------------------------------------------------------#
# Change NA values to zero for calculations later
# ----------------------------------------------------------#
ChessTourn[is.na(ChessTourn)] <- 0  

Pre-tournament opponents’ ratings

At this point, each method had successfully cleaned and formatted the data frame in order to correctly determine the pre-tournament opponents’ rating. This was done using for-loops for the calculation of the mean.

Check Code


Final Exporting of the CSV

The appropriate columns were selected and ordered as indicated in the beginning. Each method exported the data frame as a csv file using write.csv(). These csv file can then be stored as a SQL database.

### Connect to export data into CSV
write.csv(Final_Chess_Data[,c(2,18,3,19,21)], file = "ChessData.csv")

Works Cited

  1. Wickham, H. (2017). R: Package stringr. Retrieved February 12, 2019, from https://cran.r-project.org/web/packages/stringr/stringr.pdf

22 February, 2019