Project One

Notice about graphic presentation.

All the tables, save the last, are made using df_print and paged in the YAML. These tables will automatically scroll off into the sunset. So, if you feel data is missing click on the traingle in the upper right corner and it will give you asscess to the hidden columns.

Also, I used code_folding with hdden in the YAML as well, so the code can be rolled up to better read through data, graphics and tables. If you are missing code you think should be here, just click on the Code button and if you want to clear it out to see the tables and text, roll it back up by clicking Hide

Reading in the text file

Using the fixed width method, I read this file directly from the Github into R, skipping the first four rows of headings and dashes after loading the needed packages

require(stringr)
require(DT)

url ="https://raw.githubusercontent.com/bpoulin-CUNY/Data607/master/chess.txt"
data<-read.fwf(url, c(6,35,6,6,6,6,6,6,6,6), skip = 4)
data[] <- (lapply(data[1:10], FUN = function(x) (gsub("\\|", "", x))) )
data

The file seems to have come in properly, but much grooming is still necessary. It is time to remove the pipe symbols “|” as well as the rows with nothing but dashes. Then create a counter to interate throgh rows the first and second rows of each player while skipping the third, then moving 3 rows forward and repeating this process to align all of the first players information on a single row of the data frame.

To facilitate this process I creadted an empty data frame of the needed dimensions and then used rbind to attach the temporary row to the permanent frame from inside the loop. Once siftedData, I subsetted to keep only the necessary columns and saved them into the data frame refindData.

data[] <- (lapply(data[1:10], FUN = function(x) (gsub("\\|", "", x))) )
rownames(data)<-NULL
iter<-1
siftedData<-data.frame(matrix(ncol = 20, nrow = 0))

while (iter<191){
  temp = data.frame(cbind(as.vector(data[iter,]), as.vector(data[iter+1,])))
  siftedData = rbind(siftedData, temp)
  iter= iter +3
}
refinedData <- siftedData[,c("V1", "V2","V3","V4", "V5","V6","V7", "V8","V9","V10", "V1.1","V2.1")]

siftedData # with only necessary rows properly aligned

refinedData # with only necessary columns

Sorting out the unnecessary columns here the remaining columns are only what is needed to form the final data frame or do calculations for it.

Because only the integers from columns V4 through V1.1 are needed to create the indexes for the oppositions pre-contest averages, the letters and spaces were removed. To do this, I created a function using regular expressions and stringr and passed in into lappy columns 4 through 10. The resulting data is clean of letter characters.

numberExtractor<- function(x){
  temp =vector('numeric')
  for(i in 1:length(x)){dig=str_extract(x[i], "[[:digit:]]{1,3}"); temp=c(temp,dig)}
  return(as.numeric(temp))}
rownames(refinedData)<-NULL
refinedData[,4:10] <- lapply(refinedData[,4:10], numberExtractor)
refinedData

Prematch Scores

In order to show the players pre-match scores as well as cross-calculate the average pre-match scores of each players opponents, it was necessary to pull the four numbers following the “R:” in column twelve. I accomplished this with the following function called preExtractor using regular expressions. I then applied it to each row in column twelve using lapply

With all the crude transformation taken care of and the aggregating phase about to begin, adding useful variable names also made sense.

preExtractor<- function(x){code=as.numeric(str_extract(x,"(?<=R: \\D)\\d{3}|(?<=R:\\D)\\d{4}"))
return(code)}

refinedData[,12] <- unlist(lapply(refinedData[,12], preExtractor))
colnames(refinedData)<- c("Player_ID", "Name", "Total_Points", "Round_1","Round_2","Round_3","Round_4","Round_5","Round_6","Round_7", "State", "Prerating")
refinedData$Total_Points <-as.numeric(refinedData$Total_Points)
refinedData

Replacing Opponent Player_ID’s with Pre-match scores

The easiest way to create the pre-match average of opponents for each player was to first create a cross table with the Player_ID’s and that players Pre_Match Score, and use it in an SQL-like fashion as a key and value. With this table and a bit of grooming to replace NA values, in refinedData, with a non-score integer (allowing the iterator to act directly on the player ID instead of creating a more complicated loop) and converting the Player_ID’s in the SubstitutionFrame to numerics so the match the numbers in the main table the scores can easily be dropped into the table using for loops and ifelse statements to replace NA values where they previously were.

substitutionFrame <- refinedData[,c("Player_ID","Prerating")]
substitutionFrame$Player_ID <- as.numeric(substitutionFrame$Player_ID)

refinedData[is.na(refinedData)] <- 2900
refinedData$Prerating[refinedData$Prerating==2900] <- NA
for(i in 1:length(refinedData$Round_1)){  refinedData$Round_1[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_1[i]])}

for(i in 1:length(refinedData$Round_2)){  refinedData$Round_2[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_2[i]])}

for(i in 1:length(refinedData$Round_3)){  refinedData$Round_3[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_3[i]])}

for(i in 1:length(refinedData$Round_4)){  refinedData$Round_4[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_4[i]])}

for(i in 1:length(refinedData$Round_5)){  refinedData$Round_5[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_5[i]])}

for(i in 1:length(refinedData$Round_6)){  refinedData$Round_6[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_6[i]])}

for(i in 1:length(refinedData$Round_7)){  refinedData$Round_7[i] = ifelse(i==2900 , NA ,substitutionFrame$Prerating[substitutionFrame$Player_ID==refinedData$Round_7[i]])}

refinedData

Creating Opponent Averages

With the opposition scores inline with the player by round, it is a simple matter of extracting a vector of values for each player, taking their mean, and saving them into a column called Opponent_Average which has place-holder NA’s.

refinedData$Opponent_Average <- NULL

for(r in 1:nrow(refinedData)){
  vect<-as.vector(unlist(refinedData[r, 4:10])) # unlisting is needed to use mean across columns
  refinedData$Opponent_Average[r]=round(mean(vect, na.rm = TRUE),0) #ignoring NA values and rounding to integers
  }

refinedData

Create the final Data Frame with the Needed Variables in the Right Order

Creating the final table is simply a matter of subsetting using titles in the appropriate order for the final file.

Final_Chess <- refinedData[ ,c("Name", "State", "Total_Points", "Prerating", "Opponent_Average")]

DIsplaying the final table using “DT” library

This interactive JavaScript table allows you to sort by column, search by word, phrase or numeral yet is compact due to the pagination. The fields without values are the DT way of presenting NA values.

#you will need to set the working directory to save the csv locally

#setwd("")
write.csv(siftedData, "final.csv")

That is a pretty basic handling of a fixed width text file with mixed character and numeric data variables.