Loading Necessary Libraries
Retrieving the Data
Initial Cleanup
Subsetting the Data Using Logical Vectors
Extracting Player Names
Extracting State Information
Extracting Total Points
Extracting Pre Scores
Extracting Opponent Numbers
Calculating Average Score
Creating the Final Data Frame
Writing the Final Output

For this project I will use the stringr library of functions.

Loading Necessary Libraries

library(stringr)

Retrieving the Data

Data for this project will be retrieved from Github.

project_url = url("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%201/tournamentinfo.txt")

project_data = readLines(project_url, n = 198)

## Warning in readLines(project_url, n = 198): incomplete final line found
## on 'https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%201/
## tournamentinfo.txt'

Initial Cleanup

Extracting only digits and words to get rid of lines that are not useful in this analysis.

mydata <- unlist(str_extract_all(project_data, "\\W{1,2}.+\\w+"))

#Removing the first two rows of data since they are not needed.  
mydata <- mydata [-c(1:2)]

Subsetting the Data Using Logical Vectors

At this point, I thought it would be easier if I split the data using a logical vector and then use do my matching.

mydata_names <- mydata [c(TRUE, FALSE)]

mydata_ids <- mydata [c(FALSE, TRUE)]

head(mydata_names)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17"
## [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21"

head(mydata_ids)

## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B"
## [6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B"

Extracting Player Names

Extracting the names of the chess players using the str_extract_all function.

mydata_names1 <- unlist(str_extract_all(mydata_names, "(\\b[[:upper:]-]+\\b\\s)+(\\b[[:upper:]-]+\\b){1}"))

#taking a peak at my new vector. 
head(mydata_names1)

## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"

Extracting State Information

Extracting the State information for each player using the str_extract_all function.

mydata_state <- unlist(str_extract_all(mydata_ids, "(\\b[[:upper:]]{2}\\b)"))

#taking a peak at my new vector.
head(mydata_state)

## [1] "ON" "MI" "MI" "MI" "MI" "OH"

Extracting Total Points

Extracting Total Point information for each player using the str_extract_all function.

mydata_total <- unlist(str_extract_all(mydata_names, "[[:digit:]]\\.[[:digit:]]"))

#taking a peak at my new vector.                            
head(mydata_total)

## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"

Extracting Pre Scores

Extracting the Pre Score for each player in two steps using the str_extract_all function.

#Looking to match a pattern where we have R: followed by a space and digits. The is followed by a quantifier to match it at least one time, but not more than five times. No value should exceed this limit.

pre_score <- unlist(str_extract_all(mydata_ids, "R:[ [:digit:]]{1,5}"))

#Taking the previously created vector and matching only digits. This should help me get rid characters I do not need and any other spaces.  

#Converting the pre score to numeric and unlisting it. 

pre_score <- unlist(as.numeric(str_extract_all(pre_score, "[[:digit:]]+")))

Extracting Opponent Numbers

In this section, I need to extract the opponent numbers. As in the previous example, I find it more convenient to break this task into two steps.

#In the first step, I want to extract the structure where a single upper case letter is followed by a space and a digit. The last part is quantified to occur at least two times, but not more than three times. 

opponent_id <- str_extract_all(mydata_names, "\\b [[:upper:] ][ [:digit:]]{2,3}")

#In the prior step, I was able to get part of the  output I needed, but it had extra spaces and characters that were not needed. In this step, I only match on digits to get rid of these extra characters.  

opponent_id <- str_extract_all(opponent_id, "[[:digit:]]{1,2}")

#I was new to the apply family of functions, but once I read about them it was what I needed for this task. In this next step, I am using the lapply to convert every element in the list to a numeric. This is needed for downstream analysis. 

opponent_id <- lapply(opponent_id, as.numeric)

head(opponent_id)

## [[1]]
## [1] 39 21 18 14  7 12  4
## 
## [[2]]
## [1] 63 58  4 17 16 20  7
## 
## [[3]]
## [1]  8 61 25 21 11 13 12
## 
## [[4]]
## [1] 23 28  2 26  5 19  1
## 
## [[5]]
## [1] 45 37 12 13  4 14 17
## 
## [[6]]
## [1] 34 29 11 35 10 27 21

Calculating Average Score

This is the fun part! I have to confess that it took me some time to figure it out since I was not familiar with these techniques in R.

# Creating a lookup function that will return the prescore based on the "i" index. This corresponds to the number of rows, so it should work in our example. 
lookup <- function (i){return (pre_score[i])}

#After creating the lookup function above, I then take advantage of the lapply function to match/replace player index ID and pre_scores. These values are then stored in the score list.

score <- lapply(opponent_id, lookup)

#At this point, I now have the pre_score values and can, once again, use lapply functions to calculate the avarege. for this, I need the sum of each of the score elements and the their length. Once I have those values, I can then take the average in the Average_score vector. 

pre_score_total <- unlist(lapply(score, sum))
pre_score_length <- unlist(lapply(score, length))
average_score <- round((pre_score_total)/(pre_score_length), 0)

#reviewing the average score, to see if at least the first row matches sample output. 
head(average_score)

## [1] 1605 1469 1564 1574 1501 1519

#It does!

At last I have all the fields I need to create the requested output. This is done in the step below. I also gave the columns better looking names and added a column for the ID. This is simply a numeric vector with 64 elements. I also made sure that my stringAsFactors parameter = FALSE.

Creating the Final Data Frame

chess_df <- data.frame("ID" = 1:64, "Name"=mydata_names1, "State"=mydata_state, "Total Points"=mydata_total, "Pre-Rating"=pre_score,"Average Rating"=average_score, stringsAsFactors = FALSE)

head(chess_df)

##   ID                Name State Total.Points Pre.Rating Average.Rating
## 1  1            GARY HUA    ON          6.0       1794           1605
## 2  2     DAKSHESH DARURI    MI          6.0       1553           1469
## 3  3        ADITYA BAJAJ    MI          6.0       1384           1564
## 4  4 PATRICK H SCHILLING    MI          5.5       1716           1574
## 5  5          HANSHI ZUO    MI          5.5       1655           1501
## 6  6         HANSEN SONG    OH          5.0       1686           1519

Writing the Final Output

In the last step, I will write the data frame to a csv file. This file will be written on my local working directory.

write.csv(chess_df, file = "tournament_structure.csv")

Project 1

Diego Diaz

September 27, 2015