For this project I will use the stringr library of functions.

Loading Necessary Libraries

library(stringr)

Retrieving the Data

Data for this project will be retrieved from Github.

project_url = url("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%201/tournamentinfo.txt")

project_data = readLines(project_url, n = 198)
## Warning in readLines(project_url, n = 198): incomplete final line found
## on 'https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%201/
## tournamentinfo.txt'

Initial Cleanup

Extracting only digits and words to get rid of lines that are not useful in this analysis.

mydata <- unlist(str_extract_all(project_data, "\\W{1,2}.+\\w+"))
#Removing the first two rows of data since they are not needed.  
mydata <- mydata [-c(1:2)]

Subsetting the Data Using Logical Vectors

At this point, I thought it would be easier if I split the data using a logical vector and then use do my matching.

mydata_names <- mydata [c(TRUE, FALSE)]

mydata_ids <- mydata [c(FALSE, TRUE)]

head(mydata_names)
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4"
## [2] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7"
## [3] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12"
## [4] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1"
## [5] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17"
## [6] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21"
head(mydata_ids)
## [1] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W"
## [2] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B"
## [3] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W"
## [4] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B"
## [5] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B"
## [6] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B"

Extracting Player Names

Extracting the names of the chess players using the str_extract_all function.

mydata_names1 <- unlist(str_extract_all(mydata_names, "(\\b[[:upper:]-]+\\b\\s)+(\\b[[:upper:]-]+\\b){1}"))

#taking a peak at my new vector. 
head(mydata_names1)
## [1] "GARY HUA"            "DAKSHESH DARURI"     "ADITYA BAJAJ"       
## [4] "PATRICK H SCHILLING" "HANSHI ZUO"          "HANSEN SONG"

Extracting State Information

Extracting the State information for each player using the str_extract_all function.

mydata_state <- unlist(str_extract_all(mydata_ids, "(\\b[[:upper:]]{2}\\b)"))

#taking a peak at my new vector.
head(mydata_state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"

Extracting Total Points

Extracting Total Point information for each player using the str_extract_all function.

mydata_total <- unlist(str_extract_all(mydata_names, "[[:digit:]]\\.[[:digit:]]"))

#taking a peak at my new vector.                            
head(mydata_total)
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"

Extracting Pre Scores

Extracting the Pre Score for each player in two steps using the str_extract_all function.

#Looking to match a pattern where we have R: followed by a space and digits. The is followed by a quantifier to match it at least one time, but not more than five times. No value should exceed this limit.

pre_score <- unlist(str_extract_all(mydata_ids, "R:[ [:digit:]]{1,5}"))

#Taking the previously created vector and matching only digits. This should help me get rid characters I do not need and any other spaces.  

#Converting the pre score to numeric and unlisting it. 

pre_score <- unlist(as.numeric(str_extract_all(pre_score, "[[:digit:]]+")))

Extracting Opponent Numbers

In this section, I need to extract the opponent numbers. As in the previous example, I find it more convenient to break this task into two steps.

#In the first step, I want to extract the structure where a single upper case letter is followed by a space and a digit. The last part is quantified to occur at least two times, but not more than three times. 

opponent_id <- str_extract_all(mydata_names, "\\b [[:upper:] ][ [:digit:]]{2,3}")

#In the prior step, I was able to get part of the  output I needed, but it had extra spaces and characters that were not needed. In this step, I only match on digits to get rid of these extra characters.  

opponent_id <- str_extract_all(opponent_id, "[[:digit:]]{1,2}")

#I was new to the apply family of functions, but once I read about them it was what I needed for this task. In this next step, I am using the lapply to convert every element in the list to a numeric. This is needed for downstream analysis. 

opponent_id <- lapply(opponent_id, as.numeric)

head(opponent_id)
## [[1]]
## [1] 39 21 18 14  7 12  4
## 
## [[2]]
## [1] 63 58  4 17 16 20  7
## 
## [[3]]
## [1]  8 61 25 21 11 13 12
## 
## [[4]]
## [1] 23 28  2 26  5 19  1
## 
## [[5]]
## [1] 45 37 12 13  4 14 17
## 
## [[6]]
## [1] 34 29 11 35 10 27 21

Calculating Average Score

This is the fun part! I have to confess that it took me some time to figure it out since I was not familiar with these techniques in R.

# Creating a lookup function that will return the prescore based on the "i" index. This corresponds to the number of rows, so it should work in our example. 
lookup <- function (i){return (pre_score[i])}

#After creating the lookup function above, I then take advantage of the lapply function to match/replace player index ID and pre_scores. These values are then stored in the score list.

score <- lapply(opponent_id, lookup)

#At this point, I now have the pre_score values and can, once again, use lapply functions to calculate the avarege. for this, I need the sum of each of the score elements and the their length. Once I have those values, I can then take the average in the Average_score vector. 

pre_score_total <- unlist(lapply(score, sum))
pre_score_length <- unlist(lapply(score, length))
average_score <- round((pre_score_total)/(pre_score_length), 0)

#reviewing the average score, to see if at least the first row matches sample output. 
head(average_score)
## [1] 1605 1469 1564 1574 1501 1519
#It does!

At last I have all the fields I need to create the requested output. This is done in the step below. I also gave the columns better looking names and added a column for the ID. This is simply a numeric vector with 64 elements. I also made sure that my stringAsFactors parameter = FALSE.

Creating the Final Data Frame

chess_df <- data.frame("ID" = 1:64, "Name"=mydata_names1, "State"=mydata_state, "Total Points"=mydata_total, "Pre-Rating"=pre_score,"Average Rating"=average_score, stringsAsFactors = FALSE)

head(chess_df)
##   ID                Name State Total.Points Pre.Rating Average.Rating
## 1  1            GARY HUA    ON          6.0       1794           1605
## 2  2     DAKSHESH DARURI    MI          6.0       1553           1469
## 3  3        ADITYA BAJAJ    MI          6.0       1384           1564
## 4  4 PATRICK H SCHILLING    MI          5.5       1716           1574
## 5  5          HANSHI ZUO    MI          5.5       1655           1501
## 6  6         HANSEN SONG    OH          5.0       1686           1519

Writing the Final Output

In the last step, I will write the data frame to a csv file. This file will be written on my local working directory.

write.csv(chess_df, file = "tournament_structure.csv")