For this project I will use the stringr library of functions.
library(stringr)
Data for this project will be retrieved from Github.
project_url = url("https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%201/tournamentinfo.txt")
project_data = readLines(project_url, n = 198)
## Warning in readLines(project_url, n = 198): incomplete final line found
## on 'https://raw.githubusercontent.com/diegomdiaz/IS607/master/Project%201/
## tournamentinfo.txt'
Extracting only digits and words to get rid of lines that are not useful in this analysis.
mydata <- unlist(str_extract_all(project_data, "\\W{1,2}.+\\w+"))
#Removing the first two rows of data since they are not needed.
mydata <- mydata [-c(1:2)]
At this point, I thought it would be easier if I split the data using a logical vector and then use do my matching.
mydata_names <- mydata [c(TRUE, FALSE)]
mydata_ids <- mydata [c(FALSE, TRUE)]
head(mydata_names)
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21"
head(mydata_ids)
## [1] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B"
## [3] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W"
## [4] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B"
## [5] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B"
## [6] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B"
Extracting the names of the chess players using the str_extract_all function.
mydata_names1 <- unlist(str_extract_all(mydata_names, "(\\b[[:upper:]-]+\\b\\s)+(\\b[[:upper:]-]+\\b){1}"))
#taking a peak at my new vector.
head(mydata_names1)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
Extracting the State information for each player using the str_extract_all function.
mydata_state <- unlist(str_extract_all(mydata_ids, "(\\b[[:upper:]]{2}\\b)"))
#taking a peak at my new vector.
head(mydata_state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
Extracting Total Point information for each player using the str_extract_all function.
mydata_total <- unlist(str_extract_all(mydata_names, "[[:digit:]]\\.[[:digit:]]"))
#taking a peak at my new vector.
head(mydata_total)
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"
Extracting the Pre Score for each player in two steps using the str_extract_all function.
#Looking to match a pattern where we have R: followed by a space and digits. The is followed by a quantifier to match it at least one time, but not more than five times. No value should exceed this limit.
pre_score <- unlist(str_extract_all(mydata_ids, "R:[ [:digit:]]{1,5}"))
#Taking the previously created vector and matching only digits. This should help me get rid characters I do not need and any other spaces.
#Converting the pre score to numeric and unlisting it.
pre_score <- unlist(as.numeric(str_extract_all(pre_score, "[[:digit:]]+")))
In this section, I need to extract the opponent numbers. As in the previous example, I find it more convenient to break this task into two steps.
#In the first step, I want to extract the structure where a single upper case letter is followed by a space and a digit. The last part is quantified to occur at least two times, but not more than three times.
opponent_id <- str_extract_all(mydata_names, "\\b [[:upper:] ][ [:digit:]]{2,3}")
#In the prior step, I was able to get part of the output I needed, but it had extra spaces and characters that were not needed. In this step, I only match on digits to get rid of these extra characters.
opponent_id <- str_extract_all(opponent_id, "[[:digit:]]{1,2}")
#I was new to the apply family of functions, but once I read about them it was what I needed for this task. In this next step, I am using the lapply to convert every element in the list to a numeric. This is needed for downstream analysis.
opponent_id <- lapply(opponent_id, as.numeric)
head(opponent_id)
## [[1]]
## [1] 39 21 18 14 7 12 4
##
## [[2]]
## [1] 63 58 4 17 16 20 7
##
## [[3]]
## [1] 8 61 25 21 11 13 12
##
## [[4]]
## [1] 23 28 2 26 5 19 1
##
## [[5]]
## [1] 45 37 12 13 4 14 17
##
## [[6]]
## [1] 34 29 11 35 10 27 21
This is the fun part! I have to confess that it took me some time to figure it out since I was not familiar with these techniques in R.
# Creating a lookup function that will return the prescore based on the "i" index. This corresponds to the number of rows, so it should work in our example.
lookup <- function (i){return (pre_score[i])}
#After creating the lookup function above, I then take advantage of the lapply function to match/replace player index ID and pre_scores. These values are then stored in the score list.
score <- lapply(opponent_id, lookup)
#At this point, I now have the pre_score values and can, once again, use lapply functions to calculate the avarege. for this, I need the sum of each of the score elements and the their length. Once I have those values, I can then take the average in the Average_score vector.
pre_score_total <- unlist(lapply(score, sum))
pre_score_length <- unlist(lapply(score, length))
average_score <- round((pre_score_total)/(pre_score_length), 0)
#reviewing the average score, to see if at least the first row matches sample output.
head(average_score)
## [1] 1605 1469 1564 1574 1501 1519
#It does!
At last I have all the fields I need to create the requested output. This is done in the step below. I also gave the columns better looking names and added a column for the ID. This is simply a numeric vector with 64 elements. I also made sure that my stringAsFactors parameter = FALSE.
chess_df <- data.frame("ID" = 1:64, "Name"=mydata_names1, "State"=mydata_state, "Total Points"=mydata_total, "Pre-Rating"=pre_score,"Average Rating"=average_score, stringsAsFactors = FALSE)
head(chess_df)
## ID Name State Total.Points Pre.Rating Average.Rating
## 1 1 GARY HUA ON 6.0 1794 1605
## 2 2 DAKSHESH DARURI MI 6.0 1553 1469
## 3 3 ADITYA BAJAJ MI 6.0 1384 1564
## 4 4 PATRICK H SCHILLING MI 5.5 1716 1574
## 5 5 HANSHI ZUO MI 5.5 1655 1501
## 6 6 HANSEN SONG OH 5.0 1686 1519
In the last step, I will write the data frame to a csv file. This file will be written on my local working directory.
write.csv(chess_df, file = "tournament_structure.csv")