Main Data Processing
Read txt file
\(warn = FALSE\): don’t show the warnings while reading the file
file<-readLines("tournamentinfo.txt", warn = FALSE)
head(file)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
Data Re-organizing
as we can see that the file has lots of useless dashes, and we don’t care about the titles. Therefore, we are going to read the file by lines. To do so, we get the sequence of lines that we need(e.g. line 5, 6, 8, 9, 11, 12, etc).
line1 <- c(seq(5, length(file), by = 3))
line2 <- c(seq(6, length(file), by = 3))
head(line1)
## [1] 5 8 11 14 17 20
head(line2)
## [1] 6 9 12 15 18 21
split each data entry into two lines, line1 will contains [pair num], [player name], [total] and [rounds], where line2 will contains [state], [USCF ID / Rtg (pre->post)], [letter result]
head(file[line1])
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [6] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
head(file[line2])
## [1] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [6] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
Data Extracting
we extract the name from line1 we’ve read
([|]).*?\1: “|” follow by any characters or spaces and finish with “|”
name <- str_extract(file[line1], "([|]).*?\\1")
head(name)
## [1] "| GARY HUA |" "| DAKSHESH DARURI |"
## [3] "| ADITYA BAJAJ |" "| PATRICK H SCHILLING |"
## [5] "| HANSHI ZUO |" "| HANSEN SONG |"
we can see that the extracted data has “|” which we don’t need. So, I replace “|” with ""
name<-str_replace_all(name, "[|]", "")
head(name)
## [1] " GARY HUA " " DAKSHESH DARURI "
## [3] " ADITYA BAJAJ " " PATRICK H SCHILLING "
## [5] " HANSHI ZUO " " HANSEN SONG "
After that, the data still contains extra spaces in both beginning and the end of strings. For the sake of aesthetics, I remove the spaces from two sides. (Well, that is totally fine if you don’t want to bother with spaces)
name<-str_trim(name)
head(name)
## [1] "GARY HUA" "DAKSHESH DARURI" "ADITYA BAJAJ"
## [4] "PATRICK H SCHILLING" "HANSHI ZUO" "HANSEN SONG"
The same procedure is used to extracting state, total points and pre-ratings
state<-str_trim(str_replace_all(str_extract(file[line2], ".{3}[|]"), "[|]", ""))
head(state)
## [1] "ON" "MI" "MI" "MI" "MI" "OH"
total_pts<-str_extract(file[line1], "\\d+\\.\\d+")
head(total_pts)
## [1] "6.0" "6.0" "6.0" "5.5" "5.5" "5.0"
pre_rating <- str_trim(str_replace_all(str_extract(file[line2], ":.\\d*.+?[-]"), ":|[-]|P\\d+", ""))
head(pre_rating)
## [1] "1794" "1553" "1384" "1716" "1655" "1686"
Date Reformation
create the data frame from the data we just extracted above
tournament<-data.frame(name, state, total_pts, pre_rating)