Abstract

Data mining involves taking data from sources and organizing them to be more available for downstream processes. This might take the form of text files, webpages, or even raw byte data. My particular domain of expertise for the past two years involved data mining the genome. That being said, the form of data can be described in the Chomsky hierarchy. General regex is able to datamine tables because it is Chomsky Type-2: context free data.

Datasets

We will be using an organized table of chess tournament rankings. Our pipeline will datamine this text file so that we can make a csv and add it to a database.

path = "./data/"
out = "./output/"
file = "tournamentinfo.txt"
url = "https://raw.githubusercontent.com/Anthogonyst/607-Acquisition/master/data/tournamentinfo.txt"

### Prefer local files, otherwise read from web
rawData = paste0(path, file) %>%
  ifelse(file.exists(.), ., url) %>%
    readLines(.)

Data pipeline

Since our data is a table, we use a series of regex and create a large table. Essentially, the table is split into observations where a character denotes row and column.

### Imported cleaning function that I wrote another place
.Fix <- function(x) { stringr::str_sub(x, 1, sapply(x, stringr::str_length)) }

sep = "############"

### Grabs the data and makes a data frame
tourneyTable = rawData %>%
  .Fix(.) %>% 
    seqinr::c2s(.) %>%
      gsub("---------+", sep, .) %>%
        stringr::str_split(., sep) %>%
          unlist(., FALSE) %>%
            .[. != ""] %>%
              sapply(., stringr::str_split, "\\|") %>%
                lapply(., trimws)

head(tourneyTable, 2)

## $` Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round|  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | `
##  [1] "Pair"                      "Player Name"              
##  [3] "Total"                     "Round"                    
##  [5] "Round"                     "Round"                    
##  [7] "Round"                     "Round"                    
##  [9] "Round"                     "Round"                    
## [11] "Num"                       "USCF ID / Rtg (Pre->Post)"
## [13] "Pts"                       "1"                        
## [15] "2"                         "3"                        
## [17] "4"                         "5"                        
## [19] "6"                         "7"                        
## [21] ""                         
## 
## $`    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |`
##  [1] "1"                           "GARY HUA"                   
##  [3] "6.0"                         "W  39"                      
##  [5] "W  21"                       "W  18"                      
##  [7] "W  14"                       "W   7"                      
##  [9] "D  12"                       "D   4"                      
## [11] "ON"                          "15445895 / R: 1794   ->1817"
## [13] "N:2"                         "W"                          
## [15] "B"                           "W"                          
## [17] "B"                           "W"                          
## [19] "B"                           "W"                          
## [21] ""

Creating base data

Alot of data is immediately available and grabbed from our master table. It is organized based on specifications.

### Pulls from the table where appropriate to get some general stats
eloRankings = tourneyTable %>%
  as.data.frame(., col.names = length(tourneyTable)) %>%
    .[, -1] %>%
      { data.frame(
        Pair = unlist(.[1, ]),
        Player = unlist(.[2, ]),
        State = unlist(.[11, ]),
        TotalPoints = unlist(.[3, ]),
        PreElo = gsub(".*:\\s*(\\d+).*", "\\1", unlist(.[12, ]))
      ) } %>%
        magrittr::set_rownames(NULL)

head(eloRankings)

##   Pair              Player State TotalPoints PreElo
## 1    1            GARY HUA    ON         6.0   1794
## 2    2     DAKSHESH DARURI    MI         6.0   1553
## 3    3        ADITYA BAJAJ    MI         6.0   1384
## 4    4 PATRICK H SCHILLING    MI         5.5   1716
## 5    5          HANSHI ZUO    MI         5.5   1655
## 6    6         HANSEN SONG    OH         5.0   1686

Further data mining

Generally these data mining pipelines are tailored to the individual file. There’s some general pedagogies that can be followed but the input determines the output. This particular part is hardcoded and wouldn’t reach production but might setup your initial database.

### Uses previous dictionary to compare elo rankings on raw table and sum them
opposition = tourneyTable %>%
  lapply(function(x) { 
    x[4:10] %>% 
      gsub("\\D", "", .) %>% 
        .[grepl("\\d+", .)] %>%
          sapply(., function(y) { 
            eloRankings[[y, 5]]
          }) %>%
            as.numeric(.) %>%
              { round(sum(.) / length(.)) } %>%
                c(x[2], .)
  }) %>%
    { as.data.frame(do.call(rbind, .)) } %>%
      magrittr::set_rownames(NULL) %>%
        magrittr::set_colnames(c("Player", "AverageOpponent")) %>%
          { .[.$Player != "Player Name", ] }

head(opposition)

##                Player AverageOpponent
## 2            GARY HUA            1605
## 3     DAKSHESH DARURI            1469
## 4        ADITYA BAJAJ            1564
## 5 PATRICK H SCHILLING            1574
## 6          HANSHI ZUO            1501
## 7         HANSEN SONG            1519

Exporting to database

The final tables are joined on a primary key through left joins. This clean data can now produce a csv or even be inserted into a SQL database.

### Does a left join to merge all of the data and write as csv
dplyr::left_join(eloRankings, opposition, by = "Player") %>%
  dplyr::select(-Pair) %>%
    { write.csv(paste0(out, "chessElo.csv"), row.names = FALSE, quote = FALSE) ; . } %>%
      head(.)

## x
## ./output/chessElo.csv

##                Player State TotalPoints PreElo AverageOpponent
## 1            GARY HUA    ON         6.0   1794            1605
## 2     DAKSHESH DARURI    MI         6.0   1553            1469
## 3        ADITYA BAJAJ    MI         6.0   1384            1564
## 4 PATRICK H SCHILLING    MI         5.5   1716            1574
## 5          HANSHI ZUO    MI         5.5   1655            1501
## 6         HANSEN SONG    OH         5.0   1686            1519

Data Scrapping