The library load in

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(RMariaDB)
library(DBI)

Data Exploration

Looking at the data so far, its clear to me that it is a fixed width deliminated. However, I decided not to take the time to fully flesh out every pattern, so I decided to simply handle post processing.

Each datum is made up of two rows

the first row is: Number | Name | Total Points | Score

The second row is Location | UUID | Rating Start -> Rating End

So we will read in our data at this point, taking into account the alternating pattern described above. In addition, we will need to remove the various headers as they are unused.

fileName="chess_scores.txt"
conn <- file(fileName,open="r")
line <-readLines(conn)
test <- list()
for (i in 4:(length(line))){
  if (!startsWith(line[i], '-')) {
    test[i] = (line[i]) 
    i = i + 1
  }
}
non_null_names <- which(!sapply(test, is.null))
result = read_delim(paste(test[non_null_names], collapse = "\n"), delim='|', col_names=FALSE)
## Rows: 128 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: "|"
## chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
## lgl  (1): X11
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
result
close(conn)

Looking at the data above, it is clear that the pattern we discovered above read in properly. With the first read in complete, we’re going to separate the two different data structures (each contained within one row). Now we are going to spread out the data, and examine the two created dataframes to see what post processing we need to do.

row_odd <- seq_len(nrow(result)) %% 2
result[row_odd == 0,]
head(result[row_odd != 0,][2])

Working with the split Datum

First we are going to work on the dataframe associated to the odd rows, which contains the player ID, Pre_Ranking, Post_Ranking, and if they played white or black for a certain game. I opted to keep the data with the X[integer] flags as it clearly enables me to keep track of columns that I am currently manipulating.

In terms of operations in ths step, we will split off Player_ID and Post_Ranking, using the ‘->’ as a separator to search for. Next, we will split off Player_ID again with Pre_Ranking. At the end of this process, we have 3 columns, Player_Id, Post_Ranking, Pre_Ranking.

The next step is to add in a column where we can merge this data to the other rows we’ve plucked from the dataset. Because we haven’t adjusted the ordering of the rows in the dataframe, or removed rows, we can simply pull the Player_Ids from the Player_data Column, and add them to the Player_Demo Table.

Player_data <- separate(result[row_odd == 0,], 2, c("Player_ID", "Post_Ranking"), sep = '->')
Player_data <- separate(Player_data, 2, c("Player_ID", "Pre_Ranking"), sep = ' / R: ')
Player_Demo <- result[row_odd != 0,]
Player_Demo$Player_ID <- Player_data$Player_ID
Player_Demo
Player_data

Ripping out Game Stats

From there, we’re going to remove the values associated to the round played, the player played, and if it is a win or loss. The main issue with this is the fact that this code is hard-coded to assume an up to 7 round system. If more rounds are played, it will not recognize them.

Data_Finished <- merge(Player_data, Player_Demo, by="Player_ID")

integer_data <- as.integer(unlist(str_extract_all(Data_Finished$Pre_Ranking, "^(\\d+)" )))
 
#Data_Finished$Pre_Ranking <- as.integer(Data_Finished$Pre_Ranking)
Data_Finished <- rename(Data_Finished, Person_id  = X1.y)
Data_Finished <- separate(Data_Finished, X4.y, c("R1_Result", "R1_Player_id"))
Data_Finished <- separate(Data_Finished, X5.y, c("R2_Result", "R2_Player_id"))
Data_Finished <- separate(Data_Finished, X6.y, c("R3_Result", "R3_Player_id"))
Data_Finished <- separate(Data_Finished, X7.y, c("R4_Result", "R4_Player_id"))
Data_Finished <- separate(Data_Finished, X8.y, c("R5_Result", "R5_Player_id"))
Data_Finished <- separate(Data_Finished, X9.y, c("R6_Result", "R6_Player_id"))
Data_Finished <- separate(Data_Finished, X10.y, c("R7_Result", "R7_Player_id"))

#Merges to make pretty names
Data_Finished <- rename(Data_Finished,  Name = X2)
Data_Finished <- rename(Data_Finished, Points = X3.y)
Data_Finished <- rename(Data_Finished, State = X1.x)

Data_Finished

Create Lookup table

Now is the fun part of creating a look up table of Player’s pre-ranking scores, to calculate the average score of each player’s opponents. Due to the complexity of this, we’re first going to hard code it for Gary Hua, player number 58.

First things first, we’re going to create a table containing the Person_id and the Pre_Ranking. Then, we’re going to get a list of all the player’s Gary Hua played. Then, we’re going to get a vector of all of the values. From there, we need to recognize that not all items in the Pre-Ranking column are integers, as some are provisional ranks as indicated by the P. For the purposes of this exercise, I am assuming that all Provisional ranks shall be treated as valid to calculate the average.

dfLookup <- Data_Finished[,c("Person_id","Pre_Ranking")]
dfLookup <- dfLookup[order(dfLookup$Person_id),]
dfLookup
Iterated_list <- as.list(Data_Finished[1,c("R1_Player_id","R2_Player_id", "R3_Player_id", "R4_Player_id", "R5_Player_id", "R6_Player_id", "R7_Player_id")])
total <- c()
for (item in Iterated_list){
  print(item)
  print(dfLookup[as.integer(dfLookup$Person_id) == item, c("Pre_Ranking")])
  total <- append(total, dfLookup[as.integer(dfLookup$Person_id) == item, c("Pre_Ranking")])
}
## [1] "54"
## [1] "1270   "
## [1] "44"
## [1] "1199   "
## [1] "8"
## [1] "1641P17"
## [1] "1"
## [1] "1794   "
## [1] "27"
## [1] "1552   "
## [1] "5"
## [1] "1655   "
## [1] "31"
## [1] "1494   "
print((as.integer((str_extract_all(total, "^(\\d+)" )))))
## [1] 1270 1199 1641 1794 1552 1655 1494
print(as.integer(mean(as.integer((str_extract_all(total, "^(\\d+)" ))))))
## [1] 1515

Let’s Make the code modular

First let’s deal with the crud in the lookup table. We will have to regex the first number found in the cell, that way it will clear out the PXX if it is present. First, lets test that. Also spot check a few entries with different formatting needs just to ensure that it works as intended

dfLookup <- Data_Finished[,c("Person_id","Pre_Ranking")]
for (df_lookup_row in 1:NROW(dfLookup)){
 print(dfLookup[df_lookup_row, "Person_id"])
 value <- dfLookup[df_lookup_row, "Pre_Ranking"]
 print(str_extract(value, "[:digit:]+"))
}
## [1] "   14 "
## [1] "1610"
## [1] "   16 "
## [1] "1604"
## [1] "   17 "
## [1] "1629"
## [1] "    7 "
## [1] "1649"
## [1] "   18 "
## [1] "1600"
## [1] "   22 "
## [1] "1555"
## [1] "   25 "
## [1] "1745"
## [1] "   49 "
## [1] "1291"
## [1] "   47 "
## [1] "1362"
## [1] "   30 "
## [1] "1522"
## [1] "   53 "
## [1] "1393"
## [1] "   11 "
## [1] "1712"
## [1] "    4 "
## [1] "1716"
## [1] "   12 "
## [1] "1663"
## [1] "   54 "
## [1] "1270"
## [1] "   59 "
## [1] "853"
## [1] "   39 "
## [1] "1436"
## [1] "   24 "
## [1] "1229"
## [1] "   32 "
## [1] "1441"
## [1] "   43 "
## [1] "1283"
## [1] "   10 "
## [1] "1365"
## [1] "   48 "
## [1] "1382"
## [1] "   42 "
## [1] "1332"
## [1] "   27 "
## [1] "1552"
## [1] "   20 "
## [1] "1595"
## [1] "   60 "
## [1] "967"
## [1] "    2 "
## [1] "1553"
## [1] "   35 "
## [1] "1438"
## [1] "    5 "
## [1] "1655"
## [1] "   56 "
## [1] "1153"
## [1] "   33 "
## [1] "1449"
## [1] "   58 "
## [1] "917"
## [1] "   36 "
## [1] "1355"
## [1] "   50 "
## [1] "1056"
## [1] "   19 "
## [1] "1564"
## [1] "   28 "
## [1] "1507"
## [1] "   40 "
## [1] "1348"
## [1] "   52 "
## [1] "935"
## [1] "    9 "
## [1] "1411"
## [1] "    3 "
## [1] "1384"
## [1] "   64 "
## [1] "1163"
## [1] "   23 "
## [1] "1363"
## [1] "   34 "
## [1] "1399"
## [1] "    6 "
## [1] "1686"
## [1] "   63 "
## [1] "1175"
## [1] "   13 "
## [1] "1666"
## [1] "   38 "
## [1] "1423"
## [1] "   57 "
## [1] "1092"
## [1] "   26 "
## [1] "1579"
## [1] "   31 "
## [1] "1494"
## [1] "    8 "
## [1] "1641"
## [1] "   51 "
## [1] "1011"
## [1] "   62 "
## [1] "1530"
## [1] "   29 "
## [1] "1602"
## [1] "   44 "
## [1] "1199"
## [1] "   45 "
## [1] "1242"
## [1] "   55 "
## [1] "1186"
## [1] "    1 "
## [1] "1794"
## [1] "   37 "
## [1] "980"
## [1] "   46 "
## [1] "377"
## [1] "   21 "
## [1] "1563"
## [1] "   15 "
## [1] "1220"
## [1] "   41 "
## [1] "1403"
## [1] "   61 "
## [1] "955"
dfLookup

Final Augment

Now that we have the code to clear the crud from that column, we will proceed to modularize the code, making it iterate through every row.

dfLookup <- Data_Finished[,c("Person_id","Pre_Ranking")]
for (df_lookup_row in 1:NROW(dfLookup)){
  
}
dfLookup <- dfLookup[order(dfLookup$Person_id),]
Data_Finished$Opponent_Avg <- 0
for (row_number in 1:NROW(Data_Finished)){
  Iterated_list <- as.list(Data_Finished[row_number,c("R1_Player_id","R2_Player_id", "R3_Player_id", "R4_Player_id", "R5_Player_id", "R6_Player_id", "R7_Player_id")])
  total <- c()
  for (item in Iterated_list){
    total <- append(total, str_extract(dfLookup[as.integer(dfLookup$Person_id) == item, c("Pre_Ranking")], "[:digit:]+"))
  }
  Data_Finished$Opponent_Avg[row_number] <- (as.integer(mean(as.integer(total), na.rm = TRUE)))
}

And last but not least let’s view the output:

Data_Finished 

Finalizing the Data

The preferred output is ‘Gary Hua, ON, 6.0, 1794, 1605’ So lets start structuring our data. We will make heavy use of the paste() function as it allows for easy stitching of data into the fileconnection.

fileConn<-file("Data_Output.csv")
writeLines(paste(str_trim(Data_Finished$Name, side = "both"), str_trim(Data_Finished$State, side = "both"), str_trim(Data_Finished$Points, side = "both"), str_trim(Data_Finished$Pre_Ranking, side = "both"),str_trim(Data_Finished$Opponent_Avg, side = "both"), sep = ', '), fileConn)
close(fileConn)
print(paste(str_trim(Data_Finished$Name, side = "both"), str_trim(Data_Finished$State, side = "both"), str_trim(Data_Finished$Points, side = "both"), str_trim(Data_Finished$Pre_Ranking, side = "both"),str_trim(Data_Finished$Opponent_Avg, side = "both"), sep = ', '))
##  [1] "BRADLEY SHAW, MI, 4.5, 1610, 1515"              
##  [2] "MIKE NIKITIN, MI, 4.0, 1604, 1385"              
##  [3] "RONALD GRZEGORCZYK, MI, 4.0, 1629, 1498"        
##  [4] "GARY DEE SWATHELL, MI, 5.0, 1649, 1372"         
##  [5] "DAVID SUNDEEN, MI, 4.0, 1600, 1480"             
##  [6] "EUGENE L MCCLURE, MI, 4.0, 1555, 1300"          
##  [7] "LOREN SCHWIEBERT, MI, 3.5, 1745, 1363"          
##  [8] "MICHAEL J MARTIN, MI, 2.5, 1291P12, 1285"       
##  [9] "ERIC WRIGHT, MI, 2.5, 1362, 1392"               
## [10] "GEORGE AVERY JONES, ON, 3.5, 1522, 1144"        
## [11] "JOSE C YBARRA, MI, 2.0, 1393, 1345"             
## [12] "CAMERON WILLIAM MC LEMAN, MI, 4.5, 1712, 1467"  
## [13] "PATRICK H SCHILLING, MI, 5.5, 1716, 1573"       
## [14] "KENNETH J TACK, MI, 4.5, 1663, 1506"            
## [15] "LARRY HODGE, MI, 2.0, 1270, 1206"               
## [16] "SEAN M MC CORMICK, MI, 2.0, 853, 1319"          
## [17] "JOEL R HENDON, MI, 3.0, 1436P23, 1429"          
## [18] "MICHAEL R ALDRICH, MI, 4.0, 1229, 1357"         
## [19] "JOSHUA PHILIP MATHEWS, ON, 3.5, 1441, 1378"     
## [20] "ROBERT GLEN VASEY, MI, 3.0, 1283, 1106"         
## [21] "ANVIT RAO, MI, 5.0, 1365, 1554"                 
## [22] "DANIEL KHAIN, MI, 2.5, 1382, 1355"              
## [23] "JARED GE, MI, 3.0, 1332, 1149"                  
## [24] "GAURAV GIDWANI, MI, 3.5, 1552, 1221"            
## [25] "JASON ZHENG, MI, 4.0, 1595, 1410"               
## [26] "JULIA SHEN, MI, 1.5, 967, 1330"                 
## [27] "DAKSHESH DARURI, MI, 6.0, 1553, 1469"           
## [28] "JOSHUA DAVID LEE, MI, 3.5, 1438, 1149"          
## [29] "HANSHI ZUO, MI, 5.5, 1655, 1500"                
## [30] "MARISA RICCI, MI, 2.0, 1153, 1414"              
## [31] "JADE GE, MI, 3.5, 1449, 1276"                   
## [32] "VIRAJ MOHILE, MI, 2.0, 917, 1391"               
## [33] "SIDDHARTH JHA, MI, 3.5, 1355, 1388"             
## [34] "SHIVAM JHA, MI, 2.5, 1056, 1296"                
## [35] "DIPANKAR ROY, MI, 4.0, 1564, 1426"              
## [36] "SOFIA ADINA STANESCU-BELLU, MI, 3.5, 1507, 1522"
## [37] "FOREST ZHANG, MI, 3.0, 1348, 1390"              
## [38] "ETHAN GUO, MI, 2.5, 935, 1494"                  
## [39] "STEFANO LEE, ON, 5.0, 1411, 1523"               
## [40] "ADITYA BAJAJ, MI, 6.0, 1384, 1563"              
## [41] "BEN LI, MI, 1.0, 1163, 1263"                    
## [42] "ALAN BUI, ON, 4.0, 1363, 1213"                  
## [43] "MICHAEL JEFFERY THOMAS, MI, 3.5, 1399, 1375"    
## [44] "HANSEN SONG, OH, 5.0, 1686, 1518"               
## [45] "THOMAS JOSEPH HOSMER, MI, 1.0, 1175, 1350"      
## [46] "TORRANCE HENRY JR, MI, 4.5, 1666, 1497"         
## [47] "BRIAN LIU, MI, 3.0, 1423, 1539"                 
## [48] "MICHAEL LU, MI, 2.0, 1092, 1363"                
## [49] "MAX ZHU, ON, 3.5, 1579, 1506"                   
## [50] "RISHI SHETTY, MI, 3.5, 1494, 1259"              
## [51] "EZEKIEL HOUGHTON, MI, 5.0, 1641P17, 1468"       
## [52] "TEJAS AYYAGARI, MI, 2.5, 1011, 1356"            
## [53] "ASHWIN BALAJI, MI, 1.0, 1530, 1186"             
## [54] "CHIEDOZIE OKORIE, MI, 3.5, 1602P6, 1313"        
## [55] "JUSTIN D SCHILLING, MI, 3.0, 1199, 1327"        
## [56] "DEREK YAN, MI, 3.0, 1242, 1152"                 
## [57] "ALEX KONG, MI, 2.0, 1186, 1406"                 
## [58] "GARY HUA, ON, 6.0, 1794, 1605"                  
## [59] "AMIYATOSH PWNANANDAM, MI, 3.5, 980P12, 1384"    
## [60] "JACOB ALEXANDER LAVALLEY, MI, 3.0, 377P3, 1357" 
## [61] "DINH DANG BUI, ON, 4.0, 1563P22, 1470"          
## [62] "ZACHARY JAMES HOUGHTON, MI, 4.5, 1220P13, 1483" 
## [63] "KYLE WILLIAM MURPHY, MI, 3.0, 1403P5, 1248"     
## [64] "JEZZEL FARKAS, ON, 1.5, 955P11, 1327"

The Conclusion

All in all, I disliked this assignment’s data set. I thought the structure of the chess score’s set made it unusually miserable to work with, due to both the fixed width delimination and compound data per cell.

Instead I would propose that the data collection & storage changes

Ideally I would love to see a multi_database layout akin to

Match_DB:

Match_ID|Comp_1|Comp_2|Color_1|Color_2|Points_1|Points_2|Epoch_Date

Where Comp_1 & Comp_2 in Player_DB$ID

Player_DB:

ID | State | NAME_ARRAY | Rank-Triggered updates from RANK_DB

Rank_DB

ID | Is_Provision | Provisional_Games | Rank | Epoch_Date

This would 100% simplify the process significantly as the structure would allow for significantly easier querying of the data, coupled with a higher degree of accuracy in terms of matches. In fact, you could calculate a real time score of each person simply based on augmentation of the tables. In addition, getting latest results is as simple as just grabbing the first result under each Epoch_Date.

All in all, this could simple be resolved by making easier structures, especially since storage is cheap.