Project-1

Loading Data and Packages

First, begin by loading the data. Here, readLines function is used. This function turns the .txt file into a vector, where each line of the .txt is a element in the vector

theURL <- "https://raw.githubusercontent.com/Tyllis/Data607/master/tournamentinfo.txt"
raw.data <- readLines(theURL)

## Warning in readLines(theURL): incomplete final line found on 'https://
## raw.githubusercontent.com/Tyllis/Data607/master/tournamentinfo.txt'

Let’s take a look the import result:

head(raw.data)

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

tail(raw.data)

## [1] "   63 | THOMAS JOSEPH HOSMER            |1.0  |L   2|L  48|D  49|L  43|L  45|H    |U    |"
## [2] "   MI | 15057092 / R: 1175   ->1125     |     |W    |B    |W    |B    |B    |     |     |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "   64 | BEN LI                          |1.0  |L  22|D  30|L  31|D  49|L  46|L  42|L  54|"
## [5] "   MI | 15006561 / R: 1163   ->1112     |     |B    |W    |W    |B    |W    |B    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

The import result can be verified by checking the length of the vector. The text file shows that there are 64 players, each player occupies 3 lines - the two lines containing each player’s information and a line containing just dash lines “-”. The text file also starts with 4 lines in the begining, containing two lines of just dash lines, and two lines containing the column names. So the total number of elements in the vector should be 64x3+4.

length(raw.data) == 64*3+4

## [1] TRUE

So the load is done correctly.

This project uses stringr package extensively.

library(stringr)

Vector Manipulation

A number of steps are needed to turn the vector into a data.frame object usable for analysis.

Step 1. Remove the title elements

Here, the elements containing the titles are removed. we only need to look at the elements containing player and game information.

vec1 <- raw.data[5:length(raw.data)]
head(vec1)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

Step 2. Remove dash line elements

In this step, the elements containing just dash lines “-” are removed.

remove.value <- str_detect(vec1, pattern = "[-]{2,}")
vec2 <- vec1[!remove.value]
head(vec2)

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [4] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [5] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [6] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Verifying that all dash lines are removed, there are one dash line for each player in the vector table1. So there are 64 total dash lines.

length(vec2) == length(vec1) - 64

## [1] TRUE

Step 3. Combining elements

Now we have a vector that contains just player and game information. But each player’s information is captured in two elements. We want to combine the two elements into one.

This can be done by spliting the vector into two, one containing odd number reference and the other with even number reference. Then, the two vectors are spliced together using str_c function.

split1 <- vec2[seq(1, length(vec2), 2)] 
split2 <- vec2[seq(2, length(vec2), 2)] 
vec3 <- str_c(split1, split2)

The result should be a 64 elements vector, where each element is a player.

length(vec3)

## [1] 64

Sampling around the vector just to get a feel what it looks like.

vec3[sample(1:length(vec3), 10)]

##  [1] "   42 | JARED GE                        |3.0  |L  12|L  50|L  57|D  60|D  61|W  64|W  56|   MI | 14462326 / R: 1332   ->1256     |     |B    |W    |B    |B    |W    |W    |B    |"
##  [2] "   15 | ZACHARY JAMES HOUGHTON          |4.5  |D  19|L  16|W  30|L  22|W  54|W  33|W  38|   MI | 15619130 / R: 1220P13->1416P20  |N:3  |B    |B    |W    |W    |B    |B    |W    |"
##  [3] "   27 | GAURAV GIDWANI                  |3.5  |W  51|L  13|W  46|W  37|D  14|L   6|U    |   MI | 14476567 / R: 1552   ->1539     |N:4  |W    |B    |W    |B    |W    |B    |     |"
##  [4] "   13 | TORRANCE HENRY JR               |4.5  |W  36|W  27|L   7|D   5|W  33|L   3|W  32|   MI | 15082995 / R: 1666   ->1662     |N:3  |B    |W    |B    |B    |W    |W    |B    |"
##  [5] "   48 | DANIEL KHAIN                    |2.5  |L  17|W  63|H    |D  52|H    |L  29|L  35|   MI | 14369165 / R: 1382   ->1335     |     |B    |W    |     |B    |     |W    |B    |"
##  [6] "   45 | DEREK YAN                       |3.0  |L   5|L  51|D  60|L  56|W  63|D  55|W  58|   MI | 15372807 / R: 1242   ->1191     |     |W    |B    |W    |B    |W    |B    |W    |"
##  [7] "   20 | JASON ZHENG                     |4.0  |L  40|W  49|W  23|W  41|W  28|L   2|L   9|   MI | 14529060 / R: 1595   ->1569     |N:4  |W    |B    |W    |B    |W    |B    |W    |"
##  [8] "   44 | JUSTIN D SCHILLING              |3.0  |B    |L  14|L  32|W  53|L  39|L  24|W  59|   MI | 15323504 / R: 1199   ->1199     |     |     |W    |B    |B    |W    |B    |W    |"
##  [9] "   46 | JACOB ALEXANDER LAVALLEY        |3.0  |W  35|L   7|L  27|L  50|W  64|W  43|L  23|   MI | 15490981 / R:  377P3 ->1076P10  |     |B    |W    |B    |W    |B    |W    |W    |"
## [10] "   14 | BRADLEY SHAW                    |4.5  |W  54|W  44|W   8|L   1|D  27|L   5|W  31|   MI | 10131499 / R: 1610   ->1618     |N:3  |W    |B    |W    |W    |B    |B    |W    |"

Step 4. Splitting elements

Recognizing that each column in the text table is seperated by “|”, we can use str_split to split the elements of the vector into a list, where each list element contains a vector, which in term contains a player’s data.

lst4 <- str_split(vec3, pattern = "[|]")
lst4[1]

## [[1]]
##  [1] "    1 "                           
##  [2] " GARY HUA                        "
##  [3] "6.0  "                            
##  [4] "W  39"                            
##  [5] "W  21"                            
##  [6] "W  18"                            
##  [7] "W  14"                            
##  [8] "W   7"                            
##  [9] "D  12"                            
## [10] "D   4"                            
## [11] "   ON "                           
## [12] " 15445895 / R: 1794   ->1817     "
## [13] "N:2  "                            
## [14] "W    "                            
## [15] "B    "                            
## [16] "W    "                            
## [17] "B    "                            
## [18] "W    "                            
## [19] "B    "                            
## [20] "W    "                            
## [21] ""

length(lst4) == 64

## [1] TRUE

Step 5. Trim spaces

In this step, the list is unpacked into vectors again, and spaces are trimmed off the string elements.

vec5 <- str_trim(unlist(lst4))

Verifying the result by looking at the first and last player.

vec5[1:21]

##  [1] "1"                           "GARY HUA"                   
##  [3] "6.0"                         "W  39"                      
##  [5] "W  21"                       "W  18"                      
##  [7] "W  14"                       "W   7"                      
##  [9] "D  12"                       "D   4"                      
## [11] "ON"                          "15445895 / R: 1794   ->1817"
## [13] "N:2"                         "W"                          
## [15] "B"                           "W"                          
## [17] "B"                           "W"                          
## [19] "B"                           "W"                          
## [21] ""

vec5[(length(vec5)-20):length(vec5)]

##  [1] "64"                          "BEN LI"                     
##  [3] "1.0"                         "L  22"                      
##  [5] "D  30"                       "L  31"                      
##  [7] "D  49"                       "L  46"                      
##  [9] "L  42"                       "L  54"                      
## [11] "MI"                          "15006561 / R: 1163   ->1112"
## [13] ""                            "B"                          
## [15] "W"                           "W"                          
## [17] "B"                           "W"                          
## [19] "B"                           "B"                          
## [21] ""

So each player’s match data is now captured in 21 string elements. We should have 64x21 elements in the new vector.

length(vec5) == 64*21

## [1] TRUE

Step 6. Extract the ratings

In this step, the rating data of each player is extracted using the str_extract_all function. Then, str_detect is used to flag the elements that contain the rating information, so it can be inserted back into the vector.

ratings <- unlist(str_extract_all(vec5, pattern = "R:  ?[0-9]{3,4}"))
ratings_pos <- str_detect(vec5, pattern = "[0-9]{8} ")
vec6 <- vec5
vec6[ratings_pos] <- ratings

Step 7. Remove Letters

In this step, the “W”, “L”, or “D” letters are removed from the string elements containing match information, so that only player numbers are left. The “R:” letters are also removed from the rating information.

vec7 <- str_replace_all(vec6, pattern = "[WLD] |R: ", replacement = "")

Taking a look at the result:

vec7[1:42]

##  [1] "1"               "GARY HUA"        "6.0"            
##  [4] " 39"             " 21"             " 18"            
##  [7] " 14"             "  7"             " 12"            
## [10] "  4"             "ON"              "1794"           
## [13] "N:2"             "W"               "B"              
## [16] "W"               "B"               "W"              
## [19] "B"               "W"               ""               
## [22] "2"               "DAKSHESH DARURI" "6.0"            
## [25] " 63"             " 58"             "  4"            
## [28] " 17"             " 16"             " 20"            
## [31] "  7"             "MI"              "1553"           
## [34] "N:2"             "B"               "W"              
## [37] "B"               "W"               "B"              
## [40] "W"               "B"               ""

vec7[(length(vec7)-20):length(vec7)]

##  [1] "64"     "BEN LI" "1.0"    " 22"    " 30"    " 31"    " 49"   
##  [8] " 46"    " 42"    " 54"    "MI"     "1163"   ""       "B"     
## [15] "W"      "W"      "B"      "W"      "B"      "B"      ""

length(vec7) == 64*21

## [1] TRUE

Creating `data.frame` Object

Now we can create the data.frame object.

Recognizing that each type of data is seperated by 21 elements until the same type of data appears again in the vector, seq function can be used to reference all the same data in the vector.

First, we create vectors containing the information we need:

player_num <- as.numeric(vec7[seq(1, length(vec7), 21)])
player_name <- vec7[seq(2, length(vec7), 21)]
player_state <- vec7[seq(11, length(vec7), 21)]
points <- as.numeric(vec7[seq(3, length(vec7), 21)])
prerating <- as.numeric(vec7[seq(12, length(vec7), 21)])
round1 <- as.numeric(vec7[seq(4, length(vec7), 21)])

## Warning: NAs introduced by coercion

round2 <- as.numeric(vec7[seq(5, length(vec7), 21)])

## Warning: NAs introduced by coercion

round3 <- as.numeric(vec7[seq(6, length(vec7), 21)])

## Warning: NAs introduced by coercion

round4 <- as.numeric(vec7[seq(7, length(vec7), 21)])

## Warning: NAs introduced by coercion

round5 <- as.numeric(vec7[seq(8, length(vec7), 21)])

## Warning: NAs introduced by coercion

round6 <- as.numeric(vec7[seq(9, length(vec7), 21)])

## Warning: NAs introduced by coercion

round7 <- as.numeric(vec7[seq(10, length(vec7), 21)])

## Warning: NAs introduced by coercion

When as.numeric is used, if the string element contains non-numeric string, a “NA” is introduced to the data cell. Hence the warning.

We now create the data.frame object:

tour.data <- data.frame(player_num, player_name, player_state, points, prerating, round1, round2, round3, round4, round5, round6, round7)
str(tour.data)

## 'data.frame':    64 obs. of  12 variables:
##  $ player_num  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ player_name : Factor w/ 64 levels "ADITYA BAJAJ",..: 24 12 1 51 28 27 23 21 59 5 ...
##  $ player_state: Factor w/ 3 levels "MI","OH","ON": 3 1 1 1 1 2 1 1 3 1 ...
##  $ points      : num  6 6 6 5.5 5.5 5 5 5 5 5 ...
##  $ prerating   : num  1794 1553 1384 1716 1655 ...
##  $ round1      : num  39 63 8 23 45 34 57 3 25 16 ...
##  $ round2      : num  21 58 61 28 37 29 46 32 18 19 ...
##  $ round3      : num  18 4 25 2 12 11 13 14 59 55 ...
##  $ round4      : num  14 17 21 26 13 35 11 9 8 31 ...
##  $ round5      : num  7 16 11 5 4 10 1 47 26 6 ...
##  $ round6      : num  12 20 13 19 14 27 9 28 7 25 ...
##  $ round7      : num  4 7 12 1 17 21 2 19 20 18 ...

head(tour.data)

##   player_num         player_name player_state points prerating round1
## 1          1            GARY HUA           ON    6.0      1794     39
## 2          2     DAKSHESH DARURI           MI    6.0      1553     63
## 3          3        ADITYA BAJAJ           MI    6.0      1384      8
## 4          4 PATRICK H SCHILLING           MI    5.5      1716     23
## 5          5          HANSHI ZUO           MI    5.5      1655     45
## 6          6         HANSEN SONG           OH    5.0      1686     34
##   round2 round3 round4 round5 round6 round7
## 1     21     18     14      7     12      4
## 2     58      4     17     16     20      7
## 3     61     25     21     11     13     12
## 4     28      2     26      5     19      1
## 5     37     12     13      4     14     17
## 6     29     11     35     10     27     21

tail(tour.data)

##    player_num          player_name player_state points prerating round1
## 59         59    SEAN M MC CORMICK           MI    2.0       853     41
## 60         60           JULIA SHEN           MI    1.5       967     33
## 61         61          JEZZEFARKAS           ON    1.5       955     32
## 62         62        ASHWIN BALAJI           MI    1.0      1530     55
## 63         63 THOMAS JOSEPH HOSMER           MI    1.0      1175      2
## 64         64               BEN LI           MI    1.0      1163     22
##    round2 round3 round4 round5 round6 round7
## 59     NA      9     40     43     54     44
## 60     34     45     42     24     NA     NA
## 61      3     54     47     42     30     37
## 62     NA     NA     NA     NA     NA     NA
## 63     48     49     43     45     NA     NA
## 64     30     31     49     46     42     54

Manipulating data.frame Object

The round1 thru round7 columns are referencing player numbers. We can use this information directly, by referencing them in the prerating column directly. We can fill the rounds columns with rating data instead.

tour.data$round1 <- tour.data$prerating[tour.data$round1]
tour.data$round2 <- tour.data$prerating[tour.data$round2]
tour.data$round3 <- tour.data$prerating[tour.data$round3]
tour.data$round4 <- tour.data$prerating[tour.data$round4]
tour.data$round5 <- tour.data$prerating[tour.data$round5]
tour.data$round6 <- tour.data$prerating[tour.data$round6]
tour.data$round7 <- tour.data$prerating[tour.data$round7]

Now we can calculate the average match rating of opponents using rowMeans function.

avgrating <- rowMeans(tour.data[,6:12], na.rm = TRUE)
round(avgrating)

##  [1] 1605 1469 1564 1574 1501 1519 1372 1468 1523 1554 1468 1506 1498 1515
## [15] 1484 1386 1499 1480 1426 1411 1470 1300 1214 1357 1363 1507 1222 1522
## [29] 1314 1144 1260 1379 1277 1375 1150 1388 1385 1539 1430 1391 1248 1150
## [43] 1107 1327 1152 1358 1392 1356 1286 1296 1356 1495 1345 1206 1406 1414
## [57] 1363 1391 1319 1330 1327 1186 1350 1263

tour.data$avgrating <- round(avgrating)

Exporting .csv

Extract the columns needed for output.

result <- tour.data[, c("player_name", "player_state", "points", "prerating", "avgrating")]
head(result)

##           player_name player_state points prerating avgrating
## 1            GARY HUA           ON    6.0      1794      1605
## 2     DAKSHESH DARURI           MI    6.0      1553      1469
## 3        ADITYA BAJAJ           MI    6.0      1384      1564
## 4 PATRICK H SCHILLING           MI    5.5      1716      1574
## 5          HANSHI ZUO           MI    5.5      1655      1501
## 6         HANSEN SONG           OH    5.0      1686      1519

tail(result)

##             player_name player_state points prerating avgrating
## 59    SEAN M MC CORMICK           MI    2.0       853      1319
## 60           JULIA SHEN           MI    1.5       967      1330
## 61          JEZZEFARKAS           ON    1.5       955      1327
## 62        ASHWIN BALAJI           MI    1.0      1530      1186
## 63 THOMAS JOSEPH HOSMER           MI    1.0      1175      1350
## 64               BEN LI           MI    1.0      1163      1263

Use write.csv to create .csv.

write.csv(result, file = "proj-1-result.csv")

The .csv exported can be found here: https://raw.githubusercontent.com/Tyllis/Data607/master/proj-1-result.csv