DATA 607 Project 1

The goal of this project is to take data seemingly complex and structure and to transform it into a structured data set that can be used for downstream analysis.

The raw data can be found here https://raw.githubusercontent.com/vindication09/DATA607_Project1/master/RawChessData We will be working with an ELO chess cross table. We want to transform the information from this cross table into a data frame that contains the following information: player number, player name, state, total points, pre ratings, and avg opponent pre ratings.

My overall strategy is to get each column as its own data frame, and then merge them into a big data frame by appending all together.

I copy and pased the data from the URL into a text editor and saved it as a .txt file. I will be reading in my data from that directory. Once it is read in, we will examine the first 10 rows to get a sense of the structure and patterns.

ELOsheet <- read.csv(paste0("~/Desktop/DATA Science SPS/DATA 607/Week 3/ELOsheet.txt"))
head(ELOsheet, 10)

##    X.........................................................................................
## 1   Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 2   Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 3   -----------------------------------------------------------------------------------------
## 4       1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 5      ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## 6   -----------------------------------------------------------------------------------------
## 7       2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## 8      MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## 9   -----------------------------------------------------------------------------------------
## 10      3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|

Right away, I notice that the first two rows contain just headings. I can remove them. I will keep the line because the line appears every third row. When I do subsets later on, I can use this fact to manage my subsets. We then examine the first 10 rows and make a note of the new structure.

The lines have two rows between them. I refer to the information between lines a s asubsection. The first subsection is between rows 2 and 3. The second subsection is between rows 5 and 6. Since there is an obvious patter, I can use the sequence function to split the data into two subsets to get the information I need.

ELOsheet2<-ELOsheet[-c(1:2),]
head(ELOsheet2, 10)

##  [1] -----------------------------------------------------------------------------------------
##  [2]     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
##  [3]    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
##  [4] -----------------------------------------------------------------------------------------
##  [5]     2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
##  [6]    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
##  [7] -----------------------------------------------------------------------------------------
##  [8]     3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
##  [9]    MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## [10] -----------------------------------------------------------------------------------------
## 131 Levels:     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| ...

How do I determine my subsets? If I want information on the player number, name, total points, and opponents, I want to take the 2nd row of every subsection.

If I want information on State, pre rating, and post rating I want the third row of each subset.

Information on sequenc could be found here http://www.endmemo.com/program/R/seq.php

#to grab the 1st row of each subsection, I want to skip the first row, grab the second, skip the third and 4th then repeat 
ELOsubsheet1<-ELOsheet2[seq(2, length(ELOsheet2), 3)]
head(ELOsubsheet1)

## [1]     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## [2]     2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|
## [3]     3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|
## [4]     4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|
## [5]     5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|
## [6]     6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|
## 131 Levels:     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| ...

#to grab the 2nd row of each subsection, I need to skip row 1, 2 grab 3, skip 4 then repeat  
ELOsubsheet2<-ELOsheet2[seq(3, length(ELOsheet2), 3)]
head(ELOsubsheet2)

## [1]    ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## [2]    MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## [3]    MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |
## [4]    MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |
## [5]    MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |
## [6]    OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |
## 131 Levels:     1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4| ...

With our two subsheets created, we can use both of them to exctract the required information. We need to use our knowledge of regular expressions and the stringr library.

library(stringr)

I used the following cheet sheet to figuire out the regular expressions. It is very handy for these cases because it includes a breakdown of each pattern with description http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf

Note that on some occasions, you will need to use the replace function in order to eliminate some of the “junk”

#use subsheet1
#i noticed that names are all upper case, we can use this to our advantage 
ELOname <- unlist(str_extract_all(ELOsubsheet1, "\\| [[:upper:]- ]{4,} \\|"))
ELOname <- str_replace_all(ELOname, pattern = "(\\| )|([[:space:]]{1,}\\|)", replacement = "")
df.ELOname <- data.frame(ELOname)
df.ELOname

##                       ELOname
## 1                    GARY HUA
## 2             DAKSHESH DARURI
## 3                ADITYA BAJAJ
## 4         PATRICK H SCHILLING
## 5                  HANSHI ZUO
## 6                 HANSEN SONG
## 7           GARY DEE SWATHELL
## 8            EZEKIEL HOUGHTON
## 9                 STEFANO LEE
## 10                  ANVIT RAO
## 11   CAMERON WILLIAM MC LEMAN
## 12             KENNETH J TACK
## 13          TORRANCE HENRY JR
## 14               BRADLEY SHAW
## 15     ZACHARY JAMES HOUGHTON
## 16               MIKE NIKITIN
## 17         RONALD GRZEGORCZYK
## 18              DAVID SUNDEEN
## 19               DIPANKAR ROY
## 20                JASON ZHENG
## 21              DINH DANG BUI
## 22           EUGENE L MCCLURE
## 23                   ALAN BUI
## 24          MICHAEL R ALDRICH
## 25           LOREN SCHWIEBERT
## 26                    MAX ZHU
## 27             GAURAV GIDWANI
## 28 SOFIA ADINA STANESCU-BELLU
## 29           CHIEDOZIE OKORIE
## 30         GEORGE AVERY JONES
## 31               RISHI SHETTY
## 32      JOSHUA PHILIP MATHEWS
## 33                    JADE GE
## 34     MICHAEL JEFFERY THOMAS
## 35           JOSHUA DAVID LEE
## 36              SIDDHARTH JHA
## 37       AMIYATOSH PWNANANDAM
## 38                  BRIAN LIU
## 39              JOEL R HENDON
## 40               FOREST ZHANG
## 41        KYLE WILLIAM MURPHY
## 42                   JARED GE
## 43          ROBERT GLEN VASEY
## 44         JUSTIN D SCHILLING
## 45                  DEREK YAN
## 46   JACOB ALEXANDER LAVALLEY
## 47                ERIC WRIGHT
## 48               DANIEL KHAIN
## 49           MICHAEL J MARTIN
## 50                 SHIVAM JHA
## 51             TEJAS AYYAGARI
## 52                  ETHAN GUO
## 53              JOSE C YBARRA
## 54                LARRY HODGE
## 55                  ALEX KONG
## 56               MARISA RICCI
## 57                 MICHAEL LU
## 58               VIRAJ MOHILE
## 59          SEAN M MC CORMICK
## 60                 JULIA SHEN
## 61              JEZZEL FARKAS
## 62              ASHWIN BALAJI
## 63       THOMAS JOSEPH HOSMER
## 64                     BEN LI

#use subsheet 2
#the next thing we need to extract are the States the players come from 
#I notice that states are upper case but also have abbreviations. There is no pipe before states 
#This is something I can use to my advantage in order to extract them. 
ELOstate <- unlist(str_extract_all(ELOsubsheet2, "\\ [[:space:]]{1,}[[A-Z]]{2} \\|"))
ELOstate <- str_replace_all(ELOstate, pattern = "(\\|[[:space:]]{1,})|([[:space:]]{1,}\\|)", replacement = "")
df.ELOstate <- data.frame(ELOstate)
df.ELOstate

##    ELOstate
## 1        ON
## 2        MI
## 3        MI
## 4        MI
## 5        MI
## 6        OH
## 7        MI
## 8        MI
## 9        ON
## 10       MI
## 11       MI
## 12       MI
## 13       MI
## 14       MI
## 15       MI
## 16       MI
## 17       MI
## 18       MI
## 19       MI
## 20       MI
## 21       ON
## 22       MI
## 23       ON
## 24       MI
## 25       MI
## 26       ON
## 27       MI
## 28       MI
## 29       MI
## 30       ON
## 31       MI
## 32       ON
## 33       MI
## 34       MI
## 35       MI
## 36       MI
## 37       MI
## 38       MI
## 39       MI
## 40       MI
## 41       MI
## 42       MI
## 43       MI
## 44       MI
## 45       MI
## 46       MI
## 47       MI
## 48       MI
## 49       MI
## 50       MI
## 51       MI
## 52       MI
## 53       MI
## 54       MI
## 55       MI
## 56       MI
## 57       MI
## 58       MI
## 59       MI
## 60       MI
## 61       ON
## 62       MI
## 63       MI
## 64       MI

#use subsheet 1
#The next item on the list is to extract is the total number of points 
#I noticed that points are in the form n.n. The are also between pipes
#Lets take n.n skip the space before the righthand pipe 
ELOtotalpoints <- unlist(str_extract_all(ELOsubsheet1, "\\|[[:digit:].[:digit:]]{3}[[:space:]]{1,}\\|"))
ELOtotalpoints <- str_replace_all(ELOtotalpoints, pattern = "(\\|)|([[:space:]]{1,}\\|)", replacement = "")
df.ELOtotalpoints <- data.frame(as.numeric(ELOtotalpoints))
df.ELOtotalpoints

##    as.numeric.ELOtotalpoints.
## 1                         6.0
## 2                         6.0
## 3                         6.0
## 4                         5.5
## 5                         5.5
## 6                         5.0
## 7                         5.0
## 8                         5.0
## 9                         5.0
## 10                        5.0
## 11                        4.5
## 12                        4.5
## 13                        4.5
## 14                        4.5
## 15                        4.5
## 16                        4.0
## 17                        4.0
## 18                        4.0
## 19                        4.0
## 20                        4.0
## 21                        4.0
## 22                        4.0
## 23                        4.0
## 24                        4.0
## 25                        3.5
## 26                        3.5
## 27                        3.5
## 28                        3.5
## 29                        3.5
## 30                        3.5
## 31                        3.5
## 32                        3.5
## 33                        3.5
## 34                        3.5
## 35                        3.5
## 36                        3.5
## 37                        3.5
## 38                        3.0
## 39                        3.0
## 40                        3.0
## 41                        3.0
## 42                        3.0
## 43                        3.0
## 44                        3.0
## 45                        3.0
## 46                        3.0
## 47                        2.5
## 48                        2.5
## 49                        2.5
## 50                        2.5
## 51                        2.5
## 52                        2.5
## 53                        2.0
## 54                        2.0
## 55                        2.0
## 56                        2.0
## 57                        2.0
## 58                        2.0
## 59                        2.0
## 60                        1.5
## 61                        1.5
## 62                        1.0
## 63                        1.0
## 64                        1.0

#use subsheet 2
#The next item on the list that needs to be extracted is the players pre-rating 
#the pre rating is to the right of R: and to the left of  spaces and arrow ->
#this can be used to our advantage to extract the pre rating 
ELOprerating <- unlist(str_extract_all(ELOsubsheet2, "[:] [[:alnum:] ]{2,9}\\-\\>"))
ELOprerating <- str_replace_all(ELOprerating, pattern = "(\\: )|(\\s{1,}\\-\\>)|([O-Q]\\d{1,2})|(\\-\\>)", replacement = "")
ELOprerating <- as.numeric(ELOprerating)
df.ELOprerating<-data.frame(as.numeric(ELOprerating))
df.ELOprerating

##    as.numeric.ELOprerating.
## 1                      1794
## 2                      1553
## 3                      1384
## 4                      1716
## 5                      1655
## 6                      1686
## 7                      1649
## 8                      1641
## 9                      1411
## 10                     1365
## 11                     1712
## 12                     1663
## 13                     1666
## 14                     1610
## 15                     1220
## 16                     1604
## 17                     1629
## 18                     1600
## 19                     1564
## 20                     1595
## 21                     1563
## 22                     1555
## 23                     1363
## 24                     1229
## 25                     1745
## 26                     1579
## 27                     1552
## 28                     1507
## 29                     1602
## 30                     1522
## 31                     1494
## 32                     1441
## 33                     1449
## 34                     1399
## 35                     1438
## 36                     1355
## 37                      980
## 38                     1423
## 39                     1436
## 40                     1348
## 41                     1403
## 42                     1332
## 43                     1283
## 44                     1199
## 45                     1242
## 46                      377
## 47                     1362
## 48                     1382
## 49                     1291
## 50                     1056
## 51                     1011
## 52                      935
## 53                     1393
## 54                     1270
## 55                     1186
## 56                     1153
## 57                     1092
## 58                      917
## 59                      853
## 60                      967
## 61                      955
## 62                     1530
## 63                     1175
## 64                     1163

How did we do so far? Lets make a partial data frame to make sure we have collected the information in the right place using the correct data types.

#computing the average opponent pre rating requires more manipulation of the original data frame 
partialcsv<-data.frame(df.ELOname, df.ELOstate, df.ELOtotalpoints, ELOprerating)
partialcsv

##                       ELOname ELOstate as.numeric.ELOtotalpoints.
## 1                    GARY HUA       ON                        6.0
## 2             DAKSHESH DARURI       MI                        6.0
## 3                ADITYA BAJAJ       MI                        6.0
## 4         PATRICK H SCHILLING       MI                        5.5
## 5                  HANSHI ZUO       MI                        5.5
## 6                 HANSEN SONG       OH                        5.0
## 7           GARY DEE SWATHELL       MI                        5.0
## 8            EZEKIEL HOUGHTON       MI                        5.0
## 9                 STEFANO LEE       ON                        5.0
## 10                  ANVIT RAO       MI                        5.0
## 11   CAMERON WILLIAM MC LEMAN       MI                        4.5
## 12             KENNETH J TACK       MI                        4.5
## 13          TORRANCE HENRY JR       MI                        4.5
## 14               BRADLEY SHAW       MI                        4.5
## 15     ZACHARY JAMES HOUGHTON       MI                        4.5
## 16               MIKE NIKITIN       MI                        4.0
## 17         RONALD GRZEGORCZYK       MI                        4.0
## 18              DAVID SUNDEEN       MI                        4.0
## 19               DIPANKAR ROY       MI                        4.0
## 20                JASON ZHENG       MI                        4.0
## 21              DINH DANG BUI       ON                        4.0
## 22           EUGENE L MCCLURE       MI                        4.0
## 23                   ALAN BUI       ON                        4.0
## 24          MICHAEL R ALDRICH       MI                        4.0
## 25           LOREN SCHWIEBERT       MI                        3.5
## 26                    MAX ZHU       ON                        3.5
## 27             GAURAV GIDWANI       MI                        3.5
## 28 SOFIA ADINA STANESCU-BELLU       MI                        3.5
## 29           CHIEDOZIE OKORIE       MI                        3.5
## 30         GEORGE AVERY JONES       ON                        3.5
## 31               RISHI SHETTY       MI                        3.5
## 32      JOSHUA PHILIP MATHEWS       ON                        3.5
## 33                    JADE GE       MI                        3.5
## 34     MICHAEL JEFFERY THOMAS       MI                        3.5
## 35           JOSHUA DAVID LEE       MI                        3.5
## 36              SIDDHARTH JHA       MI                        3.5
## 37       AMIYATOSH PWNANANDAM       MI                        3.5
## 38                  BRIAN LIU       MI                        3.0
## 39              JOEL R HENDON       MI                        3.0
## 40               FOREST ZHANG       MI                        3.0
## 41        KYLE WILLIAM MURPHY       MI                        3.0
## 42                   JARED GE       MI                        3.0
## 43          ROBERT GLEN VASEY       MI                        3.0
## 44         JUSTIN D SCHILLING       MI                        3.0
## 45                  DEREK YAN       MI                        3.0
## 46   JACOB ALEXANDER LAVALLEY       MI                        3.0
## 47                ERIC WRIGHT       MI                        2.5
## 48               DANIEL KHAIN       MI                        2.5
## 49           MICHAEL J MARTIN       MI                        2.5
## 50                 SHIVAM JHA       MI                        2.5
## 51             TEJAS AYYAGARI       MI                        2.5
## 52                  ETHAN GUO       MI                        2.5
## 53              JOSE C YBARRA       MI                        2.0
## 54                LARRY HODGE       MI                        2.0
## 55                  ALEX KONG       MI                        2.0
## 56               MARISA RICCI       MI                        2.0
## 57                 MICHAEL LU       MI                        2.0
## 58               VIRAJ MOHILE       MI                        2.0
## 59          SEAN M MC CORMICK       MI                        2.0
## 60                 JULIA SHEN       MI                        1.5
## 61              JEZZEL FARKAS       ON                        1.5
## 62              ASHWIN BALAJI       MI                        1.0
## 63       THOMAS JOSEPH HOSMER       MI                        1.0
## 64                     BEN LI       MI                        1.0
##    ELOprerating
## 1          1794
## 2          1553
## 3          1384
## 4          1716
## 5          1655
## 6          1686
## 7          1649
## 8          1641
## 9          1411
## 10         1365
## 11         1712
## 12         1663
## 13         1666
## 14         1610
## 15         1220
## 16         1604
## 17         1629
## 18         1600
## 19         1564
## 20         1595
## 21         1563
## 22         1555
## 23         1363
## 24         1229
## 25         1745
## 26         1579
## 27         1552
## 28         1507
## 29         1602
## 30         1522
## 31         1494
## 32         1441
## 33         1449
## 34         1399
## 35         1438
## 36         1355
## 37          980
## 38         1423
## 39         1436
## 40         1348
## 41         1403
## 42         1332
## 43         1283
## 44         1199
## 45         1242
## 46          377
## 47         1362
## 48         1382
## 49         1291
## 50         1056
## 51         1011
## 52          935
## 53         1393
## 54         1270
## 55         1186
## 56         1153
## 57         1092
## 58          917
## 59          853
## 60          967
## 61          955
## 62         1530
## 63         1175
## 64         1163

We have 4 of the 5 columns needed. Computing the average opponent pre rating is going to be the most difficult part of this operation.

I had toyed with the idea of using melt to flatten and make 10 additional columns replacing opponent number with score and then making an 11th column for the average but I could not figure out the syntax. I also toyed with using sqldf and take advantage of the fact tha player to opponent is a one to many relationship.

I settled with trying to write a loop to compute the average pre rating, however I am eager to see if there are solutions using the methods I mentioned above.

Before writing my loop, I need information regarding the player number and the opponent number. I will use similar reg expression to extract those pieces of information.

#use subsheet1
#We need to compute the average pre rating for opponents by player 
#I first need to extract the opponents into their own data frame
#We can extract digits using d and add + to keep going till it hits pipe
ELOopponent<-unlist(str_extract_all(ELOsubsheet1, "\\d+\\|" ), "\\d+")
ELOopponent<-str_replace_all(ELOopponent, pattern="\\|", replace="")
ELOopponent<-as.integer(ELOopponent)
df.ELOopponent<-data.frame(as.integer(ELOopponent))
df.ELOopponent

##     as.integer.ELOopponent.
## 1                        39
## 2                        21
## 3                        18
## 4                        14
## 5                         7
## 6                        12
## 7                         4
## 8                        63
## 9                        58
## 10                        4
## 11                       17
## 12                       16
## 13                       20
## 14                        7
## 15                        8
## 16                       61
## 17                       25
## 18                       21
## 19                       11
## 20                       13
## 21                       12
## 22                       23
## 23                       28
## 24                        2
## 25                       26
## 26                        5
## 27                       19
## 28                        1
## 29                       45
## 30                       37
## 31                       12
## 32                       13
## 33                        4
## 34                       14
## 35                       17
## 36                       34
## 37                       29
## 38                       11
## 39                       35
## 40                       10
## 41                       27
## 42                       21
## 43                       57
## 44                       46
## 45                       13
## 46                       11
## 47                        1
## 48                        9
## 49                        2
## 50                        3
## 51                       32
## 52                       14
## 53                        9
## 54                       47
## 55                       28
## 56                       19
## 57                       25
## 58                       18
## 59                       59
## 60                        8
## 61                       26
## 62                        7
## 63                       20
## 64                       16
## 65                       19
## 66                       55
## 67                       31
## 68                        6
## 69                       25
## 70                       18
## 71                       38
## 72                       56
## 73                        6
## 74                        7
## 75                        3
## 76                       34
## 77                       26
## 78                       42
## 79                       33
## 80                        5
## 81                       38
## 82                        1
## 83                        3
## 84                       36
## 85                       27
## 86                        7
## 87                        5
## 88                       33
## 89                        3
## 90                       32
## 91                       54
## 92                       44
## 93                        8
## 94                        1
## 95                       27
## 96                        5
## 97                       31
## 98                       19
## 99                       16
## 100                      30
## 101                      22
## 102                      54
## 103                      33
## 104                      38
## 105                      10
## 106                      15
## 107                      39
## 108                       2
## 109                      36
## 110                      48
## 111                      41
## 112                      26
## 113                       2
## 114                      23
## 115                      22
## 116                       5
## 117                      47
## 118                       9
## 119                       1
## 120                      32
## 121                      19
## 122                      38
## 123                      10
## 124                      15
## 125                      10
## 126                      52
## 127                      28
## 128                      18
## 129                       4
## 130                       8
## 131                      40
## 132                      49
## 133                      23
## 134                      41
## 135                      28
## 136                       2
## 137                       9
## 138                      43
## 139                       1
## 140                      47
## 141                       3
## 142                      40
## 143                      39
## 144                       6
## 145                      64
## 146                      52
## 147                      28
## 148                      15
## 149                      17
## 150                      40
## 151                       4
## 152                      43
## 153                      20
## 154                      58
## 155                      17
## 156                      37
## 157                      46
## 158                      28
## 159                      47
## 160                      43
## 161                      25
## 162                      60
## 163                      44
## 164                      39
## 165                       9
## 166                      53
## 167                       3
## 168                      24
## 169                      34
## 170                      10
## 171                      47
## 172                      49
## 173                      40
## 174                      17
## 175                       4
## 176                       9
## 177                      32
## 178                      11
## 179                      51
## 180                      13
## 181                      46
## 182                      37
## 183                      14
## 184                       6
## 185                      24
## 186                       4
## 187                      22
## 188                      19
## 189                      20
## 190                       8
## 191                      36
## 192                      50
## 193                       6
## 194                      38
## 195                      34
## 196                      52
## 197                      48
## 198                      52
## 199                      64
## 200                      15
## 201                      55
## 202                      31
## 203                      61
## 204                      50
## 205                      58
## 206                      55
## 207                      64
## 208                      10
## 209                      30
## 210                      50
## 211                      14
## 212                      61
## 213                       8
## 214                      44
## 215                      18
## 216                      51
## 217                      26
## 218                      13
## 219                      60
## 220                      12
## 221                      50
## 222                      36
## 223                      13
## 224                      15
## 225                      51
## 226                       6
## 227                      60
## 228                      37
## 229                      29
## 230                      25
## 231                      11
## 232                      52
## 233                      46
## 234                      38
## 235                      56
## 236                       6
## 237                      57
## 238                      52
## 239                      48
## 240                      13
## 241                      57
## 242                      51
## 243                      33
## 244                      16
## 245                      28
## 246                       5
## 247                      34
## 248                      27
## 249                      23
## 250                      61
## 251                      11
## 252                      35
## 253                      29
## 254                      12
## 255                      18
## 256                      15
## 257                       1
## 258                      54
## 259                      40
## 260                      16
## 261                      44
## 262                      21
## 263                      24
## 264                      20
## 265                      26
## 266                      39
## 267                      59
## 268                      21
## 269                      56
## 270                      22
## 271                      59
## 272                      17
## 273                      58
## 274                      20
## 275                      12
## 276                      50
## 277                      57
## 278                      60
## 279                      61
## 280                      64
## 281                      56
## 282                      21
## 283                      23
## 284                      24
## 285                      63
## 286                      59
## 287                      46
## 288                      55
## 289                      14
## 290                      32
## 291                      53
## 292                      39
## 293                      24
## 294                      59
## 295                       5
## 296                      51
## 297                      60
## 298                      56
## 299                      63
## 300                      55
## 301                      58
## 302                      35
## 303                       7
## 304                      27
## 305                      50
## 306                      64
## 307                      43
## 308                      23
## 309                      18
## 310                      24
## 311                      21
## 312                      61
## 313                       8
## 314                      51
## 315                      25
## 316                      17
## 317                      63
## 318                      52
## 319                      29
## 320                      35
## 321                      26
## 322                      20
## 323                      63
## 324                      64
## 325                      58
## 326                      29
## 327                      42
## 328                      33
## 329                      46
## 330                      31
## 331                      30
## 332                      27
## 333                      45
## 334                      36
## 335                      57
## 336                      32
## 337                      47
## 338                      33
## 339                      30
## 340                      22
## 341                      19
## 342                      48
## 343                      29
## 344                      35
## 345                      34
## 346                      25
## 347                      44
## 348                      57
## 349                      14
## 350                      39
## 351                      61
## 352                      15
## 353                      59
## 354                      64
## 355                      62
## 356                      31
## 357                      10
## 358                      30
## 359                      45
## 360                      43
## 361                      11
## 362                      35
## 363                      45
## 364                      40
## 365                      42
## 366                       7
## 367                      36
## 368                      42
## 369                      51
## 370                      35
## 371                      53
## 372                      31
## 373                       2
## 374                      41
## 375                      23
## 376                      49
## 377                      45
## 378                      41
## 379                       9
## 380                      40
## 381                      43
## 382                      54
## 383                      44
## 384                      33
## 385                      34
## 386                      45
## 387                      42
## 388                      24
## 389                      32
## 390                       3
## 391                      54
## 392                      47
## 393                      42
## 394                      30
## 395                      37
## 396                      55
## 397                       2
## 398                      48
## 399                      49
## 400                      43
## 401                      45
## 402                      22
## 403                      30
## 404                      31
## 405                      49
## 406                      46
## 407                      42
## 408                      54

#use subsheet 1
#the last piece we need in order to compute the average are the player numbers
#players are ordered 1-64. 
ELOplayer<-as.integer(str_extract(ELOsubsheet1, "\\d+"))
head(ELOplayer, 64)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

df.ELOplayer<-data.frame(as.integer(ELOplayer))
df.ELOplayer

##    as.integer.ELOplayer.
## 1                      1
## 2                      2
## 3                      3
## 4                      4
## 5                      5
## 6                      6
## 7                      7
## 8                      8
## 9                      9
## 10                    10
## 11                    11
## 12                    12
## 13                    13
## 14                    14
## 15                    15
## 16                    16
## 17                    17
## 18                    18
## 19                    19
## 20                    20
## 21                    21
## 22                    22
## 23                    23
## 24                    24
## 25                    25
## 26                    26
## 27                    27
## 28                    28
## 29                    29
## 30                    30
## 31                    31
## 32                    32
## 33                    33
## 34                    34
## 35                    35
## 36                    36
## 37                    37
## 38                    38
## 39                    39
## 40                    40
## 41                    41
## 42                    42
## 43                    43
## 44                    44
## 45                    45
## 46                    46
## 47                    47
## 48                    48
## 49                    49
## 50                    50
## 51                    51
## 52                    52
## 53                    53
## 54                    54
## 55                    55
## 56                    56
## 57                    57
## 58                    58
## 59                    59
## 60                    60
## 61                    61
## 62                    62
## 63                    63
## 64                    64

For player number, there was no “junk” to replace since the player number is right at the edge.

I want my loop to do the following: Scan subsheet 1 and match the player number to each of its opponents. Then to take the pre rating for each opponent and find the mean of their associated pre ratings as a single column.

This is where I am stuck. My loop is not working properly as it is only taking the first opponent of each player. I would greatly appcreciate any feedback on how to best complete this portion of the project. In the meantime, I want to get more familiar with more advanced libraries that could do this more efficiently. A co-worker of mine suggested the purrr library but it is to new to me to figure out how to use it.

#use subsheet 1
#How do we compute the average opponent player rating? 
#the loop then fetches the ratings for each opponent and divides by number of rounds 
avg_ELOopp_rating<-length(ELOsubsheet1)
#loop
for (i in 1: length(ELOsubsheet1))
{
  avg_ELOopp_rating[i]<-mean(ELOprerating[as.numeric(unlist(ELOopponent[ELOplayer[i]]))])
}

df.avg_ELOopp_rating<-data.frame(as.numeric(avg_ELOopp_rating))
df.avg_ELOopp_rating

##    as.numeric.avg_ELOopp_rating.
## 1                           1436
## 2                           1563
## 3                           1600
## 4                           1610
## 5                           1649
## 6                           1663
## 7                           1716
## 8                           1175
## 9                            917
## 10                          1716
## 11                          1629
## 12                          1604
## 13                          1595
## 14                          1649
## 15                          1641
## 16                           955
## 17                          1745
## 18                          1563
## 19                          1712
## 20                          1666
## 21                          1663
## 22                          1363
## 23                          1507
## 24                          1553
## 25                          1579
## 26                          1655
## 27                          1564
## 28                          1794
## 29                          1242
## 30                           980
## 31                          1663
## 32                          1666
## 33                          1716
## 34                          1610
## 35                          1629
## 36                          1399
## 37                          1602
## 38                          1712
## 39                          1438
## 40                          1365
## 41                          1552
## 42                          1563
## 43                          1092
## 44                           377
## 45                          1666
## 46                          1712
## 47                          1794
## 48                          1411
## 49                          1553
## 50                          1384
## 51                          1441
## 52                          1610
## 53                          1411
## 54                          1362
## 55                          1507
## 56                          1564
## 57                          1745
## 58                          1600
## 59                           853
## 60                          1641
## 61                          1579
## 62                          1649
## 63                          1595
## 64                          1604

Proceeding with the assumption that my average was correct, we can put together our final data frame.

#Put together in a data frame 
csv<-data.frame(df.ELOplayer, df.ELOname, df.ELOstate, df.ELOtotalpoints, ELOprerating, df.avg_ELOopp_rating)
csv

##    as.integer.ELOplayer.                    ELOname ELOstate
## 1                      1                   GARY HUA       ON
## 2                      2            DAKSHESH DARURI       MI
## 3                      3               ADITYA BAJAJ       MI
## 4                      4        PATRICK H SCHILLING       MI
## 5                      5                 HANSHI ZUO       MI
## 6                      6                HANSEN SONG       OH
## 7                      7          GARY DEE SWATHELL       MI
## 8                      8           EZEKIEL HOUGHTON       MI
## 9                      9                STEFANO LEE       ON
## 10                    10                  ANVIT RAO       MI
## 11                    11   CAMERON WILLIAM MC LEMAN       MI
## 12                    12             KENNETH J TACK       MI
## 13                    13          TORRANCE HENRY JR       MI
## 14                    14               BRADLEY SHAW       MI
## 15                    15     ZACHARY JAMES HOUGHTON       MI
## 16                    16               MIKE NIKITIN       MI
## 17                    17         RONALD GRZEGORCZYK       MI
## 18                    18              DAVID SUNDEEN       MI
## 19                    19               DIPANKAR ROY       MI
## 20                    20                JASON ZHENG       MI
## 21                    21              DINH DANG BUI       ON
## 22                    22           EUGENE L MCCLURE       MI
## 23                    23                   ALAN BUI       ON
## 24                    24          MICHAEL R ALDRICH       MI
## 25                    25           LOREN SCHWIEBERT       MI
## 26                    26                    MAX ZHU       ON
## 27                    27             GAURAV GIDWANI       MI
## 28                    28 SOFIA ADINA STANESCU-BELLU       MI
## 29                    29           CHIEDOZIE OKORIE       MI
## 30                    30         GEORGE AVERY JONES       ON
## 31                    31               RISHI SHETTY       MI
## 32                    32      JOSHUA PHILIP MATHEWS       ON
## 33                    33                    JADE GE       MI
## 34                    34     MICHAEL JEFFERY THOMAS       MI
## 35                    35           JOSHUA DAVID LEE       MI
## 36                    36              SIDDHARTH JHA       MI
## 37                    37       AMIYATOSH PWNANANDAM       MI
## 38                    38                  BRIAN LIU       MI
## 39                    39              JOEL R HENDON       MI
## 40                    40               FOREST ZHANG       MI
## 41                    41        KYLE WILLIAM MURPHY       MI
## 42                    42                   JARED GE       MI
## 43                    43          ROBERT GLEN VASEY       MI
## 44                    44         JUSTIN D SCHILLING       MI
## 45                    45                  DEREK YAN       MI
## 46                    46   JACOB ALEXANDER LAVALLEY       MI
## 47                    47                ERIC WRIGHT       MI
## 48                    48               DANIEL KHAIN       MI
## 49                    49           MICHAEL J MARTIN       MI
## 50                    50                 SHIVAM JHA       MI
## 51                    51             TEJAS AYYAGARI       MI
## 52                    52                  ETHAN GUO       MI
## 53                    53              JOSE C YBARRA       MI
## 54                    54                LARRY HODGE       MI
## 55                    55                  ALEX KONG       MI
## 56                    56               MARISA RICCI       MI
## 57                    57                 MICHAEL LU       MI
## 58                    58               VIRAJ MOHILE       MI
## 59                    59          SEAN M MC CORMICK       MI
## 60                    60                 JULIA SHEN       MI
## 61                    61              JEZZEL FARKAS       ON
## 62                    62              ASHWIN BALAJI       MI
## 63                    63       THOMAS JOSEPH HOSMER       MI
## 64                    64                     BEN LI       MI
##    as.numeric.ELOtotalpoints. ELOprerating as.numeric.avg_ELOopp_rating.
## 1                         6.0         1794                          1436
## 2                         6.0         1553                          1563
## 3                         6.0         1384                          1600
## 4                         5.5         1716                          1610
## 5                         5.5         1655                          1649
## 6                         5.0         1686                          1663
## 7                         5.0         1649                          1716
## 8                         5.0         1641                          1175
## 9                         5.0         1411                           917
## 10                        5.0         1365                          1716
## 11                        4.5         1712                          1629
## 12                        4.5         1663                          1604
## 13                        4.5         1666                          1595
## 14                        4.5         1610                          1649
## 15                        4.5         1220                          1641
## 16                        4.0         1604                           955
## 17                        4.0         1629                          1745
## 18                        4.0         1600                          1563
## 19                        4.0         1564                          1712
## 20                        4.0         1595                          1666
## 21                        4.0         1563                          1663
## 22                        4.0         1555                          1363
## 23                        4.0         1363                          1507
## 24                        4.0         1229                          1553
## 25                        3.5         1745                          1579
## 26                        3.5         1579                          1655
## 27                        3.5         1552                          1564
## 28                        3.5         1507                          1794
## 29                        3.5         1602                          1242
## 30                        3.5         1522                           980
## 31                        3.5         1494                          1663
## 32                        3.5         1441                          1666
## 33                        3.5         1449                          1716
## 34                        3.5         1399                          1610
## 35                        3.5         1438                          1629
## 36                        3.5         1355                          1399
## 37                        3.5          980                          1602
## 38                        3.0         1423                          1712
## 39                        3.0         1436                          1438
## 40                        3.0         1348                          1365
## 41                        3.0         1403                          1552
## 42                        3.0         1332                          1563
## 43                        3.0         1283                          1092
## 44                        3.0         1199                           377
## 45                        3.0         1242                          1666
## 46                        3.0          377                          1712
## 47                        2.5         1362                          1794
## 48                        2.5         1382                          1411
## 49                        2.5         1291                          1553
## 50                        2.5         1056                          1384
## 51                        2.5         1011                          1441
## 52                        2.5          935                          1610
## 53                        2.0         1393                          1411
## 54                        2.0         1270                          1362
## 55                        2.0         1186                          1507
## 56                        2.0         1153                          1564
## 57                        2.0         1092                          1745
## 58                        2.0          917                          1600
## 59                        2.0          853                           853
## 60                        1.5          967                          1641
## 61                        1.5          955                          1579
## 62                        1.0         1530                          1649
## 63                        1.0         1175                          1595
## 64                        1.0         1163                          1604

The names do not look good. I can change them to make them more presentable and easier to understand.

#use a better naming convention 
colnames(csv)[colnames(csv)=="as.integer.ELOplayer."]<-"PlayerNumber"
colnames(csv)[colnames(csv)=="ELOname"]<-"Name"
colnames(csv)[colnames(csv)=="ELOstate"]<-"State"
colnames(csv)[colnames(csv)=="as.numeric.ELOtotalpoints."]<-"TotalPoints"
colnames(csv)[colnames(csv)=="ELOprerating"]<-"PreRating"
colnames(csv)[colnames(csv)=="as.numeric.avg_ELOopp_rating."]<-"AvgOppPreRating"
csv

##    PlayerNumber                       Name State TotalPoints PreRating
## 1             1                   GARY HUA    ON         6.0      1794
## 2             2            DAKSHESH DARURI    MI         6.0      1553
## 3             3               ADITYA BAJAJ    MI         6.0      1384
## 4             4        PATRICK H SCHILLING    MI         5.5      1716
## 5             5                 HANSHI ZUO    MI         5.5      1655
## 6             6                HANSEN SONG    OH         5.0      1686
## 7             7          GARY DEE SWATHELL    MI         5.0      1649
## 8             8           EZEKIEL HOUGHTON    MI         5.0      1641
## 9             9                STEFANO LEE    ON         5.0      1411
## 10           10                  ANVIT RAO    MI         5.0      1365
## 11           11   CAMERON WILLIAM MC LEMAN    MI         4.5      1712
## 12           12             KENNETH J TACK    MI         4.5      1663
## 13           13          TORRANCE HENRY JR    MI         4.5      1666
## 14           14               BRADLEY SHAW    MI         4.5      1610
## 15           15     ZACHARY JAMES HOUGHTON    MI         4.5      1220
## 16           16               MIKE NIKITIN    MI         4.0      1604
## 17           17         RONALD GRZEGORCZYK    MI         4.0      1629
## 18           18              DAVID SUNDEEN    MI         4.0      1600
## 19           19               DIPANKAR ROY    MI         4.0      1564
## 20           20                JASON ZHENG    MI         4.0      1595
## 21           21              DINH DANG BUI    ON         4.0      1563
## 22           22           EUGENE L MCCLURE    MI         4.0      1555
## 23           23                   ALAN BUI    ON         4.0      1363
## 24           24          MICHAEL R ALDRICH    MI         4.0      1229
## 25           25           LOREN SCHWIEBERT    MI         3.5      1745
## 26           26                    MAX ZHU    ON         3.5      1579
## 27           27             GAURAV GIDWANI    MI         3.5      1552
## 28           28 SOFIA ADINA STANESCU-BELLU    MI         3.5      1507
## 29           29           CHIEDOZIE OKORIE    MI         3.5      1602
## 30           30         GEORGE AVERY JONES    ON         3.5      1522
## 31           31               RISHI SHETTY    MI         3.5      1494
## 32           32      JOSHUA PHILIP MATHEWS    ON         3.5      1441
## 33           33                    JADE GE    MI         3.5      1449
## 34           34     MICHAEL JEFFERY THOMAS    MI         3.5      1399
## 35           35           JOSHUA DAVID LEE    MI         3.5      1438
## 36           36              SIDDHARTH JHA    MI         3.5      1355
## 37           37       AMIYATOSH PWNANANDAM    MI         3.5       980
## 38           38                  BRIAN LIU    MI         3.0      1423
## 39           39              JOEL R HENDON    MI         3.0      1436
## 40           40               FOREST ZHANG    MI         3.0      1348
## 41           41        KYLE WILLIAM MURPHY    MI         3.0      1403
## 42           42                   JARED GE    MI         3.0      1332
## 43           43          ROBERT GLEN VASEY    MI         3.0      1283
## 44           44         JUSTIN D SCHILLING    MI         3.0      1199
## 45           45                  DEREK YAN    MI         3.0      1242
## 46           46   JACOB ALEXANDER LAVALLEY    MI         3.0       377
## 47           47                ERIC WRIGHT    MI         2.5      1362
## 48           48               DANIEL KHAIN    MI         2.5      1382
## 49           49           MICHAEL J MARTIN    MI         2.5      1291
## 50           50                 SHIVAM JHA    MI         2.5      1056
## 51           51             TEJAS AYYAGARI    MI         2.5      1011
## 52           52                  ETHAN GUO    MI         2.5       935
## 53           53              JOSE C YBARRA    MI         2.0      1393
## 54           54                LARRY HODGE    MI         2.0      1270
## 55           55                  ALEX KONG    MI         2.0      1186
## 56           56               MARISA RICCI    MI         2.0      1153
## 57           57                 MICHAEL LU    MI         2.0      1092
## 58           58               VIRAJ MOHILE    MI         2.0       917
## 59           59          SEAN M MC CORMICK    MI         2.0       853
## 60           60                 JULIA SHEN    MI         1.5       967
## 61           61              JEZZEL FARKAS    ON         1.5       955
## 62           62              ASHWIN BALAJI    MI         1.0      1530
## 63           63       THOMAS JOSEPH HOSMER    MI         1.0      1175
## 64           64                     BEN LI    MI         1.0      1163
##    AvgOppPreRating
## 1             1436
## 2             1563
## 3             1600
## 4             1610
## 5             1649
## 6             1663
## 7             1716
## 8             1175
## 9              917
## 10            1716
## 11            1629
## 12            1604
## 13            1595
## 14            1649
## 15            1641
## 16             955
## 17            1745
## 18            1563
## 19            1712
## 20            1666
## 21            1663
## 22            1363
## 23            1507
## 24            1553
## 25            1579
## 26            1655
## 27            1564
## 28            1794
## 29            1242
## 30             980
## 31            1663
## 32            1666
## 33            1716
## 34            1610
## 35            1629
## 36            1399
## 37            1602
## 38            1712
## 39            1438
## 40            1365
## 41            1552
## 42            1563
## 43            1092
## 44             377
## 45            1666
## 46            1712
## 47            1794
## 48            1411
## 49            1553
## 50            1384
## 51            1441
## 52            1610
## 53            1411
## 54            1362
## 55            1507
## 56            1564
## 57            1745
## 58            1600
## 59             853
## 60            1641
## 61            1579
## 62            1649
## 63            1595
## 64            1604

We finally export a new csv to a destination on your local machine. You will need to change the path.

#export to a CSV
write.csv(csv, "~/Desktop/DATA Science SPS/DATA 607/Week 3/chesstorunament.csv")

DATA 607 Project 1

Vinicio Haro

2/25/2018