The goal of this project is to take data seemingly complex and structure and to transform it into a structured data set that can be used for downstream analysis.
The raw data can be found here https://raw.githubusercontent.com/vindication09/DATA607_Project1/master/RawChessData We will be working with an ELO chess cross table. We want to transform the information from this cross table into a data frame that contains the following information: player number, player name, state, total points, pre ratings, and avg opponent pre ratings.
My overall strategy is to get each column as its own data frame, and then merge them into a big data frame by appending all together.
I copy and pased the data from the URL into a text editor and saved it as a .txt file. I will be reading in my data from that directory. Once it is read in, we will examine the first 10 rows to get a sense of the structure and patterns.
ELOsheet <- read.csv(paste0("~/Desktop/DATA Science SPS/DATA 607/Week 3/ELOsheet.txt"))
head(ELOsheet, 10)
## X.........................................................................................
## 1 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 2 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 3 -----------------------------------------------------------------------------------------
## 4 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 5 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## 6 -----------------------------------------------------------------------------------------
## 7 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## 8 MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## 9 -----------------------------------------------------------------------------------------
## 10 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
Right away, I notice that the first two rows contain just headings. I can remove them. I will keep the line because the line appears every third row. When I do subsets later on, I can use this fact to manage my subsets. We then examine the first 10 rows and make a note of the new structure.
The lines have two rows between them. I refer to the information between lines a s asubsection. The first subsection is between rows 2 and 3. The second subsection is between rows 5 and 6. Since there is an obvious patter, I can use the sequence function to split the data into two subsets to get the information I need.
ELOsheet2<-ELOsheet[-c(1:2),]
head(ELOsheet2, 10)
## [1] -----------------------------------------------------------------------------------------
## [2] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [3] ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## [4] -----------------------------------------------------------------------------------------
## [5] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [6] MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## [7] -----------------------------------------------------------------------------------------
## [8] 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [9] MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
## [10] -----------------------------------------------------------------------------------------
## 131 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ...
How do I determine my subsets? If I want information on the player number, name, total points, and opponents, I want to take the 2nd row of every subsection.
If I want information on State, pre rating, and post rating I want the third row of each subset.
Information on sequenc could be found here http://www.endmemo.com/program/R/seq.php
#to grab the 1st row of each subsection, I want to skip the first row, grab the second, skip the third and 4th then repeat
ELOsubsheet1<-ELOsheet2[seq(2, length(ELOsheet2), 3)]
head(ELOsubsheet1)
## [1] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [2] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [3] 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [4] 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## [5] 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
## [6] 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|
## 131 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ...
#to grab the 2nd row of each subsection, I need to skip row 1, 2 grab 3, skip 4 then repeat
ELOsubsheet2<-ELOsheet2[seq(3, length(ELOsheet2), 3)]
head(ELOsubsheet2)
## [1] ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
## [2] MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |
## [3] MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |
## [4] MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |
## [5] MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |
## [6] OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |
## 131 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ...
With our two subsheets created, we can use both of them to exctract the required information. We need to use our knowledge of regular expressions and the stringr library.
library(stringr)
I used the following cheet sheet to figuire out the regular expressions. It is very handy for these cases because it includes a breakdown of each pattern with description http://www.cbs.dtu.dk/courses/27610/regular-expressions-cheat-sheet-v2.pdf
Note that on some occasions, you will need to use the replace function in order to eliminate some of the “junk”
#use subsheet1
#i noticed that names are all upper case, we can use this to our advantage
ELOname <- unlist(str_extract_all(ELOsubsheet1, "\\| [[:upper:]- ]{4,} \\|"))
ELOname <- str_replace_all(ELOname, pattern = "(\\| )|([[:space:]]{1,}\\|)", replacement = "")
df.ELOname <- data.frame(ELOname)
df.ELOname
## ELOname
## 1 GARY HUA
## 2 DAKSHESH DARURI
## 3 ADITYA BAJAJ
## 4 PATRICK H SCHILLING
## 5 HANSHI ZUO
## 6 HANSEN SONG
## 7 GARY DEE SWATHELL
## 8 EZEKIEL HOUGHTON
## 9 STEFANO LEE
## 10 ANVIT RAO
## 11 CAMERON WILLIAM MC LEMAN
## 12 KENNETH J TACK
## 13 TORRANCE HENRY JR
## 14 BRADLEY SHAW
## 15 ZACHARY JAMES HOUGHTON
## 16 MIKE NIKITIN
## 17 RONALD GRZEGORCZYK
## 18 DAVID SUNDEEN
## 19 DIPANKAR ROY
## 20 JASON ZHENG
## 21 DINH DANG BUI
## 22 EUGENE L MCCLURE
## 23 ALAN BUI
## 24 MICHAEL R ALDRICH
## 25 LOREN SCHWIEBERT
## 26 MAX ZHU
## 27 GAURAV GIDWANI
## 28 SOFIA ADINA STANESCU-BELLU
## 29 CHIEDOZIE OKORIE
## 30 GEORGE AVERY JONES
## 31 RISHI SHETTY
## 32 JOSHUA PHILIP MATHEWS
## 33 JADE GE
## 34 MICHAEL JEFFERY THOMAS
## 35 JOSHUA DAVID LEE
## 36 SIDDHARTH JHA
## 37 AMIYATOSH PWNANANDAM
## 38 BRIAN LIU
## 39 JOEL R HENDON
## 40 FOREST ZHANG
## 41 KYLE WILLIAM MURPHY
## 42 JARED GE
## 43 ROBERT GLEN VASEY
## 44 JUSTIN D SCHILLING
## 45 DEREK YAN
## 46 JACOB ALEXANDER LAVALLEY
## 47 ERIC WRIGHT
## 48 DANIEL KHAIN
## 49 MICHAEL J MARTIN
## 50 SHIVAM JHA
## 51 TEJAS AYYAGARI
## 52 ETHAN GUO
## 53 JOSE C YBARRA
## 54 LARRY HODGE
## 55 ALEX KONG
## 56 MARISA RICCI
## 57 MICHAEL LU
## 58 VIRAJ MOHILE
## 59 SEAN M MC CORMICK
## 60 JULIA SHEN
## 61 JEZZEL FARKAS
## 62 ASHWIN BALAJI
## 63 THOMAS JOSEPH HOSMER
## 64 BEN LI
#use subsheet 2
#the next thing we need to extract are the States the players come from
#I notice that states are upper case but also have abbreviations. There is no pipe before states
#This is something I can use to my advantage in order to extract them.
ELOstate <- unlist(str_extract_all(ELOsubsheet2, "\\ [[:space:]]{1,}[[A-Z]]{2} \\|"))
ELOstate <- str_replace_all(ELOstate, pattern = "(\\|[[:space:]]{1,})|([[:space:]]{1,}\\|)", replacement = "")
df.ELOstate <- data.frame(ELOstate)
df.ELOstate
## ELOstate
## 1 ON
## 2 MI
## 3 MI
## 4 MI
## 5 MI
## 6 OH
## 7 MI
## 8 MI
## 9 ON
## 10 MI
## 11 MI
## 12 MI
## 13 MI
## 14 MI
## 15 MI
## 16 MI
## 17 MI
## 18 MI
## 19 MI
## 20 MI
## 21 ON
## 22 MI
## 23 ON
## 24 MI
## 25 MI
## 26 ON
## 27 MI
## 28 MI
## 29 MI
## 30 ON
## 31 MI
## 32 ON
## 33 MI
## 34 MI
## 35 MI
## 36 MI
## 37 MI
## 38 MI
## 39 MI
## 40 MI
## 41 MI
## 42 MI
## 43 MI
## 44 MI
## 45 MI
## 46 MI
## 47 MI
## 48 MI
## 49 MI
## 50 MI
## 51 MI
## 52 MI
## 53 MI
## 54 MI
## 55 MI
## 56 MI
## 57 MI
## 58 MI
## 59 MI
## 60 MI
## 61 ON
## 62 MI
## 63 MI
## 64 MI
#use subsheet 1
#The next item on the list is to extract is the total number of points
#I noticed that points are in the form n.n. The are also between pipes
#Lets take n.n skip the space before the righthand pipe
ELOtotalpoints <- unlist(str_extract_all(ELOsubsheet1, "\\|[[:digit:].[:digit:]]{3}[[:space:]]{1,}\\|"))
ELOtotalpoints <- str_replace_all(ELOtotalpoints, pattern = "(\\|)|([[:space:]]{1,}\\|)", replacement = "")
df.ELOtotalpoints <- data.frame(as.numeric(ELOtotalpoints))
df.ELOtotalpoints
## as.numeric.ELOtotalpoints.
## 1 6.0
## 2 6.0
## 3 6.0
## 4 5.5
## 5 5.5
## 6 5.0
## 7 5.0
## 8 5.0
## 9 5.0
## 10 5.0
## 11 4.5
## 12 4.5
## 13 4.5
## 14 4.5
## 15 4.5
## 16 4.0
## 17 4.0
## 18 4.0
## 19 4.0
## 20 4.0
## 21 4.0
## 22 4.0
## 23 4.0
## 24 4.0
## 25 3.5
## 26 3.5
## 27 3.5
## 28 3.5
## 29 3.5
## 30 3.5
## 31 3.5
## 32 3.5
## 33 3.5
## 34 3.5
## 35 3.5
## 36 3.5
## 37 3.5
## 38 3.0
## 39 3.0
## 40 3.0
## 41 3.0
## 42 3.0
## 43 3.0
## 44 3.0
## 45 3.0
## 46 3.0
## 47 2.5
## 48 2.5
## 49 2.5
## 50 2.5
## 51 2.5
## 52 2.5
## 53 2.0
## 54 2.0
## 55 2.0
## 56 2.0
## 57 2.0
## 58 2.0
## 59 2.0
## 60 1.5
## 61 1.5
## 62 1.0
## 63 1.0
## 64 1.0
#use subsheet 2
#The next item on the list that needs to be extracted is the players pre-rating
#the pre rating is to the right of R: and to the left of spaces and arrow ->
#this can be used to our advantage to extract the pre rating
ELOprerating <- unlist(str_extract_all(ELOsubsheet2, "[:] [[:alnum:] ]{2,9}\\-\\>"))
ELOprerating <- str_replace_all(ELOprerating, pattern = "(\\: )|(\\s{1,}\\-\\>)|([O-Q]\\d{1,2})|(\\-\\>)", replacement = "")
ELOprerating <- as.numeric(ELOprerating)
df.ELOprerating<-data.frame(as.numeric(ELOprerating))
df.ELOprerating
## as.numeric.ELOprerating.
## 1 1794
## 2 1553
## 3 1384
## 4 1716
## 5 1655
## 6 1686
## 7 1649
## 8 1641
## 9 1411
## 10 1365
## 11 1712
## 12 1663
## 13 1666
## 14 1610
## 15 1220
## 16 1604
## 17 1629
## 18 1600
## 19 1564
## 20 1595
## 21 1563
## 22 1555
## 23 1363
## 24 1229
## 25 1745
## 26 1579
## 27 1552
## 28 1507
## 29 1602
## 30 1522
## 31 1494
## 32 1441
## 33 1449
## 34 1399
## 35 1438
## 36 1355
## 37 980
## 38 1423
## 39 1436
## 40 1348
## 41 1403
## 42 1332
## 43 1283
## 44 1199
## 45 1242
## 46 377
## 47 1362
## 48 1382
## 49 1291
## 50 1056
## 51 1011
## 52 935
## 53 1393
## 54 1270
## 55 1186
## 56 1153
## 57 1092
## 58 917
## 59 853
## 60 967
## 61 955
## 62 1530
## 63 1175
## 64 1163
How did we do so far? Lets make a partial data frame to make sure we have collected the information in the right place using the correct data types.
#computing the average opponent pre rating requires more manipulation of the original data frame
partialcsv<-data.frame(df.ELOname, df.ELOstate, df.ELOtotalpoints, ELOprerating)
partialcsv
## ELOname ELOstate as.numeric.ELOtotalpoints.
## 1 GARY HUA ON 6.0
## 2 DAKSHESH DARURI MI 6.0
## 3 ADITYA BAJAJ MI 6.0
## 4 PATRICK H SCHILLING MI 5.5
## 5 HANSHI ZUO MI 5.5
## 6 HANSEN SONG OH 5.0
## 7 GARY DEE SWATHELL MI 5.0
## 8 EZEKIEL HOUGHTON MI 5.0
## 9 STEFANO LEE ON 5.0
## 10 ANVIT RAO MI 5.0
## 11 CAMERON WILLIAM MC LEMAN MI 4.5
## 12 KENNETH J TACK MI 4.5
## 13 TORRANCE HENRY JR MI 4.5
## 14 BRADLEY SHAW MI 4.5
## 15 ZACHARY JAMES HOUGHTON MI 4.5
## 16 MIKE NIKITIN MI 4.0
## 17 RONALD GRZEGORCZYK MI 4.0
## 18 DAVID SUNDEEN MI 4.0
## 19 DIPANKAR ROY MI 4.0
## 20 JASON ZHENG MI 4.0
## 21 DINH DANG BUI ON 4.0
## 22 EUGENE L MCCLURE MI 4.0
## 23 ALAN BUI ON 4.0
## 24 MICHAEL R ALDRICH MI 4.0
## 25 LOREN SCHWIEBERT MI 3.5
## 26 MAX ZHU ON 3.5
## 27 GAURAV GIDWANI MI 3.5
## 28 SOFIA ADINA STANESCU-BELLU MI 3.5
## 29 CHIEDOZIE OKORIE MI 3.5
## 30 GEORGE AVERY JONES ON 3.5
## 31 RISHI SHETTY MI 3.5
## 32 JOSHUA PHILIP MATHEWS ON 3.5
## 33 JADE GE MI 3.5
## 34 MICHAEL JEFFERY THOMAS MI 3.5
## 35 JOSHUA DAVID LEE MI 3.5
## 36 SIDDHARTH JHA MI 3.5
## 37 AMIYATOSH PWNANANDAM MI 3.5
## 38 BRIAN LIU MI 3.0
## 39 JOEL R HENDON MI 3.0
## 40 FOREST ZHANG MI 3.0
## 41 KYLE WILLIAM MURPHY MI 3.0
## 42 JARED GE MI 3.0
## 43 ROBERT GLEN VASEY MI 3.0
## 44 JUSTIN D SCHILLING MI 3.0
## 45 DEREK YAN MI 3.0
## 46 JACOB ALEXANDER LAVALLEY MI 3.0
## 47 ERIC WRIGHT MI 2.5
## 48 DANIEL KHAIN MI 2.5
## 49 MICHAEL J MARTIN MI 2.5
## 50 SHIVAM JHA MI 2.5
## 51 TEJAS AYYAGARI MI 2.5
## 52 ETHAN GUO MI 2.5
## 53 JOSE C YBARRA MI 2.0
## 54 LARRY HODGE MI 2.0
## 55 ALEX KONG MI 2.0
## 56 MARISA RICCI MI 2.0
## 57 MICHAEL LU MI 2.0
## 58 VIRAJ MOHILE MI 2.0
## 59 SEAN M MC CORMICK MI 2.0
## 60 JULIA SHEN MI 1.5
## 61 JEZZEL FARKAS ON 1.5
## 62 ASHWIN BALAJI MI 1.0
## 63 THOMAS JOSEPH HOSMER MI 1.0
## 64 BEN LI MI 1.0
## ELOprerating
## 1 1794
## 2 1553
## 3 1384
## 4 1716
## 5 1655
## 6 1686
## 7 1649
## 8 1641
## 9 1411
## 10 1365
## 11 1712
## 12 1663
## 13 1666
## 14 1610
## 15 1220
## 16 1604
## 17 1629
## 18 1600
## 19 1564
## 20 1595
## 21 1563
## 22 1555
## 23 1363
## 24 1229
## 25 1745
## 26 1579
## 27 1552
## 28 1507
## 29 1602
## 30 1522
## 31 1494
## 32 1441
## 33 1449
## 34 1399
## 35 1438
## 36 1355
## 37 980
## 38 1423
## 39 1436
## 40 1348
## 41 1403
## 42 1332
## 43 1283
## 44 1199
## 45 1242
## 46 377
## 47 1362
## 48 1382
## 49 1291
## 50 1056
## 51 1011
## 52 935
## 53 1393
## 54 1270
## 55 1186
## 56 1153
## 57 1092
## 58 917
## 59 853
## 60 967
## 61 955
## 62 1530
## 63 1175
## 64 1163
We have 4 of the 5 columns needed. Computing the average opponent pre rating is going to be the most difficult part of this operation.
I had toyed with the idea of using melt to flatten and make 10 additional columns replacing opponent number with score and then making an 11th column for the average but I could not figure out the syntax. I also toyed with using sqldf and take advantage of the fact tha player to opponent is a one to many relationship.
I settled with trying to write a loop to compute the average pre rating, however I am eager to see if there are solutions using the methods I mentioned above.
Before writing my loop, I need information regarding the player number and the opponent number. I will use similar reg expression to extract those pieces of information.
#use subsheet1
#We need to compute the average pre rating for opponents by player
#I first need to extract the opponents into their own data frame
#We can extract digits using d and add + to keep going till it hits pipe
ELOopponent<-unlist(str_extract_all(ELOsubsheet1, "\\d+\\|" ), "\\d+")
ELOopponent<-str_replace_all(ELOopponent, pattern="\\|", replace="")
ELOopponent<-as.integer(ELOopponent)
df.ELOopponent<-data.frame(as.integer(ELOopponent))
df.ELOopponent
## as.integer.ELOopponent.
## 1 39
## 2 21
## 3 18
## 4 14
## 5 7
## 6 12
## 7 4
## 8 63
## 9 58
## 10 4
## 11 17
## 12 16
## 13 20
## 14 7
## 15 8
## 16 61
## 17 25
## 18 21
## 19 11
## 20 13
## 21 12
## 22 23
## 23 28
## 24 2
## 25 26
## 26 5
## 27 19
## 28 1
## 29 45
## 30 37
## 31 12
## 32 13
## 33 4
## 34 14
## 35 17
## 36 34
## 37 29
## 38 11
## 39 35
## 40 10
## 41 27
## 42 21
## 43 57
## 44 46
## 45 13
## 46 11
## 47 1
## 48 9
## 49 2
## 50 3
## 51 32
## 52 14
## 53 9
## 54 47
## 55 28
## 56 19
## 57 25
## 58 18
## 59 59
## 60 8
## 61 26
## 62 7
## 63 20
## 64 16
## 65 19
## 66 55
## 67 31
## 68 6
## 69 25
## 70 18
## 71 38
## 72 56
## 73 6
## 74 7
## 75 3
## 76 34
## 77 26
## 78 42
## 79 33
## 80 5
## 81 38
## 82 1
## 83 3
## 84 36
## 85 27
## 86 7
## 87 5
## 88 33
## 89 3
## 90 32
## 91 54
## 92 44
## 93 8
## 94 1
## 95 27
## 96 5
## 97 31
## 98 19
## 99 16
## 100 30
## 101 22
## 102 54
## 103 33
## 104 38
## 105 10
## 106 15
## 107 39
## 108 2
## 109 36
## 110 48
## 111 41
## 112 26
## 113 2
## 114 23
## 115 22
## 116 5
## 117 47
## 118 9
## 119 1
## 120 32
## 121 19
## 122 38
## 123 10
## 124 15
## 125 10
## 126 52
## 127 28
## 128 18
## 129 4
## 130 8
## 131 40
## 132 49
## 133 23
## 134 41
## 135 28
## 136 2
## 137 9
## 138 43
## 139 1
## 140 47
## 141 3
## 142 40
## 143 39
## 144 6
## 145 64
## 146 52
## 147 28
## 148 15
## 149 17
## 150 40
## 151 4
## 152 43
## 153 20
## 154 58
## 155 17
## 156 37
## 157 46
## 158 28
## 159 47
## 160 43
## 161 25
## 162 60
## 163 44
## 164 39
## 165 9
## 166 53
## 167 3
## 168 24
## 169 34
## 170 10
## 171 47
## 172 49
## 173 40
## 174 17
## 175 4
## 176 9
## 177 32
## 178 11
## 179 51
## 180 13
## 181 46
## 182 37
## 183 14
## 184 6
## 185 24
## 186 4
## 187 22
## 188 19
## 189 20
## 190 8
## 191 36
## 192 50
## 193 6
## 194 38
## 195 34
## 196 52
## 197 48
## 198 52
## 199 64
## 200 15
## 201 55
## 202 31
## 203 61
## 204 50
## 205 58
## 206 55
## 207 64
## 208 10
## 209 30
## 210 50
## 211 14
## 212 61
## 213 8
## 214 44
## 215 18
## 216 51
## 217 26
## 218 13
## 219 60
## 220 12
## 221 50
## 222 36
## 223 13
## 224 15
## 225 51
## 226 6
## 227 60
## 228 37
## 229 29
## 230 25
## 231 11
## 232 52
## 233 46
## 234 38
## 235 56
## 236 6
## 237 57
## 238 52
## 239 48
## 240 13
## 241 57
## 242 51
## 243 33
## 244 16
## 245 28
## 246 5
## 247 34
## 248 27
## 249 23
## 250 61
## 251 11
## 252 35
## 253 29
## 254 12
## 255 18
## 256 15
## 257 1
## 258 54
## 259 40
## 260 16
## 261 44
## 262 21
## 263 24
## 264 20
## 265 26
## 266 39
## 267 59
## 268 21
## 269 56
## 270 22
## 271 59
## 272 17
## 273 58
## 274 20
## 275 12
## 276 50
## 277 57
## 278 60
## 279 61
## 280 64
## 281 56
## 282 21
## 283 23
## 284 24
## 285 63
## 286 59
## 287 46
## 288 55
## 289 14
## 290 32
## 291 53
## 292 39
## 293 24
## 294 59
## 295 5
## 296 51
## 297 60
## 298 56
## 299 63
## 300 55
## 301 58
## 302 35
## 303 7
## 304 27
## 305 50
## 306 64
## 307 43
## 308 23
## 309 18
## 310 24
## 311 21
## 312 61
## 313 8
## 314 51
## 315 25
## 316 17
## 317 63
## 318 52
## 319 29
## 320 35
## 321 26
## 322 20
## 323 63
## 324 64
## 325 58
## 326 29
## 327 42
## 328 33
## 329 46
## 330 31
## 331 30
## 332 27
## 333 45
## 334 36
## 335 57
## 336 32
## 337 47
## 338 33
## 339 30
## 340 22
## 341 19
## 342 48
## 343 29
## 344 35
## 345 34
## 346 25
## 347 44
## 348 57
## 349 14
## 350 39
## 351 61
## 352 15
## 353 59
## 354 64
## 355 62
## 356 31
## 357 10
## 358 30
## 359 45
## 360 43
## 361 11
## 362 35
## 363 45
## 364 40
## 365 42
## 366 7
## 367 36
## 368 42
## 369 51
## 370 35
## 371 53
## 372 31
## 373 2
## 374 41
## 375 23
## 376 49
## 377 45
## 378 41
## 379 9
## 380 40
## 381 43
## 382 54
## 383 44
## 384 33
## 385 34
## 386 45
## 387 42
## 388 24
## 389 32
## 390 3
## 391 54
## 392 47
## 393 42
## 394 30
## 395 37
## 396 55
## 397 2
## 398 48
## 399 49
## 400 43
## 401 45
## 402 22
## 403 30
## 404 31
## 405 49
## 406 46
## 407 42
## 408 54
#use subsheet 1
#the last piece we need in order to compute the average are the player numbers
#players are ordered 1-64.
ELOplayer<-as.integer(str_extract(ELOsubsheet1, "\\d+"))
head(ELOplayer, 64)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
df.ELOplayer<-data.frame(as.integer(ELOplayer))
df.ELOplayer
## as.integer.ELOplayer.
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## 11 11
## 12 12
## 13 13
## 14 14
## 15 15
## 16 16
## 17 17
## 18 18
## 19 19
## 20 20
## 21 21
## 22 22
## 23 23
## 24 24
## 25 25
## 26 26
## 27 27
## 28 28
## 29 29
## 30 30
## 31 31
## 32 32
## 33 33
## 34 34
## 35 35
## 36 36
## 37 37
## 38 38
## 39 39
## 40 40
## 41 41
## 42 42
## 43 43
## 44 44
## 45 45
## 46 46
## 47 47
## 48 48
## 49 49
## 50 50
## 51 51
## 52 52
## 53 53
## 54 54
## 55 55
## 56 56
## 57 57
## 58 58
## 59 59
## 60 60
## 61 61
## 62 62
## 63 63
## 64 64
For player number, there was no “junk” to replace since the player number is right at the edge.
I want my loop to do the following: Scan subsheet 1 and match the player number to each of its opponents. Then to take the pre rating for each opponent and find the mean of their associated pre ratings as a single column.
This is where I am stuck. My loop is not working properly as it is only taking the first opponent of each player. I would greatly appcreciate any feedback on how to best complete this portion of the project. In the meantime, I want to get more familiar with more advanced libraries that could do this more efficiently. A co-worker of mine suggested the purrr library but it is to new to me to figure out how to use it.
#use subsheet 1
#How do we compute the average opponent player rating?
#the loop then fetches the ratings for each opponent and divides by number of rounds
avg_ELOopp_rating<-length(ELOsubsheet1)
#loop
for (i in 1: length(ELOsubsheet1))
{
avg_ELOopp_rating[i]<-mean(ELOprerating[as.numeric(unlist(ELOopponent[ELOplayer[i]]))])
}
df.avg_ELOopp_rating<-data.frame(as.numeric(avg_ELOopp_rating))
df.avg_ELOopp_rating
## as.numeric.avg_ELOopp_rating.
## 1 1436
## 2 1563
## 3 1600
## 4 1610
## 5 1649
## 6 1663
## 7 1716
## 8 1175
## 9 917
## 10 1716
## 11 1629
## 12 1604
## 13 1595
## 14 1649
## 15 1641
## 16 955
## 17 1745
## 18 1563
## 19 1712
## 20 1666
## 21 1663
## 22 1363
## 23 1507
## 24 1553
## 25 1579
## 26 1655
## 27 1564
## 28 1794
## 29 1242
## 30 980
## 31 1663
## 32 1666
## 33 1716
## 34 1610
## 35 1629
## 36 1399
## 37 1602
## 38 1712
## 39 1438
## 40 1365
## 41 1552
## 42 1563
## 43 1092
## 44 377
## 45 1666
## 46 1712
## 47 1794
## 48 1411
## 49 1553
## 50 1384
## 51 1441
## 52 1610
## 53 1411
## 54 1362
## 55 1507
## 56 1564
## 57 1745
## 58 1600
## 59 853
## 60 1641
## 61 1579
## 62 1649
## 63 1595
## 64 1604
Proceeding with the assumption that my average was correct, we can put together our final data frame.
#Put together in a data frame
csv<-data.frame(df.ELOplayer, df.ELOname, df.ELOstate, df.ELOtotalpoints, ELOprerating, df.avg_ELOopp_rating)
csv
## as.integer.ELOplayer. ELOname ELOstate
## 1 1 GARY HUA ON
## 2 2 DAKSHESH DARURI MI
## 3 3 ADITYA BAJAJ MI
## 4 4 PATRICK H SCHILLING MI
## 5 5 HANSHI ZUO MI
## 6 6 HANSEN SONG OH
## 7 7 GARY DEE SWATHELL MI
## 8 8 EZEKIEL HOUGHTON MI
## 9 9 STEFANO LEE ON
## 10 10 ANVIT RAO MI
## 11 11 CAMERON WILLIAM MC LEMAN MI
## 12 12 KENNETH J TACK MI
## 13 13 TORRANCE HENRY JR MI
## 14 14 BRADLEY SHAW MI
## 15 15 ZACHARY JAMES HOUGHTON MI
## 16 16 MIKE NIKITIN MI
## 17 17 RONALD GRZEGORCZYK MI
## 18 18 DAVID SUNDEEN MI
## 19 19 DIPANKAR ROY MI
## 20 20 JASON ZHENG MI
## 21 21 DINH DANG BUI ON
## 22 22 EUGENE L MCCLURE MI
## 23 23 ALAN BUI ON
## 24 24 MICHAEL R ALDRICH MI
## 25 25 LOREN SCHWIEBERT MI
## 26 26 MAX ZHU ON
## 27 27 GAURAV GIDWANI MI
## 28 28 SOFIA ADINA STANESCU-BELLU MI
## 29 29 CHIEDOZIE OKORIE MI
## 30 30 GEORGE AVERY JONES ON
## 31 31 RISHI SHETTY MI
## 32 32 JOSHUA PHILIP MATHEWS ON
## 33 33 JADE GE MI
## 34 34 MICHAEL JEFFERY THOMAS MI
## 35 35 JOSHUA DAVID LEE MI
## 36 36 SIDDHARTH JHA MI
## 37 37 AMIYATOSH PWNANANDAM MI
## 38 38 BRIAN LIU MI
## 39 39 JOEL R HENDON MI
## 40 40 FOREST ZHANG MI
## 41 41 KYLE WILLIAM MURPHY MI
## 42 42 JARED GE MI
## 43 43 ROBERT GLEN VASEY MI
## 44 44 JUSTIN D SCHILLING MI
## 45 45 DEREK YAN MI
## 46 46 JACOB ALEXANDER LAVALLEY MI
## 47 47 ERIC WRIGHT MI
## 48 48 DANIEL KHAIN MI
## 49 49 MICHAEL J MARTIN MI
## 50 50 SHIVAM JHA MI
## 51 51 TEJAS AYYAGARI MI
## 52 52 ETHAN GUO MI
## 53 53 JOSE C YBARRA MI
## 54 54 LARRY HODGE MI
## 55 55 ALEX KONG MI
## 56 56 MARISA RICCI MI
## 57 57 MICHAEL LU MI
## 58 58 VIRAJ MOHILE MI
## 59 59 SEAN M MC CORMICK MI
## 60 60 JULIA SHEN MI
## 61 61 JEZZEL FARKAS ON
## 62 62 ASHWIN BALAJI MI
## 63 63 THOMAS JOSEPH HOSMER MI
## 64 64 BEN LI MI
## as.numeric.ELOtotalpoints. ELOprerating as.numeric.avg_ELOopp_rating.
## 1 6.0 1794 1436
## 2 6.0 1553 1563
## 3 6.0 1384 1600
## 4 5.5 1716 1610
## 5 5.5 1655 1649
## 6 5.0 1686 1663
## 7 5.0 1649 1716
## 8 5.0 1641 1175
## 9 5.0 1411 917
## 10 5.0 1365 1716
## 11 4.5 1712 1629
## 12 4.5 1663 1604
## 13 4.5 1666 1595
## 14 4.5 1610 1649
## 15 4.5 1220 1641
## 16 4.0 1604 955
## 17 4.0 1629 1745
## 18 4.0 1600 1563
## 19 4.0 1564 1712
## 20 4.0 1595 1666
## 21 4.0 1563 1663
## 22 4.0 1555 1363
## 23 4.0 1363 1507
## 24 4.0 1229 1553
## 25 3.5 1745 1579
## 26 3.5 1579 1655
## 27 3.5 1552 1564
## 28 3.5 1507 1794
## 29 3.5 1602 1242
## 30 3.5 1522 980
## 31 3.5 1494 1663
## 32 3.5 1441 1666
## 33 3.5 1449 1716
## 34 3.5 1399 1610
## 35 3.5 1438 1629
## 36 3.5 1355 1399
## 37 3.5 980 1602
## 38 3.0 1423 1712
## 39 3.0 1436 1438
## 40 3.0 1348 1365
## 41 3.0 1403 1552
## 42 3.0 1332 1563
## 43 3.0 1283 1092
## 44 3.0 1199 377
## 45 3.0 1242 1666
## 46 3.0 377 1712
## 47 2.5 1362 1794
## 48 2.5 1382 1411
## 49 2.5 1291 1553
## 50 2.5 1056 1384
## 51 2.5 1011 1441
## 52 2.5 935 1610
## 53 2.0 1393 1411
## 54 2.0 1270 1362
## 55 2.0 1186 1507
## 56 2.0 1153 1564
## 57 2.0 1092 1745
## 58 2.0 917 1600
## 59 2.0 853 853
## 60 1.5 967 1641
## 61 1.5 955 1579
## 62 1.0 1530 1649
## 63 1.0 1175 1595
## 64 1.0 1163 1604
The names do not look good. I can change them to make them more presentable and easier to understand.
#use a better naming convention
colnames(csv)[colnames(csv)=="as.integer.ELOplayer."]<-"PlayerNumber"
colnames(csv)[colnames(csv)=="ELOname"]<-"Name"
colnames(csv)[colnames(csv)=="ELOstate"]<-"State"
colnames(csv)[colnames(csv)=="as.numeric.ELOtotalpoints."]<-"TotalPoints"
colnames(csv)[colnames(csv)=="ELOprerating"]<-"PreRating"
colnames(csv)[colnames(csv)=="as.numeric.avg_ELOopp_rating."]<-"AvgOppPreRating"
csv
## PlayerNumber Name State TotalPoints PreRating
## 1 1 GARY HUA ON 6.0 1794
## 2 2 DAKSHESH DARURI MI 6.0 1553
## 3 3 ADITYA BAJAJ MI 6.0 1384
## 4 4 PATRICK H SCHILLING MI 5.5 1716
## 5 5 HANSHI ZUO MI 5.5 1655
## 6 6 HANSEN SONG OH 5.0 1686
## 7 7 GARY DEE SWATHELL MI 5.0 1649
## 8 8 EZEKIEL HOUGHTON MI 5.0 1641
## 9 9 STEFANO LEE ON 5.0 1411
## 10 10 ANVIT RAO MI 5.0 1365
## 11 11 CAMERON WILLIAM MC LEMAN MI 4.5 1712
## 12 12 KENNETH J TACK MI 4.5 1663
## 13 13 TORRANCE HENRY JR MI 4.5 1666
## 14 14 BRADLEY SHAW MI 4.5 1610
## 15 15 ZACHARY JAMES HOUGHTON MI 4.5 1220
## 16 16 MIKE NIKITIN MI 4.0 1604
## 17 17 RONALD GRZEGORCZYK MI 4.0 1629
## 18 18 DAVID SUNDEEN MI 4.0 1600
## 19 19 DIPANKAR ROY MI 4.0 1564
## 20 20 JASON ZHENG MI 4.0 1595
## 21 21 DINH DANG BUI ON 4.0 1563
## 22 22 EUGENE L MCCLURE MI 4.0 1555
## 23 23 ALAN BUI ON 4.0 1363
## 24 24 MICHAEL R ALDRICH MI 4.0 1229
## 25 25 LOREN SCHWIEBERT MI 3.5 1745
## 26 26 MAX ZHU ON 3.5 1579
## 27 27 GAURAV GIDWANI MI 3.5 1552
## 28 28 SOFIA ADINA STANESCU-BELLU MI 3.5 1507
## 29 29 CHIEDOZIE OKORIE MI 3.5 1602
## 30 30 GEORGE AVERY JONES ON 3.5 1522
## 31 31 RISHI SHETTY MI 3.5 1494
## 32 32 JOSHUA PHILIP MATHEWS ON 3.5 1441
## 33 33 JADE GE MI 3.5 1449
## 34 34 MICHAEL JEFFERY THOMAS MI 3.5 1399
## 35 35 JOSHUA DAVID LEE MI 3.5 1438
## 36 36 SIDDHARTH JHA MI 3.5 1355
## 37 37 AMIYATOSH PWNANANDAM MI 3.5 980
## 38 38 BRIAN LIU MI 3.0 1423
## 39 39 JOEL R HENDON MI 3.0 1436
## 40 40 FOREST ZHANG MI 3.0 1348
## 41 41 KYLE WILLIAM MURPHY MI 3.0 1403
## 42 42 JARED GE MI 3.0 1332
## 43 43 ROBERT GLEN VASEY MI 3.0 1283
## 44 44 JUSTIN D SCHILLING MI 3.0 1199
## 45 45 DEREK YAN MI 3.0 1242
## 46 46 JACOB ALEXANDER LAVALLEY MI 3.0 377
## 47 47 ERIC WRIGHT MI 2.5 1362
## 48 48 DANIEL KHAIN MI 2.5 1382
## 49 49 MICHAEL J MARTIN MI 2.5 1291
## 50 50 SHIVAM JHA MI 2.5 1056
## 51 51 TEJAS AYYAGARI MI 2.5 1011
## 52 52 ETHAN GUO MI 2.5 935
## 53 53 JOSE C YBARRA MI 2.0 1393
## 54 54 LARRY HODGE MI 2.0 1270
## 55 55 ALEX KONG MI 2.0 1186
## 56 56 MARISA RICCI MI 2.0 1153
## 57 57 MICHAEL LU MI 2.0 1092
## 58 58 VIRAJ MOHILE MI 2.0 917
## 59 59 SEAN M MC CORMICK MI 2.0 853
## 60 60 JULIA SHEN MI 1.5 967
## 61 61 JEZZEL FARKAS ON 1.5 955
## 62 62 ASHWIN BALAJI MI 1.0 1530
## 63 63 THOMAS JOSEPH HOSMER MI 1.0 1175
## 64 64 BEN LI MI 1.0 1163
## AvgOppPreRating
## 1 1436
## 2 1563
## 3 1600
## 4 1610
## 5 1649
## 6 1663
## 7 1716
## 8 1175
## 9 917
## 10 1716
## 11 1629
## 12 1604
## 13 1595
## 14 1649
## 15 1641
## 16 955
## 17 1745
## 18 1563
## 19 1712
## 20 1666
## 21 1663
## 22 1363
## 23 1507
## 24 1553
## 25 1579
## 26 1655
## 27 1564
## 28 1794
## 29 1242
## 30 980
## 31 1663
## 32 1666
## 33 1716
## 34 1610
## 35 1629
## 36 1399
## 37 1602
## 38 1712
## 39 1438
## 40 1365
## 41 1552
## 42 1563
## 43 1092
## 44 377
## 45 1666
## 46 1712
## 47 1794
## 48 1411
## 49 1553
## 50 1384
## 51 1441
## 52 1610
## 53 1411
## 54 1362
## 55 1507
## 56 1564
## 57 1745
## 58 1600
## 59 853
## 60 1641
## 61 1579
## 62 1649
## 63 1595
## 64 1604
We finally export a new csv to a destination on your local machine. You will need to change the path.
#export to a CSV
write.csv(csv, "~/Desktop/DATA Science SPS/DATA 607/Week 3/chesstorunament.csv")