Removing Special Characters in R from a text file

Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name Player’s State Total Number of Points Player’s Pre-Rating and Average Pre Chess Rating of Opponents

For the first player, the information would be:Gary Hua, ON, 6.0,1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Below are different block of commands I try to use to create a table and manipulate special characters from the text file; however I found much easier work with a vectors. Those command also removes special characters from a text file.

“tournament.res <- read.table(”https://raw.githubusercontent.com/doradu8030/Data607/master/tournamentinfo.txt“, header=FALSE, sep=”“) tournament.res <- readLines(”tournamentinfo.txt“) str_replace_all(tournament.res,”[[:punct:]]“,” “) tournament.res<- tournament.res[-(1:4)] tournament.res<-setdiff(tournament.res,lines) tournament.res<-str_replace_all(tournament.res,” {2}“,”“) tournament.res<- str_replace_all(tournament.res,”([/])“,”“) gsub(”[^[:alnum:][:blank:]+?&/\-]“,”“, tournament.res) gsub(”[[:punct:]]“,” “, tournament.res) regex (”[[:punct:]]“, tournament.res)”"

I use the command read.delim which removes most of the special characters from the text file

tournament.res <- read.delim("https://raw.githubusercontent.com/doradu8030/Data607/master/tournamentinfo.txt", header=FALSE, sep="|", skip = 4, stringsAsFactors = FALSE)
head(tournament.res)

##                                                                                          V1
## 1                                                                                        1 
## 2                                                                                       ON 
## 3 -----------------------------------------------------------------------------------------
## 4                                                                                        2 
## 5                                                                                       MI 
## 6 -----------------------------------------------------------------------------------------
##                                  V2    V3    V4    V5    V6    V7    V8
## 1  GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7
## 2  15445895 / R: 1794   ->1817      N:2   W     B     W     B     W    
## 3                                                                      
## 4  DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16
## 5  14598900 / R: 1553   ->1663      N:2   B     W     B     W     B    
## 6                                                                      
##      V9   V10 V11
## 1 D  12 D   4  NA
## 2 B     W      NA
## 3              NA
## 4 W  20 W   7  NA
## 5 W     B      NA
## 6              NA

After several attempts I created only one data frame with all fields needed.

tournament.rounds <- data.frame(
            playername = tournament.res$V2[c(TRUE, FALSE, FALSE)],
            playerstate = tournament.res$V1[c(FALSE, TRUE, FALSE)],
            totnumpoints = round(as.numeric(tournament.res$V3[c(TRUE, FALSE, FALSE)]), 1),
            round1 = as.numeric(str_extract(tournament.res$V4[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))), 
            round2 = as.numeric(str_extract(tournament.res$V5[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))),
            round3 = as.numeric(str_extract(tournament.res$V6[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))),
            round4 = as.numeric(str_extract(tournament.res$V7[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))),
            round5 = as.numeric(str_extract(tournament.res$V8[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))),
            round6 = as.numeric(str_extract(tournament.res$V9[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))),
            round7 = as.numeric(str_extract(tournament.res$V10[c(TRUE, FALSE, FALSE)],regex("[:digit:]+"))),
            prerating = integer(64),
            avgrateoppon = integer(64))
            # the definicion of interger only allows the value 64 which corresponde             to the number of rows in the data set

How the instruction c(TRUE,FALSE,FALSE) included above works; See the original file

head(tournament.res)

##                                                                                          V1
## 1                                                                                        1 
## 2                                                                                       ON 
## 3 -----------------------------------------------------------------------------------------
## 4                                                                                        2 
## 5                                                                                       MI 
## 6 -----------------------------------------------------------------------------------------
##                                  V2    V3    V4    V5    V6    V7    V8
## 1  GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7
## 2  15445895 / R: 1794   ->1817      N:2   W     B     W     B     W    
## 3                                                                      
## 4  DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17 W  16
## 5  14598900 / R: 1553   ->1663      N:2   B     W     B     W     B    
## 6                                                                      
##      V9   V10 V11
## 1 D  12 D   4  NA
## 2 B     W      NA
## 3              NA
## 4 W  20 W   7  NA
## 5 W     B      NA
## 6              NA

Note that the information needed to be extrated is found every third row; row 1,4,7 and so on. see the changes when I added logical values to the c vector

playername = tournament.res$V2[c(TRUE)]
head(playername)

## [1] " GARY HUA                        " " 15445895 / R: 1794   ->1817     "
## [3] ""                                  " DAKSHESH DARURI                 "
## [5] " 14598900 / R: 1553   ->1663     " ""

playername = tournament.res$V2[c(TRUE, FALSE)]
head(playername)

## [1] " GARY HUA                        " ""                                 
## [3] " 14598900 / R: 1553   ->1663     " " ADITYA BAJAJ                    "
## [5] ""                                  " 12616049 / R: 1716   ->1744     "

playername = tournament.res$V2[c(TRUE, FALSE, TRUE)]
head(playername)

## [1] " GARY HUA                        " ""                                 
## [3] " DAKSHESH DARURI                 " ""                                 
## [5] " ADITYA BAJAJ                    " ""

Now I will convert as numeric the column “prerating”“; at firt extract the character Ex.”R: 1553" and assig it to a variable then R allows youto convert this variable as numeric and extracted as numeric as well

test <- str_extract(tournament.res$V2[c(FALSE, TRUE, FALSE)], "R:[:blank:]+[:digit:]+")
tournament.rounds$prerating <- as.numeric(str_extract(test, "[:digit:]+"))
head(test)

## [1] "R: 1794" "R: 1553" "R: 1384" "R: 1716" "R: 1655" "R: 1686"

head(tournament.rounds$prerating)

## [1] 1794 1553 1384 1716 1655 1686

The final output file is;

tournament.rounds

##                           playername playerstate totnumpoints round1
## 1   GARY HUA                                 ON           6.0     39
## 2   DAKSHESH DARURI                          MI           6.0     63
## 3   ADITYA BAJAJ                             MI           6.0      8
## 4   PATRICK H SCHILLING                      MI           5.5     23
## 5   HANSHI ZUO                               MI           5.5     45
## 6   HANSEN SONG                              OH           5.0     34
## 7   GARY DEE SWATHELL                        MI           5.0     57
## 8   EZEKIEL HOUGHTON                         MI           5.0      3
## 9   STEFANO LEE                              ON           5.0     25
## 10  ANVIT RAO                                MI           5.0     16
## 11  CAMERON WILLIAM MC LEMAN                 MI           4.5     38
## 12  KENNETH J TACK                           MI           4.5     42
## 13  TORRANCE HENRY JR                        MI           4.5     36
## 14  BRADLEY SHAW                             MI           4.5     54
## 15  ZACHARY JAMES HOUGHTON                   MI           4.5     19
## 16  MIKE NIKITIN                             MI           4.0     10
## 17  RONALD GRZEGORCZYK                       MI           4.0     48
## 18  DAVID SUNDEEN                            MI           4.0     47
## 19  DIPANKAR ROY                             MI           4.0     15
## 20  JASON ZHENG                              MI           4.0     40
## 21  DINH DANG BUI                            ON           4.0     43
## 22  EUGENE L MCCLURE                         MI           4.0     64
## 23  ALAN BUI                                 ON           4.0      4
## 24  MICHAEL R ALDRICH                        MI           4.0     28
## 25  LOREN SCHWIEBERT                         MI           3.5      9
## 26  MAX ZHU                                  ON           3.5     49
## 27  GAURAV GIDWANI                           MI           3.5     51
## 28  SOFIA ADINA STANESCU-BELLU               MI           3.5     24
## 29  CHIEDOZIE OKORIE                         MI           3.5     50
## 30  GEORGE AVERY JONES                       ON           3.5     52
## 31  RISHI SHETTY                             MI           3.5     58
## 32  JOSHUA PHILIP MATHEWS                    ON           3.5     61
## 33  JADE GE                                  MI           3.5     60
## 34  MICHAEL JEFFERY THOMAS                   MI           3.5      6
## 35  JOSHUA DAVID LEE                         MI           3.5     46
## 36  SIDDHARTH JHA                            MI           3.5     13
## 37  AMIYATOSH PWNANANDAM                     MI           3.5     NA
## 38  BRIAN LIU                                MI           3.0     11
## 39  JOEL R HENDON                            MI           3.0      1
## 40  FOREST ZHANG                             MI           3.0     20
## 41  KYLE WILLIAM MURPHY                      MI           3.0     59
## 42  JARED GE                                 MI           3.0     12
## 43  ROBERT GLEN VASEY                        MI           3.0     21
## 44  JUSTIN D SCHILLING                       MI           3.0     NA
## 45  DEREK YAN                                MI           3.0      5
## 46  JACOB ALEXANDER LAVALLEY                 MI           3.0     35
## 47  ERIC WRIGHT                              MI           2.5     18
## 48  DANIEL KHAIN                             MI           2.5     17
## 49  MICHAEL J MARTIN                         MI           2.5     26
## 50  SHIVAM JHA                               MI           2.5     29
## 51  TEJAS AYYAGARI                           MI           2.5     27
## 52  ETHAN GUO                                MI           2.5     30
## 53  JOSE C YBARRA                            MI           2.0     NA
## 54  LARRY HODGE                              MI           2.0     14
## 55  ALEX KONG                                MI           2.0     62
## 56  MARISA RICCI                             MI           2.0     NA
## 57  MICHAEL LU                               MI           2.0      7
## 58  VIRAJ MOHILE                             MI           2.0     31
## 59  SEAN M MC CORMICK                        MI           2.0     41
## 60  JULIA SHEN                               MI           1.5     33
## 61  JEZZEL FARKAS                            ON           1.5     32
## 62  ASHWIN BALAJI                            MI           1.0     55
## 63  THOMAS JOSEPH HOSMER                     MI           1.0      2
## 64  BEN LI                                   MI           1.0     22
##    round2 round3 round4 round5 round6 round7 prerating avgrateoppon
## 1      21     18     14      7     12      4      1794            0
## 2      58      4     17     16     20      7      1553            0
## 3      61     25     21     11     13     12      1384            0
## 4      28      2     26      5     19      1      1716            0
## 5      37     12     13      4     14     17      1655            0
## 6      29     11     35     10     27     21      1686            0
## 7      46     13     11      1      9      2      1649            0
## 8      32     14      9     47     28     19      1641            0
## 9      18     59      8     26      7     20      1411            0
## 10     19     55     31      6     25     18      1365            0
## 11     56      6      7      3     34     26      1712            0
## 12     33      5     38     NA      1      3      1663            0
## 13     27      7      5     33      3     32      1666            0
## 14     44      8      1     27      5     31      1610            0
## 15     16     30     22     54     33     38      1220            0
## 16     15     NA     39      2     36     NA      1604            0
## 17     41     26      2     23     22      5      1629            0
## 18      9      1     32     19     38     10      1600            0
## 19     10     52     28     18      4      8      1564            0
## 20     49     23     41     28      2      9      1595            0
## 21      1     47      3     40     39      6      1563            0
## 22     52     28     15     NA     17     40      1555            0
## 23     43     20     58     17     37     46      1363            0
## 24     47     43     25     60     44     39      1229            0
## 25     53      3     24     34     10     47      1745            0
## 26     40     17      4      9     32     11      1579            0
## 27     13     46     37     14      6     NA      1552            0
## 28      4     22     19     20      8     36      1507            0
## 29      6     38     34     52     48     NA      1602            0
## 30     64     15     55     31     61     50      1522            0
## 31     55     64     10     30     50     14      1494            0
## 32      8     44     18     51     26     13      1441            0
## 33     12     50     36     13     15     51      1449            0
## 34     60     37     29     25     11     52      1399            0
## 35     38     56      6     57     52     48      1438            0
## 36     57     51     33     NA     16     28      1355            0
## 37      5     34     27     NA     23     61       980            0
## 38     35     29     12     NA     18     15      1423            0
## 39     54     40     16     44     21     24      1436            0
## 40     26     39     59     21     56     22      1348            0
## 41     17     58     20     NA     NA     NA      1403            0
## 42     50     57     60     61     64     56      1332            0
## 43     23     24     63     59     46     55      1283            0
## 44     14     32     53     39     24     59      1199            0
## 45     51     60     56     63     55     58      1242            0
## 46      7     27     50     64     43     23       377            0
## 47     24     21     61      8     51     25      1362            0
## 48     63     NA     52     NA     29     35      1382            0
## 49     20     63     64     58     NA     NA      1291            0
## 50     42     33     46     NA     31     30      1056            0
## 51     45     36     57     32     47     33      1011            0
## 52     22     19     48     29     35     34       935            0
## 53     25     NA     44     NA     57     NA      1393            0
## 54     39     61     NA     15     59     64      1270            0
## 55     31     10     30     NA     45     43      1186            0
## 56     11     35     45     NA     40     42      1153            0
## 57     36     42     51     35     53     NA      1092            0
## 58      2     41     23     49     NA     45       917            0
## 59     NA      9     40     43     54     44       853            0
## 60     34     45     42     24     NA     NA       967            0
## 61      3     54     47     42     30     37       955            0
## 62     NA     NA     NA     NA     NA     NA      1530            0
## 63     48     49     43     45     NA     NA      1175            0
## 64     30     31     49     46     42     54      1163            0

for (i in 1:64) {
       tournament.rounds$round1[i] <- tournament.rounds$prerating[tournament.rounds$round1[i]]
       tournament.rounds$round2[i] <- tournament.rounds$prerating[tournament.rounds$round2[i]]
       tournament.rounds$round3[i] <- tournament.rounds$prerating[tournament.rounds$round3[i]]
       tournament.rounds$round4[i] <- tournament.rounds$prerating[tournament.rounds$round4[i]]
       tournament.rounds$round5[i] <- tournament.rounds$prerating[tournament.rounds$round5[i]]
       tournament.rounds$round6[i] <- tournament.rounds$prerating[tournament.rounds$round6[i]]
       tournament.rounds$round7[i] <- tournament.rounds$prerating[tournament.rounds$round7[i]]
}

  for (i in 1:64) {
       tournament.rounds$avgrateoppon[i] <- rowSums(tournament.rounds[i, 4:10], na.rm = TRUE) %/% 7
  }

foutput <- tournament.rounds[c(1,2,3,11,12)]
names(foutput) = c("PlayerName","PlayerState","NumofPoints","Pre-Rating","AvgRatingOp")
foutput

##                           PlayerName PlayerState NumofPoints Pre-Rating
## 1   GARY HUA                                 ON          6.0       1794
## 2   DAKSHESH DARURI                          MI          6.0       1553
## 3   ADITYA BAJAJ                             MI          6.0       1384
## 4   PATRICK H SCHILLING                      MI          5.5       1716
## 5   HANSHI ZUO                               MI          5.5       1655
## 6   HANSEN SONG                              OH          5.0       1686
## 7   GARY DEE SWATHELL                        MI          5.0       1649
## 8   EZEKIEL HOUGHTON                         MI          5.0       1641
## 9   STEFANO LEE                              ON          5.0       1411
## 10  ANVIT RAO                                MI          5.0       1365
## 11  CAMERON WILLIAM MC LEMAN                 MI          4.5       1712
## 12  KENNETH J TACK                           MI          4.5       1663
## 13  TORRANCE HENRY JR                        MI          4.5       1666
## 14  BRADLEY SHAW                             MI          4.5       1610
## 15  ZACHARY JAMES HOUGHTON                   MI          4.5       1220
## 16  MIKE NIKITIN                             MI          4.0       1604
## 17  RONALD GRZEGORCZYK                       MI          4.0       1629
## 18  DAVID SUNDEEN                            MI          4.0       1600
## 19  DIPANKAR ROY                             MI          4.0       1564
## 20  JASON ZHENG                              MI          4.0       1595
## 21  DINH DANG BUI                            ON          4.0       1563
## 22  EUGENE L MCCLURE                         MI          4.0       1555
## 23  ALAN BUI                                 ON          4.0       1363
## 24  MICHAEL R ALDRICH                        MI          4.0       1229
## 25  LOREN SCHWIEBERT                         MI          3.5       1745
## 26  MAX ZHU                                  ON          3.5       1579
## 27  GAURAV GIDWANI                           MI          3.5       1552
## 28  SOFIA ADINA STANESCU-BELLU               MI          3.5       1507
## 29  CHIEDOZIE OKORIE                         MI          3.5       1602
## 30  GEORGE AVERY JONES                       ON          3.5       1522
## 31  RISHI SHETTY                             MI          3.5       1494
## 32  JOSHUA PHILIP MATHEWS                    ON          3.5       1441
## 33  JADE GE                                  MI          3.5       1449
## 34  MICHAEL JEFFERY THOMAS                   MI          3.5       1399
## 35  JOSHUA DAVID LEE                         MI          3.5       1438
## 36  SIDDHARTH JHA                            MI          3.5       1355
## 37  AMIYATOSH PWNANANDAM                     MI          3.5        980
## 38  BRIAN LIU                                MI          3.0       1423
## 39  JOEL R HENDON                            MI          3.0       1436
## 40  FOREST ZHANG                             MI          3.0       1348
## 41  KYLE WILLIAM MURPHY                      MI          3.0       1403
## 42  JARED GE                                 MI          3.0       1332
## 43  ROBERT GLEN VASEY                        MI          3.0       1283
## 44  JUSTIN D SCHILLING                       MI          3.0       1199
## 45  DEREK YAN                                MI          3.0       1242
## 46  JACOB ALEXANDER LAVALLEY                 MI          3.0        377
## 47  ERIC WRIGHT                              MI          2.5       1362
## 48  DANIEL KHAIN                             MI          2.5       1382
## 49  MICHAEL J MARTIN                         MI          2.5       1291
## 50  SHIVAM JHA                               MI          2.5       1056
## 51  TEJAS AYYAGARI                           MI          2.5       1011
## 52  ETHAN GUO                                MI          2.5        935
## 53  JOSE C YBARRA                            MI          2.0       1393
## 54  LARRY HODGE                              MI          2.0       1270
## 55  ALEX KONG                                MI          2.0       1186
## 56  MARISA RICCI                             MI          2.0       1153
## 57  MICHAEL LU                               MI          2.0       1092
## 58  VIRAJ MOHILE                             MI          2.0        917
## 59  SEAN M MC CORMICK                        MI          2.0        853
## 60  JULIA SHEN                               MI          1.5        967
## 61  JEZZEL FARKAS                            ON          1.5        955
## 62  ASHWIN BALAJI                            MI          1.0       1530
## 63  THOMAS JOSEPH HOSMER                     MI          1.0       1175
## 64  BEN LI                                   MI          1.0       1163
##    AvgRatingOp
## 1         1605
## 2         1469
## 3         1563
## 4         1573
## 5         1500
## 6         1518
## 7         1372
## 8         1468
## 9         1523
## 10        1554
## 11        1467
## 12        1291
## 13        1497
## 14        1515
## 15        1483
## 16         989
## 17        1498
## 18        1480
## 19        1426
## 20        1410
## 21        1470
## 22        1114
## 23        1213
## 24        1357
## 25        1363
## 26        1506
## 27        1047
## 28        1522
## 29        1125
## 30        1144
## 31        1259
## 32        1378
## 33        1276
## 34        1375
## 35        1149
## 36        1189
## 37         989
## 38        1319
## 39        1429
## 40        1390
## 41         713
## 42        1149
## 43        1106
## 44        1137
## 45        1152
## 46        1357
## 47        1392
## 48         968
## 49         918
## 50        1110
## 51        1356
## 52        1494
## 53         576
## 54        1033
## 55        1205
## 56        1010
## 57        1168
## 58        1192
## 59        1130
## 60         950
## 61        1327
## 62         169
## 63         964
## 64        1263

Then after all information needed is cleaned up, the final file could be save as csv file that could be exported or copy into a postgreSQL or MySQL. Note the R allows to work with dots (.) or underscore (-) in between of a variable names Ex.player.state However, postgreSQL will present this message error;

   player.state in posgreSQL
   ERROR:  syntax error at or near "."
   LINE 5:   player.name character(20) NOT NULL,

If the final file will be exported to postgreSQL don’t use those separators, postgreSQL wouldn’t recognized (.),(-) as part of a variable name.

write.csv(foutput, file = "C:/Users/Dora/CUNY/DATA607-Data Acquisition/Project1/tournamentfinal.csv")

ProjecT1-DATA607

Durley Torres-Marin

September 23, 2017

Removing Special Characters in R from a text file