Data 607 - Project 1 - Text File Wrangling

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

It Begins Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Project_1

Github link: https://github.com/littlejohnjeff/DATA607_Fall2018/blob/master/Project%201%20-%20Text%20File%20Wrangling%20-%20Jeff%20Littlejohn.Rmd

Let’s load some packages that we’ll need to read and parse the data.

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Warning: package 'tibble' was built under R version 3.3.3

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(readr)
library(stringr)

Read the tournament info. After much trial and error, read.delim seems liked the best option. Attempts at using read_delim from the tidyverse led to many problems.

#Reading txt file with pipe delimiter. Tried to skip column 11, created due to line-pipe, but it did not work. Will handle with subsets.
chesst <- read.delim("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/tournamentinfo.txt", header = FALSE, sep = "|")

head(chesst)

##                                                                                          V1
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                     Pair 
## 3                                                                                     Num  
## 4 -----------------------------------------------------------------------------------------
## 5                                                                                        1 
## 6                                                                                       ON 
##                                  V2    V3    V4    V5    V6    V7    V8
## 1                                                                      
## 2  Player Name                      Total Round Round Round Round Round
## 3  USCF ID / Rtg (Pre->Post)         Pts    1     2     3     4     5  
## 4                                                                      
## 5  GARY HUA                         6.0   W  39 W  21 W  18 W  14 W   7
## 6  15445895 / R: 1794   ->1817      N:2   W     B     W     B     W    
##      V9   V10 V11
## 1              NA
## 2 Round Round  NA
## 3   6     7    NA
## 4              NA
## 5 D  12 D   4  NA
## 6 B     W      NA

Considered pulling out rows 2 and 3 to use as column names, but the inconsistent standards and whitespace likely make it more trouble than it would b worth. Instead, let’s manually create some column names - under the understanding that we will be merging records into columns and splitting the UCSF ID/Rtg (Pre>Post) column into two attributes.

chess_headers <- c("PairNum","PlayerName","TotalPts","R1Result","R2Result","R3Result","R4Result","R5Result","R6Result","R7Result","PlayerState","UCSF_ID","What","R1Color","R2Color","R3Color","R4Color","R5Color","R6Color","R7Color")

Now for the wrangling.

#Pull out relevant data, dropping 11th column, which was created
#Create sequences starting at 5, going up by 3 - 194 and 195 are the final rows from the initial import
seq1 <- seq(from = 5,to = 194, by = 3)
seq2 <- seq(from = 6, to = 195, by = 3)
#subset the read txt file using the sequencing to skip the "----" columns and turn the two rows per player into one wider record
chesst_data <- cbind(chesst[seq1,1:10],chesst[seq2,1:10])
colnames(chesst_data) <- chess_headers
chesst_data$PairNum <- as.numeric(as.character(chesst_data$PairNum))
head(chesst_data)

##    PairNum                        PlayerName TotalPts R1Result R2Result
## 5        1  GARY HUA                            6.0      W  39    W  21
## 8        2  DAKSHESH DARURI                     6.0      W  63    W  58
## 11       3  ADITYA BAJAJ                        6.0      L   8    W  61
## 14       4  PATRICK H SCHILLING                 5.5      W  23    D  28
## 17       5  HANSHI ZUO                          5.5      W  45    W  37
## 20       6  HANSEN SONG                         5.0      W  34    D  29
##    R3Result R4Result R5Result R6Result R7Result PlayerState
## 5     W  18    W  14    W   7    D  12    D   4         ON 
## 8     L   4    W  17    W  16    W  20    W   7         MI 
## 11    W  25    W  21    W  11    W  13    W  12         MI 
## 14    W   2    W  26    D   5    W  19    D   1         MI 
## 17    D  12    D  13    D   4    W  14    W  17         MI 
## 20    L  11    W  35    D  10    W  27    W  21         OH 
##                              UCSF_ID  What R1Color R2Color R3Color R4Color
## 5   15445895 / R: 1794   ->1817      N:2     W       B       W       B    
## 8   14598900 / R: 1553   ->1663      N:2     B       W       B       W    
## 11  14959604 / R: 1384   ->1640      N:2     W       B       W       B    
## 14  12616049 / R: 1716   ->1744      N:2     W       B       W       B    
## 17  14601533 / R: 1655   ->1690      N:2     B       W       B       W    
## 20  15055204 / R: 1686   ->1687      N:3     W       B       W       B    
##    R5Color R6Color R7Color
## 5    W       B       W    
## 8    B       W       B    
## 11   W       B       W    
## 14   W       B       B    
## 17   B       W       B    
## 20   B       W       B

Pull out the pre-rating from the crazy UCSF ID/pre-post rating column that violates every known rule and principle ever devised. And let’s extract the opponent ID from the R_Result columns to enable the calculation of the average opponent rating.

#Extract the pre-rating as numeric type
chesst_data$Prerating <- as.numeric(substring(chesst_data$UCSF_ID,16,19))
#Pull the opponent IDs (match to PairNum like foreign key) as numeric as well
chesst_data$R1Opp <- as.numeric(substring(chesst_data$R1Result,3))
chesst_data$R2Opp <- as.numeric(substring(chesst_data$R2Result,3))
chesst_data$R3Opp <- as.numeric(substring(chesst_data$R3Result,3))
chesst_data$R4Opp <- as.numeric(substring(chesst_data$R4Result,3))
chesst_data$R5Opp <- as.numeric(substring(chesst_data$R5Result,3))
chesst_data$R6Opp <- as.numeric(substring(chesst_data$R6Result,3))
chesst_data$R7Opp <- as.numeric(substring(chesst_data$R7Result,3))
head(chesst_data)

##    PairNum                        PlayerName TotalPts R1Result R2Result
## 5        1  GARY HUA                            6.0      W  39    W  21
## 8        2  DAKSHESH DARURI                     6.0      W  63    W  58
## 11       3  ADITYA BAJAJ                        6.0      L   8    W  61
## 14       4  PATRICK H SCHILLING                 5.5      W  23    D  28
## 17       5  HANSHI ZUO                          5.5      W  45    W  37
## 20       6  HANSEN SONG                         5.0      W  34    D  29
##    R3Result R4Result R5Result R6Result R7Result PlayerState
## 5     W  18    W  14    W   7    D  12    D   4         ON 
## 8     L   4    W  17    W  16    W  20    W   7         MI 
## 11    W  25    W  21    W  11    W  13    W  12         MI 
## 14    W   2    W  26    D   5    W  19    D   1         MI 
## 17    D  12    D  13    D   4    W  14    W  17         MI 
## 20    L  11    W  35    D  10    W  27    W  21         OH 
##                              UCSF_ID  What R1Color R2Color R3Color R4Color
## 5   15445895 / R: 1794   ->1817      N:2     W       B       W       B    
## 8   14598900 / R: 1553   ->1663      N:2     B       W       B       W    
## 11  14959604 / R: 1384   ->1640      N:2     W       B       W       B    
## 14  12616049 / R: 1716   ->1744      N:2     W       B       W       B    
## 17  14601533 / R: 1655   ->1690      N:2     B       W       B       W    
## 20  15055204 / R: 1686   ->1687      N:3     W       B       W       B    
##    R5Color R6Color R7Color Prerating R1Opp R2Opp R3Opp R4Opp R5Opp R6Opp
## 5    W       B       W          1794    39    21    18    14     7    12
## 8    B       W       B          1553    63    58     4    17    16    20
## 11   W       B       W          1384     8    61    25    21    11    13
## 14   W       B       B          1716    23    28     2    26     5    19
## 17   B       W       B          1655    45    37    12    13     4    14
## 20   B       W       B          1686    34    29    11    35    10    27
##    R7Opp
## 5      4
## 8      7
## 11    12
## 14     1
## 17    17
## 20    21

Let’s subset the dataframe to limit the columns that will be used for the csv output - note that the opponent PairNums won’t be used, but we need them to merge the rankings and then derive a mean for the desired end point.

#Subset for columns that will be used for csv output
chesst_prep <- subset(chesst_data,select=c(as.numeric(PairNum),PlayerName,PlayerState,TotalPts,Prerating,R1Opp,R2Opp,R3Opp,R4Opp,R5Opp,R6Opp,R7Opp))
head(chesst_prep)

##    PairNum                        PlayerName PlayerState TotalPts
## 5        1  GARY HUA                                 ON     6.0  
## 8        2  DAKSHESH DARURI                          MI     6.0  
## 11       3  ADITYA BAJAJ                             MI     6.0  
## 14       4  PATRICK H SCHILLING                      MI     5.5  
## 17       5  HANSHI ZUO                               MI     5.5  
## 20       6  HANSEN SONG                              OH     5.0  
##    Prerating R1Opp R2Opp R3Opp R4Opp R5Opp R6Opp R7Opp
## 5       1794    39    21    18    14     7    12     4
## 8       1553    63    58     4    17    16    20     7
## 11      1384     8    61    25    21    11    13    12
## 14      1716    23    28     2    26     5    19     1
## 17      1655    45    37    12    13     4    14    17
## 20      1686    34    29    11    35    10    27    21

#Merge round 1 opponent rating
#If I was a better programmer, I would write this once and loop for opponents 2-7
chesst_prep1 <- merge(x = chesst_prep, y = chesst_prep[,c("PairNum","Prerating")], by.x=c("R1Opp"), by.y=c("PairNum"),suffix = c("", "_1"), all.x = TRUE)
chesst_prep2 <- merge(x = chesst_prep1, y = chesst_prep1[,c("PairNum","Prerating")], by.x=c("R2Opp"), by.y=c("PairNum"),suffix = c("", "_2"), all.x = TRUE)
chesst_prep3 <- merge(x = chesst_prep2, y = chesst_prep2[,c("PairNum","Prerating")], by.x=c("R3Opp"), by.y=c("PairNum"),suffix = c("", "_3"), all.x = TRUE)
chesst_prep4 <- merge(x = chesst_prep3, y = chesst_prep3[,c("PairNum","Prerating")], by.x=c("R4Opp"), by.y=c("PairNum"),suffix = c("", "_4"), all.x = TRUE)
chesst_prep5 <- merge(x = chesst_prep4, y = chesst_prep4[,c("PairNum","Prerating")], by.x=c("R5Opp"), by.y=c("PairNum"),suffix = c("", "_5"), all.x = TRUE)
chesst_prep6 <- merge(x = chesst_prep5, y = chesst_prep5[,c("PairNum","Prerating")], by.x=c("R6Opp"), by.y=c("PairNum"),suffix = c("", "_6"), all.x = TRUE)
chesst_prep7 <- merge(x = chesst_prep6, y = chesst_prep6[,c("PairNum","Prerating")], by.x=c("R7Opp"), by.y=c("PairNum"),suffix = c("", "_7"), all.x = TRUE)
head(chesst_prep7)

##   R7Opp R6Opp R5Opp R4Opp R3Opp R2Opp R1Opp PairNum
## 1     1    19     5    26     2    28    23       4
## 2     2     9     1    11    13    46    57       7
## 3     3     1    NA    38     5    33    42      12
## 4     4    12     7    14    18    21    39       1
## 5     5    22    23     2    26    41    48      17
## 6     6    39    40     3    47     1    43      21
##                          PlayerName PlayerState TotalPts Prerating
## 1  PATRICK H SCHILLING                      MI     5.5        1716
## 2  GARY DEE SWATHELL                        MI     5.0        1649
## 3  KENNETH J TACK                           MI     4.5        1663
## 4  GARY HUA                                 ON     6.0        1794
## 5  RONALD GRZEGORCZYK                       MI     4.0        1629
## 6  DINH DANG BUI                            ON     4.0        1563
##   Prerating_1 Prerating_2 Prerating_3 Prerating_4 Prerating_5 Prerating_6
## 1        1363        1507        1553        1579        1655        1564
## 2        1092         377        1666        1712        1794        1411
## 3        1332        1449        1655        1423          NA        1794
## 4        1436        1563        1600        1610        1649        1663
## 5        1382        1403        1579        1553        1363        1555
## 6        1283        1794        1362        1384        1348        1436
##   Prerating_7
## 1        1794
## 2        1553
## 3        1384
## 4        1716
## 5        1655
## 6        1686

Calculate opponent average rating using rowmeans.

#Get the average opponent rating, handling the NA matches
AvgOppRating <- rowMeans(chesst_prep7[13:19],na.rm = TRUE)
chesst_prep7$AvgOppRating <- as.integer(AvgOppRating)
chesst_prep7$Prerating <- as.integer(chesst_prep7$Prerating)

Subset a dataframe that matches our output.

chesst_final <- subset(chesst_prep7,select=c("PlayerName","PlayerState","TotalPts","Prerating","AvgOppRating"))
chesst_final

##                           PlayerName PlayerState TotalPts Prerating
## 1   PATRICK H SCHILLING                      MI     5.5        1716
## 2   GARY DEE SWATHELL                        MI     5.0        1649
## 3   KENNETH J TACK                           MI     4.5        1663
## 4   GARY HUA                                 ON     6.0        1794
## 5   RONALD GRZEGORCZYK                       MI     4.0        1629
## 6   DINH DANG BUI                            ON     4.0        1563
## 7   DAKSHESH DARURI                          MI     6.0        1553
## 8   DIPANKAR ROY                             MI     4.0        1564
## 9   JASON ZHENG                              MI     4.0        1595
## 10  DAVID SUNDEEN                            MI     4.0        1600
## 11  MAX ZHU                                  ON     3.5        1579
## 12  ADITYA BAJAJ                             MI     6.0        1384
## 13  JOSHUA PHILIP MATHEWS                    ON     3.5        1441
## 14  RISHI SHETTY                             MI     3.5        1494
## 15  BRIAN LIU                                MI     3.0        1423
## 16  HANSHI ZUO                               MI     5.5        1655
## 17  ANVIT RAO                                MI     5.0        1365
## 18  EZEKIEL HOUGHTON                         MI     5.0        1641
## 19  STEFANO LEE                              ON     5.0        1411
## 20  HANSEN SONG                              OH     5.0        1686
## 21  FOREST ZHANG                             MI     3.0        1348
## 22  JACOB ALEXANDER LAVALLEY                 MI     3.0         377
## 23  JOEL R HENDON                            MI     3.0        1436
## 24  ERIC WRIGHT                              MI     2.5        1362
## 25  CAMERON WILLIAM MC LEMAN                 MI     4.5        1712
## 26  SIDDHARTH JHA                            MI     3.5        1355
## 27  SHIVAM JHA                               MI     2.5        1056
## 28  BRADLEY SHAW                             MI     4.5        1610
## 29  TORRANCE HENRY JR                        MI     4.5        1666
## 30  TEJAS AYYAGARI                           MI     2.5        1011
## 31  ETHAN GUO                                MI     2.5         935
## 32  DANIEL KHAIN                             MI     2.5        1382
## 33  SOFIA ADINA STANESCU-BELLU               MI     3.5        1507
## 34  JEZZEL FARKAS                            ON     1.5         955
## 35  ZACHARY JAMES HOUGHTON                   MI     4.5        1220
## 36  MICHAEL R ALDRICH                        MI     4.0        1229
## 37  EUGENE L MCCLURE                         MI     4.0        1555
## 38  MARISA RICCI                             MI     2.0        1153
## 39  ALEX KONG                                MI     2.0        1186
## 40  SEAN M MC CORMICK                        MI     2.0         853
## 41  VIRAJ MOHILE                             MI     2.0         917
## 42  ALAN BUI                                 ON     4.0        1363
## 43  LOREN SCHWIEBERT                         MI     3.5        1745
## 44  JOSHUA DAVID LEE                         MI     3.5        1438
## 45  GEORGE AVERY JONES                       ON     3.5        1522
## 46  JADE GE                                  MI     3.5        1449
## 47  MICHAEL JEFFERY THOMAS                   MI     3.5        1399
## 48  BEN LI                                   MI     1.0        1163
## 49  ROBERT GLEN VASEY                        MI     3.0        1283
## 50  JARED GE                                 MI     3.0        1332
## 51  DEREK YAN                                MI     3.0        1242
## 52  JUSTIN D SCHILLING                       MI     3.0        1199
## 53  AMIYATOSH PWNANANDAM                     MI     3.5         980
## 54  LARRY HODGE                              MI     2.0        1270
## 55  ASHWIN BALAJI                            MI     1.0        1530
## 56  GAURAV GIDWANI                           MI     3.5        1552
## 57  JOSE C YBARRA                            MI     2.0        1393
## 58  CHIEDOZIE OKORIE                         MI     3.5        1602
## 59  KYLE WILLIAM MURPHY                      MI     3.0        1403
## 60  MIKE NIKITIN                             MI     4.0        1604
## 61  JULIA SHEN                               MI     1.5         967
## 62  MICHAEL LU                               MI     2.0        1092
## 63  THOMAS JOSEPH HOSMER                     MI     1.0        1175
## 64  MICHAEL J MARTIN                         MI     2.5        1291
##    AvgOppRating
## 1          1573
## 2          1372
## 3          1506
## 4          1605
## 5          1498
## 6          1470
## 7          1469
## 8          1426
## 9          1410
## 10         1480
## 11         1506
## 12         1563
## 13         1378
## 14         1259
## 15         1539
## 16         1500
## 17         1554
## 18         1468
## 19         1523
## 20         1518
## 21         1390
## 22         1357
## 23         1429
## 24         1392
## 25         1467
## 26         1388
## 27         1296
## 28         1515
## 29         1497
## 30         1356
## 31         1494
## 32         1355
## 33         1522
## 34         1327
## 35         1483
## 36         1357
## 37         1300
## 38         1414
## 39         1406
## 40         1319
## 41         1391
## 42         1213
## 43         1363
## 44         1149
## 45         1144
## 46         1276
## 47         1375
## 48         1263
## 49         1106
## 50         1149
## 51         1152
## 52         1327
## 53         1384
## 54         1206
## 55         1186
## 56         1221
## 57         1345
## 58         1313
## 59         1248
## 60         1385
## 61         1330
## 62         1363
## 63         1350
## 64         1285

Output the csv

write.csv(chesst_final, file = "Data_607_Lab_1_Chess_Data_Wrangling_Littlejohn.csv")

Data 607 - Project 1 - Text File Wrangling

Jeff Littlejohn

September 22, 2018