1. question

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be:

Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

2. methodology

In the most general strokes, my approach to this project was comprised of three parts:

2-1. data loading

2-2. regex matching

2-3. referencing and joining

In “data loading”, I tried several different approaches, including using the “scan” and “readLines” methods, but settled for “read.delim” to handle the delimiter explicitly.

In “regex matching”, I used regex to match relevant values while preserving empty values, which were necessary for calculating number of games each player played.

In “referencing and joining”, I first recreated the original data structure as a dataframe as an intermediate frame of reference. Then I extracted a subset to simplify mapping and validate additional column values before producing the desired output and writing to file.

2-1. data loading

# read pipe delimited text file in [table format|https://www.rdocumentation.org/packages/psych/versions/1.8.12/topics/read.file], skipping the 1st line
raw_txt_a = read.delim("tournamentinfo.txt", sep = "|", skip = 1) # read txt file
head(raw_txt_a)

##                                                                                        Pair
## 1                                                                                     Num  
## 2 -----------------------------------------------------------------------------------------
## 3                                                                                        1 
## 4                                                                                       ON 
## 5 -----------------------------------------------------------------------------------------
## 6                                                                                        2 
##                         Player.Name Total Round Round.1 Round.2 Round.3
## 1  USCF ID / Rtg (Pre->Post)         Pts    1       2       3       4  
## 2                                                                      
## 3  GARY HUA                         6.0   W  39   W  21   W  18   W  14
## 4  15445895 / R: 1794   ->1817      N:2   W       B       W       B    
## 5                                                                      
## 6  DAKSHESH DARURI                  6.0   W  63   W  58   L   4   W  17
##   Round.4 Round.5 Round.6  X
## 1     5       6       7   NA
## 2                         NA
## 3   W   7   D  12   D   4 NA
## 4   W       B       W     NA
## 5                         NA
## 6   W  16   W  20   W   7 NA

# read existing headers
names(raw_txt_a)

##  [1] "Pair"        "Player.Name" "Total"       "Round"       "Round.1"    
##  [6] "Round.2"     "Round.3"     "Round.4"     "Round.5"     "Round.6"    
## [11] "X"

# read first two lines
head(raw_txt_a, n = 2)

##                                                                                        Pair
## 1                                                                                     Num  
## 2 -----------------------------------------------------------------------------------------
##                         Player.Name Total Round Round.1 Round.2 Round.3
## 1  USCF ID / Rtg (Pre->Post)         Pts    1       2       3       4  
## 2                                                                      
##   Round.4 Round.5 Round.6  X
## 1     5       6       7   NA
## 2                         NA

# rename headers
h_txt <- rename(raw_txt_a, pair_num = Pair, player_name = Player.Name, total_points = Total, round_1 = Round, round_2 = Round.1, round_3 = Round.2, round_4 = Round.3, round_5 = Round.4, round_6 = Round.5, round_7 = Round.6) 
head(h_txt)

##                                                                                    pair_num
## 1                                                                                     Num  
## 2 -----------------------------------------------------------------------------------------
## 3                                                                                        1 
## 4                                                                                       ON 
## 5 -----------------------------------------------------------------------------------------
## 6                                                                                        2 
##                         player_name total_points round_1 round_2 round_3
## 1  USCF ID / Rtg (Pre->Post)                Pts      1       2       3  
## 2                                                                       
## 3  GARY HUA                                6.0     W  39   W  21   W  18
## 4  15445895 / R: 1794   ->1817             N:2     W       B       W    
## 5                                                                       
## 6  DAKSHESH DARURI                         6.0     W  63   W  58   L   4
##   round_4 round_5 round_6 round_7  X
## 1     4       5       6       7   NA
## 2                                 NA
## 3   W  14   W   7   D  12   D   4 NA
## 4   B       W       B       W     NA
## 5                                 NA
## 6   W  17   W  16   W  20   W   7 NA

# remove first row
rh_txt <- h_txt[-c(1), ]
head(rh_txt)

##                                                                                    pair_num
## 2 -----------------------------------------------------------------------------------------
## 3                                                                                        1 
## 4                                                                                       ON 
## 5 -----------------------------------------------------------------------------------------
## 6                                                                                        2 
## 7                                                                                       MI 
##                         player_name total_points round_1 round_2 round_3
## 2                                                                       
## 3  GARY HUA                                6.0     W  39   W  21   W  18
## 4  15445895 / R: 1794   ->1817             N:2     W       B       W    
## 5                                                                       
## 6  DAKSHESH DARURI                         6.0     W  63   W  58   L   4
## 7  14598900 / R: 1553   ->1663             N:2     B       W       B    
##   round_4 round_5 round_6 round_7  X
## 2                                 NA
## 3   W  14   W   7   D  12   D   4 NA
## 4   B       W       B       W     NA
## 5                                 NA
## 6   W  17   W  16   W  20   W   7 NA
## 7   W       B       W       B     NA

# match rows with dashes
head(str_subset(rh_txt$pair_num, "[-]"))

## [1] "-----------------------------------------------------------------------------------------"
## [2] "-----------------------------------------------------------------------------------------"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "-----------------------------------------------------------------------------------------"
## [5] "-----------------------------------------------------------------------------------------"
## [6] "-----------------------------------------------------------------------------------------"

# remove rows with dashes
drh_txt <- subset(rh_txt, pair_num!=str_subset(rh_txt$pair_num, "[-]"))

## Warning in `!=.default`(pair_num, str_subset(rh_txt$pair_num, "[-]")):
## longer object length is not a multiple of shorter object length

## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length

head(drh_txt)

##    pair_num                       player_name total_points round_1 round_2
## 3        1   GARY HUA                                6.0     W  39   W  21
## 4       ON   15445895 / R: 1794   ->1817             N:2     W       B    
## 6        2   DAKSHESH DARURI                         6.0     W  63   W  58
## 7       MI   14598900 / R: 1553   ->1663             N:2     B       W    
## 9        3   ADITYA BAJAJ                            6.0     L   8   W  61
## 10      MI   14959604 / R: 1384   ->1640             N:2     W       B    
##    round_3 round_4 round_5 round_6 round_7  X
## 3    W  18   W  14   W   7   D  12   D   4 NA
## 4    W       B       W       B       W     NA
## 6    L   4   W  17   W  16   W  20   W   7 NA
## 7    B       W       B       W       B     NA
## 9    W  25   W  21   W  11   W  13   W  12 NA
## 10   W       B       W       B       W     NA

2-2. regex matching

 # match player number
player_numbers <- trimws(str_subset(drh_txt$pair_num, "[[:digit:]]")) # do I need to change to int?

# check for completeness against expected count
length(player_numbers)

## [1] 64

# match player name
names <- trimws(str_extract_all(drh_txt$player_name, "^[A-Z\\- ]+$", simplify = TRUE)) 

# remove empty strings
names <- names[names != ""]

# check for completeness against expected count
length(names)

## [1] 64

# match player state
player_state <- trimws(str_subset(drh_txt$pair_num, "[[:alpha:]]"))

# check for completeness against expected count
length(player_state)

## [1] 64

# match total points
total_points <- trimws(str_subset(drh_txt$total_points, "\\d(\\.\\d)"))

# check for completeness against expected count
length(total_points)

## [1] 64

# match player pre-rating
# " 15445895 / R: 1794   ->1817" 
pre_rating <- trimws(unlist(str_extract_all(drh_txt$player_name, "[:space:]\\d{3,4}", simplify = TRUE))[,2])

# remove empty strings
pre_rating <- pre_rating[pre_rating != ""]

# check for completeness against expected count
length(pre_rating)

## [1] 64

2-3. referencing and joining

# create matrix from relevant data
player_tbl <- cbind(player_numbers, player_state, names, pre_rating, total_points) # create matrix

# convert matrix to dataframe
player_tbl <- data.frame(player_tbl)
head(player_tbl)

##   player_numbers player_state               names pre_rating total_points
## 1              1           ON            GARY HUA       1794          6.0
## 2              2           MI     DAKSHESH DARURI       1553          6.0
## 3              3           MI        ADITYA BAJAJ       1384          6.0
## 4              4           MI PATRICK H SCHILLING       1716          5.5
## 5              5           MI          HANSHI ZUO       1655          5.5
## 6              6           OH         HANSEN SONG       1686          5.0

# replicate gameplay data
games <- subset(drh_txt, select = round_1:round_7)

# create matrix excluding irrelevant rows
games <- games[seq(1, nrow(games), 2), ] 

# extract player numbers from matrix and transpose
games <- t(apply(games, 1, function(x) str_extract_all(x, "[0-9]{1,2}", simplify = TRUE)))

# store matrix as dataframe
games_tbl <- data.frame(games)

# reorder dataframe by row number
rownames(games_tbl) <- seq(length = nrow(games_tbl))

# optionally, rename headers for ease of reference
games_tbl <- rename(games_tbl, round_1 = X1, round_2 = X2, round_3 = X3, round_4 = X4, round_5 = X5, round_6 = X6, round_7 = X7) 

# produce cleaned version of original tournamentinfo table for reference
tournament_info <- merge(x = player_tbl, y = games_tbl, by = 0)
head(tournament_info)

##   Row.names player_numbers player_state                    names
## 1         1              1           ON                 GARY HUA
## 2        10             10           MI                ANVIT RAO
## 3        11             11           MI CAMERON WILLIAM MC LEMAN
## 4        12             12           MI           KENNETH J TACK
## 5        13             13           MI        TORRANCE HENRY JR
## 6        14             14           MI             BRADLEY SHAW
##   pre_rating total_points round_1 round_2 round_3 round_4 round_5 round_6
## 1       1794          6.0      39      21      18      14       7      12
## 2       1365          5.0      16      19      55      31       6      25
## 3       1712          4.5      38      56       6       7       3      34
## 4       1663          4.5      42      33       5      38               1
## 5       1666          4.5      36      27       7       5      33       3
## 6       1610          4.5      54      44       8       1      27       5
##   round_7
## 1       4
## 2      18
## 3      26
## 4       3
## 5      32
## 6      31

# replace player numbers in games columns with relevant pre-rating scores

# create key, value list pair
key <- c(as.character(tournament_info$player_numbers))
value <- c(as.character(tournament_info$pre_rating))
# create map from keys and values 
map <- setNames(value, key)

# map new values to dataframe
games_tbl[] <- map[unlist(games_tbl)]

# convert new values to integer class
games_tbl[, ] <- lapply(games_tbl[, ], as.integer)

# sum total score per row while accounting for missing values
games_tbl$total_score <- rowSums(games_tbl[1:7], na.rm = TRUE)

# count number of games played by discounting missing values
games_tbl$n_games <- rowSums(!is.na(games_tbl[1:7])) 

# store calculation of average rating per row in the new column
games_tbl[, "opp_avg_score"] <- round(games_tbl[, "total_score"] / games_tbl[, "n_games"])
head(games_tbl)

##   round_1 round_2 round_3 round_4 round_5 round_6 round_7 total_score
## 1    1436    1555    1564    1220    1530    1666    1716       10687
## 2     967    1186    1716    1600    1629    1563    1530       10191
## 3    1175     853    1579    1555    1663    1610    1666       10101
## 4    1229    1602    1595    1552    1291    1553    1365       10187
## 5    1199    1649    1666    1610    1716    1220    1600       10660
## 6    1438    1384    1663    1355    1712    1507    1555       10614
##   n_games opp_avg_score
## 1       7          1527
## 2       7          1456
## 3       7          1443
## 4       7          1455
## 5       7          1523
## 6       7          1516

3. solution

# create desired output

# store outputs in desired classes as vectors
player_num <- as.vector(tournament_info$player_numbers)
player_nom <- as.vector(tournament_info$names)
player_state <- as.vector(tournament_info$player_state)
total_points <- as.vector(tournament_info$total_points)
pre_rating <- as.integer(as.vector(tournament_info$pre_rating))
opponent_rating <- as.vector(games_tbl$opp_avg_score)

# create desired dataframe from output vectors
output_tbl <- data.frame(player_num, player_nom, player_state, total_points, pre_rating, opponent_rating)
output_tbl

##    player_num                 player_nom player_state total_points
## 1           1                   GARY HUA           ON          6.0
## 2          10                  ANVIT RAO           MI          5.0
## 3          11   CAMERON WILLIAM MC LEMAN           MI          4.5
## 4          12             KENNETH J TACK           MI          4.5
## 5          13          TORRANCE HENRY JR           MI          4.5
## 6          14               BRADLEY SHAW           MI          4.5
## 7          15     ZACHARY JAMES HOUGHTON           MI          4.5
## 8          16               MIKE NIKITIN           MI          4.0
## 9          17         RONALD GRZEGORCZYK           MI          4.0
## 10         18              DAVID SUNDEEN           MI          4.0
## 11         19               DIPANKAR ROY           MI          4.0
## 12          2            DAKSHESH DARURI           MI          6.0
## 13         20                JASON ZHENG           MI          4.0
## 14         21              DINH DANG BUI           ON          4.0
## 15         22           EUGENE L MCCLURE           MI          4.0
## 16         23                   ALAN BUI           ON          4.0
## 17         24          MICHAEL R ALDRICH           MI          4.0
## 18         25           LOREN SCHWIEBERT           MI          3.5
## 19         26                    MAX ZHU           ON          3.5
## 20         27             GAURAV GIDWANI           MI          3.5
## 21         28 SOFIA ADINA STANESCU-BELLU           MI          3.5
## 22         29           CHIEDOZIE OKORIE           MI          3.5
## 23          3               ADITYA BAJAJ           MI          6.0
## 24         30         GEORGE AVERY JONES           ON          3.5
## 25         31               RISHI SHETTY           MI          3.5
## 26         32      JOSHUA PHILIP MATHEWS           ON          3.5
## 27         33                    JADE GE           MI          3.5
## 28         34     MICHAEL JEFFERY THOMAS           MI          3.5
## 29         35           JOSHUA DAVID LEE           MI          3.5
## 30         36              SIDDHARTH JHA           MI          3.5
## 31         37       AMIYATOSH PWNANANDAM           MI          3.5
## 32         38                  BRIAN LIU           MI          3.0
## 33         39              JOEL R HENDON           MI          3.0
## 34          4        PATRICK H SCHILLING           MI          5.5
## 35         40               FOREST ZHANG           MI          3.0
## 36         41        KYLE WILLIAM MURPHY           MI          3.0
## 37         42                   JARED GE           MI          3.0
## 38         43          ROBERT GLEN VASEY           MI          3.0
## 39         44         JUSTIN D SCHILLING           MI          3.0
## 40         45                  DEREK YAN           MI          3.0
## 41         46   JACOB ALEXANDER LAVALLEY           MI          3.0
## 42         47                ERIC WRIGHT           MI          2.5
## 43         48               DANIEL KHAIN           MI          2.5
## 44         49           MICHAEL J MARTIN           MI          2.5
## 45          5                 HANSHI ZUO           MI          5.5
## 46         50                 SHIVAM JHA           MI          2.5
## 47         51             TEJAS AYYAGARI           MI          2.5
## 48         52                  ETHAN GUO           MI          2.5
## 49         53              JOSE C YBARRA           MI          2.0
## 50         54                LARRY HODGE           MI          2.0
## 51         55                  ALEX KONG           MI          2.0
## 52         56               MARISA RICCI           MI          2.0
## 53         57                 MICHAEL LU           MI          2.0
## 54         58               VIRAJ MOHILE           MI          2.0
## 55         59          SEAN M MC CORMICK           MI          2.0
## 56          6                HANSEN SONG           OH          5.0
## 57         60                 JULIA SHEN           MI          1.5
## 58         61              JEZZEL FARKAS           ON          1.5
## 59         62              ASHWIN BALAJI           MI          1.0
## 60         63       THOMAS JOSEPH HOSMER           MI          1.0
## 61         64                     BEN LI           MI          1.0
## 62          7          GARY DEE SWATHELL           MI          5.0
## 63          8           EZEKIEL HOUGHTON           MI          5.0
## 64          9                STEFANO LEE           ON          5.0
##    pre_rating opponent_rating
## 1        1794            1527
## 2        1365            1456
## 3        1712            1443
## 4        1663            1455
## 5        1666            1523
## 6        1610            1516
## 7        1220            1415
## 8        1604            1269
## 9        1629            1445
## 10       1600            1464
## 11       1564            1426
## 12       1553            1447
## 13       1595            1383
## 14       1563            1336
## 15       1555            1399
## 16       1363            1559
## 17       1229            1399
## 18       1745            1292
## 19       1579            1483
## 20       1552            1389
## 21       1507            1203
## 22       1602            1416
## 23       1384            1463
## 24       1522            1262
## 25       1494            1338
## 26       1441            1474
## 27       1449            1380
## 28       1399            1442
## 29       1438            1396
## 30       1355            1273
## 31        980            1374
## 32       1423            1350
## 33       1436            1355
## 34       1716            1328
## 35       1348            1233
## 36       1403            1480
## 37       1332            1415
## 38       1283            1576
## 39       1199            1460
## 40       1242            1437
## 41        377            1555
## 42       1362            1219
## 43       1382            1326
## 44       1291            1458
## 45       1655            1135
## 46       1056            1359
## 47       1011            1361
## 48        935            1415
## 49       1393            1402
## 50       1270            1441
## 51       1186            1169
## 52       1153            1421
## 53       1092            1667
## 54        917            1288
## 55        853            1516
## 56       1686            1498
## 57        967            1351
## 58        955            1435
## 59       1530            1367
## 60       1175            1529
## 61       1163            1251
## 62       1649            1737
## 63       1641            1487
## 64       1411            1313

# write dataframe to CSV
write.table(output_tbl, file = "tournamentinfo.csv", row.names=FALSE,sep = ",")

data_607-week_03

Nicholas Chung

09/22/2019

1. question

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents

For the first player, the information would be:

Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

2. methodology

In the most general strokes, my approach to this project was comprised of three parts:

2-1. data loading

2-2. regex matching

2-3. referencing and joining

In “data loading”, I tried several different approaches, including using the “scan” and “readLines” methods, but settled for “read.delim” to handle the delimiter explicitly.

In “regex matching”, I used regex to match relevant values while preserving empty values, which were necessary for calculating number of games each player played.

In “referencing and joining”, I first recreated the original data structure as a dataframe as an intermediate frame of reference. Then I extracted a subset to simplify mapping and validate additional column values before producing the desired output and writing to file.

2-1. data loading

2-2. regex matching

2-3. referencing and joining

3. solution