1. question
In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
2. methodology
In the most general strokes, my approach to this project was comprised of three parts:
2-1. data loading
2-2. regex matching
2-3. referencing and joining
In “data loading”, I tried several different approaches, including using the “scan” and “readLines” methods, but settled for “read.delim” to handle the delimiter explicitly.
In “regex matching”, I used regex to match relevant values while preserving empty values, which were necessary for calculating number of games each player played.
In “referencing and joining”, I first recreated the original data structure as a dataframe as an intermediate frame of reference. Then I extracted a subset to simplify mapping and validate additional column values before producing the desired output and writing to file.
2-1. data loading
# read pipe delimited text file in [table format|https://www.rdocumentation.org/packages/psych/versions/1.8.12/topics/read.file], skipping the 1st line
raw_txt_a = read.delim("tournamentinfo.txt", sep = "|", skip = 1) # read txt file
head(raw_txt_a)
## Pair
## 1 Num
## 2 -----------------------------------------------------------------------------------------
## 3 1
## 4 ON
## 5 -----------------------------------------------------------------------------------------
## 6 2
## Player.Name Total Round Round.1 Round.2 Round.3
## 1 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4
## 2
## 3 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 4 15445895 / R: 1794 ->1817 N:2 W B W B
## 5
## 6 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## Round.4 Round.5 Round.6 X
## 1 5 6 7 NA
## 2 NA
## 3 W 7 D 12 D 4 NA
## 4 W B W NA
## 5 NA
## 6 W 16 W 20 W 7 NA
# read existing headers
names(raw_txt_a)
## [1] "Pair" "Player.Name" "Total" "Round" "Round.1"
## [6] "Round.2" "Round.3" "Round.4" "Round.5" "Round.6"
## [11] "X"
# read first two lines
head(raw_txt_a, n = 2)
## Pair
## 1 Num
## 2 -----------------------------------------------------------------------------------------
## Player.Name Total Round Round.1 Round.2 Round.3
## 1 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4
## 2
## Round.4 Round.5 Round.6 X
## 1 5 6 7 NA
## 2 NA
# rename headers
h_txt <- rename(raw_txt_a, pair_num = Pair, player_name = Player.Name, total_points = Total, round_1 = Round, round_2 = Round.1, round_3 = Round.2, round_4 = Round.3, round_5 = Round.4, round_6 = Round.5, round_7 = Round.6)
head(h_txt)
## pair_num
## 1 Num
## 2 -----------------------------------------------------------------------------------------
## 3 1
## 4 ON
## 5 -----------------------------------------------------------------------------------------
## 6 2
## player_name total_points round_1 round_2 round_3
## 1 USCF ID / Rtg (Pre->Post) Pts 1 2 3
## 2
## 3 GARY HUA 6.0 W 39 W 21 W 18
## 4 15445895 / R: 1794 ->1817 N:2 W B W
## 5
## 6 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## round_4 round_5 round_6 round_7 X
## 1 4 5 6 7 NA
## 2 NA
## 3 W 14 W 7 D 12 D 4 NA
## 4 B W B W NA
## 5 NA
## 6 W 17 W 16 W 20 W 7 NA
# remove first row
rh_txt <- h_txt[-c(1), ]
head(rh_txt)
## pair_num
## 2 -----------------------------------------------------------------------------------------
## 3 1
## 4 ON
## 5 -----------------------------------------------------------------------------------------
## 6 2
## 7 MI
## player_name total_points round_1 round_2 round_3
## 2
## 3 GARY HUA 6.0 W 39 W 21 W 18
## 4 15445895 / R: 1794 ->1817 N:2 W B W
## 5
## 6 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 7 14598900 / R: 1553 ->1663 N:2 B W B
## round_4 round_5 round_6 round_7 X
## 2 NA
## 3 W 14 W 7 D 12 D 4 NA
## 4 B W B W NA
## 5 NA
## 6 W 17 W 16 W 20 W 7 NA
## 7 W B W B NA
# match rows with dashes
head(str_subset(rh_txt$pair_num, "[-]"))
## [1] "-----------------------------------------------------------------------------------------"
## [2] "-----------------------------------------------------------------------------------------"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "-----------------------------------------------------------------------------------------"
## [5] "-----------------------------------------------------------------------------------------"
## [6] "-----------------------------------------------------------------------------------------"
# remove rows with dashes
drh_txt <- subset(rh_txt, pair_num!=str_subset(rh_txt$pair_num, "[-]"))
## Warning in `!=.default`(pair_num, str_subset(rh_txt$pair_num, "[-]")):
## longer object length is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
head(drh_txt)
## pair_num player_name total_points round_1 round_2
## 3 1 GARY HUA 6.0 W 39 W 21
## 4 ON 15445895 / R: 1794 ->1817 N:2 W B
## 6 2 DAKSHESH DARURI 6.0 W 63 W 58
## 7 MI 14598900 / R: 1553 ->1663 N:2 B W
## 9 3 ADITYA BAJAJ 6.0 L 8 W 61
## 10 MI 14959604 / R: 1384 ->1640 N:2 W B
## round_3 round_4 round_5 round_6 round_7 X
## 3 W 18 W 14 W 7 D 12 D 4 NA
## 4 W B W B W NA
## 6 L 4 W 17 W 16 W 20 W 7 NA
## 7 B W B W B NA
## 9 W 25 W 21 W 11 W 13 W 12 NA
## 10 W B W B W NA
2-2. regex matching
# match player number
player_numbers <- trimws(str_subset(drh_txt$pair_num, "[[:digit:]]")) # do I need to change to int?
# check for completeness against expected count
length(player_numbers)
## [1] 64
# match player name
names <- trimws(str_extract_all(drh_txt$player_name, "^[A-Z\\- ]+$", simplify = TRUE))
# remove empty strings
names <- names[names != ""]
# check for completeness against expected count
length(names)
## [1] 64
# match player state
player_state <- trimws(str_subset(drh_txt$pair_num, "[[:alpha:]]"))
# check for completeness against expected count
length(player_state)
## [1] 64
# match total points
total_points <- trimws(str_subset(drh_txt$total_points, "\\d(\\.\\d)"))
# check for completeness against expected count
length(total_points)
## [1] 64
# match player pre-rating
# " 15445895 / R: 1794 ->1817"
pre_rating <- trimws(unlist(str_extract_all(drh_txt$player_name, "[:space:]\\d{3,4}", simplify = TRUE))[,2])
# remove empty strings
pre_rating <- pre_rating[pre_rating != ""]
# check for completeness against expected count
length(pre_rating)
## [1] 64
2-3. referencing and joining
# create matrix from relevant data
player_tbl <- cbind(player_numbers, player_state, names, pre_rating, total_points) # create matrix
# convert matrix to dataframe
player_tbl <- data.frame(player_tbl)
head(player_tbl)
## player_numbers player_state names pre_rating total_points
## 1 1 ON GARY HUA 1794 6.0
## 2 2 MI DAKSHESH DARURI 1553 6.0
## 3 3 MI ADITYA BAJAJ 1384 6.0
## 4 4 MI PATRICK H SCHILLING 1716 5.5
## 5 5 MI HANSHI ZUO 1655 5.5
## 6 6 OH HANSEN SONG 1686 5.0
# replicate gameplay data
games <- subset(drh_txt, select = round_1:round_7)
# create matrix excluding irrelevant rows
games <- games[seq(1, nrow(games), 2), ]
# extract player numbers from matrix and transpose
games <- t(apply(games, 1, function(x) str_extract_all(x, "[0-9]{1,2}", simplify = TRUE)))
# store matrix as dataframe
games_tbl <- data.frame(games)
# reorder dataframe by row number
rownames(games_tbl) <- seq(length = nrow(games_tbl))
# optionally, rename headers for ease of reference
games_tbl <- rename(games_tbl, round_1 = X1, round_2 = X2, round_3 = X3, round_4 = X4, round_5 = X5, round_6 = X6, round_7 = X7)
# produce cleaned version of original tournamentinfo table for reference
tournament_info <- merge(x = player_tbl, y = games_tbl, by = 0)
head(tournament_info)
## Row.names player_numbers player_state names
## 1 1 1 ON GARY HUA
## 2 10 10 MI ANVIT RAO
## 3 11 11 MI CAMERON WILLIAM MC LEMAN
## 4 12 12 MI KENNETH J TACK
## 5 13 13 MI TORRANCE HENRY JR
## 6 14 14 MI BRADLEY SHAW
## pre_rating total_points round_1 round_2 round_3 round_4 round_5 round_6
## 1 1794 6.0 39 21 18 14 7 12
## 2 1365 5.0 16 19 55 31 6 25
## 3 1712 4.5 38 56 6 7 3 34
## 4 1663 4.5 42 33 5 38 1
## 5 1666 4.5 36 27 7 5 33 3
## 6 1610 4.5 54 44 8 1 27 5
## round_7
## 1 4
## 2 18
## 3 26
## 4 3
## 5 32
## 6 31
# replace player numbers in games columns with relevant pre-rating scores
# create key, value list pair
key <- c(as.character(tournament_info$player_numbers))
value <- c(as.character(tournament_info$pre_rating))
# create map from keys and values
map <- setNames(value, key)
# map new values to dataframe
games_tbl[] <- map[unlist(games_tbl)]
# convert new values to integer class
games_tbl[, ] <- lapply(games_tbl[, ], as.integer)
# sum total score per row while accounting for missing values
games_tbl$total_score <- rowSums(games_tbl[1:7], na.rm = TRUE)
# count number of games played by discounting missing values
games_tbl$n_games <- rowSums(!is.na(games_tbl[1:7]))
# store calculation of average rating per row in the new column
games_tbl[, "opp_avg_score"] <- round(games_tbl[, "total_score"] / games_tbl[, "n_games"])
head(games_tbl)
## round_1 round_2 round_3 round_4 round_5 round_6 round_7 total_score
## 1 1436 1555 1564 1220 1530 1666 1716 10687
## 2 967 1186 1716 1600 1629 1563 1530 10191
## 3 1175 853 1579 1555 1663 1610 1666 10101
## 4 1229 1602 1595 1552 1291 1553 1365 10187
## 5 1199 1649 1666 1610 1716 1220 1600 10660
## 6 1438 1384 1663 1355 1712 1507 1555 10614
## n_games opp_avg_score
## 1 7 1527
## 2 7 1456
## 3 7 1443
## 4 7 1455
## 5 7 1523
## 6 7 1516
3. solution
# create desired output
# store outputs in desired classes as vectors
player_num <- as.vector(tournament_info$player_numbers)
player_nom <- as.vector(tournament_info$names)
player_state <- as.vector(tournament_info$player_state)
total_points <- as.vector(tournament_info$total_points)
pre_rating <- as.integer(as.vector(tournament_info$pre_rating))
opponent_rating <- as.vector(games_tbl$opp_avg_score)
# create desired dataframe from output vectors
output_tbl <- data.frame(player_num, player_nom, player_state, total_points, pre_rating, opponent_rating)
output_tbl
## player_num player_nom player_state total_points
## 1 1 GARY HUA ON 6.0
## 2 10 ANVIT RAO MI 5.0
## 3 11 CAMERON WILLIAM MC LEMAN MI 4.5
## 4 12 KENNETH J TACK MI 4.5
## 5 13 TORRANCE HENRY JR MI 4.5
## 6 14 BRADLEY SHAW MI 4.5
## 7 15 ZACHARY JAMES HOUGHTON MI 4.5
## 8 16 MIKE NIKITIN MI 4.0
## 9 17 RONALD GRZEGORCZYK MI 4.0
## 10 18 DAVID SUNDEEN MI 4.0
## 11 19 DIPANKAR ROY MI 4.0
## 12 2 DAKSHESH DARURI MI 6.0
## 13 20 JASON ZHENG MI 4.0
## 14 21 DINH DANG BUI ON 4.0
## 15 22 EUGENE L MCCLURE MI 4.0
## 16 23 ALAN BUI ON 4.0
## 17 24 MICHAEL R ALDRICH MI 4.0
## 18 25 LOREN SCHWIEBERT MI 3.5
## 19 26 MAX ZHU ON 3.5
## 20 27 GAURAV GIDWANI MI 3.5
## 21 28 SOFIA ADINA STANESCU-BELLU MI 3.5
## 22 29 CHIEDOZIE OKORIE MI 3.5
## 23 3 ADITYA BAJAJ MI 6.0
## 24 30 GEORGE AVERY JONES ON 3.5
## 25 31 RISHI SHETTY MI 3.5
## 26 32 JOSHUA PHILIP MATHEWS ON 3.5
## 27 33 JADE GE MI 3.5
## 28 34 MICHAEL JEFFERY THOMAS MI 3.5
## 29 35 JOSHUA DAVID LEE MI 3.5
## 30 36 SIDDHARTH JHA MI 3.5
## 31 37 AMIYATOSH PWNANANDAM MI 3.5
## 32 38 BRIAN LIU MI 3.0
## 33 39 JOEL R HENDON MI 3.0
## 34 4 PATRICK H SCHILLING MI 5.5
## 35 40 FOREST ZHANG MI 3.0
## 36 41 KYLE WILLIAM MURPHY MI 3.0
## 37 42 JARED GE MI 3.0
## 38 43 ROBERT GLEN VASEY MI 3.0
## 39 44 JUSTIN D SCHILLING MI 3.0
## 40 45 DEREK YAN MI 3.0
## 41 46 JACOB ALEXANDER LAVALLEY MI 3.0
## 42 47 ERIC WRIGHT MI 2.5
## 43 48 DANIEL KHAIN MI 2.5
## 44 49 MICHAEL J MARTIN MI 2.5
## 45 5 HANSHI ZUO MI 5.5
## 46 50 SHIVAM JHA MI 2.5
## 47 51 TEJAS AYYAGARI MI 2.5
## 48 52 ETHAN GUO MI 2.5
## 49 53 JOSE C YBARRA MI 2.0
## 50 54 LARRY HODGE MI 2.0
## 51 55 ALEX KONG MI 2.0
## 52 56 MARISA RICCI MI 2.0
## 53 57 MICHAEL LU MI 2.0
## 54 58 VIRAJ MOHILE MI 2.0
## 55 59 SEAN M MC CORMICK MI 2.0
## 56 6 HANSEN SONG OH 5.0
## 57 60 JULIA SHEN MI 1.5
## 58 61 JEZZEL FARKAS ON 1.5
## 59 62 ASHWIN BALAJI MI 1.0
## 60 63 THOMAS JOSEPH HOSMER MI 1.0
## 61 64 BEN LI MI 1.0
## 62 7 GARY DEE SWATHELL MI 5.0
## 63 8 EZEKIEL HOUGHTON MI 5.0
## 64 9 STEFANO LEE ON 5.0
## pre_rating opponent_rating
## 1 1794 1527
## 2 1365 1456
## 3 1712 1443
## 4 1663 1455
## 5 1666 1523
## 6 1610 1516
## 7 1220 1415
## 8 1604 1269
## 9 1629 1445
## 10 1600 1464
## 11 1564 1426
## 12 1553 1447
## 13 1595 1383
## 14 1563 1336
## 15 1555 1399
## 16 1363 1559
## 17 1229 1399
## 18 1745 1292
## 19 1579 1483
## 20 1552 1389
## 21 1507 1203
## 22 1602 1416
## 23 1384 1463
## 24 1522 1262
## 25 1494 1338
## 26 1441 1474
## 27 1449 1380
## 28 1399 1442
## 29 1438 1396
## 30 1355 1273
## 31 980 1374
## 32 1423 1350
## 33 1436 1355
## 34 1716 1328
## 35 1348 1233
## 36 1403 1480
## 37 1332 1415
## 38 1283 1576
## 39 1199 1460
## 40 1242 1437
## 41 377 1555
## 42 1362 1219
## 43 1382 1326
## 44 1291 1458
## 45 1655 1135
## 46 1056 1359
## 47 1011 1361
## 48 935 1415
## 49 1393 1402
## 50 1270 1441
## 51 1186 1169
## 52 1153 1421
## 53 1092 1667
## 54 917 1288
## 55 853 1516
## 56 1686 1498
## 57 967 1351
## 58 955 1435
## 59 1530 1367
## 60 1175 1529
## 61 1163 1251
## 62 1649 1737
## 63 1641 1487
## 64 1411 1313
# write dataframe to CSV
write.table(output_tbl, file = "tournamentinfo.csv", row.names=FALSE,sep = ",")