Collaborators: Dan Rosenfeld, Jered Ataky, and Rick Sughrue
The collaborators (noted above) contributed to and collaborated on solving Project 1.
In this project, we’re given a text file with chess tournament results where the information has some structure. Our job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player Name | State | Total Points | Pre-Rating | Avg Opp Pre-Rating
For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
Import the data from the text file.
chess_input <- read.table(url("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Project-1/tournamentinfo.txt"), sep = ",")
#head(chess_input)
From the output above, we can see that the header has been included and that the data is not organized. Thus, prior to outputting our data in the desired form, we’ll have to pull variables, etc. from this set in an organized manner. We’ll have to wrangle data and sort through our ‘raw’ data set for what is essential / desired.
Remove header, “- - -” (dashed lines), and " " excess white spaces.
#Remove header
chess_input <- chess_input[-c(1:4),]
#Exclude every 3rd row (of dashed lines)
i <- seq(0, 192, 3)
chess_input <- chess_input[-i]
#Compress white space for readability
chess_input <- str_replace_all(chess_input, "\\s+", " ")
#head(chess_input)
The result of our processed data is alternating rows of values where the values themselves are separated by “|”s.
Odd rows: ID | Player Name | Total Pts | W/L/D Opp# …
Even rows: State | USCF ID | Rtg Pre -> Post | …
Sort by value row value (even v odd) to simplify later extraction.
o_rows <- seq(1, 128, 2)
e_rows <- seq(2, 128, 2)
o_chess <- unlist(data.frame(x=chess_input[o_rows]))
e_chess <- unlist(data.frame(x=chess_input[e_rows]))
#head(o_chess)
#head(e_chess)
We now have 2 vectors (o_chess and e_chess) produced via unlist() to contain all their atomic components. From each of these vectors, we can begin to use regular expressions to extract essential components from each vector.
Extract essential components from each vector: Player Name | State | Total Points | Pre-Rating
and later Avg Opp Pre-Rating.
#Extract player IDs:
id<-unlist(str_extract_all(o_chess,"\\d{1,2} \\| "))
id<-unlist(str_extract_all(id, "\\d{1,2}"))
#head(id)
#Extract player names:
name <-unlist(str_extract_all(o_chess,"\\w+ ?\\w+ \\w+"))
#head(name)
#Extract player point total:
pt_total <-unlist(str_extract_all(o_chess,"\\d.\\d"))
#head(pt_total)
#Extract player state:
state <-unlist(str_extract_all(e_chess, " \\w{2} \\| "))
state<-unlist(str_extract_all(state, "\\w{2}"))
#head(state)
#Extract player pre-rating:
prerating <-unlist(str_extract_all(e_chess, "(R:\\s*)(\\d+)"))
prerating <-as.numeric(unlist(str_extract_all(prerating, "(\\d+)")))
#prerating
At this point, we’ve extracted the player’s ID, name, point total, state, and prerating. Prerating has been stored as a double for later purposes.
We would like to combine what we’ve extracted (the variables of interest) into one table but before we can do so we have one more variable to compute … the average opponent ranking.
Compute the Avg Opp Pre-Rating.
We’re going to build a matrix of opponent ID#s, replace these numbers with their corresponding prerating and then take the average across each row to compute the Avg Opp Pre-Rating.
#Extract the entire row from the pt_total onward, set missing values @ B,H,U and X to be " 0" and then extract digits.
opp_ids<-unlist(str_extract_all(o_chess,"\\|[0-9].*"))
opp_ids<-str_replace_all(opp_ids, "[BHUX]", " 0")
opp_ids<-str_extract_all(opp_ids,"\\s\\d{1,2}")
#Convert into numeric matrix and from 64 x 7 to 7 x 64.
opp_ids_m <- matrix(unlist(opp_ids), byrow= TRUE, nrow = length(opp_ids))
opp_ids_m <- t(apply(opp_ids_m, 1, as.numeric))
#Use embedded for loops to iterate over rows and columns. If the entry is 0, replace it with NA. Otherwise, replace the id# with its corresponding preratings.
for (i in 1:nrow(opp_ids_m)) {
for (j in 1:ncol(opp_ids_m)) {
if (opp_ids_m[i,j] == 0){
opp_ids_m[i,j] = NA
} else {
opp_ids_m[i,j] <- prerating[opp_ids_m[i,j]]
}
}
}
#Note: if an error message pops up for the for loop portion, please clear objects from workspace and run again.
#Take the average value per row while omitting missing values.
avg_opp_rating <- c(rowMeans(opp_ids_m, na.rm = TRUE))
avg_opp_rating <- round(avg_opp_rating)
avg_opp_rating
## [1] 1605 1469 1564 1574 1501 1519 1372 1468 1523 1554 1468 1506 1498 1515 1484
## [16] 1386 1499 1480 1426 1411 1470 1300 1214 1357 1363 1507 1222 1522 1314 1144
## [31] 1260 1379 1277 1375 1150 1388 1385 1539 1430 1391 1248 1150 1107 1327 1152
## [46] 1358 1392 1356 1286 1296 1356 1495 1345 1206 1406 1414 1363 1391 1319 1330
## [61] 1327 1186 1350 1263
At this point, we have extracted all the essential components and all we have to do is combine and export them to .csv.
Combine desired components into dataframe and export to .csv.
#Combine desired components into dataframe
combined <- data.frame(name, state, pt_total, prerating, avg_opp_rating)
colnames(combined) <- c("Player Name", "Player's State", "Point Total", "Pre Rating", "Avg Opponent Rating") #waiting to calculate and add avg_opp_rating
combined
## Player Name Player's State Point Total Pre Rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## 6 HANSEN SONG OH 5.0 1686
## 7 GARY DEE SWATHELL MI 5.0 1649
## 8 EZEKIEL HOUGHTON MI 5.0 1641
## 9 STEFANO LEE ON 5.0 1411
## 10 ANVIT RAO MI 5.0 1365
## 11 CAMERON WILLIAM MC MI 4.5 1712
## 12 KENNETH J TACK MI 4.5 1663
## 13 TORRANCE HENRY JR MI 4.5 1666
## 14 BRADLEY SHAW MI 4.5 1610
## 15 ZACHARY JAMES HOUGHTON MI 4.5 1220
## 16 MIKE NIKITIN MI 4.0 1604
## 17 RONALD GRZEGORCZYK MI 4.0 1629
## 18 DAVID SUNDEEN MI 4.0 1600
## 19 DIPANKAR ROY MI 4.0 1564
## 20 JASON ZHENG MI 4.0 1595
## 21 DINH DANG BUI ON 4.0 1563
## 22 EUGENE L MCCLURE MI 4.0 1555
## 23 ALAN BUI ON 4.0 1363
## 24 MICHAEL R ALDRICH MI 4.0 1229
## 25 LOREN SCHWIEBERT MI 3.5 1745
## 26 MAX ZHU ON 3.5 1579
## 27 GAURAV GIDWANI MI 3.5 1552
## 28 SOFIA ADINA STANESCU MI 3.5 1507
## 29 CHIEDOZIE OKORIE MI 3.5 1602
## 30 GEORGE AVERY JONES ON 3.5 1522
## 31 RISHI SHETTY MI 3.5 1494
## 32 JOSHUA PHILIP MATHEWS ON 3.5 1441
## 33 JADE GE MI 3.5 1449
## 34 MICHAEL JEFFERY THOMAS MI 3.5 1399
## 35 JOSHUA DAVID LEE MI 3.5 1438
## 36 SIDDHARTH JHA MI 3.5 1355
## 37 AMIYATOSH PWNANANDAM MI 3.5 980
## 38 BRIAN LIU MI 3.0 1423
## 39 JOEL R HENDON MI 3.0 1436
## 40 FOREST ZHANG MI 3.0 1348
## 41 KYLE WILLIAM MURPHY MI 3.0 1403
## 42 JARED GE MI 3.0 1332
## 43 ROBERT GLEN VASEY MI 3.0 1283
## 44 JUSTIN D SCHILLING MI 3.0 1199
## 45 DEREK YAN MI 3.0 1242
## 46 JACOB ALEXANDER LAVALLEY MI 3.0 377
## 47 ERIC WRIGHT MI 2.5 1362
## 48 DANIEL KHAIN MI 2.5 1382
## 49 MICHAEL J MARTIN MI 2.5 1291
## 50 SHIVAM JHA MI 2.5 1056
## 51 TEJAS AYYAGARI MI 2.5 1011
## 52 ETHAN GUO MI 2.5 935
## 53 JOSE C YBARRA MI 2.0 1393
## 54 LARRY HODGE MI 2.0 1270
## 55 ALEX KONG MI 2.0 1186
## 56 MARISA RICCI MI 2.0 1153
## 57 MICHAEL LU MI 2.0 1092
## 58 VIRAJ MOHILE MI 2.0 917
## 59 SEAN M MC MI 2.0 853
## 60 JULIA SHEN MI 1.5 967
## 61 JEZZEL FARKAS ON 1.5 955
## 62 ASHWIN BALAJI MI 1.0 1530
## 63 THOMAS JOSEPH HOSMER MI 1.0 1175
## 64 BEN LI MI 1.0 1163
## Avg Opponent Rating
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
## 7 1372
## 8 1468
## 9 1523
## 10 1554
## 11 1468
## 12 1506
## 13 1498
## 14 1515
## 15 1484
## 16 1386
## 17 1499
## 18 1480
## 19 1426
## 20 1411
## 21 1470
## 22 1300
## 23 1214
## 24 1357
## 25 1363
## 26 1507
## 27 1222
## 28 1522
## 29 1314
## 30 1144
## 31 1260
## 32 1379
## 33 1277
## 34 1375
## 35 1150
## 36 1388
## 37 1385
## 38 1539
## 39 1430
## 40 1391
## 41 1248
## 42 1150
## 43 1107
## 44 1327
## 45 1152
## 46 1358
## 47 1392
## 48 1356
## 49 1286
## 50 1296
## 51 1356
## 52 1495
## 53 1345
## 54 1206
## 55 1406
## 56 1414
## 57 1363
## 58 1391
## 59 1319
## 60 1330
## 61 1327
## 62 1186
## 63 1350
## 64 1263