We are asked to parse through a text file containing chess tournament results in order to extract the following:
The data is presented in a dataframe which is then exported to a .csv file. Regex expressions were used to turn the unstructured data from the text file to structured data to be further analyzed. Only the wins, losses, and draws were factored into the average pre chess ratings.
Each of the chess player’s pieces of relevannt information, such as name and state, were extracted and stored onto vectors. All of the extracted vectors were then stored into a dataframe called test_df
urlfile <- 'https://raw.githubusercontent.com/peterphung2043/DATA-606---Project-1/main/tournamentinfo.txt'
raw_text <- read_file(url(urlfile))
## Make it so that each player stat is a string vector.
each_player <- str_extract_all(raw_text, '\\r\\n[\\s]+[:digit:]*\\s[|].*\\r\\n[\\s]+[:alpha:]{2}\\s[|].*')[[1]]
players <- vector()
states <- vector()
pre_ratings <- vector()
total_num_pts <- vector()
for (i in 1:length(each_player)) {
test_player <- each_player[i]
name_with_ending_space <- str_extract(test_player, "(?<=[|]\\s)([:alpha:]+\\s{1}|[:alpha:]+\\-[:alpha:]+\\s{1})+")
players[i] <- substr(name_with_ending_space, 1, nchar(name_with_ending_space) - 1)
states[i] <- str_extract(test_player, '(?<=\\s\\s)[:upper:]{2}(?=\\s[|])')
pre_ratings[i] <- as.integer(str_match(test_player, 'R:\\s+(\\d+)')[1, 2])
total_num_pts[i] <- str_extract(test_player, '(?<=[|])\\d+\\.\\d+')
}
test_df <- data.frame("players" = players, "states" = states, "total_num_pts" = total_num_pts,
"pre_ratings" = pre_ratings)
kable(test_df)
| players | states | total_num_pts | pre_ratings |
|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 |
| DAKSHESH DARURI | MI | 6.0 | 1553 |
| ADITYA BAJAJ | MI | 6.0 | 1384 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 |
| HANSHI ZUO | MI | 5.5 | 1655 |
| HANSEN SONG | OH | 5.0 | 1686 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 |
| STEFANO LEE | ON | 5.0 | 1411 |
| ANVIT RAO | MI | 5.0 | 1365 |
| CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 |
| KENNETH J TACK | MI | 4.5 | 1663 |
| TORRANCE HENRY JR | MI | 4.5 | 1666 |
| BRADLEY SHAW | MI | 4.5 | 1610 |
| ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 |
| MIKE NIKITIN | MI | 4.0 | 1604 |
| RONALD GRZEGORCZYK | MI | 4.0 | 1629 |
| DAVID SUNDEEN | MI | 4.0 | 1600 |
| DIPANKAR ROY | MI | 4.0 | 1564 |
| JASON ZHENG | MI | 4.0 | 1595 |
| DINH DANG BUI | ON | 4.0 | 1563 |
| EUGENE L MCCLURE | MI | 4.0 | 1555 |
| ALAN BUI | ON | 4.0 | 1363 |
| MICHAEL R ALDRICH | MI | 4.0 | 1229 |
| LOREN SCHWIEBERT | MI | 3.5 | 1745 |
| MAX ZHU | ON | 3.5 | 1579 |
| GAURAV GIDWANI | MI | 3.5 | 1552 |
| SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 |
| CHIEDOZIE OKORIE | MI | 3.5 | 1602 |
| GEORGE AVERY JONES | ON | 3.5 | 1522 |
| RISHI SHETTY | MI | 3.5 | 1494 |
| JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 |
| JADE GE | MI | 3.5 | 1449 |
| MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 |
| JOSHUA DAVID LEE | MI | 3.5 | 1438 |
| SIDDHARTH JHA | MI | 3.5 | 1355 |
| AMIYATOSH PWNANANDAM | MI | 3.5 | 980 |
| BRIAN LIU | MI | 3.0 | 1423 |
| JOEL R HENDON | MI | 3.0 | 1436 |
| FOREST ZHANG | MI | 3.0 | 1348 |
| KYLE WILLIAM MURPHY | MI | 3.0 | 1403 |
| JARED GE | MI | 3.0 | 1332 |
| ROBERT GLEN VASEY | MI | 3.0 | 1283 |
| JUSTIN D SCHILLING | MI | 3.0 | 1199 |
| DEREK YAN | MI | 3.0 | 1242 |
| JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 |
| ERIC WRIGHT | MI | 2.5 | 1362 |
| DANIEL KHAIN | MI | 2.5 | 1382 |
| MICHAEL J MARTIN | MI | 2.5 | 1291 |
| SHIVAM JHA | MI | 2.5 | 1056 |
| TEJAS AYYAGARI | MI | 2.5 | 1011 |
| ETHAN GUO | MI | 2.5 | 935 |
| JOSE C YBARRA | MI | 2.0 | 1393 |
| LARRY HODGE | MI | 2.0 | 1270 |
| ALEX KONG | MI | 2.0 | 1186 |
| MARISA RICCI | MI | 2.0 | 1153 |
| MICHAEL LU | MI | 2.0 | 1092 |
| VIRAJ MOHILE | MI | 2.0 | 917 |
| SEAN M MC CORMICK | MI | 2.0 | 853 |
| JULIA SHEN | MI | 1.5 | 967 |
| JEZZEL FARKAS | ON | 1.5 | 955 |
| ASHWIN BALAJI | MI | 1.0 | 1530 |
| THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 |
| BEN LI | MI | 1.0 | 1163 |
Then the average pre-chess rating of the opponents for each player was calculated and stored onto the test_df dataframe. The code and resulting dataframe are shown below.
avg_pre_chess_func <- function(wld_vector){
aggregated_pre_ratings <- vector()
for (i in 1:length(wld_vector)){
aggregated_pre_ratings[i] <- test_df$pre_ratings[wld_vector[i]]
}
return(round(mean(aggregated_pre_ratings)))
}
avg_pre_chess_ratings <- vector()
for (i in 1:length(each_player)){
test_player <- each_player[i]
raw_opponents <- str_extract_all(test_player, '[|](W|L|D)\\s*[:digit:]+')[[1]]
wlds <- as.integer(str_extract(raw_opponents, '[:digit:]+'))
avg_pre_chess_ratings[i] <- avg_pre_chess_func(wlds)
}
test_df <- add_column(test_df, avg_pre_chess_ratings = avg_pre_chess_ratings)
kable(test_df)
| players | states | total_num_pts | pre_ratings | avg_pre_chess_ratings |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1501 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 | 1372 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 |
| STEFANO LEE | ON | 5.0 | 1411 | 1523 |
| ANVIT RAO | MI | 5.0 | 1365 | 1554 |
| CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1468 |
| KENNETH J TACK | MI | 4.5 | 1663 | 1506 |
| TORRANCE HENRY JR | MI | 4.5 | 1666 | 1498 |
| BRADLEY SHAW | MI | 4.5 | 1610 | 1515 |
| ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1484 |
| MIKE NIKITIN | MI | 4.0 | 1604 | 1386 |
| RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1499 |
| DAVID SUNDEEN | MI | 4.0 | 1600 | 1480 |
| DIPANKAR ROY | MI | 4.0 | 1564 | 1426 |
| JASON ZHENG | MI | 4.0 | 1595 | 1411 |
| DINH DANG BUI | ON | 4.0 | 1563 | 1470 |
| EUGENE L MCCLURE | MI | 4.0 | 1555 | 1300 |
| ALAN BUI | ON | 4.0 | 1363 | 1214 |
| MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1357 |
| LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1363 |
| MAX ZHU | ON | 3.5 | 1579 | 1507 |
| GAURAV GIDWANI | MI | 3.5 | 1552 | 1222 |
| SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 | 1522 |
| CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1314 |
| GEORGE AVERY JONES | ON | 3.5 | 1522 | 1144 |
| RISHI SHETTY | MI | 3.5 | 1494 | 1260 |
| JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1379 |
| JADE GE | MI | 3.5 | 1449 | 1277 |
| MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1375 |
| JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1150 |
| SIDDHARTH JHA | MI | 3.5 | 1355 | 1388 |
| AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1385 |
| BRIAN LIU | MI | 3.0 | 1423 | 1539 |
| JOEL R HENDON | MI | 3.0 | 1436 | 1430 |
| FOREST ZHANG | MI | 3.0 | 1348 | 1391 |
| KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1248 |
| JARED GE | MI | 3.0 | 1332 | 1150 |
| ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1107 |
| JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1327 |
| DEREK YAN | MI | 3.0 | 1242 | 1152 |
| JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1358 |
| ERIC WRIGHT | MI | 2.5 | 1362 | 1392 |
| DANIEL KHAIN | MI | 2.5 | 1382 | 1356 |
| MICHAEL J MARTIN | MI | 2.5 | 1291 | 1286 |
| SHIVAM JHA | MI | 2.5 | 1056 | 1296 |
| TEJAS AYYAGARI | MI | 2.5 | 1011 | 1356 |
| ETHAN GUO | MI | 2.5 | 935 | 1495 |
| JOSE C YBARRA | MI | 2.0 | 1393 | 1345 |
| LARRY HODGE | MI | 2.0 | 1270 | 1206 |
| ALEX KONG | MI | 2.0 | 1186 | 1406 |
| MARISA RICCI | MI | 2.0 | 1153 | 1414 |
| MICHAEL LU | MI | 2.0 | 1092 | 1363 |
| VIRAJ MOHILE | MI | 2.0 | 917 | 1391 |
| SEAN M MC CORMICK | MI | 2.0 | 853 | 1319 |
| JULIA SHEN | MI | 1.5 | 967 | 1330 |
| JEZZEL FARKAS | ON | 1.5 | 955 | 1327 |
| ASHWIN BALAJI | MI | 1.0 | 1530 | 1186 |
| THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1350 |
| BEN LI | MI | 1.0 | 1163 | 1263 |
The next code snippet below exports the test_df dataframe to a .csv file in your current working directory. The .csv file is called tournament.csv.
write.csv(test_df, "tournament.csv", row.names = FALSE)
This project was great for getting hands on practice with regular expressions. This chess data was a great example of unstructured data and is a very interesting data set. It would be great to further analyze this data with the withdrawals and half points for a future project.