In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth… The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.
You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.
To start, I saved the file as a .txt document. Then, I imported the raw data into R.
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
Next, I used the stringr package to extract the neccessary data through regular expression.
names <- unlist(str_extract_all(tournamentinfo, "\\w+ \\w+"))
player_names <- data.frame(names[3:66])
names(player_names)[1]<-paste("Names")
head(player_names)
## Names
## 1 GARY HUA
## 2 DAKSHESH DARURI
## 3 ADITYA BAJAJ
## 4 PATRICK H
## 5 HANSHI ZUO
## 6 HANSEN SONG
State <- unlist(str_extract_all(tournamentinfo, "( MI )|( OH )|( ON )"))
player_state <- data.frame(State)
head(player_state)
## State
## 1 ON
## 2 MI
## 3 MI
## 4 MI
## 5 MI
## 6 OH
prerate <- unlist(str_extract_all(tournamentinfo, "\\:{1} *\\d.+?\\->"))
prerate1 <- str_remove_all(prerate, "\\:")
prerate2 <- str_remove_all(prerate1, "\\->")
prerate3 <- str_remove_all(prerate2, "P[[:digit:]]{1,2}")
Pre_Rating <- data.frame(prerate3)
head(Pre_Rating)
## prerate3
## 1 1794
## 2 1553
## 3 1384
## 4 1716
## 5 1655
## 6 1686
dim(player_names)
## [1] 64 1
dim(player_state)
## [1] 64 1
dim(Pre_Rating)
## [1] 64 1
Finally, I have created a dataframe and subsequent .csv file with the extracted data.
project <- data.frame(c(player_names, player_state, Pre_Rating))
project
## Names State prerate3
## 1 GARY HUA ON 1794
## 2 DAKSHESH DARURI MI 1553
## 3 ADITYA BAJAJ MI 1384
## 4 PATRICK H MI 1716
## 5 HANSHI ZUO MI 1655
## 6 HANSEN SONG OH 1686
## 7 GARY DEE MI 1649
## 8 EZEKIEL HOUGHTON MI 1641
## 9 STEFANO LEE ON 1411
## 10 ANVIT RAO MI 1365
## 11 CAMERON WILLIAM MI 1712
## 12 MC LEMAN MI 1663
## 13 KENNETH J MI 1666
## 14 TORRANCE HENRY MI 1610
## 15 BRADLEY SHAW MI 1220
## 16 ZACHARY JAMES MI 1604
## 17 MIKE NIKITIN MI 1629
## 18 RONALD GRZEGORCZYK MI 1600
## 19 DAVID SUNDEEN MI 1564
## 20 DIPANKAR ROY MI 1595
## 21 JASON ZHENG ON 1563
## 22 DINH DANG MI 1555
## 23 EUGENE L ON 1363
## 24 ALAN BUI MI 1229
## 25 MICHAEL R MI 1745
## 26 LOREN SCHWIEBERT ON 1579
## 27 MAX ZHU MI 1552
## 28 GAURAV GIDWANI MI 1507
## 29 SOFIA ADINA MI 1602
## 30 CHIEDOZIE OKORIE ON 1522
## 31 GEORGE AVERY MI 1494
## 32 RISHI SHETTY ON 1441
## 33 JOSHUA PHILIP MI 1449
## 34 JADE GE MI 1399
## 35 MICHAEL JEFFERY MI 1438
## 36 JOSHUA DAVID MI 1355
## 37 SIDDHARTH JHA MI 980
## 38 AMIYATOSH PWNANANDAM MI 1423
## 39 BRIAN LIU MI 1436
## 40 JOEL R MI 1348
## 41 FOREST ZHANG MI 1403
## 42 KYLE WILLIAM MI 1332
## 43 JARED GE MI 1283
## 44 ROBERT GLEN MI 1199
## 45 JUSTIN D MI 1242
## 46 DEREK YAN MI 377
## 47 JACOB ALEXANDER MI 1362
## 48 ERIC WRIGHT MI 1382
## 49 DANIEL KHAIN MI 1291
## 50 MICHAEL J MI 1056
## 51 SHIVAM JHA MI 1011
## 52 TEJAS AYYAGARI MI 935
## 53 ETHAN GUO MI 1393
## 54 JOSE C MI 1270
## 55 LARRY HODGE MI 1186
## 56 ALEX KONG MI 1153
## 57 MARISA RICCI MI 1092
## 58 MICHAEL LU MI 917
## 59 VIRAJ MOHILE MI 853
## 60 SEAN M MI 967
## 61 MC CORMICK ON 955
## 62 JULIA SHEN MI 1530
## 63 JEZZEL FARKAS MI 1175
## 64 ASHWIN BALAJI MI 1163
write.csv(project, file = "DATA607_Wk4Proj.csv", col.names = T)
## Warning in write.csv(project, file = "DATA607_Wk4Proj.csv", col.names = T):
## attempt to set 'col.names' ignored