Instructions

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth… The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.

You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

Solution

To start, I saved the file as a .txt document. Then, I imported the raw data into R.

## [1] "-----------------------------------------------------------------------------------------" 
## [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
## [4] "-----------------------------------------------------------------------------------------" 
## [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
## [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Next, I used the stringr package to extract the neccessary data through regular expression.

names <-  unlist(str_extract_all(tournamentinfo, "\\w+ \\w+")) 
player_names <- data.frame(names[3:66])
names(player_names)[1]<-paste("Names")
head(player_names)
##             Names
## 1        GARY HUA
## 2 DAKSHESH DARURI
## 3    ADITYA BAJAJ
## 4       PATRICK H
## 5      HANSHI ZUO
## 6     HANSEN SONG
State <- unlist(str_extract_all(tournamentinfo, "( MI )|( OH )|( ON )"))
player_state <- data.frame(State)
head(player_state)
##   State
## 1   ON 
## 2   MI 
## 3   MI 
## 4   MI 
## 5   MI 
## 6   OH
prerate <- unlist(str_extract_all(tournamentinfo, "\\:{1} *\\d.+?\\->"))
prerate1 <- str_remove_all(prerate, "\\:")
prerate2 <- str_remove_all(prerate1, "\\->")
prerate3 <- str_remove_all(prerate2, "P[[:digit:]]{1,2}")
Pre_Rating <- data.frame(prerate3)
head(Pre_Rating)
##   prerate3
## 1  1794   
## 2  1553   
## 3  1384   
## 4  1716   
## 5  1655   
## 6  1686
dim(player_names)
## [1] 64  1
dim(player_state)
## [1] 64  1
dim(Pre_Rating)
## [1] 64  1

Finally, I have created a dataframe and subsequent .csv file with the extracted data.

project <- data.frame(c(player_names, player_state, Pre_Rating))
project
##                   Names State prerate3
## 1              GARY HUA   ON   1794   
## 2       DAKSHESH DARURI   MI   1553   
## 3          ADITYA BAJAJ   MI   1384   
## 4             PATRICK H   MI   1716   
## 5            HANSHI ZUO   MI   1655   
## 6           HANSEN SONG   OH   1686   
## 7              GARY DEE   MI   1649   
## 8      EZEKIEL HOUGHTON   MI      1641
## 9           STEFANO LEE   ON   1411   
## 10            ANVIT RAO   MI   1365   
## 11      CAMERON WILLIAM   MI   1712   
## 12             MC LEMAN   MI   1663   
## 13            KENNETH J   MI   1666   
## 14       TORRANCE HENRY   MI   1610   
## 15         BRADLEY SHAW   MI      1220
## 16        ZACHARY JAMES   MI   1604   
## 17         MIKE NIKITIN   MI   1629   
## 18   RONALD GRZEGORCZYK   MI   1600   
## 19        DAVID SUNDEEN   MI   1564   
## 20         DIPANKAR ROY   MI   1595   
## 21          JASON ZHENG   ON      1563
## 22            DINH DANG   MI   1555   
## 23             EUGENE L   ON   1363   
## 24             ALAN BUI   MI   1229   
## 25            MICHAEL R   MI   1745   
## 26     LOREN SCHWIEBERT   ON   1579   
## 27              MAX ZHU   MI   1552   
## 28       GAURAV GIDWANI   MI   1507   
## 29          SOFIA ADINA   MI     1602 
## 30     CHIEDOZIE OKORIE   ON   1522   
## 31         GEORGE AVERY   MI   1494   
## 32         RISHI SHETTY   ON   1441   
## 33        JOSHUA PHILIP   MI   1449   
## 34              JADE GE   MI   1399   
## 35      MICHAEL JEFFERY   MI   1438   
## 36         JOSHUA DAVID   MI   1355   
## 37        SIDDHARTH JHA   MI       980
## 38 AMIYATOSH PWNANANDAM   MI   1423   
## 39            BRIAN LIU   MI      1436
## 40               JOEL R   MI   1348   
## 41         FOREST ZHANG   MI     1403 
## 42         KYLE WILLIAM   MI   1332   
## 43             JARED GE   MI   1283   
## 44          ROBERT GLEN   MI   1199   
## 45             JUSTIN D   MI   1242   
## 46            DEREK YAN   MI      377 
## 47      JACOB ALEXANDER   MI   1362   
## 48          ERIC WRIGHT   MI   1382   
## 49         DANIEL KHAIN   MI      1291
## 50            MICHAEL J   MI   1056   
## 51           SHIVAM JHA   MI   1011   
## 52       TEJAS AYYAGARI   MI    935   
## 53            ETHAN GUO   MI   1393   
## 54               JOSE C   MI   1270   
## 55          LARRY HODGE   MI   1186   
## 56            ALEX KONG   MI   1153   
## 57         MARISA RICCI   MI   1092   
## 58           MICHAEL LU   MI    917   
## 59         VIRAJ MOHILE   MI    853   
## 60               SEAN M   MI    967   
## 61           MC CORMICK   ON       955
## 62           JULIA SHEN   MI   1530   
## 63        JEZZEL FARKAS   MI   1175   
## 64        ASHWIN BALAJI   MI   1163
write.csv(project, file = "DATA607_Wk4Proj.csv", col.names = T)          
## Warning in write.csv(project, file = "DATA607_Wk4Proj.csv", col.names = T):
## attempt to set 'col.names' ignored