In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be:Gary Hua, ON, 6.0, 1794, 1605 1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played. If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth… The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments. You may substitute another text file (or set of text files, or data scraped from web pages) of similar or greater complexity, and create your own assignment and solution. You may work in a small team. All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

Import Data

data<-read.csv("https://raw.githubusercontent.com/nancunjie4560/Data607/master/Project1/tournamentinfo.txt")
# remove the header that has X......

data<-read.csv("https://raw.githubusercontent.com/nancunjie4560/Data607/master/Project1/tournamentinfo.txt",header = F)

summary(data)
##       V1           
##  Length:196        
##  Class :character  
##  Mode  :character
head(data)
##                                                                                           V1
## 1  -----------------------------------------------------------------------------------------
## 2  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4  -----------------------------------------------------------------------------------------
## 5      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |

the summary shows 196 of observations in the dataset. However, head() tells that one result takes 3 observations in the data. Also, row 1 to row 4 need to get removed which are the header of the data.

data<-data[-c(1:4),]
head(data) # row 1 to 4 are removed. 
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "-----------------------------------------------------------------------------------------"
tail(data)
## [1] "   63 | THOMAS JOSEPH HOSMER            |1.0  |L   2|L  48|D  49|L  43|L  45|H    |U    |"
## [2] "   MI | 15057092 / R: 1175   ->1125     |     |W    |B    |W    |B    |B    |     |     |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "   64 | BEN LI                          |1.0  |L  22|D  30|L  31|D  49|L  46|L  42|L  54|"
## [5] "   MI | 15006561 / R: 1163   ->1112     |     |B    |W    |W    |B    |W    |B    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

The Players’ names are located at 1+3nth row. For example, Dakshesh is on the forth row which is 1+3x1=4, when n=1. The ratings are located at 2+3nth row. Extracting the rows by names and ratings.

players<-data[seq(1, length(data),3)]
ratings<-data[seq(2, length(data),3)]

I can see there are 64 players and ratings.

library(stringr)
number <- as.integer(str_extract(players, "\\d+"))
player_name <- str_trim(str_extract(players, "(\\w+\\s){2,3}"))
state <- str_extract(ratings,"\\w+")
points <- as.numeric(str_extract(players, "\\d+\\.\\d+"))
rating <- as.integer(str_extract(str_extract(ratings, "[^\\d]\\d{3,4}[^\\d]"), "\\d+"))
Opponents <- str_extract_all(str_extract_all(players, "\\d+\\|"), "\\d+")
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

Find out the Average Opponents Ratings

Opponent_rating <- numeric(length(data)/3)
for (i in 1:(length(data) / 3)) {Opponent_rating[i] <- mean(rating[as.numeric(unlist(Opponents[number[i]]))])}

Create a Data Frame

library(data.table)
frame<-data.frame(number ,player_name,state ,points,rating, Opponent_rating)
data.table(frame)
##     number              player_name state points rating Opponent_rating
##  1:      1                 GARY HUA    ON    6.0   1794        1605.286
##  2:      2          DAKSHESH DARURI    MI    6.0   1553        1469.286
##  3:      3             ADITYA BAJAJ    MI    6.0   1384        1563.571
##  4:      4      PATRICK H SCHILLING    MI    5.5   1716        1573.571
##  5:      5               HANSHI ZUO    MI    5.5   1655        1500.857
##  6:      6              HANSEN SONG    OH    5.0   1686        1518.714
##  7:      7        GARY DEE SWATHELL    MI    5.0   1649        1372.143
##  8:      8         EZEKIEL HOUGHTON    MI    5.0   1641        1468.429
##  9:      9              STEFANO LEE    ON    5.0   1411        1523.143
## 10:     10                ANVIT RAO    MI    5.0   1365        1554.143
## 11:     11       CAMERON WILLIAM MC    MI    4.5   1712        1467.571
## 12:     12           KENNETH J TACK    MI    4.5   1663        1506.167
## 13:     13        TORRANCE HENRY JR    MI    4.5   1666        1497.857
## 14:     14             BRADLEY SHAW    MI    4.5   1610        1515.000
## 15:     15   ZACHARY JAMES HOUGHTON    MI    4.5   1220        1483.857
## 16:     16             MIKE NIKITIN    MI    4.0   1604        1385.800
## 17:     17       RONALD GRZEGORCZYK    MI    4.0   1629        1498.571
## 18:     18            DAVID SUNDEEN    MI    4.0   1600        1480.000
## 19:     19             DIPANKAR ROY    MI    4.0   1564        1426.286
## 20:     20              JASON ZHENG    MI    4.0   1595        1410.857
## 21:     21            DINH DANG BUI    ON    4.0   1563        1470.429
## 22:     22         EUGENE L MCCLURE    MI    4.0   1555        1300.333
## 23:     23                 ALAN BUI    ON    4.0   1363        1213.857
## 24:     24        MICHAEL R ALDRICH    MI    4.0   1229        1357.000
## 25:     25         LOREN SCHWIEBERT    MI    3.5   1745        1363.286
## 26:     26                  MAX ZHU    ON    3.5   1579        1506.857
## 27:     27           GAURAV GIDWANI    MI    3.5   1552        1221.667
## 28:     28              SOFIA ADINA    MI    3.5   1507        1522.143
## 29:     29         CHIEDOZIE OKORIE    MI    3.5   1602        1313.500
## 30:     30       GEORGE AVERY JONES    ON    3.5   1522        1144.143
## 31:     31             RISHI SHETTY    MI    3.5   1494        1259.857
## 32:     32    JOSHUA PHILIP MATHEWS    ON    3.5   1441        1378.714
## 33:     33                  JADE GE    MI    3.5   1449        1276.857
## 34:     34   MICHAEL JEFFERY THOMAS    MI    3.5   1399        1375.286
## 35:     35         JOSHUA DAVID LEE    MI    3.5   1438        1149.714
## 36:     36            SIDDHARTH JHA    MI    3.5   1355        1388.167
## 37:     37     AMIYATOSH PWNANANDAM    MI    3.5    980        1384.800
## 38:     38                BRIAN LIU    MI    3.0   1423        1539.167
## 39:     39            JOEL R HENDON    MI    3.0   1436        1429.571
## 40:     40             FOREST ZHANG    MI    3.0   1348        1390.571
## 41:     41      KYLE WILLIAM MURPHY    MI    3.0   1403        1248.500
## 42:     42                 JARED GE    MI    3.0   1332        1149.857
## 43:     43        ROBERT GLEN VASEY    MI    3.0   1283        1106.571
## 44:     44       JUSTIN D SCHILLING    MI    3.0   1199        1327.000
## 45:     45                DEREK YAN    MI    3.0   1242        1152.000
## 46:     46 JACOB ALEXANDER LAVALLEY    MI    3.0    377        1357.714
## 47:     47              ERIC WRIGHT    MI    2.5   1362        1392.000
## 48:     48             DANIEL KHAIN    MI    2.5   1382        1355.800
## 49:     49         MICHAEL J MARTIN    MI    2.5   1291        1285.800
## 50:     50               SHIVAM JHA    MI    2.5   1056        1296.000
## 51:     51           TEJAS AYYAGARI    MI    2.5   1011        1356.143
## 52:     52                ETHAN GUO    MI    2.5    935        1494.571
## 53:     53            JOSE C YBARRA    MI    2.0   1393        1345.333
## 54:     54              LARRY HODGE    MI    2.0   1270        1206.167
## 55:     55                ALEX KONG    MI    2.0   1186        1406.000
## 56:     56             MARISA RICCI    MI    2.0   1153        1414.400
## 57:     57               MICHAEL LU    MI    2.0   1092        1363.000
## 58:     58             VIRAJ MOHILE    MI    2.0    917        1391.000
## 59:     59                SEAN M MC    MI    2.0    853        1319.000
## 60:     60               JULIA SHEN    MI    1.5    967        1330.200
## 61:     61            JEZZEL FARKAS    ON    1.5    955        1327.286
## 62:     62            ASHWIN BALAJI    MI    1.0   1530        1186.000
## 63:     63     THOMAS JOSEPH HOSMER    MI    1.0   1175        1350.200
## 64:     64                   BEN LI    MI    1.0   1163        1263.000
##     number              player_name state points rating Opponent_rating

Plot

library(ggplot2)

ggplot(data=frame, aes(x=Opponent_rating,y=points))+
  geom_point()

From the above plot, I can find out that Opponent ratings and points may be positively correlated.

Export to CSV file

write.table(frame, file="chess.csv")