In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…
The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.
All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.
The objective of this project is import the text file and present it in a sophisticated way as the file is given in text form and is very unstructured
# Loading the necessary libraries
library(stringr)
library(knitr)
# Reading the file
chess <- read.delim("chessinfo.txt", header= FALSE, stringsAsFactors=FALSE)
head(chess)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## 3 Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## 4 -----------------------------------------------------------------------------------------
## 5 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## 6 ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |
chess2 <- chess[-c(1:4),]
chess2 %>% head()
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [6] "-----------------------------------------------------------------------------------------"
# Check the file if it is character
class(chess2)
## [1] "character"
# There are lines after every two rows which needs to be eliminated. Let's do that
name <- chess2[seq(from=1, to = length(chess2), by=3)]
ratings <- chess2[seq(from=2, to = length(chess2), by=3)]
# Extracting relevant data
new_id <- as.integer(str_extract(name, "\\d+"))
new_name <- str_trim(str_extract(name, "(\\w+\\s){2,3}"))
new_point <- as.numeric(str_extract(name, "\\d.\\d"))
new_rate <- as.integer(str_extract(str_extract(ratings, "\\D\\d{3,4}\\D"), "\\d+"))
new_state <- str_extract(ratings, "\\w\\w")
new_oppid <- str_extract_all(str_extract_all(name, "\\d+\\|"), "\\d+")
## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing
# Now let's convert the lists into df
op1 <- sapply(new_oppid, length)
max <- seq_len(max(op1))
op2 <- t(sapply(new_oppid, "[", i=max))
df2 <- as.data.frame(op2)
# Now let's convert these with ratings
v <- data.frame(new_id, new_rate)
df3 <- df2
df3[] <- v$new_rate[match(unlist(df3), v$new_id)]
head(df3)
## V1 V2 V3 V4 V5 V6 V7
## 1 1436 1563 1600 1610 1649 1663 1716
## 2 1175 917 1716 1629 1604 1595 1649
## 3 1641 955 1745 1563 1712 1666 1663
## 4 1363 1507 1553 1579 1655 1564 1794
## 5 1242 980 1663 1666 1716 1610 1629
## 6 1399 1602 1712 1438 1365 1552 1563
# Now let's calculate the average to print the final table afterwards
df3$opponentavg <- round(apply(df3,1,mean, na.rm=TRUE))
head(df3)
## V1 V2 V3 V4 V5 V6 V7 opponentavg
## 1 1436 1563 1600 1610 1649 1663 1716 1605
## 2 1175 917 1716 1629 1604 1595 1649 1469
## 3 1641 955 1745 1563 1712 1666 1663 1564
## 4 1363 1507 1553 1579 1655 1564 1794 1574
## 5 1242 980 1663 1666 1716 1610 1629 1501
## 6 1399 1602 1712 1438 1365 1552 1563 1519
# We are done with data cleaning, let's print the final table
new_table <- data.frame(new_id, new_name, new_state, new_point, new_rate, df3$opponentavg)
# Table above is finalized but let's rename the column's name to make it more sophisticated
colnames(new_table) <- c("ID", "Player_Name", "State", "Point", "PreRating", "OpponentRating")
kable(new_table)
ID | Player_Name | State | Point | PreRating | OpponentRating |
---|---|---|---|---|---|
1 | GARY HUA | ON | 6.0 | 1794 | 1605 |
2 | DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
3 | ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
4 | PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 |
5 | HANSHI ZUO | MI | 5.5 | 1655 | 1501 |
6 | HANSEN SONG | OH | 5.0 | 1686 | 1519 |
7 | GARY DEE SWATHELL | MI | 5.0 | 1649 | 1372 |
8 | EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 |
9 | STEFANO LEE | ON | 5.0 | 1411 | 1523 |
10 | ANVIT RAO | MI | 5.0 | 1365 | 1554 |
11 | CAMERON WILLIAM MC | MI | 4.5 | 1712 | 1468 |
12 | KENNETH J TACK | MI | 4.5 | 1663 | 1506 |
13 | TORRANCE HENRY JR | MI | 4.5 | 1666 | 1498 |
14 | BRADLEY SHAW | MI | 4.5 | 1610 | 1515 |
15 | ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1484 |
16 | MIKE NIKITIN | MI | 4.0 | 1604 | 1386 |
17 | RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1499 |
18 | DAVID SUNDEEN | MI | 4.0 | 1600 | 1480 |
19 | DIPANKAR ROY | MI | 4.0 | 1564 | 1426 |
20 | JASON ZHENG | MI | 4.0 | 1595 | 1411 |
21 | DINH DANG BUI | ON | 4.0 | 1563 | 1470 |
22 | EUGENE L MCCLURE | MI | 4.0 | 1555 | 1300 |
23 | ALAN BUI | ON | 4.0 | 1363 | 1214 |
24 | MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1357 |
25 | LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1363 |
26 | MAX ZHU | ON | 3.5 | 1579 | 1507 |
27 | GAURAV GIDWANI | MI | 3.5 | 1552 | 1222 |
28 | SOFIA ADINA | MI | 3.5 | 1507 | 1522 |
29 | CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1314 |
30 | GEORGE AVERY JONES | ON | 3.5 | 1522 | 1144 |
31 | RISHI SHETTY | MI | 3.5 | 1494 | 1260 |
32 | JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1379 |
33 | JADE GE | MI | 3.5 | 1449 | 1277 |
34 | MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1375 |
35 | JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1150 |
36 | SIDDHARTH JHA | MI | 3.5 | 1355 | 1388 |
37 | AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1385 |
38 | BRIAN LIU | MI | 3.0 | 1423 | 1539 |
39 | JOEL R HENDON | MI | 3.0 | 1436 | 1430 |
40 | FOREST ZHANG | MI | 3.0 | 1348 | 1391 |
41 | KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1248 |
42 | JARED GE | MI | 3.0 | 1332 | 1150 |
43 | ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1107 |
44 | JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1327 |
45 | DEREK YAN | MI | 3.0 | 1242 | 1152 |
46 | JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1358 |
47 | ERIC WRIGHT | MI | 2.5 | 1362 | 1392 |
48 | DANIEL KHAIN | MI | 2.5 | 1382 | 1356 |
49 | MICHAEL J MARTIN | MI | 2.5 | 1291 | 1286 |
50 | SHIVAM JHA | MI | 2.5 | 1056 | 1296 |
51 | TEJAS AYYAGARI | MI | 2.5 | 1011 | 1356 |
52 | ETHAN GUO | MI | 2.5 | 935 | 1495 |
53 | JOSE C YBARRA | MI | 2.0 | 1393 | 1345 |
54 | LARRY HODGE | MI | 2.0 | 1270 | 1206 |
55 | ALEX KONG | MI | 2.0 | 1186 | 1406 |
56 | MARISA RICCI | MI | 2.0 | 1153 | 1414 |
57 | MICHAEL LU | MI | 2.0 | 1092 | 1363 |
58 | VIRAJ MOHILE | MI | 2.0 | 917 | 1391 |
59 | SEAN M MC | MI | 2.0 | 853 | 1319 |
60 | JULIA SHEN | MI | 1.5 | 967 | 1330 |
61 | JEZZEL FARKAS | ON | 1.5 | 955 | 1327 |
62 | ASHWIN BALAJI | MI | 1.0 | 1530 | 1186 |
63 | THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1350 |
64 | BEN LI | MI | 1.0 | 1163 | 1263 |
# Convert the file into csv
write.table(new_table, file = "chessratings.csv", row.names=FALSE,sep = ",")
Player’s Pre Ratings compared to Opponent’s Average Ratings
There seems to be a slight positive association between each player’s ratings and the opponent’s average pre rating. However, the R square is too low for any significant association.
reg1<-lm(PreRating~ OpponentRating, data=new_table)
summary(reg1)
##
## Call:
## lm(formula = PreRating ~ OpponentRating, data = new_table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -988.58 -137.73 50.47 200.56 376.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 514.7559 371.8226 1.384 0.171
## OpponentRating 0.6265 0.2687 2.332 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256.6 on 62 degrees of freedom
## Multiple R-squared: 0.08062, Adjusted R-squared: 0.06579
## F-statistic: 5.437 on 1 and 62 DF, p-value: 0.02298
plot(new_table$`PreRating`, new_table$`OpponentRating`, xlim=c(800,1800), ylim=c(1000,1800), main="PreRating vs Oppononent avg Rating", xlab="Player PreRating", ylab="Opponent Average PreRating", abline(reg1))
Pre Ratings statistics by State
We can see that highest number of players in the dataset are from MI and the lowest number from OH. Players from OH have a much higher mean Pre-ratings score compared to the other two states.
boxplot(PreRating~State, data = new_table, xlab = "State",
ylab = "Preratings by State", main = "Statewise Average Player Pre-ratings")