Introduction

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

If you have questions about the meaning of the data or the results, please post them on the discussion forum. Data science, like chess, is a game of back and forth…

The chess rating system (invented by a Minnesota statistician named Arpad Elo) has been used in many other contexts, including assessing relative strength of employment candidates by human resource departments.

All of your code should be in an R markdown file (and published to rpubs.com); with your data accessible for the person running the script.

Objective

The objective of this project is import the text file and present it in a sophisticated way as the file is given in text form and is very unstructured

# Loading the necessary libraries
library(stringr)
library(knitr)

# Reading the file
chess <- read.delim("chessinfo.txt", header= FALSE, stringsAsFactors=FALSE)
head(chess)

##                                                                                           V1
## 1  -----------------------------------------------------------------------------------------
## 2  Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| 
## 3  Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | 
## 4  -----------------------------------------------------------------------------------------
## 5      1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|
## 6     ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |

Data cleaning Process

chess2  <- chess[-c(1:4),]
chess2 %>% head()

## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |"
## [3] "-----------------------------------------------------------------------------------------"
## [4] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [5] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |"
## [6] "-----------------------------------------------------------------------------------------"

#  Check the file if it is character
class(chess2)

## [1] "character"

# There are lines after every two rows which needs to be eliminated. Let's do that 

name <- chess2[seq(from=1, to = length(chess2), by=3)]
ratings <- chess2[seq(from=2, to = length(chess2), by=3)]

# Extracting relevant data
new_id <- as.integer(str_extract(name, "\\d+"))
new_name <- str_trim(str_extract(name, "(\\w+\\s){2,3}"))
new_point <- as.numeric(str_extract(name, "\\d.\\d"))
new_rate <- as.integer(str_extract(str_extract(ratings, "\\D\\d{3,4}\\D"), "\\d+"))
new_state <- str_extract(ratings, "\\w\\w")
new_oppid <- str_extract_all(str_extract_all(name, "\\d+\\|"), "\\d+")

## Warning in stri_extract_all_regex(string, pattern, simplify = simplify, :
## argument is not an atomic vector; coercing

# Now let's convert the lists into df
op1 <- sapply(new_oppid, length)
max <- seq_len(max(op1))
op2 <- t(sapply(new_oppid, "[", i=max))
df2 <- as.data.frame(op2)


# Now let's convert these with ratings

v <- data.frame(new_id, new_rate)
df3 <- df2
df3[] <- v$new_rate[match(unlist(df3), v$new_id)]
head(df3)

##     V1   V2   V3   V4   V5   V6   V7
## 1 1436 1563 1600 1610 1649 1663 1716
## 2 1175  917 1716 1629 1604 1595 1649
## 3 1641  955 1745 1563 1712 1666 1663
## 4 1363 1507 1553 1579 1655 1564 1794
## 5 1242  980 1663 1666 1716 1610 1629
## 6 1399 1602 1712 1438 1365 1552 1563

# Now let's calculate the average to print the final table afterwards

df3$opponentavg <- round(apply(df3,1,mean, na.rm=TRUE))
head(df3)

##     V1   V2   V3   V4   V5   V6   V7 opponentavg
## 1 1436 1563 1600 1610 1649 1663 1716        1605
## 2 1175  917 1716 1629 1604 1595 1649        1469
## 3 1641  955 1745 1563 1712 1666 1663        1564
## 4 1363 1507 1553 1579 1655 1564 1794        1574
## 5 1242  980 1663 1666 1716 1610 1629        1501
## 6 1399 1602 1712 1438 1365 1552 1563        1519

# We are done with data cleaning, let's print the final table 
new_table <- data.frame(new_id, new_name, new_state, new_point, new_rate, df3$opponentavg)

# Table above is finalized but let's rename the column's name to make it more sophisticated

colnames(new_table) <- c("ID", "Player_Name", "State", "Point", "PreRating", "OpponentRating")
kable(new_table)

ID	Player_Name	State	Point	PreRating	OpponentRating
1	GARY HUA	ON	6.0	1794	1605
2	DAKSHESH DARURI	MI	6.0	1553	1469
3	ADITYA BAJAJ	MI	6.0	1384	1564
4	PATRICK H SCHILLING	MI	5.5	1716	1574
5	HANSHI ZUO	MI	5.5	1655	1501
6	HANSEN SONG	OH	5.0	1686	1519
7	GARY DEE SWATHELL	MI	5.0	1649	1372
8	EZEKIEL HOUGHTON	MI	5.0	1641	1468
9	STEFANO LEE	ON	5.0	1411	1523
10	ANVIT RAO	MI	5.0	1365	1554
11	CAMERON WILLIAM MC	MI	4.5	1712	1468
12	KENNETH J TACK	MI	4.5	1663	1506
13	TORRANCE HENRY JR	MI	4.5	1666	1498
14	BRADLEY SHAW	MI	4.5	1610	1515
15	ZACHARY JAMES HOUGHTON	MI	4.5	1220	1484
16	MIKE NIKITIN	MI	4.0	1604	1386
17	RONALD GRZEGORCZYK	MI	4.0	1629	1499
18	DAVID SUNDEEN	MI	4.0	1600	1480
19	DIPANKAR ROY	MI	4.0	1564	1426
20	JASON ZHENG	MI	4.0	1595	1411
21	DINH DANG BUI	ON	4.0	1563	1470
22	EUGENE L MCCLURE	MI	4.0	1555	1300
23	ALAN BUI	ON	4.0	1363	1214
24	MICHAEL R ALDRICH	MI	4.0	1229	1357
25	LOREN SCHWIEBERT	MI	3.5	1745	1363
26	MAX ZHU	ON	3.5	1579	1507
27	GAURAV GIDWANI	MI	3.5	1552	1222
28	SOFIA ADINA	MI	3.5	1507	1522
29	CHIEDOZIE OKORIE	MI	3.5	1602	1314
30	GEORGE AVERY JONES	ON	3.5	1522	1144
31	RISHI SHETTY	MI	3.5	1494	1260
32	JOSHUA PHILIP MATHEWS	ON	3.5	1441	1379
33	JADE GE	MI	3.5	1449	1277
34	MICHAEL JEFFERY THOMAS	MI	3.5	1399	1375
35	JOSHUA DAVID LEE	MI	3.5	1438	1150
36	SIDDHARTH JHA	MI	3.5	1355	1388
37	AMIYATOSH PWNANANDAM	MI	3.5	980	1385
38	BRIAN LIU	MI	3.0	1423	1539
39	JOEL R HENDON	MI	3.0	1436	1430
40	FOREST ZHANG	MI	3.0	1348	1391
41	KYLE WILLIAM MURPHY	MI	3.0	1403	1248
42	JARED GE	MI	3.0	1332	1150
43	ROBERT GLEN VASEY	MI	3.0	1283	1107
44	JUSTIN D SCHILLING	MI	3.0	1199	1327
45	DEREK YAN	MI	3.0	1242	1152
46	JACOB ALEXANDER LAVALLEY	MI	3.0	377	1358
47	ERIC WRIGHT	MI	2.5	1362	1392
48	DANIEL KHAIN	MI	2.5	1382	1356
49	MICHAEL J MARTIN	MI	2.5	1291	1286
50	SHIVAM JHA	MI	2.5	1056	1296
51	TEJAS AYYAGARI	MI	2.5	1011	1356
52	ETHAN GUO	MI	2.5	935	1495
53	JOSE C YBARRA	MI	2.0	1393	1345
54	LARRY HODGE	MI	2.0	1270	1206
55	ALEX KONG	MI	2.0	1186	1406
56	MARISA RICCI	MI	2.0	1153	1414
57	MICHAEL LU	MI	2.0	1092	1363
58	VIRAJ MOHILE	MI	2.0	917	1391
59	SEAN M MC	MI	2.0	853	1319
60	JULIA SHEN	MI	1.5	967	1330
61	JEZZEL FARKAS	ON	1.5	955	1327
62	ASHWIN BALAJI	MI	1.0	1530	1186
63	THOMAS JOSEPH HOSMER	MI	1.0	1175	1350
64	BEN LI	MI	1.0	1163	1263

# Convert the file into csv

write.table(new_table, file = "chessratings.csv", row.names=FALSE,sep = ",")

Analysis and Visualization

Player’s Pre Ratings compared to Opponent’s Average Ratings

There seems to be a slight positive association between each player’s ratings and the opponent’s average pre rating. However, the R square is too low for any significant association.

reg1<-lm(PreRating~ OpponentRating, data=new_table)
summary(reg1)

## 
## Call:
## lm(formula = PreRating ~ OpponentRating, data = new_table)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -988.58 -137.73   50.47  200.56  376.29 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    514.7559   371.8226   1.384    0.171  
## OpponentRating   0.6265     0.2687   2.332    0.023 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 256.6 on 62 degrees of freedom
## Multiple R-squared:  0.08062,    Adjusted R-squared:  0.06579 
## F-statistic: 5.437 on 1 and 62 DF,  p-value: 0.02298

plot(new_table$`PreRating`, new_table$`OpponentRating`, xlim=c(800,1800), ylim=c(1000,1800), main="PreRating vs Oppononent avg Rating", xlab="Player PreRating", ylab="Opponent Average PreRating", abline(reg1))

Pre Ratings statistics by State

We can see that highest number of players in the dataset are from MI and the lowest number from OH. Players from OH have a much higher mean Pre-ratings score compared to the other two states.

boxplot(PreRating~State, data = new_table, xlab = "State",
   ylab = "Preratings by State", main = "Statewise Average Player Pre-ratings")

Data 607 - Project 1

Habib Khan and Farhana Zahir

September 21, 2019

Introduction

Objective

Data cleaning Process

Analysis and Visualization