CUNY MSDS DATA 607 Project 1

Task:

In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be:

Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played

Libraries:

library(RCurl)
library(knitr)
library(kableExtra)
library(tidyverse)
library(stringr)

Access the chess data from github:

my.data <- readLines("https://cdn.rawgit.com/nschettini/CUNY-MSDS-DATA-607/5ccd9c39/tournamentinfo.txt", warn = F)
head(my.data, 22)

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## [13] "-----------------------------------------------------------------------------------------" 
## [14] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|" 
## [15] "   MI | 12616049 / R: 1716   ->1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
## [16] "-----------------------------------------------------------------------------------------" 
## [17] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|" 
## [18] "   MI | 14601533 / R: 1655   ->1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [19] "-----------------------------------------------------------------------------------------" 
## [20] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|" 
## [21] "   OH | 15055204 / R: 1686   ->1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |" 
## [22] "-----------------------------------------------------------------------------------------"

Looking at the data from the provided .txt file, you can notice that a lot of it isn’t needed for our task. We can clean this data by the following:

Running the following code: seq(from, to, by) will allow to pull only the rows that have data that is needed. This will make it easier to manipulate with regular expressions to pull out information from the data.

#starts at column 5, increases to the length of the data, by 3 (# of rows till next useful data needed - name, pts)
data1 <- c(seq(5, length(my.data),3))
#starts at column 6, increases to the length of the data, by 3 (# of rows till next useful data needed - state, rank)
data2 <- c(seq(6, length(my.data),3))

Manipulate the data:

Expressions used to pull out necessary data (name, state, total_points, pre_raiting):

#regex to pull out only the name from data1
name <- str_replace_all(str_extract(my.data[data1],"([|]).+?\\1"),"[|]","")
#pull out state from data2
state <- str_extract(my.data[data2], "[A-Z]{2}" )
#pull out total points from data 1 - I noticed points is the only # as a float
total_points <- str_extract(my.data[data1], "\\d.\\d")
#pull out pre raiting.  Had to pull out [R: score] then replace the "R: " with a blank character.
pre_raiting1 <- as.integer(str_replace_all(str_extract(my.data[data2], "R: \\s?\\d{3,4}"), "R:\\s", ""))
#pull out player number
player_num <- as.integer(str_extract(my.data[data1], "\\d+"))
#used https://regexr.com/ to copy/paste the text file.  This made it possible to play around with combinations of regex in real time

Place the extracted data into a data.frame, then into a kable table:

df1 <- data.frame(name, state, player_num, total_points, pre_raiting1)

kable(head(df1, 20), "html", escape = F) %>%
  kable_styling("striped", full_width = F, font_size = 15) %>%
  column_spec(1:2, bold = T)

name	state	player_num	total_points	pre_raiting1
GARY HUA	ON	1	6.0	1794
DAKSHESH DARURI	MI	2	6.0	1553
ADITYA BAJAJ	MI	3	6.0	1384
PATRICK H SCHILLING	MI	4	5.5	1716
HANSHI ZUO	MI	5	5.5	1655
HANSEN SONG	OH	6	5.0	1686
GARY DEE SWATHELL	MI	7	5.0	1649
EZEKIEL HOUGHTON	MI	8	5.0	1641
STEFANO LEE	ON	9	5.0	1411
ANVIT RAO	MI	10	5.0	1365
CAMERON WILLIAM MC LEMAN	MI	11	4.5	1712
KENNETH J TACK	MI	12	4.5	1663
TORRANCE HENRY JR	MI	13	4.5	1666
BRADLEY SHAW	MI	14	4.5	1610
ZACHARY JAMES HOUGHTON	MI	15	4.5	1220
MIKE NIKITIN	MI	16	4.0	1604
RONALD GRZEGORCZYK	MI	17	4.0	1629
DAVID SUNDEEN	MI	18	4.0	1600
DIPANKAR ROY	MI	19	4.0	1564
JASON ZHENG	MI	20	4.0	1595

Manipulating data to find opponets scores:

First pulled out “opponent” numbers.

opponent1 <- str_extract_all(my.data[data1], "\\d+\\|")
opponent <- str_extract_all(opponent1,"\\d+")
head(opponent)

## [[1]]
## [1] "39" "21" "18" "14" "7"  "12" "4" 
## 
## [[2]]
## [1] "63" "58" "4"  "17" "16" "20" "7" 
## 
## [[3]]
## [1] "8"  "61" "25" "21" "11" "13" "12"
## 
## [[4]]
## [1] "23" "28" "2"  "26" "5"  "19" "1" 
## 
## [[5]]
## [1] "45" "37" "12" "13" "4"  "14" "17"
## 
## [[6]]
## [1] "34" "29" "11" "35" "10" "27" "21"

Created loop to calculate the opponents pre raiting score

opponent_preraiting <- numeric(length(data1))

for (i in 1:(length(data1))){
  opponent_preraiting[i] <- mean(pre_raiting1[as.numeric(unlist(opponent[i]))])
}

opponent_preraiting <- round(opponent_preraiting,0)

df2 <- data.frame(name, state, player_num, total_points, pre_raiting1, opponent_preraiting)



kable(df2, "html", escape = F) %>%
  kable_styling("striped", full_width = F, font_size = 15) %>%
  column_spec(1:2, bold = T)

name	state	player_num	total_points	pre_raiting1	opponent_preraiting
GARY HUA	ON	1	6.0	1794	1605
DAKSHESH DARURI	MI	2	6.0	1553	1469
ADITYA BAJAJ	MI	3	6.0	1384	1564
PATRICK H SCHILLING	MI	4	5.5	1716	1574
HANSHI ZUO	MI	5	5.5	1655	1501
HANSEN SONG	OH	6	5.0	1686	1519
GARY DEE SWATHELL	MI	7	5.0	1649	1372
EZEKIEL HOUGHTON	MI	8	5.0	1641	1468
STEFANO LEE	ON	9	5.0	1411	1523
ANVIT RAO	MI	10	5.0	1365	1554
CAMERON WILLIAM MC LEMAN	MI	11	4.5	1712	1468
KENNETH J TACK	MI	12	4.5	1663	1506
TORRANCE HENRY JR	MI	13	4.5	1666	1498
BRADLEY SHAW	MI	14	4.5	1610	1515
ZACHARY JAMES HOUGHTON	MI	15	4.5	1220	1484
MIKE NIKITIN	MI	16	4.0	1604	1386
RONALD GRZEGORCZYK	MI	17	4.0	1629	1499
DAVID SUNDEEN	MI	18	4.0	1600	1480
DIPANKAR ROY	MI	19	4.0	1564	1426
JASON ZHENG	MI	20	4.0	1595	1411
DINH DANG BUI	ON	21	4.0	1563	1470
EUGENE L MCCLURE	MI	22	4.0	1555	1300
ALAN BUI	ON	23	4.0	1363	1214
MICHAEL R ALDRICH	MI	24	4.0	1229	1357
LOREN SCHWIEBERT	MI	25	3.5	1745	1363
MAX ZHU	ON	26	3.5	1579	1507
GAURAV GIDWANI	MI	27	3.5	1552	1222
SOFIA ADINA STANESCU-BELLU	MI	28	3.5	1507	1522
CHIEDOZIE OKORIE	MI	29	3.5	1602	1314
GEORGE AVERY JONES	ON	30	3.5	1522	1144
RISHI SHETTY	MI	31	3.5	1494	1260
JOSHUA PHILIP MATHEWS	ON	32	3.5	1441	1379
JADE GE	MI	33	3.5	1449	1277
MICHAEL JEFFERY THOMAS	MI	34	3.5	1399	1375
JOSHUA DAVID LEE	MI	35	3.5	1438	1150
SIDDHARTH JHA	MI	36	3.5	1355	1388
AMIYATOSH PWNANANDAM	MI	37	3.5	980	1385
BRIAN LIU	MI	38	3.0	1423	1539
JOEL R HENDON	MI	39	3.0	1436	1430
FOREST ZHANG	MI	40	3.0	1348	1391
KYLE WILLIAM MURPHY	MI	41	3.0	1403	1248
JARED GE	MI	42	3.0	1332	1150
ROBERT GLEN VASEY	MI	43	3.0	1283	1107
JUSTIN D SCHILLING	MI	44	3.0	1199	1327
DEREK YAN	MI	45	3.0	1242	1152
JACOB ALEXANDER LAVALLEY	MI	46	3.0	377	1358
ERIC WRIGHT	MI	47	2.5	1362	1392
DANIEL KHAIN	MI	48	2.5	1382	1356
MICHAEL J MARTIN	MI	49	2.5	1291	1286
SHIVAM JHA	MI	50	2.5	1056	1296
TEJAS AYYAGARI	MI	51	2.5	1011	1356
ETHAN GUO	MI	52	2.5	935	1495
JOSE C YBARRA	MI	53	2.0	1393	1345
LARRY HODGE	MI	54	2.0	1270	1206
ALEX KONG	MI	55	2.0	1186	1406
MARISA RICCI	MI	56	2.0	1153	1414
MICHAEL LU	MI	57	2.0	1092	1363
VIRAJ MOHILE	MI	58	2.0	917	1391
SEAN M MC CORMICK	MI	59	2.0	853	1319
JULIA SHEN	MI	60	1.5	967	1330
JEZZEL FARKAS	ON	61	1.5	955	1327
ASHWIN BALAJI	MI	62	1.0	1530	1186
THOMAS JOSEPH HOSMER	MI	63	1.0	1175	1350
BEN LI	MI	64	1.0	1163	1263

Save to a .CSV file

write.table(df2, file = "C:/Users/nicsc/Documents/DATA_607_Project1.csv", sep = ",", col.names = T)

ggplot(df2, aes(opponent_preraiting, pre_raiting1)) + geom_point(aes(color = total_points)) +
  xlab("Opponet's Preraiting") +
  ylab("Player's Preraiting") +
  geom_smooth() +
  geom_vline(xintercept=mean(opponent_preraiting), col='red') +
    geom_vline(xintercept=median(opponent_preraiting), col='orange')

## `geom_smooth()` using method = 'loess'