Task:
In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played
Libraries:
library(RCurl)
library(knitr)
library(kableExtra)
library(tidyverse)
library(stringr)
Access the chess data from github:
my.data <- readLines("https://cdn.rawgit.com/nschettini/CUNY-MSDS-DATA-607/5ccd9c39/tournamentinfo.txt", warn = F)
head(my.data, 22)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
## [13] "-----------------------------------------------------------------------------------------"
## [14] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [15] " MI | 12616049 / R: 1716 ->1744 |N:2 |W |B |W |B |W |B |B |"
## [16] "-----------------------------------------------------------------------------------------"
## [17] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [18] " MI | 14601533 / R: 1655 ->1690 |N:2 |B |W |B |W |B |W |B |"
## [19] "-----------------------------------------------------------------------------------------"
## [20] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [21] " OH | 15055204 / R: 1686 ->1687 |N:3 |W |B |W |B |B |W |B |"
## [22] "-----------------------------------------------------------------------------------------"
Looking at the data from the provided .txt file, you can notice that a lot of it isn’t needed for our task. We can clean this data by the following:
Running the following code: seq(from, to, by) will allow to pull only the rows that have data that is needed. This will make it easier to manipulate with regular expressions to pull out information from the data.
#starts at column 5, increases to the length of the data, by 3 (# of rows till next useful data needed - name, pts)
data1 <- c(seq(5, length(my.data),3))
#starts at column 6, increases to the length of the data, by 3 (# of rows till next useful data needed - state, rank)
data2 <- c(seq(6, length(my.data),3))
Manipulate the data:
Expressions used to pull out necessary data (name, state, total_points, pre_raiting):
#regex to pull out only the name from data1
name <- str_replace_all(str_extract(my.data[data1],"([|]).+?\\1"),"[|]","")
#pull out state from data2
state <- str_extract(my.data[data2], "[A-Z]{2}" )
#pull out total points from data 1 - I noticed points is the only # as a float
total_points <- str_extract(my.data[data1], "\\d.\\d")
#pull out pre raiting. Had to pull out [R: score] then replace the "R: " with a blank character.
pre_raiting1 <- as.integer(str_replace_all(str_extract(my.data[data2], "R: \\s?\\d{3,4}"), "R:\\s", ""))
#pull out player number
player_num <- as.integer(str_extract(my.data[data1], "\\d+"))
#used https://regexr.com/ to copy/paste the text file. This made it possible to play around with combinations of regex in real time
Place the extracted data into a data.frame, then into a kable table:
df1 <- data.frame(name, state, player_num, total_points, pre_raiting1)
kable(head(df1, 20), "html", escape = F) %>%
kable_styling("striped", full_width = F, font_size = 15) %>%
column_spec(1:2, bold = T)
name | state | player_num | total_points | pre_raiting1 |
---|---|---|---|---|
GARY HUA | ON | 1 | 6.0 | 1794 |
DAKSHESH DARURI | MI | 2 | 6.0 | 1553 |
ADITYA BAJAJ | MI | 3 | 6.0 | 1384 |
PATRICK H SCHILLING | MI | 4 | 5.5 | 1716 |
HANSHI ZUO | MI | 5 | 5.5 | 1655 |
HANSEN SONG | OH | 6 | 5.0 | 1686 |
GARY DEE SWATHELL | MI | 7 | 5.0 | 1649 |
EZEKIEL HOUGHTON | MI | 8 | 5.0 | 1641 |
STEFANO LEE | ON | 9 | 5.0 | 1411 |
ANVIT RAO | MI | 10 | 5.0 | 1365 |
CAMERON WILLIAM MC LEMAN | MI | 11 | 4.5 | 1712 |
KENNETH J TACK | MI | 12 | 4.5 | 1663 |
TORRANCE HENRY JR | MI | 13 | 4.5 | 1666 |
BRADLEY SHAW | MI | 14 | 4.5 | 1610 |
ZACHARY JAMES HOUGHTON | MI | 15 | 4.5 | 1220 |
MIKE NIKITIN | MI | 16 | 4.0 | 1604 |
RONALD GRZEGORCZYK | MI | 17 | 4.0 | 1629 |
DAVID SUNDEEN | MI | 18 | 4.0 | 1600 |
DIPANKAR ROY | MI | 19 | 4.0 | 1564 |
JASON ZHENG | MI | 20 | 4.0 | 1595 |
Manipulating data to find opponets scores:
First pulled out “opponent” numbers.
opponent1 <- str_extract_all(my.data[data1], "\\d+\\|")
opponent <- str_extract_all(opponent1,"\\d+")
head(opponent)
## [[1]]
## [1] "39" "21" "18" "14" "7" "12" "4"
##
## [[2]]
## [1] "63" "58" "4" "17" "16" "20" "7"
##
## [[3]]
## [1] "8" "61" "25" "21" "11" "13" "12"
##
## [[4]]
## [1] "23" "28" "2" "26" "5" "19" "1"
##
## [[5]]
## [1] "45" "37" "12" "13" "4" "14" "17"
##
## [[6]]
## [1] "34" "29" "11" "35" "10" "27" "21"
Created loop to calculate the opponents pre raiting score
opponent_preraiting <- numeric(length(data1))
for (i in 1:(length(data1))){
opponent_preraiting[i] <- mean(pre_raiting1[as.numeric(unlist(opponent[i]))])
}
opponent_preraiting <- round(opponent_preraiting,0)
df2 <- data.frame(name, state, player_num, total_points, pre_raiting1, opponent_preraiting)
kable(df2, "html", escape = F) %>%
kable_styling("striped", full_width = F, font_size = 15) %>%
column_spec(1:2, bold = T)
name | state | player_num | total_points | pre_raiting1 | opponent_preraiting |
---|---|---|---|---|---|
GARY HUA | ON | 1 | 6.0 | 1794 | 1605 |
DAKSHESH DARURI | MI | 2 | 6.0 | 1553 | 1469 |
ADITYA BAJAJ | MI | 3 | 6.0 | 1384 | 1564 |
PATRICK H SCHILLING | MI | 4 | 5.5 | 1716 | 1574 |
HANSHI ZUO | MI | 5 | 5.5 | 1655 | 1501 |
HANSEN SONG | OH | 6 | 5.0 | 1686 | 1519 |
GARY DEE SWATHELL | MI | 7 | 5.0 | 1649 | 1372 |
EZEKIEL HOUGHTON | MI | 8 | 5.0 | 1641 | 1468 |
STEFANO LEE | ON | 9 | 5.0 | 1411 | 1523 |
ANVIT RAO | MI | 10 | 5.0 | 1365 | 1554 |
CAMERON WILLIAM MC LEMAN | MI | 11 | 4.5 | 1712 | 1468 |
KENNETH J TACK | MI | 12 | 4.5 | 1663 | 1506 |
TORRANCE HENRY JR | MI | 13 | 4.5 | 1666 | 1498 |
BRADLEY SHAW | MI | 14 | 4.5 | 1610 | 1515 |
ZACHARY JAMES HOUGHTON | MI | 15 | 4.5 | 1220 | 1484 |
MIKE NIKITIN | MI | 16 | 4.0 | 1604 | 1386 |
RONALD GRZEGORCZYK | MI | 17 | 4.0 | 1629 | 1499 |
DAVID SUNDEEN | MI | 18 | 4.0 | 1600 | 1480 |
DIPANKAR ROY | MI | 19 | 4.0 | 1564 | 1426 |
JASON ZHENG | MI | 20 | 4.0 | 1595 | 1411 |
DINH DANG BUI | ON | 21 | 4.0 | 1563 | 1470 |
EUGENE L MCCLURE | MI | 22 | 4.0 | 1555 | 1300 |
ALAN BUI | ON | 23 | 4.0 | 1363 | 1214 |
MICHAEL R ALDRICH | MI | 24 | 4.0 | 1229 | 1357 |
LOREN SCHWIEBERT | MI | 25 | 3.5 | 1745 | 1363 |
MAX ZHU | ON | 26 | 3.5 | 1579 | 1507 |
GAURAV GIDWANI | MI | 27 | 3.5 | 1552 | 1222 |
SOFIA ADINA STANESCU-BELLU | MI | 28 | 3.5 | 1507 | 1522 |
CHIEDOZIE OKORIE | MI | 29 | 3.5 | 1602 | 1314 |
GEORGE AVERY JONES | ON | 30 | 3.5 | 1522 | 1144 |
RISHI SHETTY | MI | 31 | 3.5 | 1494 | 1260 |
JOSHUA PHILIP MATHEWS | ON | 32 | 3.5 | 1441 | 1379 |
JADE GE | MI | 33 | 3.5 | 1449 | 1277 |
MICHAEL JEFFERY THOMAS | MI | 34 | 3.5 | 1399 | 1375 |
JOSHUA DAVID LEE | MI | 35 | 3.5 | 1438 | 1150 |
SIDDHARTH JHA | MI | 36 | 3.5 | 1355 | 1388 |
AMIYATOSH PWNANANDAM | MI | 37 | 3.5 | 980 | 1385 |
BRIAN LIU | MI | 38 | 3.0 | 1423 | 1539 |
JOEL R HENDON | MI | 39 | 3.0 | 1436 | 1430 |
FOREST ZHANG | MI | 40 | 3.0 | 1348 | 1391 |
KYLE WILLIAM MURPHY | MI | 41 | 3.0 | 1403 | 1248 |
JARED GE | MI | 42 | 3.0 | 1332 | 1150 |
ROBERT GLEN VASEY | MI | 43 | 3.0 | 1283 | 1107 |
JUSTIN D SCHILLING | MI | 44 | 3.0 | 1199 | 1327 |
DEREK YAN | MI | 45 | 3.0 | 1242 | 1152 |
JACOB ALEXANDER LAVALLEY | MI | 46 | 3.0 | 377 | 1358 |
ERIC WRIGHT | MI | 47 | 2.5 | 1362 | 1392 |
DANIEL KHAIN | MI | 48 | 2.5 | 1382 | 1356 |
MICHAEL J MARTIN | MI | 49 | 2.5 | 1291 | 1286 |
SHIVAM JHA | MI | 50 | 2.5 | 1056 | 1296 |
TEJAS AYYAGARI | MI | 51 | 2.5 | 1011 | 1356 |
ETHAN GUO | MI | 52 | 2.5 | 935 | 1495 |
JOSE C YBARRA | MI | 53 | 2.0 | 1393 | 1345 |
LARRY HODGE | MI | 54 | 2.0 | 1270 | 1206 |
ALEX KONG | MI | 55 | 2.0 | 1186 | 1406 |
MARISA RICCI | MI | 56 | 2.0 | 1153 | 1414 |
MICHAEL LU | MI | 57 | 2.0 | 1092 | 1363 |
VIRAJ MOHILE | MI | 58 | 2.0 | 917 | 1391 |
SEAN M MC CORMICK | MI | 59 | 2.0 | 853 | 1319 |
JULIA SHEN | MI | 60 | 1.5 | 967 | 1330 |
JEZZEL FARKAS | ON | 61 | 1.5 | 955 | 1327 |
ASHWIN BALAJI | MI | 62 | 1.0 | 1530 | 1186 |
THOMAS JOSEPH HOSMER | MI | 63 | 1.0 | 1175 | 1350 |
BEN LI | MI | 64 | 1.0 | 1163 | 1263 |
Save to a .CSV file
write.table(df2, file = "C:/Users/nicsc/Documents/DATA_607_Project1.csv", sep = ",", col.names = T)
ggplot(df2, aes(opponent_preraiting, pre_raiting1)) + geom_point(aes(color = total_points)) +
xlab("Opponet's Preraiting") +
ylab("Player's Preraiting") +
geom_smooth() +
geom_vline(xintercept=mean(opponent_preraiting), col='red') +
geom_vline(xintercept=median(opponent_preraiting), col='orange')
## `geom_smooth()` using method = 'loess'