Project Goal: Extract chess Player Name, State Name, Pre Rating and New Rating as an average from the Win, Loss and Draw points of the opponent and convert the new extracted dataset to a csv file for easy upload to any SQL DB.

Initiated project with the tournamentinfo.txt upload to Github for easy access to everyone and fetching the data universally. Once the file data is accessible, moved forward with the first step as: - Extraction of Data: Extraction of data started up with analyzing the data pattern which looks like ithas two different pattern of data as dataLine1 and dataLine2. - Once we have all the data in dataLine1 and dataLine2, started up extraction of playerName, stateName and totalPoints - Now the challenge is to get the Pre rating data without ā€œPā€ suffix, so that this can be used to calculate the New Rating. This was achieved by replacing the P[0-9] with ā€œā€, now we are left with the string data set, which was converted to numeric for further calculation

library(stringr)
tournament_info <- readLines("https://raw.githubusercontent.com/ksanju0/IS607/master/tournamentinfo.txt")
## Warning in readLines("https://raw.githubusercontent.com/ksanju0/IS607/
## master/tournamentinfo.txt"): incomplete final line found on 'https://
## raw.githubusercontent.com/ksanju0/IS607/master/tournamentinfo.txt'
dataLine1 = unlist(str_extract_all(tournament_info,"^[[:blank:]]+\\d{1,2}.+"))
dataLine2=unlist(str_extract_all(tournament_info,"^[[:blank:]]+[A-Z]{2}.+"))

playerName=unlist(str_extract_all(dataLine1,"(\\b[[:upper:]-]+\\b\\s)+(\\b[[:upper:]-]+\\b){1}"))
stateName=unlist(str_extract_all(dataLine2,"[[:upper:]]{2}" ))
totalPoints=as.numeric(unlist(str_extract_all(dataLine1,"\\d(.)\\d")))

cleanpreRatingData<-str_replace_all(dataLine2,pattern="[P]\\d{1,}"," ")


line21 <- str_extract_all(cleanpreRatingData,"([R(:)][[:blank:]]+\\d{3,}+)")
preRating<-as.numeric(str_extract_all(line21,"\\d{3,}"))

Now next steps is to find the opponents numbers with whom each of the players had either Win, Loss or Draw

Opponents1<- str_extract_all(dataLine1,"[WLD][[:blank:]]+\\d{1,2}")
OpponentsData<-str_extract_all(Opponents1,"\\d{1,2}")
opponents <- lapply(OpponentsData, as.numeric)

Now we have all the data required to calculate the new Rating, but before that we need to bind all the data together in a data frame (PlayerDF) for easy access and reference while doing the calculation for ave_newRate as a function. This function will calculate the average rating of the player based on Win, Loss or Draw and return the complete dataset as an object to newRate. Once we have the dataSet, convert strings to numeric and format it to match with the Pre rating. Finally we cna bind this newRate also with the data frame (PlayerDF) and write it in any format for future use, here we have write it in csv format.

PlayerDF<- data.frame(playerName,stateName,totalPoints,preRating)
ave_newRate <- function(x){
  Newrating<-0
totOpponents<-length(x)
for (i in x){
  Newrating<-Newrating+PlayerDF[i,"preRating"]}
  return(Newrating/totOpponents)
  }

newRate <- unlist(lapply(opponents, ave_newRate))
newRate <- as.numeric(sprintf("%1.0f",newRate))

PlayerDF<- data.frame(playerName,stateName,totalPoints,preRating,newRate)
PlayerDF
##                    playerName stateName totalPoints preRating newRate
## 1                    GARY HUA        ON         6.0      1794    1605
## 2             DAKSHESH DARURI        MI         6.0      1553    1469
## 3                ADITYA BAJAJ        MI         6.0      1384    1564
## 4         PATRICK H SCHILLING        MI         5.5      1716    1574
## 5                  HANSHI ZUO        MI         5.5      1655    1501
## 6                 HANSEN SONG        OH         5.0      1686    1519
## 7           GARY DEE SWATHELL        MI         5.0      1649    1372
## 8            EZEKIEL HOUGHTON        MI         5.0      1641    1468
## 9                 STEFANO LEE        ON         5.0      1411    1523
## 10                  ANVIT RAO        MI         5.0      1365    1554
## 11   CAMERON WILLIAM MC LEMAN        MI         4.5      1712    1468
## 12             KENNETH J TACK        MI         4.5      1663    1506
## 13          TORRANCE HENRY JR        MI         4.5      1666    1498
## 14               BRADLEY SHAW        MI         4.5      1610    1515
## 15     ZACHARY JAMES HOUGHTON        MI         4.5      1220    1484
## 16               MIKE NIKITIN        MI         4.0      1604    1386
## 17         RONALD GRZEGORCZYK        MI         4.0      1629    1499
## 18              DAVID SUNDEEN        MI         4.0      1600    1480
## 19               DIPANKAR ROY        MI         4.0      1564    1426
## 20                JASON ZHENG        MI         4.0      1595    1411
## 21              DINH DANG BUI        ON         4.0      1563    1470
## 22           EUGENE L MCCLURE        MI         4.0      1555    1300
## 23                   ALAN BUI        ON         4.0      1363    1214
## 24          MICHAEL R ALDRICH        MI         4.0      1229    1357
## 25           LOREN SCHWIEBERT        MI         3.5      1745    1363
## 26                    MAX ZHU        ON         3.5      1579    1507
## 27             GAURAV GIDWANI        MI         3.5      1552    1222
## 28 SOFIA ADINA STANESCU-BELLU        MI         3.5      1507    1522
## 29           CHIEDOZIE OKORIE        MI         3.5      1602    1314
## 30         GEORGE AVERY JONES        ON         3.5      1522    1144
## 31               RISHI SHETTY        MI         3.5      1494    1260
## 32      JOSHUA PHILIP MATHEWS        ON         3.5      1441    1379
## 33                    JADE GE        MI         3.5      1449    1277
## 34     MICHAEL JEFFERY THOMAS        MI         3.5      1399    1375
## 35           JOSHUA DAVID LEE        MI         3.5      1438    1150
## 36              SIDDHARTH JHA        MI         3.5      1355    1388
## 37       AMIYATOSH PWNANANDAM        MI         3.5       980    1385
## 38                  BRIAN LIU        MI         3.0      1423    1539
## 39              JOEL R HENDON        MI         3.0      1436    1430
## 40               FOREST ZHANG        MI         3.0      1348    1391
## 41        KYLE WILLIAM MURPHY        MI         3.0      1403    1248
## 42                   JARED GE        MI         3.0      1332    1150
## 43          ROBERT GLEN VASEY        MI         3.0      1283    1107
## 44         JUSTIN D SCHILLING        MI         3.0      1199    1327
## 45                  DEREK YAN        MI         3.0      1242    1152
## 46   JACOB ALEXANDER LAVALLEY        MI         3.0       377    1358
## 47                ERIC WRIGHT        MI         2.5      1362    1392
## 48               DANIEL KHAIN        MI         2.5      1382    1356
## 49           MICHAEL J MARTIN        MI         2.5      1291    1286
## 50                 SHIVAM JHA        MI         2.5      1056    1296
## 51             TEJAS AYYAGARI        MI         2.5      1011    1356
## 52                  ETHAN GUO        MI         2.5       935    1495
## 53              JOSE C YBARRA        MI         2.0      1393    1345
## 54                LARRY HODGE        MI         2.0      1270    1206
## 55                  ALEX KONG        MI         2.0      1186    1406
## 56               MARISA RICCI        MI         2.0      1153    1414
## 57                 MICHAEL LU        MI         2.0      1092    1363
## 58               VIRAJ MOHILE        MI         2.0       917    1391
## 59          SEAN M MC CORMICK        MI         2.0       853    1319
## 60                 JULIA SHEN        MI         1.5       967    1330
## 61              JEZZEL FARKAS        ON         1.5       955    1327
## 62              ASHWIN BALAJI        MI         1.0      1530    1186
## 63       THOMAS JOSEPH HOSMER        MI         1.0      1175    1350
## 64                     BEN LI        MI         1.0      1163    1263
write.csv(PlayerDF, file ="playerDF.csv")

Further analysis to find out the data distribution pattern for new Rating as shown below using normal density plot and further confirmation to teh normal disdribution is done using normal Q-Q Plot

par(mfrow=c(1,2))
newRateMean <- mean(PlayerDF$newRate)
newRateSD <- sd(PlayerDF$newRate)
hist(PlayerDF$newRate,probability=TRUE)
x <- 800:1900
y <- dnorm(x = x, mean = newRateMean, sd = newRateSD)
lines(x = x, y = y, col = "blue")

qqnorm(PlayerDF$newRate)
qqline(PlayerDF$newRate)

summary(PlayerDF$newRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1107    1310    1382    1379    1481    1605

Conclusion: Histogram shows the unimodal distribution with mean almost equal to median as 1379 emphasizing the normal distribution pattern. This was reinforced by the normal QQ plot which also suggest as most of the data is around the mean/median and dense at 1310 to 1379. it also has outliers as 1605 and few more data point beyond 1500.