In this project, you’re given a text file with chess tournament results where the information has some structure. Your job is to create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
For the first player, the information would be: Gary Hua, ON, 6.0, 1794, 1605
It Begins Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Project_1
Let’s load some packages that we’ll need to read and parse the data.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tibble' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(readr)
library(stringr)
Read the tournament info. After much trial and error, read.delim seems liked the best option. Attempts at using read_delim from the tidyverse led to many problems.
#Reading txt file with pipe delimiter. Tried to skip column 11, created due to line-pipe, but it did not work. Will handle with subsets.
chesst <- read.delim("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/tournamentinfo.txt", header = FALSE, sep = "|")
head(chesst)
## V1
## 1 -----------------------------------------------------------------------------------------
## 2 Pair
## 3 Num
## 4 -----------------------------------------------------------------------------------------
## 5 1
## 6 ON
## V2 V3 V4 V5 V6 V7 V8
## 1
## 2 Player Name Total Round Round Round Round Round
## 3 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5
## 4
## 5 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 6 15445895 / R: 1794 ->1817 N:2 W B W B W
## V9 V10 V11
## 1 NA
## 2 Round Round NA
## 3 6 7 NA
## 4 NA
## 5 D 12 D 4 NA
## 6 B W NA
Considered pulling out rows 2 and 3 to use as column names, but the inconsistent standards and whitespace likely make it more trouble than it would b worth. Instead, let’s manually create some column names - under the understanding that we will be merging records into columns and splitting the UCSF ID/Rtg (Pre>Post) column into two attributes.
chess_headers <- c("PairNum","PlayerName","TotalPts","R1Result","R2Result","R3Result","R4Result","R5Result","R6Result","R7Result","PlayerState","UCSF_ID","What","R1Color","R2Color","R3Color","R4Color","R5Color","R6Color","R7Color")
Now for the wrangling.
#Pull out relevant data, dropping 11th column, which was created
#Create sequences starting at 5, going up by 3 - 194 and 195 are the final rows from the initial import
seq1 <- seq(from = 5,to = 194, by = 3)
seq2 <- seq(from = 6, to = 195, by = 3)
#subset the read txt file using the sequencing to skip the "----" columns and turn the two rows per player into one wider record
chesst_data <- cbind(chesst[seq1,1:10],chesst[seq2,1:10])
colnames(chesst_data) <- chess_headers
chesst_data$PairNum <- as.numeric(as.character(chesst_data$PairNum))
head(chesst_data)
## PairNum PlayerName TotalPts R1Result R2Result
## 5 1 GARY HUA 6.0 W 39 W 21
## 8 2 DAKSHESH DARURI 6.0 W 63 W 58
## 11 3 ADITYA BAJAJ 6.0 L 8 W 61
## 14 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 17 5 HANSHI ZUO 5.5 W 45 W 37
## 20 6 HANSEN SONG 5.0 W 34 D 29
## R3Result R4Result R5Result R6Result R7Result PlayerState
## 5 W 18 W 14 W 7 D 12 D 4 ON
## 8 L 4 W 17 W 16 W 20 W 7 MI
## 11 W 25 W 21 W 11 W 13 W 12 MI
## 14 W 2 W 26 D 5 W 19 D 1 MI
## 17 D 12 D 13 D 4 W 14 W 17 MI
## 20 L 11 W 35 D 10 W 27 W 21 OH
## UCSF_ID What R1Color R2Color R3Color R4Color
## 5 15445895 / R: 1794 ->1817 N:2 W B W B
## 8 14598900 / R: 1553 ->1663 N:2 B W B W
## 11 14959604 / R: 1384 ->1640 N:2 W B W B
## 14 12616049 / R: 1716 ->1744 N:2 W B W B
## 17 14601533 / R: 1655 ->1690 N:2 B W B W
## 20 15055204 / R: 1686 ->1687 N:3 W B W B
## R5Color R6Color R7Color
## 5 W B W
## 8 B W B
## 11 W B W
## 14 W B B
## 17 B W B
## 20 B W B
Pull out the pre-rating from the crazy UCSF ID/pre-post rating column that violates every known rule and principle ever devised. And let’s extract the opponent ID from the R_Result columns to enable the calculation of the average opponent rating.
#Extract the pre-rating as numeric type
chesst_data$Prerating <- as.numeric(substring(chesst_data$UCSF_ID,16,19))
#Pull the opponent IDs (match to PairNum like foreign key) as numeric as well
chesst_data$R1Opp <- as.numeric(substring(chesst_data$R1Result,3))
chesst_data$R2Opp <- as.numeric(substring(chesst_data$R2Result,3))
chesst_data$R3Opp <- as.numeric(substring(chesst_data$R3Result,3))
chesst_data$R4Opp <- as.numeric(substring(chesst_data$R4Result,3))
chesst_data$R5Opp <- as.numeric(substring(chesst_data$R5Result,3))
chesst_data$R6Opp <- as.numeric(substring(chesst_data$R6Result,3))
chesst_data$R7Opp <- as.numeric(substring(chesst_data$R7Result,3))
head(chesst_data)
## PairNum PlayerName TotalPts R1Result R2Result
## 5 1 GARY HUA 6.0 W 39 W 21
## 8 2 DAKSHESH DARURI 6.0 W 63 W 58
## 11 3 ADITYA BAJAJ 6.0 L 8 W 61
## 14 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 17 5 HANSHI ZUO 5.5 W 45 W 37
## 20 6 HANSEN SONG 5.0 W 34 D 29
## R3Result R4Result R5Result R6Result R7Result PlayerState
## 5 W 18 W 14 W 7 D 12 D 4 ON
## 8 L 4 W 17 W 16 W 20 W 7 MI
## 11 W 25 W 21 W 11 W 13 W 12 MI
## 14 W 2 W 26 D 5 W 19 D 1 MI
## 17 D 12 D 13 D 4 W 14 W 17 MI
## 20 L 11 W 35 D 10 W 27 W 21 OH
## UCSF_ID What R1Color R2Color R3Color R4Color
## 5 15445895 / R: 1794 ->1817 N:2 W B W B
## 8 14598900 / R: 1553 ->1663 N:2 B W B W
## 11 14959604 / R: 1384 ->1640 N:2 W B W B
## 14 12616049 / R: 1716 ->1744 N:2 W B W B
## 17 14601533 / R: 1655 ->1690 N:2 B W B W
## 20 15055204 / R: 1686 ->1687 N:3 W B W B
## R5Color R6Color R7Color Prerating R1Opp R2Opp R3Opp R4Opp R5Opp R6Opp
## 5 W B W 1794 39 21 18 14 7 12
## 8 B W B 1553 63 58 4 17 16 20
## 11 W B W 1384 8 61 25 21 11 13
## 14 W B B 1716 23 28 2 26 5 19
## 17 B W B 1655 45 37 12 13 4 14
## 20 B W B 1686 34 29 11 35 10 27
## R7Opp
## 5 4
## 8 7
## 11 12
## 14 1
## 17 17
## 20 21
Let’s subset the dataframe to limit the columns that will be used for the csv output - note that the opponent PairNums won’t be used, but we need them to merge the rankings and then derive a mean for the desired end point.
#Subset for columns that will be used for csv output
chesst_prep <- subset(chesst_data,select=c(as.numeric(PairNum),PlayerName,PlayerState,TotalPts,Prerating,R1Opp,R2Opp,R3Opp,R4Opp,R5Opp,R6Opp,R7Opp))
head(chesst_prep)
## PairNum PlayerName PlayerState TotalPts
## 5 1 GARY HUA ON 6.0
## 8 2 DAKSHESH DARURI MI 6.0
## 11 3 ADITYA BAJAJ MI 6.0
## 14 4 PATRICK H SCHILLING MI 5.5
## 17 5 HANSHI ZUO MI 5.5
## 20 6 HANSEN SONG OH 5.0
## Prerating R1Opp R2Opp R3Opp R4Opp R5Opp R6Opp R7Opp
## 5 1794 39 21 18 14 7 12 4
## 8 1553 63 58 4 17 16 20 7
## 11 1384 8 61 25 21 11 13 12
## 14 1716 23 28 2 26 5 19 1
## 17 1655 45 37 12 13 4 14 17
## 20 1686 34 29 11 35 10 27 21
#Merge round 1 opponent rating
#If I was a better programmer, I would write this once and loop for opponents 2-7
chesst_prep1 <- merge(x = chesst_prep, y = chesst_prep[,c("PairNum","Prerating")], by.x=c("R1Opp"), by.y=c("PairNum"),suffix = c("", "_1"), all.x = TRUE)
chesst_prep2 <- merge(x = chesst_prep1, y = chesst_prep1[,c("PairNum","Prerating")], by.x=c("R2Opp"), by.y=c("PairNum"),suffix = c("", "_2"), all.x = TRUE)
chesst_prep3 <- merge(x = chesst_prep2, y = chesst_prep2[,c("PairNum","Prerating")], by.x=c("R3Opp"), by.y=c("PairNum"),suffix = c("", "_3"), all.x = TRUE)
chesst_prep4 <- merge(x = chesst_prep3, y = chesst_prep3[,c("PairNum","Prerating")], by.x=c("R4Opp"), by.y=c("PairNum"),suffix = c("", "_4"), all.x = TRUE)
chesst_prep5 <- merge(x = chesst_prep4, y = chesst_prep4[,c("PairNum","Prerating")], by.x=c("R5Opp"), by.y=c("PairNum"),suffix = c("", "_5"), all.x = TRUE)
chesst_prep6 <- merge(x = chesst_prep5, y = chesst_prep5[,c("PairNum","Prerating")], by.x=c("R6Opp"), by.y=c("PairNum"),suffix = c("", "_6"), all.x = TRUE)
chesst_prep7 <- merge(x = chesst_prep6, y = chesst_prep6[,c("PairNum","Prerating")], by.x=c("R7Opp"), by.y=c("PairNum"),suffix = c("", "_7"), all.x = TRUE)
head(chesst_prep7)
## R7Opp R6Opp R5Opp R4Opp R3Opp R2Opp R1Opp PairNum
## 1 1 19 5 26 2 28 23 4
## 2 2 9 1 11 13 46 57 7
## 3 3 1 NA 38 5 33 42 12
## 4 4 12 7 14 18 21 39 1
## 5 5 22 23 2 26 41 48 17
## 6 6 39 40 3 47 1 43 21
## PlayerName PlayerState TotalPts Prerating
## 1 PATRICK H SCHILLING MI 5.5 1716
## 2 GARY DEE SWATHELL MI 5.0 1649
## 3 KENNETH J TACK MI 4.5 1663
## 4 GARY HUA ON 6.0 1794
## 5 RONALD GRZEGORCZYK MI 4.0 1629
## 6 DINH DANG BUI ON 4.0 1563
## Prerating_1 Prerating_2 Prerating_3 Prerating_4 Prerating_5 Prerating_6
## 1 1363 1507 1553 1579 1655 1564
## 2 1092 377 1666 1712 1794 1411
## 3 1332 1449 1655 1423 NA 1794
## 4 1436 1563 1600 1610 1649 1663
## 5 1382 1403 1579 1553 1363 1555
## 6 1283 1794 1362 1384 1348 1436
## Prerating_7
## 1 1794
## 2 1553
## 3 1384
## 4 1716
## 5 1655
## 6 1686
Calculate opponent average rating using rowmeans.
#Get the average opponent rating, handling the NA matches
AvgOppRating <- rowMeans(chesst_prep7[13:19],na.rm = TRUE)
chesst_prep7$AvgOppRating <- as.integer(AvgOppRating)
chesst_prep7$Prerating <- as.integer(chesst_prep7$Prerating)
Subset a dataframe that matches our output.
chesst_final <- subset(chesst_prep7,select=c("PlayerName","PlayerState","TotalPts","Prerating","AvgOppRating"))
chesst_final
## PlayerName PlayerState TotalPts Prerating
## 1 PATRICK H SCHILLING MI 5.5 1716
## 2 GARY DEE SWATHELL MI 5.0 1649
## 3 KENNETH J TACK MI 4.5 1663
## 4 GARY HUA ON 6.0 1794
## 5 RONALD GRZEGORCZYK MI 4.0 1629
## 6 DINH DANG BUI ON 4.0 1563
## 7 DAKSHESH DARURI MI 6.0 1553
## 8 DIPANKAR ROY MI 4.0 1564
## 9 JASON ZHENG MI 4.0 1595
## 10 DAVID SUNDEEN MI 4.0 1600
## 11 MAX ZHU ON 3.5 1579
## 12 ADITYA BAJAJ MI 6.0 1384
## 13 JOSHUA PHILIP MATHEWS ON 3.5 1441
## 14 RISHI SHETTY MI 3.5 1494
## 15 BRIAN LIU MI 3.0 1423
## 16 HANSHI ZUO MI 5.5 1655
## 17 ANVIT RAO MI 5.0 1365
## 18 EZEKIEL HOUGHTON MI 5.0 1641
## 19 STEFANO LEE ON 5.0 1411
## 20 HANSEN SONG OH 5.0 1686
## 21 FOREST ZHANG MI 3.0 1348
## 22 JACOB ALEXANDER LAVALLEY MI 3.0 377
## 23 JOEL R HENDON MI 3.0 1436
## 24 ERIC WRIGHT MI 2.5 1362
## 25 CAMERON WILLIAM MC LEMAN MI 4.5 1712
## 26 SIDDHARTH JHA MI 3.5 1355
## 27 SHIVAM JHA MI 2.5 1056
## 28 BRADLEY SHAW MI 4.5 1610
## 29 TORRANCE HENRY JR MI 4.5 1666
## 30 TEJAS AYYAGARI MI 2.5 1011
## 31 ETHAN GUO MI 2.5 935
## 32 DANIEL KHAIN MI 2.5 1382
## 33 SOFIA ADINA STANESCU-BELLU MI 3.5 1507
## 34 JEZZEL FARKAS ON 1.5 955
## 35 ZACHARY JAMES HOUGHTON MI 4.5 1220
## 36 MICHAEL R ALDRICH MI 4.0 1229
## 37 EUGENE L MCCLURE MI 4.0 1555
## 38 MARISA RICCI MI 2.0 1153
## 39 ALEX KONG MI 2.0 1186
## 40 SEAN M MC CORMICK MI 2.0 853
## 41 VIRAJ MOHILE MI 2.0 917
## 42 ALAN BUI ON 4.0 1363
## 43 LOREN SCHWIEBERT MI 3.5 1745
## 44 JOSHUA DAVID LEE MI 3.5 1438
## 45 GEORGE AVERY JONES ON 3.5 1522
## 46 JADE GE MI 3.5 1449
## 47 MICHAEL JEFFERY THOMAS MI 3.5 1399
## 48 BEN LI MI 1.0 1163
## 49 ROBERT GLEN VASEY MI 3.0 1283
## 50 JARED GE MI 3.0 1332
## 51 DEREK YAN MI 3.0 1242
## 52 JUSTIN D SCHILLING MI 3.0 1199
## 53 AMIYATOSH PWNANANDAM MI 3.5 980
## 54 LARRY HODGE MI 2.0 1270
## 55 ASHWIN BALAJI MI 1.0 1530
## 56 GAURAV GIDWANI MI 3.5 1552
## 57 JOSE C YBARRA MI 2.0 1393
## 58 CHIEDOZIE OKORIE MI 3.5 1602
## 59 KYLE WILLIAM MURPHY MI 3.0 1403
## 60 MIKE NIKITIN MI 4.0 1604
## 61 JULIA SHEN MI 1.5 967
## 62 MICHAEL LU MI 2.0 1092
## 63 THOMAS JOSEPH HOSMER MI 1.0 1175
## 64 MICHAEL J MARTIN MI 2.5 1291
## AvgOppRating
## 1 1573
## 2 1372
## 3 1506
## 4 1605
## 5 1498
## 6 1470
## 7 1469
## 8 1426
## 9 1410
## 10 1480
## 11 1506
## 12 1563
## 13 1378
## 14 1259
## 15 1539
## 16 1500
## 17 1554
## 18 1468
## 19 1523
## 20 1518
## 21 1390
## 22 1357
## 23 1429
## 24 1392
## 25 1467
## 26 1388
## 27 1296
## 28 1515
## 29 1497
## 30 1356
## 31 1494
## 32 1355
## 33 1522
## 34 1327
## 35 1483
## 36 1357
## 37 1300
## 38 1414
## 39 1406
## 40 1319
## 41 1391
## 42 1213
## 43 1363
## 44 1149
## 45 1144
## 46 1276
## 47 1375
## 48 1263
## 49 1106
## 50 1149
## 51 1152
## 52 1327
## 53 1384
## 54 1206
## 55 1186
## 56 1221
## 57 1345
## 58 1313
## 59 1248
## 60 1385
## 61 1330
## 62 1363
## 63 1350
## 64 1285
Output the csv
write.csv(chesst_final, file = "Data_607_Lab_1_Chess_Data_Wrangling_Littlejohn.csv")