Using a data set that contains the results a chess tournament, our job is to create an R Markdown file that generates a .CSV files with the following information for all players:
Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents
The following is an example of what the data should look like:
Gary Hua, ON, 6.0, 1794, 1605
library(stringr)
library(knitr)
data <- 'https://raw.githubusercontent.com/curiostegui/CUNY-SPS/main/Data%20607/Project%201/Chess_data.txt'
chess <- read.delim(file=data,skip=1,sep = "|", header = TRUE)
We can see in the data below, that the data set needs extensive cleaning. Our first step is to remove rows of unnecessary data such as the dashed lines seen in row 2 and 5.
head(chess)
## Pair
## 1 Num
## 2 -----------------------------------------------------------------------------------------
## 3 1
## 4 ON
## 5 -----------------------------------------------------------------------------------------
## 6 2
## Player.Name Total Round Round.1 Round.2 Round.3 Round.4
## 1 USCF ID / Rtg (Pre->Post) Pts 1 2 3 4 5
## 2
## 3 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7
## 4 15445895 / R: 1794 ->1817 N:2 W B W B W
## 5
## 6 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16
## Round.5 Round.6 X
## 1 6 7 NA
## 2 NA
## 3 D 12 D 4 NA
## 4 B W NA
## 5 NA
## 6 W 20 W 7 NA
I decided to use the seq to extract the rows of information we are interested and split them into two sets. Then I renamed the columns for convenience. I continued my data cleaning by removing columns that have no information of interest.
Name <- chess[seq(3,length(chess[,1]),3),]
info <- chess[seq(4,length(chess[,1]),3),]
names(Name) <- c('Pair','Player.Name','Total','Round.1','Round.2','Round.3','Round.4','Round.5','Round.6','Round.7','X')
names(info) <- c('State','Rating','Total','Round.1_Result','Round.2_Result','Round.3_Result','Round.4_Result','Round.5_Result','Round.6_Result','Round.7_Result','X')
Name <- subset(Name, select = -c(X))
info <- subset(info, select = -c(Total,Round.1_Result,Round.2_Result,Round.3_Result,Round.4_Result,Round.5_Result,Round.6_Result,Round.7_Result,X))
I extracted each player’s pre-rating. There were some null values in the dataset. I was unable to troubleshoot the regular expression syntax to include the values missed. So I decided to locate any NULLs and plug in the missing pre-rating values.
info$Rating <- as.integer(str_extract(info$Rating,"\\D\\d{3,4}\\D\\W"))
## Warning: NAs introduced by coercion
which(is.na(info$Rating))
## [1] 8 15 21 29 37 39 41 46 49 61
info$Rating[8] <- 1641
info$Rating[15] <- 1220
info$Rating[21] <- 1563
info$Rating[29] <- 1602
info$Rating[37] <- 980
info$Rating[39] <- 1436
info$Rating[41] <- 1403
info$Rating[46] <- 377
info$Rating[49] <- 1291
info$Rating[61] <- 955
I took the same approach when extracting the player IDs in columns Round.1 - Round.7
Name$Round.1 <- as.integer(str_extract(Name$Round.1,"\\d{1,2}"))
Name$Round.2 <- as.integer(str_extract(Name$Round.2,"\\d{1,2}"))
Name$Round.3 <- as.integer(str_extract(Name$Round.3,"\\d{1,2}"))
Name$Round.4 <- as.integer(str_extract(Name$Round.4,"\\d{1,2}"))
Name$Round.5 <- as.integer(str_extract(Name$Round.5,"\\d{1,2}"))
Name$Round.6 <- as.integer(str_extract(Name$Round.6,"\\d{1,2}"))
Name$Round.7 <- as.integer(str_extract(Name$Round.7,"\\d{1,2}"))
After joining the data, I decided to find the average rating of opponent players. First, I substituted the player IDs underneath the Round columns and replaced them with the player’s corresponding pre-rating score. Then I calculated the average from columns Round.1-7.
Chess_clean <- cbind(Name,info)
Chess_clean$Round.1 <- Chess_clean$Rating[Chess_clean$Round.1]
Chess_clean$Round.2 <- Chess_clean$Rating[Chess_clean$Round.2]
Chess_clean$Round.3 <- Chess_clean$Rating[Chess_clean$Round.3]
Chess_clean$Round.4 <- Chess_clean$Rating[Chess_clean$Round.4]
Chess_clean$Round.5 <- Chess_clean$Rating[Chess_clean$Round.5]
Chess_clean$Round.6 <- Chess_clean$Rating[Chess_clean$Round.6]
Chess_clean$Round.7 <- Chess_clean$Rating[Chess_clean$Round.7]
Chess_clean$Opp_Average <- rowMeans(Chess_clean[,4:10],na.rm=T)
Chess_clean$Opp_Average <- round(Chess_clean$Opp_Average)
Lastly, I created the final data frame which has only the columns of interest: Player Name, Player State, Total Points, Player Pre-Rating and Average Opponent Pre-Rating
final_df <- data.frame(
Player.Name <- Chess_clean$Player.Name,
Player.State <- Chess_clean$State,
Total.Points <- Chess_clean$Total,
Player.PreRating <- Chess_clean$Rating,
Opp.Rating.Avg <- Chess_clean$Opp_Average
)
colnames(final_df) <- c("Player.Name","Player.State","Total.Points","Player.PreRating","Opp.Rating.Avg")
kable(final_df)
| Player.Name | Player.State | Total.Points | Player.PreRating | Opp.Rating.Avg |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1501 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 | 1372 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 |
| STEFANO LEE | ON | 5.0 | 1411 | 1523 |
| ANVIT RAO | MI | 5.0 | 1365 | 1554 |
| CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1468 |
| KENNETH J TACK | MI | 4.5 | 1663 | 1506 |
| TORRANCE HENRY JR | MI | 4.5 | 1666 | 1498 |
| BRADLEY SHAW | MI | 4.5 | 1610 | 1515 |
| ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1484 |
| MIKE NIKITIN | MI | 4.0 | 1604 | 1386 |
| RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1499 |
| DAVID SUNDEEN | MI | 4.0 | 1600 | 1480 |
| DIPANKAR ROY | MI | 4.0 | 1564 | 1426 |
| JASON ZHENG | MI | 4.0 | 1595 | 1411 |
| DINH DANG BUI | ON | 4.0 | 1563 | 1470 |
| EUGENE L MCCLURE | MI | 4.0 | 1555 | 1300 |
| ALAN BUI | ON | 4.0 | 1363 | 1214 |
| MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1357 |
| LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1363 |
| MAX ZHU | ON | 3.5 | 1579 | 1507 |
| GAURAV GIDWANI | MI | 3.5 | 1552 | 1222 |
| SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 | 1522 |
| CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1314 |
| GEORGE AVERY JONES | ON | 3.5 | 1522 | 1144 |
| RISHI SHETTY | MI | 3.5 | 1494 | 1260 |
| JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1379 |
| JADE GE | MI | 3.5 | 1449 | 1277 |
| MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1375 |
| JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1150 |
| SIDDHARTH JHA | MI | 3.5 | 1355 | 1388 |
| AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1385 |
| BRIAN LIU | MI | 3.0 | 1423 | 1539 |
| JOEL R HENDON | MI | 3.0 | 1436 | 1430 |
| FOREST ZHANG | MI | 3.0 | 1348 | 1391 |
| KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1248 |
| JARED GE | MI | 3.0 | 1332 | 1150 |
| ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1107 |
| JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1327 |
| DEREK YAN | MI | 3.0 | 1242 | 1152 |
| JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1358 |
| ERIC WRIGHT | MI | 2.5 | 1362 | 1392 |
| DANIEL KHAIN | MI | 2.5 | 1382 | 1356 |
| MICHAEL J MARTIN | MI | 2.5 | 1291 | 1286 |
| SHIVAM JHA | MI | 2.5 | 1056 | 1296 |
| TEJAS AYYAGARI | MI | 2.5 | 1011 | 1356 |
| ETHAN GUO | MI | 2.5 | 935 | 1495 |
| JOSE C YBARRA | MI | 2.0 | 1393 | 1345 |
| LARRY HODGE | MI | 2.0 | 1270 | 1206 |
| ALEX KONG | MI | 2.0 | 1186 | 1406 |
| MARISA RICCI | MI | 2.0 | 1153 | 1414 |
| MICHAEL LU | MI | 2.0 | 1092 | 1363 |
| VIRAJ MOHILE | MI | 2.0 | 917 | 1391 |
| SEAN M MC CORMICK | MI | 2.0 | 853 | 1319 |
| JULIA SHEN | MI | 1.5 | 967 | 1330 |
| JEZZEL FARKAS | ON | 1.5 | 955 | 1327 |
| ASHWIN BALAJI | MI | 1.0 | 1530 | 1186 |
| THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1350 |
| BEN LI | MI | 1.0 | 1163 | 1263 |
write.csv(final_df, file = "C:\\Users\\urios\\Downloads\\data.csv")