The first step is to read the txt file into R as a table, and save it
chess <- read.table("https://raw.githubusercontent.com/Mattr5541/DATA-607/main/Project%201/Chess.txt",
header = T, sep = "|", skip = 1, fill = T, quote = "")
glimpse(chess)
## Rows: 194
## Columns: 11
## $ Pair <chr> " Num ", "-----------------------------------------------…
## $ Player.Name <chr> " USCF ID / Rtg (Pre->Post) ", "", " GARY HUA …
## $ Total <chr> " Pts ", "", "6.0 ", "N:2 ", "", "6.0 ", "N:2 ", "", "…
## $ Round <chr> " 1 ", "", "W 39", "W ", "", "W 63", "B ", "", "…
## $ Round.1 <chr> " 2 ", "", "W 21", "B ", "", "W 58", "W ", "", "…
## $ Round.2 <chr> " 3 ", "", "W 18", "W ", "", "L 4", "B ", "", "…
## $ Round.3 <chr> " 4 ", "", "W 14", "B ", "", "W 17", "W ", "", "…
## $ Round.4 <chr> " 5 ", "", "W 7", "W ", "", "W 16", "B ", "", "…
## $ Round.5 <chr> " 6 ", "", "D 12", "B ", "", "W 20", "W ", "", "…
## $ Round.6 <chr> " 7 ", "", "D 4", "W ", "", "W 7", "B ", "", "…
## $ X <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Since the resulting dataframe is less than interpretable, the next step will be to clean the dataframe by removing any extraneous lines, characters, and columns. I started by removing all hyphens, cutting out some empty columns, and then by merging columns and rows where appropriate. This was accomplished by making a grouping variable called “merge” that groups every two together; I then created a new dataframe called chess_clean where all instances of “merge” that matched would be grouped into one row, and then, of course, I dropped the merge variable. Finally, I cleaned up any trailing spaces that were present in the observations
chess <- data_frame(chess)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
chess <- subset(chess, Pair != '-----------------------------------------------------------------------------------------')
chess <- chess[-1,]
chess <- chess[-11]
chess$merge <- rep(1:(nrow(chess) / 2), each = 2)
chess_clean <- chess %>% group_by(merge) %>% summarize_all(~paste(., collapse = "")) %>%
ungroup() %>% select(-merge)
chess_clean <- chess_clean %>% mutate_all(trimws)
chess_clean$Pair <- trimws(chess_clean$Pair)
chess_clean$Player.Name <- trimws(chess_clean$Player.Name)
I then separated the now-cleaned chess dataset where appropriate by using regular expressions and dplyr’s separate function. This took quite a bit of trial and error to properly parse out the correct values, primarily due to the many uneven spaces throughout the observations
chess_sep <- chess_clean %>% separate(Pair, c('Pair', 'Player_State'))
chess_sep <- chess_sep %>% separate(Player.Name, c('Player.Name', 'Rating'), sep = " / R: ")
chess_sep$Player.Name <- gsub("[[0-9]+", "", chess_sep$Player.Name)
chess_sep$Player.Name <- trimws(chess_sep$Player.Name)
chess_sep$Rating <- gsub("^[P].+|>.+", "", chess_sep$Rating)
chess_sep$Rating <- gsub("P\\d*|[- ]", "", chess_sep$Rating)
chess_sep$Total <- gsub("N:\\d+", "", chess_sep$Total)
I then converted the dataframe to a long format in order to more easily match the opponents’ ratings with each player (essentially, I wanted to convert the rounds into a grouping variable so I could match the opponents with the “Rating” column, and eventually, the “Pair” column). After that, I created a new dataframe consisting of the Pair IDs, and renamed “Pair” to “Opponent” and “Rating” to “Opponent_Rating.” I then merged this into the chess dataframe, in order to match each “player with their corresponding opponents’ ratings
chess_long <- chess_sep %>% gather("Round", "Opponent", 6:12)
chess_long$Opponent <- gsub("[A-Za-z]", "", chess_long$Opponent)
chess_long$Opponent <- as.numeric(chess_long$Opponent)
Ratings_sep <- chess_long %>% select(Opponent_Rating = Rating, Opponent = Pair)
chess_long <- chess_long %>% arrange(Opponent)
Ratings_sep <- Ratings_sep %>% arrange(Opponent)
chess_merge <- merge(chess_long, Ratings_sep, by = "Opponent") %>% distinct()
##Just to fix the overall layout of the players
chess_merge <- chess_merge %>% arrange(Pair)
Finally, I set the dataframe back to a wide format, calculated the row averages for every round, in order to determine opponent averages, and performed some last-minute cleaning procedures (removing unnecessary columns/renaming columns)
chess_wide <- chess_merge %>% pivot_wider(id_cols = c(Pair, Player.Name, Player_State, Total, Rating), names_from = Round, values_from = Opponent_Rating)
chess_wide$Pair <- as.numeric(chess_wide$Pair)
Rounds <- chess_wide[,c(6:12)]
Rounds <- Rounds %>% mutate_at(1:7, as.numeric)
Rounds$Opponent_Average <- rowMeans(Rounds, na.rm = T)
Rounds$Opponent_Average <- round(Rounds$Opponent_Average, digits = 0)
chess_wide$Opponent_Average <- Rounds$Opponent_Average
chess_wide <- chess_wide %>% select(-c(1, 6:12))
chess_wide$Total <- as.numeric(chess_wide$Total)
kable(chess_wide)
| Player.Name | Player_State | Total | Rating | Opponent_Average |
|---|---|---|---|---|
| GARY HUA | ON | 6.0 | 1794 | 1605 |
| ANVIT RAO | MI | 5.0 | 1365 | 1554 |
| CAMERON WILLIAM MC LEMAN | MI | 4.5 | 1712 | 1468 |
| KENNETH J TACK | MI | 4.5 | 1663 | 1506 |
| TORRANCE HENRY JR | MI | 4.5 | 1666 | 1498 |
| BRADLEY SHAW | MI | 4.5 | 1610 | 1515 |
| ZACHARY JAMES HOUGHTON | MI | 4.5 | 1220 | 1484 |
| MIKE NIKITIN | MI | 4.0 | 1604 | 1386 |
| RONALD GRZEGORCZYK | MI | 4.0 | 1629 | 1499 |
| DAVID SUNDEEN | MI | 4.0 | 1600 | 1480 |
| DIPANKAR ROY | MI | 4.0 | 1564 | 1426 |
| DAKSHESH DARURI | MI | 6.0 | 1553 | 1469 |
| JASON ZHENG | MI | 4.0 | 1595 | 1411 |
| DINH DANG BUI | ON | 4.0 | 1563 | 1470 |
| EUGENE L MCCLURE | MI | 4.0 | 1555 | 1300 |
| ALAN BUI | ON | 4.0 | 1363 | 1214 |
| MICHAEL R ALDRICH | MI | 4.0 | 1229 | 1357 |
| LOREN SCHWIEBERT | MI | 3.5 | 1745 | 1363 |
| MAX ZHU | ON | 3.5 | 1579 | 1507 |
| GAURAV GIDWANI | MI | 3.5 | 1552 | 1222 |
| SOFIA ADINA STANESCU-BELLU | MI | 3.5 | 1507 | 1522 |
| CHIEDOZIE OKORIE | MI | 3.5 | 1602 | 1314 |
| ADITYA BAJAJ | MI | 6.0 | 1384 | 1564 |
| GEORGE AVERY JONES | ON | 3.5 | 1522 | 1144 |
| RISHI SHETTY | MI | 3.5 | 1494 | 1260 |
| JOSHUA PHILIP MATHEWS | ON | 3.5 | 1441 | 1379 |
| JADE GE | MI | 3.5 | 1449 | 1277 |
| MICHAEL JEFFERY THOMAS | MI | 3.5 | 1399 | 1375 |
| JOSHUA DAVID LEE | MI | 3.5 | 1438 | 1150 |
| SIDDHARTH JHA | MI | 3.5 | 1355 | 1388 |
| AMIYATOSH PWNANANDAM | MI | 3.5 | 980 | 1385 |
| BRIAN LIU | MI | 3.0 | 1423 | 1539 |
| JOEL R HENDON | MI | 3.0 | 1436 | 1430 |
| PATRICK H SCHILLING | MI | 5.5 | 1716 | 1574 |
| FOREST ZHANG | MI | 3.0 | 1348 | 1391 |
| KYLE WILLIAM MURPHY | MI | 3.0 | 1403 | 1248 |
| JARED GE | MI | 3.0 | 1332 | 1150 |
| ROBERT GLEN VASEY | MI | 3.0 | 1283 | 1107 |
| JUSTIN D SCHILLING | MI | 3.0 | 1199 | 1327 |
| DEREK YAN | MI | 3.0 | 1242 | 1152 |
| JACOB ALEXANDER LAVALLEY | MI | 3.0 | 377 | 1358 |
| ERIC WRIGHT | MI | 2.5 | 1362 | 1392 |
| DANIEL KHAIN | MI | 2.5 | 1382 | 1356 |
| MICHAEL J MARTIN | MI | 2.5 | 1291 | 1286 |
| HANSHI ZUO | MI | 5.5 | 1655 | 1501 |
| SHIVAM JHA | MI | 2.5 | 1056 | 1296 |
| TEJAS AYYAGARI | MI | 2.5 | 1011 | 1356 |
| ETHAN GUO | MI | 2.5 | 935 | 1495 |
| JOSE C YBARRA | MI | 2.0 | 1393 | 1345 |
| LARRY HODGE | MI | 2.0 | 1270 | 1206 |
| ALEX KONG | MI | 2.0 | 1186 | 1406 |
| MARISA RICCI | MI | 2.0 | 1153 | 1414 |
| MICHAEL LU | MI | 2.0 | 1092 | 1363 |
| VIRAJ MOHILE | MI | 2.0 | 917 | 1391 |
| SEAN M MC CORMICK | MI | 2.0 | 853 | 1319 |
| HANSEN SONG | OH | 5.0 | 1686 | 1519 |
| JULIA SHEN | MI | 1.5 | 967 | 1330 |
| JEZZEL FARKAS | ON | 1.5 | 955 | 1327 |
| ASHWIN BALAJI | MI | 1.0 | 1530 | 1186 |
| THOMAS JOSEPH HOSMER | MI | 1.0 | 1175 | 1350 |
| BEN LI | MI | 1.0 | 1163 | 1263 |
| GARY DEE SWATHELL | MI | 5.0 | 1649 | 1372 |
| EZEKIEL HOUGHTON | MI | 5.0 | 1641 | 1468 |
| STEFANO LEE | ON | 5.0 | 1411 | 1523 |
Now that everything is set up, it’s time to make a little visual demonstration for some of the values
chess_avg <- chess_wide %>% select(Player_State, Total)
Avg_by_State <- chess_avg %>%
group_by(Player_State) %>%
summarise_at(vars(Total), list(Total = mean))
ggplot(Avg_by_State, aes(x = Player_State, y = Total)) +
geom_bar(stat = "identity") + labs(title = "Average Score by State", x = "State", y = "Score")
The graph above demonstrates that Ohio had the highest average
score when compared to ON (which I am assuming is Ontario?) and
Michigan. However, that is not entirely meaningful, since Ohio only had
one player. The more meaningful comparison would be that Ontario(?) had
a higher average player score than Michigan
##Saving as a CSV
The final step would be to save the cleaned and modified dataframe as a CSV file
chess_csv <- chess_wide
write.csv(chess_csv, "chess.csv")