In this project, we take chess tournament results data from a text file and prepare it for export as .csv
My strategy for this project was to identify the individual steps I needed to take to transform the data from the original text file.
The steps I identified to complete:
1. Removing |’s and -’s
2. Isolating the player names and ratings into their own columns
3. Extracting the second row of data for each observation and adding
them as additional columns to each observation.
4. Calculating the opposing player rating by match in order to aggregate
an average.
The first step was to read the data into R. I tried several methods: first, table.read(), then read.tsv(), then I tried to read each individual line of the file into a new file. I ultimately discovered I could use read.csv, and through the arguments sep, header, skip, and colnames, I could achieve several of my cleaning goals in one line of code:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
url <- 'https://raw.githubusercontent.com/josh1den/DATA-607/Projects/tournamentinfo.txt'
cols = c('player_num','name','total_pts','round_1','round_2','round_3','round_4',
'round_5','round_6','round_7','NA')
df <- read.csv(url, sep="|", header=FALSE, skip=3, col.names=cols)
head(df)
## player_num
## 1 -----------------------------------------------------------------------------------------
## 2 1
## 3 ON
## 4 -----------------------------------------------------------------------------------------
## 5 2
## 6 MI
## name total_pts round_1 round_2 round_3 round_4
## 1
## 2 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 3 15445895 / R: 1794 ->1817 N:2 W B W B
## 4
## 5 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 6 14598900 / R: 1553 ->1663 N:2 B W B W
## round_5 round_6 round_7 NA.
## 1 NA
## 2 W 7 D 12 D 4 NA
## 3 W B W NA
## 4 NA
## 5 W 16 W 20 W 7 NA
## 6 B W B NA
The next step was to remove the rows with dashes.
# remove rows with dashes
dashes <- "-----------------------------------------------------------------------------------------"
df <- filter(df, player_num != dashes)
head(df)
## player_num name total_pts round_1 round_2
## 1 1 GARY HUA 6.0 W 39 W 21
## 2 ON 15445895 / R: 1794 ->1817 N:2 W B
## 3 2 DAKSHESH DARURI 6.0 W 63 W 58
## 4 MI 14598900 / R: 1553 ->1663 N:2 B W
## 5 3 ADITYA BAJAJ 6.0 L 8 W 61
## 6 MI 14959604 / R: 1384 ->1640 N:2 W B
## round_3 round_4 round_5 round_6 round_7 NA.
## 1 W 18 W 14 W 7 D 12 D 4 NA
## 2 W B W B W NA
## 3 L 4 W 17 W 16 W 20 W 7 NA
## 4 B W B W B NA
## 5 W 25 W 21 W 11 W 13 W 12 NA
## 6 W B W B W NA
Next, I decided to create two new dataframes which I would later merge to one. The first dataframe (df1) would contain the odd number rows, and the second dataframe (df2) would contain the even number rows.
# extract every first row
df1 <- df %>%
filter(row_number() %% 2 == 1)
# extract every second row
df2 <- df %>%
filter(row_number() %% 2 != 1)
Next I used regex to extract the player rating from the name column as a new column: rating. The player ratings all followed a : followed by two or three spaces, and were digits of a minimum length of 3, maximum length of 4.
df2 <- df2 %>%
mutate(rating = as.numeric(str_extract(name, "(?<=:\\s|:\\s\\s)\\d{3,4}")))
head(df2)
## player_num name total_pts round_1 round_2
## 1 ON 15445895 / R: 1794 ->1817 N:2 W B
## 2 MI 14598900 / R: 1553 ->1663 N:2 B W
## 3 MI 14959604 / R: 1384 ->1640 N:2 W B
## 4 MI 12616049 / R: 1716 ->1744 N:2 W B
## 5 MI 14601533 / R: 1655 ->1690 N:2 B W
## 6 OH 15055204 / R: 1686 ->1687 N:3 W B
## round_3 round_4 round_5 round_6 round_7 NA. rating
## 1 W B W B W NA 1794
## 2 B W B W B NA 1553
## 3 W B W B W NA 1384
## 4 W B W B B NA 1716
## 5 B W B W B NA 1655
## 6 W B B W B NA 1686
Now I could rename the player_num column to its appropriate name, state, and by subsetting, remove the columns I don’t need.
# rename and drop unnecessary columns
df2 <- rename(df2, state=player_num)
df2 <- df2[c("state","rating")]
head(df2)
## state rating
## 1 ON 1794
## 2 MI 1553
## 3 MI 1384
## 4 MI 1716
## 5 MI 1655
## 6 OH 1686
Now I’m ready to merge the dataframes.
# merge the dataframes
df_new <- merge(df1, df2, by.x = 0, by.y = 0)
head(df_new)
## Row.names player_num name total_pts round_1
## 1 1 1 GARY HUA 6.0 W 39
## 2 10 10 ANVIT RAO 5.0 D 16
## 3 11 11 CAMERON WILLIAM MC LEMAN 4.5 D 38
## 4 12 12 KENNETH J TACK 4.5 W 42
## 5 13 13 TORRANCE HENRY JR 4.5 W 36
## 6 14 14 BRADLEY SHAW 4.5 W 54
## round_2 round_3 round_4 round_5 round_6 round_7 NA. state rating
## 1 W 21 W 18 W 14 W 7 D 12 D 4 NA ON 1794
## 2 L 19 W 55 W 31 D 6 W 25 W 18 NA MI 1365
## 3 W 56 W 6 L 7 L 3 W 34 W 26 NA MI 1712
## 4 W 33 D 5 W 38 H D 1 L 3 NA MI 1663
## 5 W 27 L 7 D 5 W 33 L 3 W 32 NA MI 1666
## 6 W 44 W 8 L 1 D 27 L 5 W 31 NA MI 1610
Steps 1, 2, and 3 are now basically complete. However, there is a little bit of tidying up to attend to.
# remove undesired columns
cols_to_remove <- c("Row.names","NA.")
df_new <- select(df_new, -cols_to_remove)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cols_to_remove)` instead of `cols_to_remove` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
head(df_new)
## player_num name total_pts round_1 round_2
## 1 1 GARY HUA 6.0 W 39 W 21
## 2 10 ANVIT RAO 5.0 D 16 L 19
## 3 11 CAMERON WILLIAM MC LEMAN 4.5 D 38 W 56
## 4 12 KENNETH J TACK 4.5 W 42 W 33
## 5 13 TORRANCE HENRY JR 4.5 W 36 W 27
## 6 14 BRADLEY SHAW 4.5 W 54 W 44
## round_3 round_4 round_5 round_6 round_7 state rating
## 1 W 18 W 14 W 7 D 12 D 4 ON 1794
## 2 W 55 W 31 D 6 W 25 W 18 MI 1365
## 3 W 6 L 7 L 3 W 34 W 26 MI 1712
## 4 D 5 W 38 H D 1 L 3 MI 1663
## 5 L 7 D 5 W 33 L 3 W 32 MI 1666
## 6 W 8 L 1 D 27 L 5 W 31 MI 1610
# the name column appears to have whitespace, let's trim that
df_new$name <- str_trim(df_new$name, side = c("both"))
head(df_new)
## player_num name total_pts round_1 round_2 round_3 round_4
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 2 10 ANVIT RAO 5.0 D 16 L 19 W 55 W 31
## 3 11 CAMERON WILLIAM MC LEMAN 4.5 D 38 W 56 W 6 L 7
## 4 12 KENNETH J TACK 4.5 W 42 W 33 D 5 W 38
## 5 13 TORRANCE HENRY JR 4.5 W 36 W 27 L 7 D 5
## 6 14 BRADLEY SHAW 4.5 W 54 W 44 W 8 L 1
## round_5 round_6 round_7 state rating
## 1 W 7 D 12 D 4 ON 1794
## 2 D 6 W 25 W 18 MI 1365
## 3 L 3 W 34 W 26 MI 1712
## 4 H D 1 L 3 MI 1663
## 5 W 33 L 3 W 32 MI 1666
## 6 D 27 L 5 W 31 MI 1610
# player_num column needs to be numeric
df_new$player_num <- as.numeric(df_new$player_num)
head(df_new)
## player_num name total_pts round_1 round_2 round_3 round_4
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 2 10 ANVIT RAO 5.0 D 16 L 19 W 55 W 31
## 3 11 CAMERON WILLIAM MC LEMAN 4.5 D 38 W 56 W 6 L 7
## 4 12 KENNETH J TACK 4.5 W 42 W 33 D 5 W 38
## 5 13 TORRANCE HENRY JR 4.5 W 36 W 27 L 7 D 5
## 6 14 BRADLEY SHAW 4.5 W 54 W 44 W 8 L 1
## round_5 round_6 round_7 state rating
## 1 W 7 D 12 D 4 ON 1794
## 2 D 6 W 25 W 18 MI 1365
## 3 L 3 W 34 W 26 MI 1712
## 4 H D 1 L 3 MI 1663
## 5 W 33 L 3 W 32 MI 1666
## 6 D 27 L 5 W 31 MI 1610
Now to complete the final step: calculating the average opposing player ratings.
First I start by extracting the player IDs from the round columns.
# reduce round columns to only the player player_ids
round_cols <- c("round_1","round_2","round_3","round_4", "round_5", "round_6", "round_7")
for (col in round_cols) {
df_new[[col]] <- as.numeric(str_extract_all(df_new[[col]], "(?<=\\s)\\d{1,2}", simplify=TRUE))
}
head(df_new)
## player_num name total_pts round_1 round_2 round_3 round_4
## 1 1 GARY HUA 6.0 39 21 18 14
## 2 10 ANVIT RAO 5.0 16 19 55 31
## 3 11 CAMERON WILLIAM MC LEMAN 4.5 38 56 6 7
## 4 12 KENNETH J TACK 4.5 42 33 5 38
## 5 13 TORRANCE HENRY JR 4.5 36 27 7 5
## 6 14 BRADLEY SHAW 4.5 54 44 8 1
## round_5 round_6 round_7 state rating
## 1 7 12 4 ON 1794
## 2 6 25 18 MI 1365
## 3 3 34 26 MI 1712
## 4 NA 1 3 MI 1663
## 5 33 3 32 MI 1666
## 6 27 5 31 MI 1610
Next, I iterate through the round columns, replacing the player IDs with the player rating corresponding to the player ID.
# loop through the rounds replacing the opposing player id with the opposing player's pre-rating
for (row in 1:nrow(df_new)) {
for (col in round_cols) {
rating <- df_new[row, col]
if(!is.na(rating)) {
df_new[row, col] <- df_new$rating[df_new$player_num == rating]
}
}
}
head(df_new)
## player_num name total_pts round_1 round_2 round_3 round_4
## 1 1 GARY HUA 6.0 1436 1563 1600 1610
## 2 10 ANVIT RAO 5.0 1604 1564 1186 1494
## 3 11 CAMERON WILLIAM MC LEMAN 4.5 1423 1153 1686 1649
## 4 12 KENNETH J TACK 4.5 1332 1449 1655 1423
## 5 13 TORRANCE HENRY JR 4.5 1355 1552 1649 1655
## 6 14 BRADLEY SHAW 4.5 1270 1199 1641 1794
## round_5 round_6 round_7 state rating
## 1 1649 1663 1716 ON 1794
## 2 1686 1745 1600 MI 1365
## 3 1384 1399 1579 MI 1712
## 4 NA 1794 1384 MI 1663
## 5 1449 1384 1441 MI 1666
## 6 1552 1655 1494 MI 1610
Now that I have columns with each opposing player rating, I can create a new columns with the average of the values in the round columns.
## calculate avg opponent rating
df_new$avg_opposing_rating <- round(rowMeans(df_new[round_cols], na.rm=TRUE))
head(df_new)
## player_num name total_pts round_1 round_2 round_3 round_4
## 1 1 GARY HUA 6.0 1436 1563 1600 1610
## 2 10 ANVIT RAO 5.0 1604 1564 1186 1494
## 3 11 CAMERON WILLIAM MC LEMAN 4.5 1423 1153 1686 1649
## 4 12 KENNETH J TACK 4.5 1332 1449 1655 1423
## 5 13 TORRANCE HENRY JR 4.5 1355 1552 1649 1655
## 6 14 BRADLEY SHAW 4.5 1270 1199 1641 1794
## round_5 round_6 round_7 state rating avg_opposing_rating
## 1 1649 1663 1716 ON 1794 1605
## 2 1686 1745 1600 MI 1365 1554
## 3 1384 1399 1579 MI 1712 1468
## 4 NA 1794 1384 MI 1663 1506
## 5 1449 1384 1441 MI 1666 1498
## 6 1552 1655 1494 MI 1610 1515
From here, it’s simply a matter of selecting the desired columns for my output file.
# select the desired output
final <- select(df_new, c(2, 11, 3, 12, 13))
head(final)
## name state total_pts rating avg_opposing_rating
## 1 GARY HUA ON 6.0 1794 1605
## 2 ANVIT RAO MI 5.0 1365 1554
## 3 CAMERON WILLIAM MC LEMAN MI 4.5 1712 1468
## 4 KENNETH J TACK MI 4.5 1663 1506
## 5 TORRANCE HENRY JR MI 4.5 1666 1498
## 6 BRADLEY SHAW MI 4.5 1610 1515
Writing the file to csv:
path <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/Projects/Project 1.csv"
write.csv(final, path)
Thanks!