For this project, we will be reading in chess tournament results from a .TXT file. We will be cleaning messy data, saving it to a data frame and finding the average pre-rating of opponents for each player. We will then export our clean data frame as a .CSV file. For each player the CSV file will include the following information:
The Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.
We will use the following libraries:
First let’s read in our data from Github and use the readLines() function. This function will read everything in our .txt file into one character column, but no worry, because we will clean and data in the column and eventually save it to a data frame.
url <-("https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt")
lines <- readLines(url)
## Warning in readLines(url): incomplete final line found on
## 'https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt'
head(lines,12)
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
## [11] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [12] " MI | 14959604 / R: 1384 ->1640 |N:2 |W |B |W |B |W |B |W |"
Let’s also get a glimpse of the lines column
glimpse(lines)
## chr [1:196] "-----------------------------------------------------------------------------------------" ...
Before we extract the information we need an save it as a data frame we need to first clean the data a bit.
Let’s start with getting rid of the dashes that litter the data:
#replacing every dash with emptiness
lines <- gsub("-", "", lines)
#deleting every empty line
lines <- lines[nzchar(lines)]
Now as we’ve seen the data has a header for our purposes we need to delete it.
#deleting the header
lines <- lines[-c(1,2)]
Since each player’s information is split into two lines, let’s create blocks to house each player’s two lines of data.
#let's group every two lines
blocks <- split(lines, ceiling(seq_along(lines)/2))
head(blocks)
## $`1`
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |"
##
## $`2`
## [1] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [2] " MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |"
##
## $`3`
## [1] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [2] " MI | 14959604 / R: 1384 >1640 |N:2 |W |B |W |B |W |B |W |"
##
## $`4`
## [1] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [2] " MI | 12616049 / R: 1716 >1744 |N:2 |W |B |W |B |W |B |B |"
##
## $`5`
## [1] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [2] " MI | 14601533 / R: 1655 >1690 |N:2 |B |W |B |W |B |W |B |"
##
## $`6`
## [1] " 6 | HANSEN SONG |5.0 |W 34|D 29|L 11|W 35|D 10|W 27|W 21|"
## [2] " OH | 15055204 / R: 1686 >1687 |N:3 |W |B |W |B |B |W |B |"
Now we can create a function can extract player information from the split blocks.
player_info <- function(block) {
player_line1 <- block[1]
player_line2 <- block[2]
#extracting each round result and the opponent number
rounds <- str_extract_all(player_line1, "[WLDB]\\s*\\d+")[[1]]
#first time using a tibble
tibble::tibble(
Player_Number = str_extract(player_line1, "^\\s*\\d+"), #extracting the number at the new line
State = str_extract(player_line2,"^\\s*\\w+" ), # extracting the first word at the new line
Name = str_trim(str_extract(player_line1, "(?<=\\| ).*?(?=\\|)")),
Total_Points = str_extract(player_line1, "\\d+\\.\\d" ),
Pre_Rating = str_match(player_line2, "R:\\s+(\\d+)")[,2],
#giving each round result it's own column
!!!setNames(as.list(rounds), paste0("Round", seq_along(rounds))),
)
}
Let’s bind all the player informatin together.
# bind all players
player_data <- bind_rows(lapply(blocks,player_info))
head(player_data)
## # A tibble: 6 × 12
## Player_Number State Name Total_Points Pre_Rating Round1 Round2 Round3 Round4
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 " 1" " O… GARY… 6.0 1794 W 39 W 21 W 18 W 14
## 2 " 2" " M… DAKS… 6.0 1553 W 63 W 58 L 4 W 17
## 3 " 3" " M… ADIT… 6.0 1384 L 8 W 61 W 25 W 21
## 4 " 4" " M… PATR… 5.5 1716 W 23 D 28 W 2 W 26
## 5 " 5" " M… HANS… 5.5 1655 W 45 W 37 D 12 D 13
## 6 " 6" " O… HANS… 5.0 1686 W 34 D 29 L 11 W 35
## # ℹ 3 more variables: Round5 <chr>, Round6 <chr>, Round7 <chr>
Let’s convert the tibble to a data frame because it’s more familiar. Though, it’s great to try new things!
Chess_Players <- as.data.frame(player_data)
For the purpose of this project we will not be including the results for each round (i.e. whether the player won, loss, etc.) so let’s extract the numbers from the rounds columns.
Chess_Players <- Chess_Players %>%
mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
~ as.numeric(str_remove_all(., "[^0-9]"))))
Now let’s check our data.
#Let's take a look at our data so far
head(Chess_Players)
## Player_Number State Name Total_Points Pre_Rating Round1 Round2
## 1 1 ON GARY HUA 6.0 1794 39 21
## 2 2 MI DAKSHESH DARURI 6.0 1553 63 58
## 3 3 MI ADITYA BAJAJ 6.0 1384 8 61
## 4 4 MI PATRICK H SCHILLING 5.5 1716 23 28
## 5 5 MI HANSHI ZUO 5.5 1655 45 37
## 6 6 OH HANSEN SONG 5.0 1686 34 29
## Round3 Round4 Round5 Round6 Round7
## 1 18 14 7 12 4
## 2 4 17 16 20 7
## 3 25 21 11 13 12
## 4 2 26 5 19 1
## 5 12 13 4 14 17
## 6 11 35 10 27 21
glimpse(Chess_Players)
## Rows: 64
## Columns: 12
## $ Player_Number <chr> " 1", " 2", " 3", " 4", " 5", " 6", " …
## $ State <chr> " ON", " MI", " MI", " MI", " MI", " OH", " …
## $ Name <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649", …
## $ Round1 <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2 <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3 <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4 <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5 <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6 <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, N…
## $ Round7 <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, NA, 32, 31, 38, …
We first need to change the pre-rating column to an integer since it will do calculations with this column later on.
Chess_Players$Pre_Rating <- as.integer(Chess_Players$Pre_Rating)
Now let’s replace any null values in the Round Columns with zero.
Chess_Players <- Chess_Players %>%
mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
~replace_na(., 0)))
glimpse(Chess_Players)
## Rows: 64
## Columns: 12
## $ Player_Number <chr> " 1", " 2", " 3", " 4", " 5", " 6", " …
## $ State <chr> " ON", " MI", " MI", " MI", " MI", " OH", " …
## $ Name <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
## $ Round1 <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2 <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3 <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4 <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5 <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6 <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, 0…
## $ Round7 <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 38, 0…
Let’s pivot the data frame so each round has it’s own row instead of it’s own column for each player.
Chess_Players_Pivot <- Chess_Players %>%
pivot_longer(cols = starts_with("Round"),
names_to = "Round_Column",
values_to = "Opp_Number") %>%
filter(!is.na(Opp_Number))
glimpse(Chess_Players_Pivot)
## Rows: 448
## Columns: 7
## $ Player_Number <chr> " 1", " 1", " 1", " 1", " 1", " 1", " …
## $ State <chr> " ON", " ON", " ON", " ON", " ON", " ON", " …
## $ Name <chr> "GARY HUA", "GARY HUA", "GARY HUA", "GARY HUA", "GARY HU…
## $ Total_Points <chr> "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", …
## $ Pre_Rating <int> 1794, 1794, 1794, 1794, 1794, 1794, 1794, 1553, 1553, 15…
## $ Round_Column <chr> "Round1", "Round2", "Round3", "Round4", "Round5", "Round…
## $ Opp_Number <dbl> 39, 21, 18, 14, 7, 12, 4, 63, 58, 4, 17, 16, 20, 7, 8, 6…
We will need to join the original Chess_Players data frame to the Chess_Players_Pivot2 data frame. Thus, the player_number column must be a double type in both data frames. We want to join these two tables so our Opponent’s pre-ratings have their own column in our Chess_Players_Pivot2 data frame.
Chess_Players$Player_Number <- as.double(Chess_Players$Player_Number)
Chess_Players_Pivot2 <- Chess_Players_Pivot %>%
left_join(
Chess_Players %>%
select(Player_Number, Pre_Rating),
by = c("Opp_Number" = "Player_Number")
)
Since we joined the Chess_Players data frame to the Chess_Players_Pivot2 data frame we have two pre-ratings columns. Let’s rename both so we can tell them apart for the sake of our calculations.
names(Chess_Players_Pivot2)
## [1] "Player_Number" "State" "Name" "Total_Points"
## [5] "Pre_Rating.x" "Round_Column" "Opp_Number" "Pre_Rating.y"
Chess_Players_Pivot2 <- Chess_Players_Pivot2 %>%
rename(Player_PreRating = Pre_Rating.x,
Opp_PreRating = Pre_Rating.y)
Now, finally we can find each player’s opponents average pre-rating.
Chess_Players_Avg <- Chess_Players_Pivot2 %>%
group_by(Player_Number) %>%
summarise(
Avg_Opp_PreRating = mean(Opp_PreRating, na.rm = TRUE)
)
head(Chess_Players_Avg)
## # A tibble: 6 × 2
## Player_Number Avg_Opp_PreRating
## <chr> <dbl>
## 1 " 1" 1605.
## 2 " 2" 1469.
## 3 " 3" 1564.
## 4 " 4" 1574.
## 5 " 5" 1501.
## 6 " 6" 1519.
Let’s round our Opponent Pre Rating Column.
#Let's round our Opp Pre Rating Column
Chess_Players_Avg$Avg_Opp_PreRating <- round(Chess_Players_Avg$Avg_Opp_PreRating,
digits = 0)
We will need to join the original Chess_Players data frame to the Chess_Players_Avg data frame where our average of opponents pre-ratings for each player exists. Thus, the player_number column must be a double type in both data frames.
Let’s first take a look at our two data frames.
glimpse(Chess_Players)
## Rows: 64
## Columns: 12
## $ Player_Number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ State <chr> " ON", " MI", " MI", " MI", " MI", " OH", " …
## $ Name <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
## $ Round1 <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2 <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3 <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4 <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5 <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6 <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, 0…
## $ Round7 <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 38, 0…
glimpse(Chess_Players_Avg)
## Rows: 64
## Columns: 2
## $ Player_Number <chr> " 1", " 2", " 3", " 4", " 5", " 6"…
## $ Avg_Opp_PreRating <dbl> 1605, 1469, 1564, 1574, 1501, 1519, 1372, 1468, 1523…
Before we can join the two data frames we need to change the Player Number column type in the Chess Players Avg data frame.
Chess_Players_Avg$Player_Number <- as.double(Chess_Players_Avg$Player_Number)
Chess_Players <- Chess_Players %>%
left_join(Chess_Players_Avg, by = "Player_Number")
Let’s take another look at our Chess_Players data frame to ensure our join worked:
glimpse(Chess_Players)
## Rows: 64
## Columns: 13
## $ Player_Number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ State <chr> " ON", " MI", " MI", " MI", " MI", " OH"…
## $ Name <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATR…
## $ Total_Points <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.…
## $ Pre_Rating <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411…
## $ Round1 <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54…
## $ Round2 <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, …
## $ Round3 <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30…
## $ Round4 <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 2…
## $ Round5 <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54,…
## $ Round6 <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 3…
## $ Round7 <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 3…
## $ Avg_Opp_PreRating <dbl> 1605, 1469, 1564, 1574, 1501, 1519, 1372, 1468, 1523…
head(Chess_Players)
## Player_Number State Name Total_Points Pre_Rating Round1 Round2
## 1 1 ON GARY HUA 6.0 1794 39 21
## 2 2 MI DAKSHESH DARURI 6.0 1553 63 58
## 3 3 MI ADITYA BAJAJ 6.0 1384 8 61
## 4 4 MI PATRICK H SCHILLING 5.5 1716 23 28
## 5 5 MI HANSHI ZUO 5.5 1655 45 37
## 6 6 OH HANSEN SONG 5.0 1686 34 29
## Round3 Round4 Round5 Round6 Round7 Avg_Opp_PreRating
## 1 18 14 7 12 4 1605
## 2 4 17 16 20 7 1469
## 3 25 21 11 13 12 1564
## 4 2 26 5 19 1 1574
## 5 12 13 4 14 17 1501
## 6 11 35 10 27 21 1519
Okay, great now we can move on!
Before we export our data to csv format let’s create a subset data frame that does not include the rounds.
Chess_Player_Final <- Chess_Players %>%
select(-Player_Number,-Round1, -Round2, -Round3, -Round4, -Round5, -Round6,-Round7)
Now we can export our data frame and save it as a csv on our local machine.
write.csv(Chess_Player_Final, "Chess_Player.csv", row.names = FALSE)
And we are finito!