Data 607: Project 1

Introduction

For this project, we will be reading in chess tournament results from a .TXT file. We will be cleaning messy data, saving it to a data frame and finding the average pre-rating of opponents for each player. We will then export our clean data frame as a .CSV file. For each player the CSV file will include the following information:

The Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating, and Average Pre Chess Rating of Opponents.

We will use the following libraries:

The tidyverse library

First let’s read in our data from Github and use the readLines() function. This function will read everything in our .txt file into one character column, but no worry, because we will clean and data in the column and eventually save it to a data frame.

url <-("https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt")
lines <- readLines(url)

## Warning in readLines(url): incomplete final line found on
## 'https://raw.githubusercontent.com/WendyR20/DATA-607-Project-1/refs/heads/main/tournamentinfo.txt'

head(lines,12)

##  [1] "-----------------------------------------------------------------------------------------" 
##  [2] " Pair | Player Name                     |Total|Round|Round|Round|Round|Round|Round|Round| "
##  [3] " Num  | USCF ID / Rtg (Pre->Post)       | Pts |  1  |  2  |  3  |  4  |  5  |  6  |  7  | "
##  [4] "-----------------------------------------------------------------------------------------" 
##  [5] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|" 
##  [6] "   ON | 15445895 / R: 1794   ->1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
##  [7] "-----------------------------------------------------------------------------------------" 
##  [8] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|" 
##  [9] "   MI | 14598900 / R: 1553   ->1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## [10] "-----------------------------------------------------------------------------------------" 
## [11] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|" 
## [12] "   MI | 14959604 / R: 1384   ->1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |"

Let’s also get a glimpse of the lines column

glimpse(lines)

##  chr [1:196] "-----------------------------------------------------------------------------------------" ...

Cleaning the Data

Before we extract the information we need an save it as a data frame we need to first clean the data a bit.

Let’s start with getting rid of the dashes that litter the data:

#replacing every dash with emptiness
lines <- gsub("-", "", lines)

#deleting every empty line
lines <- lines[nzchar(lines)]

Now as we’ve seen the data has a header for our purposes we need to delete it.

#deleting the header
lines <- lines[-c(1,2)]

Since each player’s information is split into two lines, let’s create blocks to house each player’s two lines of data.

#let's group every two lines 
blocks <- split(lines, ceiling(seq_along(lines)/2))

head(blocks)

## $`1`
## [1] "    1 | GARY HUA                        |6.0  |W  39|W  21|W  18|W  14|W   7|D  12|D   4|"
## [2] "   ON | 15445895 / R: 1794   >1817     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## 
## $`2`
## [1] "    2 | DAKSHESH DARURI                 |6.0  |W  63|W  58|L   4|W  17|W  16|W  20|W   7|"
## [2] "   MI | 14598900 / R: 1553   >1663     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## 
## $`3`
## [1] "    3 | ADITYA BAJAJ                    |6.0  |L   8|W  61|W  25|W  21|W  11|W  13|W  12|"
## [2] "   MI | 14959604 / R: 1384   >1640     |N:2  |W    |B    |W    |B    |W    |B    |W    |" 
## 
## $`4`
## [1] "    4 | PATRICK H SCHILLING             |5.5  |W  23|D  28|W   2|W  26|D   5|W  19|D   1|"
## [2] "   MI | 12616049 / R: 1716   >1744     |N:2  |W    |B    |W    |B    |W    |B    |B    |" 
## 
## $`5`
## [1] "    5 | HANSHI ZUO                      |5.5  |W  45|W  37|D  12|D  13|D   4|W  14|W  17|"
## [2] "   MI | 14601533 / R: 1655   >1690     |N:2  |B    |W    |B    |W    |B    |W    |B    |" 
## 
## $`6`
## [1] "    6 | HANSEN SONG                     |5.0  |W  34|D  29|L  11|W  35|D  10|W  27|W  21|"
## [2] "   OH | 15055204 / R: 1686   >1687     |N:3  |W    |B    |W    |B    |B    |W    |B    |"

Now we can create a function can extract player information from the split blocks.

player_info <- function(block) {
  
  player_line1 <- block[1]
  player_line2 <- block[2]
  
  
  #extracting each round result and the opponent number
  rounds <- str_extract_all(player_line1, "[WLDB]\\s*\\d+")[[1]] 
  
  #first time using a tibble 
  tibble::tibble(
    
    Player_Number = str_extract(player_line1, "^\\s*\\d+"), #extracting the number at the new line
    State = str_extract(player_line2,"^\\s*\\w+" ), # extracting the first word at the new line
    Name = str_trim(str_extract(player_line1, "(?<=\\| ).*?(?=\\|)")),
    Total_Points = str_extract(player_line1, "\\d+\\.\\d"  ), 
    Pre_Rating = str_match(player_line2, "R:\\s+(\\d+)")[,2],
    
    #giving each round result it's own column
    !!!setNames(as.list(rounds), paste0("Round", seq_along(rounds))),
  )
  
}

Let’s bind all the player informatin together.

# bind all players
player_data <- bind_rows(lapply(blocks,player_info))

head(player_data)

## # A tibble: 6 × 12
##   Player_Number State  Name  Total_Points Pre_Rating Round1 Round2 Round3 Round4
##   <chr>         <chr>  <chr> <chr>        <chr>      <chr>  <chr>  <chr>  <chr> 
## 1 "    1"       "   O… GARY… 6.0          1794       W  39  W  21  W  18  W  14 
## 2 "    2"       "   M… DAKS… 6.0          1553       W  63  W  58  L   4  W  17 
## 3 "    3"       "   M… ADIT… 6.0          1384       L   8  W  61  W  25  W  21 
## 4 "    4"       "   M… PATR… 5.5          1716       W  23  D  28  W   2  W  26 
## 5 "    5"       "   M… HANS… 5.5          1655       W  45  W  37  D  12  D  13 
## 6 "    6"       "   O… HANS… 5.0          1686       W  34  D  29  L  11  W  35 
## # ℹ 3 more variables: Round5 <chr>, Round6 <chr>, Round7 <chr>

Let’s convert the tibble to a data frame because it’s more familiar. Though, it’s great to try new things!

Chess_Players <- as.data.frame(player_data)

For the purpose of this project we will not be including the results for each round (i.e. whether the player won, loss, etc.) so let’s extract the numbers from the rounds columns.

Chess_Players <- Chess_Players %>%
  mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
                ~ as.numeric(str_remove_all(., "[^0-9]"))))

Now let’s check our data.

#Let's take a look at our data so far
head(Chess_Players)

##   Player_Number State                Name Total_Points Pre_Rating Round1 Round2
## 1             1    ON            GARY HUA          6.0       1794     39     21
## 2             2    MI     DAKSHESH DARURI          6.0       1553     63     58
## 3             3    MI        ADITYA BAJAJ          6.0       1384      8     61
## 4             4    MI PATRICK H SCHILLING          5.5       1716     23     28
## 5             5    MI          HANSHI ZUO          5.5       1655     45     37
## 6             6    OH         HANSEN SONG          5.0       1686     34     29
##   Round3 Round4 Round5 Round6 Round7
## 1     18     14      7     12      4
## 2      4     17     16     20      7
## 3     25     21     11     13     12
## 4      2     26      5     19      1
## 5     12     13      4     14     17
## 6     11     35     10     27     21

glimpse(Chess_Players)

## Rows: 64
## Columns: 12
## $ Player_Number <chr> "    1", "    2", "    3", "    4", "    5", "    6", " …
## $ State         <chr> "   ON", "   MI", "   MI", "   MI", "   MI", "   OH", " …
## $ Name          <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating    <chr> "1794", "1553", "1384", "1716", "1655", "1686", "1649", …
## $ Round1        <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2        <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3        <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4        <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5        <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6        <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, N…
## $ Round7        <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, NA, 32, 31, 38, …

Finding the Average of each Player’s Opponent’s Pre-ratings

We first need to change the pre-rating column to an integer since it will do calculations with this column later on.

Chess_Players$Pre_Rating <- as.integer(Chess_Players$Pre_Rating)

Now let’s replace any null values in the Round Columns with zero.

Chess_Players <- Chess_Players %>%
  mutate(across(c(Round1, Round2, Round3, Round4, Round5, Round6,Round7),
                ~replace_na(., 0)))

glimpse(Chess_Players)

## Rows: 64
## Columns: 12
## $ Player_Number <chr> "    1", "    2", "    3", "    4", "    5", "    6", " …
## $ State         <chr> "   ON", "   MI", "   MI", "   MI", "   MI", "   OH", " …
## $ Name          <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating    <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
## $ Round1        <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2        <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3        <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4        <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5        <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6        <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, 0…
## $ Round7        <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 38, 0…

Let’s pivot the data frame so each round has it’s own row instead of it’s own column for each player.

Chess_Players_Pivot <- Chess_Players %>%
  pivot_longer(cols = starts_with("Round"),
               names_to = "Round_Column",
               values_to = "Opp_Number") %>%
  filter(!is.na(Opp_Number))

glimpse(Chess_Players_Pivot)

## Rows: 448
## Columns: 7
## $ Player_Number <chr> "    1", "    1", "    1", "    1", "    1", "    1", " …
## $ State         <chr> "   ON", "   ON", "   ON", "   ON", "   ON", "   ON", " …
## $ Name          <chr> "GARY HUA", "GARY HUA", "GARY HUA", "GARY HUA", "GARY HU…
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", "6.0", …
## $ Pre_Rating    <int> 1794, 1794, 1794, 1794, 1794, 1794, 1794, 1553, 1553, 15…
## $ Round_Column  <chr> "Round1", "Round2", "Round3", "Round4", "Round5", "Round…
## $ Opp_Number    <dbl> 39, 21, 18, 14, 7, 12, 4, 63, 58, 4, 17, 16, 20, 7, 8, 6…

We will need to join the original Chess_Players data frame to the Chess_Players_Pivot2 data frame. Thus, the player_number column must be a double type in both data frames. We want to join these two tables so our Opponent’s pre-ratings have their own column in our Chess_Players_Pivot2 data frame.

Chess_Players$Player_Number <- as.double(Chess_Players$Player_Number)

Chess_Players_Pivot2 <- Chess_Players_Pivot %>%
  left_join(
    Chess_Players %>% 
      select(Player_Number, Pre_Rating),
    by = c("Opp_Number" = "Player_Number")
  )

Since we joined the Chess_Players data frame to the Chess_Players_Pivot2 data frame we have two pre-ratings columns. Let’s rename both so we can tell them apart for the sake of our calculations.

names(Chess_Players_Pivot2)

## [1] "Player_Number" "State"         "Name"          "Total_Points" 
## [5] "Pre_Rating.x"  "Round_Column"  "Opp_Number"    "Pre_Rating.y"

Chess_Players_Pivot2 <- Chess_Players_Pivot2 %>%
  rename(Player_PreRating = Pre_Rating.x,
         Opp_PreRating = Pre_Rating.y)

Now, finally we can find each player’s opponents average pre-rating.

Chess_Players_Avg <- Chess_Players_Pivot2 %>%
  group_by(Player_Number) %>%
  summarise(
    Avg_Opp_PreRating = mean(Opp_PreRating, na.rm = TRUE)
  )

head(Chess_Players_Avg)

## # A tibble: 6 × 2
##   Player_Number Avg_Opp_PreRating
##   <chr>                     <dbl>
## 1 "    1"                   1605.
## 2 "    2"                   1469.
## 3 "    3"                   1564.
## 4 "    4"                   1574.
## 5 "    5"                   1501.
## 6 "    6"                   1519.

Let’s round our Opponent Pre Rating Column.

#Let's round our Opp Pre Rating Column
Chess_Players_Avg$Avg_Opp_PreRating <- round(Chess_Players_Avg$Avg_Opp_PreRating,
                                             digits = 0)

Join Our Opponent Pre-Ratings Column to our original Data Frame

We will need to join the original Chess_Players data frame to the Chess_Players_Avg data frame where our average of opponents pre-ratings for each player exists. Thus, the player_number column must be a double type in both data frames.

Let’s first take a look at our two data frames.

glimpse(Chess_Players)

## Rows: 64
## Columns: 12
## $ Player_Number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ State         <chr> "   ON", "   MI", "   MI", "   MI", "   MI", "   OH", " …
## $ Name          <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATRICK …
## $ Total_Points  <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.0", …
## $ Pre_Rating    <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411, 13…
## $ Round1        <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54, 19…
## $ Round2        <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, 44, …
## $ Round3        <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30, 39…
## $ Round4        <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 22, 2…
## $ Round5        <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54, 36,…
## $ Round6        <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 33, 0…
## $ Round7        <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 38, 0…

glimpse(Chess_Players_Avg)

## Rows: 64
## Columns: 2
## $ Player_Number     <chr> "    1", "    2", "    3", "    4", "    5", "    6"…
## $ Avg_Opp_PreRating <dbl> 1605, 1469, 1564, 1574, 1501, 1519, 1372, 1468, 1523…

Before we can join the two data frames we need to change the Player Number column type in the Chess Players Avg data frame.

Chess_Players_Avg$Player_Number <- as.double(Chess_Players_Avg$Player_Number)

Chess_Players <- Chess_Players %>%
  left_join(Chess_Players_Avg, by = "Player_Number")

Let’s take another look at our Chess_Players data frame to ensure our join worked:

glimpse(Chess_Players)

## Rows: 64
## Columns: 13
## $ Player_Number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ State             <chr> "   ON", "   MI", "   MI", "   MI", "   MI", "   OH"…
## $ Name              <chr> "GARY HUA", "DAKSHESH DARURI", "ADITYA BAJAJ", "PATR…
## $ Total_Points      <chr> "6.0", "6.0", "6.0", "5.5", "5.5", "5.0", "5.0", "5.…
## $ Pre_Rating        <int> 1794, 1553, 1384, 1716, 1655, 1686, 1649, 1641, 1411…
## $ Round1            <dbl> 39, 63, 8, 23, 45, 34, 57, 3, 25, 16, 38, 42, 36, 54…
## $ Round2            <dbl> 21, 58, 61, 28, 37, 29, 46, 32, 18, 19, 56, 33, 27, …
## $ Round3            <dbl> 18, 4, 25, 2, 12, 11, 13, 14, 59, 55, 6, 5, 7, 8, 30…
## $ Round4            <dbl> 14, 17, 21, 26, 13, 35, 11, 9, 8, 31, 7, 38, 5, 1, 2…
## $ Round5            <dbl> 7, 16, 11, 5, 4, 10, 1, 47, 26, 6, 3, 1, 33, 27, 54,…
## $ Round6            <dbl> 12, 20, 13, 19, 14, 27, 9, 28, 7, 25, 34, 3, 3, 5, 3…
## $ Round7            <dbl> 4, 7, 12, 1, 17, 21, 2, 19, 20, 18, 26, 0, 32, 31, 3…
## $ Avg_Opp_PreRating <dbl> 1605, 1469, 1564, 1574, 1501, 1519, 1372, 1468, 1523…

head(Chess_Players)

##   Player_Number State                Name Total_Points Pre_Rating Round1 Round2
## 1             1    ON            GARY HUA          6.0       1794     39     21
## 2             2    MI     DAKSHESH DARURI          6.0       1553     63     58
## 3             3    MI        ADITYA BAJAJ          6.0       1384      8     61
## 4             4    MI PATRICK H SCHILLING          5.5       1716     23     28
## 5             5    MI          HANSHI ZUO          5.5       1655     45     37
## 6             6    OH         HANSEN SONG          5.0       1686     34     29
##   Round3 Round4 Round5 Round6 Round7 Avg_Opp_PreRating
## 1     18     14      7     12      4              1605
## 2      4     17     16     20      7              1469
## 3     25     21     11     13     12              1564
## 4      2     26      5     19      1              1574
## 5     12     13      4     14     17              1501
## 6     11     35     10     27     21              1519

Okay, great now we can move on!

Subset and Export to CSV

Before we export our data to csv format let’s create a subset data frame that does not include the rounds.

Chess_Player_Final <- Chess_Players %>%
  select(-Player_Number,-Round1, -Round2, -Round3, -Round4, -Round5, -Round6,-Round7)

Now we can export our data frame and save it as a csv on our local machine.

write.csv(Chess_Player_Final, "Chess_Player.csv", row.names = FALSE)

And we are finito!