DATA 607 – Project 01

In this project, I was given a text file with chess tournament results where the information had some structure. My goal was to create an R Markdown file that generates a CSV file with the following information for all players:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

My process for this project was to read in tournament data, tidy it, aggregate the opponent pre-scores, and return a CSV file.

Here is a GitHub repository with links to the original text file and the CSV file I created.

1. Read in the file.

I saved the tournament file to a folder on my desktop. Then, I read in the file, set the separator to be a |, and filled in the empty spaces in the data.

df <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 (Data Acquisition and Management)/Projects/Project 01/tournamentinfo.txt",sep = "|", fill = TRUE)

2. Examine the data.

I first renamed the columns in the raw dataframe to be more descriptive. Then, I took a look at the data.

names(df) <- c("Pair_Num", "Player_Name", "Total_Pts", "Round_1", "Round_2", "Round_3", "Round_4", "Round_5", "Round_6", "Round_7", "NA")

head(df)

##                                                                                    Pair_Num
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                     Pair 
## 3                                                                                     Num  
## 4 -----------------------------------------------------------------------------------------
## 5                                                                                        1 
## 6                                                                                       ON 
##                         Player_Name Total_Pts Round_1 Round_2 Round_3
## 1                                                                    
## 2  Player Name                          Total   Round   Round   Round
## 3  USCF ID / Rtg (Pre->Post)             Pts      1       2       3  
## 4                                                                    
## 5  GARY HUA                             6.0     W  39   W  21   W  18
## 6  15445895 / R: 1794   ->1817          N:2     W       B       W    
##   Round_4 Round_5 Round_6 Round_7 NA
## 1                                 NA
## 2   Round   Round   Round   Round NA
## 3     4       5       6       7   NA
## 4                                 NA
## 5   W  14   W   7   D  12   D   4 NA
## 6   B       W       B       W     NA

I noticed right away that the raw tournament data was not tidy. To be tidy – that is, organized and easily manipulated by R – a dataframe needs to meet the following criteria:
- Each variable in the data set should be placed in its own column.
- Each observation should be placed in its own row.
- Each value should placed in its own cell.
The raw data did not meet these criteria. For example, the column I titled “Player Name” had 5 different pieces of information within it – player first name, player last name, player ID, pre-score, and post-score.
Before I could determine the average pre-score for the opponents of each player, I needed to (1) separate each player’s information, (2) identify their opponents, and (3) identify the opponents’ pre-scores. This meant the data needed to be tidier.

3. Reshape the data.

To begin the tidying process, I first filtered the dataframe to only show distinct rows (which got rid of the horizontal dashed lines from the txt file) and removed the first three rows (which got rid of the leftover headers).

df <- distinct(df)

df <- df[ -c(1:3), ]

The original file had each player’s outcome per round in seven separate columns. I used the gather function from dplyr to collapse these seven columns into two: Round and Outcome (of the round).

df2 <- df %>% gather(Round, Outcome, Round_1, Round_2, Round_3, Round_4, Round_5, Round_6, Round_7)

## Warning: attributes are not identical across measure variables;
## they will be dropped

head(df2)

##   Pair_Num                       Player_Name Total_Pts NA   Round Outcome
## 1       1   GARY HUA                             6.0   NA Round_1   W  39
## 2      ON   15445895 / R: 1794   ->1817          N:2   NA Round_1   W    
## 3       2   DAKSHESH DARURI                      6.0   NA Round_1   W  63
## 4      MI   14598900 / R: 1553   ->1663          N:2   NA Round_1   B    
## 5       3   ADITYA BAJAJ                         6.0   NA Round_1   L   8
## 6      MI   14959604 / R: 1384   ->1640          N:2   NA Round_1   W

4. Separate the data – Part 1.

Like I mentioned above, the first few columns of this dataframe held many key pieces of information: state, pair number, player name, player ID, pre-score, and post-score.
I noticed the data was stored in a pattern – every other row held the same type of information. I used grep to extract the information into separate columns, with the idea that they would stay in the same order when extracted.
Then, I created a new dataframe called df3 that held these extracted columns.
In retrospect, I should not have used grep to extract the columns in this step, since that meant I had to merge them into a dataframe solely by position (as opposed to using keys, which are agnostic to position and much less prone to error).

state <- grep("[[:alpha:]]+", df2$Pair_Num, value=T)

pair_number <- grep("[[:digit:]]+", df2$Pair_Num, value=T)

id <- grep("[[:digit:]]+", df2$Player_Name, value=T)

Player_Name <- grep("[[:alpha:]]+[[:space:]][[:alpha:]]+", df2$Player_Name, value=T)

df3 <- data.frame(state, pair_number, id, Player_Name)

head(df3)

##    state pair_number                                id
## 1    ON           1   15445895 / R: 1794   ->1817     
## 2    MI           2   14598900 / R: 1553   ->1663     
## 3    MI           3   14959604 / R: 1384   ->1640     
## 4    MI           4   12616049 / R: 1716   ->1744     
## 5    MI           5   14601533 / R: 1655   ->1690     
## 6    OH           6   15055204 / R: 1686   ->1687     
##                         Player_Name
## 1  GARY HUA                        
## 2  DAKSHESH DARURI                 
## 3  ADITYA BAJAJ                    
## 4  PATRICK H SCHILLING             
## 5  HANSHI ZUO                      
## 6  HANSEN SONG

5. Join the separated data to the main dataframe.

Then, I left-joined df2 (which held the original data) with df3 (which held the grep extracted data) along a common “Player Name” column. This added the separated columns to the main dataframe.

joined <- distinct(left_join(df3, df2, by="Player_Name"))

## Warning: Column `Player_Name` joining factors with different levels,
## coercing to character vector

joined <- joined[ , -2 ]

head(joined)

##    state                                id
## 1    ON   15445895 / R: 1794   ->1817     
## 2    ON   15445895 / R: 1794   ->1817     
## 3    ON   15445895 / R: 1794   ->1817     
## 4    ON   15445895 / R: 1794   ->1817     
## 5    ON   15445895 / R: 1794   ->1817     
## 6    ON   15445895 / R: 1794   ->1817     
##                         Player_Name Pair_Num Total_Pts NA   Round Outcome
## 1  GARY HUA                               1      6.0   NA Round_1   W  39
## 2  GARY HUA                               1      6.0   NA Round_2   W  21
## 3  GARY HUA                               1      6.0   NA Round_3   W  18
## 4  GARY HUA                               1      6.0   NA Round_4   W  14
## 5  GARY HUA                               1      6.0   NA Round_5   W   7
## 6  GARY HUA                               1      6.0   NA Round_6   D  12

6. Separate the data – Part 2.

In this step, I applied a series of further separations to the main dataframe using regular expressions to pull out the player’s Outcome, Opponent, ID, Pre-Score, and Post-Score.

Separated the Outcome column into Outcome and Opponent, and id into id and score.

joined1 <- joined %>%
  separate(Outcome, c("Outcome", "Opponent"), sep = "[[:space:]]+") %>%
  separate(id, c("id", "score"), sep = "[\\/]")

Separated the score column roughly into pre- and post-score.

joined2 <- joined1 %>%
  separate(score, c("Pre-Score", "Post-Score"), sep = "[\\-]")

Separated pre-score into “R” and the score number.

joined3 <- joined2 %>%
  separate(`Pre-Score`, c("Pre-Score2", "Pre-Score"), sep = "[\\:]")

Removed the combination of letters and numbers after the pre-scores that had them.

joined4 <- joined3 %>%
  separate(`Pre-Score`, c("Pre-Score", "Pre-Score2"), sep = "[[:alpha:]]+")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 378 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Separated post-score into the angle brace and the number.

joined5 <- joined4 %>%
  separate(`Post-Score`, c("Post-Score2", "Post-Score"), sep = "[\\>]")

Removed the combination of letters and numbers after the post-scores that had them.

joined6 <- joined5 %>%
  separate(`Post-Score`, c("Post-Score", "Post-Score2"), sep = "[[:alpha:]]+")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 392 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Removed the unnecessary columns to make the final join.

final_join <- joined6[ , c(-5,-6) ]

head(final_join)

##    state         id Pre-Score Post-Score                       Player_Name
## 1    ON   15445895    1794     1817       GARY HUA                        
## 2    ON   15445895    1794     1817       GARY HUA                        
## 3    ON   15445895    1794     1817       GARY HUA                        
## 4    ON   15445895    1794     1817       GARY HUA                        
## 5    ON   15445895    1794     1817       GARY HUA                        
## 6    ON   15445895    1794     1817       GARY HUA                        
##   Pair_Num Total_Pts NA   Round Outcome Opponent
## 1       1      6.0   NA Round_1       W       39
## 2       1      6.0   NA Round_2       W       21
## 3       1      6.0   NA Round_3       W       18
## 4       1      6.0   NA Round_4       W       14
## 5       1      6.0   NA Round_5       W        7
## 6       1      6.0   NA Round_6       D       12

7. Create a reference dataframe.

I decided to create a separate dataframe to hold pair number and pre-score as a reference. I called this dataframe opponent_info.

opponent_info <- distinct((data.frame(final_join$`Pre-Score`, final_join$Pair_Num)))

names(opponent_info) <- c("Opponent_Score", "Opponent")

head(opponent_info)

##   Opponent_Score Opponent
## 1        1794          1 
## 2        1553          2 
## 3        1384          3 
## 4        1716          4 
## 5        1655          5 
## 6        1686          6

8. Make the variables numeric.

My next goal was to create a variable in my main dataframe (now called final_join) that held the pre-score of the opponent of each round.
To do this, I planned to join the opponent_info dataframe with final_join. The two dataframes had the Opponent column in common.
However, this was harder than I thought – when I used a full_join along the Opponent column, the pre-scores did not match up correctly. It took a lot of trial and error to find out why.
Back in Step 5, I got a warning that the columns I used to join dataframes had different levels, which meant R had to coerce them to a character vector. I didn’t understand what that meant at the time and moved on, but that is what prevented R from recognizing how the Opponent columns matched.
I then made sure R recognized the columns with numbers in both dataframes as numeric.

final_join$Pair_Num <- as.numeric(final_join$Pair_Num)

final_join <- mutate(final_join, Pair_Num = Pair_Num - 1)

## Warning: package 'bindrcpp' was built under R version 3.4.3

final_join$Opponent <- as.numeric(final_join$Opponent)

opponent_info$Opponent <- as.numeric(opponent_info$Opponent)

opponent_info <- mutate(opponent_info, Opponent = Opponent - 1)

9. Add the opponent’s pre-score to each round.

After making the columns numeric, I was finally able to join the two datasets and add the opponent’s pre-score to each round.
I renamed the columns of this newly-merged dataset to be more consistent.

merged <- full_join(final_join, opponent_info[, c("Opponent", "Opponent_Score")], by="Opponent")

names(merged) <- c("Player_Name", "Player_ID", "Pre_Score", "Post_Score", "Player_State", "Pair_Num", "Total_Pts", "NA", "Round", "Outcome", "Opponent", "Opponent_Score")

head(merged)

##   Player_Name  Player_ID Pre_Score Post_Score
## 1         ON   15445895    1794     1817     
## 2         ON   15445895    1794     1817     
## 3         ON   15445895    1794     1817     
## 4         ON   15445895    1794     1817     
## 5         ON   15445895    1794     1817     
## 6         ON   15445895    1794     1817     
##                        Player_State Pair_Num Total_Pts NA   Round Outcome
## 1  GARY HUA                                1     6.0   NA Round_1       W
## 2  GARY HUA                                1     6.0   NA Round_2       W
## 3  GARY HUA                                1     6.0   NA Round_3       W
## 4  GARY HUA                                1     6.0   NA Round_4       W
## 5  GARY HUA                                1     6.0   NA Round_5       W
## 6  GARY HUA                                1     6.0   NA Round_6       D
##   Opponent Opponent_Score
## 1       39           1436
## 2       21           1563
## 3       18        1600   
## 4       14        1610   
## 5        7        1649   
## 6       12        1663

10. Aggregate the opponent’s pre-scores.

After adding Opponent_Score to each round, I could then get an average opponent pre-score for each player by using the aggregate function.
However, it turned out that the Opponent_Score variable was a factor, which meant it could not be aggregated. I had to first coerce it into a character and then a numeric.

merged$Opponent_Score <- as.character(merged$Opponent_Score)

merged$Opponent_Score <- as.numeric(merged$Opponent_Score)

Once that step was complete, I was able to create a separate dataframe called aggregated with the mean opponent pre-score for each pair number (that is, each player).

aggregated <- aggregate(merged[ , 12], list(merged$Pair_Num), mean)

head(aggregated)

##   Group.1        x
## 1       1 1605.286
## 2       2 1469.286
## 3       3 1563.571
## 4       4 1573.571
## 5       5 1500.857
## 6       6 1518.714

11. Create a final dataframe.

Finally, I had the pieces to put together a dataframe that contained all of the information requested in the project’s prompt. I called this dataframe final, and renamed the columns to include the average opponent score.

final <- distinct(data.frame(merged$Pair_Num, merged$Player_Name, merged$Player_State, merged$Total_Pts, merged$Pre_Score))

final <- merge(final, aggregated, by.x = "merged.Pair_Num", by.y = 1, all.x = TRUE)

names(final) <- c("Pair_Num", "Player_Name", "Player_State", "Total_Pts", "Pre_Score", "Avg_Opponent_Score")

head(final)

##   Pair_Num Player_Name                      Player_State Total_Pts
## 1        1         ON   GARY HUA                             6.0  
## 2        2         MI   DAKSHESH DARURI                      6.0  
## 3        3         MI   ADITYA BAJAJ                         6.0  
## 4        4         MI   PATRICK H SCHILLING                  5.5  
## 5        5         MI   HANSHI ZUO                           5.5  
## 6        6         OH   HANSEN SONG                          5.0  
##   Pre_Score Avg_Opponent_Score
## 1   1794              1605.286
## 2   1553              1469.286
## 3   1384              1563.571
## 4   1716              1573.571
## 5   1655              1500.857
## 6   1686              1518.714

12. Write the dataframe into a CSV file.

As a last step, I wrote the database to a file in my working directory called “tournamentinfo.csv”.

write.csv(final, "tournamentinfo.csv")