Project Overview

This project is to process a text file of chess tournament results as shown below. The tournament results follow some structure and below is the glimpse of data from the given file.

The goal of this project is to generate a .CSV file (which could for example be imported into a SQL database) with the following information of the players mentioned in the chess tournament:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

For the first player listed aboved, Gary Hua, his information to be eventually calculated and exported as:

Player’s Name	Player’s State	Total Number of Points	Player’s Pre-Rating	Average Pre Chess Rating of Opponents
Gary Hua	ON	6.0	1794	1605

Setup

The following R packages are required for this project:

tidyr
dplyr
stringr
sqldf

Understanding the File Format

Looking at the file, we can see that the file has a header with two rows:

The file has a fixed structure having 2 rows for a single player. First line contains Player’s Number, Player’s Name, Total Points and Round 1-7 results. Each round results has game result longwith opponent player number.

Second line contains Player’s State, USCF ID, Pre Rating and Post Rating.

Note: The combination of two lines forms a complete record for a single player.

Data Processing

First load the txt using the read delim method with sep=‘|’ and get the data loaded in dataframe. Use select function from tidyr package to select the dataframe and limit the columns we need. Using Filter function of dplyr package I filter out the rows not needed, and then construct 2 different data frames from our original data frame. One containing the players playing details having (Player name, Total Points, Matches played) and an additional fields names player id as primary key. Second dataframe contains the remaining information for that player e.g. (State, USCF-ID, prerating, postrating again player_id as primaey key.

# Get the data from tournamentinfo.txt
theLink <- "https://raw.githubusercontent.com/amit-kapoor/data607/master/tournamentinfo.txt"
chess_df <- read.delim(file = theLink, header = FALSE, sep="|")
glimpse(chess_df)

## Observations: 196
## Variables: 11
## $ V1  <fct> ------------------------------------------------------------…
## $ V2  <fct> ,  Player Name                     ,  USCF ID / Rtg (Pre->Po…
## $ V3  <fct> , Total,  Pts , , 6.0  , N:2  , , 6.0  , N:2  , , 6.0  , N:2…
## $ V4  <fct> , Round,   1  , , W  39, W    , , W  63, B    , , L   8, W  …
## $ V5  <fct> , Round,   2  , , W  21, B    , , W  58, W    , , W  61, B  …
## $ V6  <fct> , Round,   3  , , W  18, W    , , L   4, B    , , W  25, W  …
## $ V7  <fct> , Round,   4  , , W  14, B    , , W  17, W    , , W  21, B  …
## $ V8  <fct> , Round,   5  , , W   7, W    , , W  16, B    , , W  11, W  …
## $ V9  <fct> , Round,   6  , , D  12, B    , , W  20, W    , , W  13, B  …
## $ V10 <fct> , Round,   7  , , D   4, W    , , W   7, B    , , W  12, W  …
## $ V11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

# filter out the rows which are not needed
my_df <- select(chess_df, 1:10)

my_df <- filter(my_df, !grepl(pattern = "[-]+", V1)) %>%
  filter(row_number() > 2)

head(my_df)

##       V1                                V2    V3    V4    V5    V6    V7
## 1     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 2    ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 3     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 4    MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
## 5     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
## 6    MI   14959604 / R: 1384   ->1640      N:2   W     B     W     B    
##      V8    V9   V10
## 1 W   7 D  12 D   4
## 2 W     B     W    
## 3 W  16 W  20 W   7
## 4 B     W     B    
## 5 W  11 W  13 W  12
## 6 W     B     W

# create the first data frame with the first row of every player details
player_details_df <- my_df %>% 
  filter(grepl(pattern = "[[:digit:]]", V1)) %>% 
  mutate(player_id = row_number()) 
head(player_details_df)

##       V1                                V2    V3    V4    V5    V6    V7
## 1     1   GARY HUA                         6.0   W  39 W  21 W  18 W  14
## 2     2   DAKSHESH DARURI                  6.0   W  63 W  58 L   4 W  17
## 3     3   ADITYA BAJAJ                     6.0   L   8 W  61 W  25 W  21
## 4     4   PATRICK H SCHILLING              5.5   W  23 D  28 W   2 W  26
## 5     5   HANSHI ZUO                       5.5   W  45 W  37 D  12 D  13
## 6     6   HANSEN SONG                      5.0   W  34 D  29 L  11 W  35
##      V8    V9   V10 player_id
## 1 W   7 D  12 D   4         1
## 2 W  16 W  20 W   7         2
## 3 W  11 W  13 W  12         3
## 4 D   5 W  19 D   1         4
## 5 D   4 W  14 W  17         5
## 6 D  10 W  27 W  21         6

# create the second data frame with the second row of every player details
player_ratings_df <- my_df %>% 
  filter(grepl(pattern = "[[:alpha:]]", V1)) %>% 
  mutate(player_id = row_number()) 

head(player_ratings_df)

##       V1                                V2    V3    V4    V5    V6    V7
## 1    ON   15445895 / R: 1794   ->1817      N:2   W     B     W     B    
## 2    MI   14598900 / R: 1553   ->1663      N:2   B     W     B     W    
## 3    MI   14959604 / R: 1384   ->1640      N:2   W     B     W     B    
## 4    MI   12616049 / R: 1716   ->1744      N:2   W     B     W     B    
## 5    MI   14601533 / R: 1655   ->1690      N:2   B     W     B     W    
## 6    OH   15055204 / R: 1686   ->1687      N:3   W     B     W     B    
##      V8    V9   V10 player_id
## 1 W     B     W             1
## 2 B     W     B             2
## 3 W     B     W             3
## 4 W     B     B             4
## 5 B     W     B             5
## 6 B     W     B             6

Next step is to seperate the joint values e.g. to separate V2 column in two columns USCF-ID and Rating and then again seperating out PRE-RATING, POST-RATING for every player. Then I rearrange the two data frames by using select and having player_id as first column in both the new dataframes. Then used the merge function of dplyr to merge both the dataframes into one dataframe. Next step I use select, mutate and str_extract functions to clean the value of PRERATING for each row and then rename columns.

# to separate V2 column in two columns USCF-ID and Rating 
my_rating <- separate(player_ratings_df, V2, c("USCF-ID", "Rating"), sep="R: ")
head(my_rating)

##       V1      USCF-ID             Rating    V3    V4    V5    V6    V7
## 1    ON   15445895 /  1794   ->1817      N:2   W     B     W     B    
## 2    MI   14598900 /  1553   ->1663      N:2   B     W     B     W    
## 3    MI   14959604 /  1384   ->1640      N:2   W     B     W     B    
## 4    MI   12616049 /  1716   ->1744      N:2   W     B     W     B    
## 5    MI   14601533 /  1655   ->1690      N:2   B     W     B     W    
## 6    OH   15055204 /  1686   ->1687      N:3   W     B     W     B    
##      V8    V9   V10 player_id
## 1 W     B     W             1
## 2 B     W     B             2
## 3 W     B     W             3
## 4 W     B     B             4
## 5 B     W     B             5
## 6 B     W     B             6

# separate column Rating in two columns PRE-RATING, POST-RATING 
my_rating <- separate(my_rating, Rating, c("PRERATING", "POSTRATING"), sep="->")
head(my_rating)

##       V1      USCF-ID PRERATING POSTRATING    V3    V4    V5    V6    V7
## 1    ON   15445895 /    1794     1817      N:2   W     B     W     B    
## 2    MI   14598900 /    1553     1663      N:2   B     W     B     W    
## 3    MI   14959604 /    1384     1640      N:2   W     B     W     B    
## 4    MI   12616049 /    1716     1744      N:2   W     B     W     B    
## 5    MI   14601533 /    1655     1690      N:2   B     W     B     W    
## 6    OH   15055204 /    1686     1687      N:3   W     B     W     B    
##      V8    V9   V10 player_id
## 1 W     B     W             1
## 2 B     W     B             2
## 3 W     B     W             3
## 4 W     B     B             4
## 5 B     W     B             5
## 6 B     W     B             6

# select player id and all other columns from player_details_df
player_details_df <- select(player_details_df, player_id, everything())
head(player_details_df)

##   player_id     V1                                V2    V3    V4    V5
## 1         1     1   GARY HUA                         6.0   W  39 W  21
## 2         2     2   DAKSHESH DARURI                  6.0   W  63 W  58
## 3         3     3   ADITYA BAJAJ                     6.0   L   8 W  61
## 4         4     4   PATRICK H SCHILLING              5.5   W  23 D  28
## 5         5     5   HANSHI ZUO                       5.5   W  45 W  37
## 6         6     6   HANSEN SONG                      5.0   W  34 D  29
##      V6    V7    V8    V9   V10
## 1 W  18 W  14 W   7 D  12 D   4
## 2 L   4 W  17 W  16 W  20 W   7
## 3 W  25 W  21 W  11 W  13 W  12
## 4 W   2 W  26 D   5 W  19 D   1
## 5 D  12 D  13 D   4 W  14 W  17
## 6 L  11 W  35 D  10 W  27 W  21

# now select player_id, state, pre rating and post rating from my_rating
player_ratings_df <- select(my_rating, player_id, V1, PRERATING, POSTRATING)
head(player_ratings_df)

##   player_id     V1 PRERATING POSTRATING
## 1         1    ON    1794     1817     
## 2         2    MI    1553     1663     
## 3         3    MI    1384     1640     
## 4         4    MI    1716     1744     
## 5         5    MI    1655     1690     
## 6         6    OH    1686     1687

# merge player_details_df and player_ratings_df by player_id
player_records <- merge(player_details_df, player_ratings_df, by="player_id")
head(player_records)

##   player_id   V1.x                                V2    V3    V4    V5
## 1         1     1   GARY HUA                         6.0   W  39 W  21
## 2         2     2   DAKSHESH DARURI                  6.0   W  63 W  58
## 3         3     3   ADITYA BAJAJ                     6.0   L   8 W  61
## 4         4     4   PATRICK H SCHILLING              5.5   W  23 D  28
## 5         5     5   HANSHI ZUO                       5.5   W  45 W  37
## 6         6     6   HANSEN SONG                      5.0   W  34 D  29
##      V6    V7    V8    V9   V10   V1.y PRERATING POSTRATING
## 1 W  18 W  14 W   7 D  12 D   4    ON    1794     1817     
## 2 L   4 W  17 W  16 W  20 W   7    MI    1553     1663     
## 3 W  25 W  21 W  11 W  13 W  12    MI    1384     1640     
## 4 W   2 W  26 D   5 W  19 D   1    MI    1716     1744     
## 5 D  12 D  13 D   4 W  14 W  17    MI    1655     1690     
## 6 L  11 W  35 D  10 W  27 W  21    OH    1686     1687

# drop second column and POSTRATING
## Also from PRERATING column, extract the number before alphabet if any
player_records <- select(player_records, 1:13, -2) %>% 
  mutate(PRERATING = str_extract(PRERATING, "[:digit:]+"))
head(player_records)

##   player_id                                V2    V3    V4    V5    V6
## 1         1  GARY HUA                         6.0   W  39 W  21 W  18
## 2         2  DAKSHESH DARURI                  6.0   W  63 W  58 L   4
## 3         3  ADITYA BAJAJ                     6.0   L   8 W  61 W  25
## 4         4  PATRICK H SCHILLING              5.5   W  23 D  28 W   2
## 5         5  HANSHI ZUO                       5.5   W  45 W  37 D  12
## 6         6  HANSEN SONG                      5.0   W  34 D  29 L  11
##      V7    V8    V9   V10   V1.y PRERATING
## 1 W  14 W   7 D  12 D   4    ON       1794
## 2 W  17 W  16 W  20 W   7    MI       1553
## 3 W  21 W  11 W  13 W  12    MI       1384
## 4 W  26 D   5 W  19 D   1    MI       1716
## 5 D  13 D   4 W  14 W  17    MI       1655
## 6 W  35 D  10 W  27 W  21    OH       1686

# rename columns
colnames(player_records) <- c("PLAYER_ID", "PLAYER_NAME", "TOTAL_POINTS", "MATCH1", "MATCH2", "MATCH3", "MATCH4", "MATCH5", "MATCH6", "MATCH7", "STATE", "PRE_RATING")

head(player_records)

##   PLAYER_ID                       PLAYER_NAME TOTAL_POINTS MATCH1 MATCH2
## 1         1  GARY HUA                                6.0    W  39  W  21
## 2         2  DAKSHESH DARURI                         6.0    W  63  W  58
## 3         3  ADITYA BAJAJ                            6.0    L   8  W  61
## 4         4  PATRICK H SCHILLING                     5.5    W  23  D  28
## 5         5  HANSHI ZUO                              5.5    W  45  W  37
## 6         6  HANSEN SONG                             5.0    W  34  D  29
##   MATCH3 MATCH4 MATCH5 MATCH6 MATCH7  STATE PRE_RATING
## 1  W  18  W  14  W   7  D  12  D   4    ON        1794
## 2  L   4  W  17  W  16  W  20  W   7    MI        1553
## 3  W  25  W  21  W  11  W  13  W  12    MI        1384
## 4  W   2  W  26  D   5  W  19  D   1    MI        1716
## 5  D  12  D  13  D   4  W  14  W  17    MI        1655
## 6  L  11  W  35  D  10  W  27  W  21    OH        1686

str(player_records)

## 'data.frame':    64 obs. of  12 variables:
##  $ PLAYER_ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PLAYER_NAME : Factor w/ 131 levels ""," 10131499 / R: 1610   ->1618     ",..: 89 77 66 116 93 92 88 86 125 70 ...
##  $ TOTAL_POINTS: Factor w/ 18 levels "","     "," Pts ",..: 14 14 14 13 13 12 12 12 12 12 ...
##  $ MATCH1      : Factor w/ 67 levels "","     ","  1  ",..: 51 66 18 43 54 48 62 41 45 8 ...
##  $ MATCH2      : Factor w/ 69 levels "","     ","  2  ",..: 47 66 68 8 53 9 60 50 27 28 ...
##  $ MATCH3      : Factor w/ 68 levels "","     ","  3  ",..: 47 14 50 42 6 18 45 19 67 64 ...
##  $ MATCH4      : Factor w/ 69 levels "","     ","  4  ",..: 47 49 50 52 6 56 46 23 45 54 ...
##  $ MATCH5      : Factor w/ 61 levels "","     ","  5  ",..: 40 42 41 6 5 8 16 51 45 7 ...
##  $ MATCH6      : Factor w/ 66 levels "","     ","  6  ",..: 6 47 44 46 45 50 43 51 21 49 ...
##  $ MATCH7      : Factor w/ 61 levels "","     ","  7  ",..: 6 37 38 5 39 43 9 41 42 40 ...
##  $ STATE       : Factor w/ 70 levels "    1 ","    2 ",..: 67 65 65 65 65 66 65 65 67 65 ...
##  $ PRE_RATING  : chr  "1794" "1553" "1384" "1716" ...

In the next few steps I used the gather function to expand the dataframe in number of rows by having multiple rows for one player id. Then arranged the rows by player_id and then used seperate function to fork a single column into multple column(s). In this scenario we broke each MATCH_RESULT as two columns RESULT and OPPONENT_ID and in last using transform I coonvert the data type of few columns.

# convert column in multiple rows using gather function
# in this case, multiple rows for given player created from  
# Match1 to Match7 and their respective values

player_records1 <- gather(player_records, "MATCHES", "MATCH_RESULT", 4:10)

## Warning: attributes are not identical across measure variables;
## they will be dropped

head(player_records1)

##   PLAYER_ID                       PLAYER_NAME TOTAL_POINTS  STATE
## 1         1  GARY HUA                                6.0      ON 
## 2         2  DAKSHESH DARURI                         6.0      MI 
## 3         3  ADITYA BAJAJ                            6.0      MI 
## 4         4  PATRICK H SCHILLING                     5.5      MI 
## 5         5  HANSHI ZUO                              5.5      MI 
## 6         6  HANSEN SONG                             5.0      OH 
##   PRE_RATING MATCHES MATCH_RESULT
## 1       1794  MATCH1        W  39
## 2       1553  MATCH1        W  63
## 3       1384  MATCH1        L   8
## 4       1716  MATCH1        W  23
## 5       1655  MATCH1        W  45
## 6       1686  MATCH1        W  34

# arrange records by player_id
player_records1 <- arrange(player_records1, PLAYER_ID)
head(player_records1)

##   PLAYER_ID                       PLAYER_NAME TOTAL_POINTS  STATE
## 1         1  GARY HUA                                6.0      ON 
## 2         1  GARY HUA                                6.0      ON 
## 3         1  GARY HUA                                6.0      ON 
## 4         1  GARY HUA                                6.0      ON 
## 5         1  GARY HUA                                6.0      ON 
## 6         1  GARY HUA                                6.0      ON 
##   PRE_RATING MATCHES MATCH_RESULT
## 1       1794  MATCH1        W  39
## 2       1794  MATCH2        W  21
## 3       1794  MATCH3        W  18
## 4       1794  MATCH4        W  14
## 5       1794  MATCH5        W   7
## 6       1794  MATCH6        D  12

# separate MATCH_RESULT as two columns RESULT and OPPONENT_ID
player_records1 <- separate(player_records1, MATCH_RESULT, c("RESULT", "OPPONENT_ID"))
head(player_records1)

##   PLAYER_ID                       PLAYER_NAME TOTAL_POINTS  STATE
## 1         1  GARY HUA                                6.0      ON 
## 2         1  GARY HUA                                6.0      ON 
## 3         1  GARY HUA                                6.0      ON 
## 4         1  GARY HUA                                6.0      ON 
## 5         1  GARY HUA                                6.0      ON 
## 6         1  GARY HUA                                6.0      ON 
##   PRE_RATING MATCHES RESULT OPPONENT_ID
## 1       1794  MATCH1      W          39
## 2       1794  MATCH2      W          21
## 3       1794  MATCH3      W          18
## 4       1794  MATCH4      W          14
## 5       1794  MATCH5      W           7
## 6       1794  MATCH6      D          12

# transform OPPONENT_ID and PRE_RATING as integer
player_records1 <- transform(player_records1, 
                            OPPONENT_ID = as.integer(OPPONENT_ID), 
                            PRE_RATING = as.integer(PRE_RATING))

head(player_records1)

##   PLAYER_ID                       PLAYER_NAME TOTAL_POINTS  STATE
## 1         1  GARY HUA                                6.0      ON 
## 2         1  GARY HUA                                6.0      ON 
## 3         1  GARY HUA                                6.0      ON 
## 4         1  GARY HUA                                6.0      ON 
## 5         1  GARY HUA                                6.0      ON 
## 6         1  GARY HUA                                6.0      ON 
##   PRE_RATING MATCHES RESULT OPPONENT_ID
## 1       1794  MATCH1      W          39
## 2       1794  MATCH2      W          21
## 3       1794  MATCH3      W          18
## 4       1794  MATCH4      W          14
## 5       1794  MATCH5      W           7
## 6       1794  MATCH6      D          12

At last I use sqldf library to run sql queries against our sqldf dataframe to have new dataframe constructed from our datafranme; with the details player_id and average opponent rating having group by on player_id. It gives one row per player_id with average of all the opponent pre ratings. Using the merge function then I merge the original dataframe with the one using sqldf.

# get opponent rating
opp_prerating_df <- sqldf("SELECT A.PLAYER_ID, (SELECT PRE_RATING FROM player_records1 B WHERE B.PLAYER_ID = A.OPPONENT_ID) OPPONENT_RATING FROM player_records1 A")
glimpse(opp_prerating_df)

## Observations: 448
## Variables: 2
## $ PLAYER_ID       <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, …
## $ OPPONENT_RATING <int> 1436, 1563, 1600, 1610, 1649, 1663, 1716, 1175, …

# compute average oppnent rating and round off the results group by player_id
opp_prerating_df <- sqldf("SELECT A.PLAYER_ID, ROUND(AVG(OPPONENT_RATING)) AVG_OPPONENT_RATING FROM opp_prerating_df A GROUP BY PLAYER_ID")
head(opp_prerating_df)

##   PLAYER_ID AVG_OPPONENT_RATING
## 1         1                1605
## 2         2                1469
## 3         3                1564
## 4         4                1574
## 5         5                1501
## 6         6                1519

# merge avg opponent rating with original data frame
merged_player_records <- merge(player_records, opp_prerating_df, by="PLAYER_ID")
head(merged_player_records)

##   PLAYER_ID                       PLAYER_NAME TOTAL_POINTS MATCH1 MATCH2
## 1         1  GARY HUA                                6.0    W  39  W  21
## 2         2  DAKSHESH DARURI                         6.0    W  63  W  58
## 3         3  ADITYA BAJAJ                            6.0    L   8  W  61
## 4         4  PATRICK H SCHILLING                     5.5    W  23  D  28
## 5         5  HANSHI ZUO                              5.5    W  45  W  37
## 6         6  HANSEN SONG                             5.0    W  34  D  29
##   MATCH3 MATCH4 MATCH5 MATCH6 MATCH7  STATE PRE_RATING AVG_OPPONENT_RATING
## 1  W  18  W  14  W   7  D  12  D   4    ON        1794                1605
## 2  L   4  W  17  W  16  W  20  W   7    MI        1553                1469
## 3  W  25  W  21  W  11  W  13  W  12    MI        1384                1564
## 4  W   2  W  26  D   5  W  19  D   1    MI        1716                1574
## 5  D  12  D  13  D   4  W  14  W  17    MI        1655                1501
## 6  L  11  W  35  D  10  W  27  W  21    OH        1686                1519

Final Output

To get the final output results and format, first I select all desired columns. Then I swap couple of columns and renamed all as per output needed. Finally I write the output in a .607-Project1.csv file.

# select desired coluns
desired_player_records <- select(merged_player_records, 2,3,11,12,13)
head(desired_player_records)

##                         PLAYER_NAME TOTAL_POINTS  STATE PRE_RATING
## 1  GARY HUA                                6.0      ON        1794
## 2  DAKSHESH DARURI                         6.0      MI        1553
## 3  ADITYA BAJAJ                            6.0      MI        1384
## 4  PATRICK H SCHILLING                     5.5      MI        1716
## 5  HANSHI ZUO                              5.5      MI        1655
## 6  HANSEN SONG                             5.0      OH        1686
##   AVG_OPPONENT_RATING
## 1                1605
## 2                1469
## 3                1564
## 4                1574
## 5                1501
## 6                1519

# swap columns as per desired end results
# state as second column and total_points as third
desired_player_records <- desired_player_records[, c(1,3,2,4,5)]

# rename columns as desired
colnames(desired_player_records) <- c("Players Name", "Player's State", "Total Number of Points", "Player's Pre-Rating", "Average Pre Chess Rating of Opponents")

head(desired_player_records)

##                        Players Name Player's State Total Number of Points
## 1  GARY HUA                                    ON                   6.0  
## 2  DAKSHESH DARURI                             MI                   6.0  
## 3  ADITYA BAJAJ                                MI                   6.0  
## 4  PATRICK H SCHILLING                         MI                   5.5  
## 5  HANSHI ZUO                                  MI                   5.5  
## 6  HANSEN SONG                                 OH                   5.0  
##   Player's Pre-Rating Average Pre Chess Rating of Opponents
## 1                1794                                  1605
## 2                1553                                  1469
## 3                1384                                  1564
## 4                1716                                  1574
## 5                1655                                  1501
## 6                1686                                  1519

# write to csv as 607-Project1.csv
write.csv(desired_player_records, "607-Project1.csv", row.names = F)

Summary

This exercise really helped me to understand and use various R packages e.g. dplyr, tidyr, stringr and sqldf using which I manipulated the dataframes and achieved the target.

Data 607 - Assignment 4