This project is to process a text file of chess tournament results as shown below. The tournament results follow some structure and below is the glimpse of data from the given file.
The goal of this project is to generate a .CSV file (which could for example be imported into a SQL database) with the following information of the players mentioned in the chess tournament:
For the first player listed aboved, Gary Hua, his information to be eventually calculated and exported as:
Player’s Name | Player’s State | Total Number of Points | Player’s Pre-Rating | Average Pre Chess Rating of Opponents |
---|---|---|---|---|
Gary Hua | ON | 6.0 | 1794 | 1605 |
The following R packages are required for this project:
Looking at the file, we can see that the file has a header with two rows:
The file has a fixed structure having 2 rows for a single player. First line contains Player’s Number, Player’s Name, Total Points and Round 1-7 results. Each round results has game result longwith opponent player number.
Second line contains Player’s State, USCF ID, Pre Rating and Post Rating.
Note: The combination of two lines forms a complete record for a single player.
First load the txt using the read delim method with sep=‘|’ and get the data loaded in dataframe. Use select function from tidyr package to select the dataframe and limit the columns we need. Using Filter function of dplyr package I filter out the rows not needed, and then construct 2 different data frames from our original data frame. One containing the players playing details having (Player name, Total Points, Matches played) and an additional fields names player id as primary key. Second dataframe contains the remaining information for that player e.g. (State, USCF-ID, prerating, postrating again player_id as primaey key.
# Get the data from tournamentinfo.txt
theLink <- "https://raw.githubusercontent.com/amit-kapoor/data607/master/tournamentinfo.txt"
chess_df <- read.delim(file = theLink, header = FALSE, sep="|")
glimpse(chess_df)
## Observations: 196
## Variables: 11
## $ V1 <fct> ------------------------------------------------------------…
## $ V2 <fct> , Player Name , USCF ID / Rtg (Pre->Po…
## $ V3 <fct> , Total, Pts , , 6.0 , N:2 , , 6.0 , N:2 , , 6.0 , N:2…
## $ V4 <fct> , Round, 1 , , W 39, W , , W 63, B , , L 8, W …
## $ V5 <fct> , Round, 2 , , W 21, B , , W 58, W , , W 61, B …
## $ V6 <fct> , Round, 3 , , W 18, W , , L 4, B , , W 25, W …
## $ V7 <fct> , Round, 4 , , W 14, B , , W 17, W , , W 21, B …
## $ V8 <fct> , Round, 5 , , W 7, W , , W 16, B , , W 11, W …
## $ V9 <fct> , Round, 6 , , D 12, B , , W 20, W , , W 13, B …
## $ V10 <fct> , Round, 7 , , D 4, W , , W 7, B , , W 12, W …
## $ V11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
# filter out the rows which are not needed
my_df <- select(chess_df, 1:10)
my_df <- filter(my_df, !grepl(pattern = "[-]+", V1)) %>%
filter(row_number() > 2)
head(my_df)
## V1 V2 V3 V4 V5 V6 V7
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 2 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 3 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 4 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 5 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
## 6 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## V8 V9 V10
## 1 W 7 D 12 D 4
## 2 W B W
## 3 W 16 W 20 W 7
## 4 B W B
## 5 W 11 W 13 W 12
## 6 W B W
# create the first data frame with the first row of every player details
player_details_df <- my_df %>%
filter(grepl(pattern = "[[:digit:]]", V1)) %>%
mutate(player_id = row_number())
head(player_details_df)
## V1 V2 V3 V4 V5 V6 V7
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2 W 26
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35
## V8 V9 V10 player_id
## 1 W 7 D 12 D 4 1
## 2 W 16 W 20 W 7 2
## 3 W 11 W 13 W 12 3
## 4 D 5 W 19 D 1 4
## 5 D 4 W 14 W 17 5
## 6 D 10 W 27 W 21 6
# create the second data frame with the second row of every player details
player_ratings_df <- my_df %>%
filter(grepl(pattern = "[[:alpha:]]", V1)) %>%
mutate(player_id = row_number())
head(player_ratings_df)
## V1 V2 V3 V4 V5 V6 V7
## 1 ON 15445895 / R: 1794 ->1817 N:2 W B W B
## 2 MI 14598900 / R: 1553 ->1663 N:2 B W B W
## 3 MI 14959604 / R: 1384 ->1640 N:2 W B W B
## 4 MI 12616049 / R: 1716 ->1744 N:2 W B W B
## 5 MI 14601533 / R: 1655 ->1690 N:2 B W B W
## 6 OH 15055204 / R: 1686 ->1687 N:3 W B W B
## V8 V9 V10 player_id
## 1 W B W 1
## 2 B W B 2
## 3 W B W 3
## 4 W B B 4
## 5 B W B 5
## 6 B W B 6
Next step is to seperate the joint values e.g. to separate V2 column in two columns USCF-ID and Rating and then again seperating out PRE-RATING, POST-RATING for every player. Then I rearrange the two data frames by using select and having player_id as first column in both the new dataframes. Then used the merge function of dplyr to merge both the dataframes into one dataframe. Next step I use select, mutate and str_extract functions to clean the value of PRERATING for each row and then rename columns.
# to separate V2 column in two columns USCF-ID and Rating
my_rating <- separate(player_ratings_df, V2, c("USCF-ID", "Rating"), sep="R: ")
head(my_rating)
## V1 USCF-ID Rating V3 V4 V5 V6 V7
## 1 ON 15445895 / 1794 ->1817 N:2 W B W B
## 2 MI 14598900 / 1553 ->1663 N:2 B W B W
## 3 MI 14959604 / 1384 ->1640 N:2 W B W B
## 4 MI 12616049 / 1716 ->1744 N:2 W B W B
## 5 MI 14601533 / 1655 ->1690 N:2 B W B W
## 6 OH 15055204 / 1686 ->1687 N:3 W B W B
## V8 V9 V10 player_id
## 1 W B W 1
## 2 B W B 2
## 3 W B W 3
## 4 W B B 4
## 5 B W B 5
## 6 B W B 6
# separate column Rating in two columns PRE-RATING, POST-RATING
my_rating <- separate(my_rating, Rating, c("PRERATING", "POSTRATING"), sep="->")
head(my_rating)
## V1 USCF-ID PRERATING POSTRATING V3 V4 V5 V6 V7
## 1 ON 15445895 / 1794 1817 N:2 W B W B
## 2 MI 14598900 / 1553 1663 N:2 B W B W
## 3 MI 14959604 / 1384 1640 N:2 W B W B
## 4 MI 12616049 / 1716 1744 N:2 W B W B
## 5 MI 14601533 / 1655 1690 N:2 B W B W
## 6 OH 15055204 / 1686 1687 N:3 W B W B
## V8 V9 V10 player_id
## 1 W B W 1
## 2 B W B 2
## 3 W B W 3
## 4 W B B 4
## 5 B W B 5
## 6 B W B 6
# select player id and all other columns from player_details_df
player_details_df <- select(player_details_df, player_id, everything())
head(player_details_df)
## player_id V1 V2 V3 V4 V5
## 1 1 1 GARY HUA 6.0 W 39 W 21
## 2 2 2 DAKSHESH DARURI 6.0 W 63 W 58
## 3 3 3 ADITYA BAJAJ 6.0 L 8 W 61
## 4 4 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 5 5 5 HANSHI ZUO 5.5 W 45 W 37
## 6 6 6 HANSEN SONG 5.0 W 34 D 29
## V6 V7 V8 V9 V10
## 1 W 18 W 14 W 7 D 12 D 4
## 2 L 4 W 17 W 16 W 20 W 7
## 3 W 25 W 21 W 11 W 13 W 12
## 4 W 2 W 26 D 5 W 19 D 1
## 5 D 12 D 13 D 4 W 14 W 17
## 6 L 11 W 35 D 10 W 27 W 21
# now select player_id, state, pre rating and post rating from my_rating
player_ratings_df <- select(my_rating, player_id, V1, PRERATING, POSTRATING)
head(player_ratings_df)
## player_id V1 PRERATING POSTRATING
## 1 1 ON 1794 1817
## 2 2 MI 1553 1663
## 3 3 MI 1384 1640
## 4 4 MI 1716 1744
## 5 5 MI 1655 1690
## 6 6 OH 1686 1687
# merge player_details_df and player_ratings_df by player_id
player_records <- merge(player_details_df, player_ratings_df, by="player_id")
head(player_records)
## player_id V1.x V2 V3 V4 V5
## 1 1 1 GARY HUA 6.0 W 39 W 21
## 2 2 2 DAKSHESH DARURI 6.0 W 63 W 58
## 3 3 3 ADITYA BAJAJ 6.0 L 8 W 61
## 4 4 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 5 5 5 HANSHI ZUO 5.5 W 45 W 37
## 6 6 6 HANSEN SONG 5.0 W 34 D 29
## V6 V7 V8 V9 V10 V1.y PRERATING POSTRATING
## 1 W 18 W 14 W 7 D 12 D 4 ON 1794 1817
## 2 L 4 W 17 W 16 W 20 W 7 MI 1553 1663
## 3 W 25 W 21 W 11 W 13 W 12 MI 1384 1640
## 4 W 2 W 26 D 5 W 19 D 1 MI 1716 1744
## 5 D 12 D 13 D 4 W 14 W 17 MI 1655 1690
## 6 L 11 W 35 D 10 W 27 W 21 OH 1686 1687
# drop second column and POSTRATING
## Also from PRERATING column, extract the number before alphabet if any
player_records <- select(player_records, 1:13, -2) %>%
mutate(PRERATING = str_extract(PRERATING, "[:digit:]+"))
head(player_records)
## player_id V2 V3 V4 V5 V6
## 1 1 GARY HUA 6.0 W 39 W 21 W 18
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11
## V7 V8 V9 V10 V1.y PRERATING
## 1 W 14 W 7 D 12 D 4 ON 1794
## 2 W 17 W 16 W 20 W 7 MI 1553
## 3 W 21 W 11 W 13 W 12 MI 1384
## 4 W 26 D 5 W 19 D 1 MI 1716
## 5 D 13 D 4 W 14 W 17 MI 1655
## 6 W 35 D 10 W 27 W 21 OH 1686
# rename columns
colnames(player_records) <- c("PLAYER_ID", "PLAYER_NAME", "TOTAL_POINTS", "MATCH1", "MATCH2", "MATCH3", "MATCH4", "MATCH5", "MATCH6", "MATCH7", "STATE", "PRE_RATING")
head(player_records)
## PLAYER_ID PLAYER_NAME TOTAL_POINTS MATCH1 MATCH2
## 1 1 GARY HUA 6.0 W 39 W 21
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 5 5 HANSHI ZUO 5.5 W 45 W 37
## 6 6 HANSEN SONG 5.0 W 34 D 29
## MATCH3 MATCH4 MATCH5 MATCH6 MATCH7 STATE PRE_RATING
## 1 W 18 W 14 W 7 D 12 D 4 ON 1794
## 2 L 4 W 17 W 16 W 20 W 7 MI 1553
## 3 W 25 W 21 W 11 W 13 W 12 MI 1384
## 4 W 2 W 26 D 5 W 19 D 1 MI 1716
## 5 D 12 D 13 D 4 W 14 W 17 MI 1655
## 6 L 11 W 35 D 10 W 27 W 21 OH 1686
str(player_records)
## 'data.frame': 64 obs. of 12 variables:
## $ PLAYER_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PLAYER_NAME : Factor w/ 131 levels ""," 10131499 / R: 1610 ->1618 ",..: 89 77 66 116 93 92 88 86 125 70 ...
## $ TOTAL_POINTS: Factor w/ 18 levels ""," "," Pts ",..: 14 14 14 13 13 12 12 12 12 12 ...
## $ MATCH1 : Factor w/ 67 levels ""," "," 1 ",..: 51 66 18 43 54 48 62 41 45 8 ...
## $ MATCH2 : Factor w/ 69 levels ""," "," 2 ",..: 47 66 68 8 53 9 60 50 27 28 ...
## $ MATCH3 : Factor w/ 68 levels ""," "," 3 ",..: 47 14 50 42 6 18 45 19 67 64 ...
## $ MATCH4 : Factor w/ 69 levels ""," "," 4 ",..: 47 49 50 52 6 56 46 23 45 54 ...
## $ MATCH5 : Factor w/ 61 levels ""," "," 5 ",..: 40 42 41 6 5 8 16 51 45 7 ...
## $ MATCH6 : Factor w/ 66 levels ""," "," 6 ",..: 6 47 44 46 45 50 43 51 21 49 ...
## $ MATCH7 : Factor w/ 61 levels ""," "," 7 ",..: 6 37 38 5 39 43 9 41 42 40 ...
## $ STATE : Factor w/ 70 levels " 1 "," 2 ",..: 67 65 65 65 65 66 65 65 67 65 ...
## $ PRE_RATING : chr "1794" "1553" "1384" "1716" ...
In the next few steps I used the gather function to expand the dataframe in number of rows by having multiple rows for one player id. Then arranged the rows by player_id and then used seperate function to fork a single column into multple column(s). In this scenario we broke each MATCH_RESULT as two columns RESULT and OPPONENT_ID and in last using transform I coonvert the data type of few columns.
# convert column in multiple rows using gather function
# in this case, multiple rows for given player created from
# Match1 to Match7 and their respective values
player_records1 <- gather(player_records, "MATCHES", "MATCH_RESULT", 4:10)
## Warning: attributes are not identical across measure variables;
## they will be dropped
head(player_records1)
## PLAYER_ID PLAYER_NAME TOTAL_POINTS STATE
## 1 1 GARY HUA 6.0 ON
## 2 2 DAKSHESH DARURI 6.0 MI
## 3 3 ADITYA BAJAJ 6.0 MI
## 4 4 PATRICK H SCHILLING 5.5 MI
## 5 5 HANSHI ZUO 5.5 MI
## 6 6 HANSEN SONG 5.0 OH
## PRE_RATING MATCHES MATCH_RESULT
## 1 1794 MATCH1 W 39
## 2 1553 MATCH1 W 63
## 3 1384 MATCH1 L 8
## 4 1716 MATCH1 W 23
## 5 1655 MATCH1 W 45
## 6 1686 MATCH1 W 34
# arrange records by player_id
player_records1 <- arrange(player_records1, PLAYER_ID)
head(player_records1)
## PLAYER_ID PLAYER_NAME TOTAL_POINTS STATE
## 1 1 GARY HUA 6.0 ON
## 2 1 GARY HUA 6.0 ON
## 3 1 GARY HUA 6.0 ON
## 4 1 GARY HUA 6.0 ON
## 5 1 GARY HUA 6.0 ON
## 6 1 GARY HUA 6.0 ON
## PRE_RATING MATCHES MATCH_RESULT
## 1 1794 MATCH1 W 39
## 2 1794 MATCH2 W 21
## 3 1794 MATCH3 W 18
## 4 1794 MATCH4 W 14
## 5 1794 MATCH5 W 7
## 6 1794 MATCH6 D 12
# separate MATCH_RESULT as two columns RESULT and OPPONENT_ID
player_records1 <- separate(player_records1, MATCH_RESULT, c("RESULT", "OPPONENT_ID"))
head(player_records1)
## PLAYER_ID PLAYER_NAME TOTAL_POINTS STATE
## 1 1 GARY HUA 6.0 ON
## 2 1 GARY HUA 6.0 ON
## 3 1 GARY HUA 6.0 ON
## 4 1 GARY HUA 6.0 ON
## 5 1 GARY HUA 6.0 ON
## 6 1 GARY HUA 6.0 ON
## PRE_RATING MATCHES RESULT OPPONENT_ID
## 1 1794 MATCH1 W 39
## 2 1794 MATCH2 W 21
## 3 1794 MATCH3 W 18
## 4 1794 MATCH4 W 14
## 5 1794 MATCH5 W 7
## 6 1794 MATCH6 D 12
# transform OPPONENT_ID and PRE_RATING as integer
player_records1 <- transform(player_records1,
OPPONENT_ID = as.integer(OPPONENT_ID),
PRE_RATING = as.integer(PRE_RATING))
head(player_records1)
## PLAYER_ID PLAYER_NAME TOTAL_POINTS STATE
## 1 1 GARY HUA 6.0 ON
## 2 1 GARY HUA 6.0 ON
## 3 1 GARY HUA 6.0 ON
## 4 1 GARY HUA 6.0 ON
## 5 1 GARY HUA 6.0 ON
## 6 1 GARY HUA 6.0 ON
## PRE_RATING MATCHES RESULT OPPONENT_ID
## 1 1794 MATCH1 W 39
## 2 1794 MATCH2 W 21
## 3 1794 MATCH3 W 18
## 4 1794 MATCH4 W 14
## 5 1794 MATCH5 W 7
## 6 1794 MATCH6 D 12
At last I use sqldf library to run sql queries against our sqldf dataframe to have new dataframe constructed from our datafranme; with the details player_id and average opponent rating having group by on player_id. It gives one row per player_id with average of all the opponent pre ratings. Using the merge function then I merge the original dataframe with the one using sqldf.
# get opponent rating
opp_prerating_df <- sqldf("SELECT A.PLAYER_ID, (SELECT PRE_RATING FROM player_records1 B WHERE B.PLAYER_ID = A.OPPONENT_ID) OPPONENT_RATING FROM player_records1 A")
glimpse(opp_prerating_df)
## Observations: 448
## Variables: 2
## $ PLAYER_ID <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, …
## $ OPPONENT_RATING <int> 1436, 1563, 1600, 1610, 1649, 1663, 1716, 1175, …
# compute average oppnent rating and round off the results group by player_id
opp_prerating_df <- sqldf("SELECT A.PLAYER_ID, ROUND(AVG(OPPONENT_RATING)) AVG_OPPONENT_RATING FROM opp_prerating_df A GROUP BY PLAYER_ID")
head(opp_prerating_df)
## PLAYER_ID AVG_OPPONENT_RATING
## 1 1 1605
## 2 2 1469
## 3 3 1564
## 4 4 1574
## 5 5 1501
## 6 6 1519
# merge avg opponent rating with original data frame
merged_player_records <- merge(player_records, opp_prerating_df, by="PLAYER_ID")
head(merged_player_records)
## PLAYER_ID PLAYER_NAME TOTAL_POINTS MATCH1 MATCH2
## 1 1 GARY HUA 6.0 W 39 W 21
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 5 5 HANSHI ZUO 5.5 W 45 W 37
## 6 6 HANSEN SONG 5.0 W 34 D 29
## MATCH3 MATCH4 MATCH5 MATCH6 MATCH7 STATE PRE_RATING AVG_OPPONENT_RATING
## 1 W 18 W 14 W 7 D 12 D 4 ON 1794 1605
## 2 L 4 W 17 W 16 W 20 W 7 MI 1553 1469
## 3 W 25 W 21 W 11 W 13 W 12 MI 1384 1564
## 4 W 2 W 26 D 5 W 19 D 1 MI 1716 1574
## 5 D 12 D 13 D 4 W 14 W 17 MI 1655 1501
## 6 L 11 W 35 D 10 W 27 W 21 OH 1686 1519
To get the final output results and format, first I select all desired columns. Then I swap couple of columns and renamed all as per output needed. Finally I write the output in a .607-Project1.csv file.
# select desired coluns
desired_player_records <- select(merged_player_records, 2,3,11,12,13)
head(desired_player_records)
## PLAYER_NAME TOTAL_POINTS STATE PRE_RATING
## 1 GARY HUA 6.0 ON 1794
## 2 DAKSHESH DARURI 6.0 MI 1553
## 3 ADITYA BAJAJ 6.0 MI 1384
## 4 PATRICK H SCHILLING 5.5 MI 1716
## 5 HANSHI ZUO 5.5 MI 1655
## 6 HANSEN SONG 5.0 OH 1686
## AVG_OPPONENT_RATING
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
## 6 1519
# swap columns as per desired end results
# state as second column and total_points as third
desired_player_records <- desired_player_records[, c(1,3,2,4,5)]
# rename columns as desired
colnames(desired_player_records) <- c("Players Name", "Player's State", "Total Number of Points", "Player's Pre-Rating", "Average Pre Chess Rating of Opponents")
head(desired_player_records)
## Players Name Player's State Total Number of Points
## 1 GARY HUA ON 6.0
## 2 DAKSHESH DARURI MI 6.0
## 3 ADITYA BAJAJ MI 6.0
## 4 PATRICK H SCHILLING MI 5.5
## 5 HANSHI ZUO MI 5.5
## 6 HANSEN SONG OH 5.0
## Player's Pre-Rating Average Pre Chess Rating of Opponents
## 1 1794 1605
## 2 1553 1469
## 3 1384 1564
## 4 1716 1574
## 5 1655 1501
## 6 1686 1519
# write to csv as 607-Project1.csv
write.csv(desired_player_records, "607-Project1.csv", row.names = F)
This exercise really helped me to understand and use various R packages e.g. dplyr, tidyr, stringr and sqldf using which I manipulated the dataframes and achieved the target.