We will pull in the Chess Tournament Results table from the text file to begin manipulating the structure to create the proposed CSV file to be imported into a SQL database. Once loaded we can view the structure of the results and plan a way to extract the data for our CSV file. According to the Project 1 Assignment directions, we will need the following data elements: Player’s Name, Player’s State, Total Number of Points, Player’s Pre-Rating and Average Pre Chess Rating of Opponents.
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
This results table appears to be very complex in structure. The headers take up two lines, and the data is not easily discernible from the header information.
Looking at the first entry (Gary Hua), in the first column we have the players number above what appears to be the state abbreviation. In the second column the players name, USCF ID number and pre- and post ratings are listed. The third column contains the total points and another abbreviation and number with unexplained significance. The fourth through tenth columns contain information on the round, whether the person won, lost, or draw, and against which opponents.
There are some issues that we need to immediately address. The dashed lines need to be removed, which are created with hyphens.
## [1] ""
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] ""
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |"
## [7] ""
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |"
## [10] ""
However, after running this code we see that it changes the symbol that they were using to show an increase in ratings from -> to just >. The -> also looks exactly the same as the assignment operator, so changing that first before removing the hyphens is probably a better idea.
## [1] ""
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre=>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] ""
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 =>1817 |N:2 |W |B |W |B |W |B |W |"
## [7] ""
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 =>1663 |N:2 |B |W |B |W |B |W |B |"
## [10] ""
We can now remove the first four (4) rows, since they contain the old headers and empty lines.
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 =>1817 |N:2 |W |B |W |B |W |B |W |"
## [3] ""
## [4] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [5] " MI | 14598900 / R: 1553 =>1663 |N:2 |B |W |B |W |B |W |B |"
## [6] ""
## [7] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [8] " MI | 14959604 / R: 1384 =>1640 |N:2 |W |B |W |B |W |B |W |"
## [9] ""
## [10] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
The table is now structured in a way that we can remove every third line. and compress to two lines per chess player entry. This is done by applying a logical vector tho the data lines through the length of data.
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " ON | 15445895 / R: 1794 =>1817 |N:2 |W |B |W |B |W |B |W |"
## [3] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [4] " MI | 14598900 / R: 1553 =>1663 |N:2 |B |W |B |W |B |W |B |"
## [5] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
Now we can subset the data similar to the way we removed the empty row. We can subset using applying a logical vector to the data for the first and second row through the length of data.
## [1] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [2] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [3] " 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|"
## [4] " 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|"
## [5] " 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|"
## [1] " ON | 15445895 / R: 1794 =>1817 |N:2 |W |B |W |B |W |B |W |"
## [2] " MI | 14598900 / R: 1553 =>1663 |N:2 |B |W |B |W |B |W |B |"
## [3] " MI | 14959604 / R: 1384 =>1640 |N:2 |W |B |W |B |W |B |W |"
## [4] " MI | 12616049 / R: 1716 =>1744 |N:2 |W |B |W |B |W |B |B |"
## [5] " MI | 14601533 / R: 1655 =>1690 |N:2 |B |W |B |W |B |W |B |"
Now we can start the process of extracting the data into columns. We can see most data elements are separated by “|”, so we can use that to parse the columns. Let’s start with the lines that have the have the participants name. We can parse out player number, player name, total points and the results for each match.
## player_number player_name total_points round1 round2
## 1 1 GARY HUA 6.0 W 39 W 21
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 5 5 HANSHI ZUO 5.5 W 45 W 37
## round3 round4 round5 round6 round7
## 1 W 18 W 14 W 7 D 12 D 4
## 2 L 4 W 17 W 16 W 20 W 7
## 3 W 25 W 21 W 11 W 13 W 12
## 4 W 2 W 26 D 5 W 19 D 1
## 5 D 12 D 13 D 4 W 14 W 17
The second line is a little more tricky. While we have the player state, the pre- and post- ratings are mixed in a string with the USCF id. There are a number of columns we simply won’t need for assignment. We split the strings based upon “|” as a delimiter, drop the unnecessary columns, and then split the USCF_id and player ratings using “:” and “=>”, respectively. We can then bind the columns into a single frame.
## player_state pre_rating post_rating
## 1 ON 1794 1817
## 2 MI 1553 1663
## 3 MI 1384 1640
## 4 MI 1716 1744
## 5 MI 1655 1690
Once we have parsed each of the columns to obtain our required data from both the first and second rows, we can now combine the two data frames into a single frame with the required data to continue with calculating the average pre-chess ratings of the opponents for each player.
## player_number player_name total_points round1 round2
## 1 1 GARY HUA 6.0 W 39 W 21
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61
## 4 4 PATRICK H SCHILLING 5.5 W 23 D 28
## 5 5 HANSHI ZUO 5.5 W 45 W 37
## round3 round4 round5 round6 round7 player_state pre_rating post_rating
## 1 W 18 W 14 W 7 D 12 D 4 ON 1794 1817
## 2 L 4 W 17 W 16 W 20 W 7 MI 1553 1663
## 3 W 25 W 21 W 11 W 13 W 12 MI 1384 1640
## 4 W 2 W 26 D 5 W 19 D 1 MI 1716 1744
## 5 D 12 D 13 D 4 W 14 W 17 MI 1655 1690
No we can pull out opponent numbers from each of the matches, and then rename the column to indicate that it is the opponent number now represented.
## player_number player_name total_points opp_rnd1
## 1 1 GARY HUA 6.0 39
## 2 2 DAKSHESH DARURI 6.0 63
## 3 3 ADITYA BAJAJ 6.0 8
## 4 4 PATRICK H SCHILLING 5.5 23
## 5 5 HANSHI ZUO 5.5 45
## opp_rnd2 opp_rnd3 opp_rnd4 opp_rnd5 opp_rnd6 opp_rnd7 player_state pre_rating
## 1 21 18 14 7 12 4 ON 1794
## 2 58 4 17 16 20 7 MI 1553
## 3 61 25 21 11 13 12 MI 1384
## 4 28 2 26 5 19 1 MI 1716
## 5 37 12 13 4 14 17 MI 1655
## post_rating
## 1 1817
## 2 1663
## 3 1640
## 4 1744
## 5 1690
Here we do a little housecleaning on the data frame. We will eliminate any letter characters or numbers from the character strings in the pre- and post-rating columns, as well as any white space before any of the strings. This will leave the remaining characters that are the rating themselves We will also replace missing data with “NA” in the strings that are missing.
## player_number player_name total_points opp_rnd1
## 1 1 GARY HUA 6.0 39
## 2 2 DAKSHESH DARURI 6.0 63
## 3 3 ADITYA BAJAJ 6.0 8
## 4 4 PATRICK H SCHILLING 5.5 23
## 5 5 HANSHI ZUO 5.5 45
## opp_rnd2 opp_rnd3 opp_rnd4 opp_rnd5 opp_rnd6 opp_rnd7 player_state pre_rating
## 1 21 18 14 7 12 4 ON 1794
## 2 58 4 17 16 20 7 MI 1553
## 3 61 25 21 11 13 12 MI 1384
## 4 28 2 26 5 19 1 MI 1716
## 5 37 12 13 4 14 17 MI 1655
## post_rating
## 1 1817
## 2 1663
## 3 1640
## 4 1744
## 5 1690
In this section we will take the opponent numbers for each round and replace it with their assigned pre-chess rating. This was done by extracting out the individual player numbers and ratings, and then using that as a reference frame for the numbers assigned to the opponents in each round. A new column is created for each round now containing the associated pre-chess rating to the opponent number in the round column (i.e. opp_rndtable1). While I am sure there is a much quicker way of coding this (for loop or function), I was not able to learn that competently enough for this assignment.
## player_number player_name total_points opp_rnd1
## 1 1 GARY HUA 6.0 39
## 2 2 DAKSHESH DARURI 6.0 63
## 3 3 ADITYA BAJAJ 6.0 8
## 4 4 PATRICK H SCHILLING 5.5 23
## 5 5 HANSHI ZUO 5.5 45
## opp_rnd2 opp_rnd3 opp_rnd4 opp_rnd5 opp_rnd6 opp_rnd7 player_state pre_rating
## 1 21 18 14 7 12 4 ON 1794
## 2 58 4 17 16 20 7 MI 1553
## 3 61 25 21 11 13 12 MI 1384
## 4 28 2 26 5 19 1 MI 1716
## 5 37 12 13 4 14 17 MI 1655
## post_rating rnd1_rating rnd2_rating rnd3_rating rnd4_rating rnd5_rating
## 1 1817 1436 1563 1600 1610 1649
## 2 1663 1175 917 1716 1629 1604
## 3 1640 1641 955 1745 1563 1712
## 4 1744 1363 1507 1553 1579 1655
## 5 1690 1242 980 1663 1666 1716
## rnd6_rating rnd7_rating
## 1 1663 1716
## 2 1595 1649
## 3 1666 1663
## 4 1564 1794
## 5 1610 1629
In this section we simply average the pre-chess ratings of all faced by a specific player and place that in a column at the end. We will take each of the columns, that are currently characters, and treat them as numeric for the calculation. We are also doing additional housekeeping by cleaning up the data frame by extracting out the requested columns in the order requested. The final column will contain the Average Pre-Chess Rating of Opponents faced by a participant, rounded to the nearest whole number.
## player_name player_state total_points pre_rating
## 1 GARY HUA ON 6.0 1794
## 2 DAKSHESH DARURI MI 6.0 1553
## 3 ADITYA BAJAJ MI 6.0 1384
## 4 PATRICK H SCHILLING MI 5.5 1716
## 5 HANSHI ZUO MI 5.5 1655
## avg_pre_chess_rating_opp
## 1 1605
## 2 1469
## 3 1564
## 4 1574
## 5 1501
In this section we prepare the file for export as CSV for possible import into a SQL database.
write.csv(tdr_average_final, "C:/Users/para2/Documents/GitHub/DATA607/Project_1/Project_1_Conrardy.csv", row.names=FALSE)
print('CSV file written successfully to location')
## [1] "CSV file written successfully to location"