Overview
For this project, I was given a text file that included the results from a chess tournament, however the data is saved as a .txt file and broken up into a table that is easy to keep records by hand, not necessarily for data processing. Here I will sort through the data and filter out superfluous information and be left with a table that includes our players names, their state/origin, the total number of points they earned during the tournament, their pre-rating, and finally the average pre-rating of their opponents. Once the table is completed, I will save it as a .csv file.
In order to start off, I must import the data, which I have stored on github, as well as choose the packages I will be working with.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_url <-
"https://raw.githubusercontent.com/scrummett/DATA607/refs/heads/main/tournamentinfo.txt"
raw_data <- read_fwf(data_url,
fwf_widths(c(8, 33, 6, 6, 6, 6, 6, 6, 6, 6)),
skip = 4)
## Rows: 192 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
##
## chr (10): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here I have read the .txt file in “fixed width format”, as if I were to read it in with “readLines” I would have a much harder time creating columns, which I will soon turn into our variables. Fortunately, with “read_fwf”, I can create different columns at exact spots in the file. I have also skipped the first four rows, as I won’t need their column names in our table.
Data Transformation
head(raw_data)
## # A tibble: 6 × 10
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 | GARY HUA … 6.0 … W 3… W 2… W 1… W 1… W … D 1… D …
## 2 ON | 15445895 / R: 1794 … N:2 … W … B … W … B … W … B … W …
## 3 -------- --------------------… ----… ----… ----… ----… ----… ----… ----… ----…
## 4 2 | DAKSHESH DARURI … 6.0 … W 6… W 5… L … W 1… W 1… W 2… W …
## 5 MI | 14598900 / R: 1553 … N:2 … B … W … B … W … B … W … B …
## 6 -------- --------------------… ----… ----… ----… ----… ----… ----… ----… ----…
Now that I have our data in columns, I can begin to clean them up by getting rid of the extraneous things separating our values such as rows of “—” and “|” separating columns.
raw_data <- raw_data |>
filter(X1 != "--------") |>
mutate_all(~ str_replace_all(as.character(.), c("\\|" = "")))
Now I have only rows of information I need, and less punctuation that is not helpful to us. However, information for an individual player is stored across two separate rows each, therefore I must create a table with one row per player.
raw_data_id <- raw_data |>
filter(!is.na(as.numeric(X1)))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `!is.na(as.numeric(X1))`.
## Caused by warning:
## ! NAs introduced by coercion
raw_data_state <- raw_data |>
filter(is.na(as.numeric(X1)))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `is.na(as.numeric(X1))`.
## Caused by warning:
## ! NAs introduced by coercion
raw_data_id <- raw_data_id |>
mutate(join_id = 1:n())
raw_data_state <- raw_data_state |>
mutate(join_id = 1:n())
raw_data <- full_join(raw_data_id, raw_data_state, by = "join_id")
head(raw_data)
## # A tibble: 6 × 21
## X1.x X2.x X3.x X4.x X5.x X6.x X7.x X8.x X9.x X10.x join_id X1.y
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr>
## 1 "1 " "GARY HUA… "6.0… W 39 W 21 W 18 W 14 W 7 D 12 D 4 1 "ON "
## 2 "2 " "DAKSHESH… "6.0… W 63 W 58 L 4 W 17 W 16 W 20 W 7 2 "MI "
## 3 "3 " "ADITYA B… "6.0… L 8 W 61 W 25 W 21 W 11 W 13 W 12 3 "MI "
## 4 "4 " "PATRICK … "5.5… W 23 D 28 W 2 W 26 D 5 W 19 D 1 4 "MI "
## 5 "5 " "HANSHI Z… "5.5… W 45 W 37 D 12 D 13 D 4 W 14 W 17 5 "MI "
## 6 "6 " "HANSEN S… "5.0… W 34 D 29 L 11 W 35 D 10 W 27 W 21 6 "OH "
## # ℹ 9 more variables: X2.y <chr>, X3.y <chr>, X4.y <chr>, X5.y <chr>,
## # X6.y <chr>, X7.y <chr>, X8.y <chr>, X9.y <chr>, X10.y <chr>
Here I have separated our data out into two tables by looking at their first column. I saw that they either contained letters or numbers. These rows were separated by determining if the values in the first column could be considered “numeric” or not. I created the table “raw_data_num” for those that were numeric and “raw_data_state” for those that weren’t. After that, I assigned each row in both tables a “join_id” number that would correspond to the other table. I then joined the two rows to create one table with every player’s information in one row.
However, this data is still obfuscated by column names that don’t make sense. Next, I can clean that up, as well as the data in those columns.
raw_data <- raw_data |>
rename(
id = X1.x,
name = X2.x,
total_points = X3.x,
opponent_1 = X4.x,
opponent_2 = X5.x,
opponent_3 = X6.x,
opponent_4 = X7.x,
opponent_5 = X8.x,
opponent_6 = X9.x,
opponent_7 = X10.x,
state = X1.y,
pre_rating = X2.y)
Next I can select only the columns that will be important to us going further.
raw_data <- raw_data |>
select(id,
name,
total_points,
opponent_1,
opponent_2,
opponent_3,
opponent_4,
opponent_5,
opponent_6,
opponent_7,
state,
pre_rating)
These are the columns that will either be in our final table, or will help us calculate the averages for our final table.
I have win/loss/draw information in our table still which I can get rid of too.
raw_data <- raw_data |>
mutate(across(starts_with("opponent_"), ~ gsub("[A-Za-z ]", "", .)))
head(raw_data)
## # A tibble: 6 × 12
## id name total_points opponent_1 opponent_2 opponent_3 opponent_4
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "1 " "GARY HUA … "6.0 " 39 21 18 14
## 2 "2 " "DAKSHESH DARU… "6.0 " 63 58 4 17
## 3 "3 " "ADITYA BAJAJ … "6.0 " 8 61 25 21
## 4 "4 " "PATRICK H SCH… "5.5 " 23 28 2 26
## 5 "5 " "HANSHI ZUO … "5.5 " 45 37 12 13
## 6 "6 " "HANSEN SONG … "5.0 " 34 29 11 35
## # ℹ 5 more variables: opponent_5 <chr>, opponent_6 <chr>, opponent_7 <chr>,
## # state <chr>, pre_rating <chr>
Here I looked across every column that began with “opponent_” for observations that contained any letters followed by a space, written as “[A-Za-z ]”. If letters were found, I replaced them with a value of nothing, written as ““.
Next, I can get rid of the extra information in the “pre_rating” column.
raw_data <- raw_data |>
mutate(pre_rating = parse_number(str_extract(pre_rating, " \\d+")))
With this code, I take every observation within the prevariable column and search for any instance of a space, followed by any number of digits, however they can only be digits. This prevents us from taking any other numbers within the observations. Additionally, it takes what is left, a space and a number, and only keeps the numbers, changing the column type to “dbl” as well.
One last transformation that needs to be done is changing some of our columns to be numerical, not character columns.
chess_data <- raw_data |>
mutate_at(vars(id,
total_points,
opponent_1,
opponent_2,
opponent_3,
opponent_4,
opponent_5,
opponent_6,
opponent_7,
pre_rating), as.numeric)
head(chess_data)
## # A tibble: 6 × 12
## id name total_points opponent_1 opponent_2 opponent_3 opponent_4
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 "GARY HUA … 6 39 21 18 14
## 2 2 "DAKSHESH DARU… 6 63 58 4 17
## 3 3 "ADITYA BAJAJ … 6 8 61 25 21
## 4 4 "PATRICK H SCH… 5.5 23 28 2 26
## 5 5 "HANSHI ZUO … 5.5 45 37 12 13
## 6 6 "HANSEN SONG … 5 34 29 11 35
## # ℹ 5 more variables: opponent_5 <dbl>, opponent_6 <dbl>, opponent_7 <dbl>,
## # state <chr>, pre_rating <dbl>
Now that these are numeric, I can begin to find the average rating of opponents for each player.
This data is very wide however, so it must be tidied to more easily perform the next steps.
long_chess_data <- chess_data |>
pivot_longer(
cols = starts_with("opponent_"),
names_to = "round",
values_to = "opponent_id")
head(long_chess_data)
## # A tibble: 6 × 7
## id name total_points state pre_rating round opponent_id
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 "GARY HUA … 6 "ON " 1794 oppo… 39
## 2 1 "GARY HUA … 6 "ON " 1794 oppo… 21
## 3 1 "GARY HUA … 6 "ON " 1794 oppo… 18
## 4 1 "GARY HUA … 6 "ON " 1794 oppo… 14
## 5 1 "GARY HUA … 6 "ON " 1794 oppo… 7
## 6 1 "GARY HUA … 6 "ON " 1794 oppo… 12
We have created a new single column for opponent_id, and now have a column of which round that opponent was played. We can clean that column up as well.
long_chess_data <- long_chess_data |>
mutate(round = str_extract(round, "\\d+"))
long_chess_data <- long_chess_data |>
mutate(round = as.numeric(round))
head(long_chess_data)
## # A tibble: 6 × 7
## id name total_points state pre_rating round opponent_id
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 "GARY HUA … 6 "ON " 1794 1 39
## 2 1 "GARY HUA … 6 "ON " 1794 2 21
## 3 1 "GARY HUA … 6 "ON " 1794 3 18
## 4 1 "GARY HUA … 6 "ON " 1794 4 14
## 5 1 "GARY HUA … 6 "ON " 1794 5 7
## 6 1 "GARY HUA … 6 "ON " 1794 6 12
The column for rounds is now a single digit and is numerical.
Now with the data long I can assign the values of the opponents rankings more easily.
long_chess_data <- long_chess_data |>
mutate(
opponent_rating = pre_rating[match(opponent_id, id)])
Here I have created a column with the ranking of the opponent played. I have done so with match(), first matching where the opponent_id and id values match. Then the pre_rating value is retrieved from the same row as id and inserted into the new column. From here I can find the average opponent pre-rating.
opp_avg_data <- long_chess_data |>
group_by(name) |>
summarise(avg_opp_rating = mean(opponent_rating, na.rm = TRUE))
Here I have created a new table with two columns, the name of the player and their corresponding average opponents rating.
long_chess_data <- long_chess_data |>
select(name, total_points, state, pre_rating) |>
distinct(name, .keep_all = TRUE)
final_chess_data <- full_join(long_chess_data, opp_avg_data, by = "name")
final_chess_data <- final_chess_data |>
relocate(state, .before = total_points)
head(final_chess_data)
## # A tibble: 6 × 5
## name state total_points pre_rating avg_opp_rating
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 "GARY HUA … "ON " 6 1794 1605.
## 2 "DAKSHESH DARURI … "MI " 6 1553 1469.
## 3 "ADITYA BAJAJ … "MI " 6 1384 1564.
## 4 "PATRICK H SCHILLING … "MI " 5.5 1716 1574.
## 5 "HANSHI ZUO … "MI " 5.5 1655 1501.
## 6 "HANSEN SONG … "OH " 5 1686 1519.
Next is to make a table that can be joined to the opp_avg_data. First we filter out all variables that we no longer need, keeping name, state, total points, and pre-rating, then finding the distinct rows by name. After this, I joined by name and rearranged two columns to create our final table.
write.csv(final_chess_data, "C:/Users/crumm/Documents/finalchessdata.csv", row.names = FALSE)
Lastly, I wrote a .csv file which I will store on github.
Conclusion
While this project was asking for a fairly simple task, taking data in a messy, unfriendly form and cleaning it up into something more manageable, I used many different methods to get to my final answer. First, I had to import the data from a file type that I have not had much experience with, followed by a long clean up involving REGEX and joins. Additionally, I had to create a version of the data that was much tidier, using pivot_longer to accomplish this. From there I could more easily find the averages of opponent pre-ratings. This, to me felt like a culmination of the first few weeks of basic data manipulation and handling. It seems like there are many ways to get to the final table too - I found myself doing this two separate (but similar) ways. First, I made the table with wide, untidy data. Once I had that done I could see more easily how tidying my data would make manipulating the data much quicker.