DATA607 Project 1

Data Cleaning

The first step was to read the data into R. I tried several methods: first, table.read(), then read.tsv(), then I tried to read each individual line of the file into a new file. I ultimately discovered I could use read.csv, and through the arguments sep, header, skip, and colnames, I could achieve several of my cleaning goals in one line of code:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

url <- 'https://raw.githubusercontent.com/josh1den/DATA-607/Projects/tournamentinfo.txt'
cols = c('player_num','name','total_pts','round_1','round_2','round_3','round_4',
         'round_5','round_6','round_7','NA')
df <- read.csv(url, sep="|", header=FALSE, skip=3, col.names=cols)
head(df)

##                                                                                  player_num
## 1 -----------------------------------------------------------------------------------------
## 2                                                                                        1 
## 3                                                                                       ON 
## 4 -----------------------------------------------------------------------------------------
## 5                                                                                        2 
## 6                                                                                       MI 
##                                name total_pts round_1 round_2 round_3 round_4
## 1                                                                            
## 2  GARY HUA                             6.0     W  39   W  21   W  18   W  14
## 3  15445895 / R: 1794   ->1817          N:2     W       B       W       B    
## 4                                                                            
## 5  DAKSHESH DARURI                      6.0     W  63   W  58   L   4   W  17
## 6  14598900 / R: 1553   ->1663          N:2     B       W       B       W    
##   round_5 round_6 round_7 NA.
## 1                          NA
## 2   W   7   D  12   D   4  NA
## 3   W       B       W      NA
## 4                          NA
## 5   W  16   W  20   W   7  NA
## 6   B       W       B      NA

The next step was to remove the rows with dashes.

# remove rows with dashes
dashes <- "-----------------------------------------------------------------------------------------"
df <- filter(df, player_num != dashes)
head(df)

##   player_num                              name total_pts round_1 round_2
## 1         1   GARY HUA                             6.0     W  39   W  21
## 2        ON   15445895 / R: 1794   ->1817          N:2     W       B    
## 3         2   DAKSHESH DARURI                      6.0     W  63   W  58
## 4        MI   14598900 / R: 1553   ->1663          N:2     B       W    
## 5         3   ADITYA BAJAJ                         6.0     L   8   W  61
## 6        MI   14959604 / R: 1384   ->1640          N:2     W       B    
##   round_3 round_4 round_5 round_6 round_7 NA.
## 1   W  18   W  14   W   7   D  12   D   4  NA
## 2   W       B       W       B       W      NA
## 3   L   4   W  17   W  16   W  20   W   7  NA
## 4   B       W       B       W       B      NA
## 5   W  25   W  21   W  11   W  13   W  12  NA
## 6   W       B       W       B       W      NA

Next, I decided to create two new dataframes which I would later merge to one. The first dataframe (df1) would contain the odd number rows, and the second dataframe (df2) would contain the even number rows.

# extract every first row
df1 <- df %>%
  filter(row_number() %% 2 == 1)

# extract every second row
df2 <- df %>%
  filter(row_number() %% 2 != 1)

Next I used regex to extract the player rating from the name column as a new column: rating. The player ratings all followed a : followed by two or three spaces, and were digits of a minimum length of 3, maximum length of 4.

df2 <- df2 %>%
  mutate(rating = as.numeric(str_extract(name, "(?<=:\\s|:\\s\\s)\\d{3,4}")))
head(df2)

##   player_num                              name total_pts round_1 round_2
## 1        ON   15445895 / R: 1794   ->1817          N:2     W       B    
## 2        MI   14598900 / R: 1553   ->1663          N:2     B       W    
## 3        MI   14959604 / R: 1384   ->1640          N:2     W       B    
## 4        MI   12616049 / R: 1716   ->1744          N:2     W       B    
## 5        MI   14601533 / R: 1655   ->1690          N:2     B       W    
## 6        OH   15055204 / R: 1686   ->1687          N:3     W       B    
##   round_3 round_4 round_5 round_6 round_7 NA. rating
## 1   W       B       W       B       W      NA   1794
## 2   B       W       B       W       B      NA   1553
## 3   W       B       W       B       W      NA   1384
## 4   W       B       W       B       B      NA   1716
## 5   B       W       B       W       B      NA   1655
## 6   W       B       B       W       B      NA   1686

Now I could rename the player_num column to its appropriate name, state, and by subsetting, remove the columns I don’t need.

# rename and drop unnecessary columns
df2 <- rename(df2, state=player_num)
df2 <- df2[c("state","rating")]
head(df2)

##    state rating
## 1    ON    1794
## 2    MI    1553
## 3    MI    1384
## 4    MI    1716
## 5    MI    1655
## 6    OH    1686

Now I’m ready to merge the dataframes.

# merge the dataframes
df_new <- merge(df1, df2, by.x = 0, by.y = 0)
head(df_new)

##   Row.names player_num                              name total_pts round_1
## 1         1         1   GARY HUA                             6.0     W  39
## 2        10        10   ANVIT RAO                            5.0     D  16
## 3        11        11   CAMERON WILLIAM MC LEMAN             4.5     D  38
## 4        12        12   KENNETH J TACK                       4.5     W  42
## 5        13        13   TORRANCE HENRY JR                    4.5     W  36
## 6        14        14   BRADLEY SHAW                         4.5     W  54
##   round_2 round_3 round_4 round_5 round_6 round_7 NA.  state rating
## 1   W  21   W  18   W  14   W   7   D  12   D   4  NA    ON    1794
## 2   L  19   W  55   W  31   D   6   W  25   W  18  NA    MI    1365
## 3   W  56   W   6   L   7   L   3   W  34   W  26  NA    MI    1712
## 4   W  33   D   5   W  38   H       D   1   L   3  NA    MI    1663
## 5   W  27   L   7   D   5   W  33   L   3   W  32  NA    MI    1666
## 6   W  44   W   8   L   1   D  27   L   5   W  31  NA    MI    1610

Steps 1, 2, and 3 are now basically complete. However, there is a little bit of tidying up to attend to.

# remove undesired columns
cols_to_remove <- c("Row.names","NA.")
df_new <- select(df_new, -cols_to_remove)

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cols_to_remove)` instead of `cols_to_remove` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

head(df_new)

##   player_num                              name total_pts round_1 round_2
## 1         1   GARY HUA                             6.0     W  39   W  21
## 2        10   ANVIT RAO                            5.0     D  16   L  19
## 3        11   CAMERON WILLIAM MC LEMAN             4.5     D  38   W  56
## 4        12   KENNETH J TACK                       4.5     W  42   W  33
## 5        13   TORRANCE HENRY JR                    4.5     W  36   W  27
## 6        14   BRADLEY SHAW                         4.5     W  54   W  44
##   round_3 round_4 round_5 round_6 round_7  state rating
## 1   W  18   W  14   W   7   D  12   D   4    ON    1794
## 2   W  55   W  31   D   6   W  25   W  18    MI    1365
## 3   W   6   L   7   L   3   W  34   W  26    MI    1712
## 4   D   5   W  38   H       D   1   L   3    MI    1663
## 5   L   7   D   5   W  33   L   3   W  32    MI    1666
## 6   W   8   L   1   D  27   L   5   W  31    MI    1610

# the name column appears to have whitespace, let's trim that
df_new$name <- str_trim(df_new$name, side = c("both"))
head(df_new)

##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1         1                  GARY HUA     6.0     W  39   W  21   W  18   W  14
## 2        10                 ANVIT RAO     5.0     D  16   L  19   W  55   W  31
## 3        11  CAMERON WILLIAM MC LEMAN     4.5     D  38   W  56   W   6   L   7
## 4        12            KENNETH J TACK     4.5     W  42   W  33   D   5   W  38
## 5        13         TORRANCE HENRY JR     4.5     W  36   W  27   L   7   D   5
## 6        14              BRADLEY SHAW     4.5     W  54   W  44   W   8   L   1
##   round_5 round_6 round_7  state rating
## 1   W   7   D  12   D   4    ON    1794
## 2   D   6   W  25   W  18    MI    1365
## 3   L   3   W  34   W  26    MI    1712
## 4   H       D   1   L   3    MI    1663
## 5   W  33   L   3   W  32    MI    1666
## 6   D  27   L   5   W  31    MI    1610

# player_num column needs to be numeric
df_new$player_num <- as.numeric(df_new$player_num)
head(df_new)

##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0     W  39   W  21   W  18   W  14
## 2         10                ANVIT RAO     5.0     D  16   L  19   W  55   W  31
## 3         11 CAMERON WILLIAM MC LEMAN     4.5     D  38   W  56   W   6   L   7
## 4         12           KENNETH J TACK     4.5     W  42   W  33   D   5   W  38
## 5         13        TORRANCE HENRY JR     4.5     W  36   W  27   L   7   D   5
## 6         14             BRADLEY SHAW     4.5     W  54   W  44   W   8   L   1
##   round_5 round_6 round_7  state rating
## 1   W   7   D  12   D   4    ON    1794
## 2   D   6   W  25   W  18    MI    1365
## 3   L   3   W  34   W  26    MI    1712
## 4   H       D   1   L   3    MI    1663
## 5   W  33   L   3   W  32    MI    1666
## 6   D  27   L   5   W  31    MI    1610

Now to complete the final step: calculating the average opposing player ratings.

First I start by extracting the player IDs from the round columns.

# reduce round columns to only the player player_ids
round_cols <- c("round_1","round_2","round_3","round_4", "round_5", "round_6", "round_7")

for (col in round_cols) {
  df_new[[col]] <- as.numeric(str_extract_all(df_new[[col]], "(?<=\\s)\\d{1,2}", simplify=TRUE)) 
}

head(df_new)

##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0        39      21      18      14
## 2         10                ANVIT RAO     5.0        16      19      55      31
## 3         11 CAMERON WILLIAM MC LEMAN     4.5        38      56       6       7
## 4         12           KENNETH J TACK     4.5        42      33       5      38
## 5         13        TORRANCE HENRY JR     4.5        36      27       7       5
## 6         14             BRADLEY SHAW     4.5        54      44       8       1
##   round_5 round_6 round_7  state rating
## 1       7      12       4    ON    1794
## 2       6      25      18    MI    1365
## 3       3      34      26    MI    1712
## 4      NA       1       3    MI    1663
## 5      33       3      32    MI    1666
## 6      27       5      31    MI    1610

Next, I iterate through the round columns, replacing the player IDs with the player rating corresponding to the player ID.

# loop through the rounds replacing the opposing player id with the opposing player's pre-rating
for (row in 1:nrow(df_new)) {
  for (col in round_cols) {
    rating <- df_new[row, col]
    
    if(!is.na(rating)) {
      df_new[row, col] <- df_new$rating[df_new$player_num == rating]
    }
  }
}

head(df_new)

##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0      1436    1563    1600    1610
## 2         10                ANVIT RAO     5.0      1604    1564    1186    1494
## 3         11 CAMERON WILLIAM MC LEMAN     4.5      1423    1153    1686    1649
## 4         12           KENNETH J TACK     4.5      1332    1449    1655    1423
## 5         13        TORRANCE HENRY JR     4.5      1355    1552    1649    1655
## 6         14             BRADLEY SHAW     4.5      1270    1199    1641    1794
##   round_5 round_6 round_7  state rating
## 1    1649    1663    1716    ON    1794
## 2    1686    1745    1600    MI    1365
## 3    1384    1399    1579    MI    1712
## 4      NA    1794    1384    MI    1663
## 5    1449    1384    1441    MI    1666
## 6    1552    1655    1494    MI    1610

Now that I have columns with each opposing player rating, I can create a new columns with the average of the values in the round columns.

## calculate avg opponent rating
df_new$avg_opposing_rating <- round(rowMeans(df_new[round_cols], na.rm=TRUE))

head(df_new)

##   player_num                     name total_pts round_1 round_2 round_3 round_4
## 1          1                 GARY HUA     6.0      1436    1563    1600    1610
## 2         10                ANVIT RAO     5.0      1604    1564    1186    1494
## 3         11 CAMERON WILLIAM MC LEMAN     4.5      1423    1153    1686    1649
## 4         12           KENNETH J TACK     4.5      1332    1449    1655    1423
## 5         13        TORRANCE HENRY JR     4.5      1355    1552    1649    1655
## 6         14             BRADLEY SHAW     4.5      1270    1199    1641    1794
##   round_5 round_6 round_7  state rating avg_opposing_rating
## 1    1649    1663    1716    ON    1794                1605
## 2    1686    1745    1600    MI    1365                1554
## 3    1384    1399    1579    MI    1712                1468
## 4      NA    1794    1384    MI    1663                1506
## 5    1449    1384    1441    MI    1666                1498
## 6    1552    1655    1494    MI    1610                1515

From here, it’s simply a matter of selecting the desired columns for my output file.

# select the desired output
final <- select(df_new, c(2, 11, 3, 12, 13))
head(final)

##                       name  state total_pts rating avg_opposing_rating
## 1                 GARY HUA    ON      6.0     1794                1605
## 2                ANVIT RAO    MI      5.0     1365                1554
## 3 CAMERON WILLIAM MC LEMAN    MI      4.5     1712                1468
## 4           KENNETH J TACK    MI      4.5     1663                1506
## 5        TORRANCE HENRY JR    MI      4.5     1666                1498
## 6             BRADLEY SHAW    MI      4.5     1610                1515

Writing the file to csv:

path <- "/Users/joshiden/Documents/Classes/CUNY SPS/Fall 2022/DATA 607/Projects/Project 1.csv"
write.csv(final, path)

Thanks!

DATA607 Project 1

Josh Iden

2022-09-25

Introduction

Strategy

Data Cleaning