DATA 607 - Project 1

Objective: Create an R Markdown file that generates a .CSV file (that could for example be imported into a SQL database) with the following information for all of the players:

Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents

For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605

1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.

Step 1: Import the raw data and prepare working data tables

To start, let’s load the raw data from the tournamentinfo.txt file, which I saved on GitHub. Data fields in the file are pipe-delimited, so let’s use “read_delim”.

library(tidyverse)

url <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj1/master/tournamentinfo.txt"

# load pipe-delimited data as raw dataframe
raw <- read_delim("tournamentinfo.txt", "|", col_names = FALSE, comment = "--", trim_ws = TRUE)
raw

## # A tibble: 130 x 11
##    X1    X2          X3    X4    X5    X6    X7    X8    X9    X10   X11  
##    <chr> <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Pair  Player Name Total Round Round Round Round Round Round Round <NA> 
##  2 Num   USCF ID / ~ Pts   1     2     3     4     5     6     7     <NA> 
##  3 1     GARY HUA    6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4 <NA> 
##  4 ON    15445895 /~ N:2   W     B     W     B     W     B     W     <NA> 
##  5 2     DAKSHESH D~ 6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7 <NA> 
##  6 MI    14598900 /~ N:2   B     W     B     W     B     W     B     <NA> 
##  7 3     ADITYA BAJ~ 6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12 <NA> 
##  8 MI    14959604 /~ N:2   W     B     W     B     W     B     W     <NA> 
##  9 4     PATRICK H ~ 5.5   W  23 D  28 W   2 W  26 D   5 W  19 D   1 <NA> 
## 10 MI    12616049 /~ N:2   W     B     W     B     W     B     B     <NA> 
## # ... with 120 more rows

Let’s remove the last column, since it only contains NA’s.

# check whether column X11 is all NA, and then remove the column
sum(is.na(raw$X11)) == length(raw$X11)

## [1] TRUE

raw <- select(raw, -X11)

Reviewing the raw data, it seems that we can simplify the analysis by separating the data into two intermediate data tables:
- raw1, which contains the player number, name, total points and opponent numbers for all matches played
- raw2, which contains the player state, pre-rating and post-rating

# create first table containing player number, name, points and opponents played
raw1 <- raw %>% filter(str_detect(X1, "[:digit:]+")) 
# create column headers
colnames(raw1) <- c("Num", "Name", "Pts", str_c("Rnd", 1:7))
raw1

## # A tibble: 64 x 10
##    Num   Name              Pts   Rnd1  Rnd2  Rnd3  Rnd4  Rnd5  Rnd6  Rnd7 
##    <chr> <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 1     GARY HUA          6.0   W  39 W  21 W  18 W  14 W   7 D  12 D   4
##  2 2     DAKSHESH DARURI   6.0   W  63 W  58 L   4 W  17 W  16 W  20 W   7
##  3 3     ADITYA BAJAJ      6.0   L   8 W  61 W  25 W  21 W  11 W  13 W  12
##  4 4     PATRICK H SCHILL~ 5.5   W  23 D  28 W   2 W  26 D   5 W  19 D   1
##  5 5     HANSHI ZUO        5.5   W  45 W  37 D  12 D  13 D   4 W  14 W  17
##  6 6     HANSEN SONG       5.0   W  34 D  29 L  11 W  35 D  10 W  27 W  21
##  7 7     GARY DEE SWATHELL 5.0   W  57 W  46 W  13 W  11 L   1 W   9 L   2
##  8 8     EZEKIEL HOUGHTON  5.0   W   3 W  32 L  14 L   9 W  47 W  28 W  19
##  9 9     STEFANO LEE       5.0   W  25 L  18 W  59 W   8 W  26 L   7 W  20
## 10 10    ANVIT RAO         5.0   D  16 L  19 W  55 W  31 D   6 W  25 W  18
## # ... with 54 more rows

# create second table containing player state and ratings
raw2 <- raw %>% filter(str_detect(X1, "[:upper:]{2}"))
# create column headers
colnames(raw2) <- raw[2, ]
raw2

## # A tibble: 64 x 10
##    Num   `USCF ID / Rtg (~ Pts   `1`   `2`   `3`   `4`   `5`   `6`   `7`  
##    <chr> <chr>             <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 ON    15445895 / R: 17~ N:2   W     B     W     B     W     B     W    
##  2 MI    14598900 / R: 15~ N:2   B     W     B     W     B     W     B    
##  3 MI    14959604 / R: 13~ N:2   W     B     W     B     W     B     W    
##  4 MI    12616049 / R: 17~ N:2   W     B     W     B     W     B     B    
##  5 MI    14601533 / R: 16~ N:2   B     W     B     W     B     W     B    
##  6 OH    15055204 / R: 16~ N:3   W     B     W     B     B     W     B    
##  7 MI    11146376 / R: 16~ N:3   W     B     W     B     B     W     W    
##  8 MI    15142253 / R: 16~ N:3   B     W     B     W     B     W     W    
##  9 ON    14954524 / R: 14~ N:2   W     B     W     B     W     B     B    
## 10 MI    14150362 / R: 13~ N:3   W     W     B     B     W     B     W    
## # ... with 54 more rows

Step 2: Extract variables that are directly available

Some variables can be extracted directly from the working data tables. For instance, the player numbers, names, states and total points are all stored in columns that can be assigned to variables with the correct data types.

# extract player number, name, state and total points from raw1 and raw2 tables, and assign correct data types
num <- as.integer(raw1[[1]])        
name <- as.character(raw1[[2]])     
state <- as.character(raw2[[1]])    
points <- as.numeric(raw1[[3]])

The pre- and post-tournament player ratings can be extracted using the stringr functions and regular expressions.

# extract pre-rating
pre_rating_pat <- "R:[:space:]*[:digit:]+"
pre_rating <- raw2[[2]] %>% str_extract(pre_rating_pat) %>%
    str_extract("[:digit:]+") %>% as.integer()
# extract post-rating                          
post_rating_pat <- "->[:space:]*[:digit:]+"
post_rating <- raw2[[2]] %>% str_extract(post_rating_pat) %>% 
    str_extract("[:digit:]+") %>% as.integer()

Step 3: Calculate variables that are not directly available

In order to calculate, for each player, the average pre-tournament ratings of all opponents played, we will need to extract and / or compute the following variables for each player:
- Player numbers for all opponents played
- Player pre-tournament ratings for all opponents played
- Number of matches played
- Average rating of opponents played

First let’s extract the player numbers for all opponents played, and store in the opp1 table. At the same time, we can sum up the number of matches played for each player. Note that we need to be careful with the NA’s when computing the number of matches.

# create a table of opponent player numbers, and calc the number of matches played
opp1 <- raw1 %>% select(-Num, -Name, -Pts)      # remove unneeded columns
n_matches <- vector("integer", nrow(opp1))      # initialize
for (j in 1:7){
    opp1[[j]] <- opp1[[j]] %>% str_extract("[:digit:]+") %>% as.integer()
    n_matches <- n_matches + !is.na(opp1[[j]])
}
opp1

## # A tibble: 64 x 7
##     Rnd1  Rnd2  Rnd3  Rnd4  Rnd5  Rnd6  Rnd7
##    <int> <int> <int> <int> <int> <int> <int>
##  1    39    21    18    14     7    12     4
##  2    63    58     4    17    16    20     7
##  3     8    61    25    21    11    13    12
##  4    23    28     2    26     5    19     1
##  5    45    37    12    13     4    14    17
##  6    34    29    11    35    10    27    21
##  7    57    46    13    11     1     9     2
##  8     3    32    14     9    47    28    19
##  9    25    18    59     8    26     7    20
## 10    16    19    55    31     6    25    18
## # ... with 54 more rows

# need to be careful with NA's
tail(opp1, 10)

## # A tibble: 10 x 7
##     Rnd1  Rnd2  Rnd3  Rnd4  Rnd5  Rnd6  Rnd7
##    <int> <int> <int> <int> <int> <int> <int>
##  1    62    31    10    30    NA    45    43
##  2    NA    11    35    45    NA    40    42
##  3     7    36    42    51    35    53    NA
##  4    31     2    41    23    49    NA    45
##  5    41    NA     9    40    43    54    44
##  6    33    34    45    42    24    NA    NA
##  7    32     3    54    47    42    30    37
##  8    55    NA    NA    NA    NA    NA    NA
##  9     2    48    49    43    45    NA    NA
## 10    22    30    31    49    46    42    54

Next let’s look-up the opponent ratings using the opponent player numbers in opp1 and the player ratings stored in the variable pre_rating above; then store the results in the opp2 table. In this step, we can also compute the average opponent rating for all matches played. As before, we need to be careful and exclude the NA’s when computing the average opponent rating.

# create a table of opponent player ratings, and calc the average rating
opp2 <- opp1                            # initialize
sum <- vector("numeric", nrow(opp1))    # initialize
for (j in 1:7){
    opp2[[j]] <- pre_rating[opp1[[j]]]
    sum <- sum + ifelse(is.na(opp2[[j]]), 0, opp2[[j]])
}
avg_rating <- sum / n_matches
opp2

## # A tibble: 64 x 7
##     Rnd1  Rnd2  Rnd3  Rnd4  Rnd5  Rnd6  Rnd7
##    <int> <int> <int> <int> <int> <int> <int>
##  1  1436  1563  1600  1610  1649  1663  1716
##  2  1175   917  1716  1629  1604  1595  1649
##  3  1641   955  1745  1563  1712  1666  1663
##  4  1363  1507  1553  1579  1655  1564  1794
##  5  1242   980  1663  1666  1716  1610  1629
##  6  1399  1602  1712  1438  1365  1552  1563
##  7  1092   377  1666  1712  1794  1411  1553
##  8  1384  1441  1610  1411  1362  1507  1564
##  9  1745  1600   853  1641  1579  1649  1595
## 10  1604  1564  1186  1494  1686  1745  1600
## # ... with 54 more rows

# check NA's
tail(opp2, 10)

## # A tibble: 10 x 7
##     Rnd1  Rnd2  Rnd3  Rnd4  Rnd5  Rnd6  Rnd7
##    <int> <int> <int> <int> <int> <int> <int>
##  1  1530  1494  1365  1522    NA  1242  1283
##  2    NA  1712  1438  1242    NA  1348  1332
##  3  1649  1355  1332  1011  1438  1393    NA
##  4  1494  1553  1403  1363  1291    NA  1242
##  5  1403    NA  1411  1348  1283  1270  1199
##  6  1449  1399  1242  1332  1229    NA    NA
##  7  1441  1384  1270  1362  1332  1522   980
##  8  1186    NA    NA    NA    NA    NA    NA
##  9  1553  1382  1291  1283  1242    NA    NA
## 10  1555  1522  1494  1291   377  1332  1270

Step 4: Generate the final dataframe and export to a CSV file

Now that we have all the required fields (plus a few extras), we can create the final dataframe and review.

final_df <- list(Player = num, Name = name, State = state, Tot_Pts = points, Pre_Rtg = pre_rating, 
                 Avg_Opp_Rtg = avg_rating, N_Matches = n_matches, Post_Rtg = post_rating) %>% as_tibble()
final_df

## # A tibble: 64 x 8
##    Player Name        State Tot_Pts Pre_Rtg Avg_Opp_Rtg N_Matches Post_Rtg
##     <int> <chr>       <chr>   <dbl>   <int>       <dbl>     <int>    <int>
##  1      1 GARY HUA    ON        6      1794       1605.         7     1817
##  2      2 DAKSHESH D~ MI        6      1553       1469.         7     1663
##  3      3 ADITYA BAJ~ MI        6      1384       1564.         7     1640
##  4      4 PATRICK H ~ MI        5.5    1716       1574.         7     1744
##  5      5 HANSHI ZUO  MI        5.5    1655       1501.         7     1690
##  6      6 HANSEN SONG OH        5      1686       1519.         7     1687
##  7      7 GARY DEE S~ MI        5      1649       1372.         7     1673
##  8      8 EZEKIEL HO~ MI        5      1641       1468.         7     1657
##  9      9 STEFANO LEE ON        5      1411       1523.         7     1564
## 10     10 ANVIT RAO   MI        5      1365       1554.         7     1544
## # ... with 54 more rows

# check NA's
tail(final_df)

## # A tibble: 6 x 8
##   Player Name         State Tot_Pts Pre_Rtg Avg_Opp_Rtg N_Matches Post_Rtg
##    <int> <chr>        <chr>   <dbl>   <int>       <dbl>     <int>    <int>
## 1     59 SEAN M MC C~ MI        2       853       1319          6      878
## 2     60 JULIA SHEN   MI        1.5     967       1330.         5      984
## 3     61 JEZZEL FARK~ ON        1.5     955       1327.         7      979
## 4     62 ASHWIN BALA~ MI        1      1530       1186          1     1535
## 5     63 THOMAS JOSE~ MI        1      1175       1350.         5     1125
## 6     64 BEN LI       MI        1      1163       1263          7     1112

View(final_df)

The dataframe looks reasonable, so let’s save it as a CSV file.

write_csv(final_df, "project1.csv")

Step 5: Do some preliminary analysis on the data

Now that we have the final dataframe, we can do some preliminary analysis and see what relationships are apparent in the data.

To start, it appears from the state distribution that this was a tournament in or around Michigan; there are also a few players from Ontario and Ohio.

ggplot(final_df) + geom_bar(aes(x = State, fill = State))

The average rating of the players is around 1400; furthermore, the ratings distribution has a left skew toward lower ratings.

ggplot(final_df) + geom_histogram(aes(x = Pre_Rtg, fill = State), binwidth = 100)

summary(final_df$Pre_Rtg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     377    1227    1407    1378    1583    1794

Most players competed in 7 matches, although some played fewer matches.

ggplot(final_df) + geom_histogram(aes(x = N_Matches, fill = State))

Most competitive players (those having pre-tournament ratings > 1500) played 6 or 7 matches.

ggplot(final_df) + geom_histogram(aes(x = Pre_Rtg, fill = N_Matches), binwidth = 100) + 
    facet_wrap(~ N_Matches)

On average, higher pre-tournament ratings are associated with higher tournament scores.

ggplot(final_df, aes(x = Pre_Rtg, y = Tot_Pts)) + geom_point(aes(col = State)) + 
    geom_smooth(method="lm", se=FALSE)

Finally, ratings for most players did not move dramatically. However it appears that lower-rated players (ratings < 1200) experienced the largest percentage moves in their scores from pre- to post-tournament.

ggplot(final_df, aes(x = Pre_Rtg, y = Post_Rtg)) + geom_point(aes(col = State)) + geom_smooth(method="lm") + 
    geom_abline()

In summary, several observations can be made from a quick analysis of the dataset:
- The tournament likely took place in Michigan, as the vast majority of players are from the state.
- Player ratings average roughly 1400, with a substantial number of players rated below 1000 (left skew).
- Most players, especially the most competitive players (having pre-tournament ratings > 1500), played 6-7 matches.
- On average, stronger players had stronger tournment scores.
- Overall, the impact of the tournament on pre- to post-tournament ratings was not dramatic, although there were larger exceptions especially for lower-ranked players (ratings < 1200).