For the first player, the information would be:
Gary Hua, ON, 6.0, 1794, 1605
1605 was calculated by using the pre-tournament opponents’ ratings of 1436, 1563, 1600, 1610, 1649, 1663, 1716, and dividing by the total number of games played.
To start, let’s load the raw data from the tournamentinfo.txt file, which I saved on GitHub. Data fields in the file are pipe-delimited, so let’s use “read_delim”.
library(tidyverse)
url <- "https://raw.githubusercontent.com/kecbenson/DATA_607_Proj1/master/tournamentinfo.txt"
# load pipe-delimited data as raw dataframe
raw <- read_delim("tournamentinfo.txt", "|", col_names = FALSE, comment = "--", trim_ws = TRUE)
raw
## # A tibble: 130 x 11
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Pair Player Name Total Round Round Round Round Round Round Round <NA>
## 2 Num USCF ID / ~ Pts 1 2 3 4 5 6 7 <NA>
## 3 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4 <NA>
## 4 ON 15445895 /~ N:2 W B W B W B W <NA>
## 5 2 DAKSHESH D~ 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7 <NA>
## 6 MI 14598900 /~ N:2 B W B W B W B <NA>
## 7 3 ADITYA BAJ~ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12 <NA>
## 8 MI 14959604 /~ N:2 W B W B W B W <NA>
## 9 4 PATRICK H ~ 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1 <NA>
## 10 MI 12616049 /~ N:2 W B W B W B B <NA>
## # ... with 120 more rows
Let’s remove the last column, since it only contains NA’s.
# check whether column X11 is all NA, and then remove the column
sum(is.na(raw$X11)) == length(raw$X11)
## [1] TRUE
raw <- select(raw, -X11)
Reviewing the raw data, it seems that we can simplify the analysis by separating the data into two intermediate data tables:
- raw1, which contains the player number, name, total points and opponent numbers for all matches played
- raw2, which contains the player state, pre-rating and post-rating
# create first table containing player number, name, points and opponents played
raw1 <- raw %>% filter(str_detect(X1, "[:digit:]+"))
# create column headers
colnames(raw1) <- c("Num", "Name", "Pts", str_c("Rnd", 1:7))
raw1
## # A tibble: 64 x 10
## Num Name Pts Rnd1 Rnd2 Rnd3 Rnd4 Rnd5 Rnd6 Rnd7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 GARY HUA 6.0 W 39 W 21 W 18 W 14 W 7 D 12 D 4
## 2 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4 W 17 W 16 W 20 W 7
## 3 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25 W 21 W 11 W 13 W 12
## 4 4 PATRICK H SCHILL~ 5.5 W 23 D 28 W 2 W 26 D 5 W 19 D 1
## 5 5 HANSHI ZUO 5.5 W 45 W 37 D 12 D 13 D 4 W 14 W 17
## 6 6 HANSEN SONG 5.0 W 34 D 29 L 11 W 35 D 10 W 27 W 21
## 7 7 GARY DEE SWATHELL 5.0 W 57 W 46 W 13 W 11 L 1 W 9 L 2
## 8 8 EZEKIEL HOUGHTON 5.0 W 3 W 32 L 14 L 9 W 47 W 28 W 19
## 9 9 STEFANO LEE 5.0 W 25 L 18 W 59 W 8 W 26 L 7 W 20
## 10 10 ANVIT RAO 5.0 D 16 L 19 W 55 W 31 D 6 W 25 W 18
## # ... with 54 more rows
# create second table containing player state and ratings
raw2 <- raw %>% filter(str_detect(X1, "[:upper:]{2}"))
# create column headers
colnames(raw2) <- raw[2, ]
raw2
## # A tibble: 64 x 10
## Num `USCF ID / Rtg (~ Pts `1` `2` `3` `4` `5` `6` `7`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ON 15445895 / R: 17~ N:2 W B W B W B W
## 2 MI 14598900 / R: 15~ N:2 B W B W B W B
## 3 MI 14959604 / R: 13~ N:2 W B W B W B W
## 4 MI 12616049 / R: 17~ N:2 W B W B W B B
## 5 MI 14601533 / R: 16~ N:2 B W B W B W B
## 6 OH 15055204 / R: 16~ N:3 W B W B B W B
## 7 MI 11146376 / R: 16~ N:3 W B W B B W W
## 8 MI 15142253 / R: 16~ N:3 B W B W B W W
## 9 ON 14954524 / R: 14~ N:2 W B W B W B B
## 10 MI 14150362 / R: 13~ N:3 W W B B W B W
## # ... with 54 more rows
Some variables can be extracted directly from the working data tables. For instance, the player numbers, names, states and total points are all stored in columns that can be assigned to variables with the correct data types.
# extract player number, name, state and total points from raw1 and raw2 tables, and assign correct data types
num <- as.integer(raw1[[1]])
name <- as.character(raw1[[2]])
state <- as.character(raw2[[1]])
points <- as.numeric(raw1[[3]])
The pre- and post-tournament player ratings can be extracted using the stringr functions and regular expressions.
# extract pre-rating
pre_rating_pat <- "R:[:space:]*[:digit:]+"
pre_rating <- raw2[[2]] %>% str_extract(pre_rating_pat) %>%
str_extract("[:digit:]+") %>% as.integer()
# extract post-rating
post_rating_pat <- "->[:space:]*[:digit:]+"
post_rating <- raw2[[2]] %>% str_extract(post_rating_pat) %>%
str_extract("[:digit:]+") %>% as.integer()
In order to calculate, for each player, the average pre-tournament ratings of all opponents played, we will need to extract and / or compute the following variables for each player:
- Player numbers for all opponents played
- Player pre-tournament ratings for all opponents played
- Number of matches played
- Average rating of opponents played
First let’s extract the player numbers for all opponents played, and store in the opp1 table. At the same time, we can sum up the number of matches played for each player. Note that we need to be careful with the NA’s when computing the number of matches.
# create a table of opponent player numbers, and calc the number of matches played
opp1 <- raw1 %>% select(-Num, -Name, -Pts) # remove unneeded columns
n_matches <- vector("integer", nrow(opp1)) # initialize
for (j in 1:7){
opp1[[j]] <- opp1[[j]] %>% str_extract("[:digit:]+") %>% as.integer()
n_matches <- n_matches + !is.na(opp1[[j]])
}
opp1
## # A tibble: 64 x 7
## Rnd1 Rnd2 Rnd3 Rnd4 Rnd5 Rnd6 Rnd7
## <int> <int> <int> <int> <int> <int> <int>
## 1 39 21 18 14 7 12 4
## 2 63 58 4 17 16 20 7
## 3 8 61 25 21 11 13 12
## 4 23 28 2 26 5 19 1
## 5 45 37 12 13 4 14 17
## 6 34 29 11 35 10 27 21
## 7 57 46 13 11 1 9 2
## 8 3 32 14 9 47 28 19
## 9 25 18 59 8 26 7 20
## 10 16 19 55 31 6 25 18
## # ... with 54 more rows
# need to be careful with NA's
tail(opp1, 10)
## # A tibble: 10 x 7
## Rnd1 Rnd2 Rnd3 Rnd4 Rnd5 Rnd6 Rnd7
## <int> <int> <int> <int> <int> <int> <int>
## 1 62 31 10 30 NA 45 43
## 2 NA 11 35 45 NA 40 42
## 3 7 36 42 51 35 53 NA
## 4 31 2 41 23 49 NA 45
## 5 41 NA 9 40 43 54 44
## 6 33 34 45 42 24 NA NA
## 7 32 3 54 47 42 30 37
## 8 55 NA NA NA NA NA NA
## 9 2 48 49 43 45 NA NA
## 10 22 30 31 49 46 42 54
Next let’s look-up the opponent ratings using the opponent player numbers in opp1 and the player ratings stored in the variable pre_rating above; then store the results in the opp2 table. In this step, we can also compute the average opponent rating for all matches played. As before, we need to be careful and exclude the NA’s when computing the average opponent rating.
# create a table of opponent player ratings, and calc the average rating
opp2 <- opp1 # initialize
sum <- vector("numeric", nrow(opp1)) # initialize
for (j in 1:7){
opp2[[j]] <- pre_rating[opp1[[j]]]
sum <- sum + ifelse(is.na(opp2[[j]]), 0, opp2[[j]])
}
avg_rating <- sum / n_matches
opp2
## # A tibble: 64 x 7
## Rnd1 Rnd2 Rnd3 Rnd4 Rnd5 Rnd6 Rnd7
## <int> <int> <int> <int> <int> <int> <int>
## 1 1436 1563 1600 1610 1649 1663 1716
## 2 1175 917 1716 1629 1604 1595 1649
## 3 1641 955 1745 1563 1712 1666 1663
## 4 1363 1507 1553 1579 1655 1564 1794
## 5 1242 980 1663 1666 1716 1610 1629
## 6 1399 1602 1712 1438 1365 1552 1563
## 7 1092 377 1666 1712 1794 1411 1553
## 8 1384 1441 1610 1411 1362 1507 1564
## 9 1745 1600 853 1641 1579 1649 1595
## 10 1604 1564 1186 1494 1686 1745 1600
## # ... with 54 more rows
# check NA's
tail(opp2, 10)
## # A tibble: 10 x 7
## Rnd1 Rnd2 Rnd3 Rnd4 Rnd5 Rnd6 Rnd7
## <int> <int> <int> <int> <int> <int> <int>
## 1 1530 1494 1365 1522 NA 1242 1283
## 2 NA 1712 1438 1242 NA 1348 1332
## 3 1649 1355 1332 1011 1438 1393 NA
## 4 1494 1553 1403 1363 1291 NA 1242
## 5 1403 NA 1411 1348 1283 1270 1199
## 6 1449 1399 1242 1332 1229 NA NA
## 7 1441 1384 1270 1362 1332 1522 980
## 8 1186 NA NA NA NA NA NA
## 9 1553 1382 1291 1283 1242 NA NA
## 10 1555 1522 1494 1291 377 1332 1270
Now that we have all the required fields (plus a few extras), we can create the final dataframe and review.
final_df <- list(Player = num, Name = name, State = state, Tot_Pts = points, Pre_Rtg = pre_rating,
Avg_Opp_Rtg = avg_rating, N_Matches = n_matches, Post_Rtg = post_rating) %>% as_tibble()
final_df
## # A tibble: 64 x 8
## Player Name State Tot_Pts Pre_Rtg Avg_Opp_Rtg N_Matches Post_Rtg
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 1 GARY HUA ON 6 1794 1605. 7 1817
## 2 2 DAKSHESH D~ MI 6 1553 1469. 7 1663
## 3 3 ADITYA BAJ~ MI 6 1384 1564. 7 1640
## 4 4 PATRICK H ~ MI 5.5 1716 1574. 7 1744
## 5 5 HANSHI ZUO MI 5.5 1655 1501. 7 1690
## 6 6 HANSEN SONG OH 5 1686 1519. 7 1687
## 7 7 GARY DEE S~ MI 5 1649 1372. 7 1673
## 8 8 EZEKIEL HO~ MI 5 1641 1468. 7 1657
## 9 9 STEFANO LEE ON 5 1411 1523. 7 1564
## 10 10 ANVIT RAO MI 5 1365 1554. 7 1544
## # ... with 54 more rows
# check NA's
tail(final_df)
## # A tibble: 6 x 8
## Player Name State Tot_Pts Pre_Rtg Avg_Opp_Rtg N_Matches Post_Rtg
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int>
## 1 59 SEAN M MC C~ MI 2 853 1319 6 878
## 2 60 JULIA SHEN MI 1.5 967 1330. 5 984
## 3 61 JEZZEL FARK~ ON 1.5 955 1327. 7 979
## 4 62 ASHWIN BALA~ MI 1 1530 1186 1 1535
## 5 63 THOMAS JOSE~ MI 1 1175 1350. 5 1125
## 6 64 BEN LI MI 1 1163 1263 7 1112
View(final_df)
The dataframe looks reasonable, so let’s save it as a CSV file.
write_csv(final_df, "project1.csv")
Now that we have the final dataframe, we can do some preliminary analysis and see what relationships are apparent in the data.
To start, it appears from the state distribution that this was a tournament in or around Michigan; there are also a few players from Ontario and Ohio.
ggplot(final_df) + geom_bar(aes(x = State, fill = State))
The average rating of the players is around 1400; furthermore, the ratings distribution has a left skew toward lower ratings.
ggplot(final_df) + geom_histogram(aes(x = Pre_Rtg, fill = State), binwidth = 100)
summary(final_df$Pre_Rtg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 377 1227 1407 1378 1583 1794
Most players competed in 7 matches, although some played fewer matches.
ggplot(final_df) + geom_histogram(aes(x = N_Matches, fill = State))
Most competitive players (those having pre-tournament ratings > 1500) played 6 or 7 matches.
ggplot(final_df) + geom_histogram(aes(x = Pre_Rtg, fill = N_Matches), binwidth = 100) +
facet_wrap(~ N_Matches)
On average, higher pre-tournament ratings are associated with higher tournament scores.
ggplot(final_df, aes(x = Pre_Rtg, y = Tot_Pts)) + geom_point(aes(col = State)) +
geom_smooth(method="lm", se=FALSE)
Finally, ratings for most players did not move dramatically. However it appears that lower-rated players (ratings < 1200) experienced the largest percentage moves in their scores from pre- to post-tournament.
ggplot(final_df, aes(x = Pre_Rtg, y = Post_Rtg)) + geom_point(aes(col = State)) + geom_smooth(method="lm") +
geom_abline()
In summary, several observations can be made from a quick analysis of the dataset:
- The tournament likely took place in Michigan, as the vast majority of players are from the state.
- Player ratings average roughly 1400, with a substantial number of players rated below 1000 (left skew).
- Most players, especially the most competitive players (having pre-tournament ratings > 1500), played 6-7 matches.
- On average, stronger players had stronger tournment scores.
- Overall, the impact of the tournament on pre- to post-tournament ratings was not dramatic, although there were larger exceptions especially for lower-ranked players (ratings < 1200).