title: “Data 607 - Project 1”
author: “Sufian”
date: “9/18/2019”
output: html_document
Rpubs Links:
http://www.rpubs.com/ssufian/530973
GitHub Links:
https://github.com/ssufian/Data_607
In this project, the task is to transform a text file with chess tournament results into a clean tidy data
format and store the results as a .CSV file with the following information for all of the players:
Player’s Name
Player’s State
Total Number of Points
Player’s Pre-Rating
Average Pre Chess Rating of Opponents
# Load necessary libraries
library(stringr)
library(DT)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url <- 'https://raw.githubusercontent.com/ssufian/Data_607/master/tournamentinfo.txt'
mydata <- read.table(file =url, header=F , stringsAsFactors =FALSE, sep = ",")
mydata[1:10,]
## [1] "-----------------------------------------------------------------------------------------"
## [2] " Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| "
## [3] " Num | USCF ID / Rtg (Pre->Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 | "
## [4] "-----------------------------------------------------------------------------------------"
## [5] " 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|"
## [6] " ON | 15445895 / R: 1794 ->1817 |N:2 |W |B |W |B |W |B |W |"
## [7] "-----------------------------------------------------------------------------------------"
## [8] " 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|"
## [9] " MI | 14598900 / R: 1553 ->1663 |N:2 |B |W |B |W |B |W |B |"
## [10] "-----------------------------------------------------------------------------------------"
library (stringr)
newdata <- data.frame(str_replace_all(mydata$V1,"-",""))
newdata[1:10,]
## [1]
## [2] Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## [3] Num | USCF ID / Rtg (Pre>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## [4]
## [5] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [6] ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |
## [7]
## [8] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [9] MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |
## [10]
## 131 Levels: ...
#take only populated rows that do not contain emptly lines
newdata <- as.data.frame(newdata[!apply(newdata == "", 1, all),])
newdata[1:10,]
## [1] Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## [2] Num | USCF ID / Rtg (Pre>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## [3] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [4] ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |
## [5] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [6] MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |
## [7] 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [8] MI | 14959604 / R: 1384 >1640 |N:2 |W |B |W |B |W |B |W |
## [9] 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## [10] MI | 12616049 / R: 1716 >1744 |N:2 |W |B |W |B |W |B |B |
## 131 Levels: ...
odd <- newdata[seq(1,nrow(newdata),2),] # only odd rows
even <- newdata[seq(2,nrow(newdata),2),] # only even rows
head(odd) #check out my odd rows
## [1] Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round|
## [2] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4|
## [3] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7|
## [4] 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12|
## [5] 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1|
## [6] 5 | HANSHI ZUO |5.5 |W 45|W 37|D 12|D 13|D 4|W 14|W 17|
## 131 Levels: ...
head(even) #check out my even rows
## [1] Num | USCF ID / Rtg (Pre>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## [2] ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |
## [3] MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |
## [4] MI | 14959604 / R: 1384 >1640 |N:2 |W |B |W |B |W |B |W |
## [5] MI | 12616049 / R: 1716 >1744 |N:2 |W |B |W |B |W |B |B |
## [6] MI | 14601533 / R: 1655 >1690 |N:2 |B |W |B |W |B |W |B |
## 131 Levels: ...
Merging rows into columns into a temp table; called new_table
#Merging the odd & the even rows into a newly created empty place holder table
new_table <- data.frame(c())
new_table <- data.frame(paste(odd,even))
new_table[1:5,]
## [1] Pair | Player Name |Total|Round|Round|Round|Round|Round|Round|Round| Num | USCF ID / Rtg (Pre>Post) | Pts | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
## [2] 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W |
## [3] 2 | DAKSHESH DARURI |6.0 |W 63|W 58|L 4|W 17|W 16|W 20|W 7| MI | 14598900 / R: 1553 >1663 |N:2 |B |W |B |W |B |W |B |
## [4] 3 | ADITYA BAJAJ |6.0 |L 8|W 61|W 25|W 21|W 11|W 13|W 12| MI | 14959604 / R: 1384 >1640 |N:2 |W |B |W |B |W |B |W |
## [5] 4 | PATRICK H SCHILLING |5.5 |W 23|D 28|W 2|W 26|D 5|W 19|D 1| MI | 12616049 / R: 1716 >1744 |N:2 |W |B |W |B |W |B |B |
## 65 Levels: 1 | GARY HUA |6.0 |W 39|W 21|W 18|W 14|W 7|D 12|D 4| ON | 15445895 / R: 1794 >1817 |N:2 |W |B |W |B |W |B |W | ...
Creating headers in the merge table while eliminating the pipes (|) deliminators and deleting first rows
since the new headers were created; like starting from a clean slate of a new table
# Create headers
Headers <- c("Pair","Player Name","Total","Round 1","Round 2","Round 3", "Round 4","Round 5","Round 6","Round 7","State","USCF ID / Rtg (Pre->Post)","Pts","1","2","3","4","5","6","7","Avg Pre Rating")
# Eliminating | and Separating into columns
df <- separate(data = new_table, col=paste.odd..even., into = Headers, sep = "\\|")
# HOusekeeping: Delete row containing all columns names
df <- df[-c(1), ]
df[1:5,]
## Pair Player Name Total Round 1 Round 2 Round 3
## 2 1 GARY HUA 6.0 W 39 W 21 W 18
## 3 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 4 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 5 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 6 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## Round 4 Round 5 Round 6 Round 7 State USCF ID / Rtg (Pre->Post)
## 2 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 >1817
## 3 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 >1663
## 4 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 >1640
## 5 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 >1744
## 6 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 >1690
## Pts 1 2 3 4 5 6 7 Avg Pre Rating
## 2 N:2 W B W B W B W
## 3 N:2 B W B W B W B
## 4 N:2 W B W B W B W
## 5 N:2 W B W B W B B
## 6 N:2 B W B W B W B
# Extracting numerical values from "USCF ID / Rtg (Pre->Post)" column for calculating averages later
temp <- str_extract_all(df$`USCF ID / Rtg (Pre->Post)`,"\\b\\d{1,}")
temp <- data.frame(as.character(temp)) #making into dataframe
head(temp)
## as.character.temp.
## 1 c("15445895", "1794", "1817")
## 2 c("14598900", "1553", "1663")
## 3 c("14959604", "1384", "1640")
## 4 c("12616049", "1716", "1744")
## 5 c("14601533", "1655", "1690")
## 6 c("15055204", "1686", "1687")
# breaking out the columns; extracting the 3 numericals
# Split the data frame: 1 column -> 3 different columns
temp <- separate(data = temp, col = as.character.temp., into = c("col1","col2","col3"), sep = ",")
head(temp)
## col1 col2 col3
## 1 c("15445895" "1794" "1817")
## 2 c("14598900" "1553" "1663")
## 3 c("14959604" "1384" "1640")
## 4 c("12616049" "1716" "1744")
## 5 c("14601533" "1655" "1690")
## 6 c("15055204" "1686" "1687")
# Temporary column vectors
ID <- str_extract_all(temp$col1,"[[:digit:]]{1,}")
Pre_rate <- str_extract_all(temp$col2,"[[:digit:]]{1,}")
Post_rate <- str_extract_all(temp$col3,"[[:digit:]]{1,}")
#converting ID, pre rating & post rating columns into numerics for calculation purposes
df$`USCF ID` <- as.numeric(ID)
df$`Pre Rating` <- as.numeric(Pre_rate)
df$`Post Rating` <- as.numeric(Post_rate)
head(df)
## Pair Player Name Total Round 1 Round 2 Round 3
## 2 1 GARY HUA 6.0 W 39 W 21 W 18
## 3 2 DAKSHESH DARURI 6.0 W 63 W 58 L 4
## 4 3 ADITYA BAJAJ 6.0 L 8 W 61 W 25
## 5 4 PATRICK H SCHILLING 5.5 W 23 D 28 W 2
## 6 5 HANSHI ZUO 5.5 W 45 W 37 D 12
## 7 6 HANSEN SONG 5.0 W 34 D 29 L 11
## Round 4 Round 5 Round 6 Round 7 State USCF ID / Rtg (Pre->Post)
## 2 W 14 W 7 D 12 D 4 ON 15445895 / R: 1794 >1817
## 3 W 17 W 16 W 20 W 7 MI 14598900 / R: 1553 >1663
## 4 W 21 W 11 W 13 W 12 MI 14959604 / R: 1384 >1640
## 5 W 26 D 5 W 19 D 1 MI 12616049 / R: 1716 >1744
## 6 D 13 D 4 W 14 W 17 MI 14601533 / R: 1655 >1690
## 7 W 35 D 10 W 27 W 21 OH 15055204 / R: 1686 >1687
## Pts 1 2 3 4 5 6 7 Avg Pre Rating USCF ID
## 2 N:2 W B W B W B W 15445895
## 3 N:2 B W B W B W B 14598900
## 4 N:2 W B W B W B W 14959604
## 5 N:2 W B W B W B B 12616049
## 6 N:2 B W B W B W B 14601533
## 7 N:3 W B W B B W B 15055204
## Pre Rating Post Rating
## 2 1794 1817
## 3 1553 1663
## 4 1384 1640
## 5 1716 1744
## 6 1655 1690
## 7 1686 1687
df_new <- subset(df, select = c(1,2,11,22,3,23,24,21))
head(df_new)
## Pair Player Name State USCF ID Total
## 2 1 GARY HUA ON 15445895 6.0
## 3 2 DAKSHESH DARURI MI 14598900 6.0
## 4 3 ADITYA BAJAJ MI 14959604 6.0
## 5 4 PATRICK H SCHILLING MI 12616049 5.5
## 6 5 HANSHI ZUO MI 14601533 5.5
## 7 6 HANSEN SONG OH 15055204 5.0
## Pre Rating Post Rating Avg Pre Rating
## 2 1794 1817
## 3 1553 1663
## 4 1384 1640
## 5 1716 1744
## 6 1655 1690
## 7 1686 1687
class(head(df_new$'Pre Rating')) # check class type
## [1] "numeric"
class(head(df_new$`Avg Pre Rating`)) # check class type
## [1] "character"
# converting characters in Avg. Pre Rating columns to numeric
df_new$`Avg Pre Rating` <- 0
head(df_new)
## Pair Player Name State USCF ID Total
## 2 1 GARY HUA ON 15445895 6.0
## 3 2 DAKSHESH DARURI MI 14598900 6.0
## 4 3 ADITYA BAJAJ MI 14959604 6.0
## 5 4 PATRICK H SCHILLING MI 12616049 5.5
## 6 5 HANSHI ZUO MI 14601533 5.5
## 7 6 HANSEN SONG OH 15055204 5.0
## Pre Rating Post Rating Avg Pre Rating
## 2 1794 1817 0
## 3 1553 1663 0
## 4 1384 1640 0
## 5 1716 1744 0
## 6 1655 1690 0
## 7 1686 1687 0
df_new[39,] #double checking on a single Player
## Pair Player Name State USCF ID Total
## 40 39 JOEL R HENDON MI 12923035 3.0
## Pre Rating Post Rating Avg Pre Rating
## 40 1436 1413 0
# Extracting numerical values from rounds played (opponents against)
opp1 <- data.frame(as.numeric(str_extract_all(df$`Round 1`,"[[:digit:]]{1,}")))
opp2 <- data.frame(as.numeric(str_extract_all(df$`Round 2`,"[[:digit:]]{1,}")))
opp3 <- data.frame(as.numeric(str_extract_all(df$`Round 3`,"[[:digit:]]{1,}")))
opp4 <- data.frame(as.numeric(str_extract_all(df$`Round 4`,"[[:digit:]]{1,}")))
opp5 <- data.frame(as.numeric(str_extract_all(df$`Round 5`,"[[:digit:]]{1,}")))
opp6 <- data.frame(as.numeric(str_extract_all(df$`Round 6`,"[[:digit:]]{1,}")))
opp7 <- data.frame(as.numeric(str_extract_all(df$`Round 7`,"[[:digit:]]{1,}")))
# build an Opponents data frame
opponents <- cbind(df$Pair, df$`Player Name`, opp1, opp2, opp3, opp4, opp5, opp6, opp7)
names(opponents) <- c("Pair","Player Name","Opp 1","Opp 2","Opp 3","Opp 4","Opp 5","Opp 6","Opp 7")
head(opponents) # Opponents only Dataframe
## Pair Player Name Opp 1 Opp 2 Opp 3 Opp 4 Opp 5
## 1 1 GARY HUA 39 21 18 14 7
## 2 2 DAKSHESH DARURI 63 58 4 17 16
## 3 3 ADITYA BAJAJ 8 61 25 21 11
## 4 4 PATRICK H SCHILLING 23 28 2 26 5
## 5 5 HANSHI ZUO 45 37 12 13 4
## 6 6 HANSEN SONG 34 29 11 35 10
## Opp 6 Opp 7
## 1 12 4
## 2 20 7
## 3 13 12
## 4 19 1
## 5 14 17
## 6 27 21
# Detecting any NA replaced and by 0.
opponents[is.na(opponents)] <- 0
head(opponents)
## Pair Player Name Opp 1 Opp 2 Opp 3 Opp 4 Opp 5
## 1 1 GARY HUA 39 21 18 14 7
## 2 2 DAKSHESH DARURI 63 58 4 17 16
## 3 3 ADITYA BAJAJ 8 61 25 21 11
## 4 4 PATRICK H SCHILLING 23 28 2 26 5
## 5 5 HANSHI ZUO 45 37 12 13 4
## 6 6 HANSEN SONG 34 29 11 35 10
## Opp 6 Opp 7
## 1 12 4
## 2 20 7
## 3 13 12
## 4 19 1
## 5 14 17
## 6 27 21
opponents[is.na(opponents)] <- 0 # NA replacement by zero
count = 0
# Finding number of games played
for(i in 1:nrow(opponents)){
for(j in 3:9){
if(opponents[i,j]>0){
count = count + 1
}
opponents$No_of_Games_Played[i] <- sum(count)
}
count = 0
}
head(opponents)
## Pair Player Name Opp 1 Opp 2 Opp 3 Opp 4 Opp 5
## 1 1 GARY HUA 39 21 18 14 7
## 2 2 DAKSHESH DARURI 63 58 4 17 16
## 3 3 ADITYA BAJAJ 8 61 25 21 11
## 4 4 PATRICK H SCHILLING 23 28 2 26 5
## 5 5 HANSHI ZUO 45 37 12 13 4
## 6 6 HANSEN SONG 34 29 11 35 10
## Opp 6 Opp 7 No_of_Games_Played
## 1 12 4 7
## 2 20 7 7
## 3 13 12 7
## 4 19 1 7
## 5 14 17 7
## 6 27 21 7
#Resetting Avg Pre Rating column to zero before calculating their averages
df_new$`Avg Pre Rating` <- 0
# Compute Average Pre-Rating for each player
for (k in 3:9){
for (j in 1:nrow(df_new)){
for (i in 1:nrow(df_new)){
if (as.numeric(opponents[j,k]) == as.numeric(df_new$Pair[i])){
df_new$`Avg Pre Rating`[j] <- as.numeric(df_new$`Avg Pre Rating`[j]) + as.numeric(df_new$`Pre Rating`[i])
}
}
}
}
# Each player's average based on the number of played games
df_new$`Avg Pre Rating` <- round(as.numeric(df_new$`Avg Pre Rating`) / opponents$No_of_Games_Played,0)
df_new[1:10,]
## Pair Player Name State USCF ID Total
## 2 1 GARY HUA ON 15445895 6.0
## 3 2 DAKSHESH DARURI MI 14598900 6.0
## 4 3 ADITYA BAJAJ MI 14959604 6.0
## 5 4 PATRICK H SCHILLING MI 12616049 5.5
## 6 5 HANSHI ZUO MI 14601533 5.5
## 7 6 HANSEN SONG OH 15055204 5.0
## 8 7 GARY DEE SWATHELL MI 11146376 5.0
## 9 8 EZEKIEL HOUGHTON MI 15142253 5.0
## 10 9 STEFANO LEE ON 14954524 5.0
## 11 10 ANVIT RAO MI 14150362 5.0
## Pre Rating Post Rating Avg Pre Rating
## 2 1794 1817 1605
## 3 1553 1663 1469
## 4 1384 1640 1564
## 5 1716 1744 1574
## 6 1655 1690 1501
## 7 1686 1687 1519
## 8 1649 1673 1372
## 9 1641 1657 1468
## 10 1411 1564 1523
## 11 1365 1544 1554
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
require(ggthemes)
## Loading required package: ggthemes
# Scatter plot to Check if there's any reasonable trend?
p1 <- ggplot(df_new, aes(df_new$'Pre Rating', y = df_new$'Post Rating')) + geom_point(aes(color=State))+theme_economist()
p1 <- ggplotly(p1)
p1
fit <- lm(df_new$'Avg Pre Rating' ~ df_new$'Pre Rating', data = df_new)
p3 <- ggplot(df_new, aes(df_new$'Pre Rating', y = df_new$'Avg Pre Rating')) +
geom_point(aes(color=df_new$Total)) + theme_economist() +
stat_smooth(method = "lm", col = "red") +
ggtitle(paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),";
Intercept =",signif(fit$coef[[1]],5 ),";
Slope =",signif(fit$coef[[2]], 5),";
P =",signif(summary(fit)$coef[2,4], 5)))
p3 <- ggplotly(p3)
p3
By States: There appears to have a positive correlation between player’s pre tournament ratings vs.
opponents’ post tournament ratings. By states, they appear high-ranking players or very good
players were retaining their ranking after the tournament; Does this mean that good players
were playing lower ranked players and therefore were able to maintain their high status? That is a
question we shall seek to answer with other type of charts, if possible.
The 2nd chart showed a comparison between the player’s own pre-rating vs. their opponents’ average
pre-rating. The color corresponds to the total number of points scored in the tournament.
All observations above the red diagonal line represent players who had a higher pre-rating than
the average of their opponents’ pre tournament ratings, and the points below the line is the
reverse.
Possible Conclusion from Scatterplots?
chart 1 showed players across states had maintained their ranking status post-tournament;
alluding to speculation that high-ranking players might be playing against lower-ranking players.
Chart 2 gives suggest that its almost half of highly ranked players were matched against lower ranked
players. This is visually discerning by the Red regression line which is almost flat as evidenced by
small slope of 0.128, see 2nd scatter plot.
#histogram to confirm that first scatter plot (chart 1) was misleading
hist(df_new$`Avg Pre Rating`, main="Avg Opponents Pre Tournament Ratings", xlab="opponents' Avg pre rating")
hist(df_new$`Pre Rating`, main="Players' Own Pre Tournament Ratings", xlab="Players' own Pre rating")
hist(df_new$`Post Rating`, main="Players' Own Post Tournament Ratings", xlab="Players' own Post rating")
The first histogram which is normally distributed, showed a lower avg. pre-tournament ranking for
the opponents of around 1350. The 2nd histogram, showed a left-skewed distribution with a
higher pre-tournament ranking of around 1500. The 3rd histogram also showed a left-skwed
distribution with a higher post-tournament ranking, albeit, being more spread out.
The 2 histogram charts, suggested that given, the distributions of the opponents’s and the players’
ranking, its is more likely than not that they were match up against lower-ranked players and thus
able to retained their post-tournament status fairly well.
ggplot(df_new, aes(df$'State',df$`Pre Rating`)) +geom_boxplot()
ggplot(df_new, aes(df$'State',df$`Post Rating`)) +geom_boxplot()
Perhaps, Box plots does a better job in explaining the rankings of players before and after the
tournament.
By States again: Except for Ohio which had only 1 entry (therefore no distribution), both Michigan
and Ontario players’ had approximately the same median pre-tournment rankings vs. post tournment
rankings. Also, they had a smaller pre-tournament spread while their post-tournament spread
increased.
#Creating a data table
datatable(df_new, rownames=FALSE)
#Writing to CSV file
write.table(df_new,file="Sufian_Suwarman_Project1.csv",row.names=FALSE,col.names=TRUE,sep=",")