In this article, I would like to analyze the ATHLETICS Dataset.This data set contains the results for the 25 competitors of the 1988 Olympics women’s heptathlon in Seoul.
input_df <- tibble(read.csv("heptathlon.csv"))
input_df <-add_column(input_df ,Player_ID = c(1:25) , .before = 1)
head(input_df)
## # A tibble: 6 x 9
## Player_ID hurdles highjump shot run200m longjump javelin run800m score
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 12.7 1.86 15.8 22.6 7.27 45.7 129. 7291
## 2 2 12.8 1.8 16.2 23.6 6.71 42.6 126. 6897
## 3 3 13.2 1.83 14.2 23.1 6.68 44.5 124. 6858
## 4 4 13.6 1.8 15.2 23.9 6.25 42.8 132. 6540
## 5 5 13.5 1.74 14.8 23.9 6.32 47.5 128. 6540
## 6 6 13.8 1.83 13.5 24.6 6.33 42.8 126. 6411
The data table has 25observations (each row has the results of one women player). And each player was observed in 7 different types of games (independent variables) and given score (dependent variable) based on scores in 7 games.
str(input_df)
## tibble [25 x 9] (S3: tbl_df/tbl/data.frame)
## $ Player_ID: int [1:25] 1 2 3 4 5 6 7 8 9 10 ...
## $ hurdles : num [1:25] 12.7 12.8 13.2 13.6 13.5 ...
## $ highjump : num [1:25] 1.86 1.8 1.83 1.8 1.74 1.83 1.8 1.8 1.83 1.77 ...
## $ shot : num [1:25] 15.8 16.2 14.2 15.2 14.8 ...
## $ run200m : num [1:25] 22.6 23.6 23.1 23.9 23.9 ...
## $ longjump : num [1:25] 7.27 6.71 6.68 6.25 6.32 6.33 6.37 6.47 6.11 6.28 ...
## $ javelin : num [1:25] 45.7 42.6 44.5 42.8 47.5 ...
## $ run800m : num [1:25] 129 126 124 132 128 ...
## $ score : int [1:25] 7291 6897 6858 6540 6540 6411 6351 6297 6252 6252 ...
All the 7 different games are as follows.
Finally , each player will be total score (depend upon performance in each game)
a.score : total score.
summary(input_df[,2:9])
## hurdles highjump shot run200m
## Min. :12.69 Min. :1.500 Min. :10.00 Min. :22.56
## 1st Qu.:13.47 1st Qu.:1.770 1st Qu.:12.32 1st Qu.:23.92
## Median :13.75 Median :1.800 Median :12.88 Median :24.83
## Mean :13.84 Mean :1.782 Mean :13.12 Mean :24.65
## 3rd Qu.:14.07 3rd Qu.:1.830 3rd Qu.:14.20 3rd Qu.:25.23
## Max. :16.42 Max. :1.860 Max. :16.23 Max. :26.61
## longjump javelin run800m score
## Min. :4.880 Min. :35.68 Min. :124.2 Min. :4566
## 1st Qu.:6.050 1st Qu.:39.06 1st Qu.:132.2 1st Qu.:5746
## Median :6.250 Median :40.28 Median :134.7 Median :6137
## Mean :6.152 Mean :41.48 Mean :136.1 Mean :6091
## 3rd Qu.:6.370 3rd Qu.:44.54 3rd Qu.:138.5 3rd Qu.:6351
## Max. :7.270 Max. :47.50 Max. :163.4 Max. :7291
plot_hist <- function(colu , colu_name)
{
hist(colu,
main = paste("Distribution of " , colu_name),
sub = paste("MEAN = " , round(mean(colu),digits = 2) , "\nSD = " , round(sd(colu),digits = 2)),
xlab = " "
)
}
par(mfrow=c(2,4))
list_of_col_names = names(input_df)
for (i in 2:9){
plot_hist(input_df[[i]] , list_of_col_names[i])
}
All the numerical variables are in different types of distributions. We must do Normalization to be uniform.
cor_mat <- cor(input_df[,2:9])
cor_mat
## hurdles highjump shot run200m longjump
## hurdles 1.000000000 -0.811402536 -0.6513347 0.7737205 -0.91213362
## highjump -0.811402536 1.000000000 0.4407861 -0.4876637 0.78244227
## shot -0.651334688 0.440786140 1.0000000 -0.6826704 0.74307300
## run200m 0.773720543 -0.487663685 -0.6826704 1.0000000 -0.81720530
## longjump -0.912133617 0.782442273 0.7430730 -0.8172053 1.00000000
## javelin -0.007762549 0.002153016 0.2689888 -0.3330427 0.06710841
## run800m 0.779257110 -0.591162823 -0.4196196 0.6168101 -0.69951116
## score -0.923198458 0.767358719 0.7996987 -0.8648825 0.95043678
## javelin run800m score
## hurdles -0.007762549 0.77925711 -0.9231985
## highjump 0.002153016 -0.59116282 0.7673587
## shot 0.268988837 -0.41961957 0.7996987
## run200m -0.333042722 0.61681006 -0.8648825
## longjump 0.067108409 -0.69951116 0.9504368
## javelin 1.000000000 0.02004909 0.2531466
## run800m 0.020049088 1.00000000 -0.7727757
## score 0.253146604 -0.77277571 1.0000000
corrplot(cor_mat , is.corr = FALSE )
ggpairs(input_df[,2:9])
From the above, co-relation analysis we can observe very clearly that the output variable score is in high co-relation with almost all the variables ( except javelin ).
It means , the players are giving consistence results in all games ( Good player in long jump is doing good in other games as well except javelin). There are few players, who are doing exceptionally good in only javelin but not in all other games.
Because of the high co-releation between all the varaibles, we can apply PCA and reduce the number of input variables to predict the output (score).
pca <- prcomp(input_df[,2:8] , scale. = TRUE)
pca
## Standard deviations (1, .., p=7):
## [1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5 PC6
## hurdles 0.4528710 -0.15792058 -0.04514996 0.02653873 -0.09494792 -0.78334101
## highjump -0.3771992 0.24807386 0.36777902 -0.67999172 -0.01879888 -0.09939981
## shot -0.3630725 -0.28940743 -0.67618919 -0.12431725 -0.51165201 0.05085983
## run200m 0.4078950 0.26038545 0.08359211 -0.36106580 -0.64983404 0.02495639
## longjump -0.4562318 0.05587394 -0.13931653 -0.11129249 0.18429810 -0.59020972
## javelin -0.0754090 -0.84169212 0.47156016 -0.12079924 -0.13510669 0.02724076
## run800m 0.3749594 -0.22448984 -0.39585671 -0.60341130 0.50432116 0.15555520
## PC7
## hurdles -0.38024707
## highjump -0.43393114
## shot -0.21762491
## run200m 0.45338483
## longjump 0.61206388
## javelin 0.17294667
## run800m 0.09830963
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214
## Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070
## Cumulative Proportion 0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000
We can observe that, by the first 4 principal components we can explain 94.75 % of total variation in the input variables. Let us gothrough the Eigenvalue chart, to get indepth evaluation of Eien calues and corresponding variance explained by each principal component.
plot(pca)
To get more clarity on the Variances (or) Eigen values, let us visulize the same chart with Factoextra package.
get_eigenvalue(pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.46027516 63.7182165 63.71822
## Dim.2 1.19432056 17.0617222 80.77994
## Dim.3 0.52101413 7.4430590 88.22300
## Dim.4 0.45716683 6.5309546 94.75395
## Dim.5 0.24526674 3.5038106 98.25776
## Dim.6 0.07295558 1.0422226 99.29999
## Dim.7 0.04900101 0.7000144 100.00000
colSums(get_eigenvalue(pca))
## eigenvalue variance.percent
## 7.0000 100.0000
## cumulative.variance.percent
## 625.0329
fviz_eig(pca ,
choice = "eigenvalue",
addlabels = TRUE)
I assume, it is better to go with 4 Principal components. But, before that, let us see the co-releation between input variables to set of principal components.
biplot(pca)
fviz_pca_var(pca ,
col.var = "black")
### evaluation the importance of input variables (cos2 values):-
var <- get_pca_var(pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
var$cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## hurdles 0.91476758 0.029785053 0.001062097 0.0003219845 2.211106e-03
## highjump 0.63460465 0.073499252 0.070473107 0.2113887890 8.667676e-05
## shot 0.58796078 0.100032303 0.238224240 0.0070654116 6.420783e-02
## run200m 0.74209329 0.080975631 0.003640660 0.0596001581 1.035723e-01
## longjump 0.92839513 0.003728546 0.010112412 0.0056624769 8.330678e-03
## javelin 0.02536343 0.846111164 0.115857382 0.0066711870 4.477054e-03
## run800m 0.62709031 0.060188609 0.081644235 0.1664568182 6.238110e-02
## Dim.6 Dim.7
## hurdles 4.476723e-02 0.0070849493
## highjump 7.208246e-04 0.0092267051
## shot 1.887158e-04 0.0023207172
## run200m 4.543828e-05 0.0100725393
## longjump 2.541390e-02 0.0183568643
## javelin 5.413736e-05 0.0014656472
## run800m 1.765337e-03 0.0004735841
fviz_cos2(pca , choice = "var" , axes = 1:4)
fviz_pca_var(pca,
col.var = "cos2",
repel = TRUE
)
head(pca$x)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -4.121448 -1.24240435 -0.3699131 -0.02300174 0.4260062 -0.33932922
## [2,] -2.882186 -0.52372600 -0.8974147 0.47545176 -0.7030659 0.23808730
## [3,] -2.649634 -0.67876243 0.4591767 0.67962860 0.1055252 -0.23919071
## [4,] -1.343351 -0.69228324 -0.5952704 0.14067052 -0.4539282 0.09180564
## [5,] -1.359026 -1.75316563 0.1507013 0.83595001 -0.6871948 0.12630397
## [6,] -1.043847 0.07940725 0.6745305 0.20557253 -0.7379335 -0.35578939
## PC7
## [1,] 0.3479213
## [2,] 0.1440158
## [3,] -0.1296478
## [4,] -0.4865780
## [5,] 0.2394820
## [6,] -0.1034143
transfomed_df <- tibble( player_id = input_df$Player_ID,
x1 = pca$x[,1],
x2 = pca$x[,2],
x3 = pca$x[,3],
x4 = pca$x[,4],
score = input_df$score)
transfomed_df
## # A tibble: 25 x 6
## player_id x1 x2 x3 x4 score
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 -4.12 -1.24 -0.370 -0.0230 7291
## 2 2 -2.88 -0.524 -0.897 0.475 6897
## 3 3 -2.65 -0.679 0.459 0.680 6858
## 4 4 -1.34 -0.692 -0.595 0.141 6540
## 5 5 -1.36 -1.75 0.151 0.836 6540
## 6 6 -1.04 0.0794 0.675 0.206 6411
## 7 7 -1.10 0.324 0.0734 0.486 6351
## 8 8 -0.923 0.807 -0.812 0.0302 6297
## 9 9 -0.530 -0.146 -0.161 -0.616 6252
## 10 10 -0.760 0.526 -0.183 0.668 6252
## # ... with 15 more rows
liner_model <- lm(formula = score~. ,
data = input_df[,2:9])
summary(liner_model)
##
## Call:
## lm(formula = score ~ ., data = input_df[, 2:9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -159.048 -7.442 0.258 19.353 68.714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5178.382 988.166 5.240 6.65e-05 ***
## hurdles -86.760 44.880 -1.933 0.070051 .
## highjump 1187.875 289.251 4.107 0.000736 ***
## shot 86.703 11.484 7.550 7.95e-07 ***
## run200m -82.561 25.307 -3.262 0.004587 **
## longjump 331.925 73.278 4.530 0.000296 ***
## javelin 20.772 3.640 5.706 2.57e-05 ***
## run800m -14.773 1.948 -7.585 7.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.82 on 17 degrees of freedom
## Multiple R-squared: 0.995, Adjusted R-squared: 0.9929
## F-statistic: 482.1 on 7 and 17 DF, p-value: < 2.2e-16
transformed_lm <- lm(formula = score~. ,
data = transfomed_df[,2:6])
summary(transformed_lm)
##
## Call:
## lm(formula = score ~ ., data = transfomed_df[, 2:6])
##
## Residuals:
## Min 1Q Median 3Q Max
## -216.347 -10.875 4.637 27.069 58.811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6090.600 11.202 543.689 < 2e-16 ***
## x1 -266.774 5.414 -49.278 < 2e-16 ***
## x2 -50.917 10.462 -4.867 9.34e-05 ***
## x3 4.066 15.840 0.257 0.8
## x4 -4.339 16.910 -0.257 0.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.01 on 20 degrees of freedom
## Multiple R-squared: 0.9919, Adjusted R-squared: 0.9903
## F-statistic: 613 on 4 and 20 DF, p-value: < 2.2e-16