INTRODUCTION :-

In this article, I would like to analyze the ATHLETICS Dataset.This data set contains the results for the 25 competitors of the 1988 Olympics women’s heptathlon in Seoul.

EDA of input data :-

input_df <- tibble(read.csv("heptathlon.csv"))
input_df <-add_column(input_df ,Player_ID = c(1:25) , .before = 1)
head(input_df)
## # A tibble: 6 x 9
##   Player_ID hurdles highjump  shot run200m longjump javelin run800m score
##       <int>   <dbl>    <dbl> <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <int>
## 1         1    12.7     1.86  15.8    22.6     7.27    45.7    129.  7291
## 2         2    12.8     1.8   16.2    23.6     6.71    42.6    126.  6897
## 3         3    13.2     1.83  14.2    23.1     6.68    44.5    124.  6858
## 4         4    13.6     1.8   15.2    23.9     6.25    42.8    132.  6540
## 5         5    13.5     1.74  14.8    23.9     6.32    47.5    128.  6540
## 6         6    13.8     1.83  13.5    24.6     6.33    42.8    126.  6411

The data table has 25observations (each row has the results of one women player). And each player was observed in 7 different types of games (independent variables) and given score (dependent variable) based on scores in 7 games.

str(input_df)
## tibble [25 x 9] (S3: tbl_df/tbl/data.frame)
##  $ Player_ID: int [1:25] 1 2 3 4 5 6 7 8 9 10 ...
##  $ hurdles  : num [1:25] 12.7 12.8 13.2 13.6 13.5 ...
##  $ highjump : num [1:25] 1.86 1.8 1.83 1.8 1.74 1.83 1.8 1.8 1.83 1.77 ...
##  $ shot     : num [1:25] 15.8 16.2 14.2 15.2 14.8 ...
##  $ run200m  : num [1:25] 22.6 23.6 23.1 23.9 23.9 ...
##  $ longjump : num [1:25] 7.27 6.71 6.68 6.25 6.32 6.33 6.37 6.47 6.11 6.28 ...
##  $ javelin  : num [1:25] 45.7 42.6 44.5 42.8 47.5 ...
##  $ run800m  : num [1:25] 129 126 124 132 128 ...
##  $ score    : int [1:25] 7291 6897 6858 6540 6540 6411 6351 6297 6252 6252 ...

All the 7 different games are as follows.

  1. hurdles : results 100m hurdles.
  2. highjump : results high jump.
  3. shot : results shot.
  4. run200m : results 200m race.
  5. longjump : results long jump.
  6. javelin : results javelin.
  7. run800m : results 800m race.

Finally , each player will be total score (depend upon performance in each game)

a.score : total score.

summary(input_df[,2:9])
##     hurdles         highjump          shot          run200m     
##  Min.   :12.69   Min.   :1.500   Min.   :10.00   Min.   :22.56  
##  1st Qu.:13.47   1st Qu.:1.770   1st Qu.:12.32   1st Qu.:23.92  
##  Median :13.75   Median :1.800   Median :12.88   Median :24.83  
##  Mean   :13.84   Mean   :1.782   Mean   :13.12   Mean   :24.65  
##  3rd Qu.:14.07   3rd Qu.:1.830   3rd Qu.:14.20   3rd Qu.:25.23  
##  Max.   :16.42   Max.   :1.860   Max.   :16.23   Max.   :26.61  
##     longjump        javelin         run800m          score     
##  Min.   :4.880   Min.   :35.68   Min.   :124.2   Min.   :4566  
##  1st Qu.:6.050   1st Qu.:39.06   1st Qu.:132.2   1st Qu.:5746  
##  Median :6.250   Median :40.28   Median :134.7   Median :6137  
##  Mean   :6.152   Mean   :41.48   Mean   :136.1   Mean   :6091  
##  3rd Qu.:6.370   3rd Qu.:44.54   3rd Qu.:138.5   3rd Qu.:6351  
##  Max.   :7.270   Max.   :47.50   Max.   :163.4   Max.   :7291
plot_hist <- function(colu , colu_name)
{
hist(colu,
     main = paste("Distribution of " , colu_name),
     sub = paste("MEAN =  " , round(mean(colu),digits = 2) , "\nSD =  " , round(sd(colu),digits = 2)),
     xlab = " "
     )
}
par(mfrow=c(2,4))

list_of_col_names = names(input_df)

for (i in 2:9){
  plot_hist(input_df[[i]] , list_of_col_names[i])
}

All the numerical variables are in different types of distributions. We must do Normalization to be uniform.

Analyzing co-releation between the variables :-

cor_mat <- cor(input_df[,2:9])
cor_mat
##               hurdles     highjump       shot    run200m    longjump
## hurdles   1.000000000 -0.811402536 -0.6513347  0.7737205 -0.91213362
## highjump -0.811402536  1.000000000  0.4407861 -0.4876637  0.78244227
## shot     -0.651334688  0.440786140  1.0000000 -0.6826704  0.74307300
## run200m   0.773720543 -0.487663685 -0.6826704  1.0000000 -0.81720530
## longjump -0.912133617  0.782442273  0.7430730 -0.8172053  1.00000000
## javelin  -0.007762549  0.002153016  0.2689888 -0.3330427  0.06710841
## run800m   0.779257110 -0.591162823 -0.4196196  0.6168101 -0.69951116
## score    -0.923198458  0.767358719  0.7996987 -0.8648825  0.95043678
##               javelin     run800m      score
## hurdles  -0.007762549  0.77925711 -0.9231985
## highjump  0.002153016 -0.59116282  0.7673587
## shot      0.268988837 -0.41961957  0.7996987
## run200m  -0.333042722  0.61681006 -0.8648825
## longjump  0.067108409 -0.69951116  0.9504368
## javelin   1.000000000  0.02004909  0.2531466
## run800m   0.020049088  1.00000000 -0.7727757
## score     0.253146604 -0.77277571  1.0000000
corrplot(cor_mat , is.corr = FALSE )

ggpairs(input_df[,2:9])

From the above, co-relation analysis we can observe very clearly that the output variable score is in high co-relation with almost all the variables ( except javelin ).

It means , the players are giving consistence results in all games ( Good player in long jump is doing good in other games as well except javelin). There are few players, who are doing exceptionally good in only javelin but not in all other games.

Because of the high co-releation between all the varaibles, we can apply PCA and reduce the number of input variables to predict the output (score).

Applying PCA :-

pca <- prcomp(input_df[,2:8] , scale. = TRUE)
pca
## Standard deviations (1, .., p=7):
## [1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617
## 
## Rotation (n x k) = (7 x 7):
##                 PC1         PC2         PC3         PC4         PC5         PC6
## hurdles   0.4528710 -0.15792058 -0.04514996  0.02653873 -0.09494792 -0.78334101
## highjump -0.3771992  0.24807386  0.36777902 -0.67999172 -0.01879888 -0.09939981
## shot     -0.3630725 -0.28940743 -0.67618919 -0.12431725 -0.51165201  0.05085983
## run200m   0.4078950  0.26038545  0.08359211 -0.36106580 -0.64983404  0.02495639
## longjump -0.4562318  0.05587394 -0.13931653 -0.11129249  0.18429810 -0.59020972
## javelin  -0.0754090 -0.84169212  0.47156016 -0.12079924 -0.13510669  0.02724076
## run800m   0.3749594 -0.22448984 -0.39585671 -0.60341130  0.50432116  0.15555520
##                  PC7
## hurdles  -0.38024707
## highjump -0.43393114
## shot     -0.21762491
## run200m   0.45338483
## longjump  0.61206388
## javelin   0.17294667
## run800m   0.09830963

Summary of PCA :-

summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.1119 1.0928 0.72181 0.67614 0.49524 0.27010 0.2214
## Proportion of Variance 0.6372 0.1706 0.07443 0.06531 0.03504 0.01042 0.0070
## Cumulative Proportion  0.6372 0.8078 0.88223 0.94754 0.98258 0.99300 1.0000

We can observe that, by the first 4 principal components we can explain 94.75 % of total variation in the input variables. Let us gothrough the Eigenvalue chart, to get indepth evaluation of Eien calues and corresponding variance explained by each principal component.

plot(pca)

To get more clarity on the Variances (or) Eigen values, let us visulize the same chart with Factoextra package.

get_eigenvalue(pca)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.46027516       63.7182165                    63.71822
## Dim.2 1.19432056       17.0617222                    80.77994
## Dim.3 0.52101413        7.4430590                    88.22300
## Dim.4 0.45716683        6.5309546                    94.75395
## Dim.5 0.24526674        3.5038106                    98.25776
## Dim.6 0.07295558        1.0422226                    99.29999
## Dim.7 0.04900101        0.7000144                   100.00000
colSums(get_eigenvalue(pca))
##                  eigenvalue            variance.percent 
##                      7.0000                    100.0000 
## cumulative.variance.percent 
##                    625.0329
fviz_eig(pca ,
         choice = "eigenvalue",
         addlabels = TRUE)

I assume, it is better to go with 4 Principal components. But, before that, let us see the co-releation between input variables to set of principal components.

BIPLOT :-

biplot(pca)

fviz_pca_var(pca ,
                col.var = "black")

### evaluation the importance of input variables (cos2 values):-

var <- get_pca_var(pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
var$cos2
##               Dim.1       Dim.2       Dim.3        Dim.4        Dim.5
## hurdles  0.91476758 0.029785053 0.001062097 0.0003219845 2.211106e-03
## highjump 0.63460465 0.073499252 0.070473107 0.2113887890 8.667676e-05
## shot     0.58796078 0.100032303 0.238224240 0.0070654116 6.420783e-02
## run200m  0.74209329 0.080975631 0.003640660 0.0596001581 1.035723e-01
## longjump 0.92839513 0.003728546 0.010112412 0.0056624769 8.330678e-03
## javelin  0.02536343 0.846111164 0.115857382 0.0066711870 4.477054e-03
## run800m  0.62709031 0.060188609 0.081644235 0.1664568182 6.238110e-02
##                 Dim.6        Dim.7
## hurdles  4.476723e-02 0.0070849493
## highjump 7.208246e-04 0.0092267051
## shot     1.887158e-04 0.0023207172
## run200m  4.543828e-05 0.0100725393
## longjump 2.541390e-02 0.0183568643
## javelin  5.413736e-05 0.0014656472
## run800m  1.765337e-03 0.0004735841
fviz_cos2(pca , choice = "var" , axes = 1:4)

fviz_pca_var(pca,
             col.var = "cos2",
             repel = TRUE
)

head(pca$x)
##            PC1         PC2        PC3         PC4        PC5         PC6
## [1,] -4.121448 -1.24240435 -0.3699131 -0.02300174  0.4260062 -0.33932922
## [2,] -2.882186 -0.52372600 -0.8974147  0.47545176 -0.7030659  0.23808730
## [3,] -2.649634 -0.67876243  0.4591767  0.67962860  0.1055252 -0.23919071
## [4,] -1.343351 -0.69228324 -0.5952704  0.14067052 -0.4539282  0.09180564
## [5,] -1.359026 -1.75316563  0.1507013  0.83595001 -0.6871948  0.12630397
## [6,] -1.043847  0.07940725  0.6745305  0.20557253 -0.7379335 -0.35578939
##             PC7
## [1,]  0.3479213
## [2,]  0.1440158
## [3,] -0.1296478
## [4,] -0.4865780
## [5,]  0.2394820
## [6,] -0.1034143
transfomed_df <- tibble( player_id = input_df$Player_ID,
                         x1 = pca$x[,1],
                         x2 = pca$x[,2],
                         x3 = pca$x[,3],
                         x4 = pca$x[,4],
                         score = input_df$score)
transfomed_df
## # A tibble: 25 x 6
##    player_id     x1      x2      x3      x4 score
##        <int>  <dbl>   <dbl>   <dbl>   <dbl> <int>
##  1         1 -4.12  -1.24   -0.370  -0.0230  7291
##  2         2 -2.88  -0.524  -0.897   0.475   6897
##  3         3 -2.65  -0.679   0.459   0.680   6858
##  4         4 -1.34  -0.692  -0.595   0.141   6540
##  5         5 -1.36  -1.75    0.151   0.836   6540
##  6         6 -1.04   0.0794  0.675   0.206   6411
##  7         7 -1.10   0.324   0.0734  0.486   6351
##  8         8 -0.923  0.807  -0.812   0.0302  6297
##  9         9 -0.530 -0.146  -0.161  -0.616   6252
## 10        10 -0.760  0.526  -0.183   0.668   6252
## # ... with 15 more rows

Fitting Linear Regression Model :-

Fitting on input data :-
liner_model <- lm(formula = score~. ,
                data = input_df[,2:9])
summary(liner_model)
## 
## Call:
## lm(formula = score ~ ., data = input_df[, 2:9])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -159.048   -7.442    0.258   19.353   68.714 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5178.382    988.166   5.240 6.65e-05 ***
## hurdles      -86.760     44.880  -1.933 0.070051 .  
## highjump    1187.875    289.251   4.107 0.000736 ***
## shot          86.703     11.484   7.550 7.95e-07 ***
## run200m      -82.561     25.307  -3.262 0.004587 ** 
## longjump     331.925     73.278   4.530 0.000296 ***
## javelin       20.772      3.640   5.706 2.57e-05 ***
## run800m      -14.773      1.948  -7.585 7.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.82 on 17 degrees of freedom
## Multiple R-squared:  0.995,  Adjusted R-squared:  0.9929 
## F-statistic: 482.1 on 7 and 17 DF,  p-value: < 2.2e-16
Fitting on transformed data (PC = 4) :-
transformed_lm <- lm(formula = score~. ,
                data = transfomed_df[,2:6])
summary(transformed_lm)
## 
## Call:
## lm(formula = score ~ ., data = transfomed_df[, 2:6])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -216.347  -10.875    4.637   27.069   58.811 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6090.600     11.202 543.689  < 2e-16 ***
## x1          -266.774      5.414 -49.278  < 2e-16 ***
## x2           -50.917     10.462  -4.867 9.34e-05 ***
## x3             4.066     15.840   0.257      0.8    
## x4            -4.339     16.910  -0.257      0.8    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.01 on 20 degrees of freedom
## Multiple R-squared:  0.9919, Adjusted R-squared:  0.9903 
## F-statistic:   613 on 4 and 20 DF,  p-value: < 2.2e-16