0. Contents

This report contains seven parts:

1. Basic Visualisation of data - We have 50 readings on 4 continuous variables and one categorical variable - gender. We have 25 male and 25 female individuals.
2. Checking normality, detecting outliers - Both populations are multivariate normal after removing one outlier in the female set.
3. Confidence intervals and ellipsoids - abc.
4. Profile Analysis - We check the profile of the two populations and test for equality of means, flatness and parallel profiles.
5. Principal Component Analysis - abc.
6. Factor Analysis - We calculate loadings, factor scores and check the assumptions.
7. Discriminant and Classification - We perform LDA and QDA on whole data set and by Lachenbruch’s holdout method. Due to non-equal covariance matrices of the two populations QDA performs better.

We now load required packages before beginning with first part. They are - MVN, MASS, biotools, profileR, psych, car and ggbiplot.

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## sROC 0.1-2 loaded
## Loading required package: rpanel
## Loading required package: tcltk
## Package `rpanel', version 1.1-4: type help(rpanel) for summary information
## Loading required package: tkrplot
## Loading required package: lattice
## Loading required package: SpatialEpi
## Loading required package: sp
## ---
## biotools version 3.1
## 
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Loading required package: RColorBrewer
## Loading required package: reshape
## Loading required package: lavaan
## This is lavaan 0.6-6
## lavaan is BETA software! Please report any bugs.
## 
## Attaching package: 'lavaan'
## The following object is masked from 'package:psych':
## 
##     cor2cov

1. Basic Visualisation

We load the data and present a few of them.

##   Resting (L/min) Resting (ml/kg/min) Max (L/min) Max (ml/kg/min) Gender
## 1            0.34                3.71        2.87           30.87   male
## 2            0.39                5.08        3.38           43.85   male
## 3            0.48                5.13        4.13           44.51   male
## 4            0.31                3.95        3.60           46.00   male
## 5            0.36                5.51        3.11           47.02   male
## 6            0.33                4.07        3.95           48.50   male

We look at the scatter matrix.

## 'data.frame':    50 obs. of  5 variables:
##  $ Resting (L/min)    : num  0.34 0.39 0.48 0.31 0.36 0.33 0.43 0.48 0.21 0.32 ...
##  $ Resting (ml/kg/min): num  3.71 5.08 5.13 3.95 5.51 4.07 4.77 6.69 3.71 4.35 ...
##  $ Max (L/min)        : num  2.87 3.38 4.13 3.6 3.11 3.95 4.39 3.5 2.82 3.59 ...
##  $ Max (ml/kg/min)    : num  30.9 43.9 44.5 46 47 ...
##  $ Gender             : chr  "male" "male" "male" "male" ...

We look at the density estimates to get an idea of the difference between the two populations.

## [1] "Resting (L/min)"
##  [1] 0.34 0.39 0.48 0.31 0.36 0.33 0.43 0.48 0.21 0.32 0.54 0.32 0.40 0.31 0.44
## [16] 0.32 0.50 0.36 0.48 0.40 0.42 0.55 0.50 0.34 0.40 0.29 0.28 0.31 0.30 0.28
## [31] 0.11 0.25 0.26 0.39 0.37 0.31 0.35 0.29 0.33 0.18 0.28 0.44 0.22 0.34 0.30
## [46] 0.31 0.27 0.66 0.37 0.35

## Warning in rug(dataset[26:50, 2], col = "red"): some values will be clipped

2. Normality

To check normality we use the MVN package which has three tests : mardia, hz and royston besides univariate Shapiro Wilk test. We test the two populations separately.

In males we see that the all four variables pass the shapiro-wilk univariate test of normality and passes the mardia and royston multivariate tests but fails the HZ test. Results below.

## $multivariateNormality
##            Test        HZ    p value MVN
## 1 Henze-Zirkler 0.9180322 0.03699581  NO
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.9612    0.4396    YES   
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.9672    0.5753    YES   
## 3 Shapiro-Wilk     Max (L/min)        0.9316    0.0948    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9396    0.1450    YES   
## 
## $Descriptives
##                      n    Mean    Std.Dev Median   Min   Max  25th  75th
## Resting (L/min)     25  0.3972 0.08438602   0.40  0.21  0.55  0.33  0.48
## Resting (ml/kg/min) 25  5.3296 1.06966303   5.13  3.71  7.89  4.58  6.04
## Max (L/min)         25  3.6876 0.67518689   3.56  2.82  5.23  3.11  4.00
## Max (ml/kg/min)     25 49.4204 7.43317871  48.92 30.87 63.30 46.23 55.08
##                            Skew   Kurtosis
## Resting (L/min)      0.02883229 -0.7707939
## Resting (ml/kg/min)  0.33539338 -0.6496398
## Max (L/min)          0.68995043 -0.5116999
## Max (ml/kg/min)     -0.62720924  0.4864554
## $multivariateNormality
##              Test          Statistic           p value Result
## 1 Mardia Skewness   24.7357672005174 0.211727532300499    YES
## 2 Mardia Kurtosis -0.545630313026931 0.585320083024567    YES
## 3             MVN               <NA>              <NA>    YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.9612    0.4396    YES   
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.9672    0.5753    YES   
## 3 Shapiro-Wilk     Max (L/min)        0.9316    0.0948    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9396    0.1450    YES   
## 
## $Descriptives
##                      n    Mean    Std.Dev Median   Min   Max  25th  75th
## Resting (L/min)     25  0.3972 0.08438602   0.40  0.21  0.55  0.33  0.48
## Resting (ml/kg/min) 25  5.3296 1.06966303   5.13  3.71  7.89  4.58  6.04
## Max (L/min)         25  3.6876 0.67518689   3.56  2.82  5.23  3.11  4.00
## Max (ml/kg/min)     25 49.4204 7.43317871  48.92 30.87 63.30 46.23 55.08
##                            Skew   Kurtosis
## Resting (L/min)      0.02883229 -0.7707939
## Resting (ml/kg/min)  0.33539338 -0.6496398
## Max (L/min)          0.68995043 -0.5116999
## Max (ml/kg/min)     -0.62720924  0.4864554
## $multivariateNormality
##      Test        H   p value MVN
## 1 Royston 6.275151 0.1749181 YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.9612    0.4396    YES   
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.9672    0.5753    YES   
## 3 Shapiro-Wilk     Max (L/min)        0.9316    0.0948    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9396    0.1450    YES   
## 
## $Descriptives
##                      n    Mean    Std.Dev Median   Min   Max  25th  75th
## Resting (L/min)     25  0.3972 0.08438602   0.40  0.21  0.55  0.33  0.48
## Resting (ml/kg/min) 25  5.3296 1.06966303   5.13  3.71  7.89  4.58  6.04
## Max (L/min)         25  3.6876 0.67518689   3.56  2.82  5.23  3.11  4.00
## Max (ml/kg/min)     25 49.4204 7.43317871  48.92 30.87 63.30 46.23 55.08
##                            Skew   Kurtosis
## Resting (L/min)      0.02883229 -0.7707939
## Resting (ml/kg/min)  0.33539338 -0.6496398
## Max (L/min)          0.68995043 -0.5116999
## Max (ml/kg/min)     -0.62720924  0.4864554

In females we see that only two variables pass the shapiro-wilk univariate test of normality. It passes mardia and hz multivariate tests but fails the royston test. Results below.

## $multivariateNormality
##            Test        HZ   p value MVN
## 1 Henze-Zirkler 0.8897817 0.0597649 YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.8723    0.0040    NO    
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.8401    0.0009    NO    
## 3 Shapiro-Wilk     Max (L/min)        0.9704    0.6330    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9501    0.2334    YES   
## 
## $Descriptives
##                      n       Mean    Std.Dev Median   Min   Max    25th    75th
## Resting (L/min)     26  0.3169231 0.09813335  0.305  0.11  0.66  0.2800  0.3500
## Resting (ml/kg/min) 26  5.1557692 1.63805537  5.070  1.74 11.05  4.5550  5.5425
## Max (L/min)         26  2.3346154 0.35424546  2.320  1.71  3.06  2.0475  2.5075
## Max (ml/kg/min)     26 37.9365385 4.85478975 38.085 28.97 51.80 35.2575 39.4300
##                          Skew   Kurtosis
## Resting (L/min)     1.2400422  3.8401198
## Resting (ml/kg/min) 1.4206652  4.5251063
## Max (L/min)         0.2901591 -0.6692302
## Max (ml/kg/min)     0.5957739  0.8991398
## $multivariateNormality
##              Test        Statistic            p value Result
## 1 Mardia Skewness 30.4959073335655 0.0622068480637173    YES
## 2 Mardia Kurtosis 1.21646295329905  0.223808615267926    YES
## 3             MVN             <NA>               <NA>    YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.8723    0.0040    NO    
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.8401    0.0009    NO    
## 3 Shapiro-Wilk     Max (L/min)        0.9704    0.6330    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9501    0.2334    YES   
## 
## $Descriptives
##                      n       Mean    Std.Dev Median   Min   Max    25th    75th
## Resting (L/min)     26  0.3169231 0.09813335  0.305  0.11  0.66  0.2800  0.3500
## Resting (ml/kg/min) 26  5.1557692 1.63805537  5.070  1.74 11.05  4.5550  5.5425
## Max (L/min)         26  2.3346154 0.35424546  2.320  1.71  3.06  2.0475  2.5075
## Max (ml/kg/min)     26 37.9365385 4.85478975 38.085 28.97 51.80 35.2575 39.4300
##                          Skew   Kurtosis
## Resting (L/min)     1.2400422  3.8401198
## Resting (ml/kg/min) 1.4206652  4.5251063
## Max (L/min)         0.2901591 -0.6692302
## Max (ml/kg/min)     0.5957739  0.8991398
## $multivariateNormality
##      Test        H     p value MVN
## 1 Royston 22.56494 8.12886e-05  NO
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.8723    0.0040    NO    
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.8401    0.0009    NO    
## 3 Shapiro-Wilk     Max (L/min)        0.9704    0.6330    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9501    0.2334    YES   
## 
## $Descriptives
##                      n       Mean    Std.Dev Median   Min   Max    25th    75th
## Resting (L/min)     26  0.3169231 0.09813335  0.305  0.11  0.66  0.2800  0.3500
## Resting (ml/kg/min) 26  5.1557692 1.63805537  5.070  1.74 11.05  4.5550  5.5425
## Max (L/min)         26  2.3346154 0.35424546  2.320  1.71  3.06  2.0475  2.5075
## Max (ml/kg/min)     26 37.9365385 4.85478975 38.085 28.97 51.80 35.2575 39.4300
##                          Skew   Kurtosis
## Resting (L/min)     1.2400422  3.8401198
## Resting (ml/kg/min) 1.4206652  4.5251063
## Max (L/min)         0.2901591 -0.6692302
## Max (ml/kg/min)     0.5957739  0.8991398

We check the outliers in the data using Mahalonobis distance and find that 48th reading is the strong candidate for outlier (at 99%).

## [1] 48

So we remove 48 from the female population and check normality and find that all four variables now pass univariate normality and all three multivariate normality tests.

## $multivariateNormality
##              Test         Statistic            p value Result
## 1 Mardia Skewness  30.3724577468973 0.0640391265398032    YES
## 2 Mardia Kurtosis 0.358992219117516  0.719600910083357    YES
## 3             MVN              <NA>               <NA>    YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.9560    0.3642    YES   
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.9487    0.2542    YES   
## 3 Shapiro-Wilk     Max (L/min)        0.9622    0.4838    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9483    0.2491    YES   
## 
## $Descriptives
##                      n       Mean    Std.Dev Median   Min   Max    25th    75th
## Resting (L/min)     24  0.2991667 0.06870964  0.300  0.11  0.44  0.2775  0.3425
## Resting (ml/kg/min) 24  4.9341667 1.15774487  5.070  1.74  7.32  4.5275  5.4275
## Max (L/min)         24  2.3150000 0.35460940  2.315  1.71  3.06  2.0175  2.5025
## Max (ml/kg/min)     24 38.1054167 4.92021825 38.085 28.97 51.80 35.6325 39.5800
##                           Skew   Kurtosis
## Resting (L/min)     -0.5865727  0.8504798
## Resting (ml/kg/min) -0.5767599  1.0202208
## Max (L/min)          0.3657211 -0.5682945
## Max (ml/kg/min)      0.5771586  0.8545083
## $multivariateNormality
##      Test        H   p value MVN
## 1 Royston 6.361645 0.1422686 YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.9560    0.3642    YES   
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.9487    0.2542    YES   
## 3 Shapiro-Wilk     Max (L/min)        0.9622    0.4838    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9483    0.2491    YES   
## 
## $Descriptives
##                      n       Mean    Std.Dev Median   Min   Max    25th    75th
## Resting (L/min)     24  0.2991667 0.06870964  0.300  0.11  0.44  0.2775  0.3425
## Resting (ml/kg/min) 24  4.9341667 1.15774487  5.070  1.74  7.32  4.5275  5.4275
## Max (L/min)         24  2.3150000 0.35460940  2.315  1.71  3.06  2.0175  2.5025
## Max (ml/kg/min)     24 38.1054167 4.92021825 38.085 28.97 51.80 35.6325 39.5800
##                           Skew   Kurtosis
## Resting (L/min)     -0.5865727  0.8504798
## Resting (ml/kg/min) -0.5767599  1.0202208
## Max (L/min)          0.3657211 -0.5682945
## Max (ml/kg/min)      0.5771586  0.8545083
## $multivariateNormality
##            Test        HZ   p value MVN
## 1 Henze-Zirkler 0.8187303 0.1449586 YES
## 
## $univariateNormality
##           Test            Variable Statistic   p value Normality
## 1 Shapiro-Wilk   Resting (L/min)      0.9560    0.3642    YES   
## 2 Shapiro-Wilk Resting (ml/kg/min)    0.9487    0.2542    YES   
## 3 Shapiro-Wilk     Max (L/min)        0.9622    0.4838    YES   
## 4 Shapiro-Wilk   Max (ml/kg/min)      0.9483    0.2491    YES   
## 
## $Descriptives
##                      n       Mean    Std.Dev Median   Min   Max    25th    75th
## Resting (L/min)     24  0.2991667 0.06870964  0.300  0.11  0.44  0.2775  0.3425
## Resting (ml/kg/min) 24  4.9341667 1.15774487  5.070  1.74  7.32  4.5275  5.4275
## Max (L/min)         24  2.3150000 0.35460940  2.315  1.71  3.06  2.0175  2.5025
## Max (ml/kg/min)     24 38.1054167 4.92021825 38.085 28.97 51.80 35.6325 39.5800
##                           Skew   Kurtosis
## Resting (L/min)     -0.5865727  0.8504798
## Resting (ml/kg/min) -0.5767599  1.0202208
## Max (L/min)          0.3657211 -0.5682945
## Max (ml/kg/min)      0.5771586  0.8545083

So now our data can be considered to be multivariate normal.

We do not transform the data as the data is sufficiently normal after removing the 48th reading. Box Cox transformation is not improving the normality so we skip it in the report.

3. Confidence intervals and ellipsoids

Confidence Ellipsoid

Males

### Females

One at a time confidence Intervals

Males

## [1]  0.3972  5.3296  3.6876 49.4204
## [1] 55.25215
##   lower_limit upper_limit
## 1  -4.1887339    4.983134
## 2   0.7436661    9.915534
## 3  -0.8983339    8.273534
## 4  44.8344661   54.006334

Females

## [1]  0.3136  5.1788  2.3152 38.1548
## [1] 23.26083
##   lower_limit upper_limit
## 1  -2.6619399     3.28914
## 2   2.2032601     8.15434
## 3  -0.6603399     5.29074
## 4  35.1792601    41.13034

Simultaneuous COnfidence Intervals

Males

## [1]  0.3972  5.3296  3.6876 49.4204
## [1] 55.25215
##   lower_limit upper_limit
## 1 -5.00073619    5.795136
## 2 -0.06833619   10.727536
## 3 -1.71033619    9.085536
## 4 44.02246381   54.818336

Females

## [1]  0.3136  5.1788  2.3152 38.1548
## [1] 23.26083
##   lower_limit upper_limit
## 1     -3.1888      3.8160
## 2      1.6764      8.6812
## 3     -1.1872      5.8176
## 4     34.6524     41.6572

Bonferroni Confidence Intervals

Males

## [1]  0.3972  5.3296  3.6876 49.4204
## [1] 55.25215
##   lower_limit upper_limit
## 1   -2.281192    3.075592
## 2    2.651208    8.007992
## 3    1.009208    6.365992
## 4   46.742008   52.098792

Females

## [1]  0.3136  5.1788  2.3152 38.1548
## [1] 23.26083
##   lower_limit upper_limit
## 1  -1.4242494    2.051449
## 2   3.4409506    6.916649
## 3   0.5773506    4.053049
## 4  36.4169506   39.892649

4. Profile analysis

Run the profile analysis function and plot the Profile. 1 is male, 0 is female.

Display the sample means.

## 
## Data Summary:
##          0       1
## v1  0.3136  0.3972
## v2  5.1788  5.3296
## v3  2.3152  3.6876
## v4 38.1548 49.4204

Display the test outputs.

## Call:
## pbg(data = dataset[1:4], group = c(rep(1, 25), rep(0, 25)), profile.plot = TRUE)
## 
## Hypothesis Tests:
## $`Ho: Profiles are parallel`
##   Multivariate.Test Statistic Approx.F num.df den.df     p.value
## 1             Wilks 0.3728944 25.78644      3     46 6.17633e-10
## 2            Pillai 0.6271056 25.78644      3     46 6.17633e-10
## 3  Hotelling-Lawley 1.6817242 25.78644      3     46 6.17633e-10
## 4               Roy 1.6817242 25.78644      3     46 6.17633e-10
## 
## $`Ho: Profiles have equal levels`
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        1  129.4   129.4   40.47 7.01e-08 ***
## Residuals   48  153.6     3.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Ho: Profiles are flat`
##          F df1 df2      p-value
## 1 875.8325   3  46 1.436427e-40

So data fails all three tests. Profiles are not parallel, Profiles do not have equal levels and Profiles are not flat.

5. Principal Component Analysis

We perform PCA on the four continuous variables using ‘prcomp’ function and check results.

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5698 1.1136 0.53994 0.06261
## Proportion of Variance 0.6161 0.3100 0.07288 0.00098
## Cumulative Proportion  0.6161 0.9261 0.99902 1.00000

We see that only two of the standard deviations are greater than one therefore two PC’s are important and they explain 92% of the variation.

We can plot the scree plot as well.

And these are the eigenvectors or loadings.

##                           PC1        PC2        PC3        PC4
## Resting (L/min)     0.5535272  0.3738597 -0.4912732  0.5590055
## Resting (ml/kg/min) 0.4184354  0.6475264  0.4040129 -0.4923363
## Max (L/min)         0.5131163 -0.4885243 -0.4302058 -0.5594449
## Max (ml/kg/min)     0.5052041 -0.4497583  0.6405835  0.3635095

Now we see the correlation between the first two PCs and the original variables.

##          r1         r2
## 1 0.8689537  0.4163337
## 2 0.6568801  0.7210915
## 3 0.8055147 -0.5440253
## 4 0.7930937 -0.5008550

We can see that the first PC has large positive correlations with all four variables but the second one has positive correlation with the resting oxygen consumption rates and negative with the max oxygen consumption rates.

Next we look at the biplot as well. The arrows corresponding to the variables confirm the correlation findings above. And each individuals scores on first and second PC are also plotted on this plane. We also indicate the known grouping of the individuals - male and female.

## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:reshape':
## 
##     rename, round_any
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale

We can see from the plot that 48th reading indeed looks like an outlier.

Next we perform PCA for the males and females separately.

For males.

## Importance of components:
##                           PC1    PC2    PC3     PC4
## Standard deviation     1.5607 1.0027 0.7458 0.05353
## Proportion of Variance 0.6089 0.2513 0.1390 0.00072
## Cumulative Proportion  0.6089 0.8602 0.9993 1.00000

For females.

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.4130 1.2657 0.63237 0.03967
## Proportion of Variance 0.4992 0.4005 0.09997 0.00039
## Cumulative Proportion  0.4992 0.8996 0.99961 1.00000

6. Factor Analysis

We try the two factor model as the PCA revealed two eigenvalues greater than 1. We use the MLE method as iterative PC method shows some warnings. We present the loadings without rotation first.

## 
## Loadings:
##                     ML2   ML1  
## Resting (L/min)     0.473 0.808
## Resting (ml/kg/min)       0.997
## Max (L/min)         0.995      
## Max (ml/kg/min)     0.815 0.220
## 
##                  ML2   ML1
## SS loadings    1.877 1.702
## Proportion Var 0.469 0.426
## Cumulative Var 0.469 0.895

We see the first factor loads positively on all variables whereas the second factor loads negatively on Resting rates and positively on Max rates.

Next we fit the orthogonal model again with Varimax rotation and see the factor loadings.

## 
## Loadings:
##                     ML2   ML1  
## Resting (L/min)     0.473 0.808
## Resting (ml/kg/min)       0.997
## Max (L/min)         0.995      
## Max (ml/kg/min)     0.815 0.220
## 
##                  ML2   ML1
## SS loadings    1.877 1.702
## Proportion Var 0.469 0.426
## Cumulative Var 0.469 0.895

We see that the first factor loads mainly on 3rd and 4th variables (and also on first) whereas the second factor in 1st and 2nd variables. So we a separation between Resting and Max oxygen consumption rates.

We also calculate the uniqueness as usual.

##     Resting (L/min) Resting (ml/kg/min)         Max (L/min)     Max (ml/kg/min) 
##         0.123581232         0.004970114         0.004907360         0.288003402

Next we calculate the factor scores.

##          ML2         ML1
## 1 -0.0713375 -1.08817771
## 2  0.4428130 -0.12371747
## 3  1.2986834 -0.08484044
## 4  0.7544542 -0.94732054
## 5  0.1116231  0.17846286
## 6  1.1477198 -0.86731054

7. Discrimination and Classification

First we check the covariance matrices of the two groups - male and female. They look different.

##                     Resting (L/min) Resting (ml/kg/min) Max (L/min)
## Resting (L/min)          0.00712100           0.0700030  0.03144717
## Resting (ml/kg/min)      0.07000300           1.1441790  0.14767817
## Max (L/min)              0.03144717           0.1476782  0.45587733
## Max (ml/kg/min)          0.15058033           3.4309085  3.30812183
##                     Max (ml/kg/min)
## Resting (L/min)           0.1505803
## Resting (ml/kg/min)       3.4309085
## Max (L/min)               3.3081218
## Max (ml/kg/min)          55.2521457
##                     Resting (L/min) Resting (ml/kg/min)  Max (L/min)
## Resting (L/min)         0.009630154          0.14593446  0.005678769
## Resting (ml/kg/min)     0.145934462          2.68322538 -0.049495692
## Max (L/min)             0.005678769         -0.04949569  0.125489846
## Max (ml/kg/min)         0.009708923          1.36016477  0.944044615
##                     Max (ml/kg/min)
## Resting (L/min)         0.009708923
## Resting (ml/kg/min)     1.360164769
## Max (L/min)             0.944044615
## Max (ml/kg/min)        23.568983538

They look different. We again check after removing the outlier ie 48th reading.

##                     Resting (L/min) Resting (ml/kg/min) Max (L/min)
## Resting (L/min)          0.00712100           0.0700030  0.03144717
## Resting (ml/kg/min)      0.07000300           1.1441790  0.14767817
## Max (L/min)              0.03144717           0.1476782  0.45587733
## Max (ml/kg/min)          0.15058033           3.4309085  3.30812183
##                     Max (ml/kg/min)
## Resting (L/min)           0.1505803
## Resting (ml/kg/min)       3.4309085
## Max (L/min)               3.3081218
## Max (ml/kg/min)          55.2521457
##                     Resting (L/min) Resting (ml/kg/min)  Max (L/min)
## Resting (L/min)         0.004721014          0.06867754  0.004273913
## Resting (ml/kg/min)     0.068677536          1.34037319 -0.042439130
## Max (L/min)             0.004273913         -0.04243913  0.125747826
## Max (ml/kg/min)         0.012456884          1.02122862  1.145636957
##                     Max (ml/kg/min)
## Resting (L/min)          0.01245688
## Resting (ml/kg/min)      1.02122862
## Max (L/min)              1.14563696
## Max (ml/kg/min)         24.20854764

To be sure we use the Box test for equality of covariance matrices. This test is sensitive to multivariate normality so we use the data after deleting the outlier (48th).

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  dataset[-48, 1:4]
## Chi-Sq (approx.) = 47.6, df = 10, p-value = 7.343e-07

It fails the test suggesting that LDA may not be suitable and therefore we must do QDA. However we do both.
The prior probabilities are 0.5 and 0.5 as the two groups male - male and female - are evenly split in the data. Also the two costs of misclassification can be assumed equal because there is not reason to suppose otherwise.

LDA

We perform LDA on the entire data set and use it to predict the entire data set. We check the confusion matrix and calculate the APER.

##                
## dataset.predict female male
##          female     23    1
##          male        2   24
## [1] 0.94

We can see the confusion matrix. The APER is 6% as the success rate is 94%.

We can plot it also.

Next we again perform LDA but with Lachenbruch hold-out procedure and again print confusion matrix and error rate. We see performance goes down to 90% success rate.

##         
##          female male
##   female     23    3
##   male        2   22
## [1] 0.9

QDA

We repeat the above using QDA. But we delete the 48th data point for the sake of normality.

##  [1] male   male   male   male   male   male   male   male   female male  
## [11] male   male   male   male   male   male   male   male   male   male  
## [21] male   male   male   male   male   female female female female female
## [31] female female female female female female female female male   female
## [41] female female female female female female female female female
## Levels: female male
##                
## dataset.predict female male
##          female     23    1
##          male        1   24
## [1] 0.9591837

We can see the confusion matrix and the success rate at 96 percent - better than LDA. We try now with Lachenbruch hold-out procedure. Success rate is now 92%.

##         
##          female male
##   female     22    2
##   male        2   23
## [1] 0.9183673