Testing with fewer number of H. tripartitus specimens

Basic starting materials

Load Packages

Here, I load the packages that will be used later in this Rmd file.

Read Data

Here, I read in the data that is used in this file. This is the wing venation coordinate data, which was created after a generalized procrustes analysis (GPA) was run, and it was annoated in Excel to add columns for location and species.

Also, in this section I drop all of the unidentified specimens and a large portion of Halictus tripartitus specimens. One of the questions we had was whether or not having one species so vastly overrepresented would skew the data. Here, there are equal numbers of H. tripartitus and H. ligatus, with 37 specimens each. There are only 5 H. farinosus and 4 H. rubicundus. A future question could be to run these analyses with only four of each specimen.

#read in the procrustes data for all groups combined
##note: this spreadsheet was created outside of R, in Excel, by compiling the 
##5 csv files from above, then comparing to CCBER gbif Halictus data to add columns
##for location, and sex, based on the specimen's catalog number
all_CSVdata <- read.csv("../../../../charliethrift/my_project/halictus_coord_data.csv")
data31dec <- all_CSVdata #assign the data to a new variable

#manipulate the data
##drop some of the tripartitus specimens
data31dec <- data31dec[c(1:83), c(1:24)] #drop all unidentified and some tripartitus
plot(data31dec$species, main="Number of specimens for each species") #visualize the numbers of each species

PCA and Plotting Biplots

Here, I run a principal component analysis (PCA) with the coordinate data. Next, I find the Eigenvalues, and summarize the PCA to show the proportion of explained variance for each principal component. In this case, there are four Eigenvalues greater than 1.

Then, I visualize the resulting PCA with multiple different biplots. These show the distributions of the four species across various axes, with 1 and 2, 2 and 3, and 1 and 3. Axes titles are on each biplot, along with their relative percentage of explained variation.

data31dec.pca <- prcomp(data31dec[,c(6:23)], center = TRUE, scale. = TRUE) #run pca

#How many eigenvalues are greater than 1?
pc31 <- data31dec.pca #reassign pca output to variable
ev31 <- pc31$sdev^2 #square the standard deviaion column
ev31 #print the output. These are the eigenvalues
##  [1] 6.650905e+00 4.649456e+00 2.486525e+00 1.098183e+00 8.166959e-01
##  [6] 5.357134e-01 4.191032e-01 3.610464e-01 2.684377e-01 2.057353e-01
## [11] 1.615989e-01 1.405949e-01 1.263561e-01 4.792687e-02 3.036479e-02
## [16] 1.357296e-03 1.025029e-15 2.501554e-16
summary(data31dec.pca) #summary of the PCA output. This shows each component and it's proportion of explained variance
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5789 2.1563 1.5769 1.04794 0.90371 0.73192 0.64738
## Proportion of Variance 0.3695 0.2583 0.1381 0.06101 0.04537 0.02976 0.02328
## Cumulative Proportion  0.3695 0.6278 0.7659 0.82695 0.87232 0.90208 0.92537
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.60087 0.51811 0.45358 0.40199 0.37496 0.35547 0.21892
## Proportion of Variance 0.02006 0.01491 0.01143 0.00898 0.00781 0.00702 0.00266
## Cumulative Proportion  0.94542 0.96034 0.97177 0.98074 0.98856 0.99558 0.99824
##                           PC15    PC16      PC17      PC18
## Standard deviation     0.17425 0.03684 3.202e-08 1.582e-08
## Proportion of Variance 0.00169 0.00008 0.000e+00 0.000e+00
## Cumulative Proportion  0.99992 1.00000 1.000e+00 1.000e+00
ggbiplot(data31dec.pca,
  ellipse=TRUE,  
  labels=rownames(data31dec$species), 
  var.axes = FALSE, 
  alpha = 0.75,
  groups=data31dec$species)+
  ggtitle("Halictus bees pca (by species)")+ 
  theme_minimal()+
  theme(legend.position = "bottom")+
  scale_colour_manual(name="species", values= c("midnightblue", "goldenrod",
                                                "olivedrab", "cornflowerblue")) 

ggbiplot(data31dec.pca,
  ellipse=TRUE,  
  choices = 2:3, #AXES 2 and 3
  labels=rownames(data31dec$species), 
  var.axes = FALSE, 
  alpha = 0.75,
  groups=data31dec$species)+
  ggtitle("Halictus bees pca (by species)")+ 
  theme_minimal()+
  theme(legend.position = "bottom")+
  scale_colour_manual(name="species", values= c("midnightblue", "goldenrod",
                                                "olivedrab", "cornflowerblue"))

ggbiplot(data31dec.pca,
  ellipse=TRUE,  
  choices = c(1,3), #AXES 1 and 3
  labels=rownames(data31dec$species), 
  var.axes = FALSE, 
  alpha = 0.75,
  groups=data31dec$species)+
  ggtitle("Halictus bees pca (by species)")+ 
  theme_minimal()+
  theme(legend.position = "bottom")+
  scale_colour_manual(name="species", values= c("midnightblue", "goldenrod",
                                                "olivedrab", "cornflowerblue"))

Calculating the Mahalanobis Distance

The Mahalanobis Distance is the distance between centroids of different groups. Here, it is being used to quantify the distance between the centroids of each species, and attempt to demonstrate that each group (species) is significantly different.

Calculating Mahalanobis Distance for individual species

In this first chunk of code, I calculate the Mahalanobis distances of each species individually. This can be useful in finding outliers.

##in this first section, calculate Mahalanobis distances for just H. tripartitus

data_tri <- subset(data31dec, species == "tripartitus") #subset for one species
df_tri <- data.frame("1x" = c(data_tri$coords.1.X),
                 "1y" = c(data_tri$coords.1.Y),
                 "2x" = c(data_tri$coords.2.X),
                 "2y" = c(data_tri$coords.2.Y),
                 "3x" = c(data_tri$coords.3.X),
                 "3y" = c(data_tri$coords.3.Y),
                 "4x" = c(data_tri$coords.4.X),
                 "4y" = c(data_tri$coords.4.Y))
mahalanobis(df_tri, colMeans(df_tri), cov(df_tri))
##  [1]  7.506330  4.890008  4.768245  8.626552  6.789839 10.880780  7.669493
##  [8]  8.827057  3.796735  5.911865 14.768868  4.363702 11.904302 12.057424
## [15]  5.923140 10.635114  5.484355  6.178994 17.520963 12.825308  5.267835
## [22] 10.448189  5.102018  8.761533  7.031358  7.320576  9.799867  3.902118
## [29]  7.466337  5.279924  7.941903  8.007271  4.237518  6.980213  7.641518
## [36]  7.910951  3.571796
D2_tri <- mahalanobis(df_tri, colMeans(df_tri), cov(df_tri))
df_tri$mahal <- mahalanobis(df_tri, colMeans(df_tri), cov(df_tri))
df_tri$p <- pchisq(df_tri$mahal, df=3, lower.tail=FALSE)
df_tri
##          X1x        X1y       X2x         X2y       X3x       X3y        X4x
## 1  0.3282624 -0.2546946 0.3359610 -0.06647976 0.1522543 0.1194981 0.11479703
## 2  0.3532232 -0.2512949 0.3471683 -0.07215020 0.1413144 0.1130226 0.11870453
## 3  0.3477460 -0.2505306 0.3474606 -0.06457094 0.1401620 0.1210712 0.10844814
## 4  0.3362406 -0.2390162 0.3571751 -0.07760787 0.1413765 0.1223869 0.10878333
## 5  0.3265859 -0.2440126 0.3425447 -0.06784464 0.1517270 0.1285215 0.12182990
## 6  0.3430901 -0.2503855 0.3380952 -0.06389916 0.1397008 0.1174078 0.10966534
## 7  0.3507780 -0.2545985 0.3422989 -0.06746132 0.1394682 0.1189172 0.11990439
## 8  0.3498234 -0.2627552 0.3393620 -0.08124156 0.1328066 0.1209698 0.10667735
## 9  0.3357520 -0.2560567 0.3424408 -0.07378704 0.1423764 0.1191059 0.11585471
## 10 0.3343582 -0.2561430 0.3475591 -0.07383785 0.1405986 0.1318678 0.10594901
## 11 0.3667817 -0.2474274 0.3444861 -0.06984594 0.1449875 0.1137762 0.11079006
## 12 0.3375299 -0.2500759 0.3511115 -0.07637392 0.1378136 0.1178150 0.10913486
## 13 0.3369228 -0.2577899 0.3424774 -0.07885573 0.1282445 0.1299238 0.09622912
## 14 0.3514514 -0.2524421 0.3512904 -0.07084590 0.1510305 0.1113280 0.09937673
## 15 0.3384778 -0.2471052 0.3570409 -0.07372561 0.1460211 0.1155094 0.11605880
## 16 0.3278319 -0.2586977 0.3405950 -0.07028280 0.1539180 0.1303506 0.10548366
## 17 0.3269700 -0.2499592 0.3413758 -0.06899060 0.1550838 0.1243322 0.11297110
## 18 0.3308020 -0.2548937 0.3373529 -0.06907781 0.1439188 0.1290812 0.12066567
## 19 0.3330977 -0.2469988 0.3346642 -0.07703741 0.1494872 0.1100195 0.12152724
## 20 0.3297562 -0.2305525 0.3588426 -0.07342140 0.1524817 0.1205569 0.11932371
## 21 0.3487096 -0.2508476 0.3432036 -0.06933416 0.1476595 0.1206561 0.10725735
## 22 0.3548658 -0.2546778 0.3573827 -0.07879597 0.1373407 0.1091848 0.10853890
## 23 0.3490985 -0.2511497 0.3555138 -0.07393865 0.1353622 0.1258032 0.10322252
## 24 0.3424238 -0.2548425 0.3410028 -0.07568246 0.1309728 0.1298917 0.09644036
## 25 0.3501431 -0.2517707 0.3441710 -0.06994254 0.1422350 0.1162141 0.10937520
## 26 0.3426917 -0.2487740 0.3539381 -0.06330591 0.1459534 0.1197106 0.11382779
## 27 0.3500903 -0.2627408 0.3454546 -0.07407319 0.1317697 0.1233730 0.09747761
## 28 0.3428526 -0.2597990 0.3402671 -0.06942464 0.1435422 0.1256354 0.10939862
## 29 0.3336931 -0.2499521 0.3402392 -0.06327859 0.1461607 0.1232568 0.10621127
## 30 0.3365859 -0.2585594 0.3351642 -0.06929794 0.1411369 0.1220780 0.10580243
## 31 0.3357505 -0.2533722 0.3378794 -0.06407535 0.1397446 0.1238343 0.11892150
## 32 0.3190968 -0.2470819 0.3426371 -0.06564601 0.1495733 0.1285911 0.12426549
## 33 0.3460153 -0.2572123 0.3459085 -0.07166835 0.1435112 0.1233982 0.09874490
## 34 0.3165116 -0.2480381 0.3459115 -0.06815119 0.1514910 0.1268577 0.11977219
## 35 0.3387249 -0.2540752 0.3456845 -0.07840134 0.1281957 0.1268983 0.10100547
## 36 0.3384945 -0.2346914 0.3575630 -0.06738241 0.1486836 0.1207324 0.12258600
## 37 0.3500409 -0.2498178 0.3532464 -0.07672427 0.1359892 0.1147446 0.10988875
##          X4y     mahal            p
## 1  0.2180719  7.506330 0.0573960388
## 2  0.2038859  4.890008 0.1800307444
## 3  0.2087407  4.768245 0.1895754873
## 4  0.2154815  8.626552 0.0346910857
## 5  0.2068089  6.789839 0.0789067024
## 6  0.2064586 10.880780 0.0123883263
## 7  0.2018084  7.669493 0.0533597093
## 8  0.2078336  8.827057 0.0316808534
## 9  0.2072551  3.796735 0.2842660779
## 10 0.2177386  5.911865 0.1159778102
## 11 0.2028917 14.768868 0.0020252090
## 12 0.2134033  4.363702 0.2247747621
## 13 0.2238126 11.904302 0.0077182748
## 14 0.2097414 12.057424 0.0071890176
## 15 0.2065568  5.923140 0.1154101031
## 16 0.2163247 10.635114 0.0138718016
## 17 0.2096869  5.484355 0.1395773690
## 18 0.2096932  6.178994 0.1032191780
## 19 0.2060268 17.520963 0.0005521251
## 20 0.2044362 12.825308 0.0050300274
## 21 0.2134025  5.267835 0.1532032191
## 22 0.2079953 10.448189 0.0151165088
## 23 0.2113329  5.102018 0.1644774789
## 24 0.2113628  8.761533 0.0326353455
## 25 0.2172144  7.031358 0.0709049756
## 26 0.2104965  7.320576 0.0623523346
## 27 0.2043982  9.799867 0.0203462374
## 28 0.2130544  3.902118 0.2722294400
## 29 0.2173085  7.466337 0.0584297350
## 30 0.2091082  5.279924 0.1524104201
## 31 0.2150893  7.941903 0.0472277810
## 32 0.2181875  8.007271 0.0458616819
## 33 0.2153460  4.237518 0.2369323369
## 34 0.2109926  6.980213 0.0725311304
## 35 0.2099348  7.641518 0.0540316056
## 36 0.2071918  7.910951 0.0478883735
## 37 0.2091375  3.571796 0.3115691363
plot(df_tri$mahal)

plot(density(D2_tri, bw = 0.5),
     main="Squared Mahalanobis distances (H. tripartitus), n=100, p=3") ; rug(D2_tri)

qqplot(qchisq(ppoints(100), df = 3), D2_tri,
       main = expression("Q-Q plot of Mahalanobis" * ~D^2 *
                         " vs. quantiles of" * ~ chi[3]^2))
abline(0, 1, col = 'gray')

#Now, calculating the Mahalanobis distance for H. ligatus
ligatus <- subset(data31dec, species == "ligatus")
ligatus_m <- data.frame("1x" = c(ligatus$coords.1.X),
                 "1y" = c(ligatus$coords.1.Y),
                 "2x" = c(ligatus$coords.2.X),
                 "2y" = c(ligatus$coords.2.Y),
                 "3x" = c(ligatus$coords.3.X),
                 "3y" = c(ligatus$coords.3.Y),
                 "4x" = c(ligatus$coords.4.X),
                 "4y" = c(ligatus$coords.4.Y),
                 "5x" = c(ligatus$coords.5.X),
                 "5y" = c(ligatus$coords.5.Y))
lig_D2 <- mahalanobis(ligatus_m, colMeans(ligatus_m), cov(ligatus_m))
plot(density(lig_D2, bw = 0.5),
     main="Squared Mahalanobis distances (H. ligatus), n=100, p=3") ; rug(lig_D2)

qqplot(qchisq(ppoints(100), df = 3), lig_D2,
       main = expression("Q-Q plot of Mahalanobis" * ~D^2 *
                         " vs. quantiles of" * ~ chi[3]^2))
abline(0, 1, col = 'gray')

lig_D2
##  [1] 10.179974 12.728666 12.763265  3.479276  9.800355 10.754483  4.264868
##  [8]  3.851849  5.400079  9.495387 12.275793 20.104606 17.326589  5.111329
## [15]  6.580271 10.411752  3.310573  8.517619  4.273423 11.490903 11.727760
## [22]  7.410153 10.518373 13.291983  7.453283  6.231800  6.426344  8.287128
## [29] 17.274881  7.637883  6.325030 10.535250 18.051224 13.671471 13.188268
## [36] 10.453935  9.394174
#Now, trying with rubicundus
rubicundus <- subset(data31dec, species == "rubicundus")
rubicundus_m <- data.frame("1x" = c(rubicundus$coords.1.X),
                 "1y" = c(rubicundus$coords.1.Y),
                 "2x" = c(rubicundus$coords.2.X)) #note: can not calculate the Mahalobis distance if there are more factors than records. Here: 4 specimens, so 3 factors(coordinates)
Sx <- cov(rubicundus_m)
rub_D2 <- mahalanobis(rubicundus_m, colMeans(rubicundus_m), Sx)
plot(density(rub_D2, bw = 0.5),
     main="Squared Mahalanobis distances (H. rubicundus), n=100, p=3") ; rug(rub_D2)

qqplot(qchisq(ppoints(100), df = 3), rub_D2,
       main = expression("Q-Q plot of Mahalanobis" * ~D^2 *
                         " vs. quantiles of" * ~ chi[3]^2))
abline(0, 1, col = 'gray')

rub_D2
## [1] 2.25 2.25 2.25 2.25

Calculating Mahalanobis Distances across species

Here, I calculate and compare Mahalanobis Distances across the four species. First, a new matrix is made with only 3 of the 18 coordinate columns from the GPA output. Only 3 can be used because a mahalanobis test can only be run with \(n-1\) predictors, when \(n\) is the number of specimens per group. When groups are species, and there are only 4 H. rubicundus specimens, there can only be 3 predictors.

#generating mahalanobis distances with GPA output
df <- data.frame("1x" = c(data31dec$coords.1.X),
                 "1y" = c(data31dec$coords.1.Y),
                 "2x" = c(data31dec$coords.2.X))
D2 <- mahalanobis(df, colMeans(df), cov(df))
df$mahal <- mahalanobis(df, colMeans(df), cov(df))
df$p <- pchisq(df$mahal, df=3, lower.tail=FALSE)
df
##          X1x        X1y       X2x      mahal           p
## 1  0.3154535 -0.2512801 0.3844496  6.2502548 0.100049670
## 2  0.3251570 -0.2464167 0.3913553  6.3691190 0.094969525
## 3  0.3267691 -0.2489586 0.3904002  6.6658108 0.083347761
## 4  0.3180556 -0.2490913 0.3896816  7.0523977 0.070246298
## 5  0.3126086 -0.2520825 0.3884492  8.3513776 0.039281451
## 6  0.3146967 -0.2646927 0.3498299  3.5514317 0.314152608
## 7  0.3243696 -0.2624141 0.3291201  3.2527183 0.354277779
## 8  0.3270782 -0.2515200 0.3449047  0.9369426 0.816504785
## 9  0.3227187 -0.2585763 0.3451408  1.2403818 0.743336269
## 10 0.3345095 -0.2553075 0.3659497  1.8008607 0.614747658
## 11 0.3325569 -0.2496330 0.3674250  0.7996366 0.849553946
## 12 0.3241794 -0.2540538 0.3587760  0.7267002 0.866906546
## 13 0.3248948 -0.2582592 0.3555563  1.0580164 0.787217524
## 14 0.3283828 -0.2581865 0.3455763  0.6810124 0.877660675
## 15 0.3166052 -0.2624426 0.3496975  2.6594961 0.447154444
## 16 0.3326764 -0.2531998 0.3500025  0.1083022 0.990822843
## 17 0.3102636 -0.2587453 0.3393815  4.8045586 0.186680622
## 18 0.3040418 -0.2563512 0.3430453  6.9294134 0.074181964
## 19 0.3338929 -0.2534127 0.3623028  0.7340468 0.865168062
## 20 0.3249702 -0.2536657 0.3626195  0.9263077 0.819074967
## 21 0.3196687 -0.2560182 0.3527807  1.3222991 0.723844869
## 22 0.3286380 -0.2521113 0.3585833  0.2383390 0.971174890
## 23 0.3280527 -0.2501051 0.3529945  0.2820868 0.963361491
## 24 0.3328367 -0.2560285 0.3628922  1.3434755 0.718836172
## 25 0.3277996 -0.2486318 0.3456414  1.2363438 0.744301408
## 26 0.3288572 -0.2537471 0.3592814  0.4215958 0.935746055
## 27 0.3263178 -0.2588651 0.3542199  1.0043558 0.800197992
## 28 0.3263227 -0.2547168 0.3509105  0.4164725 0.936819037
## 29 0.3384560 -0.2410906 0.3918390  6.0290448 0.110205649
## 30 0.3249715 -0.2555635 0.3415156  1.2601992 0.738605158
## 31 0.3306029 -0.2482764 0.3506783  0.4326827 0.933411256
## 32 0.3239474 -0.2577151 0.3593631  1.3594726 0.715061454
## 33 0.3277331 -0.2515794 0.3507111  0.3289932 0.954486524
## 34 0.3172586 -0.2248965 0.3690016 11.7742314 0.008197954
## 35 0.3359769 -0.2576593 0.3579646  1.3422850 0.719117402
## 36 0.3392727 -0.2552947 0.3575501  1.0874394 0.780107097
## 37 0.3280512 -0.2605483 0.3608886  2.2444195 0.523252183
## 38 0.3264594 -0.2573960 0.3585506  1.0549923 0.787948701
## 39 0.3344122 -0.2320353 0.3685472  3.5989431 0.308154438
## 40 0.3343986 -0.2257955 0.3599793  8.3747522 0.038869522
## 41 0.3123067 -0.2599113 0.3627368  4.0659856 0.254426470
## 42 0.3267752 -0.2542349 0.3672720  1.5633374 0.667731095
## 43 0.3528220 -0.2220932 0.3894537  9.7851775 0.020483307
## 44 0.3398019 -0.2220109 0.3949384  9.2480975 0.026167841
## 45 0.3572043 -0.2170664 0.3903176 13.0320140 0.004567882
## 46 0.3450145 -0.2223839 0.3964227  9.7103657 0.021195587
## 47 0.3282624 -0.2546946 0.3359610  1.8583283 0.602325039
## 48 0.3532232 -0.2512949 0.3471683  3.0529944 0.383526441
## 49 0.3477460 -0.2505306 0.3474606  1.6563085 0.646690686
## 50 0.3362406 -0.2390162 0.3571751  1.7147417 0.633661363
## 51 0.3265859 -0.2440126 0.3425447  3.4488730 0.327453906
## 52 0.3430901 -0.2503855 0.3380952  2.0046482 0.571442517
## 53 0.3507780 -0.2545985 0.3422989  2.8747803 0.411337218
## 54 0.3498234 -0.2627552 0.3393620  4.4691463 0.215058935
## 55 0.3357520 -0.2560567 0.3424408  0.6908995 0.875342356
## 56 0.3343582 -0.2561430 0.3475591  0.3847186 0.943380875
## 57 0.3667817 -0.2474274 0.3444861  8.1048734 0.043893290
## 58 0.3375299 -0.2500759 0.3511115  0.1991012 0.977734262
## 59 0.3369228 -0.2577899 0.3424774  0.8975525 0.826018405
## 60 0.3514514 -0.2524421 0.3512904  2.6857514 0.442654127
## 61 0.3384778 -0.2471052 0.3570409  0.2559564 0.968087241
## 62 0.3278319 -0.2586977 0.3405950  1.0860099 0.780452369
## 63 0.3269700 -0.2499592 0.3413758  1.7866966 0.617834168
## 64 0.3308020 -0.2548937 0.3373529  1.3810325 0.709986791
## 65 0.3330977 -0.2469988 0.3346642  3.5404130 0.315558389
## 66 0.3297562 -0.2305525 0.3588426  6.1572080 0.104207247
## 67 0.3487096 -0.2508476 0.3432036  2.1506173 0.541740605
## 68 0.3548658 -0.2546778 0.3573827  4.8537980 0.182821193
## 69 0.3490985 -0.2511497 0.3555138  2.0659745 0.558826964
## 70 0.3424238 -0.2548425 0.3410028  1.3394713 0.719782254
## 71 0.3501431 -0.2517707 0.3441710  2.4015467 0.493346766
## 72 0.3426917 -0.2487740 0.3539381  0.6314177 0.889203834
## 73 0.3500903 -0.2627408 0.3454546  4.7914429 0.187721401
## 74 0.3428526 -0.2597990 0.3402671  2.0403775 0.564068629
## 75 0.3336931 -0.2499521 0.3402392  1.3974824 0.706124978
## 76 0.3365859 -0.2585594 0.3351642  1.6403900 0.650266731
## 77 0.3357505 -0.2533722 0.3378794  1.3051201 0.727917613
## 78 0.3190968 -0.2470819 0.3426371  3.9941505 0.262096458
## 79 0.3460153 -0.2572123 0.3459085  2.0478202 0.562540990
## 80 0.3165116 -0.2480381 0.3459115  3.7375458 0.291234650
## 81 0.3387249 -0.2540752 0.3456845  0.5665825 0.904038090
## 82 0.3384945 -0.2346914 0.3575630  3.2650497 0.352536768
## 83 0.3500409 -0.2498178 0.3532464  2.0649923 0.559027466
plot(df$mahal)

plot(density(D2, bw = 0.5),
     main="Squared Mahalanobis distances, n=100, p=3") ; rug(D2)

qqplot(qchisq(ppoints(100), df = 3), D2,
       main = expression("Q-Q plot of Mahalanobis" * ~D^2 *
                         " vs. quantiles of" * ~ chi[3]^2))
abline(0, 1, col = 'gray')

groups1 <- data31dec$species

#pairwise mahalanobis for all four species, with just the three coordinates of data
Mahala1 <- pairwise.mahalanobis(df, groups1, digits = 3)
## Warning in pairwise.mahalanobis(df, groups1, digits = 3): group unknown is empty
D1 <- sqrt(Mahala1$distance)
m1 <- D1
rownames(m1) <- list("farinosus", "ligatus", "rubicundus", "tripartitus")
colnames(m1) <- list("farinosus", "ligatus", "rubicundus", "tripartitus")
m1
##             farinosus    ligatus rubicundus tripartitus
## farinosus     0.00000 14.7873529   11.81418  14.3149057
## ligatus      14.78735  0.0000000   26.59635   0.4814642
## rubicundus   11.81418 26.5963466    0.00000  26.1254029
## tripartitus  14.31491  0.4814642   26.12540   0.0000000
#Now, pairwise with just tripartitus and ligatus since they have enough specimens to use all 18 coordinates
data_trip_lig <- subset(data31dec, species == "tripartitus"
                        | species == "ligatus")
groups_trip_lig <- data_trip_lig$species
df_trip_lig <- data.frame("1x" = c(data_trip_lig$coords.1.X),
                 "1y" = c(data_trip_lig$coords.1.Y),
                 "2x" = c(data_trip_lig$coords.2.X),
                 "2y" = c(data_trip_lig$coords.2.Y),
                 "3x" = c(data_trip_lig$coords.3.X),
                 "3y" = c(data_trip_lig$coords.3.Y),
                 "4x" = c(data_trip_lig$coords.4.X),
                 "4y" = c(data_trip_lig$coords.4.Y),
                 "5x" = c(data_trip_lig$coords.5.X),
                 "5y" = c(data_trip_lig$coords.5.Y),
                 "6x" = c(data_trip_lig$coords.6.X),
                 "6y" = c(data_trip_lig$coords.6.Y),
                 "7x" = c(data_trip_lig$coords.7.X),
                 "7y" = c(data_trip_lig$coords.7.Y),
                 "8x" = c(data_trip_lig$coords.8.X),
                 "8y" = c(data_trip_lig$coords.8.Y))
Mahala_trip_lig <- pairwise.mahalanobis(df_trip_lig, groups_trip_lig, digits = 3)
## Warning in pairwise.mahalanobis(df_trip_lig, groups_trip_lig, digits = 3):
## groups farinosus rubicundus unknown are empty
D_trip_lig <- sqrt(Mahala_trip_lig$distance)
rownames(D_trip_lig) <- list("ligatus", "tripartitus")
colnames(D_trip_lig) <- list("ligatus", "tripartitus")
D_trip_lig 
##               ligatus tripartitus
## ligatus     0.0000000   0.1235779
## tripartitus 0.1235779   0.0000000
New method of calculating Mahalanobis Distance: PCA instead of GPA

Here, I try to calculate the Mahalanobis distance using the results of the PCA output instead of the GPA as I did in the prior code chunk. I’m not sure if this is viable, since the pairwise mahalanobis is being run on the $x column of data for the PCA.

## Pairwise mahalanobis distances between the four species
## This uses the PCA output instead of the GPA output
#### I don't think that's right though
data31dec.pca <- prcomp(data31dec[,c(6:23)], center = TRUE, scale. = TRUE) #creates the PCA
newgroups <- data31dec$species #make grouping variable for subsequent pairwise.mahalanobis
Mahala2 <- pairwise.mahalanobis(data31dec.pca$x, newgroups, digits = 5) #NOTE: this is only using the $x output of the PCA, not the whole thing. Possibly incorrect, especially compared to the GPA output equivalent from earlier.
## Warning in pairwise.mahalanobis(data31dec.pca$x, newgroups, digits = 5): group
## unknown is empty
D <- sqrt(Mahala2$distance) #assign the $distance output from the pairwise.mahalanobis to a variable
rownames(D) <- list("farinosus", "ligatus", "rubicundus", "tripartitus") #give species names to matrix
colnames(D) <- list("farinosus", "ligatus", "rubicundus", "tripartitus") #give species names to matrix
D #print this new matrix of mahalanobis distances
##             farinosus  ligatus rubicundus tripartitus
## farinosus    0.000000 4.762392   8.332613    6.412909
## ligatus      4.762392 0.000000   7.699758    4.870414
## rubicundus   8.332613 7.699758   0.000000    9.486401
## tripartitus  6.412909 4.870414   9.486401    0.000000

Building a UPGMA Tree

Here, I build two different UPGMA dendrograms. First is the tree made with the Mahalanobis distances among the four species using the output of the Generalized Procrustes Analysis (GPA). Second is the tree made with the Mahalanobis distances among the four species using the output of the Principal Component Analysis (PCA). Note that these distances came from the $x field of the PCA data, and not the complete thing, which may be incorrect (see prior code chunk).

As shown, the trees are very similar. Placement of the four species and two nodes is identical, but the distances of branches are not.

#building tree with GPA output mahalanobis
tre <- nj(m1) #neighbor joining method tree (m1 is the GPA version output)
class(tre) #what class of tree (it will by phylogenetic)
## [1] "phylo"
tre <- ladderize(tre) #making the steps
tre #describes the tree
## 
## Phylogenetic tree with 4 tips and 2 internal nodes.
## 
## Tip labels:
##   farinosus, ligatus, rubicundus, tripartitus
## 
## Unrooted; includes branch lengths.
plot(tre, cex = 1) #plot the tree. cex refers to text size
title("Tree: Built with Mahalanobis Distances of GPA output")

#building tree with PCA output mahalanobis
tre <- nj(D) #neighbor joining method tree (D is the PCA version output)
class(tre) #what class of tree (it will by phylogenetic)
## [1] "phylo"
tre <- ladderize(tre) #making the steps
tre #describes the tree
## 
## Phylogenetic tree with 4 tips and 2 internal nodes.
## 
## Tip labels:
##   farinosus, ligatus, rubicundus, tripartitus
## 
## Unrooted; includes branch lengths.
plot(tre, cex = 1) #plot the tree. cex refers to text size
title("Tree: Built with Mahalanobis Distances of PCA output")