Title: Genome size / chromosome counts for all specimens with chromosome number estimations

Name: Tammy L. Elliott

Date: March 7, 2019

R version 3.5.1

Genome sizes in holocentrics

The above figure from Bures and Zedek (2014) shows the negative relationship between nuclear DNA content and chromosome number in holokinetics based on Carex data. This relationship has been shown to be significant in the Cyperaceae by Lipnerova et al. 2013.

the important point to consider at this point is the negative relationship between chromosome number and DNA content: generally, individuals with higher chromosome numbers have less DNA content in the Cyperaceae
for the purpose of this analytical exercise, I will not be considering holokinetic drive, which is a possible process that explains this pattern.

This figure shows the relationships between genome size and chromosome number in the genus Carex for all species studied by Lipnerova et al. 2013. The grey arrows show the putative effects of particular mechanisms responsible for the evolution of genome size and chromosome number. Closely related species are represented by the same symbol. The three main grades (ploidy) levels are separated with dash lines.

as shown in the figure above, the relationship between chromosome number and genome size becomes more complex in Carex polyploids

The following analyses are for the South African Schoenus

Note: Some of these specimens used in the following analyses have NOT been sequenced for GBS

for the following analyses, the mean of genome size and chromosome count estimations have been calculated and used for modelling
the final specimens used in the analyses are based on their chromosome counts: the counts must have a standard error of the mean value less than one. This eliminates those specimens with counts with high variance due to poor preparation or contamination
the mean of chromosome counts was calculated, since there were as many as possible counts made from each specimen
all specimens with either NA values for genome size or chromosome count mean have also been removed from the analyses

Histogram of SD and SEM to show outliers

This histogram shows the standard deviation and standard error of the mean for chromosome counts. I used this information to decide to eliminate samples with SD values greater than four and SEM values greater than 1.

for the SEM data, the break above one seems like a reasonable cut-off point
for the SD data, the break above four seems like a reasonable cut-off point

Examine differences among species with high standard error of the mean and high standard deviations values

since no species differ in the below analyses, it does not matter which of SD or SEM is used in the following analyses

(diff.gs.chromo.SEM.SD<-setdiff(gs.chromo.high.SEM$Collection_number, gs.chromo.high.SD$Collection_number ))

## character(0)

(diff.gs.chromo.SD.SEM<-setdiff(gs.chromo.high.SD$Collection_number ,gs.chromo.high.SEM$Collection_number))

## character(0)

Correlation on raw (not transformed) data

this is a Pearson’s correlation between genome size and chromosome count means

##             Count_mean Genome_size
## Count_mean        1.00        0.53
## Genome_size       0.53        1.00

Diagnostic plots for genome size and chromosome count means

Genome sizes

the log transformation gives a distribution that is most normal

Chromosome counts

Additional diagnostic plots for normality

none of the transformations seem to produce a normally distributed y-variable

Verify normality using the Shapiro test

Based on these results, none of the transformations lead to a normally-distributed y- or x-variable

(shapiro.test(gs.chromo.high.SEM$Genome_size))

## 
##  Shapiro-Wilk normality test
## 
## data:  gs.chromo.high.SEM$Genome_size
## W = 0.89993, p-value = 0.0004183

(shapiro.test(sqrt(gs.chromo.high.SEM$Genome_size)))

## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(gs.chromo.high.SEM$Genome_size)
## W = 0.92336, p-value = 0.002797

(shapiro.test((gs.chromo.high.SEM$Genome_size)^(1/3)))

## 
##  Shapiro-Wilk normality test
## 
## data:  (gs.chromo.high.SEM$Genome_size)^(1/3)
## W = 0.92822, p-value = 0.004255

(shapiro.test(log(gs.chromo.high.SEM$Genome_size)))

## 
##  Shapiro-Wilk normality test
## 
## data:  log(gs.chromo.high.SEM$Genome_size)
## W = 0.93221, p-value = 0.006058

(shapiro.test(gs.chromo.high.SEM$Count_mean))

## 
##  Shapiro-Wilk normality test
## 
## data:  gs.chromo.high.SEM$Count_mean
## W = 0.85583, p-value = 1.895e-05

(shapiro.test(log(gs.chromo.high.SEM$Count_mean)))

## 
##  Shapiro-Wilk normality test
## 
## data:  log(gs.chromo.high.SEM$Count_mean)
## W = 0.90542, p-value = 0.0006411

Create linear models

I have chromosome count as independent variable (y) and genome size as dependent variable (x)
the log transformation of the y-variable (genome size) gets it to be closer to normally-distributed
I show the models here for future reference

gs.chromo.lm<-lm(gs.chromo.high.SEM$Genome_size~gs.chromo.high.SEM$Count_mean)
sqrt.chromo.lm<-lm(sqrt(gs.chromo.high.SEM$Genome_size)~gs.chromo.high.SEM$Count_mean)
cbrt.chromo.lm<-lm(((gs.chromo.high.SEM$Genome_size)^(1/3))~gs.chromo.high.SEM$Count_mean)
log.gs.chromo.lm<-lm(log(gs.chromo.high.SEM$Genome_size)~gs.chromo.high.SEM$Count_mean)
log.log.gs.chromo.lm<-lm(log(gs.chromo.high.SEM$Genome_size)~log(gs.chromo.high.SEM$Count_mean))

Examine residuals for models with and without data transformations for genome size

the first figure is for the model without transformed data, the second figure is the square route transformed data, the third figure (bottom left) is for the cube root transformed data, and the fourth figure (bottom right) is for the log transformed data

Examine residuals for model with untransformed data and other model with both x and y variables tranformed

the first figure is based on the raw data
the second figure has had both genome sizes and chromosome count means log transformed

Display results for three models

all models are significant with decent fits
although none of the models fully meet the assumptions of normality and homoscedasticity, I think that it is most appropriate to report the log transformed genome size (y-variable) results, since that model improves normality and is relatively easy to interpret.

Model with untransformed y- and x-variables

## 
## Call:
## lm(formula = gs.chromo.high.SEM$Genome_size ~ gs.chromo.high.SEM$Count_mean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3890.5  -737.3  -326.2  1215.4  4544.2 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2311.25     600.86   3.847 0.000346 ***
## gs.chromo.high.SEM$Count_mean    83.41      19.10   4.367 6.51e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1566 on 49 degrees of freedom
## Multiple R-squared:  0.2802, Adjusted R-squared:  0.2655 
## F-statistic: 19.07 on 1 and 49 DF,  p-value: 6.507e-05

Model with log transformed y-variable

## 
## Call:
## lm(formula = log(gs.chromo.high.SEM$Genome_size) ~ gs.chromo.high.SEM$Count_mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89042 -0.15071 -0.00145  0.25325  0.65286 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   7.906064   0.125336  63.079  < 2e-16 ***
## gs.chromo.high.SEM$Count_mean 0.016761   0.003984   4.207  0.00011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3266 on 49 degrees of freedom
## Multiple R-squared:  0.2653, Adjusted R-squared:  0.2503 
## F-statistic:  17.7 on 1 and 49 DF,  p-value: 0.0001099

Model with log transformed x- and y-variables

## 
## Call:
## lm(formula = log(gs.chromo.high.SEM$Genome_size) ~ log(gs.chromo.high.SEM$Count_mean))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91619 -0.12172 -0.01116  0.20373  0.65100 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          6.3619     0.3948  16.116  < 2e-16
## log(gs.chromo.high.SEM$Count_mean)   0.6151     0.1186   5.186 4.08e-06
##                                       
## (Intercept)                        ***
## log(gs.chromo.high.SEM$Count_mean) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3062 on 49 degrees of freedom
## Multiple R-squared:  0.3544, Adjusted R-squared:  0.3412 
## F-statistic: 26.89 on 1 and 49 DF,  p-value: 4.075e-06

Examine variance and normality of model residuals

from the variation and qq normality plots, none of the transformations improve the distribution of the variance of the residuals or the normality
since data that has not been transformed is easiest to interpret, I will interpret my results from the LM based on data that has not been transformed

Homoscedasticity of residuals

Normality of residuals

Plot linear models

the figure to the left is based on the data that has not been transformed
the genome size data has been transformed in the figure to the right
the plot on the left-hand side (based on the data that has not been transformed) is based on the LM model that has been chosen to interpret results

Principal components and linear discriminant analyses

In the following analyses, I compare the results of fuzzy clustering Principal Components Analyses results to those clusters obtained from linear discriminant analyses

Principal components analysis

for this analysis, I ordinate the mean of genome size and chromosome number

Clustering

the following analyses are based on fuzzy clustering using the fanny function in R

Examine the Calinski-Harabasc index to define optimal number of clusters

based on the Calinski-Harabasc index scores, I have decided to create separate analyses for 4, 5 and 6 clusters
I prefer the 4 cluster analysis, as that does not break up the dense cluster in the left-hand side of the principal component ordinations

Plot PC ordinations beside clustering results

the four cluster analysis is on the top right-hand side
the five cluster analysis is in the middle of the right-hand side of figures
the six cluster analysis is on the bottom of the right-hand side of figures

Specimens corresponding to clusters

see which specimens correspond to each of four clusters

Cluster #1

this cluster includes specimens with genome sizes between 1630 and 3850.6 MBp, and chromosome counts between 17 and 26.7
I had been referring to these as the diploidized taxa

##     Collection_number Genome_size Count_mean
## 27            MM_7591      3719.7       18.9
## 36         TE2016_192      1630.0       18.0
## 37         TE2016_193      3297.0       20.5
## 45         TE2016_202      3373.5       19.9
## 46         TE2016_203      3319.5       19.0
## 47         TE2016_204      3443.0       21.6
## 53         TE2016_213      3335.0       21.4
## 56         TE2016_226      3010.0       18.5
## 59         TE2016_229      3279.5       18.9
## 69         TE2016_241      3230.2       19.2
## 72         TE2016_244      3412.3       19.8
## 73         TE2016_245      3431.9       20.8
## 74         TE2016_247      3207.7       19.8
## 80         TE2016_253      3229.2       20.0
## 87         TE2016_262      3327.8       18.8
## 93         TE2016_269      3850.6       20.0
## 103        TE2016_282      3580.9       22.0
## 104        TE2016_283      3564.9       21.3
## 122        TE2016_302      3541.5       19.0
## 166        TE2016_347      3235.0       24.5
## 176        TE2016_357      3204.0       26.7
## 181        TE2016_362      2770.0       17.0
## 189        TE2016_383      3348.2       23.5
## 196        TE2016_390      3182.8       21.4

Cluster #2

this cluster includes specimens with genome sizes between 4846.8 and 6113.5 MBp, and chromosome counts between 16 and 35.6

##     Collection_number Genome_size Count_mean
## 28            MM_7626      4846.8       24.0
## 39         TE2016_196      5156.5       35.6
## 50         TE2016_207      5807.5       24.5
## 58         TE2016_228      4892.0       29.0
## 161        TE2016_341      5703.0       20.8
## 175        TE2016_356      4891.0       24.5
## 178        TE2016_359      5304.0       31.7
## 179        TE2016_360      4925.0       30.0
## 198        TE2016_392      6113.5       16.0

Cluster #3

this cluster includes specimens with genome sizes between 5380 and 10192 MBp, and chromosome counts between 34 and 52
these appear to be polyploid taxa with large genome sizes

##     Collection_number Genome_size Count_mean
## 33         TE2016_182      8396.5       52.0
## 35         TE2016_191      6814.0       40.1
## 40         TE2016_197      7544.0       39.8
## 41         TE2016_198      6760.5       38.0
## 49         TE2016_206      5380.0       40.7
## 61         TE2016_231      6607.5       38.4
## 82         TE2016_255      5905.1       39.0
## 90         TE2016_265      6888.9       38.8
## 100        TE2016_278      7211.4       34.0
## 120        TE2016_300      7052.9       40.0
## 162        TE2016_342     10192.0       40.0
## 177        TE2016_358      6879.0       39.5
## 180        TE2016_361      7152.0       40.0
## 185        TE2016_371      7366.5       38.4
## 188        TE2016_382      6136.9       34.7

Cluster #4

this cluster includes specimens with genome sizes between 2558 and 3749.7 MBp, and chromosome counts between 49.6 and 63.6
these are polyploid taxa with small genome sizes - all three specimens are new species (S. albovaginatus and S. bracteosus).

##     Collection_number Genome_size Count_mean
## 190        TE2016_384      2558.0       49.6
## 206        TE2016_400      3709.4       60.5
## 210        TE2016_404      3749.7       63.6

Linear discriminant analysis

the LDA classifier assumes that each class comes from a normal distribution with a class-specific mean vector and a common variance
I have already looked at the normality plots above, neither genome size or chromosome number is normally distributed, but log transforming genome size does slightly improve normality
QDA (quadratic discriminant analysis) is preferred when there is a very large number of predictors - which is not my case.
since log transforming helps meet the assumption of equal variance, I will conduct a LDA with log transformed genome size data.
since neither the assumption of equal variance or normally-distributed data are fully met, care must be taken in interpreting the following results

Examine variance

Genome size

genome size per cluster does not meet the assumption of equal variance

## 
##  Bartlett test of homogeneity of variances
## 
## data:  gs.chromo.clust.df$Genome_size by gs.chromo.clust.df[, 6]
## Bartlett's K-squared = 18.16, df = 3, p-value = 0.0004077

Chromosome counts

chromosome count per cluster does not meet the assumption of equal variance

## 
##  Bartlett test of homogeneity of variances
## 
## data:  gs.chromo.clust.df$Count_mean by gs.chromo.clust.df[, 6]
## Bartlett's K-squared = 16.365, df = 3, p-value = 0.0009546

Genome size: log transformed

log transforming genome size does meet the assumption of equal variance for the four cluster analysis

## 
##  Bartlett test of homogeneity of variances
## 
## data:  log(gs.chromo.clust.df$Genome_size) by gs.chromo.clust.df[, 6]
## Bartlett's K-squared = 4.0109, df = 3, p-value = 0.2603

Chromosome counts: log transformed

however, log transforming chromosome number does not help meet the assumption of equal variance

## 
##  Bartlett test of homogeneity of variances
## 
## data:  log(gs.chromo.clust.df$Count_mean) by gs.chromo.clust.df[, 6]
## Bartlett's K-squared = 13.574, df = 3, p-value = 0.003547

Linear discriminant analysis model

code with results for my LDA with log-transformed genome size

(gs.chromo.lda <- lda(gs.chromo.clust.df[,6] ~ log(gs.chromo.clust.df$Genome_size) + gs.chromo.clust.df$Count_mean))

## Call:
## lda(gs.chromo.clust.df[, 6] ~ log(gs.chromo.clust.df$Genome_size) + 
##     gs.chromo.clust.df$Count_mean)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.47058824 0.17647059 0.29411765 0.05882353 
## 
## Group means:
##   log(gs.chromo.clust.df$Genome_size) gs.chromo.clust.df$Count_mean
## 1                            8.082655                      20.43750
## 2                            8.570756                      26.23333
## 3                            8.855295                      39.56000
## 4                            8.098346                      57.90000
## 
## Coefficients of linear discriminants:
##                                           LD1        LD2
## log(gs.chromo.clust.df$Genome_size) 2.9060881 -6.1733577
## gs.chromo.clust.df$Count_mean       0.2085208  0.1577215
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7198 0.2802

gs.chromo.lda.pred<- predict(gs.chromo.lda)

# redo with jacknifing
gs.chromo.lda.CV<- lda(gs.chromo.clust.df[,6] ~ log(gs.chromo.clust.df$Genome_size) + gs.chromo.clust.df$Count_mean, CV=TRUE)

Compare accuracy of plain model and jack-knived model

##    
##      1  2  3  4
##   1 24  0  0  0
##   2  0  8  0  0
##   3  0  0 16  0
##   4  0  0  0  3

Overall classification accuracy

it is defined as the fraction of instances that are correctly classified
for this analysis, I compare the LDA clustering results to those obtained from clustering of the PCA results
overall accuracy is 1, therefore accuracy per cluster = 1

## [1] 1

Per-class Precision, Recall, and F-1

in order to assess the performance with respect to every class in the data set, I compute common per-class metrics such as precision, recall, and the F-1 score.
these metrics are particularly useful when the class labels are not uniformly distributed (most instances belong to one class, for example). In such cases, accuracy could be misleading as one could predict the dominant class most of the time and still achieve a relatively high overall accuracy but very low precision or recall for other classes.
precision is defined as the fraction of correct predictions for a certain class,
recall is the fraction of instances of a class that were correctly predicted.
the F-1 score is also commonly reported. It is defined as the harmonic mean (or a weighted average) of precision and recall.
for this analysis, I calculate these values by comparing the LDA clustering results to those obtained from clustering of the PCA results

##   Precision Recall F1
## 1         1      1  1
## 2         1      1  1
## 3         1      1  1
## 4         1      1  1

Plot linear discriminants with predicted values

the first plot is a PCA - the colours of the dots are based on the clustering analyses
the plot to the left is the linear discriminant analysis plot