Multivariate Analysis of Silkworm Hybrid Phenotypes Across Two Locations in the Philippines With Mean Value Traits

INTRODUCTION

Unraveling the Phenotypic Tapestry of Silkworm Hybrids: A Comprehensive Multi-Trait, Multi-Location Analysis: This study embarks on a rigorous exploration of phenotypic diversity within a population of silkworm (Bombyx mori) hybrids, reared across two distinct locations in the Philippines. By employing a robust statistical and multivariate framework, we aim to transcend mere description and delve into the intricate interplay of genetic and environmental factors that orchestrate the expression of key traits critical for both fundamental research and applied breeding programs.

Delving into the Data: A Multifaceted Approach

Our investigation is structured around four key analytical pillars:

1. Data Exploration & Assumption Validation: We begin by meticulously examining the distributional properties of each phenotypic trait across the hybrid strains. This involves constructing histograms and density plots to visualize the shape of the distributions, assess central tendency and variability, and identify potential outliers that might warrant further investigation or specialized statistical treatment. Recognizing that many statistical tests rely on specific assumptions about the data, we rigorously assess the normality of the trait distributions using the Shapiro-Wilk test and evaluate the homogeneity of variance across strains using Levene’s test. These assessments guide our choice of subsequent statistical tests and ensure the validity of our inferences.

2. ANOVA & Non-Parametric Testing: Discerning Strain Differences: To discern statistically significant differences in trait means among the silkworm hybrids, we employ Analysis of Variance (ANOVA), a powerful parametric test that compares means across multiple groups. However, the validity of ANOVA hinges on the fulfillment of the normality and homogeneity of variance assumptions. In cases where these assumptions are violated, we strategically employ log transformations to potentially rectify deviations from normality or homoscedasticity. If transformations fail to remedy the violations, we adopt the non-parametric Kruskal-Wallis test, which compares medians across groups based on ranks, making it suitable for data that doesn’t conform to the assumptions of ANOVA. To pinpoint specific strain pairs exhibiting significant differences, we employ post-hoc tests, namely Tukey’s Honestly Significant Difference (HSD) test for ANOVA and Dunn’s test for Kruskal-Wallis. These tests allow for pairwise comparisons between strains while controlling for the increased risk of Type I errors (false positives) associated with multiple testing.

3. Multivariate Analysis: Unveiling Hidden Patterns: To move beyond univariate analysis and explore the interrelationships among multiple traits, we harness the power of multivariate techniques. Principal Component Analysis (PCA) is employed to unveil the underlying structure of phenotypic variation, effectively reducing the dimensionality of the data and identifying key patterns. By scrutinizing scree plots and biplots, we ascertain the contribution of each trait to the principal components and visualize the relationships between individuals (silkworms) and variables (traits). Furthermore, we apply hierarchical clustering and k-means clustering on the principal component scores to classify silkworms into distinct groups based on their phenotypic similarities. This clustering analysis can reveal natural groupings of strains with similar trait profiles, potentially suggesting shared genetic backgrounds or responses to environmental factors.

4. Correlation Analysis: Deciphering Trait Interdependencies: To quantify and visualize the relationships between traits, we compute and visualize both Pearson and Spearman correlation matrices. The Pearson correlation coefficient captures linear relationships between traits, while the Spearman correlation coefficient assesses monotonic relationships, accommodating both linear and non-linear associations. Scrutinizing correlograms allows us to discern clusters of highly correlated traits and identify potential multicollinearity, which can inform subsequent analyses. To rigorously assess the strength and significance of associations between specific trait pairs, we conduct individual correlation tests (Pearson and Spearman). Furthermore, we generate scatterplot matrices to visualize all pairwise relationships between traits, aiding in the identification of patterns, outliers, and potential non-linear associations that might not be captured by simple correlation coefficients. Finally, we explore trait correlations within the context of strain variation using mixed-effects models, enabling us to discern whether correlations are consistent across strains or if they vary depending on the genetic background. This can provide insights into the genetic and environmental factors that influence trait co-variation.

Bridging the Gap: Genotype to Phenotype: By integrating these diverse analytical approaches, this study strives to transcend mere description and provide a comprehensive portrait of the phenotypic landscape within a population of silkworm hybrids. The insights gleaned from this investigation will not only deepen our understanding of the genetic architecture underlying silkworm traits but also furnish invaluable information for targeted breeding efforts to enhance silk production and other economically important characteristics. Moreover, this study contributes to the broader field of evolutionary biology by shedding light on the interplay between genetic variation and phenotypic diversity in a model organism with profound implications for both scientific inquiry and industrial applications.

Data Loading and Pre-processing

# Read the data with subsamples
data_subsamples <- read_excel("Hybrid.xls")

# Convert 'Strain' to a factor in the subsample data
data_subsamples$Strain <- as.factor(data_subsamples$Strain)

# Convert 'Location' to a factor in the subsample data
data_subsamples$Location <- as.factor(data_subsamples$Location)

# List of traits in the subsample dataset
traits_subsamples <- c("Single_Cocoon_Weight", "Cocoon_Shell_Weight", "Cocoon_Shell_Percentage")

# Remove rows with missing values for any of the traits in the subsample dataset
data_subsamples_complete <- data_subsamples %>% filter(complete.cases(data_subsamples[, traits_subsamples]))

# Load the data without subsamples
data_no_subsamples <- read_excel("Hybrid_new.xls")

# Convert 'Strain' to a factor in the no subsample data
data_no_subsamples$Strain <- as.factor(data_no_subsamples$Strain)

# Convert 'Location' to a factor in the no subsample data
data_no_subsamples$Location <- as.factor(data_no_subsamples$Location)

# --- Merge the two datasets ---

# Aggregate the data with subsamples by Strain and Location using 'aggregate()'
data_agg <- aggregate(cbind(Single_Cocoon_Weight, Cocoon_Shell_Weight, Cocoon_Shell_Percentage) ~ Strain + Location, 
                      data = data_subsamples_complete, 
                      FUN = mean)


# Merge the aggregated data with the data without subsamples
combined_data <- merge(data_agg, data_no_subsamples, by = c("Strain", "Location"))

# Reorder columns to place 'Replicate' after 'Strain'
combined_data <- combined_data %>% dplyr::select(Strain, Replicate, everything())  # Use dplyr::select

# List of all traits
all_traits <- c(traits_subsamples, "Hatching_Rate", "Larval_Weight", "Live_Pupa_Percentage")

Data Exploration and Assumption Checking

We begin by exploring the distributions of each trait and checking the assumptions of normality and homogeneity of variance, crucial for subsequent analyses.

Data Exploration and Assumption Checking

# Loop through each trait
for (trait in all_traits) {
  cat("\n\n--- Analyzing", trait, "---\n")
  
  # Use the appropriate dataframe for each trait
  if (trait %in% traits_subsamples) {
    # Use data_subsamples_complete for traits with subsamples
    data_trait <- data_subsamples_complete
  } else {
    # Use combined_data for traits without subsamples
    data_trait <- combined_data
  }
  
  # Fit a temporary linear mixed-effects model to check assumptions
  temp_model <- lm(as.formula(paste(trait, "~ Strain + Location")), data = data_trait)
  
  # Boxplot with outliers, filled by Location
  cat("\nBoxplot for", trait, ":\n")
  boxplot_outliers <- ggplot(data_trait, aes(x = Strain, y = .data[[trait]], fill = Location)) +
    geom_boxplot(outlier.shape = 16, outlier.size = 2) + 
    labs(title = paste("Boxplot of", trait, "by Strain and Location"),
         x = "Strain", y = trait) +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) 
  print(boxplot_outliers) 
  
  # Outlier Detection and Treatment (using Tukey's fences on residuals)
  resid_trait <- residuals(temp_model) 
  Q1 <- quantile(resid_trait, 0.25)
  Q3 <- quantile(resid_trait, 0.75)
  IQR <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  outliers <- data_trait %>%  
    mutate(residual = resid_trait) %>% 
    filter(residual < lower_bound | residual > upper_bound) %>% 
    dplyr::select(Strain, Replicate, Location, all_of(trait)) 
  
  if (nrow(outliers) > 0) {
    cat("\nOutliers for", trait, "(based on residuals):\n")
    print(outliers)
  } else {
    cat("\nNo outliers found for", trait, "\n")
  }
  
  # Normality test (Shapiro-Wilk) on residuals
  normality_test <- shapiro.test(resid_trait)
  cat("\nNormality Test (Shapiro-Wilk) on Residuals:\n")
  print(normality_test)
  
  # Homogeneity of variance test (Levene's)
  variance_test <- leveneTest(resid_trait ~ data_trait$Strain * data_trait$Location)  
  cat("\nHomogeneity of Variance Test (Levene's):\n")
  print(variance_test)
  
  # Distribution plots 
  cat("\nDistribution Plots for", trait, ":\n")
  
  # Combined distribution plot 
  p1 <- ggplot(data_trait, aes(x = .data[[trait]], fill = Location)) +  
    geom_density(alpha = 0.5) + 
    labs(title = paste("Distribution of", trait, "(All Strains)"),
         x = trait, y = "Density") +
    theme_bw()
  
  # Distribution plots by strain and location
  p2 <- ggplot(data_trait, aes(x = .data[[trait]], fill = Location)) +  
    geom_density(alpha = 0.5) + 
    labs(x = trait, y = "Density") +
    facet_wrap(~ Strain, ncol = 6) +  
    theme_bw() +
    theme(strip.text = element_text(size = 8)) 
  
  # Arrange and print the plots
  print(ggarrange(p1, p2, nrow = 2, heights = c(1, 3)))  
}

## 
## 
## --- Analyzing Single_Cocoon_Weight ---
## 
## Boxplot for Single_Cocoon_Weight :

## 
## Outliers for Single_Cocoon_Weight (based on residuals):
## # A tibble: 8 × 4
##   Strain Replicate Location Single_Cocoon_Weight
##   <fct>      <dbl> <fct>                   <dbl>
## 1 SW 101         2 BALUBAL                 2.27 
## 2 SW 102         3 BALUBAL                 2.29 
## 3 SW 107         3 BALUBAL                 2.30 
## 4 SW 109         1 BALUBAL                 0.359
## 5 SW 110         2 BALUBAL                 2.32 
## 6 SW 111         2 BALUBAL                 2.8  
## 7 SW 115         2 BALUBAL                 0.823
## 8 SW 115         3 BALUBAL                 0.708
## 
## Normality Test (Shapiro-Wilk) on Residuals:
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_trait
## W = 0.98536, p-value = 5.174e-08
## 
## 
## Homogeneity of Variance Test (Levene's):
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)    
## group  31  2.0195 0.000899 ***
##       896                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution Plots for Single_Cocoon_Weight :

## 
## 
## --- Analyzing Cocoon_Shell_Weight ---
## 
## Boxplot for Cocoon_Shell_Weight :

## 
## Outliers for Cocoon_Shell_Weight (based on residuals):
## # A tibble: 23 × 4
##    Strain Replicate Location Cocoon_Shell_Weight
##    <fct>      <dbl> <fct>                  <dbl>
##  1 SW 104         2 TCMO                   0.77 
##  2 SW 111         2 TCMO                   0.39 
##  3 SW 111         2 TCMO                   0.4  
##  4 SW 111         2 TCMO                   0.39 
##  5 SW 113         3 TCMO                   0.414
##  6 SW 101         2 BALUBAL                0.468
##  7 SW 101         2 BALUBAL                0.187
##  8 SW 101         2 BALUBAL                0.21 
##  9 SW 102         3 BALUBAL                0.443
## 10 SW 103         3 BALUBAL                0.213
## # ℹ 13 more rows
## 
## Normality Test (Shapiro-Wilk) on Residuals:
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_trait
## W = 0.92283, p-value < 2.2e-16
## 
## 
## Homogeneity of Variance Test (Levene's):
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)    
## group  31  2.3735 4.36e-05 ***
##       896                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Distribution Plots for Cocoon_Shell_Weight :

## 
## 
## --- Analyzing Cocoon_Shell_Percentage ---
## 
## Boxplot for Cocoon_Shell_Percentage :

## 
## Outliers for Cocoon_Shell_Percentage (based on residuals):
## # A tibble: 13 × 4
##    Strain Replicate Location Cocoon_Shell_Percentage
##    <fct>      <dbl> <fct>                      <dbl>
##  1 SW 104         2 TCMO                        65.8
##  2 SW 106         2 TCMO                        28.9
##  3 SW 111         2 TCMO                        29.1
##  4 SW 113         2 TCMO                        28.2
##  5 SW 115         2 TCMO                        27.6
##  6 SW 103         1 BALUBAL                     33.4
##  7 SW 103         3 BALUBAL                     11.7
##  8 SW 109         1 BALUBAL                     85.5
##  9 SW 111         2 BALUBAL                     13.1
## 10 SW 114         1 BALUBAL                     12.3
## 11 SW 114         2 BALUBAL                     12.9
## 12 SW 115         1 BALUBAL                     10.7
## 13 SW 115         2 BALUBAL                     30.4
## 
## Normality Test (Shapiro-Wilk) on Residuals:
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_trait
## W = 0.64692, p-value < 2.2e-16
## 
## 
## Homogeneity of Variance Test (Levene's):
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group  31  0.9305 0.5772
##       896               
## 
## Distribution Plots for Cocoon_Shell_Percentage :

## 
## 
## --- Analyzing Hatching_Rate ---
## 
## Boxplot for Hatching_Rate :

## 
## Outliers for Hatching_Rate (based on residuals):
##   Strain Replicate Location Hatching_Rate
## 1 SW 107         3  BALUBAL      91.86992
## 2 SW 108         3  BALUBAL      92.15686
## 
## Normality Test (Shapiro-Wilk) on Residuals:
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_trait
## W = 0.98544, p-value = 0.3701
## 
## 
## Homogeneity of Variance Test (Levene's):
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group 31  0.5917 0.9445
##       64               
## 
## Distribution Plots for Hatching_Rate :

## 
## 
## --- Analyzing Larval_Weight ---
## 
## Boxplot for Larval_Weight :

## 
## Outliers for Larval_Weight (based on residuals):
##   Strain Replicate Location Larval_Weight
## 1 SW 101         2     TCMO      32.95233
## 2 SW 108         3  BALUBAL      26.34000
## 
## Normality Test (Shapiro-Wilk) on Residuals:
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_trait
## W = 0.98383, p-value = 0.2864
## 
## 
## Homogeneity of Variance Test (Levene's):
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group 31  0.6008 0.9391
##       64               
## 
## Distribution Plots for Larval_Weight :

## 
## 
## --- Analyzing Live_Pupa_Percentage ---
## 
## Boxplot for Live_Pupa_Percentage :

## 
## Outliers for Live_Pupa_Percentage (based on residuals):
##   Strain Replicate Location Live_Pupa_Percentage
## 1 SW 109         2  BALUBAL             72.61905
## 2 SW 113         1  BALUBAL             72.22222
## 
## Normality Test (Shapiro-Wilk) on Residuals:
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_trait
## W = 0.97226, p-value = 0.03928
## 
## 
## Homogeneity of Variance Test (Levene's):
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group 31  0.9019 0.6155
##       64               
## 
## Distribution Plots for Live_Pupa_Percentage :

Technical Discussion Points and Considerations:

Boxplots: 1. Central Tendency and Spread: Examine the medians (horizontal lines within the boxes) to compare the typical values of each trait across strains. Observe the interquartile ranges (IQR, the height of the boxes) to gauge the spread of the data within each strain. 2. Outliers: Identify any points plotted individually outside the whiskers. These outliers might indicate unusual observations or potential errors in data collection. Consider their impact on 3. subsequent analyses and whether they should be handled (e.g., removal, winsorization) if they are deemed to be influential. 4. Strain Differences: Visually compare the boxplots across strains to get a preliminary sense of which traits might show significant differences between strains.

Normality Tests (Shapiro-Wilk): 1. P-values: Assess the p-values for each trait. If p < 0.05, it suggests a significant departure from normality. This might necessitate data transformation or the use of non-parametric tests. 2. Visual Inspection of Q-Q Plots: Complement the p-values by examining the Q-Q plots generated by the check_model function. If the points deviate substantially from the straight line, it further supports the evidence of non-normality.

Homogeneity of Variance Tests (Levene’s): 1. P-values: Evaluate the p-values for each trait. If p < 0.05, it suggests significant heterogeneity of variance (unequal variances) across strains. This might also necessitate data transformation or the use of alternative statistical tests. 2. Visual Inspection of Residuals vs. Fitted Plots: Examine the Residuals vs. Fitted plots from check_model. If the spread of the residuals is not roughly constant across the range of fitted values, it further supports the evidence of heteroscedasticity.

Distribution Plots: 1. Shape of Distributions: Observe the shape of the density curves for each trait, both for all strains combined and for individual strains. Look for deviations from a normal bell-shaped curve (e.g., skewness, multiple peaks). 2. Strain Comparisons: Compare the density curves across strains to visually assess differences in central tendency (mean or median) and spread (variance or interquartile range).

ANOVA, Transformation, and Non-Parametric Testing for All Traits (Combined Locations)

# Loop through each trait
for (trait in all_traits) {
  cat("\n\n--- Analyzing", trait, "---\n")
  
  # Use the appropriate dataframe for each trait
  if (trait %in% traits_subsamples) {
    # Use data_subsamples_complete for traits with subsamples
    data_trait <- data_subsamples_complete
  } else {
    # Use combined_data for traits without subsamples
    data_trait <- combined_data
  }
  
  # Fit the linear mixed-effects model (LMM) with 'Location' and interaction
  lmer_model <- lm(as.formula(paste(trait, "~ Strain + Location")), data = data_trait)
  
  # Check for non-normality or heteroscedasticity
  normality_p_value <- shapiro.test(residuals(lmer_model))$p.value
  levene_result <- leveneTest(residuals(lmer_model) ~ data_trait$Strain * data_trait$Location)
  homogeneity_p_value <- levene_result$"Pr(>F)"[1] 
  
  # Create means dataframe for CLDs and boxplot (outside the conditional blocks)
  means <- aggregate(as.formula(paste(trait, "~ Strain")), data = data_trait, FUN = mean)
  colnames(means)[colnames(means) == trait] <- paste0("Mean_", trait)
  
  if (normality_p_value >= 0.05 & homogeneity_p_value >= 0.05) {
    # Perform ANOVA directly on original data if both assumptions are met
    cat("\n**Performing ANOVA on original data:**\n")
    anova_result <- anova(lmer_model)
    print(anova_result)
    
    # Calculate and print eta-squared for the LMM
    eta_squared <- performance::r2(lmer_model)$Marginal$R2
    cat("Eta-squared (Marginal R-squared):", eta_squared, "\n")
    
    # Tukey's HSD if ANOVA is significant
    if (anova_result$`Pr(>F)`[1] < 0.05) {
      tukey_result <- emmeans(lmer_model, pairwise ~ Strain | Location, adjust = "tukey")
      print(tukey_result)
      
      # Extract CLD from Tukey's results
      cld_data <- as.data.frame(tukey_result$contrasts) %>%
        mutate(sig = ifelse(p.value < 0.05, "*", " ")) %>%
        dplyr::select(contrast, estimate, SE, df, t.ratio, p.value, sig)
      
      # Create CLDs using 'multcompLetters4' 
      cld_letters <- multcompLetters4(tukey_result$emmeans, tukey_result$contrasts)
      
      # Print the compact letter display
      cat("\n**Compact Letter Displays (CLDs):**\n")
      print(cld_letters$LetterMatrix)
      
      # Boxplot with CLDs, faceted by Location
      # Ensure complete cases for plotting
      data_complete_plot <- data_trait %>% filter(!is.na(.data[[trait]]))
      
      # Base boxplot
      p <- ggplot(data_complete_plot, aes(x = Strain, y = .data[[trait]], fill = Location)) + 
        geom_boxplot() +
        labs(title = paste("Boxplot of", trait, "by Strain and Location (ANOVA with Tukey HSD)"),
             x = "Strain", y = trait) +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        facet_wrap(~ Location)  # Facet by Location
      
      # Add labels from cld_letters
      for (loc in levels(data_trait$Location)) {  # Use data_trait$Location here
        cld_data_loc <- cld_letters$LetterMatrix[, loc] %>% 
          data.frame(Letters = .) %>% 
          rownames_to_column("Strain") %>%
          mutate(y_pos = aggregate(as.formula(paste(trait, "~ Strain")), data = data_trait %>% filter(Location == loc), FUN = mean)[[2]] + 0.1)
        
        p <- p + geom_text(data = cld_data_loc, 
                           aes(x = Strain, y = y_pos, label = Letters),
                           vjust = -0.5, size = 3)
      }
      
      # Print the plot
      print(p)
      
    } else {
      cat("\nANOVA not significant. No post-hoc test needed.\n")
    }
    
  } else {
    # Attempt log transformation if at least one assumption is not met
    cat("\n**Attempting log transformation...**\n")
    transformed_data <- data_trait
    transformed_data[[trait]] <- log1p(transformed_data[[trait]])
    transformed_model <- lmer(as.formula(paste(trait, "~ Strain * Location + (1 | Replicate)")), data = transformed_data)
    check_model(transformed_model, transformed_data)  # Pass transformed_data to check_model
    
    # Re-check assumptions after transformation
    normality_p_value_transformed <- shapiro.test(residuals(transformed_model))$p.value
    levene_result_transformed <- leveneTest(residuals(transformed_model) ~ transformed_data$Strain * transformed_data$Location)
    homogeneity_p_value_transformed <- levene_result_transformed$"Pr(>F)"[1] 
    
    # Print the results of assumption checks after transformation
    cat("\nAssumption Checks after Transformation:\n")
    cat("  Normality (Shapiro-Wilk) p-value:", normality_p_value_transformed, "\n")
    cat("  Homogeneity of Variance (Levene's) p-value:", homogeneity_p_value_transformed, "\n")
    
    if (normality_p_value_transformed >= 0.05 & homogeneity_p_value_transformed >= 0.05) { 
      # Use ANOVA on transformed data if both assumptions are now met
      cat("\n**Performing ANOVA on transformed data:**\n")
      anova_result <- anova(transformed_model)
      print(anova_result)
      
      # Tukey's HSD if ANOVA is significant
      if (anova_result$`Pr(>F)`[1] < 0.05) {
        tukey_result <- emmeans(transformed_model, pairwise ~ Strain | Location, adjust = "tukey")
        print(tukey_result)
        
        # Extract CLD from Tukey's results
        cld_data <- as.data.frame(tukey_result$contrasts) %>%
          mutate(sig = ifelse(p.value < 0.05, "*", " ")) %>%
          dplyr::select(contrast, estimate, SE, df, t.ratio, p.value, sig)
        
        # Create CLDs using 'multcompLetters4' 
        cld_letters <- multcompLetters4(tukey_result$emmeans, tukey_result$contrasts)
        
        # Print the compact letter display
        cat("\n**Compact Letter Displays (CLDs):**\n")
        print(cld_letters$LetterMatrix)
        
        # Boxplot with CLDs, faceted by Location
        # Ensure complete cases for plotting
        data_complete_plot <- transformed_data %>% filter(!is.na(.data[[trait]]))
        
        # Base boxplot
        p <- ggplot(data_complete_plot, aes(x = Strain, y = .data[[trait]], fill = Location)) + 
          geom_boxplot() +
          labs(title = paste("Boxplot of", trait, "by Strain and Location (ANOVA with Tukey HSD)"),
               x = "Strain", y = trait) +
          theme_bw() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
          facet_wrap(~ Location)  # Facet by Location
        
        # Add labels from cld_letters
        for (loc in levels(data_complete$Location)) {
          cld_data_loc <- cld_letters$LetterMatrix[, loc] %>% 
            data.frame(Letters = .) %>% 
            rownames_to_column("Strain") %>%
            mutate(y_pos = aggregate(as.formula(paste(trait, "~ Strain")), data = transformed_data %>% filter(Location == loc), FUN = mean)[[2]] + 0.1)
          
          p <- p + geom_text(data = cld_data_loc, 
                             aes(x = Strain, y = y_pos, label = Letters),
                             vjust = -0.5, size = 3)
        }
        
        # Print the plot
        print(p)
        
      } else {
        cat("\nANOVA not significant. No post-hoc test needed.\n")
      }
      
    } else {
      # Use Kruskal-Wallis if assumptions are still violated after transformation
      cat("\n**Performing Kruskal-Wallis test:**\n")
      kruskal_result <- kruskal.test(as.formula(paste(trait, "~ Strain")), data = data_trait)  
      print(kruskal_result)
      
      # Calculate and print epsilon-squared (effect size for Kruskal-Wallis)
      k <- length(unique(data_trait$Strain)) # Number of groups
      n <- nrow(data_trait) # Total number of observations
      epsilon_squared <- (kruskal_result$statistic - (k - 1)) / (n - k)
      cat("Epsilon-squared:", epsilon_squared, "\n")
      
      # Dunn's Test if Kruskal-Wallis is significant
      if (kruskal_result$p.value < 0.05) {
        cat("\n**Dunn's Test (Post-hoc with Holm correction):**\n")
        dunn_result <- dunnTest(as.formula(paste(trait, "~ Strain")), data = data_trait, method="holm") 
        print(dunn_result)
        
        # Get unique strain names
        strains <- unique(data_trait$Strain)
        
        # Initialize a vector to store group letters
        group_letters <- rep("", length(strains))
        names(group_letters) <- strains
        
        # Assign group letters based on adjusted p-values (example threshold of 0.05)
        for (i in 1:nrow(dunn_result$res)) {
          comparison <- strsplit(dunn_result$res$Comparison[i], " - ")
          strain1 <- comparison[[1]][1]
          strain2 <- comparison[[1]][2]
          
          if (dunn_result$res$P.adj[i] < 0.05) {
            # Strains are significantly different, assign different letters
            if (any(group_letters[strain1] == "") && any(group_letters[strain2] == "")) {
              # Both strains have no group yet
              group_letters[strain1] <- "a"
              group_letters[strain2] <- "b"
            } else if (any(group_letters[strain1] == "")) {
              # Only strain1 has no group
              group_letters[strain1] <- setdiff(letters, group_letters[strain2])[1]
            } else if (any(group_letters[strain2] == "")) {
              # Only strain2 has no group
              group_letters[strain2] <- setdiff(letters, group_letters[strain1])[1]
            } else {
              # Both strains have groups, ensure they are different
              if (group_letters[strain1] == group_letters[strain2]) {
                group_letters[strain2] <- setdiff(letters, group_letters[strain1])[1]
              }
            }
          } else {
            # Strains are not significantly different
            if (any(group_letters[strain1] == "") && any(group_letters[strain2] == "")) {
              # Both strains have no group yet
              group_letters[strain1] <- "a"
              group_letters[strain2] <- "a"
            } else if (any(group_letters[strain1] == "")) {
              # Only strain1 has no group
              group_letters[strain1] <- group_letters[strain2]
            } else if (any(group_letters[strain2] == "")) {
              # Only strain2 has no group
              group_letters[strain2] <- group_letters[strain1]
            }
          }
        }
        
        # Create the cld_data dataframe
        cld_data <- data.frame(Strain = strains,
                               Trait = means[[2]],
                               .group = group_letters)
        
        # Rename the trait column
        colnames(cld_data)[colnames(cld_data) == "Trait"] <- trait
        
        cat("\n**Compact Letter Displays (CLDs - Holm):**\n")
        print(cld_data)
        
        # Boxplot of data with group labels
        plotting_data <- merge(data_trait, cld_data, by = "Strain")  # Use data_trait for merging
        
        # Use 'plotting_data' in ggplot to access the trait and .group columns
        ggplot(plotting_data, aes(x = Strain, y = .data[[trait]], fill = .group)) + 
          geom_boxplot() +
          labs(title = paste("Boxplot of", trait, "by Strain (Kruskal-Wallis with Dunn's Test)"),
               x = "Strain", y = trait) +
          theme_bw() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1))
      } else {
        cat("\nKruskal-Wallis not significant. No post-hoc test needed.\n")
      }
    }
  }
}

## 
## 
## --- Analyzing Single_Cocoon_Weight ---
## 
## **Attempting log transformation...**
## 
## Assumption Checks after Transformation:
##   Normality (Shapiro-Wilk) p-value: 1.449919e-10 
##   Homogeneity of Variance (Levene's) p-value: 0.01517713 
## 
## **Performing Kruskal-Wallis test:**
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Single_Cocoon_Weight by Strain
## Kruskal-Wallis chi-squared = 51.435, df = 15, p-value = 7e-06
## 
## Epsilon-squared: 0.03995115 
## 
## **Dunn's Test (Post-hoc with Holm correction):**

## Dunn (1964) Kruskal-Wallis multiple comparison

##   p-values adjusted with the Holm method.

##          Comparison           Z      P.unadj        P.adj
## 1   SW 101 - SW 102  1.72201660 8.506651e-02 1.0000000000
## 2   SW 101 - SW 103  1.31188029 1.895605e-01 1.0000000000
## 3   SW 102 - SW 103 -0.46411450 6.425657e-01 1.0000000000
## 4   SW 101 - SW 104  3.57289381 3.530580e-04 0.0395424918
## 5   SW 102 - SW 104  1.73468765 8.279615e-02 1.0000000000
## 6   SW 103 - SW 104  2.28146404 2.252100e-02 1.0000000000
## 7   SW 101 - SW 105  2.69628915 7.011677e-03 0.7151910833
## 8   SW 102 - SW 105  0.85637203 3.917920e-01 1.0000000000
## 9   SW 103 - SW 105  1.37862842 1.680094e-01 1.0000000000
## 10  SW 104 - SW 105 -0.93368133 3.504683e-01 1.0000000000
## 11  SW 101 - SW 106  1.03196001 3.020909e-01 1.0000000000
## 12  SW 102 - SW 106 -0.73808134 4.604650e-01 1.0000000000
## 13  SW 103 - SW 106 -0.28426540 7.762070e-01 1.0000000000
## 14  SW 104 - SW 106 -2.56845769 1.021522e-02 1.0000000000
## 15  SW 105 - SW 106 -1.66867425 9.518195e-02 1.0000000000
## 16  SW 101 - SW 107 -0.30873655 7.575219e-01 1.0000000000
## 17  SW 102 - SW 107 -2.01638535 4.375970e-02 1.0000000000
## 18  SW 103 - SW 107 -1.61931689 1.053791e-01 1.0000000000
## 19  SW 104 - SW 107 -3.87339590 1.073293e-04 0.0125575239
## 20  SW 105 - SW 107 -3.00372575 2.666956e-03 0.2880312599
## 21  SW 106 - SW 107 -1.34069656 1.800190e-01 1.0000000000
## 22  SW 101 - SW 108 -0.07567349 9.396789e-01 1.0000000000
## 23  SW 102 - SW 108 -1.73717793 8.235577e-02 1.0000000000
## 24  SW 103 - SW 108 -1.33991654 1.802725e-01 1.0000000000
## 25  SW 104 - SW 108 -3.52321059 4.263524e-04 0.0473251209
## 26  SW 105 - SW 108 -2.67436684 7.487050e-03 0.7561920704
## 27  SW 106 - SW 108 -1.07009503 2.845765e-01 1.0000000000
## 28  SW 107 - SW 108  0.22183250 8.244443e-01 1.0000000000
## 29  SW 101 - SW 109 -0.56842946 5.697434e-01 1.0000000000
## 30  SW 102 - SW 109 -2.26399282 2.357456e-02 1.0000000000
## 31  SW 103 - SW 109 -1.87791635 6.039261e-02 1.0000000000
## 32  SW 104 - SW 109 -4.12616241 3.688668e-05 0.0043526284
## 33  SW 105 - SW 109 -3.26232521 1.105023e-03 0.1204475030
## 34  SW 106 - SW 109 -1.60038946 1.095122e-01 1.0000000000
## 35  SW 107 - SW 109 -0.25969291 7.951007e-01 1.0000000000
## 36  SW 108 - SW 109 -0.47207884 6.368705e-01 1.0000000000
## 37  SW 101 - SW 110 -0.20893523 8.344988e-01 1.0000000000
## 38  SW 102 - SW 110 -1.87980234 6.013502e-02 1.0000000000
## 39  SW 103 - SW 110 -1.48527451 1.374711e-01 1.0000000000
## 40  SW 104 - SW 110 -3.68607099 2.277428e-04 0.0261904273
## 41  SW 105 - SW 110 -2.83305722 4.610513e-03 0.4887143988
## 42  SW 106 - SW 110 -1.21337135 2.249879e-01 1.0000000000
## 43  SW 107 - SW 110  0.09156686 9.270422e-01 1.0000000000
## 44  SW 108 - SW 110 -0.12792055 8.982119e-01 1.0000000000
## 45  SW 109 - SW 110  0.34433337 7.305956e-01 1.0000000000
## 46  SW 101 - SW 111 -0.89879289 3.687630e-01 1.0000000000
## 47  SW 102 - SW 111 -2.57898200 9.909195e-03 0.9909195130
## 48  SW 103 - SW 111 -2.20688878 2.732183e-02 1.0000000000
## 49  SW 104 - SW 111 -4.44771457 8.678875e-06 0.0010327861
## 50  SW 105 - SW 111 -3.59129763 3.290356e-04 0.0371810247
## 51  SW 106 - SW 111 -1.93075290 5.351362e-02 1.0000000000
## 52  SW 107 - SW 111 -0.59005634 5.551529e-01 1.0000000000
## 53  SW 108 - SW 111 -0.79042502 4.292796e-01 1.0000000000
## 54  SW 109 - SW 111 -0.33036343 7.411254e-01 1.0000000000
## 55  SW 110 - SW 111 -0.66588553 5.054843e-01 1.0000000000
## 56  SW 101 - SW 112 -1.08628265 2.773540e-01 1.0000000000
## 57  SW 102 - SW 112 -2.75774647 5.820132e-03 0.6111138451
## 58  SW 103 - SW 112 -2.39358911 1.668443e-02 1.0000000000
## 59  SW 104 - SW 112 -4.63020370 3.653062e-06 0.0004383674
## 60  SW 105 - SW 112 -3.77799796 1.580942e-04 0.0183389242
## 61  SW 106 - SW 112 -2.11824266 3.415452e-02 1.0000000000
## 62  SW 107 - SW 112 -0.77754610 4.368366e-01 1.0000000000
## 63  SW 108 - SW 112 -0.97109467 3.315011e-01 1.0000000000
## 64  SW 109 - SW 112 -0.51785320 6.045607e-01 1.0000000000
## 65  SW 110 - SW 112 -0.84837466 3.962293e-01 1.0000000000
## 66  SW 111 - SW 112 -0.18748976 8.512766e-01 1.0000000000
## 67  SW 101 - SW 113  0.27552992 7.829091e-01 1.0000000000
## 68  SW 102 - SW 113 -1.45930913 1.444800e-01 1.0000000000
## 69  SW 103 - SW 113 -1.03751051 2.994980e-01 1.0000000000
## 70  SW 104 - SW 113 -3.30471268 9.507381e-04 0.1045811922
## 71  SW 105 - SW 113 -2.42191936 1.543878e-02 1.0000000000
## 72  SW 106 - SW 113 -0.75643009 4.493914e-01 1.0000000000
## 73  SW 107 - SW 113  0.58426647 5.590410e-01 1.0000000000
## 74  SW 108 - SW 113  0.34118076 7.329675e-01 1.0000000000
## 75  SW 109 - SW 113  0.84395937 3.986921e-01 1.0000000000
## 76  SW 110 - SW 113  0.47711636 6.332793e-01 1.0000000000
## 77  SW 111 - SW 113  1.17432281 2.402658e-01 1.0000000000
## 78  SW 112 - SW 113  1.36181257 1.732570e-01 1.0000000000
## 79  SW 101 - SW 114 -0.04785161 9.618345e-01 1.0000000000
## 80  SW 102 - SW 114 -1.76764132 7.712089e-02 1.0000000000
## 81  SW 103 - SW 114 -1.35953042 1.739786e-01 1.0000000000
## 82  SW 104 - SW 114 -3.61946914 2.952080e-04 0.0336537128
## 83  SW 105 - SW 114 -2.74393928 6.070677e-03 0.6313503685
## 84  SW 106 - SW 114 -1.07981162 2.802261e-01 1.0000000000
## 85  SW 107 - SW 114  0.26088494 7.941812e-01 1.0000000000
## 86  SW 108 - SW 114  0.02956252 9.764160e-01 1.0000000000
## 87  SW 109 - SW 114  0.52057785 6.026609e-01 1.0000000000
## 88  SW 110 - SW 114  0.16235990 8.710224e-01 1.0000000000
## 89  SW 111 - SW 114  0.85094128 3.948020e-01 1.0000000000
## 90  SW 112 - SW 114  1.03843104 2.990694e-01 1.0000000000
## 91  SW 113 - SW 114 -0.32338153 7.464063e-01 1.0000000000
## 92  SW 101 - SW 115  1.15984812 2.461106e-01 1.0000000000
## 93  SW 102 - SW 115 -0.61614481 5.377990e-01 1.0000000000
## 94  SW 103 - SW 115 -0.15691577 8.753112e-01 1.0000000000
## 95  SW 104 - SW 115 -2.44398054 1.452621e-02 1.0000000000
## 96  SW 105 - SW 115 -1.54132462 1.232378e-01 1.0000000000
## 97  SW 106 - SW 115  0.12788811 8.982375e-01 1.0000000000
## 98  SW 107 - SW 115  1.46858467 1.419455e-01 1.0000000000
## 99  SW 108 - SW 115  1.19333110 2.327397e-01 1.0000000000
## 100 SW 109 - SW 115  1.72827758 8.393847e-02 1.0000000000
## 101 SW 110 - SW 115  1.33784850 1.809458e-01 1.0000000000
## 102 SW 111 - SW 115  2.05864101 3.952864e-02 1.0000000000
## 103 SW 112 - SW 115  2.24613077 2.469563e-02 1.0000000000
## 104 SW 113 - SW 115  0.88431820 3.765245e-01 1.0000000000
## 105 SW 114 - SW 115  1.20769973 2.271628e-01 1.0000000000
## 106 SW 101 - SW 116  1.82721632 6.766725e-02 1.0000000000
## 107 SW 102 - SW 116  0.02016580 9.839111e-01 0.9839111085
## 108 SW 103 - SW 116  0.50764245 6.117041e-01 1.0000000000
## 109 SW 104 - SW 116 -1.79441204 7.274743e-02 1.0000000000
## 110 SW 105 - SW 116 -0.87676641 3.806135e-01 1.0000000000
## 111 SW 106 - SW 116  0.79525631 4.264644e-01 1.0000000000
## 112 SW 107 - SW 116  2.13595287 3.268326e-02 1.0000000000
## 113 SW 108 - SW 116  1.83642318 6.629509e-02 1.0000000000
## 114 SW 109 - SW 116  2.39564577 1.659112e-02 1.0000000000
## 115 SW 110 - SW 116  1.98741700 4.687621e-02 1.0000000000
## 116 SW 111 - SW 116  2.72600921 6.410521e-03 0.6602836639
## 117 SW 112 - SW 116  2.91349897 3.574030e-03 0.3824212260
## 118 SW 113 - SW 116  1.55168640 1.207373e-01 1.0000000000
## 119 SW 114 - SW 116  1.87506793 6.078338e-02 1.0000000000
## 120 SW 115 - SW 116  0.66736820 5.045370e-01 1.0000000000
## 
## **Compact Letter Displays (CLDs - Holm):**
##        Strain Single_Cocoon_Weight .group
## SW 101 SW 101             1.483800      a
## SW 102 SW 102             1.404540      a
## SW 103 SW 103             1.419119      a
## SW 104 SW 104             1.315407      b
## SW 105 SW 105             1.363203      a
## SW 106 SW 106             1.431433      a
## SW 107 SW 107             1.501850      a
## SW 108 SW 108             1.465827      a
## SW 109 SW 109             1.496533      a
## SW 110 SW 110             1.484426      a
## SW 111 SW 111             1.532417      b
## SW 112 SW 112             1.532417      b
## SW 113 SW 113             1.459350      a
## SW 114 SW 114             1.473400      a
## SW 115 SW 115             1.420700      a
## SW 116 SW 116             1.402300      a
## 
## 
## --- Analyzing Cocoon_Shell_Weight ---
## 
## **Attempting log transformation...**
## 
## Assumption Checks after Transformation:
##   Normality (Shapiro-Wilk) p-value: 2.21599e-18 
##   Homogeneity of Variance (Levene's) p-value: 8.501889e-05 
## 
## **Performing Kruskal-Wallis test:**
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Cocoon_Shell_Weight by Strain
## Kruskal-Wallis chi-squared = 36.722, df = 15, p-value = 0.00139
## 
## Epsilon-squared: 0.02381752 
## 
## **Dunn's Test (Post-hoc with Holm correction):**

## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Holm method.

##          Comparison            Z      P.unadj      P.adj
## 1   SW 101 - SW 102  2.047025256 0.0406556048 1.00000000
## 2   SW 101 - SW 103 -0.587136437 0.5571120778 1.00000000
## 3   SW 102 - SW 103 -2.599211805 0.0093438101 1.00000000
## 4   SW 101 - SW 104  1.723660739 0.0847691065 1.00000000
## 5   SW 102 - SW 104 -0.349819939 0.7264738344 1.00000000
## 6   SW 103 - SW 104  2.288385944 0.0221150555 1.00000000
## 7   SW 101 - SW 105  0.680251852 0.4963450056 1.00000000
## 8   SW 102 - SW 105 -1.390342645 0.1644248541 1.00000000
## 9   SW 103 - SW 105  1.262096456 0.2069140802 1.00000000
## 10  SW 104 - SW 105 -1.054527881 0.2916413027 1.00000000
## 11  SW 101 - SW 106  0.048705804 0.9611537507 1.00000000
## 12  SW 102 - SW 106 -2.000586095 0.0454370133 1.00000000
## 13  SW 103 - SW 106  0.635637163 0.5250129416 1.00000000
## 14  SW 104 - SW 106 -1.676253991 0.0936884527 1.00000000
## 15  SW 105 - SW 106 -0.631751126 0.5275495144 1.00000000
## 16  SW 101 - SW 107 -0.550069041 0.5822720199 1.00000000
## 17  SW 102 - SW 107 -2.571495509 0.0101260329 1.00000000
## 18  SW 103 - SW 107  0.039383486 0.9685846459 1.00000000
## 19  SW 104 - SW 107 -2.259058629 0.0238797381 1.00000000
## 20  SW 105 - SW 107 -1.228004803 0.2194451651 1.00000000
## 21  SW 106 - SW 107 -0.598774845 0.5493230390 1.00000000
## 22  SW 101 - SW 108  0.055997743 0.9553436051 1.00000000
## 23  SW 102 - SW 108 -1.925432728 0.0541752445 1.00000000
## 24  SW 103 - SW 108  0.621727410 0.5341211247 1.00000000
## 25  SW 104 - SW 108 -1.609485045 0.1075103226 1.00000000
## 26  SW 105 - SW 108 -0.599925201 0.5485560865 1.00000000
## 27  SW 106 - SW 108  0.009063656 0.9927683475 0.99276835
## 28  SW 107 - SW 108  0.586057534 0.5578368520 1.00000000
## 29  SW 101 - SW 109  0.413488431 0.6792487968 1.00000000
## 30  SW 102 - SW 109 -1.652779506 0.0983757464 1.00000000
## 31  SW 103 - SW 109  0.998883856 0.3178509575 1.00000000
## 32  SW 104 - SW 109 -1.321200654 0.1864344683 1.00000000
## 33  SW 105 - SW 109 -0.268504432 0.7883110636 1.00000000
## 34  SW 106 - SW 109  0.364782627 0.7152736741 1.00000000
## 35  SW 107 - SW 109  0.963557472 0.3352678369 1.00000000
## 36  SW 108 - SW 109  0.342449679 0.7320125075 1.00000000
## 37  SW 101 - SW 110  0.246942377 0.8049528194 1.00000000
## 38  SW 102 - SW 110 -1.761196199 0.0782052012 1.00000000
## 39  SW 103 - SW 110  0.817560074 0.4136084331 1.00000000
## 40  SW 104 - SW 110 -1.439327018 0.1500578913 1.00000000
## 41  SW 105 - SW 110 -0.416297988 0.6771919645 1.00000000
## 42  SW 106 - SW 110  0.199535629 0.8418437757 1.00000000
## 43  SW 107 - SW 110  0.782340267 0.4340146283 1.00000000
## 44  SW 108 - SW 110  0.183801244 0.8541693799 1.00000000
## 45  SW 109 - SW 110 -0.155517708 0.8764131946 1.00000000
## 46  SW 101 - SW 111 -1.559948117 0.1187721423 1.00000000
## 47  SW 102 - SW 111 -3.534377427 0.0004087370 0.04863970
## 48  SW 103 - SW 111 -0.966243449 0.3339223764 1.00000000
## 49  SW 104 - SW 111 -3.242002742 0.0011869288 0.13887067
## 50  SW 105 - SW 111 -2.233631737 0.0255073139 1.00000000
## 51  SW 106 - SW 111 -1.608653921 0.1076920368 1.00000000
## 52  SW 107 - SW 111 -1.009879076 0.3125532286 1.00000000
## 53  SW 108 - SW 111 -1.559201362 0.1189487286 1.00000000
## 54  SW 109 - SW 111 -1.973436548 0.0484458470 1.00000000
## 55  SW 110 - SW 111 -1.765284380 0.0775159834 1.00000000
## 56  SW 101 - SW 112 -1.018904977 0.3082480809 1.00000000
## 57  SW 102 - SW 112 -3.018513034 0.0025401847 0.28958106
## 58  SW 103 - SW 112 -0.427478394 0.6690309186 1.00000000
## 59  SW 104 - SW 112 -2.715390019 0.0066197760 0.73479513
## 60  SW 105 - SW 112 -1.694866683 0.0901007141 1.00000000
## 61  SW 106 - SW 112 -1.067610780 0.2856961188 1.00000000
## 62  SW 107 - SW 112 -0.468835935 0.6391869117 1.00000000
## 63  SW 108 - SW 112 -1.037839146 0.2993449486 1.00000000
## 64  SW 109 - SW 112 -1.432393408 0.1520312641 1.00000000
## 65  SW 110 - SW 112 -1.238671657 0.2154671192 1.00000000
## 66  SW 111 - SW 112  0.541043141 0.5884778478 1.00000000
## 67  SW 101 - SW 113 -1.180860289 0.2376582291 1.00000000
## 68  SW 102 - SW 113 -3.172931365 0.0015090820 0.17505351
## 69  SW 103 - SW 113 -0.588751786 0.5560277928 1.00000000
## 70  SW 104 - SW 113 -2.873025745 0.0040656098 0.45534830
## 71  SW 105 - SW 113 -1.856140075 0.0634335949 1.00000000
## 72  SW 106 - SW 113 -1.229566092 0.2188596339 1.00000000
## 73  SW 107 - SW 113 -0.630791247 0.5281770270 1.00000000
## 74  SW 108 - SW 113 -1.193903190 0.2325158396 1.00000000
## 75  SW 109 - SW 113 -1.594348720 0.1108579512 1.00000000
## 76  SW 110 - SW 113 -1.396307382 0.1626219517 1.00000000
## 77  SW 111 - SW 113  0.379087828 0.7046226444 1.00000000
## 78  SW 112 - SW 113 -0.161955312 0.8713410449 1.00000000
## 79  SW 101 - SW 114  0.031846102 0.9745947809 1.00000000
## 80  SW 102 - SW 114 -2.016661189 0.0437308827 1.00000000
## 81  SW 103 - SW 114  0.618848450 0.5360162011 1.00000000
## 82  SW 104 - SW 114 -1.692664020 0.0905194377 1.00000000
## 83  SW 105 - SW 114 -0.648539839 0.5166358524 1.00000000
## 84  SW 106 - SW 114 -0.016859701 0.9865485419 1.00000000
## 85  SW 107 - SW 114  0.581915144 0.5606238395 1.00000000
## 86  SW 108 - SW 114 -0.025310071 0.9798076410 1.00000000
## 87  SW 109 - SW 114 -0.381642329 0.7027266827 1.00000000
## 88  SW 110 - SW 114 -0.215945657 0.8290301073 1.00000000
## 89  SW 111 - SW 114  1.591794220 0.1114309494 1.00000000
## 90  SW 112 - SW 114  1.050751079 0.2933729296 1.00000000
## 91  SW 113 - SW 114  1.212706391 0.2252420954 1.00000000
## 92  SW 101 - SW 115  2.077149256 0.0377877856 1.00000000
## 93  SW 102 - SW 115 -0.066541148 0.9469469986 1.00000000
## 94  SW 103 - SW 115  2.655539763 0.0079181598 0.87099758
## 95  SW 104 - SW 115  0.298087886 0.7656360872 1.00000000
## 96  SW 105 - SW 115  1.388151474 0.1650909252 1.00000000
## 97  SW 106 - SW 115  2.028443452 0.0425150091 1.00000000
## 98  SW 107 - SW 115  2.627218297 0.0086086088 0.93833836
## 99  SW 108 - SW 115  1.945593364 0.0517036048 1.00000000
## 100 SW 109 - SW 115  1.663660825 0.0961802292 1.00000000
## 101 SW 110 - SW 115  1.774806248 0.0759298843 1.00000000
## 102 SW 111 - SW 115  3.637097373 0.0002757277 0.03308733
## 103 SW 112 - SW 115  3.096054233 0.0019611449 0.22553166
## 104 SW 113 - SW 115  3.258009545 0.0011219663 0.13239203
## 105 SW 114 - SW 115  2.045303154 0.0408249800 1.00000000
## 106 SW 101 - SW 116  1.388115404 0.1651019066 1.00000000
## 107 SW 102 - SW 116 -0.723509149 0.4693671354 1.00000000
## 108 SW 103 - SW 116  1.969407119 0.0489063589 1.00000000
## 109 SW 104 - SW 116 -0.372568418 0.7094696820 1.00000000
## 110 SW 105 - SW 116  0.702018830 0.4826674204 1.00000000
## 111 SW 106 - SW 116  1.339409601 0.1804373656 1.00000000
## 112 SW 107 - SW 116  1.938184446 0.0526007237 1.00000000
## 113 SW 108 - SW 116  1.281623730 0.1999746716 1.00000000
## 114 SW 109 - SW 116  0.974626973 0.3297453288 1.00000000
## 115 SW 110 - SW 116  1.104149944 0.2695280990 1.00000000
## 116 SW 111 - SW 116  2.948063522 0.0031977135 0.36134163
## 117 SW 112 - SW 116  2.407020381 0.0160832723 1.00000000
## 118 SW 113 - SW 116  2.568975693 0.0101999612 1.00000000
## 119 SW 114 - SW 116  1.356269302 0.1750134990 1.00000000
## 120 SW 115 - SW 116 -0.689033852 0.4908019650 1.00000000
## 
## **Compact Letter Displays (CLDs - Holm):**
##        Strain Cocoon_Shell_Weight .group
## SW 101 SW 101           0.2937500      a
## SW 102 SW 102           0.2738000      a
## SW 103 SW 103           0.2980847      a
## SW 104 SW 104           0.2824630      a
## SW 105 SW 105           0.2858644      a
## SW 106 SW 106           0.2898333      a
## SW 107 SW 107           0.3016167      a
## SW 108 SW 108           0.2937115      a
## SW 109 SW 109           0.2868167      a
## SW 110 SW 110           0.2900185      a
## SW 111 SW 111           0.3048333      b
## SW 112 SW 112           0.3018033      a
## SW 113 SW 113           0.3026000      a
## SW 114 SW 114           0.2910000      a
## SW 115 SW 115           0.2727167      a
## SW 116 SW 116           0.2795000      a
## 
## 
## --- Analyzing Cocoon_Shell_Percentage ---
## 
## **Attempting log transformation...**

## boundary (singular) fit: see help('isSingular')

## 
## Assumption Checks after Transformation:
##   Normality (Shapiro-Wilk) p-value: 3.532175e-24 
##   Homogeneity of Variance (Levene's) p-value: 0.4632928 
## 
## **Performing Kruskal-Wallis test:**
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Cocoon_Shell_Percentage by Strain
## Kruskal-Wallis chi-squared = 45.229, df = 15, p-value = 7.046e-05
## 
## Epsilon-squared: 0.03314579 
## 
## **Dunn's Test (Post-hoc with Holm correction):**

## Dunn (1964) Kruskal-Wallis multiple comparison
##   p-values adjusted with the Holm method.

##          Comparison           Z      P.unadj       P.adj
## 1   SW 101 - SW 102  0.79234032 4.281623e-01 1.000000000
## 2   SW 101 - SW 103 -2.49137857 1.272485e-02 1.000000000
## 3   SW 102 - SW 103 -3.16564982 1.547370e-03 0.173305388
## 4   SW 101 - SW 104 -2.22753971 2.591123e-02 1.000000000
## 5   SW 102 - SW 104 -2.90203420 3.707481e-03 0.404115384
## 6   SW 103 - SW 104  0.20681503 8.361543e-01 1.000000000
## 7   SW 101 - SW 105 -2.26530087 2.349422e-02 1.000000000
## 8   SW 102 - SW 105 -2.95001080 3.177628e-03 0.349539099
## 9   SW 103 - SW 105  0.22513374 8.218752e-01 1.000000000
## 10  SW 104 - SW 105  0.01328153 9.894032e-01 1.000000000
## 11  SW 101 - SW 106 -0.82999180 4.065434e-01 1.000000000
## 12  SW 102 - SW 106 -1.58370645 1.132605e-01 1.000000000
## 13  SW 103 - SW 106  1.66488149 9.593640e-02 1.000000000
## 14  SW 104 - SW 106  1.41968501 1.556994e-01 1.000000000
## 15  SW 105 - SW 106  1.43880379 1.502061e-01 1.000000000
## 16  SW 101 - SW 107 -0.75727811 4.488833e-01 1.000000000
## 17  SW 102 - SW 107 -1.51437667 1.299303e-01 1.000000000
## 18  SW 103 - SW 107  1.73728901 8.233617e-02 1.000000000
## 19  SW 104 - SW 107  1.49045932 1.361035e-01 1.000000000
## 20  SW 105 - SW 107  1.51121131 1.307346e-01 1.000000000
## 21  SW 106 - SW 107  0.07271368 9.420340e-01 1.000000000
## 22  SW 101 - SW 108 -0.22926627 8.186620e-01 1.000000000
## 23  SW 102 - SW 108 -0.98531935 3.244672e-01 1.000000000
## 24  SW 103 - SW 108  2.17310382 2.977250e-02 1.000000000
## 25  SW 104 - SW 108  1.92698513 5.398148e-02 1.000000000
## 26  SW 105 - SW 108  1.95518449 5.056130e-02 1.000000000
## 27  SW 106 - SW 108  0.57053384 5.683157e-01 1.000000000
## 28  SW 107 - SW 108  0.50046518 6.167476e-01 1.000000000
## 29  SW 101 - SW 109  1.49463234 1.350104e-01 1.000000000
## 30  SW 102 - SW 109  0.63273570 5.269063e-01 1.000000000
## 31  SW 103 - SW 109  3.97971770 6.899715e-05 0.008279658
## 32  SW 104 - SW 109  3.68230800 2.311319e-04 0.027042436
## 33  SW 105 - SW 109  3.75364000 1.742851e-04 0.020739930
## 34  SW 106 - SW 109  2.32462414 2.009207e-02 1.000000000
## 35  SW 107 - SW 109  2.25191045 2.432793e-02 1.000000000
## 36  SW 108 - SW 109  1.66953003 9.501238e-02 1.000000000
## 37  SW 101 - SW 110  0.59017261 5.550749e-01 1.000000000
## 38  SW 102 - SW 110 -0.20900053 8.344478e-01 1.000000000
## 39  SW 103 - SW 110  3.01328393 2.584370e-03 0.286865070
## 40  SW 104 - SW 110  2.74636625 6.025945e-03 0.644776134
## 41  SW 105 - SW 110  2.79318738 5.219145e-03 0.563667661
## 42  SW 106 - SW 110  1.39802730 1.621049e-01 1.000000000
## 43  SW 107 - SW 110  1.32725300 1.844250e-01 1.000000000
## 44  SW 108 - SW 110  0.79334862 4.275747e-01 1.000000000
## 45  SW 109 - SW 110 -0.86459569 3.872607e-01 1.000000000
## 46  SW 101 - SW 111 -0.79235778 4.281521e-01 1.000000000
## 47  SW 102 - SW 111 -1.54782382 1.216647e-01 1.000000000
## 48  SW 103 - SW 111  1.70235705 8.868846e-02 1.000000000
## 49  SW 104 - SW 111  1.45631527 1.453055e-01 1.000000000
## 50  SW 105 - SW 111  1.47627935 1.398689e-01 1.000000000
## 51  SW 106 - SW 111  0.03763401 9.699795e-01 1.000000000
## 52  SW 107 - SW 111 -0.03507967 9.720162e-01 1.000000000
## 53  SW 108 - SW 111 -0.53426880 5.931556e-01 1.000000000
## 54  SW 109 - SW 111 -2.28699012 2.219640e-02 1.000000000
## 55  SW 110 - SW 111 -1.36139704 1.733883e-01 1.000000000
## 56  SW 101 - SW 112  0.14985490 8.808791e-01 1.000000000
## 57  SW 102 - SW 112 -0.64945928 5.160416e-01 1.000000000
## 58  SW 103 - SW 112  2.64060250 8.275875e-03 0.868966911
## 59  SW 104 - SW 112  2.37339776 1.762527e-02 1.000000000
## 60  SW 105 - SW 112  2.41452480 1.575575e-02 1.000000000
## 61  SW 106 - SW 112  0.97984670 3.271618e-01 1.000000000
## 62  SW 107 - SW 112  0.90713302 3.643365e-01 1.000000000
## 63  SW 108 - SW 112  0.37367007 7.086498e-01 1.000000000
## 64  SW 109 - SW 112 -1.34477744 1.786971e-01 1.000000000
## 65  SW 110 - SW 112 -0.44431456 6.568152e-01 1.000000000
## 66  SW 111 - SW 112  0.94221268 3.460838e-01 1.000000000
## 67  SW 101 - SW 113 -1.96378035 4.955557e-02 1.000000000
## 68  SW 102 - SW 113 -2.66473142 7.704983e-03 0.816728226
## 69  SW 103 - SW 113  0.53586681 5.920506e-01 1.000000000
## 70  SW 104 - SW 113  0.31613627 7.518991e-01 1.000000000
## 71  SW 105 - SW 113  0.30978911 7.567213e-01 1.000000000
## 72  SW 106 - SW 113 -1.13378855 2.568833e-01 1.000000000
## 73  SW 107 - SW 113 -1.20650224 2.276239e-01 1.000000000
## 74  SW 108 - SW 113 -1.66307983 9.629646e-02 1.000000000
## 75  SW 109 - SW 113 -3.45841269 5.433684e-04 0.062487361
## 76  SW 110 - SW 113 -2.50157604 1.236419e-02 1.000000000
## 77  SW 111 - SW 113 -1.17142257 2.414290e-01 1.000000000
## 78  SW 112 - SW 113 -2.11363525 3.454643e-02 1.000000000
## 79  SW 101 - SW 114 -0.33478947 7.377839e-01 1.000000000
## 80  SW 102 - SW 114 -1.11154955 2.663319e-01 1.000000000
## 81  SW 103 - SW 114  2.15799875 3.092793e-02 1.000000000
## 82  SW 104 - SW 114  1.90167957 5.721306e-02 1.000000000
## 83  SW 105 - SW 114  1.93192105 5.336925e-02 1.000000000
## 84  SW 106 - SW 114  0.49520233 6.204573e-01 1.000000000
## 85  SW 107 - SW 114  0.42248864 6.726684e-01 1.000000000
## 86  SW 108 - SW 114 -0.09334494 9.256295e-01 1.000000000
## 87  SW 109 - SW 114 -1.82942181 6.733644e-02 1.000000000
## 88  SW 110 - SW 114 -0.91603275 3.596497e-01 1.000000000
## 89  SW 111 - SW 114  0.45756831 6.472626e-01 1.000000000
## 90  SW 112 - SW 114 -0.48464437 6.279286e-01 1.000000000
## 91  SW 113 - SW 114  1.62899088 1.033149e-01 1.000000000
## 92  SW 101 - SW 115  1.21688991 2.236461e-01 1.000000000
## 93  SW 102 - SW 115  0.36791868 7.129339e-01 1.000000000
## 94  SW 103 - SW 115  3.70314471 2.129433e-04 0.025127313
## 95  SW 104 - SW 115  3.41197337 6.449441e-04 0.073523632
## 96  SW 105 - SW 115  3.47706701 5.069311e-04 0.058804008
## 97  SW 106 - SW 115  2.04688170 4.066970e-02 1.000000000
## 98  SW 107 - SW 115  1.97416802 4.836264e-02 1.000000000
## 99  SW 108 - SW 115  1.40189072 1.609479e-01 1.000000000
## 100 SW 109 - SW 115 -0.27774243 7.812101e-01 1.000000000
## 101 SW 110 - SW 115  0.59426105 5.523375e-01 1.000000000
## 102 SW 111 - SW 115  2.00924769 4.451087e-02 1.000000000
## 103 SW 112 - SW 115  1.06703501 2.859560e-01 1.000000000
## 104 SW 113 - SW 115  3.18067026 1.469348e-03 0.166036291
## 105 SW 114 - SW 115  1.55167938 1.207390e-01 1.000000000
## 106 SW 101 - SW 116 -0.32286919 7.467943e-01 1.000000000
## 107 SW 102 - SW 116 -1.10018402 2.712520e-01 1.000000000
## 108 SW 103 - SW 116  2.16986884 3.001678e-02 1.000000000
## 109 SW 104 - SW 116  1.91328191 5.571197e-02 1.000000000
## 110 SW 105 - SW 116  1.94379113 5.192064e-02 1.000000000
## 111 SW 106 - SW 116  0.50712261 6.120688e-01 1.000000000
## 112 SW 107 - SW 116  0.43440892 6.639915e-01 1.000000000
## 113 SW 108 - SW 116 -0.08185827 9.347594e-01 1.000000000
## 114 SW 109 - SW 116 -1.81750153 6.914035e-02 1.000000000
## 115 SW 110 - SW 116 -0.90443040 3.657672e-01 1.000000000
## 116 SW 111 - SW 116  0.46948859 6.387204e-01 1.000000000
## 117 SW 112 - SW 116 -0.47272409 6.364100e-01 1.000000000
## 118 SW 113 - SW 116  1.64091116 1.008159e-01 1.000000000
## 119 SW 114 - SW 116  0.01192028 9.904892e-01 0.990489221
## 120 SW 115 - SW 116 -1.53975910 1.236191e-01 1.000000000
## 
## **Compact Letter Displays (CLDs - Holm):**
##        Strain Cocoon_Shell_Percentage .group
## SW 101 SW 101                19.90443      a
## SW 102 SW 102                19.53279      a
## SW 103 SW 103                21.18908      a
## SW 104 SW 104                21.71367      a
## SW 105 SW 105                21.13654      a
## SW 106 SW 106                20.38933      a
## SW 107 SW 107                20.20448      a
## SW 108 SW 108                20.12382      a
## SW 109 SW 109                20.20702      b
## SW 110 SW 110                19.70253      a
## SW 111 SW 111                20.22120      a
## SW 112 SW 112                19.83908      a
## SW 113 SW 113                20.87511      a
## SW 114 SW 114                19.92945      a
## SW 115 SW 115                19.42583      b
## SW 116 SW 116                20.07143      a
## 
## 
## --- Analyzing Hatching_Rate ---
## 
## **Performing ANOVA on original data:**
## Analysis of Variance Table
## 
## Response: Hatching_Rate
##           Df  Sum Sq Mean Sq F value  Pr(>F)  
## Strain    15  59.064  3.9376  1.6219 0.08659 .
## Location   1   8.520  8.5204  3.5096 0.06471 .
## Residuals 79 191.790  2.4277                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Eta-squared (Marginal R-squared): 
## 
## ANOVA not significant. No post-hoc test needed.
## 
## 
## --- Analyzing Larval_Weight ---
## 
## **Performing ANOVA on original data:**
## Analysis of Variance Table
## 
## Response: Larval_Weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Strain    15 155.39   10.36  1.6854   0.07097 .  
## Location   1 598.31  598.31 97.3423 2.003e-15 ***
## Residuals 79 485.57    6.15                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Eta-squared (Marginal R-squared): 
## 
## ANOVA not significant. No post-hoc test needed.
## 
## 
## --- Analyzing Live_Pupa_Percentage ---
## 
## **Attempting log transformation...**

## boundary (singular) fit: see help('isSingular')

## 
## Assumption Checks after Transformation:
##   Normality (Shapiro-Wilk) p-value: 5.605831e-07 
##   Homogeneity of Variance (Levene's) p-value: 0.185054 
## 
## **Performing Kruskal-Wallis test:**
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Live_Pupa_Percentage by Strain
## Kruskal-Wallis chi-squared = 8.2778, df = 15, p-value = 0.9122
## 
## Epsilon-squared: -0.08402715 
## 
## Kruskal-Wallis not significant. No post-hoc test needed.

Multivariate Analysis

We’ll now explore the relationships among the traits and strains using multivariate techniques.

1. Principal Component Analysis (PCA): PCA helps us understand the main sources of variation in the data and visualize the relationships between individuals (silkworms) and variables (traits).

Multivariate Analysis (Combined Locations)

# Subset the data for multivariate analysis
# Include all traits except "Larval_Weight" since it has missing data for one location
traits_multivariate <- c("Single_Cocoon_Weight", "Cocoon_Shell_Weight", "Larval_Weight",
                         "Cocoon_Shell_Percentage", "Live_Pupa_Percentage", "Hatching_Rate")


# Filter combined_data to include only the 16 hybrid strains
hybrid_strains <- unique(combined_data$Strain)  # Get the unique hybrid strains
data_hybrids <- combined_data %>% filter(Strain %in% hybrid_strains)


# Reshape the data to wide format for location comparison
data_wide <- data_hybrids %>%
  pivot_wider(id_cols = Strain, 
              names_from = Location, 
              values_from = all_of(traits_multivariate),
              names_sep = "_",  # Separate location names in column names with "_"
              values_fn = mean)  # Use the mean value for each trait in each location


# 1. Principal Component Analysis (PCA)

# Perform PCA (scaling the data is generally recommended)
pca_result <- PCA(data_wide[, -1], scale.unit = TRUE, graph = FALSE)  # Exclude the 'Strain' column

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 80)) +  # Adjust ylim as needed
  labs(x = "Principal Component", y = "Percentage of Explained Variance")

# Individuals plot (colored by Strain)
fviz_pca_ind(pca_result, repel = TRUE, 
             col.ind = data_wide$Strain,
             palette = colorRampPalette(brewer.pal(12, "Paired"))(16),# Use a larger color palette
             addEllipses = TRUE, ellipse.type = "confidence",
             legend.title = "Strain", 
             title = "PCA - Individuals Plot (Hybrids Only, Colored by Strain)")

# Variables plot
fviz_pca_var(pca_result, repel = TRUE, 
             col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = "PCA - Variables Plot")

# Biplot with both individuals and variables
fviz_pca_biplot(pca_result, repel = TRUE, 
                col.ind = data_wide$Strain, 
                palette = colorRampPalette(brewer.pal(12, "Paired"))(16),# Use a larger color palette
                addEllipses = TRUE, ellipse.type = "confidence",
                legend.title = "Strain", 
                title = "PCA Biplot (Hybrids Only, Colored by Strain)")

# 2. Hierarchical Clustering

# Extract the principal components
pca_coords <- pca_result$ind$coord

# Perform hierarchical clustering (using first few PCs - adjust as needed)
hc_result <- hclust(dist(pca_coords[, 1:2]), method = "ward.D2") 

# Visualize the dendrogram
plot(hc_result, hang = -1, main = "Hierarchical Clustering on PCs", 
     xlab = "", sub = "")

# 3. K-means Clustering

# Determine optimal number of clusters (Elbow method) using data_hybrids
set.seed(123) 
wss <- fviz_nbclust(data_hybrids[, traits_multivariate], kmeans, method = "wss")  # Use data_hybrids

# Perform k-means clustering 
k <- 3  # Adjust based on the elbow plot or other criteria
kmeans_result <- kmeans(pca_coords, centers = k, nstart = 25)

# Visualize clusters
fviz_cluster(kmeans_result, data = pca_coords, geom = "point", ellipse.type = "convex", 
             ggtheme = theme_minimal())

# 4. Factor Analysis

# Load necessary library
library(psych) 

# Determine the appropriate number of factors (e.g., using scree plot or parallel analysis)
# For this example, let's assume you've decided to extract 2 factors
nfactors <- 4

# Perform Factor Analysis (determine the number of factors first)
fa_model <- fa(data_hybrids[, traits_multivariate], nfactors = 4, rotate = "varimax")  # Adjust nfactors based on your analysis

# Print the results
print(fa_model, sort = TRUE)

## Factor Analysis using method =  minres
## Call: fa(r = data_hybrids[, traits_multivariate], nfactors = 4, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                         item   MR1   MR2   MR4   MR3   h2     u2 com
## Single_Cocoon_Weight       1  0.96  0.01  0.25 -0.01 0.98 0.0212 1.1
## Cocoon_Shell_Weight        2  0.82  0.52  0.23  0.03 1.00 0.0049 1.9
## Live_Pupa_Percentage       5 -0.67 -0.40 -0.25  0.01 0.67 0.3258 1.9
## Cocoon_Shell_Percentage    4  0.20  0.95  0.14  0.05 0.97 0.0334 1.1
## Larval_Weight              3  0.40  0.20  0.88  0.18 1.00 0.0050 1.6
## Hatching_Rate              6 -0.01  0.02  0.08  0.65 0.43 0.5687 1.0
## 
##                        MR1  MR2  MR4  MR3
## SS loadings           2.24 1.37 0.97 0.46
## Proportion Var        0.37 0.23 0.16 0.08
## Cumulative Var        0.37 0.60 0.76 0.84
## Proportion Explained  0.44 0.27 0.19 0.09
## Cumulative Proportion 0.44 0.72 0.91 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 4 factors are sufficient.
## 
## df null model =  15  with the objective function =  5.41 with Chi Square =  498.7
## df of  the model are -3  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0 
## The df corrected root mean square of the residuals is  NA 
## 
## The harmonic n.obs is  96 with the empirical chi square  0  with prob <  NA 
## The total n.obs was  96  with Likelihood Chi Square =  0  with prob <  NA 
## 
## Tucker Lewis Index of factoring reliability =  1.032
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2  MR4   MR3
## Correlation of (regression) scores with factors   0.99 0.99 0.98  0.66
## Multiple R square of scores with factors          0.98 0.97 0.96  0.44
## Minimum correlation of possible factor scores     0.97 0.94 0.93 -0.13

# Visualize factor loadings
fa.diagram(fa_model, simple = FALSE)

# 1. Multidimensional Scaling (MDS)

# Calculate a distance matrix (using Euclidean distance as an example)
dist_matrix <- dist(data_hybrids[, traits_multivariate], method = "euclidean")

# Perform classical MDS
mds_result <- cmdscale(dist_matrix, k = 2)  # k = number of dimensions

# Convert to dataframe for plotting
mds_df <- as.data.frame(mds_result)
colnames(mds_df) <- c("MDS1", "MDS2")
mds_df$Strain <- data_hybrids$Strain
mds_df$Location <- data_hybrids$Location

# Visualize MDS results
ggplot(mds_df, aes(x = MDS1, y = MDS2, color = Strain, shape = Location)) + 
  geom_point(size = 3) +
  labs(title = "MDS Plot of Silkworm Strains") +
  theme_bw()

# 2. Principal Coordinate Analysis (PCoA)

# Perform PCoA (using the same distance matrix)
pcoa_result <- cmdscale(dist_matrix, k = 2, eig = TRUE)  # eig = TRUE to get eigenvalues

# Convert to dataframe for plotting
pcoa_df <- as.data.frame(pcoa_result$points)
colnames(pcoa_df) <- c("PCoA1", "PCoA2")
pcoa_df$Strain <- data_hybrids$Strain
pcoa_df$Location <- data_hybrids$Location

# Visualize PCoA results
ggplot(pcoa_df, aes(x = PCoA1, y = PCoA2, color = Strain, shape = Location)) + 
  geom_point(size = 3) +
  labs(title = "PCoA Plot of Silkworm Strains") +
  theme_bw()

# 3. Non-metric Multidimensional Scaling (NMDS)

# Perform NMDS
nmds_result <- metaMDS(data_hybrids[, traits_multivariate], k = 2)  # k = number of dimensions

## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.0309942 
## Run 1 stress 0.03620601 
## Run 2 stress 0.03569862 
## Run 3 stress 0.0418888 
## Run 4 stress 0.041855 
## Run 5 stress 0.04107385 
## Run 6 stress 0.03712024 
## Run 7 stress 0.04048451 
## Run 8 stress 0.03620584 
## Run 9 stress 0.03724261 
## Run 10 stress 0.03304121 
## Run 11 stress 0.03960414 
## Run 12 stress 0.03858935 
## Run 13 stress 0.04412229 
## Run 14 stress 0.04147472 
## Run 15 stress 0.04014125 
## Run 16 stress 0.03299432 
## Run 17 stress 0.03099445 
## ... Procrustes: rmse 4.890779e-05  max resid 0.0004212855 
## ... Similar to previous best
## Run 18 stress 0.03620604 
## Run 19 stress 0.03839966 
## Run 20 stress 0.04480203 
## *** Best solution repeated 1 times

# Visualize NMDS results (using 'stressplot' to assess goodness-of-fit)
stressplot(nmds_result)

# Plot NMDS results with strain and location information
ordiplot(nmds_result, type = "n")  # Create an empty plot
points(nmds_result, col = as.numeric(data_hybrids$Strain), pch = as.numeric(data_hybrids$Location))
ordihull(nmds_result, groups = data_hybrids$Strain, draw = "polygon", col = "grey90")  # Add convex hulls for strains

# 6. Discriminant Analysis

# Load necessary library
library(MASS)

# Perform Linear Discriminant Analysis
lda_model <- lda(Strain ~ ., data = data_hybrids[, c(traits_multivariate, "Strain")])  # Use data_hybrids

# Print the results
print(lda_model)

## Call:
## lda(Strain ~ ., data = data_hybrids[, c(traits_multivariate, 
##     "Strain")])
## 
## Prior probabilities of groups:
## SW 101 SW 102 SW 103 SW 104 SW 105 SW 106 SW 107 SW 108 SW 109 SW 110 SW 111 
## 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 0.0625 
## SW 112 SW 113 SW 114 SW 115 SW 116 
## 0.0625 0.0625 0.0625 0.0625 0.0625 
## 
## Group means:
##        Single_Cocoon_Weight Cocoon_Shell_Weight Larval_Weight
## SW 101             1.483800           0.2937500      29.43856
## SW 102             1.433683           0.2817167      28.82072
## SW 103             1.419659           0.2983770      28.98178
## SW 104             1.328917           0.2838542      29.02983
## SW 105             1.364420           0.2863379      28.13722
## SW 106             1.431433           0.2898333      31.15083
## SW 107             1.501850           0.3016167      28.59839
## SW 108             1.479129           0.2995985      29.56856
## SW 109             1.496533           0.2868167      30.23294
## SW 110             1.487437           0.2914792      29.30011
## SW 111             1.532417           0.3048333      32.25856
## SW 112             1.532417           0.3018033      32.28139
## SW 113             1.459350           0.3026000      31.27800
## SW 114             1.473400           0.2910000      31.38083
## SW 115             1.420700           0.2727167      30.67006
## SW 116             1.402300           0.2795000      30.81650
##        Cocoon_Shell_Percentage Live_Pupa_Percentage Hatching_Rate
## SW 101                19.90443             67.90551      97.70825
## SW 102                19.69722             60.93521      97.07138
## SW 103                21.20281             65.48781      97.05508
## SW 104                21.59885             68.26253      96.97524
## SW 105                21.15305             64.27633      96.52421
## SW 106                20.38933             56.39149      97.05680
## SW 107                20.20448             58.76280      95.38672
## SW 108                20.32851             47.36697      95.98854
## SW 109                20.20702             55.66281      94.85372
## SW 110                19.77045             52.02084      96.51966
## SW 111                20.22120             61.29672      96.17741
## SW 112                19.83908             63.69991      95.84273
## SW 113                20.87511             71.05085      96.53398
## SW 114                19.92945             75.21564      95.70890
## SW 115                19.42583             67.40499      97.23525
## SW 116                20.07143             69.04902      97.64132
## 
## Coefficients of linear discriminants:
##                                   LD1           LD2          LD3          LD4
## Single_Cocoon_Weight    -45.662383413   28.75622882 -6.421795663   1.39797864
## Cocoon_Shell_Weight     193.042705416 -197.11676454 -4.769458117 -37.14424759
## Larval_Weight            -0.050401573   -0.09756293 -0.004739677   0.33472202
## Cocoon_Shell_Percentage  -1.867591352    2.80706699 -0.992320812   0.34293096
## Live_Pupa_Percentage     -0.001484765   -0.03305863 -0.047030577   0.01490563
## Hatching_Rate             0.363148961   -0.01672590  0.300879586   0.06963535
##                                  LD5         LD6
## Single_Cocoon_Weight    -20.21647918  2.05199921
## Cocoon_Shell_Weight      65.87357128 -5.34658662
## Larval_Weight             0.15211049  0.07033239
## Cocoon_Shell_Percentage  -1.10394258  0.24364051
## Live_Pupa_Percentage     -0.02089249 -0.01680327
## Hatching_Rate            -0.47004763  0.12758767
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5    LD6 
## 0.4994 0.2530 0.1413 0.0746 0.0259 0.0057

# Predict group membership for new data (optional)
# new_data <- data.frame( ... )  # Replace ... with your new data
# predictions <- predict(lda_model, newdata = new_data)

# Visualize LDA results

# Scatterplot of discriminant scores
lda_scores <- predict(lda_model)$x

# Create a data frame with the discriminant scores, Strain, and Location information
lda_df <- data.frame(lda_scores, Strain = data_hybrids$Strain, Location = data_hybrids$Location)

# Plot the discriminant scores, with a custom color palette for better distinction
ggplot(lda_df, aes(x = LD1, y = LD2, color = Strain, shape = Location)) + 
  geom_point(size = 3) + # Increase the 'size' argument 
  labs(title = "Scatterplot of Discriminant Scores", x = "LD1", y = "LD2") +
  theme_bw() +
  scale_color_manual(values = rainbow(nlevels(data_hybrids$Strain))) +  # Use a rainbow color palette
  guides(color = guide_legend(ncol = 1))  # Display Strain legend in 1 column

Technical Discussion Points and Considerations:

Scree Plot: 1. Examine the proportion of variance explained by each principal component (PC). 2. Look for the “elbow” point where the explained variance starts to level off, indicating the number of PCs to retain for further analysis.

Individuals Plot: 1. Observe how the individual silkworms are distributed in the PCA space based on their trait values. 2. Look for clusters or patterns that might indicate similarities or differences between strains. 3. Consider the implications of the positioning of individuals relative to the PC axes and the ellipses representing strain confidence intervals.

Variables Plot: 1. Interpret the contribution of each trait to the principal components. 2. Arrows pointing in the same direction suggest positive correlations between traits, while arrows pointing in opposite directions suggest negative correlations. 3. The length of the arrow represents the strength of the contribution.

2. Hierarchical Clustering on Principal Components: We’ll perform hierarchical clustering on the first few principal components to group similar individuals together.

Technical Discussion Points and Considerations:

Examine the branching pattern of the dendrogram to understand how individuals are grouped based on their similarity in the PCA space.
The height of the branches represents the dissimilarity between clusters.
Look for clusters that correspond to specific strains or groups of strains, providing insights into phenotypic relationships.

3. K-means Clustering: K-means clustering will partition the individuals into a predefined number of clusters based on their principal component scores.

Technical Discussion Points and Considerations:

Elbow Plot: 1. Use the elbow plot to help determine the optimal number of clusters (k) for k-means clustering. 2. Look for the point where the within-cluster sum of squares (WSS) starts to level off, indicating diminishing returns from adding more clusters.

Cluster Plot: 1. Observe how the individuals are partitioned into clusters in the PCA space. 2. Interpret the meaning of each cluster in terms of the phenotypic traits and strains. 3. Consider the separation between clusters and the tightness within clusters.

Correlation Analysis

We will explore the relationships between the phenotypic traits using both Pearson and Spearman correlation coefficients.

Correlation Analysis (Combined Locations)

# 1. Correlation Matrices and Correlograms

# Calculate Pearson and Spearman correlation matrices
cor_pearson <- cor(combined_data[, all_traits], method = "pearson")  # Use all_traits
cor_spearman <- cor(combined_data[, all_traits], method = "spearman")  # Use all_traits

# Visualize correlation matrices using heatmaps
heatmap(cor_pearson, Rowv = NA, Colv = NA, 
        col = colorRampPalette(c("blue", "white", "red"))(20), 
        scale = "none", margins = c(5, 10), cexRow = 0.8, cexCol = 0.8)

heatmap(cor_spearman, Rowv = NA, Colv = NA, 
        col = colorRampPalette(c("blue", "white", "red"))(20), 
        scale = "none", margins = c(5, 10), cexRow = 0.8, cexCol = 0.8)

# Visualize correlograms
corrplot(cor_pearson, method = "color", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45, sig.level = 0.05, 
         title = "Pearson Correlation", mar = c(0, 0, 2, 0))

corrplot(cor_spearman, method = "color", type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45, sig.level = 0.05, 
         title = "Spearman Correlation", mar = c(0, 0, 2, 0))

# 2. Individual Correlation Tests and Scatterplots

# Function to perform correlation test and plot
plot_correlation <- function(trait1, trait2, data, method = "pearson") {
  cor_test <- cor.test(data[[trait1]], data[[trait2]], method = method)
  p_value <- cor_test$p.value
  cor_coef <- cor_test$estimate
  
  ggplot(data, aes_string(x = trait1, y = trait2, color = "Location")) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = paste("Correlation between", trait1, "and", trait2),
         subtitle = paste(method, "correlation:", round(cor_coef, 3), ", p-value:", round(p_value, 5))) +
    facet_wrap(~ Location)  # Facet by Location
}

# Example usage (replace with your actual trait names)
plot_correlation("Single_Cocoon_Weight", "Cocoon_Shell_Weight", combined_data, method = "pearson")

## `geom_smooth()` using formula = 'y ~ x'

plot_correlation("Single_Cocoon_Weight", "Cocoon_Shell_Weight", combined_data, method = "spearman")

## `geom_smooth()` using formula = 'y ~ x'

# 3. Scatterplot Matrix

# Scatterplot matrix with Pearson correlations
ggpairs(combined_data[, c(all_traits, "Location")],  # Include 'Location' in the subset
        aes(color = Location, alpha = 0.5), 
        upper = list(continuous = wrap("cor", method = "pearson", size = 3))) + 
  theme(axis.text = element_text(size = 8)) + 
  theme(plot.margin = margin(5, 10, 5, 5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Scatterplot matrix with Spearman correlations
ggpairs(combined_data[, c(all_traits, "Location")],  # Include 'Location' in the subset
        aes(color = Location, alpha = 0.5), 
        upper = list(continuous = wrap("cor", method = "spearman", size = 3))) + 
  theme(axis.text = element_text(size = 8)) + 
  theme(plot.margin = margin(5, 10, 5, 5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# 4. Partial Correlation Analysis

# Get the list of traits for partial correlation analysis
traits_partial <- c("Single_Cocoon_Weight", "Cocoon_Shell_Weight", "Larval_Weight", 
                    "Cocoon_Shell_Percentage", "Hatching_Rate", "Live_Pupa_Percentage")

# Generate all combinations of 3 traits
trait_combinations <- combn(traits_partial, 3, simplify = FALSE)

# Loop through each location
for (loc in levels(combined_data$Location)) {  # Use combined_data$Location
  cat("\n\n--- Analyzing Partial Correlations for", loc, "---\n")
  
  # Subset data by location first
  data_loc <- combined_data %>% filter(Location == loc)
  
  # Loop through each combination of traits
  for (combination in trait_combinations) {
    trait1 <- combination[1]
    trait2 <- combination[2]
    trait3 <- combination[3]  # Controlling variable
    
    # Check for missing values in the selected traits within the location subset
    if (any(is.na(data_loc[[trait1]])) || any(is.na(data_loc[[trait2]])) || any(is.na(data_loc[[trait3]]))) {
      cat("\n\nMissing values found for", trait1, ",", trait2, ", or", trait3, "in", loc, ". Imputing missing values.\n")
      
      # Impute missing values using the mean for each trait within the location
      data_loc <- data_loc %>%
        mutate(across(c(trait1, trait2, trait3), ~ ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)))
    }
    
    # Calculate partial correlation using 'psych' package
    partial_cor <- partial.r(data_loc[, c(trait1, trait2, trait3)])  # Calculate partial correlation matrix
    
    # Extract the relevant partial correlation coefficient
    partial_cor_coef <- partial_cor[trait1, trait2]
    
    # Perform correlation test to get p-value
    cor_test_result <- cor.test(data_loc[[trait1]], data_loc[[trait2]], method = "pearson")  # Use appropriate method
    p_value <- cor_test_result$p.value
    
    # Print the results
    cat("\n\nPartial Correlation between", trait1, "and", trait2, 
        "controlling for", trait3, "in", loc, ":\n")
    cat("  Correlation Coefficient:", partial_cor_coef, "\n") 
    cat("  P-value:", p_value, "\n")  # Print the p-value
  }
}

## 
## 
## --- Analyzing Partial Correlations for BALUBAL ---
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Larval_Weight in BALUBAL :
##   Correlation Coefficient: 0.7023777 
##   P-value: 3.15263e-08 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Cocoon_Shell_Percentage in BALUBAL :
##   Correlation Coefficient: 0.8801142 
##   P-value: 3.15263e-08 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Hatching_Rate in BALUBAL :
##   Correlation Coefficient: 0.6820058 
##   P-value: 3.15263e-08 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: 0.6859562 
##   P-value: 3.15263e-08 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Larval_Weight controlling for Cocoon_Shell_Percentage in BALUBAL :
##   Correlation Coefficient: -0.03890379 
##   P-value: 0.8983348 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Larval_Weight controlling for Hatching_Rate in BALUBAL :
##   Correlation Coefficient: 0.02922613 
##   P-value: 0.8983348 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Larval_Weight controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: 0.03335183 
##   P-value: 0.8983348 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Percentage controlling for Hatching_Rate in BALUBAL :
##   Correlation Coefficient: -0.5263022 
##   P-value: 0.000309444 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Percentage controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: -0.5077487 
##   P-value: 0.000309444 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Hatching_Rate controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: -0.1604772 
##   P-value: 0.1388483 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Larval_Weight controlling for Cocoon_Shell_Percentage in BALUBAL :
##   Correlation Coefficient: -0.114652 
##   P-value: 0.4273099 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Larval_Weight controlling for Hatching_Rate in BALUBAL :
##   Correlation Coefficient: -0.06250522 
##   P-value: 0.4273099 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Larval_Weight controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: -0.05151845 
##   P-value: 0.4273099 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Cocoon_Shell_Percentage controlling for Hatching_Rate in BALUBAL :
##   Correlation Coefficient: 0.1029392 
##   P-value: 0.4305926 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Cocoon_Shell_Percentage controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: 0.1207397 
##   P-value: 0.4305926 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Hatching_Rate controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: -0.1944656 
##   P-value: 0.06232077 
## 
## 
## Partial Correlation between Larval_Weight and Cocoon_Shell_Percentage controlling for Hatching_Rate in BALUBAL :
##   Correlation Coefficient: -0.01603855 
##   P-value: 0.8416214 
## 
## 
## Partial Correlation between Larval_Weight and Cocoon_Shell_Percentage controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: -0.03087665 
##   P-value: 0.8416214 
## 
## 
## Partial Correlation between Larval_Weight and Hatching_Rate controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: 0.1303172 
##   P-value: 0.1406125 
## 
## 
## Partial Correlation between Cocoon_Shell_Percentage and Hatching_Rate controlling for Live_Pupa_Percentage in BALUBAL :
##   Correlation Coefficient: -0.07004473 
##   P-value: 0.6616764 
## 
## 
## --- Analyzing Partial Correlations for TCMO ---
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Larval_Weight in TCMO :
##   Correlation Coefficient: 0.5127208 
##   P-value: 1.959491e-05 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Cocoon_Shell_Percentage in TCMO :
##   Correlation Coefficient: 0.9967735 
##   P-value: 1.959491e-05 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Hatching_Rate in TCMO :
##   Correlation Coefficient: 0.5626946 
##   P-value: 1.959491e-05 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: 0.540071 
##   P-value: 1.959491e-05 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Larval_Weight controlling for Cocoon_Shell_Percentage in TCMO :
##   Correlation Coefficient: 0.4529217 
##   P-value: 0.001023493 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Larval_Weight controlling for Hatching_Rate in TCMO :
##   Correlation Coefficient: 0.455567 
##   P-value: 0.001023493 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Larval_Weight controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: 0.4546899 
##   P-value: 0.001023493 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Percentage controlling for Hatching_Rate in TCMO :
##   Correlation Coefficient: -0.3907053 
##   P-value: 0.008486627 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Cocoon_Shell_Percentage controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: -0.3746729 
##   P-value: 0.008486627 
## 
## 
## Partial Correlation between Single_Cocoon_Weight and Hatching_Rate controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: -0.05502732 
##   P-value: 0.3154037 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Larval_Weight controlling for Cocoon_Shell_Percentage in TCMO :
##   Correlation Coefficient: 0.4396895 
##   P-value: 0.03382506 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Larval_Weight controlling for Hatching_Rate in TCMO :
##   Correlation Coefficient: 0.3005679 
##   P-value: 0.03382506 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Larval_Weight controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: 0.2932895 
##   P-value: 0.03382506 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Cocoon_Shell_Percentage controlling for Hatching_Rate in TCMO :
##   Correlation Coefficient: 0.5384771 
##   P-value: 7.42543e-05 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Cocoon_Shell_Percentage controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: 0.5752916 
##   P-value: 7.42543e-05 
## 
## 
## Partial Correlation between Cocoon_Shell_Weight and Hatching_Rate controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: -0.1275126 
##   P-value: 0.1982872 
## 
## 
## Partial Correlation between Larval_Weight and Cocoon_Shell_Percentage controlling for Hatching_Rate in TCMO :
##   Correlation Coefficient: -0.1173128 
##   P-value: 0.4466857 
## 
## 
## Partial Correlation between Larval_Weight and Cocoon_Shell_Percentage controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: -0.1062923 
##   P-value: 0.4466857 
## 
## 
## Partial Correlation between Larval_Weight and Hatching_Rate controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: -0.03882875 
##   P-value: 0.656431 
## 
## 
## Partial Correlation between Cocoon_Shell_Percentage and Hatching_Rate controlling for Live_Pupa_Percentage in TCMO :
##   Correlation Coefficient: -0.09105911 
##   P-value: 0.6554186

# 5. Correlation with Location

# Calculate ANOVA for each trait with Location as a factor and store the results
anova_results <- list()
for (trait in all_traits) {  # Loop through all traits
  anova_result <- aov(as.formula(paste(trait, "~ Location")), data = combined_data)
  anova_results[[trait]] <- anova_result # Store the entire ANOVA table
  print(summary(anova_result))
}

##             Df Sum Sq Mean Sq F value Pr(>F)    
## Location     1 0.5895  0.5895   103.6 <2e-16 ***
## Residuals   94 0.5351  0.0057                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value Pr(>F)    
## Location     1 0.0610 0.06100   271.7 <2e-16 ***
## Residuals   94 0.0211 0.00022                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Location     1  51.73   51.73   61.39 7.07e-12 ***
## Residuals   94  79.20    0.84                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Location     1   8.52   8.520   3.193 0.0772 .
## Residuals   94 250.85   2.669                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Location     1  598.3   598.3   87.75 4.07e-15 ***
## Residuals   94  641.0     6.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value Pr(>F)    
## Location     1  66639   66639   343.8 <2e-16 ***
## Residuals   94  18220     194                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Create a list to store the boxplots
boxplots <- list()

# Generate boxplots for each trait with ANOVA results
for (trait in all_traits) {  # Loop through all traits
  
  # Extract p-value from ANOVA result
  p_value <- summary(anova_results[[trait]])[[1]][["Pr(>F)"]][1]
  
  # Format p-value for display
  p_value_label <- if (p_value < 0.001) {
    "< 0.001"  # Display as "< 0.001" if very small
  } else {
    paste("=", round(p_value, 3))  # Otherwise, round to 3 decimal places
  }
  
  # Create boxplot with p-value annotation
  boxplots[[trait]] <- ggplot(combined_data, aes(x = Location, y = .data[[trait]])) + 
    geom_boxplot() +
    labs(title = paste("Boxplot of", trait, "by Location"),
         x = "Location", y = trait) +
    theme_bw() +
    annotate("text", x = 1.5, y = max(combined_data[[trait]]),  
             label = paste("ANOVA p-value:", p_value_label))  # Use the formatted p-value label
}

# Arrange all boxplots in a single figure
ggarrange(plotlist = boxplots, ncol = 2, nrow = 3)  # Adjust ncol and nrow as needed

## Advanced Correlation Analysis (Combined Locations)

# 1. Distance Correlation

library(energy)  # Load the 'energy' package

# Calculate distance correlation between all pairs of traits
dist_cor_matrix <- matrix(nrow = length(all_traits), ncol = length(all_traits))
rownames(dist_cor_matrix) <- all_traits
colnames(dist_cor_matrix) <- all_traits

for (i in 1:length(all_traits)) {
  for (j in i:length(all_traits)) {
    trait1 <- all_traits[i]
    trait2 <- all_traits[j]
    dist_cor <- dcor(combined_data[[trait1]], combined_data[[trait2]])
    dist_cor_matrix[i, j] <- dist_cor_matrix[j, i] <- dist_cor
  }
}

# Print the distance correlation matrix
print(dist_cor_matrix)

##                         Single_Cocoon_Weight Cocoon_Shell_Weight
## Single_Cocoon_Weight               1.0000000           0.8441943
## Cocoon_Shell_Weight                0.8441943           1.0000000
## Cocoon_Shell_Percentage            0.4568003           0.7403779
## Hatching_Rate                      0.1329776           0.1771922
## Larval_Weight                      0.5845634           0.6584816
## Live_Pupa_Percentage               0.7107905           0.8266545
##                         Cocoon_Shell_Percentage Hatching_Rate Larval_Weight
## Single_Cocoon_Weight                  0.4568003     0.1329776     0.5845634
## Cocoon_Shell_Weight                   0.7403779     0.1771922     0.6584816
## Cocoon_Shell_Percentage               1.0000000     0.1509995     0.4720629
## Hatching_Rate                         0.1509995     1.0000000     0.2280294
## Larval_Weight                         0.4720629     0.2280294     1.0000000
## Live_Pupa_Percentage                  0.6059396     0.1627910     0.6082860
##                         Live_Pupa_Percentage
## Single_Cocoon_Weight               0.7107905
## Cocoon_Shell_Weight                0.8266545
## Cocoon_Shell_Percentage            0.6059396
## Hatching_Rate                      0.1627910
## Larval_Weight                      0.6082860
## Live_Pupa_Percentage               1.0000000

library(heatmaply)

## Loading required package: plotly
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 method overwritten by 'dendextend':
##   method     from 
##   rev.hclust vegan
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust vegan
## 
## ======================
## Welcome to heatmaply version 1.5.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags: 
##   https://stackoverflow.com/questions/tagged/heatmaply
## ======================

heatmaply(dist_cor_matrix, 
          col = colorRampPalette(c("white", "blue"))(20), 
          main = "Distance Correlation Matrix")

# 2. Canonical Correlation Analysis (CCA)

# Define the two sets of variables (adjust based on your traits)
set1 <- combined_data[, c("Single_Cocoon_Weight", "Cocoon_Shell_Weight")]
set2 <- combined_data[, c("Cocoon_Shell_Percentage", "Live_Pupa_Percentage")]

# Perform CCA 
cca_result <- cancor(set1, set2)

# Print and visualize the results
print(cca_result)

## $cor
## [1] 0.9580569 0.7017367
## 
## $xcoef
##                           [,1]      [,2]
## Single_Cocoon_Weight -1.106622 -1.372460
## Cocoon_Shell_Weight   6.177256  2.101308
## 
## $ycoef
##                                  [,1]        [,2]
## Cocoon_Shell_Percentage  0.0803646770 0.067002971
## Live_Pupa_Percentage    -0.0004624349 0.004083874
## 
## $xcenter
## Single_Cocoon_Weight  Cocoon_Shell_Weight 
##            1.4529653            0.2916146 
## 
## $ycenter
## Cocoon_Shell_Percentage    Live_Pupa_Percentage 
##                20.30114                62.79934

# Display the canonical correlations
cat("\nCanonical Correlations:\n")

## 
## Canonical Correlations:

print(cca_result$cor)

## [1] 0.9580569 0.7017367

# Calculate the canonical variates
cv1 <- as.matrix(set1) %*% cca_result$xcoef[, 1]  # First canonical variate for set1
cv2 <- as.matrix(set2) %*% cca_result$ycoef[, 1]  # First canonical variate for set2

# Visualize the canonical variates
plot(cv1, cv2,   # Plot the canonical variates
     xlab = "Canonical Variate 1 (Set 1)", 
     ylab = "Canonical Variate 2 (Set 2)",
     main = "Canonical Correlation Analysis")

# 3. Copula-Based Correlation

# Load necessary library
library(copula)

## 
## Attaching package: 'copula'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval

# Select two traits for copula analysis (replace with your actual trait names)
trait1 <- "Single_Cocoon_Weight"
trait2 <- "Cocoon_Shell_Weight"

# Create a copula object (using Gaussian copula as an example)
copula_data <- cbind(combined_data[[trait1]], combined_data[[trait2]])
copula_model <- normalCopula(dim = 2)

# Transform the data to pseudo-observations using pobs()
copula_data <- pobs(cbind(combined_data[[trait1]], combined_data[[trait2]]))

# Fit the copula model to the pseudo-observations
fit_result <- fitCopula(copula_model, data = copula_data, method = "ml")  # Use maximum likelihood estimation

# Extract the estimated correlation parameter
copula_cor <- coef(fit_result)[1]

# Print the results
cat("\nCopula-Based Correlation between", trait1, "and", trait2, ":\n")

## 
## Copula-Based Correlation between Single_Cocoon_Weight and Cocoon_Shell_Weight :

cat("  Correlation Coefficient:", copula_cor, "\n")

##   Correlation Coefficient: 0.852632

# You can further analyze and visualize the copula model using functions like 'contourplot' or 'persp' from the 'copula' package

1. Correlation Matrices and Correlograms:

Technical Discussion Points and Considerations:

Correlation Matrices (Pearson and Spearman): 1. Correlation Coefficients: Examine the strength and direction of the correlations between traits. 2. P-values: Assess the statistical significance of the correlations. 3. Comparison: Compare the Pearson and Spearman correlations to see if there are any notable differences, which might indicate non-linear relationships between traits.

Correlograms: 1. Visual Patterns: Observe the overall patterns in the correlograms. 2. Clusters: Identify clusters of traits that are highly correlated with each other. 3. Potential Multicollinearity: Be mindful of strong correlations between multiple traits, as this might indicate multicollinearity, which can affect the interpretation of regression or other multivariate analyses.

2. Individual Correlation Tests and Scatterplots: We will visualize and test the correlations between specific pairs of traits.

Technical Discussion Points and Considerations:

Scatterplots: Visually inspect the scatterplots to assess the linearity or monotonicity of the relationships between trait pairs.
Correlation Coefficients and P-values: Interpret the correlation coefficients and p-values to quantify the strength and significance of the relationships.

3. Scatterplot Matrix: Visualize all pairwise relationships between traits using a scatterplot matrix.

Technical Discussion Points and Considerations:

Overall Patterns: Gain a comprehensive overview of the pairwise relationships between all traits.
Outliers: Identify any potential outliers that might be influencing the correlations.
Non-linear Relationships: Look for any patterns in the scatterplots that suggest non-linear relationships between traits.

4. Heatmap of Trait Values by Strain:

CONCLUSION

This comprehensive analysis has provided insights into the distribution, relationships, and differences in silkworm phenotypic traits across various strains. We’ve addressed issues of non-normality and heteroscedasticity using transformations and non-parametric tests where appropriate. Multivariate techniques like PCA and clustering helped uncover underlying patterns in the data, while correlation analysis revealed the interdependencies between traits. These findings can guide further research and breeding efforts in silkworms.

Remember:

Interpret the results in the context of your specific research questions and biological knowledge. Consider exploring additional analyses or visualizations to gain further insights into your data. Clearly document your findings and conclusions, highlighting the key takeaways and their implications.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Multivariate Analysis of Silkworm Hybrid Phenotypes Across Two Locations in the Philippines With Mean Value Traits

Jeshurun Asher Tarun

2024-12-08

INTRODUCTION

Data Loading and Pre-processing

Data Exploration and Assumption Checking

ANOVA, Transformation, and Non-Parametric Testing for All Traits (Combined Locations)

Multivariate Analysis

Multivariate Analysis (Combined Locations)

Correlation Analysis

Correlation Analysis (Combined Locations)

CONCLUSION