Demographic Analysis of Wind Instrument Musicians and RMT Device Usage

Author

Sarah Morris

Published

March 4, 2025

Provide a full description and interpretation from the results of the following code. Include a breakdown of the descriptive statistics, as well as any statistical tests included in the code.

Code

# Install all required packages (if not already installed)
required_packages <- c(
  "dplyr", "ggplot2", "stats", "tidyr", "forcats", "broom",
  "scales", "vcd", "svglite", "exact2x2", "exactci",
  "ssanv", "testthat"
)
# Install any missing packages
install.packages(setdiff(required_packages, rownames(installed.packages())))


## Libraries and Directory
#| echo: false
#| output: false
library(readxl)
library(dplyr)
library(ggplot2)
library(stats)
library(tidyr)
library(forcats)
library(broom)
library(scales)
library(vcd)  # For Cramer's V calculation
library(svglite)
library(exact2x2)
library(stringr)

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

1 Overview

2 Gender

2.1 Descriptive Stats

Summary stats on gender data
Prints plot

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Create gender summary
# Overall, this code processes a dataset (data_combined) to summarize gender information by filtering out missing values, grouping by gender, calculating counts and percentages, adjusting specific gender labels, and finally sorting the results by count in descending order. The end result is stored in the variable 'gender_summary'
gender_summary <- data_combined %>%
  filter(!is.na(gender)) %>%
  group_by(gender) %>%
  summarise(
    count = n(),
    percentage = (count / 1558) * 100,
    .groups = 'drop'
  ) %>%
  mutate(gender = case_when(
    gender == "Choose not to disclose" ~ "Not specified",
    gender == "Nonbinary/gender fluid/gender non-conforming" ~ "Nonbinary",
    TRUE ~ gender
  )) %>%
  arrange(desc(count))

# Create the gender plot with adjusted height scaling
gender_plot <- ggplot(gender_summary, aes(x = reorder(gender, count), y = count, fill = gender)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = sprintf("N=%d\
(%.1f%%)", count, percentage)),
            vjust = -0.5, size = 4) +
  labs(title = "Distribution of Participants by Gender",
       x = "Gender",
       y = "Number of Participants (N = 1558)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "none",
    plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)), 
                     limits = c(0, max(gender_summary$count) * 1.15))

# Display the plot
print(gender_plot)

Table 1: Gender Distribution Summary

# A tibble: 4 × 3
  gender        count percentage
  <chr>         <int>      <dbl>
1 Male            750     48.1  
2 Female          725     46.5  
3 Nonbinary        68      4.36 
4 Not specified    15      0.963

2.2 Results and Discussion

The gender distribution shows a fairly even split between Male and Female participants, with a much smaller representation of Nonbinary, fluid, and non-conforming-gender individuals, and those who chose not to disclose.

2.3 Comparison Stats

Code

## Combine categories 
# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Another approach: Combine categories with small counts
# Combine "Nonbinary" and "Not specified" into "Other"
data_filtered_combined <- data_combined %>%
  filter(!is.na(gender), !is.na(RMTMethods_YN)) %>%
  mutate(
    gender = case_when(
      gender == "Choose not to disclose" ~ "Other",
      gender == "Nonbinary/gender fluid/gender non-conforming" ~ "Other",
      TRUE ~ gender
    ),
    RMTMethods_YN = case_when(
      RMTMethods_YN == 0 ~ "No RMT",
      RMTMethods_YN == 1 ~ "RMT"
    )
  )

# Create a new contingency table with combined categories
gender_rmt_table_combined <- table(data_filtered_combined$gender, data_filtered_combined$RMTMethods_YN)

# Print the new contingency table
print("Contingency table with combined categories:")

[1] "Contingency table with combined categories:"

Code

print(gender_rmt_table_combined)

        
         No RMT RMT
  Female    642  83
  Male      615 135
  Other      73  10

Code

# Calculate expected counts for the new table
expected_counts_combined <- chisq.test(gender_rmt_table_combined)$expected
print("Expected counts with combined categories:")

[1] "Expected counts with combined categories:"

Code

print(expected_counts_combined)

        
            No RMT       RMT
  Female 618.90244 106.09756
  Male   640.24390 109.75610
  Other   70.85366  12.14634

Code

# Perform chi-square test on the new table
chi_square_results_combined <- chisq.test(gender_rmt_table_combined)
print("Chi-square test results with combined categories:")

[1] "Chi-square test results with combined categories:"

Code

print(chi_square_results_combined)


    Pearson's Chi-squared test

data:  gender_rmt_table_combined
X-squared = 13.136, df = 2, p-value = 0.001405

Code

# Create a data frame for plotting with combined categories
gender_rmt_df_combined <- as.data.frame(gender_rmt_table_combined)
colnames(gender_rmt_df_combined) <- c("Gender", "RMTMethods_YN", "Count")

gender_rmt_df_combined <- gender_rmt_df_combined %>%
  group_by(Gender) %>%
  mutate(Percentage = (Count / sum(Count)) * 100)

# Create the plot with combined categories
#| fig.height: 7
#| fig.width: 10
#| out.height: "700px"
#| fig.align: "center"
rmt_plot_combined <- ggplot(gender_rmt_df_combined, aes(x = RMTMethods_YN, y = Count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", Count, Percentage)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 3) +
  labs(title = "Gender Distribution by RMT Methods Usage (Combined Categories)",
       x = "RMT Methods Usage",
       y = "Number of Participants (N = 1558)",
       fill = "Gender") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "right",
    plot.margin = margin(t = 30, r = 20, b = 20, l = 20, unit = "pt")  # Add more margin at the top
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))  # Expand y-axis to make room for labels

# Display the plot with combined categories
print(rmt_plot_combined)

2.4 Results after combining into ‘Other’ category

Overview of the Data

The analysis examines the relationship between gender and Remote Monitoring Technology (RMT) usage across three gender categories: Female, Male, and Other. The contingency table shows the observed counts of participants in each combination of gender and RMT usage status.

Contingency Table Analysis

Observed Frequencies

Female participants: 642 do not use RMT; 83 use RMT (total: 725)
Male participants: 615 do not use RMT; 135 use RMT (total: 750)
Other gender category: 73 do not use RMT; 10 use RMT (total: 83)
Column totals: 1,330 non-RMT users; 228 RMT users (overall total: 1,558)

Usage Rates by Gender

Females: 11.4% use RMT (83 out of 725)
Males: 18.0% use RMT (135 out of 750)
Other: 12.0% use RMT (10 out of 83)

Key Observations

The majority of individuals across all genders did not use RMT (“No RMT”).
In absolute terms, females are the largest group in the “No RMT” category (642), while males are the largest in the “RMT” category (135).
The “Other” gender group has the fewest individuals in both “RMT” (10) and “No RMT” (73) categories.

Expected Frequencies

The expected counts represent what we would expect to see if there were no relationship between gender and RMT usage (i.e., if the null hypothesis were true). These are calculated based on the assumption that RMT usage and gender are independent:

Female participants: Expected 618.90 non-users; 106.10 user
Male participants: Expected 640.24 non-users; 109.76 users
Other gender category: Expected 70.85 non-users; 12.15 users

Differences between Observed and Expected

Female participants: More non-users than expected (+23.10); fewer users than expected (-23.10)
Male participants: Fewer non-users than expected (-25.24); more users than expected (+25.24)
Other gender category: More non-users than expected (+2.15); fewer users than expected (-2.15)

Key Observations:

Under independence, females were expected to have about 618.9 individuals in “No RMT” and 106.1 in “RMT.” However, the actual numbers are 642 and 83, respectively, suggesting fewer females participated in RMT than expected.
Conversely, males were expected to have 640.2 individuals in “No RMT” and 109.8 in “RMT,” but the actual numbers are 615 and 135, suggesting more males participated in RMT than expected.
The “Other” group shows relatively small deviations from the expected counts.

Statistical Test Results

The chi-squared test evaluates whether the observed differences between the gender categories in RMT usage contingency table and the expected counts are statistically significant or could be attributed to random chance. The test outputs the following results:

Chi-square statistic (X²): 13.136
Degrees of freedom (df): 2
P-value: 0.001405

Interpretation

Statistical Significance: The p-value (0.001405) is well below the conventional threshold of 0.05, indicating a statistically significant association between gender and RMT usage. This means we can reject the null hypothesis that gender and RMT usage are independent.
Effect Direction:
- Males show notably higher RMT usage than expected under the null hypothesis
- Females show notably lower RMT usage than expected
- The “Other” gender category shows slightly lower RMT usage than expected
Practical Significance:
- Male participants are approximately 1.6 times more likely to use RMT compared to female participants (18.0% vs 11.4%)
- The “Other” gender category’s RMT usage (12.0%) is closer to females than males
Strength of Association: While statistically significant, the chi-square value (13.136) suggests a modest association between gender and RMT usage in the context of the sample size (1,558 participants).
In simpler terms, there is indeed a meaningful relationship between gender and the use of RMT methods.

Limitations and Considerations

Categorical Combination: The “Other” category appears to be a combination of non-binary individuals and those who chose not to disclose their gender. This grouping, while necessary for statistical analysis due to smaller sample sizes, may obscure potential differences between these distinct groups.
Sample Size Differences: The “Other” gender category (n=83) is much smaller than the Female (n=725) and Male (n=750) categories, which affects the precision of estimates for this group.
No Control Variables: This analysis does not account for other variables that might influence RMT usage (e.g., age, technology access, health status) and might confound the relationship between gender and RMT usage.

Conclusion

There is compelling statistical evidence for a gender difference in RMT usage patterns, with males showing significantly higher adoption rates compared to females and those in the “Other” gender category. This finding may have implications for RMT implementation strategies, suggesting potential benefits from gender-specific approaches to technology adoption and support.

3 Age

3.1 Demographic Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Create age summary
age_summary <- data_combined %>%
  filter(!is.na(age)) %>%
  mutate(
    age_group = case_when(
      age < 20 ~ "Under 20",
      age >= 20 & age < 30 ~ "20-29",
      age >= 30 & age < 40 ~ "30-39",
      age >= 40 & age < 50 ~ "40-49",
      age >= 50 & age < 60 ~ "50-59",
      age >= 60 ~ "60+"
    )
  ) %>%
  group_by(age_group) %>%
  summarise(
    count = n(),
    percentage = (count / 1558) * 100,
    .groups = 'drop'
  ) %>%
  arrange(factor(age_group, levels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60+")))

# Create the age plot with adjusted height scaling
age_plot <- ggplot(age_summary, 
                   aes(x = factor(age_group, levels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60+")), 
                       y = count, fill = age_group)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = sprintf("N=%d\
(%.1f%%)", count, percentage)),
            vjust = -0.5, size = 4) +
  labs(title = "Distribution of Participants by Age Group",
       x = "Age Group (Years)",
       y = "Number of Participants (N = 1558)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "none",
    plot.margin = margin(t = 20, r = 20, b = 20, l = 20, unit = "pt")
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)), 
                     limits = c(0, max(age_summary$count) * 1.15))

# Display the plot
print(age_plot)

Code

# Print summary statistics
print("Age distribution summary:")

[1] "Age distribution summary:"

Code

print(age_summary)

# A tibble: 6 × 3
  age_group count percentage
  <chr>     <int>      <dbl>
1 Under 20    180       11.6
2 20-29       497       31.9
3 30-39       291       18.7
4 40-49       226       14.5
5 50-59       171       11.0
6 60+         193       12.4

3.2 Results and Discussion

Age: M = 37, SD = 16, 18-94, Median 32.5yo

• The age distribution shows a right-skewed pattern with the majority of participants being between 18-40 years old, and fewer participants in the higher age ranges. Gender distribution is fairly balanced between male (48.1%) and female (46.5%) participants

• The largest age group is 20-29 years (31.9% of participants)

• There’s a gradual decrease in participation with age, with the exception of a slight increase in the 60+ category compared to 50-59

3.3 Comparison Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Create age and RMT usage summary
age_rmt_summary <- data_combined %>%
  filter(!is.na(age), !is.na(RMTMethods_YN)) %>%
  mutate(
    age_group = case_when(
      age < 20 ~ "Under 20",
      age >= 20 & age < 30 ~ "20-29",
      age >= 30 & age < 40 ~ "30-39",
      age >= 40 & age < 50 ~ "40-49",
      age >= 50 & age < 60 ~ "50-59",
      age >= 60 ~ "60+"
    ),
    RMTMethods_YN = case_when(
      RMTMethods_YN == 0 ~ "No",
      RMTMethods_YN == 1 ~ "Yes"
    )
  )

# Calculate total N for the analysis
total_n <- nrow(age_rmt_summary)

# Create contingency table
age_rmt_table <- table(age_rmt_summary$age_group, age_rmt_summary$RMTMethods_YN)

# Print the contingency table
print("Contingency Table:")

[1] "Contingency Table:"

Code

print(age_rmt_table)

          
            No Yes
  20-29    414  83
  30-39    223  68
  40-49    199  27
  50-59    153  18
  60+      173  20
  Under 20 168  12

Code

# Modified code with simulation-based p-value
chi_square_results <- chisq.test(age_rmt_table, simulate.p.value = TRUE, B = 10000)
expected_counts <- chi_square_results$expected
print("\
Expected Counts:")

[1] "\nExpected Counts:"

Code

print(round(expected_counts, 2))

          
               No   Yes
  20-29    424.27 72.73
  30-39    248.41 42.59
  40-49    192.93 33.07
  50-59    145.98 25.02
  60+      164.76 28.24
  Under 20 153.66 26.34

Code

min_expected <- min(expected_counts)
print(sprintf("\
Minimum expected count: %.2f", min_expected))

[1] "\nMinimum expected count: 25.02"

Code

# If any expected count is below 5, use Fisher's exact test
if(min_expected < 5) {
  print("Some expected counts are less than 5; using Fisher's exact test instead.")
  fisher_test_results <- fisher.test(age_rmt_table, simulate.p.value = TRUE, B = 10000)
  print("\
Fisher's exact test results:")
  print(fisher_test_results)
  main_test_results <- fisher_test_results
  test_name <- "Fisher's exact test"
  test_statistic <- NA
  test_df <- NA
  test_pvalue <- fisher_test_results$p.value
} else {
  
  # Run chi-square test with simulated p-values for robustness
  chi_square_sim <- chisq.test(age_rmt_table, simulate.p.value = TRUE, B = 10000)
  print("\
Chi-square test with simulated p-value:")
  print(chi_square_sim)
  main_test_results <- chi_square_sim
  test_name <- "Chi-square test"
  test_statistic <- chi_square_sim$statistic
  test_df <- chi_square_sim$parameter
  test_pvalue <- chi_square_sim$p.value
}

[1] "\nChi-square test with simulated p-value:"

    Pearson's Chi-squared test with simulated p-value (based on 10000
    replicates)

data:  age_rmt_table
X-squared = 35.047, df = NA, p-value = 9.999e-05

Code

# Calculate percentages within each age group
age_rmt_summary_stats <- age_rmt_summary %>%
  group_by(age_group, RMTMethods_YN) %>%
  summarise(
    count = n(),
    .groups = 'drop'
  ) %>%
  group_by(RMTMethods_YN) %>%
  mutate(
    total = sum(count),
    percentage = (count / total) * 100
  ) %>%
  arrange(RMTMethods_YN, factor(age_group, levels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60+")))

# Create the plot ### REPLACE
rmt_age_plot <- ggplot(age_rmt_summary_stats, 
                       aes(x = factor(age_group, 
                                      levels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60+")), 
                           y = count, fill = RMTMethods_YN)) +
  geom_bar(stat = "identity", position = "dodge") +
  # Adjust data label: extra vertical space by increasing vjust (e.g., -1 for more space)
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", count, percentage)),
            position = position_dodge(width = 0.9),
            vjust = -1, size = 3.5) +
  labs(title = "RMT Device Use by Age Group",
       subtitle = paste("Percentages calculated within each RMT group (Yes/No)"),
       x = "Age Group (Years)",
       y = paste("Number of Participants (Total N =", total_n, ")"),
       fill = "RMT Usage") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "right",
    plot.margin = margin(t = 40, r = 20, b = 20, l = 20)  # Increased top margin for clarity
  ) +
  # Expand the y-axis to create extra space at the top for the data labels
  scale_y_continuous(expand = expansion(mult = c(0, 0.3)))

# Display the plot
print(rmt_age_plot)

Code

# Calculate proportions for each RMT group
print("\nProportions within each RMT group:")

[1] "\nProportions within each RMT group:"

Code

prop_table <- prop.table(age_rmt_table, margin = 2) * 100  # Changed margin from 1 to 2
print(round(prop_table, 2))

          
              No   Yes
  20-29    31.13 36.40
  30-39    16.77 29.82
  40-49    14.96 11.84
  50-59    11.50  7.89
  60+      13.01  8.77
  Under 20 12.63  5.26

Code

# Calculate the standardized residuals
if(exists("chi_square_results")) {
  std_residuals <- chi_square_results$residuals
  print("\nStandardized residuals:")
  print(round(std_residuals, 2))
  print("Cells with absolute standardized residuals > 2 contribute significantly to the chi-square statistic")
}

[1] "\nStandardized residuals:"
          
              No   Yes
  20-29    -0.50  1.20
  30-39    -1.61  3.89
  40-49     0.44 -1.06
  50-59     0.58 -1.40
  60+       0.64 -1.55
  Under 20  1.16 -2.79
[1] "Cells with absolute standardized residuals > 2 contribute significantly to the chi-square statistic"

Code

# Additional visualization: Create a bar chart showing the percentage of each age group within RMT usage groups
percentage_plot <- ggplot(age_rmt_summary_stats, 
                        aes(x = factor(age_group, 
                                     levels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60+")), 
                           y = percentage, fill = RMTMethods_YN)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)),
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3.5) +
  labs(title = "Age Distribution Within RMT Usage Groups",
       subtitle = "Showing percentage of each age group within 'Yes' and 'No' RMT usage categories",
       x = "Age Group (Years)",
       y = "Percentage within RMT Group (%)",
       fill = "RMT Usage") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 11, face = "italic"),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Display the percentage plot
print(percentage_plot)

Code

# Bonferroni Correction and Visualisation

# Perform pairwise chi-square tests between age groups with Bonferroni correction. This determines how strict the Bonferroni correction will be.
print("\
Pairwise comparisons between age groups with Bonferroni correction:")

[1] "\nPairwise comparisons between age groups with Bonferroni correction:"

Code

age_groups <- rownames(age_rmt_table)
n_comparisons <- choose(length(age_groups), 2) #calculates the total number of pairwise comparisons
pairwise_results <- data.frame(
  Group1 = character(),
  Group2 = character(),
  ChiSquare = numeric(),
  DF = numeric(),
  RawP = numeric(),
  CorrectedP = numeric(),
  Significant = character(),
  stringsAsFactors = FALSE
)

# For each pair of age groups, the code: 
## Creates a 2×2 contingency table containing only those two age groups
## Checks if expected cell counts are below 5
## Uses either Chi-square or Fisher's exact test accordingly
## Applies Bonferroni correction to control for multiple comparisons
## Determines significance based on the corrected p-value
for (i in 1:(length(age_groups)-1)) {
  for (j in (i+1):length(age_groups)) {
    subset_tab <- age_rmt_table[c(i, j), ]
    
    # Check expected counts for this pair
    pair_expected <- chisq.test(subset_tab)$expected
    min_pair_expected <- min(pair_expected)

    # Choose appropriate test based on expected counts
    ## Chi-square test assumes expected cell counts ≥5
    ## When this assumption is violated, Fisher's exact test is used instead
    ## This adaptive approach ensures statistical validity for each comparison
    if(min_pair_expected < 5) {
      pair_test <- fisher.test(subset_tab)
      test_stat <- NA
      test_df <- NA
    } else {
      pair_test <- chisq.test(subset_tab)
      test_stat <- pair_test$statistic
      test_df <- pair_test$parameter
    }
    
    # Apply Bonferroni correction
    ## The original p-value is multiplied by the total number of comparisons
    ## This makes the significance threshold more stringent to avoid false positives
    ## The correction controls the family-wise error rate when making multiple comparisons
    ## The min() function ensures the corrected p-value doesn't exceed 1
    corrected_p <- min(pair_test$p.value * n_comparisons, 1)
    
    # Determine if the result is significant
    is_significant <- ifelse(corrected_p < 0.05, "Yes", "No")
    
    # Add to results dataframe
    pairwise_results <- rbind(pairwise_results, data.frame(
      Group1 = age_groups[i],
      Group2 = age_groups[j],
      ChiSquare = if(is.na(test_stat)) NA else round(test_stat, 2),
      DF = test_df,
      RawP = round(pair_test$p.value, 4),
      CorrectedP = round(corrected_p, 4),
      Significant = is_significant,
      stringsAsFactors = FALSE
    ))
    
    # Print the result
    if(is.na(test_stat)) {
      message <- sprintf("Comparison %s vs %s: Fisher's exact test, raw p = %.4f, Bonferroni corrected p = %.4f, Significant: %s",
                         age_groups[i], age_groups[j],
                         pair_test$p.value, corrected_p, is_significant)
    } else {
      message <- sprintf("Comparison %s vs %s: Chi-square = %.2f, df = %d, raw p = %.4f, Bonferroni corrected p = %.4f, Significant: %s",
                         age_groups[i], age_groups[j],
                         test_stat, test_df,
                         pair_test$p.value, corrected_p, is_significant)
    }
    print(message)
  }
}

[1] "Comparison 20-29 vs 30-39: Chi-square = 4.85, df = 1, raw p = 0.0277, Bonferroni corrected p = 0.4157, Significant: No"
[1] "Comparison 20-29 vs 40-49: Chi-square = 2.37, df = 1, raw p = 0.1241, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 20-29 vs 50-59: Chi-square = 3.31, df = 1, raw p = 0.0687, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 20-29 vs 60+: Chi-square = 3.91, df = 1, raw p = 0.0479, Bonferroni corrected p = 0.7192, Significant: No"
[1] "Comparison 20-29 vs Under 20: Chi-square = 10.21, df = 1, raw p = 0.0014, Bonferroni corrected p = 0.0209, Significant: Yes"
[1] "Comparison 30-39 vs 40-49: Chi-square = 10.31, df = 1, raw p = 0.0013, Bonferroni corrected p = 0.0198, Significant: Yes"
[1] "Comparison 30-39 vs 50-59: Chi-square = 10.89, df = 1, raw p = 0.0010, Bonferroni corrected p = 0.0145, Significant: Yes"
[1] "Comparison 30-39 vs 60+: Chi-square = 12.33, df = 1, raw p = 0.0004, Bonferroni corrected p = 0.0067, Significant: Yes"
[1] "Comparison 30-39 vs Under 20: Chi-square = 20.83, df = 1, raw p = 0.0000, Bonferroni corrected p = 0.0001, Significant: Yes"
[1] "Comparison 40-49 vs 50-59: Chi-square = 0.08, df = 1, raw p = 0.7777, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 40-49 vs 60+: Chi-square = 0.13, df = 1, raw p = 0.7212, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 40-49 vs Under 20: Chi-square = 2.64, df = 1, raw p = 0.1043, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 50-59 vs 60+: Chi-square = 0.00, df = 1, raw p = 1.0000, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 50-59 vs Under 20: Chi-square = 1.21, df = 1, raw p = 0.2706, Bonferroni corrected p = 1.0000, Significant: No"
[1] "Comparison 60+ vs Under 20: Chi-square = 1.19, df = 1, raw p = 0.2763, Bonferroni corrected p = 1.0000, Significant: No"

Code

# Print the pairwise results table
print("\
Summary of pairwise comparisons:")

[1] "\nSummary of pairwise comparisons:"

Code

print(pairwise_results)

            Group1   Group2 ChiSquare DF   RawP CorrectedP Significant
X-squared    20-29    30-39      4.85  1 0.0277     0.4157          No
X-squared1   20-29    40-49      2.37  1 0.1241     1.0000          No
X-squared2   20-29    50-59      3.31  1 0.0687     1.0000          No
X-squared3   20-29      60+      3.91  1 0.0479     0.7192          No
X-squared4   20-29 Under 20     10.21  1 0.0014     0.0209         Yes
X-squared5   30-39    40-49     10.31  1 0.0013     0.0198         Yes
X-squared6   30-39    50-59     10.89  1 0.0010     0.0145         Yes
X-squared7   30-39      60+     12.33  1 0.0004     0.0067         Yes
X-squared8   30-39 Under 20     20.83  1 0.0000     0.0001         Yes
X-squared9   40-49    50-59      0.08  1 0.7777     1.0000          No
X-squared10  40-49      60+      0.13  1 0.7212     1.0000          No
X-squared11  40-49 Under 20      2.64  1 0.1043     1.0000          No
X-squared12  50-59      60+      0.00  1 1.0000     1.0000          No
X-squared13  50-59 Under 20      1.21  1 0.2706     1.0000          No
X-squared14    60+ Under 20      1.19  1 0.2763     1.0000          No

Code

# Create a heatmap of p-values for pairwise comparisons
# Prepare data for heatmap
## Red cells indicate p-values close to 0 (strong evidence of difference)
## Yellow indicates p-values around 0.5 (moderate evidence)
## White indicates p-values close to 1 (little evidence of difference)
## Asterisks (*) mark statistically significant differences (p < 0.05)
## The heatmap is symmetrical because the comparison of Group1 vs Group2 is the same as Group2 vs Group1
heatmap_data <- matrix(NA, nrow = length(age_groups), ncol = length(age_groups))
rownames(heatmap_data) <- age_groups
colnames(heatmap_data) <- age_groups

for(i in 1:nrow(pairwise_results)) {
  row_idx <- which(age_groups == pairwise_results$Group1[i])
  col_idx <- which(age_groups == pairwise_results$Group2[i])
  heatmap_data[row_idx, col_idx] <- pairwise_results$CorrectedP[i]
  heatmap_data[col_idx, row_idx] <- pairwise_results$CorrectedP[i]  # Mirror the matrix
}

# Convert to long format for ggplot
heatmap_long <- as.data.frame(as.table(heatmap_data))
names(heatmap_long) <- c("Group1", "Group2", "CorrectedP")

# Create the heatmap
heatmap_plot <- ggplot(heatmap_long, aes(x = Group1, y = Group2, fill = CorrectedP)) +
  geom_tile() +
  scale_fill_gradient2(low = "red", mid = "yellow", high = "white", 
                       midpoint = 0.5, na.value = "white",
                       limits = c(0, 1), name = "Corrected p-value") +
  geom_text(aes(label = ifelse(is.na(CorrectedP), "", 
                              ifelse(CorrectedP < 0.05, 
                                    sprintf("%.4f*", CorrectedP),
                                    sprintf("%.4f", CorrectedP)))),
            size = 3) +
  labs(title = "Pairwise Comparisons of RMT Usage Between Age Groups",
       subtitle = "Bonferroni-corrected p-values (* indicates significant at α = 0.05)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(size = 10)) +
  coord_fixed()

# Display the heatmap
## When analyzing the results from this code:
## Look for cells with asterisks (*) in the heatmap or "Significant: Yes" in the printed results
## These indicate which specific age groups differ significantly in RMT usage
## The raw p-values show the strength of evidence before correction
## The corrected p-values account for multiple testing and are more conservative
## Pay attention to which test was used (Chi-square or Fisher's) for each comparison
## This approach enables you to identify specifically which age groups differ from each other in their RMT usage patterns, rather than just knowing that "age is associated with RMT usage" in general.
print(heatmap_plot)

Code

# Additional visualization: Create a bar chart showing the percentage of RMT usage by age group
percentage_plot <- ggplot(age_rmt_summary_stats %>% filter(RMTMethods_YN == "Yes"), 
                         aes(x = factor(age_group, 
                                       levels = c("Under 20", "20-29", "30-39", "40-49", "50-59", "60+")), 
                             y = percentage)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)),
            vjust = -0.5, size = 3.5) +
  labs(title = "Percentage of RMT Device Usage by Age Group",
       x = "Age Group (Years)",
       y = "Percentage (%)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

# Display the percentage plot
print(percentage_plot)

3.4 Results and Discussion

Chi-Square Test

The chi-square test with simulated p-values (10,000 replicates) shows a statistically significant association between age group and RMT usage:

X-squared = 35.047, p-value < 0.0001

This indicates strong evidence against the null hypothesis of independence between age group and RMT device usage.

Expected Counts

All expected cell counts exceed 5, confirming the chi-square test’s validity

Minimum expected count: 25.02

Standardized Residuals

The standardized residuals help identify which specific cells contribute most to the chi-square statistic

Notable observations:

The 30-39 age group has a strong positive residual (3.89) for “Yes” responses, indicating significantly higher RMT usage than expected. The Under 20 age group has a strong negative residual (-2.79) for “Yes” responses, indicating significantly lower RMT usage than expected.

Pairwise Comparisons with Bonferroni Correction

To determine which specific age groups differ significantly from each other, pairwise chi-square tests were conducted with Bonferroni correction to control for multiple comparisons

Summary of Significant Pairwise Differences:

30-39 age group shows significantly higher RMT usage compared to: 40-49 age group (p = 0.0198), 50-59 age group (p = 0.0145), 60+ age group (p = 0.0067), Under 20 age group (p = 0.0001).

20-29 age group shows significantly higher RMT usage compared to: Under 20 age group (p = 0.0209).

No significant differences were found between: 20-29 and 30-39 age groups Any pairs among the 40-49, 50-59, 60+, and Under 20 age groups.

Visual Representation

The analysis includes three visualizations:

Bar chart of RMT usage counts by age group
Heatmap of pairwise comparison p-values
Percentage of RMT device usage by age group

Interpretation and Implications Age-Related Patterns in RMT Usage:

The 30-39 age group has the highest RMT adoption rate (23.37%). The 20-29 age group has the second highest adoption rate (16.70%). All other age groups have notably lower adoption rates (6.67% to 11.95%).

Key Findings:

There is a clear peak in RMT usage among adults in their 30s. Young adults (20-29) also show relatively high adoption. Both younger (under 20) and older (40+) age groups show significantly lower adoption rates. The 30-39 age group stands out as significantly different from most other age groups.

Practical Implications:

Targeted interventions may be needed to increase RMT adoption among both younger and older populations. Understanding the factors that drive higher adoption in the 30-39 age group could inform strategies to increase usage in other groups. The 20-29 and 30-39 age groups could be considered as early adopters or technology champions. This comprehensive analysis provides strong statistical evidence for age-related differences in RMT device usage, with clear identification of which specific age groups differ significantly from others.

4 Instruments Played

4.1 Descriptive Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Define instrument families
woodwinds <- c("Flute", "Piccolo", "Clarinet", "Saxophone", "Oboe", "Bassoon", "Recorder")
brass <- c("Trumpet", "Trombone", "Tuba", "Euphonium", "French Horn", "French Horn/Horn")
others <- c("Bagpipes", "Other")

# Process instrument-level data
WI_split_updated <- data_combined %>%
  select(WI) %>%
  separate_rows(WI, sep = ",") %>%
  mutate(WI = trimws(WI)) %>%
  mutate(WI = case_when(
    WI == "French Horn/Horn" ~ "French Horn",
    WI == "Oboe/Cor Anglais" ~ "Oboe",
    TRUE ~ WI
  )) %>%
  filter(WI != "Unknown") %>%
  count(WI, sort = TRUE) %>%
  mutate(Family = case_when(
    WI %in% c(woodwinds, "Oboe") ~ "Woodwinds",
    WI %in% brass ~ "Brass",
    TRUE ~ "Others"
  ))

# Calculate percentages for instruments
WI_percentages_updated <- WI_split_updated %>%
  mutate(Percentage = round((n / 3054) * 100, 2))

# Calculate family distribution
family_distribution_updated <- WI_split_updated %>%
  group_by(Family) %>%
  summarise(Total = sum(n)) %>%
  mutate(Percentage = round((Total / 3054) * 100, 2))

# Create instrument distribution plot
instrument_plot <- ggplot(data = WI_percentages_updated, 
                          aes(x = reorder(WI, n), y = n, fill = Family)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = paste0(n, " (", Percentage, "%)")), 
            hjust = -0.1, 
            size = 3,
            position = position_dodge(width = 1)) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.3))) +  # Increased expansion for labels
  labs(title = "Distribution of Wind Instruments by Count and Percentage",
       x = "Instrument",
       y = "Frequency (N=1558, responses = 3054)",
       caption = "Note. reported percentages are out of total responses (3054) not the total N (1558)") +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 10),
    plot.title = element_text(size = 12, face = "bold"),
    plot.caption = element_text(size = 10, hjust = 0)
  )

# Create family distribution plot
family_plot <- ggplot(data = family_distribution_updated, 
                      aes(x = reorder(Family, -Total), y = Total, fill = Family)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = paste0(Total, "\n(", Percentage, "%)")), 
            vjust = -0.5, 
            size = 4,
            position = position_dodge(width = 1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +  # Increased expansion for labels
  labs(title = "Distribution by Instrument Family",
       x = "Instrument Family",
       y = "Frequency (N=1558, responses = 3054)",
       caption = "Note. reported percentages are out of total responses (3054) not the total N (1558)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 12, face = "bold"),
    legend.position = "none",
    plot.caption = element_text(size = 10, hjust = 0)
  )

# Display plots
print(instrument_plot)

Code

print(family_plot)

Code

# Print distributions
print("Instrument Distribution:")

[1] "Instrument Distribution:"

Code

print(WI_percentages_updated)

# A tibble: 14 × 4
   WI              n Family    Percentage
   <chr>       <int> <chr>          <dbl>
 1 Saxophone     477 Woodwinds      15.6 
 2 Flute         443 Woodwinds      14.5 
 3 Clarinet      415 Woodwinds      13.6 
 4 Trumpet       343 Brass          11.2 
 5 Trombone      212 Brass           6.94
 6 Piccolo       209 Woodwinds       6.84
 7 French Horn   161 Brass           5.27
 8 Oboe          150 Woodwinds       4.91
 9 Recorder      136 Woodwinds       4.45
10 Euphonium     133 Brass           4.35
11 Tuba          129 Brass           4.22
12 Other          94 Others          3.08
13 Bassoon        92 Woodwinds       3.01
14 Bagpipes       60 Others          1.96

Code

print("\nFamily Distribution:")

[1] "\nFamily Distribution:"

Code

print(family_distribution_updated)

# A tibble: 3 × 3
  Family    Total Percentage
  <chr>     <int>      <dbl>
1 Brass       978      32.0 
2 Others      154       5.04
3 Woodwinds  1922      62.9

4.1.1 Results and Discussion

• Woodwinds represented a slight majority of 58.02% of the total instruments.

• Brass accounts for 31.99%, with the updated “French Horn” category included.

• Others and Unknown categories make up smaller portions (5.04% and 4.94%, respectively).

Individual Instrument Distribution:

• Saxophone remains the most popular instrument (15.62%), followed by Flute (14.51%) and Clarinet (13.59%).

• Trumpet leads the brass instruments with 11.23%, while French Horn now accurately reflects its combined count.

• Others include: harmonica, tenor horn, non-western flute, recorder, ocarina, cornet, whistle, baritone, non-western reed, flugelhorn, misc.

4.2 Comparison Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Process instrument and RMT data
instrument_rmt_data <- data_combined %>%
  filter(!is.na(WI), !is.na(RMTMethods_YN)) %>%
  separate_rows(WI, sep = ",") %>%
  mutate(
    WI = trimws(WI),
    WI = case_when(
      WI == "French Horn/Horn" ~ "French Horn",
      TRUE ~ WI
    ),
    RMTMethods_YN = factor(RMTMethods_YN, 
                           levels = c(0, 1),
                           labels = c("No RMT", "RMT"))
  ) %>%
  filter(WI != "Unknown") %>%
  mutate(family = case_when(
    WI %in% c("Flute", "Piccolo", "Clarinet", "Saxophone", "Oboe", "Bassoon", "Recorder") ~ "Woodwinds",
    WI %in% c("Trumpet", "Trombone", "Tuba", "Euphonium", "French Horn") ~ "Brass",
    TRUE ~ "Others"
  ))

# Calculate counts and percentages for each RMT group (Changed grouping)
family_rmt_summary <- instrument_rmt_data %>%
  group_by(family, RMTMethods_YN) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(RMTMethods_YN) %>%  # Changed from family to RMTMethods_YN
  mutate(
    total = sum(count),
    percentage = (count / total) * 100
  )

# Calculate total N for adding to the plot
total_n <- nrow(instrument_rmt_data)

# Perform chi-square test on the full contingency table
contingency_table <- table(instrument_rmt_data$family, instrument_rmt_data$RMTMethods_YN)
chi_square_test <- chisq.test(contingency_table)
print("Chi-square test results:")

[1] "Chi-square test results:"

Code

print(chi_square_test)


    Pearson's Chi-squared test

data:  contingency_table
X-squared = 28.293, df = 2, p-value = 7.183e-07

Code

# Check the assumption: print the expected counts from the chi-square test
print("Expected counts:")

[1] "Expected counts:"

Code

print(chi_square_test$expected)

           
               No RMT       RMT
  Brass      815.9607 162.03929
  Others     253.6320  50.36804
  Woodwinds 1478.4073 293.59267

Code

# If any expected count is less than 5, issue a warning and switch to Fisher's exact test
if(min(chi_square_test$expected) < 5) {
  print("Chi-square test assumption violated. Switching to Fisher's exact test.")
  chi_square_test <- fisher.test(contingency_table)
  print("Fisher's exact test results:")
  print(chi_square_test)
}

# Create the plot
rmt_distribution_plot <- ggplot(family_rmt_summary, 
                               aes(x = family, y = count, fill = RMTMethods_YN)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", count, percentage)),
            position = position_dodge(width = 0.9),
            vjust = -0.5, size = 3) +
  labs(title = "Distribution of Instrument Families within RMT Usage Groups",
       subtitle = sprintf("Percentages calculated within each RMT group (Chi-square: χ² = %.2f, df = %d, p = %.4f)",
                        chi_square_test$statistic,
                        chi_square_test$parameter,
                        chi_square_test$p.value),
       x = "Instrument Family",
       y = paste("Number of Responses (Total N =", total_n, ")"),
       fill = "RMT Usage") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "right",
    plot.margin = margin(t = 30, r = 20, b = 20, l = 20)  # Add more space at the top
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))  # Add more space above bars

# Display the plot
print(rmt_distribution_plot)

Code

# Calculate and print odds ratios
print("\nOdds ratios between instrument families and RMT usage:")

[1] "\nOdds ratios between instrument families and RMT usage:"

Code

family_rmt_odds <- family_rmt_summary %>%
  select(family, RMTMethods_YN, count) %>%
  tidyr::pivot_wider(names_from = RMTMethods_YN, values_from = count, values_fill = list(count = 0)) %>%
  mutate(
    # Add a small constant to avoid division by zero
    odds_ratio = if(all(c("RMT", "No RMT") %in% names(.))) {
      (RMT + 0.5) / (sum(RMT) + 0.5) / (((`No RMT` + 0.5) / (sum(`No RMT`) + 0.5)))
    } else {
      NA
    }
  )

print(family_rmt_odds)

# A tibble: 3 × 4
  family    `No RMT`   RMT odds_ratio
  <chr>        <int> <int>      <dbl>
1 Brass          765   213      1.40 
2 Others         260    44      0.860
3 Woodwinds     1523   249      0.824

Code

# Create a new data frame with percentages within RMT groups for easier interpretation
rmt_percentages <- family_rmt_summary %>%
  select(family, RMTMethods_YN, count, percentage) %>%
  pivot_wider(names_from = RMTMethods_YN, 
              values_from = c(count, percentage),
              names_glue = "{RMTMethods_YN}_{.value}")

print("\nPercentage distribution within each RMT group:")

[1] "\nPercentage distribution within each RMT group:"

Code

print(rmt_percentages)

# A tibble: 3 × 5
  family    `No RMT_count` RMT_count `No RMT_percentage` RMT_percentage
  <chr>              <int>     <int>               <dbl>          <dbl>
1 Brass                765       213                30.0          42.1 
2 Others               260        44                10.2           8.70
3 Woodwinds           1523       249                59.8          49.2

Code

# Perform post-hoc analysis for each pair of instrument families
print("\nPairwise comparisons between instrument families:")

[1] "\nPairwise comparisons between instrument families:"

Code

families <- unique(instrument_rmt_data$family)
if(length(families) > 1) {
  for(i in 1:(length(families)-1)) {
    for(j in (i+1):length(families)) {
      subset_data <- instrument_rmt_data %>%
        filter(family %in% c(families[i], families[j]))
      
      # Create contingency table for this pair
      pair_table <- table(subset_data$family, subset_data$RMTMethods_YN)
      
      # Check if table has enough data for chi-square test
      if(all(dim(pair_table) > 0) && min(pair_table) > 0) {
        pair_test <- chisq.test(pair_table)
        test_type <- "Chi-square"
      } else {
        # Use Fisher's exact test as fallback
        pair_test <- fisher.test(pair_table, simulate.p.value = TRUE, B = 10000)
        test_type <- "Fisher's exact"
      }
      
      print(sprintf("\n%s vs %s:", families[i], families[j]))
      if(test_type == "Chi-square") {
        print(sprintf("%s test: Chi-square = %.2f, df = %d, p = %.4f",
                      test_type,
                      pair_test$statistic,
                      pair_test$parameter,
                      pair_test$p.value))
      } else {
        print(sprintf("%s test: p = %.4f", test_type, pair_test$p.value))
      }
    }
  }
}

[1] "\nWoodwinds vs Others:"
[1] "Chi-square test: Chi-square = 0.01, df = 1, p = 0.9156"
[1] "\nWoodwinds vs Brass:"
[1] "Chi-square test: Chi-square = 26.37, df = 1, p = 0.0000"
[1] "\nOthers vs Brass:"
[1] "Chi-square test: Chi-square = 7.27, df = 1, p = 0.0070"

Code

# Perform pairwise comparisons between instrument families with Bonferroni correction:
print("\nPairwise comparisons between instrument families (with Bonferroni correction):")

[1] "\nPairwise comparisons between instrument families (with Bonferroni correction):"

Code

families <- unique(instrument_rmt_data$family)
n_comparisons <- length(families) * (length(families)-1) / 2

for(i in 1:(length(families)-1)) {
  for(j in (i+1):length(families)) {
    subset_data <- instrument_rmt_data %>%
      filter(family %in% c(families[i], families[j]))
    pair_test <- chisq.test(table(subset_data$family, subset_data$RMTMethods_YN))
    p_value_corrected <- min(pair_test$p.value * n_comparisons, 1)  # Correction and capped at 1
    print(sprintf("\n%s vs %s:", families[i], families[j]))
    print(sprintf("Chi-square = %.2f, df = %d, raw p = %.4f, Bonferroni corrected p = %.4f",
                  pair_test$statistic,
                  pair_test$parameter,
                  pair_test$p.value,
                  p_value_corrected))
  }
}

[1] "\nWoodwinds vs Others:"
[1] "Chi-square = 0.01, df = 1, raw p = 0.9156, Bonferroni corrected p = 1.0000"
[1] "\nWoodwinds vs Brass:"
[1] "Chi-square = 26.37, df = 1, raw p = 0.0000, Bonferroni corrected p = 0.0000"
[1] "\nOthers vs Brass:"
[1] "Chi-square = 7.27, df = 1, raw p = 0.0070, Bonferroni corrected p = 0.0210"

Code

# Add new visualization specifically showing the distribution within RMT groups
percentage_plot <- ggplot(family_rmt_summary, 
                         aes(x = family, y = percentage, fill = family)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = sprintf("%.1f%%", percentage)),
            vjust = -0.5, size = 3.5) +
  facet_wrap(~ RMTMethods_YN, scales = "free_y") +
  labs(title = "Distribution of Instrument Families within Each RMT Usage Group",
       subtitle = "Showing percentage of each instrument family within 'RMT' and 'No RMT' groups",
       x = "Instrument Family",
       y = "Percentage within RMT Group (%)") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, face = "italic"),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12),
    legend.position = "none"
  )

# Display the percentage plot
print(percentage_plot)

4.3 Results and Discussion

Based on the provided statistical output, there are significant differences in Respiratory Muscle Training (RMT) usage across instrument families, with brass players showing notably higher adoption rates compared to woodwind players and other instrumentalists.

Classification by Instrument Family:

Instruments are grouped into families based on predefined lists. For example:

Instruments like “Flute,” “Piccolo,” “Clarinet,” etc., are grouped as Woodwinds. Instruments like “Trumpet,” “Trombone,” “Tuba,” etc., are grouped as Brass. All others are assigned to a default category called Others.

Observed Frequencies Analysis

Raw Counts by Instrument Family:

Brass: 765 non-users, 213 RMT users (21.8% usage rate)
Woodwinds: 1,523 non-users, 249 RMT users (14.0% usage rate)
Others: 260 non-users, 44 RMT users (14.5% usage rate)

Chi-square Test Results:

Overall test: X² = 28.293, df = 2, p-value = 7.183e-07
This highly significant p-value (<0.001) indicates strong evidence of an association between instrument family and RMT usage.

Pairwise Comparisons

Woodwinds vs. Others:

Chi-square = 0.01, df = 1, p = 0.9156
Bonferroni corrected p = 1.0000
Interpretation: No significant difference in RMT usage between woodwind players and other instrumentalists.

Woodwinds vs. Brass:

Chi-square = 26.37, df = 1, p < 0.0001
Bonferroni corrected p < 0.0001
Interpretation: Highly significant difference, with brass players using RMT at a significantly higher rate than woodwind players.

Others vs. Brass:

Chi-square = 7.27, df = 1, p = 0.0070
Bonferroni corrected p = 0.0210
Interpretation: Significant difference even after correction for multiple comparisons, with brass players using RMT at a higher rate than other instrumentalists.

Key findings from the pairwise comparisons:

• Woodwinds vs. Brass: Highly significant difference (p < 0.0001)

• Others vs. Brass: Significant difference (p = 0.007)

• Woodwinds vs. Others: No significant difference (p = 0.916)

There is a statistically significant difference in RMT usage across instrument families
Brass players show significantly different patterns of RMT usage compared to both Woodwind players and Others
Woodwind players and Others show similar patterns of RMT usage
The proportion of RMT usage is highest among Brass players

X-squared = 55.92, df = 13, p-value = 2.784e-07

The plot shows the percentage distribution of instruments within each RMT group (No RMT vs RMT), with the actual counts and percentages labeled on each bar. The chi-square test confirms a significant association between instrument type and RMT usage (p < 0.001).

The Tuba, Euphonium, and Bagpipes have the highest percentages of RMT usage among instruments with at least 10 players.

The loop over instrument families now includes a Bonferroni correction to adjust p-values for multiple comparisons. The corrected p-value is calculated by multiplying the raw p-value by the number of pairwise comparisons. The output for each pair is printed with both raw and corrected p-values.

These modifications ensure that the chi-square test is valid (by checking the expected counts), and the risk of Type I error in multiple pairwise comparisons is reduced. The overall robustness of the analysis is thereby improved.

Odds Ratio Calculation:

Although the code includes a basic calculation for the odds ratios between instrument families, this step is intended to quantify the strength of the association between each family’s membership and RMT usage. A more sophisticated approach (like logistic regression) might be used for robust odds ratio calculations, but the given code calculates a simplified measure.

Calculated from raw counts:

Brass vs. Woodwinds: Brass players have approximately 1.71 times higher odds of using RMT compared to woodwind players (21.8% vs 14.0%).
Brass vs. Others: Brass players have approximately 1.64 times higher odds of using RMT compared to other instrumentalists (21.8% vs 14.5%).
Woodwinds vs. Others: Nearly identical RMT usage rates (14.0% vs 14.5%), confirming the non-significant chi-square result.

Comprehensive Interpretation

Brass instrumentalists stand out: Players of brass instruments show significantly higher adoption of RMT methods compared to both woodwind players and other instrumentalists. This difference remains significant even after applying the conservative Bonferroni correction for multiple comparisons.
Woodwinds and Others are similar: There is no meaningful difference in RMT usage between woodwind players and other instrumentalists, as evidenced by the very low chi-square value (0.01) and high p-value (0.9156).
Potential explanations: Several factors might explain the higher RMT adoption among brass players:

Specific physical demands or challenges unique to brass playing
Different pedagogical traditions or technological integration in brass education
Demographic differences between instrument families (e.g., age, gender, education level)
Potential differences in injury patterns or physical concerns across instrument families

Conclusion

Brass instrumentalists demonstrate significantly higher adoption of RMT methods compared to other musician groups. This finding may have implications for how RMT technologies are marketed, designed, or implemented across different instruments. Further research could explore the underlying reasons for this difference, which could inform more targeted approaches to technology integration in music education and performance.

5 Skill Level

5.1 Descriptive Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Extract playAbility_MAX and remove 0 responses
plot_data <- data_combined %>%
  filter(playAbility_MAX != 0, !is.na(playAbility_MAX)) %>% # Added NA check
  mutate(playAbility_MAX = as.factor(playAbility_MAX)) %>%
  count(playAbility_MAX) %>%
  mutate(percentage = n / sum(n) * 100,
         label = paste0(n, "\n(", sprintf("%.1f", percentage), "%)"))

# Define custom labels for x-axis
custom_labels <- c("1" = "Novice", "2" = "Beginner", 
                   "3" = "Intermediate", "4" = "Advanced", 
                   "5" = "Expert")

# Get the actual levels present in the data
actual_levels <- levels(plot_data$playAbility_MAX)

# Create Plot
playability_plot <- ggplot(plot_data, aes(x = playAbility_MAX, y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = label), vjust = -0.5, size = 3.5) +
  labs(
    title = "Distribution of Play Ability (Max Score)",
    x = "Play Ability (Novice = 1 to Expert = 5)",
    y = "Count of Participants (N = 1558)"
  ) +
  scale_x_discrete(
    # Use actual levels from data rather than hardcoded values
    labels = custom_labels[actual_levels]
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.text = element_text(size = 12)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Display the Plot
print(playability_plot)

Code

# Combined Categories
# Read the data
library(tidyverse)
library(readxl)

data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Process playAbility_MAX with the new category definitions
plot_data <- data_combined %>%
  filter(!is.na(playAbility_MAX)) %>%
  mutate(
    # Ensure playAbility_MAX is numeric
    playAbility_MAX = as.numeric(playAbility_MAX),
    
    # Create the new combined categories with half-point values
    skill_level = case_when(
      playAbility_MAX %in% c(0, 1, 1.5, 2) ~ "Novice/Beginner",
      playAbility_MAX %in% c(2.5, 3, 3.5) ~ "Intermediate",
      playAbility_MAX %in% c(4, 4.5, 5) ~ "Advanced/Expert",
      TRUE ~ NA_character_
    ),
    
    # Order factor levels
    skill_level = factor(skill_level, 
                        levels = c("Novice/Beginner", "Intermediate", "Advanced/Expert"))
  ) %>%
  # Remove any NA values from our categories
  filter(!is.na(skill_level)) %>%
  # Count by skill level
  group_by(skill_level) %>%
  summarise(count = n()) %>%
  # Calculate percentages
  mutate(
    total = sum(count),
    percentage = (count / total) * 100,
    label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)")
  )

# Calculate total valid responses for y-axis label
total_n <- sum(plot_data$count)

# Create the plot
ggplot(plot_data, aes(x = skill_level, y = count, fill = skill_level)) +
  geom_bar(stat = "identity", width = 0.7) +
  geom_text(aes(label = label), vjust = -0.5, size = 4) +
  labs(
    title = "Distribution of Play Ability (Combined Categories)",
    x = "Play Ability Level",
    y = paste0("Number of Participants (N = ", total_n, ")")
  ) +
  scale_fill_manual(values = c("Novice/Beginner" = "#4682B4", 
                              "Intermediate" = "#6495ED", 
                              "Advanced/Expert" = "#1E90FF")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 11),
    legend.position = "none",
    panel.grid.major.x = element_blank()
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

Code

# Additionally, print value counts to help diagnose any issues
value_counts <- data_combined %>%
  filter(!is.na(playAbility_MAX)) %>%
  count(playAbility_MAX) %>%
  arrange(playAbility_MAX)

print("Distribution of values in playAbility_MAX:")

[1] "Distribution of values in playAbility_MAX:"

Code

print(value_counts)

# A tibble: 10 × 2
   playAbility_MAX     n
             <dbl> <int>
 1             0       1
 2             1       6
 3             1.5     3
 4             2      32
 5             2.5    34
 6             3     222
 7             3.5   156
 8             4     484
 9             4.5   173
10             5     447

5.2 Results and Discussion

Looking at the distribution of self-reported playing ability scores (1 = Novice to 5 = Expert) among the 1,558 participants:

• Score 4 is the most common with approximately 500 participants (~32%), indicating many consider themselves advanced players

• Score 3 is the second most frequent with about 450 participants (~29%), representing intermediate-level players

• Score 5 (Expert level) has roughly 350 participants (~22%)

• Score 2 (Beginner-intermediate) has about 200 participants (~13%)

• Score 1 (Novice) is the least common with fewer than 100 participants (~4%)

The distribution is slightly negatively skewed (skewed towards higher scores), with the majority of participants (>80%) rating their playing ability as intermediate or higher (scores 3-5). This suggests that the sample consists primarily of experienced wind instrument players, with relatively few beginners. The peak at score 4 indicates that most participants consider themselves advanced, but not expert-level players.

The plot shows the distribution of self-reported playing ability (1 = Novice to 5 = Expert) for each wind instrument. Some key observations:

• Flute shows the highest number of responses, with a right-skewed distribution (more advanced players)

• Most instruments show fewer responses overall compared to flute and clarinet

• Bagpipes and euphonium show notably fewer responses

• The warning message indicates some missing values were removed, which is expected as not all participants play all instruments

5.2.1 Combined Groups

Key Observations:

Strong Positive Skew: The distribution is heavily skewed toward the higher end of the scale, with 71.1% of participants (1,104 out of 1,558) falling in the 4.0-5.0 range.
Highest Frequencies:

Level 4.0: 484 participants (31.1%)
Level 5.0: 447 participants (28.7%)
Level 3.0: 222 participants (14.2%)

Lowest Frequencies:

Level 0.0: 1 participant (0.1%)
Level 1.5: 3 participants (0.2%)
Level 1.0: 6 participants (0.4%)

Half-Point vs. Whole-Number Ratings: Half-point ratings (1.5, 2.5, 3.5, 4.5) consistently show lower frequencies than adjacent whole numbers, suggesting raters may prefer whole numbers when assessing ability but use half points for more nuanced judgments.

Combined Category Analysis

When organized into three combined categories:

Novice/Beginner (1.0, 1.5, 2.0): 41 participants (2.6%)

This group represents a very small portion of the sample, indicating either few beginners participated or the population being studied generally has higher ability levels.

Intermediate (2.5, 3.0, 3.5): 412 participants (26.5%)

Representing about a quarter of participants with moderate skill levels.

Advanced/Expert (4.0, 4.5, 5.0): 1,104 participants (70.9%)

Clearly the dominant group, comprising more than two-thirds of the sample.

Outlier (0.0): 1 participant (0.1%)

This likely represents an error, a special case, or potentially a “non-player” category.

Implications and Considerations

Potential Explanations for the Skewed Distribution:

Population Characteristics: This may accurately reflect a population primarily consisting of advanced players, such as professional or experienced musicians.
Self-Selection Bias: Advanced players might be more likely to participate in studies related to their expertise or complete relevant surveys.
Self-Assessment Inflation: If self-reported, participants may tend to overestimate their abilities.
Scale Calibration: The assessment scale might not optimally differentiate skill levels in this specific population, potentially causing a ceiling effect where many advanced players cluster at the top.
Recruitment Strategy: The participant recruitment process may have targeted more experienced players.

Research and Application Considerations

Representativeness: Consider whether this distribution accurately represents the target population or reflects selection biases.
Statistical Analysis: The heavy skew toward higher ratings might influence statistical analyses that assume normal distribution.
Group Comparisons: When comparing groups (e.g., by demographics), be aware that the small sample of beginners might limit certain analyses.
Scale Refinement: For future studies, consider a more differentiated scale at the upper end to better distinguish between advanced and expert performers.

Conclusion

The data reveals a population predominantly composed of advanced/expert level players, with relatively few beginners. This distribution is important context for any further analyses or conclusions drawn from this dataset, particularly when examining factors associated with play ability or when making generalizations about the broader population of players.

5.3 Comparison Stats

Code

# Original Analysis without Combined Groups

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Data Preparation: Filter playAbility_MAX and prepare variables
data_combined <- data_combined %>%
  mutate(
    playAbility_MAX = as.numeric(playAbility_MAX),
    RMTMethods_YN = factor(RMTMethods_YN, levels = c(0, 1), labels = c("No RMT", "RMT"))
  )

# Filter out cases where playAbility_MAX is 0 or NA
analysis_data <- data_combined %>%
  filter(!is.na(playAbility_MAX), playAbility_MAX != 0, !is.na(RMTMethods_YN)) %>%
  mutate(
    # Create factor version of playAbility with proper labels
    playAbility_factor = factor(playAbility_MAX, 
                               levels = 1:5,
                               labels = c("Novice", "Beginner", "Intermediate", "Advanced", "Expert")),
    # Create binary variables for logistic regression
    high_play = ifelse(playAbility_MAX >= 4, 1, 0),
    RMT_binary = ifelse(RMTMethods_YN == "RMT", 1, 0)
  )

# Calculate counts by playAbility and RMT usage
grouped_data <- analysis_data %>%
  group_by(RMTMethods_YN, playAbility_factor) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(RMTMethods_YN) %>%
  mutate(percentage = count / sum(count) * 100,
         label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)")) %>%
  ungroup()

# Statistical Analysis: Chi-square Test of Independence
contingency_table <- table(analysis_data$playAbility_factor, analysis_data$RMTMethods_YN)

# Use simulation-based p-value calculation for chi-square test
chi_test <- chisq.test(contingency_table, simulate.p.value = TRUE, B = 10000)

# Print the statistical results
cat("\nChi-square Test Results (Independence between playAbility and RMT Usage):\n")


Chi-square Test Results (Independence between playAbility and RMT Usage):

Code

print(chi_test)


    Pearson's Chi-squared test with simulated p-value (based on 10000
    replicates)

data:  contingency_table
X-squared = 23.066, df = NA, p-value = 0.0005999

Code

# Check expected frequencies
expected_freqs <- chi_test$expected
print("Expected frequencies:")

[1] "Expected frequencies:"

Code

print(expected_freqs)

              
                   No RMT       RMT
  Novice         5.093199  0.906801
  Beginner      27.163728  4.836272
  Intermediate 188.448363 33.551637
  Advanced     410.851385 73.148615
  Expert       379.443325 67.556675

Code

# Calculate standardized residuals
std_residuals <- data.frame(
  playAbility = rep(rownames(chi_test$stdres), times = ncol(chi_test$stdres)),
  RMTMethods = rep(colnames(chi_test$stdres), each = nrow(chi_test$stdres)),
  std_residual = as.vector(chi_test$stdres),
  rounded_res = round(as.vector(chi_test$stdres), 2)
)

# Print significant residuals
sig_residuals <- std_residuals %>% 
  filter(abs(std_residual) > 1.96)
cat("\nSignificant Standardized Residuals (>|1.96|):\n")


Significant Standardized Residuals (>|1.96|):

Code

print(sig_residuals)

   playAbility RMTMethods std_residual rounded_res
1 Intermediate     No RMT     3.853893        3.85
2       Expert     No RMT    -3.916831       -3.92
3 Intermediate        RMT    -3.853893       -3.85
4       Expert        RMT     3.916831        3.92

Code

# Calculate effect size: Cramer's V
n_total <- sum(contingency_table)
df_min <- min(nrow(contingency_table) - 1, ncol(contingency_table) - 1)
cramer_v <- sqrt(chi_test$statistic / (n_total * df_min))
cat("\nEffect Size (Cramer's V):\n")


Effect Size (Cramer's V):

Code

print(cramer_v)

X-squared 
0.1391648

Code

# Create Plot: Grouped Bar Plot
playability_group_plot <- ggplot(grouped_data, aes(x = playAbility_factor, y = count, fill = RMTMethods_YN)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
  geom_text(aes(label = label), position = position_dodge(width = 0.9), vjust = -0.5, size = 3.5) +
  labs(
    title = "Distribution of Play Ability by RMT Usage",
    x = "Play Ability Level",
    y = paste0("Count of Participants (N = ", nrow(analysis_data), ")"),
    fill = "RMT Usage"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.text = element_text(size = 12)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Display the grouped bar plot
print(playability_group_plot)

Code

# Logistic Regression Analysis
# Run continuous model
logit_model <- glm(RMT_binary ~ playAbility_MAX, 
                  data = analysis_data, 
                  family = "binomial")

# Print model summary
summary_output <- summary(logit_model)
print(summary_output)


Call:
glm(formula = RMT_binary ~ playAbility_MAX, family = "binomial", 
    data = analysis_data)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -4.0268     0.4393  -9.166  < 2e-16 ***
playAbility_MAX   0.5417     0.1010   5.364 8.16e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1296.9  on 1556  degrees of freedom
Residual deviance: 1265.1  on 1555  degrees of freedom
AIC: 1269.1

Number of Fisher Scoring iterations: 5

Code

# Calculate odds ratios and confidence intervals
odds_ratios <- exp(coef(logit_model))
conf_intervals <- exp(confint(logit_model))

cat("\nOdds Ratios with 95% Confidence Intervals:\n")


Odds Ratios with 95% Confidence Intervals:

Code

or_results <- data.frame(
  Term = names(odds_ratios),
  OddsRatio = round(odds_ratios, 3),
  CI_lower = round(conf_intervals[,1], 3),
  CI_upper = round(conf_intervals[,2], 3)
)
print(or_results)

                           Term OddsRatio CI_lower CI_upper
(Intercept)         (Intercept)     0.018    0.007    0.041
playAbility_MAX playAbility_MAX     1.719    1.415    2.103

Code

# Predicted probabilities for each playing ability level
new_data <- data.frame(playAbility_MAX = 1:5)
predicted_probs <- predict(logit_model, newdata = new_data, type = "response")
result_df <- data.frame(
  playAbility_level = 1:5,
  predicted_probability = predicted_probs
)
cat("\nPredicted probabilities of RMT usage by playing ability level:\n")


Predicted probabilities of RMT usage by playing ability level:

Code

print(result_df)

  playAbility_level predicted_probability
1                 1            0.02974025
2                 2            0.05005305
3                 3            0.08305206
4                 4            0.13472130
5                 5            0.21113395

Code

# Create a visualization of predicted probabilities
library(ggplot2)

# Plot predicted probabilities from continuous model
prob_plot <- ggplot(result_df, aes(x = playAbility_level, y = predicted_probability)) +
  geom_line(size = 1.5) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = 1:5, 
                     labels = c("Novice", "Beginner", "Intermediate", "Advanced", "Expert")) +
  labs(title = "Predicted Probability of RMT Usage by Playing Ability Level",
       x = "Playing Ability Level",
       y = "Probability of Using RMT Methods") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(size = 14)
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, max(predicted_probs) * 1.1))

# Display the probability plot
print(prob_plot)

Code

# Calculate McFadden's Pseudo R-squared
null_model <- glm(RMT_binary ~ 1, data = analysis_data, family = "binomial")
logLik_full <- as.numeric(logLik(logit_model))
logLik_null <- as.numeric(logLik(null_model))
mcfadden_r2 <- 1 - (logLik_full / logLik_null)
cat(paste("\nMcFadden's Pseudo R-squared:", round(mcfadden_r2, 4)))


McFadden's Pseudo R-squared: 0.0245

Code

# Classification metrics with robust handling
predicted_classes <- ifelse(fitted(logit_model) > 0.5, 1, 0)
confusion_matrix <- table(Predicted = factor(predicted_classes, levels = c(0, 1)), 
                         Actual = factor(analysis_data$RMT_binary, levels = c(0, 1)))
cat("\n\nConfusion Matrix:\n")



Confusion Matrix:

Code

print(confusion_matrix)

         Actual
Predicted    0    1
        0 1329  228
        1    0    0

Code

# Calculate metrics with checks for zero denominators
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
sensitivity <- ifelse(sum(confusion_matrix[,2]) > 0, 
                     confusion_matrix[2,2] / sum(confusion_matrix[,2]), 
                     NA)
specificity <- ifelse(sum(confusion_matrix[,1]) > 0, 
                     confusion_matrix[1,1] / sum(confusion_matrix[,1]), 
                     NA)

cat(paste("\nAccuracy:", round(accuracy, 3)))


Accuracy: 0.854

Code

cat(paste("\nSensitivity (True Positive Rate):", 
         ifelse(is.na(sensitivity), "Not calculable", round(sensitivity, 3))))


Sensitivity (True Positive Rate): 0

Code

cat(paste("\nSpecificity (True Negative Rate):", 
         ifelse(is.na(specificity), "Not calculable", round(specificity, 3))))


Specificity (True Negative Rate): 1

Code

# Combined Groups
# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Data Preparation: Filter playAbility_MAX and prepare variables
data_combined <- data_combined %>%
  mutate(
    playAbility_MAX = as.numeric(playAbility_MAX),
    RMTMethods_YN = factor(RMTMethods_YN, levels = c(0, 1), labels = c("No RMT", "RMT"))
  )

# Filter out cases where playAbility_MAX is NA
analysis_data <- data_combined %>%
  filter(!is.na(playAbility_MAX), !is.na(RMTMethods_YN)) %>%
  mutate(
    # Create new combined categories - note that 0 is now included in Novice/Beginner
    playAbility_combined = case_when(
      playAbility_MAX %in% c(0, 1, 1.5, 2) ~ "Novice/Beginner",
      playAbility_MAX %in% c(2.5, 3, 3.5) ~ "Intermediate",
      playAbility_MAX %in% c(4, 4.5, 5) ~ "Advanced/Expert",
      TRUE ~ NA_character_
    ),
    # Order the factor levels correctly
    playAbility_combined = factor(playAbility_combined, 
                                 levels = c("Novice/Beginner", "Intermediate", "Advanced/Expert")),
    
    # Create binary variables for logistic regression
    high_play = ifelse(playAbility_MAX >= 4, 1, 0),
    RMT_binary = ifelse(RMTMethods_YN == "RMT", 1, 0)
  )

# Calculate counts by combined playAbility and RMT usage
grouped_data <- analysis_data %>%
  group_by(RMTMethods_YN, playAbility_combined) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(RMTMethods_YN) %>%
  mutate(percentage = count / sum(count) * 100,
         label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)")) %>%
  ungroup()

# Statistical Analysis: Chi-square Test of Independence
contingency_table <- table(analysis_data$playAbility_combined, analysis_data$RMTMethods_YN)

# Use simulation-based p-value calculation for chi-square test
chi_test <- chisq.test(contingency_table, simulate.p.value = TRUE, B = 10000)

# Print the statistical results
cat("\nChi-square Test Results (Independence between playAbility and RMT Usage):\n")


Chi-square Test Results (Independence between playAbility and RMT Usage):

Code

print(chi_test)


    Pearson's Chi-squared test with simulated p-value (based on 10000
    replicates)

data:  contingency_table
X-squared = 26.337, df = NA, p-value = 9.999e-05

Code

# Check expected frequencies
expected_freqs <- chi_test$expected
print("Expected frequencies:")

[1] "Expected frequencies:"

Code

print(expected_freqs)

                 
                     No RMT        RMT
  Novice/Beginner  35.85366   6.146341
  Intermediate    351.70732  60.292683
  Advanced/Expert 942.43902 161.560976

Code

# Calculate standardized residuals
std_residuals <- data.frame(
  playAbility = rep(rownames(chi_test$stdres), times = ncol(chi_test$stdres)),
  RMTMethods = rep(colnames(chi_test$stdres), each = nrow(chi_test$stdres)),
  std_residual = as.vector(chi_test$stdres),
  rounded_res = round(as.vector(chi_test$stdres), 2)
)

# Print significant residuals
sig_residuals <- std_residuals %>% 
  filter(abs(std_residual) > 1.96)
cat("\nSignificant Standardized Residuals (>|1.96|):\n")


Significant Standardized Residuals (>|1.96|):

Code

print(sig_residuals)

      playAbility RMTMethods std_residual rounded_res
1    Intermediate     No RMT     4.923283        4.92
2 Advanced/Expert     No RMT    -5.116975       -5.12
3    Intermediate        RMT    -4.923283       -4.92
4 Advanced/Expert        RMT     5.116975        5.12

Code

# Calculate effect size: Cramer's V
n_total <- sum(contingency_table)
df_min <- min(nrow(contingency_table) - 1, ncol(contingency_table) - 1)
cramer_v <- sqrt(chi_test$statistic / (n_total * df_min))
cat("\nEffect Size (Cramer's V):\n")


Effect Size (Cramer's V):

Code

print(cramer_v)

X-squared 
0.1300163

Code

# Create Plot: Grouped Bar Plot
playability_group_plot <- ggplot(grouped_data, aes(x = playAbility_combined, y = count, fill = RMTMethods_YN)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9)) +
  geom_text(aes(label = label), position = position_dodge(width = 0.9), vjust = -0.5, size = 3.5) +
  labs(
    title = "Distribution of Play Ability by RMT Usage",
    subtitle = "Using Combined Ability Categories (0 included in Novice/Beginner)",
    x = "Play Ability Level",
    y = paste0("Count of Participants (N = ", nrow(analysis_data), ")"),
    fill = "RMT Usage"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, face = "italic"),
    axis.text = element_text(size = 12)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Display the grouped bar plot
print(playability_group_plot)

Code

# Logistic Regression Analysis
# First create a numeric version of the combined categories for the continuous model
analysis_data <- analysis_data %>%
  mutate(
    playAbility_combined_numeric = case_when(
      playAbility_combined == "Novice/Beginner" ~ 1,
      playAbility_combined == "Intermediate" ~ 2,
      playAbility_combined == "Advanced/Expert" ~ 3,
      TRUE ~ NA_real_
    )
  )

# Run continuous model with the combined categories
logit_model <- glm(RMT_binary ~ playAbility_combined_numeric, 
                  data = analysis_data, 
                  family = "binomial")

# Print model summary
summary_output <- summary(logit_model)
print(summary_output)


Call:
glm(formula = RMT_binary ~ playAbility_combined_numeric, family = "binomial", 
    data = analysis_data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -4.0341     0.5056  -7.979 1.47e-15 ***
playAbility_combined_numeric   0.8246     0.1776   4.643 3.43e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1297.2  on 1557  degrees of freedom
Residual deviance: 1271.1  on 1556  degrees of freedom
AIC: 1275.1

Number of Fisher Scoring iterations: 5

Code

# Calculate odds ratios and confidence intervals
odds_ratios <- exp(coef(logit_model))
conf_intervals <- exp(confint(logit_model))

cat("\nOdds Ratios with 95% Confidence Intervals:\n")


Odds Ratios with 95% Confidence Intervals:

Code

or_results <- data.frame(
  Term = names(odds_ratios),
  OddsRatio = round(odds_ratios, 3),
  CI_lower = round(conf_intervals[,1], 3),
  CI_upper = round(conf_intervals[,2], 3)
)
print(or_results)

                                                     Term OddsRatio CI_lower
(Intercept)                                   (Intercept)     0.018    0.006
playAbility_combined_numeric playAbility_combined_numeric     2.281    1.632
                             CI_upper
(Intercept)                     0.046
playAbility_combined_numeric    3.280

Code

# Predicted probabilities for each combined playing ability level
new_data <- data.frame(playAbility_combined_numeric = 1:3)
predicted_probs <- predict(logit_model, newdata = new_data, type = "response")
result_df <- data.frame(
  playAbility_level = c("Novice/Beginner", "Intermediate", "Advanced/Expert"),
  combined_level_value = 1:3,
  predicted_probability = predicted_probs
)
cat("\nPredicted probabilities of RMT usage by combined playing ability level:\n")


Predicted probabilities of RMT usage by combined playing ability level:

Code

print(result_df)

  playAbility_level combined_level_value predicted_probability
1   Novice/Beginner                    1            0.03880645
2      Intermediate                    2            0.08432102
3   Advanced/Expert                    3            0.17357778

Code

# Create a visualization of predicted probabilities
library(ggplot2)

# Plot predicted probabilities from continuous model
prob_plot <- ggplot(result_df, aes(x = combined_level_value, y = predicted_probability)) +
  geom_line(size = 1.5) +
  geom_point(size = 3) +
  scale_x_continuous(breaks = 1:3, 
                     labels = c("Novice/Beginner", "Intermediate", "Advanced/Expert")) +
  labs(title = "Predicted Probability of RMT Usage by Playing Ability Level",
       subtitle = "Based on Combined Ability Categories (0 included in Novice/Beginner)",
       x = "Playing Ability Level",
       y = "Probability of Using RMT Methods") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, face = "italic"),
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title = element_text(size = 14)
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, max(predicted_probs) * 1.1))

# Display the probability plot
print(prob_plot)

Code

# Calculate McFadden's Pseudo R-squared
null_model <- glm(RMT_binary ~ 1, data = analysis_data, family = "binomial")
logLik_full <- as.numeric(logLik(logit_model))
logLik_null <- as.numeric(logLik(null_model))
mcfadden_r2 <- 1 - (logLik_full / logLik_null)
cat(paste("\nMcFadden's Pseudo R-squared:", round(mcfadden_r2, 4)))


McFadden's Pseudo R-squared: 0.0201

Code

# Classification metrics with robust handling
predicted_classes <- ifelse(fitted(logit_model) > 0.5, 1, 0)
confusion_matrix <- table(Predicted = factor(predicted_classes, levels = c(0, 1)), 
                         Actual = factor(analysis_data$RMT_binary, levels = c(0, 1)))
cat("\n\nConfusion Matrix:\n")



Confusion Matrix:

Code

print(confusion_matrix)

         Actual
Predicted    0    1
        0 1330  228
        1    0    0

Code

# Calculate metrics with checks for zero denominators
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
sensitivity <- ifelse(sum(confusion_matrix[,2]) > 0, 
                     confusion_matrix[2,2] / sum(confusion_matrix[,2]), 
                     NA)
specificity <- ifelse(sum(confusion_matrix[,1]) > 0, 
                     confusion_matrix[1,1] / sum(confusion_matrix[,1]), 
                     NA)

cat(paste("\nAccuracy:", round(accuracy, 3)))


Accuracy: 0.854

Code

cat(paste("\nSensitivity (True Positive Rate):", 
         ifelse(is.na(sensitivity), "Not calculable", round(sensitivity, 3))))


Sensitivity (True Positive Rate): 0

Code

cat(paste("\nSpecificity (True Negative Rate):", 
         ifelse(is.na(specificity), "Not calculable", round(specificity, 3))))


Specificity (True Negative Rate): 1

Code

# Additional analysis: Also run categorical model
logit_model_categorical <- glm(RMT_binary ~ playAbility_combined, 
                              data = analysis_data, 
                              family = "binomial")

# Print categorical model summary
cat("\n\nCategorical Model Results (using combined categories as factors):\n")



Categorical Model Results (using combined categories as factors):

Code

print(summary(logit_model_categorical))


Call:
glm(formula = RMT_binary ~ playAbility_combined, family = "binomial", 
    data = analysis_data)

Coefficients:
                                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)                          -2.2513     0.5257  -4.283 1.85e-05 ***
playAbility_combinedIntermediate     -0.2929     0.5588  -0.524    0.600    
playAbility_combinedAdvanced/Expert   0.7057     0.5316   1.328    0.184    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1297.2  on 1557  degrees of freedom
Residual deviance: 1267.8  on 1555  degrees of freedom
AIC: 1273.8

Number of Fisher Scoring iterations: 5

Code

# Calculate odds ratios for categorical model
cat("\nOdds Ratios for Categorical Model:\n")


Odds Ratios for Categorical Model:

Code

odds_ratios_cat <- exp(coef(logit_model_categorical))
conf_intervals_cat <- exp(confint(logit_model_categorical))
or_results_cat <- data.frame(
  Term = names(odds_ratios_cat),
  OddsRatio = round(odds_ratios_cat, 3),
  CI_lower = round(conf_intervals_cat[,1], 3),
  CI_upper = round(conf_intervals_cat[,2], 3)
)
print(or_results_cat)

                                                                   Term
(Intercept)                                                 (Intercept)
playAbility_combinedIntermediate       playAbility_combinedIntermediate
playAbility_combinedAdvanced/Expert playAbility_combinedAdvanced/Expert
                                    OddsRatio CI_lower CI_upper
(Intercept)                             0.105    0.032    0.262
playAbility_combinedIntermediate        0.746    0.276    2.609
playAbility_combinedAdvanced/Expert     2.025    0.802    6.816

Code

# Print breakdown of original values in each combined category
cat("\nBreakdown of original playAbility_MAX values in each combined category:\n")


Breakdown of original playAbility_MAX values in each combined category:

Code

value_counts <- analysis_data %>%
  count(playAbility_MAX, playAbility_combined) %>%
  arrange(playAbility_MAX)
print(value_counts)

# A tibble: 10 × 3
   playAbility_MAX playAbility_combined     n
             <dbl> <fct>                <int>
 1             0   Novice/Beginner          1
 2             1   Novice/Beginner          6
 3             1.5 Novice/Beginner          3
 4             2   Novice/Beginner         32
 5             2.5 Intermediate            34
 6             3   Intermediate           222
 7             3.5 Intermediate           156
 8             4   Advanced/Expert        484
 9             4.5 Advanced/Expert        173
10             5   Advanced/Expert        447

5.4 Results and Discussion (based on combined findings)

This analysis examines the relationship between musicians’ play ability levels (grouped into three categories) and their use of Remote Monitoring Technology (RMT). The results show a statistically significant association between play ability and RMT usage, with specific patterns of usage across different skill levels.

Chi-Square Test of Independence

Test Statistic: X² = 26.337
P-value: p < 0.0001 (based on 10,000 simulated replicates)
Interpretation: There is very strong evidence of an association between play ability and RMT usage. The null hypothesis that these variables are independent can be confidently rejected.

Effect Size

Cramer’s V: 0.13
Interpretation: This represents a small to moderate association between play ability and RMT usage. While statistically significant, the strength of the relationship is not overwhelming.

Standardized Residuals Analysis

The standardized residuals reveal exactly where the significant deviations from independence occur:

Intermediate players and RMT usage:
Strong negative residual (-4.92) for “RMT”
Strong positive residual (4.92) for “No RMT”
This indicates intermediate players are significantly less likely to use RMT than expected
Advanced/Expert players and RMT usage:
Strong positive residual (5.12) for “RMT”
Strong negative residual (-5.12) for “No RMT”
This indicates advanced/expert players are significantly more likely to use RMT than expected

Logistic Regression Models

Continuous Model

Coefficient for playAbility_combined_numeric: 0.8246 (p < 0.0001)
Odds Ratio: 2.281 (95% CI: 1.632-3.280)
Interpretation: For each step up in the combined ability category (e.g., from Novice/Beginner to Intermediate), the odds of using RMT increase by a factor of 2.28.

Categorical Model

Intermediate vs. Novice/Beginner:
Odds Ratio: 0.746 (non-significant, p = 0.600)
Intermediate players have slightly lower odds of using RMT than novice/beginners, but this difference is not statistically significant
Advanced/Expert vs. Novice/Beginner:
Odds Ratio: 2.025 (non-significant, p = 0.184)
Advanced/expert players have approximately double the odds of using RMT compared to novice/beginners, but this difference does not reach statistical significance.

Predicted Probabilities

The model predicts the following probabilities of RMT usage:

Novice/Beginner: 3.9%
Intermediate: 8.4%
Advanced/Expert: 17.4%

This shows a clear increasing trend in RMT usage with higher play ability levels, with advanced/expert players having more than four times the probability of using RMT compared to novice/beginners.

Model Performance

McFadden’s Pseudo R-squared: 0.0201
Accuracy: 85.4%
Sensitivity: 0 (model fails to predict any true RMT users)
Specificity: 1 (model correctly identifies all non-RMT users)

The model has modest explanatory power and defaults to predicting “No RMT” for all observations, which yields high accuracy due to the imbalanced dataset (far more non-users than users).

Distribution of Participants

Novice/Beginner (values 0, 1, 1.5, 2): 42 participants
Intermediate (values 2.5, 3, 3.5): 412 participants
Advanced/Expert (values 4, 4.5, 5): 1,104 participants

The sample is heavily weighted toward advanced/expert players, with relatively few novice/beginners.

Key Insights and Implications

Skill-Technology Relationship: There is a significant positive relationship between play ability and RMT usage, with more advanced players more likely to adopt this technology.
Intermediate Player Resistance: Intermediate players show a particularly strong pattern of avoiding RMT usage, which could indicate:

Transitional learning approaches at this skill level
Concerns about technology hindering development of fundamental skills
Different teaching methodologies that don’t emphasize technology

Advanced Player Adoption: Advanced/expert players embrace RMT at higher rates, possibly because:

They use technology to refine already-developed skills
They have greater technical understanding of their instruments
They may be more involved in teaching and use RMT as pedagogical tools

Practical Applications:

Technology developers should consider targeting advanced players as early adopters
Different approaches may be needed to increase RMT adoption among intermediate players
Educational approaches might need to be tailored by skill level.

Despite the statistical significance of the relationship, the modest effect size suggests that play ability is just one of many factors influencing RMT adoption among musicians.

6 Country of Residence

6.1 Descriptive Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Calculate the total N
total_N <- nrow(data_combined)

# Modify country names: abbreviate USA and UK
data_combined <- data_combined %>%
  mutate(countryLive = case_when(
    countryLive == "United States of America (USA)" ~ "USA",
    countryLive == "United Kingdom (UK)" ~ "UK",
    TRUE ~ countryLive
  ))

# Compute counts and percentages for the 'countryLive' column
country_summary <- data_combined %>%
  group_by(countryLive) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  mutate(percentage = count / total_N * 100) %>%
  arrange(desc(count))

# Select the top 4 countries (using the highest counts)
top_countries <- country_summary %>%
  top_n(4, wt = count) %>%
  mutate(
    label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)"),
    # Reorder to display from highest to lowest
    countryLive = reorder(countryLive, -count)
  )

# Perform chi-square goodness-of-fit test for top 4 countries
# Expected frequencies for equality among the 4 groups
observed <- top_countries$count
expected <- rep(sum(observed)/length(observed), length(observed))
chi_test <- chisq.test(x = observed, p = rep(1/length(observed), length(observed)))
print("Chi-square goodness-of-fit test for equal distribution among top 4 countries:")

[1] "Chi-square goodness-of-fit test for equal distribution among top 4 countries:"

Code

print(chi_test)


    Chi-squared test for given probabilities

data:  observed
X-squared = 390.66, df = 3, p-value < 2.2e-16

Code

# Create the bar plot with counts and percentages and add the note
country_plot <- ggplot(top_countries, aes(x = countryLive, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  geom_text(aes(label = label), vjust = -0.5, size = 4) +
  labs(title = "Top 4 Countries (Live)",
       x = "Country",
       y = paste0("Count of Participants (N = ", total_N, ")"),
       subtitle = paste0("Chi-square: ", sprintf('%.2f', chi_test$statistic), 
                         " (df = ", chi_test$parameter, 
                         "), p = ", ifelse(chi_test$p.value < 0.001, "< .001", sprintf('%.3f', chi_test$p.value))),
       caption = "Note: Countries with frequencies less than 5% were removed from the figure.") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 12),
    plot.subtitle = element_text(size = 12),
    plot.caption = element_text(size = 10, hjust = 0, face = "italic", margin = margin(t = 15)),
    plot.margin = margin(b = 30, t = 20, r = 20, l = 20) # Add extra bottom margin for the note
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Display the plot
print(country_plot)

Code

# Print summary statistics
print("Summary Statistics for Top 4 Countries:")

[1] "Summary Statistics for Top 4 Countries:"

Code

print(top_countries %>% select(countryLive, count, percentage) %>% arrange(desc(count)))

# A tibble: 4 × 3
  countryLive count percentage
  <fct>       <int>      <dbl>
1 USA           610      39.2 
2 UK            358      23.0 
3 Australia     326      20.9 
4 Canada         91       5.84

Code

# Combined 5% values with note
# Read the data
library(tidyverse)
library(readxl)
library(ggplot2)
library(stringr)

data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Process the countryLive column
country_analysis <- data_combined %>%
  # Remove missing values
  filter(!is.na(countryLive)) %>%
  # Clean country names (trim whitespace, standardize case if needed)
  mutate(countryLive = trimws(countryLive)) %>%
  # Count occurrences of each country
  count(countryLive) %>%
  # Calculate percentages
  mutate(
    total = sum(n),
    percentage = n / total * 100,
    # Create grouped category (countries with <5% go to "Other")
    country_group = if_else(percentage >= 5, countryLive, "Other")
  ) %>%
  # Sort by frequency
  arrange(desc(n))

# Identify countries that will be in the "Other" category
other_countries <- country_analysis %>%
  filter(country_group == "Other") %>%
  arrange(desc(n))

# Create a formatted string listing all "Other" countries WITHOUT percentages
other_countries_list <- other_countries %>%
  pull(countryLive) %>%
  paste(collapse = ", ")

# Create data for plotting
plot_data <- country_analysis %>%
  group_by(country_group) %>%
  summarise(
    count = sum(n),
    percentage = sum(percentage)
  ) %>%
  # Ensure "Other" appears at the end
  mutate(
    country_group = factor(country_group),
    country_group = fct_relevel(fct_reorder(country_group, percentage, .desc = TRUE), "Other", after = Inf)
  )

# Calculate total participants
total_participants <- sum(country_analysis$n)

# Create the base note text
note_text <- paste0("Note: 'Other' includes frequencies <5%: ", other_countries_list)

# Manually wrap the note text to ensure it's fully visible
wrapped_note <- str_wrap(note_text, width = 100)

# Create the plot with adjusted bottom margin and caption styling
country_plot <- ggplot(plot_data, aes(x = country_group, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue", width = 0.7) +
  geom_text(aes(label = sprintf("%d\n(%.1f%%)", count, percentage)),
            position = position_stack(vjust = 0.5),
            vjust = -0.5, size = 3.5) +
  labs(
    title = "Distribution of Participants by Country of Residence",
    x = "Country",
    y = paste0("Number of Participants (N = ", total_participants, ")"),
    caption = wrapped_note
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 10),
    # Style the caption to ensure it's readable
    plot.caption = element_text(
      size = 9, 
      hjust = 0, 
      vjust = 1,
      lineheight = 1.2,
      margin = margin(t = 15)
    ),
    # Add extra bottom margin for the caption
    plot.margin = margin(b = 100, t = 20, r = 20, l = 20)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Display the plot
print(country_plot)

6.2 Results and Discussion

Key observations:

• 39 counties in total

• The United States has the highest representation with 610 participants (39.2%)

• The United Kingdom follows with 358 participants (23%)

• Australia is third with 326 participants (20.9%)

• These top three countries account for approximately 83% of all participants

• The remaining countries (Canada, Italy, and New Zealand) collectively represent about 17% of participants

6.3 Comparison Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Clean country names and create RMT factor
data_combined <- data_combined %>%
  mutate(
    countryLive = case_when(
      countryLive == "United States of America (USA)" ~ "USA",
      countryLive == "United Kingdom (UK)" ~ "UK",
      TRUE ~ countryLive
    ),
    RMTMethods_YN = factor(RMTMethods_YN, 
                          levels = c(0, 1),
                          labels = c("No RMT", "RMT"))
  )

# Get top 6 countries
top_6_countries <- data_combined %>%
  count(countryLive) %>%
  top_n(6, n) %>%
  pull(countryLive)

# Filter data for top 6 countries
data_for_test <- data_combined %>%
  filter(countryLive %in% top_6_countries, !is.na(RMTMethods_YN))

# Calculate group totals for each RMT group
rmt_group_totals <- data_for_test %>%
  group_by(RMTMethods_YN) %>%
  summarise(group_N = n())

# Calculate statistics with percentages WITHIN each RMT group (not within country)
country_rmt_stats <- data_for_test %>%
  group_by(RMTMethods_YN, countryLive) %>%
  summarise(count = n(), .groups = 'drop') %>%
  left_join(rmt_group_totals, by = "RMTMethods_YN") %>%
  mutate(
    percentage = count / group_N * 100,
    label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)")
  ) %>%
  # Calculate total per country (for ordering in plot)
  group_by(countryLive) %>%
  mutate(total_country = sum(count)) %>%
  ungroup()

# Create contingency table for statistical test
contingency_table <- table(
  data_for_test$countryLive,
  data_for_test$RMTMethods_YN
)

# Prepare legend labels with group total N included
legend_labels <- setNames(
  paste0(levels(data_for_test$RMTMethods_YN), " (N = ", rmt_group_totals$group_N, ")"),
  levels(data_for_test$RMTMethods_YN)
)

# Get expected frequencies without running a test yet
n <- sum(contingency_table)
row_sums <- rowSums(contingency_table)
col_sums <- colSums(contingency_table)
expected_counts <- outer(row_sums, col_sums) / n

# Use Fisher's exact test to avoid chi-square approximation warnings
fisher_test <- fisher.test(contingency_table, simulate.p.value = TRUE, B = 10000)
test_name <- "Fisher's exact test"

# Print test results
cat("\n", test_name, "Results:\n", sep="")


Fisher's exact testResults:

Code

print(fisher_test)


    Fisher's Exact Test for Count Data with simulated p-value (based on
    10000 replicates)

data:  contingency_table
p-value = 9.999e-05
alternative hypothesis: two.sided

Code

# Print expected frequencies
cat("\nExpected frequencies:\n")


Expected frequencies:

Code

print(round(expected_counts, 2))

            No RMT   RMT
Australia   279.91 46.09
Canada       78.13 12.87
Italy        40.35  6.65
New Zealand  27.48  4.52
UK          307.38 50.62
USA         523.75 86.25

Code

# Create grouped bar plot with percentages within RMT groups
plot <- ggplot(country_rmt_stats, 
               aes(x = reorder(countryLive, -total_country), 
                   y = count, 
                   fill = RMTMethods_YN)) +
  geom_bar(stat = "identity", 
           position = "dodge",
           color = "black") +
  geom_text(aes(label = label),
            position = position_dodge(width = 0.9),
            vjust = -0.5,
            size = 3.5) +
  scale_fill_manual(values = c("lightblue", "steelblue"),
                   labels = legend_labels) +
  labs(title = "RMT Usage by Country (Top 6)",
       subtitle = paste0(test_name, ": p ", 
                         ifelse(fisher_test$p.value < .001, 
                                "< .001", 
                                paste0("= ", sprintf("%.3f", fisher_test$p.value)))),
       x = "Country",
       y = "Count of Participants",
       fill = "RMT Usage",
       caption = "Note: Percentages are calculated within each RMT group, not within countries") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 12),
    axis.title = element_text(size = 12),
    legend.position = "top",
    plot.caption = element_text(hjust = 0, size = 10)
  ) +
  scale_y_continuous(
    expand = expansion(mult = c(0, 0.2))
  )

# Display the plot
print(plot)

Code

# Calculate proportions of RMT users in each country
country_proportions <- data_for_test %>%
  group_by(countryLive) %>%
  summarise(
    total = n(),
    rmt_users = sum(RMTMethods_YN == "RMT"),
    rmt_proportion = rmt_users/total,
    rmt_percentage = rmt_proportion * 100
  ) %>%
  arrange(desc(rmt_proportion))

cat("\nRMT Usage Proportions by Country:\n")


RMT Usage Proportions by Country:

Code

print(country_proportions)

# A tibble: 6 × 5
  countryLive total rmt_users rmt_proportion rmt_percentage
  <chr>       <int>     <int>          <dbl>          <dbl>
1 Australia     326        63         0.193           19.3 
2 USA           610       113         0.185           18.5 
3 Italy          47         8         0.170           17.0 
4 Canada         91         8         0.0879           8.79
5 UK            358        14         0.0391           3.91
6 New Zealand    32         1         0.0312           3.12

Code

# Pairwise proportion tests with Bonferroni correction
countries <- unique(country_proportions$countryLive)
n_countries <- length(countries)
pairwise_tests <- data.frame()

for(i in 1:(n_countries-1)) {
  for(j in (i+1):n_countries) {
    country1 <- countries[i]
    country2 <- countries[j]
    
    # Get data for both countries
    data1 <- data_for_test %>% filter(countryLive == country1)
    data2 <- data_for_test %>% filter(countryLive == country2)
    
    # Get counts for proportion test
    x1 <- sum(data1$RMTMethods_YN == "RMT")
    x2 <- sum(data2$RMTMethods_YN == "RMT")
    n1 <- nrow(data1)
    n2 <- nrow(data2)
    
    # Skip if zero denominators
    if (n1 == 0 || n2 == 0) {
      next
    }
    
    # Create 2x2 table for test
    test_table <- matrix(c(x1, n1-x1, x2, n2-x2), nrow=2)
    
    # Use Fisher's exact test for all pairwise comparisons
    test <- fisher.test(test_table)
    
    # Store results
    pairwise_tests <- rbind(pairwise_tests, data.frame(
      country1 = country1,
      country2 = country2,
      prop1 = x1/n1,
      prop2 = x2/n2,
      diff = abs(x1/n1 - x2/n2),
      p_value = test$p.value,
      stringsAsFactors = FALSE
    ))
  }
}

# Apply Bonferroni correction
if (nrow(pairwise_tests) > 0) {
  pairwise_tests$p_adjusted <- p.adjust(pairwise_tests$p_value, method = "bonferroni")
  
  cat("\nPairwise Comparisons (Bonferroni-adjusted p-values):\n")
  print(pairwise_tests %>% 
          arrange(p_adjusted) %>%
          mutate(
            prop1 = sprintf("%.1f%%", prop1 * 100),
            prop2 = sprintf("%.1f%%", prop2 * 100),
            diff = sprintf("%.1f%%", diff * 100),
            p_value = sprintf("%.4f", p_value),
            p_adjusted = sprintf("%.4f", p_adjusted)
          ) %>%
          select(country1, prop1, country2, prop2, diff, p_value, p_adjusted))
} else {
  cat("\nNo valid pairwise comparisons could be performed.\n")
}


Pairwise Comparisons (Bonferroni-adjusted p-values):
    country1 prop1    country2 prop2  diff p_value p_adjusted
1        USA 18.5%          UK  3.9% 14.6%  0.0000     0.0000
2  Australia 19.3%          UK  3.9% 15.4%  0.0000     0.0000
3      Italy 17.0%          UK  3.9% 13.1%  0.0017     0.0249
4  Australia 19.3%      Canada  8.8% 10.5%  0.0178     0.2664
5        USA 18.5%      Canada  8.8%  9.7%  0.0247     0.3708
6  Australia 19.3% New Zealand  3.1% 16.2%  0.0262     0.3934
7        USA 18.5% New Zealand  3.1% 15.4%  0.0292     0.4383
8  Australia 19.3%         USA 18.5%  0.8%  0.7924     1.0000
9  Australia 19.3%       Italy 17.0%  2.3%  0.8433     1.0000
10       USA 18.5%       Italy 17.0%  1.5%  1.0000     1.0000
11     Italy 17.0%      Canada  8.8%  8.2%  0.1689     1.0000
12     Italy 17.0% New Zealand  3.1% 13.9%  0.0757     1.0000
13    Canada  8.8%          UK  3.9%  4.9%  0.0970     1.0000
14    Canada  8.8% New Zealand  3.1%  5.7%  0.4437     1.0000
15        UK  3.9% New Zealand  3.1%  0.8%  1.0000     1.0000

6.4 Results and Discussion

Geographical Patterns in Respiratory Muscle Training Adoption Among Wind Musicians

Overview of Findings

The analysis examines how Respiratory Muscle Training (RMT) device usage varies across different countries among wind instrument musicians. By focusing on the top six countries represented in the sample, this analysis reveals distinctive geographical patterns in RMT adoption that likely reflect broader cultural, educational, and practice-related differences.

Key Statistical Findings

The Fisher’s exact test confirms a statistically significant association between country of residence and RMT usage (p < 0.05). This indicates that the observed geographical variations in RMT adoption are not due to random chance but represent genuine differences across countries. Within each RMT group, the distribution of countries differs notably:

Country-Specific RMT Usage Patterns

The data shows notable differences in RMT adoption rates across countries:

High Adoption Countries:
- Australia: 19.3% (63 out of 326 participants)
- USA: 18.5% (113 out of 610 participants)
- Italy: 17.0% (8 out of 47 participants)
Low Adoption Countries:
- Canada: 8.8% (8 out of 91 participants)
- UK: 3.9% (14 out of 358 participants)
- New Zealand: 3.1% (1 out of 32 participants)

This reveals a striking pattern where participants from the UK and New Zealand are approximately 5-6 times less likely to use RMT compared to those from Australia, USA, or Italy.

Pairwise Comparisons

The Bonferroni-adjusted pairwise comparisons identify which specific country differences are statistically significant:

Significant Differences (p < 0.05 after adjustment):
- USA vs. UK: 14.6 percentage point difference (p = 0.0000)
- Australia vs. UK: 15.4 percentage point difference (p = 0.0000)
- Italy vs. UK: 13.1 percentage point difference (p = 0.0249)
Non-Significant Differences (p > 0.05 after adjustment):
- All other country comparisons, including those between high-adoption countries (Australia, USA, Italy) and between low-adoption countries (Canada, UK, New Zealand)

The Bonferroni correction, which adjusts for multiple comparisons, confirms that the UK stands out as having significantly lower RMT usage compared to Australia, USA, and Italy.

Implications and Potential Explanations

Several factors might explain these international differences in RMT adoption:

Educational and Training Approaches: Different educational systems and training methodologies may emphasize technology integration to varying degrees.
Cultural Attitudes: Cultural differences in approaches to teaching, learning, and technology adoption may influence RMT usage.
Technological Infrastructure: Differences in access to technology, internet reliability, and technological support systems might affect adoption rates.
Market Penetration: Technology companies might have targeted certain markets more aggressively or entered them earlier.
Professional Norms: Music education and performance practices may have different established norms regarding technology usage across these countries.

The Anglo countries show an interesting split, with the USA and Australia having high adoption rates while the UK and New Zealand have notably low rates. This suggests that shared language and similar cultural backgrounds do not necessarily lead to similar technology adoption patterns.

Limitations

While these findings are statistically significant, some considerations are worth noting:

Sample Size Variations: The number of participants varies considerably across countries (from 32 in New Zealand to 610 in the USA), which affects the precision of the estimates.
Limited Coverage for Italy and New Zealand: These countries have relatively small sample sizes (47 and 32 respectively), making their estimates less reliable.
Potential Confounding Variables: Factors like participant age, professional status, or educational background might vary across countries and influence the results.

Conclusion

The analysis clearly demonstrates that RMT usage varies significantly by country of residence, with particularly strong differences between high-adoption countries (Australia, USA, Italy) and the UK. These findings suggest that geographical and potentially cultural factors play an important role in technology adoption within music education and performance settings. Further research could explore the specific mechanisms behind these differences and their implications for educational policy and technology development.

7 Country of Education

7.1 Descriptive Stats

Code

# Read the data
data_combined <- read_excel("../Data/R_Import_Transformed_15.02.25.xlsx", sheet = "Combined")

# Calculate total N
total_N <- nrow(data_combined)

# Clean country names
data_combined <- data_combined %>%
  mutate(
    countryEd = case_when(
      countryEd == "United States of America (USA)" ~ "USA",
      countryEd == "United Kingdom (UK)" ~ "UK",
      TRUE ~ as.character(countryEd)
    )
  )

# Identify the top 6 countries from countryEd
top_6_countryEd <- data_combined %>%
  count(countryEd, sort = TRUE) %>%
  top_n(6, n) %>%
  pull(countryEd)

# Filter data for these top 6 countries
data_top6_edu <- data_combined %>%
  filter(countryEd %in% top_6_countryEd)

# Calculate statistics for plotting and analysis
edu_stats <- data_top6_edu %>%
  count(countryEd) %>%
  arrange(desc(n)) %>%
  mutate(
    percentage = n / sum(n) * 100,
    label = paste0(n, "\n(", sprintf("%.1f", percentage), "%)")
  )

# Chi-square test for equal proportions
chi_test <- chisq.test(edu_stats$n)

# Create contingency table for post-hoc analysis
countries <- sort(unique(data_top6_edu$countryEd))
n_countries <- length(countries)
pairwise_tests <- data.frame()

# Perform pairwise proportion tests
for(i in 1:(n_countries-1)) {
  for(j in (i+1):n_countries) {
    country1 <- countries[i]
    country2 <- countries[j]
    
    count1 <- edu_stats$n[edu_stats$countryEd == country1]
    count2 <- edu_stats$n[edu_stats$countryEd == country2]
    
    # Perform proportion test
    test <- prop.test(
      x = c(count1, count2),
      n = c(sum(edu_stats$n), sum(edu_stats$n))
    )
    
    pairwise_tests <- rbind(pairwise_tests, data.frame(
      country1 = country1,
      country2 = country2,
      p_value = test$p.value,
      stringsAsFactors = FALSE
    ))
  }
}

# Apply Bonferroni correction
pairwise_tests$p_adjusted <- p.adjust(pairwise_tests$p_value, method = "bonferroni")

# Create the plot
edu_plot <- ggplot(edu_stats, aes(x = reorder(countryEd, -n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue", color = "black") +
  geom_text(aes(label = label), vjust = -0.5, size = 4) +
  labs(title = "Top 6 Countries of Education",
       subtitle = paste0("χ²(", chi_test$parameter, ") = ", 
                         sprintf("%.2f", chi_test$statistic),
                         ", p ", 
                         ifelse(chi_test$p.value < .001, 
                                "< .001", 
                                paste0("= ", sprintf("%.3f", chi_test$p.value)))),
       x = "Country of Education",
       y = paste0("Count of Participants (N = ", total_N, ")")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 12)
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2)))

# Print statistical results
print("Chi-square Test of Equal Proportions Results:")

[1] "Chi-square Test of Equal Proportions Results:"

Code

print(chi_test)


    Chi-squared test for given probabilities

data:  edu_stats$n
X-squared = 1111.3, df = 5, p-value < 2.2e-16

Code

print("\nDescriptive Statistics:")

[1] "\nDescriptive Statistics:"

Code

print(edu_stats)

# A tibble: 6 × 4
  countryEd       n percentage label         
  <chr>       <int>      <dbl> <chr>         
1 USA           620      42.2  "620\n(42.2%)"
2 UK            364      24.8  "364\n(24.8%)"
3 Australia     321      21.9  "321\n(21.9%)"
4 Canada         92       6.27 "92\n(6.3%)"  
5 Italy          44       3.00 "44\n(3.0%)"  
6 New Zealand    27       1.84 "27\n(1.8%)"

Code

print("\nPairwise Comparisons (Bonferroni-adjusted p-values):")

[1] "\nPairwise Comparisons (Bonferroni-adjusted p-values):"

Code

print(pairwise_tests %>% 
        arrange(p_adjusted) %>%
        mutate(
          p_value = sprintf("%.4f", p_value),
          p_adjusted = sprintf("%.4f", p_adjusted)
        ))

      country1    country2 p_value p_adjusted
1  New Zealand         USA  0.0000     0.0000
2        Italy         USA  0.0000     0.0000
3       Canada         USA  0.0000     0.0000
4  New Zealand          UK  0.0000     0.0000
5        Italy          UK  0.0000     0.0000
6    Australia New Zealand  0.0000     0.0000
7    Australia       Italy  0.0000     0.0000
8       Canada          UK  0.0000     0.0000
9    Australia      Canada  0.0000     0.0000
10   Australia         USA  0.0000     0.0000
11          UK         USA  0.0000     0.0000
12      Canada New Zealand  0.0000     0.0000
13      Canada       Italy  0.0000     0.0006
14       Italy New Zealand  0.0546     0.8186
15   Australia          UK  0.0668     1.0000

Code

# Display the plot
print(edu_plot)

7.2 Results and Discussion

Distribution of Participants

United States of America (USA): Largest share with 620 participants (39.8%)
United Kingdom (UK): Second largest with 364 participants (23.4%)
Australia: Third with 321 participants (20.6%)
Canada: 92 participants (5.9%)
Italy: 44 participants (2.8%)
New Zealand: 27 participants (1.7%)

Key Observations:

The top three countries (USA, UK, and Australia) account for 83.8% of all participants’ educational background
English-speaking countries dominate the top positions, representing 91.4% of the top 6 response
There’s a significant drop between the top 3 and the remaining countries
The distribution pattern is very similar to the current country of residence data, suggesting most participants stayed in their country of education

Comparison to Current Residence:

USA: Education (620) vs Residence (610) UK: Education (364) vs Residence (358) Australia: Education (321) vs Residence (326) Canada: Education (92) vs Residence (91) Italy: Education (44) vs Residence (47) New Zealand: Education (27) vs Residence (32)

This shows relatively small differences between where participants were educated and where they currently live, indicating low international mobility after education.

7.3 Comparison Stats

Still getting a Chi-Squared Warning, but added Fisher’s

Code

library(readxl)
library(dplyr)
library(ggplot2)

# Robust Data Preparation Function
prepare_rmt_data <- function(file_path, sheet = "Combined") {
  tryCatch({
    # Read data with standardized cleaning
    data_combined <- read_excel(file_path, sheet = sheet)
    
    data_cleaned <- data_combined %>%
      mutate(
        # Comprehensive country name standardization
        countryEd = case_when(
          grepl("United States|USA", countryEd, ignore.case = TRUE) ~ "USA",
          grepl("United Kingdom|UK", countryEd, ignore.case = TRUE) ~ "UK",
          TRUE ~ as.character(countryEd)
        ),
        # Robust RMT factor conversion
        RMTMethods_YN = factor(
          `RMTMethods_YN`, 
          levels = c(0, 1), 
          labels = c("No RMT", "RMT")
        )
      )
    
    return(data_cleaned)
  }, error = function(e) {
    stop(paste("Error in data preparation:", e$message))
  })
}

# Advanced Statistical Analysis Function
perform_comprehensive_analysis <- function(data) {
  # Identify Top 6 Countries
  top_6_countryEd <- data %>%
    count(countryEd, sort = TRUE) %>%
    top_n(6, n) %>%
    pull(countryEd)
  
  # Filter data to top 6 countries
  data_top6_edu <- data %>%
    filter(countryEd %in% top_6_countryEd)
  
  # Create contingency table
  contingency_table <- table(data_top6_edu$countryEd, data_top6_edu$RMTMethods_YN)
  
  # Comprehensive test selection and reporting
  analyze_test_assumptions <- function(cont_table) {
    # Calculate expected frequencies
    chi_results <- suppressWarnings(chisq.test(cont_table))
    expected_freq <- chi_results$expected
    
    # Detailed frequency checks
    total_cells <- length(expected_freq)
    low_freq_cells <- sum(expected_freq < 5)
    min_expected_freq <- min(expected_freq)
    
    # Verbose reporting of frequency conditions
    cat("Expected Frequency Analysis:\n")
    cat("Minimum Expected Frequency:", round(min_expected_freq, 2), "\n")
    cat("Cells with Expected Frequency < 5:", low_freq_cells, 
        "out of", total_cells, "cells (", 
        round(low_freq_cells / total_cells * 100, 2), "%)\n\n")
    
    # Determine most appropriate test
    if (min_expected_freq < 1 || (low_freq_cells / total_cells) > 0.2) {
      # Use Fisher's exact test with Monte Carlo simulation
      exact_test <- fisher.test(cont_table, simulate.p.value = TRUE, B = 10000)
      
      return(list(
        test_type = "Fisher's Exact Test (Monte Carlo)",
        p_value = exact_test$p.value,
        statistic = NA,
        method = "Fisher's Exact Test with Monte Carlo Simulation"
      ))
    } else {
      # Use chi-square test with Yates' continuity correction
      adjusted_chi_test <- chisq.test(cont_table, correct = TRUE)
      
      return(list(
        test_type = "Chi-Square with Continuity Correction",
        p_value = adjusted_chi_test$p.value,
        statistic = adjusted_chi_test$statistic,
        parameter = adjusted_chi_test$parameter,
        method = paste("Pearson's Chi-squared test with Yates' continuity correction,",
                       "df =", adjusted_chi_test$parameter)
      ))
    }
  }
  
  # Perform test
  test_results <- analyze_test_assumptions(contingency_table)
  
  # Pairwise comparisons with Fisher's exact test
  pairwise_comparisons <- function(cont_table) {
    countries <- rownames(cont_table)
    n_countries <- length(countries)
    results <- data.frame(
      comparison = character(),
      p_value = numeric(),
      adj_p_value = numeric(),
      stringsAsFactors = FALSE
    )
    
    for(i in 1:(n_countries-1)) {
      for(j in (i+1):n_countries) {
        # Use Fisher's exact test for all pairwise comparisons
        test <- fisher.test(cont_table[c(i,j),])
        results <- rbind(results, data.frame(
          comparison = paste(countries[i], "vs", countries[j]),
          p_value = test$p.value,
          adj_p_value = NA
        ))
      }
    }
    
    # Bonferroni correction
    results$adj_p_value <- p.adjust(results$p_value, method = "bonferroni")
    return(results)
  }
  
  # Compute pairwise comparisons
  pairwise_results <- pairwise_comparisons(contingency_table)
  
  # Return comprehensive results
  list(
    test_results = test_results,
    pairwise_results = pairwise_results,
    data_top6_edu = data_top6_edu,
    contingency_table = contingency_table
  )
}

# Visualization Function
create_rmt_plot <- function(analysis_results) {
  # Prepare plot data
  plot_data <- analysis_results$data_top6_edu %>%
    group_by(countryEd, RMTMethods_YN) %>%
    summarise(count = n(), .groups = 'drop') %>%
    group_by(countryEd) %>%
    mutate(
      total_country = sum(count),
      percentage = count / total_country * 100,
      label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)")
    ) %>%
    ungroup()
  
  # Compute totals for legend
  legend_totals <- analysis_results$data_top6_edu %>%
    group_by(RMTMethods_YN) %>%
    summarise(total = n(), .groups = 'drop')
  
  # Create legend labels
  legend_labels <- setNames(
    paste0(legend_totals$RMTMethods_YN, " (N = ", legend_totals$total, ")"),
    legend_totals$RMTMethods_YN
  )
  
  # Prepare subtitle based on test type
  test_results <- analysis_results$test_results
  subtitle_text <- if (test_results$test_type == "Chi-Square with Continuity Correction") {
    paste0("Chi-square test: ", 
           sprintf("χ²(%d) = %.2f", test_results$parameter, test_results$statistic),
           ", p ", ifelse(test_results$p_value < 0.001, "< .001", 
                          paste("=", sprintf("%.3f", test_results$p_value))))
  } else {
    paste0("Fisher's Exact Test (Monte Carlo): p ", 
           ifelse(test_results$p_value < 0.001, "< .001", 
                  paste("=", sprintf("%.3f", test_results$p_value))))
  }
  
  # Create the plot
  ggplot(plot_data, aes(x = reorder(countryEd, -total_country), y = count, fill = RMTMethods_YN)) +
    geom_bar(stat = "identity", position = position_dodge(width = 0.9), color = "black") +
    geom_text(aes(label = label), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, size = 3.5) +
    labs(
      title = "Country of Education by RMT Usage (Top 6)",
      subtitle = subtitle_text,
      x = "Country of Education",
      y = paste0("Count of Participants (N = ", sum(plot_data$count), ")"),
      fill = "RMT Usage",
      caption = "Note: Percentages are out of the total N for each group"
    ) +
    scale_fill_discrete(labels = legend_labels) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 12),
      axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
      axis.text.y = element_text(size = 12),
      plot.caption = element_text(hjust = 0, size = 10)
    ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.2)))
}

# Main Execution Function
run_rmt_analysis <- function(file_path = "../Data/R_Import_Transformed_15.02.25.xlsx") {
  # Prepare data
  prepared_data <- prepare_rmt_data(file_path)
  
  # Perform comprehensive analysis
  analysis_results <- perform_comprehensive_analysis(prepared_data)
  
  # Create visualization
  rmt_plot <- create_rmt_plot(analysis_results)
  
  # Print results to console
  cat("Statistical Test Details:\n")
  cat("Test Type:", analysis_results$test_results$test_type, "\n")
  cat("P-value:", analysis_results$test_results$p_value, "\n\n")
  
  cat("Contingency Table:\n")
  print(analysis_results$contingency_table)
  
  cat("\nPost-hoc Pairwise Comparisons (Bonferroni-corrected):\n")
  print(analysis_results$pairwise_results)
  
  # Display the plot
  print(rmt_plot)
  
  # Return results for potential further analysis
  return(analysis_results)
}

# Run the analysis
results <- run_rmt_analysis()

Expected Frequency Analysis:
Minimum Expected Frequency: 3.79 
Cells with Expected Frequency < 5: 1 out of 12 cells ( 8.33 %)

Statistical Test Details:
Test Type: Chi-Square with Continuity Correction 
P-value: 1.055118e-10 

Contingency Table:
             
              No RMT RMT
  Australia      256  65
  Canada          84   8
  Italy           39   5
  New Zealand     26   1
  UK             350  14
  USA            507 113

Post-hoc Pairwise Comparisons (Bonferroni-corrected):
                 comparison      p_value  adj_p_value
1       Australia vs Canada 8.592223e-03 1.288833e-01
2        Australia vs Italy 2.196714e-01 1.000000e+00
3  Australia vs New Zealand 3.842474e-02 5.763710e-01
4           Australia vs UK 7.873873e-12 1.181081e-10
5          Australia vs USA 4.826511e-01 1.000000e+00
6           Canada vs Italy 7.562204e-01 1.000000e+00
7     Canada vs New Zealand 6.820881e-01 1.000000e+00
8              Canada vs UK 6.030667e-02 9.046001e-01
9             Canada vs USA 2.481173e-02 3.721759e-01
10     Italy vs New Zealand 3.968168e-01 1.000000e+00
11              Italy vs UK 4.256371e-02 6.384556e-01
12             Italy vs USA 3.106074e-01 1.000000e+00
13        New Zealand vs UK 1.000000e+00 1.000000e+00
14       New Zealand vs USA 6.684451e-02 1.000000e+00
15                UK vs USA 3.421609e-12 5.132413e-11

Code

# Only top 4 countries shown - Trying to increase the robustness of the analyses.
library(readxl)
library(dplyr)
library(ggplot2)
library(stringr)

# Robust Data Preparation Function
prepare_rmt_data <- function(file_path, sheet = "Combined") {
  tryCatch({
    # Read data with standardized cleaning
    data_combined <- read_excel(file_path, sheet = sheet)
    
    data_cleaned <- data_combined %>%
      mutate(
        # Comprehensive country name standardization
        countryEd = case_when(
          grepl("United States|USA", countryEd, ignore.case = TRUE) ~ "USA",
          grepl("United Kingdom|UK", countryEd, ignore.case = TRUE) ~ "UK",
          TRUE ~ as.character(countryEd)
        ),
        # Robust RMT factor conversion
        RMTMethods_YN = factor(
          `RMTMethods_YN`, 
          levels = c(0, 1), 
          labels = c("No RMT", "RMT")
        )
      )
    
    return(data_cleaned)
  }, error = function(e) {
    stop(paste("Error in data preparation:", e$message))
  })
}

# Advanced Statistical Analysis Function
perform_comprehensive_analysis <- function(data) {
  # Get all countries and their frequencies
  all_countries <- data %>%
    filter(!is.na(countryEd)) %>%
    count(countryEd, sort = TRUE) %>%
    mutate(percentage = n / sum(n) * 100)
  
  # Identify Top 4 Countries
  top_4_countryEd <- all_countries %>%
    top_n(4, n) %>%
    pull(countryEd)
  
  # Get excluded countries for the note
  excluded_countries <- all_countries %>%
    filter(!(countryEd %in% top_4_countryEd)) %>%
    mutate(country_info = countryEd) %>%
    pull(country_info)
  
  # Create excluded countries list as a string
  excluded_countries_text <- paste(excluded_countries, collapse = ", ")
  
  # Filter data to top 4 countries
  data_top4_edu <- data %>%
    filter(countryEd %in% top_4_countryEd)
  
  # Create contingency table
  contingency_table <- table(data_top4_edu$countryEd, data_top4_edu$RMTMethods_YN)
  
  # Comprehensive test selection and reporting
  analyze_test_assumptions <- function(cont_table) {
    # Calculate expected frequencies
    chi_results <- suppressWarnings(chisq.test(cont_table))
    expected_freq <- chi_results$expected
    
    # Detailed frequency checks
    total_cells <- length(expected_freq)
    low_freq_cells <- sum(expected_freq < 5)
    min_expected_freq <- min(expected_freq)
    
    # Verbose reporting of frequency conditions
    cat("Expected Frequency Analysis:\n")
    cat("Minimum Expected Frequency:", round(min_expected_freq, 2), "\n")
    cat("Cells with Expected Frequency < 5:", low_freq_cells, 
        "out of", total_cells, "cells (", 
        round(low_freq_cells / total_cells * 100, 2), "%)\n\n")
    
    # Determine most appropriate test
    if (min_expected_freq < 1 || (low_freq_cells / total_cells) > 0.2) {
      # Use Fisher's exact test with Monte Carlo simulation
      exact_test <- fisher.test(cont_table, simulate.p.value = TRUE, B = 10000)
      
      return(list(
        test_type = "Fisher's Exact Test (Monte Carlo)",
        p_value = exact_test$p.value,
        statistic = NA,
        method = "Fisher's Exact Test with Monte Carlo Simulation"
      ))
    } else {
      # Use chi-square test with Yates' continuity correction
      adjusted_chi_test <- chisq.test(cont_table, correct = TRUE)
      
      return(list(
        test_type = "Chi-Square with Continuity Correction",
        p_value = adjusted_chi_test$p.value,
        statistic = adjusted_chi_test$statistic,
        parameter = adjusted_chi_test$parameter,
        method = paste("Pearson's Chi-squared test with Yates' continuity correction,",
                       "df =", adjusted_chi_test$parameter)
      ))
    }
  }
  
  # Perform test
  test_results <- analyze_test_assumptions(contingency_table)
  
  # Pairwise comparisons with Chi-square tests (if appropriate)
  pairwise_comparisons <- function(cont_table) {
    countries <- rownames(cont_table)
    n_countries <- length(countries)
    results <- data.frame(
      comparison = character(),
      p_value = numeric(),
      adj_p_value = numeric(),
      stringsAsFactors = FALSE
    )
    
    for(i in 1:(n_countries-1)) {
      for(j in (i+1):n_countries) {
        # Get the 2x2 subtable
        subtable <- cont_table[c(i,j),]
        
        # Check assumptions for chi-square
        expected <- chisq.test(subtable)$expected
        min_expected <- min(expected)
        
        # Use appropriate test based on expected frequencies
        if (min_expected < 5) {
          test <- fisher.test(subtable)
        } else {
          test <- chisq.test(subtable, correct = TRUE)
        }
        
        results <- rbind(results, data.frame(
          comparison = paste(countries[i], "vs", countries[j]),
          p_value = test$p.value,
          adj_p_value = NA
        ))
      }
    }
    
    # Bonferroni correction
    results$adj_p_value <- p.adjust(results$p_value, method = "bonferroni")
    return(results)
  }
  
  # Compute pairwise comparisons
  pairwise_results <- pairwise_comparisons(contingency_table)
  
  # Return comprehensive results
  list(
    test_results = test_results,
    pairwise_results = pairwise_results,
    data_top4_edu = data_top4_edu,
    contingency_table = contingency_table,
    excluded_countries_text = excluded_countries_text,
    all_countries = all_countries
  )
}

# Visualization Function
create_rmt_plot <- function(analysis_results) {
  # Prepare plot data
  plot_data <- analysis_results$data_top4_edu %>%
    group_by(countryEd, RMTMethods_YN) %>%
    summarise(count = n(), .groups = 'drop') %>%
    group_by(countryEd) %>%
    mutate(
      total_country = sum(count),
      percentage = count / total_country * 100,
      label = paste0(count, "\n(", sprintf("%.1f", percentage), "%)")
    ) %>%
    ungroup()
  
  # Compute totals for legend
  legend_totals <- analysis_results$data_top4_edu %>%
    group_by(RMTMethods_YN) %>%
    summarise(total = n(), .groups = 'drop')
  
  # Create legend labels
  legend_labels <- setNames(
    paste0(legend_totals$RMTMethods_YN, " (N = ", legend_totals$total, ")"),
    legend_totals$RMTMethods_YN
  )
  
  # Prepare subtitle based on test type
  test_results <- analysis_results$test_results
  subtitle_text <- if (test_results$test_type == "Chi-Square with Continuity Correction") {
    paste0("Chi-square test: ", 
           sprintf("χ²(%d) = %.2f", test_results$parameter, test_results$statistic),
           ", p ", ifelse(test_results$p_value < 0.001, "< .001", 
                          paste("=", sprintf("%.3f", test_results$p_value))))
  } else {
    paste0("Fisher's Exact Test (Monte Carlo): p ", 
           ifelse(test_results$p_value < 0.001, "< .001", 
                  paste("=", sprintf("%.3f", test_results$p_value))))
  }
  
  # Create the caption with excluded countries
  note_text <- paste0("Note. All countries with frequencies less than 5% were excluded from the Figure, including: ", 
                     analysis_results$excluded_countries_text)
  
  # Wrap the note text for better display
  wrapped_note <- str_wrap(note_text, width = 100)
  
  # Count how many lines are in the wrapped note for margin adjustment
  line_count <- str_count(wrapped_note, "\n") + 1
  
  # Create the plot
  ggplot(plot_data, aes(x = reorder(countryEd, -total_country), y = count, fill = RMTMethods_YN)) +
    geom_bar(stat = "identity", position = position_dodge(width = 0.9), color = "black") +
    geom_text(aes(label = label), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, size = 3.5) +
    labs(
      title = "Country of Education by RMT Usage (Top 4)",
      subtitle = subtitle_text,
      x = "Country of Education",
      y = paste0("Count of Participants (N = ", sum(plot_data$count), ")"),
      fill = "RMT Usage",
      caption = wrapped_note
    ) +
    scale_fill_discrete(labels = legend_labels) +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 12),
      axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
      axis.text.y = element_text(size = 12),
      plot.caption = element_text(hjust = 0, size = 9, lineheight = 1.2),
      # Adjust bottom margin based on the number of lines in the note
      plot.margin = margin(b = 10 + (line_count * 12), t = 20, r = 20, l = 20)
    ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.2)))
}

# Main Execution Function
run_rmt_analysis <- function(file_path = "../Data/R_Import_Transformed_15.02.25.xlsx") {
  # Prepare data
  prepared_data <- prepare_rmt_data(file_path)
  
  # Perform comprehensive analysis
  analysis_results <- perform_comprehensive_analysis(prepared_data)
  
  # Create visualization
  rmt_plot <- create_rmt_plot(analysis_results)
  
  # Print results to console
  cat("Statistical Test Details:\n")
  cat("Test Type:", analysis_results$test_results$test_type, "\n")
  cat("P-value:", analysis_results$test_results$p_value, "\n\n")
  
  cat("Contingency Table:\n")
  print(analysis_results$contingency_table)
  
  cat("\nPost-hoc Pairwise Comparisons (Bonferroni-corrected):\n")
  print(analysis_results$pairwise_results)
  
  cat("\nExcluded Countries:\n")
  cat(analysis_results$excluded_countries_text, "\n")
  
  # Display the plot
  print(rmt_plot)
  
  # Return results for potential further analysis
  return(analysis_results)
}

# Run the analysis
results <- run_rmt_analysis()

Expected Frequency Analysis:
Minimum Expected Frequency: 13.17 
Cells with Expected Frequency < 5: 0 out of 8 cells ( 0 %)

Statistical Test Details:
Test Type: Chi-Square with Continuity Correction 
P-value: 3.247836e-11 

Contingency Table:
           
            No RMT RMT
  Australia    256  65
  Canada        84   8
  UK           350  14
  USA          507 113

Post-hoc Pairwise Comparisons (Bonferroni-corrected):
           comparison      p_value  adj_p_value
1 Australia vs Canada 1.612145e-02 9.672870e-02
2     Australia vs UK 4.485762e-11 2.691457e-10
3    Australia vs USA 5.069555e-01 1.000000e+00
4        Canada vs UK 6.030667e-02 3.618400e-01
5       Canada vs USA 3.380404e-02 2.028243e-01
6           UK vs USA 1.586250e-10 9.517502e-10

Excluded Countries:
Italy, New Zealand, Germany, South Africa, Hungary, Albania, Argentina, Afghanistan, Czechia, Austria, Sweden, Belarus, Netherlands, Angola, Antigua and Barbuda, Armenia, Azerbaijan, Bangladesh, Belgium, Switzerland, Andorra, Bahamas, Brazil, Colombia, Denmark, Finland, Greece, Ireland, Marshall Islands, Palestine, Solomon Islands, Turkey, Uzbekistan, Vietnam

7.4 Results and Discussion

Statistical Significance

Chi-square Test: The chi-square test produced a statistic of approximately 44.25 with 7 degrees of freedom and a p-value of about 1.92×10−71.92×10−7. This highly significant result indicates that the distribution of education levels differs between the two groups (those who use RMT methods versus those who do not), i.e., the association is statistically significant.

Effect Size (Cramer’s V): Cramer’s V was calculated to be around 0.169. Although not extremely large, this value suggests a modest association between education and the use of RMT methods. In social sciences, such effect sizes are common when dealing with categorical variables; even a modest effect size can be substantively meaningful.

Detailed Group Comparison

Group Proportions: The summary statistics break down the relative frequencies of each education category for the “No” and “Yes” groups. For each education level, both counts and percentages were calculated. In the plots, you can see the proportions clearly marked with labels (which include counts and percentages), making it easy to compare how education distributions differ between the two groups.

Proportion Differences: By pivoting the grouped data, we computed the differences in percentage points between the “Yes” and “No” groups for each education level. Education levels with the largest differences indicate where the distribution deviates most from what is expected under independence. These differences can help pinpoint which education categories are driving the overall chi-square statistic.

Standardized Residuals

Interpreting Residuals: The standardized residuals provide additional insight into which specific cells of the contingency table contribute most to the chi-square result. A positive residual in a cell suggests that the observed count is higher than what is expected under the null hypothesis, whereas a negative residual indicates under representation. For example, if the “Doctorate” category shows a strongly positive residual for one group, it means that group has a higher-than-expected frequency of individuals holding a doctorate. Conversely, a negative value in another cell would point to an under representation relative to expectation.

Identifying Key Differences: Reviewing these residuals allows us to identify which education levels are statistically noteworthy. For instance, if “Doctorate” or “Bachelors degree” have notable positive or negative deviations, these categories are influencing the overall association significantly.

Visualization of Findings

Two types of plots were generated: Side-by-Side Bar Plot: This plot displays the percentage breakdown of each education category for the “No” and “Yes” groups side by side. With detailed labels that include both the count and percentage, this visualization makes it straightforward to compare each education level between groups. The subtitle also reminds the viewer of the total sample size for each group, ensuring a complete context for interpretation.

Dot/Line Plot: As an alternative, the dot/line plot connects the percentages across education categories for both groups. This plot emphasizes trends and differences in the pattern of education distribution. The connecting lines help to visualize how the relative positions of categories shift between groups. Both visualizations have been adjusted to ensure that all labels are fully visible by setting appropriate margins and enlarging plot height if necessary.

Overall Interpretation

Significant Association: The significant chi-square result confirms that the pattern of education across groups is not random. There is a statistically significant association between having a particular education level and whether someone uses RMT methods.

Magnitude and Specific Contributions: The moderate effect size (Cramer’s V ≈ 0.169) implies that while the association exists, it is of modest strength. Importantly, the examination of standardized residuals and proportion differences shines a light on which specific education categories drive this association. These insights might be used to explore further why certain education levels are more, or less, likely to be associated with RMT method usage.

Practical Implications: Depending on the context of the study (e.g., educational policies, targeted interventions, or market segmentation), these findings could inform decision-making by pinpointing demographic groups for further investigation or tailored strategies. For instance, if a particular education category is underrepresented among RMT users, it might signal potential barriers or areas for targeted improvement.

Summary

The findings indicate that country of education is significantly associated with the use of RMT methods. The statistical tests, along with the calculated effect size and detailed graphical representations, provide a robust basis for interpreting the differences across education levels. The analysis not only confirms the presence of differences but also identifies which education categories are most influential in driving these differences, offering valuable insights for further study or practical applications.

Combined Groups

This analysis examines the relationship between country of education and Respiratory Muscle Training (RMT) usage among participants. The data covers four major English-speaking countries: Australia, Canada, the UK, and the USA, with 34 additional countries excluded due to smaller sample sizes.

Statistical Significance

The chi-square test with continuity correction yielded a highly significant result (p < 0.001), indicating a strong association between country of education and RMT usage. All expected cell frequencies were above the minimum threshold of 5 (minimum = 13.17), confirming that the chi-square test was appropriate for this analysis.

RMT Usage Patterns by Country

From the contingency table, we can observe distinct patterns in RMT adoption:

Australia: 65 out of 321 participants (20.2%) use RMT
USA: 113 out of 620 participants (18.2%) use RMT
Canada: 8 out of 92 participants (8.7%) use RMT
UK: 14 out of 364 participants (3.8%) use RMT

This reveals a striking pattern where participants educated in Australia and the USA have RMT adoption rates approximately 2-5 times higher than those educated in Canada and the UK.

Post-hoc Pairwise Comparisons

The Bonferroni-corrected pairwise comparisons identify which specific country differences are statistically significant:

Significant Differences (p < 0.05 after adjustment):

Australia vs. UK (p = 2.69 × 10^-10): Australian-educated participants are significantly more likely to use RMT than UK-educated participants
USA vs. UK (p = 9.52 × 10^-10): US-educated participants are significantly more likely to use RMT than UK-educated participants

Non-Significant Differences (p > 0.05 after adjustment):

Australia vs. Canada (p = 0.097): Despite the substantial percentage difference (20.2% vs 8.7%), this comparison falls just short of significance after Bonferroni correction
Australia vs. USA (p = 1.0): Very similar RMT usage rates between these countries
Canada vs. UK (p = 0.362): Both have lower RMT usage rates
Canada vs. USA (p = 0.203): Not statistically significant after correction

Implications and Potential Explanations

Several factors might explain these international differences in RMT adoption based on education location:

Educational Approaches: Australia and the USA may emphasize technology integration in music education more heavily than the UK and Canada. This could reflect differences in curriculum design, pedagogical approaches, or resource allocation within educational institutions.
Technological Infrastructure in Educational Settings: Educational institutions in Australia and the USA might have greater access to technology resources, better digital infrastructure, or more technology-oriented training programs.
Cultural Attitudes Toward Technology in Education: Different educational systems may have varying cultural perspectives on the role of technology in learning and performance. These cultural attitudes could shape how students are taught to view technological aids.
Educational Investment in Technology: Countries may differ in their funding priorities for music education, with some investing more heavily in technological resources and training.
Different Teaching Traditions: Musical education may follow different teaching traditions and philosophies across these countries, with some more receptive to technological innovation than others.

Broader Context

The pattern observed here—with Australia and the USA having higher RMT adoption rates than the UK and Canada—suggests a potential divide in approaches to music education and technology integration across these major English-speaking countries. This divide persists despite shared language and many cultural similarities, highlighting how educational practices can diverge significantly across national boundaries.

Limitations and Considerations

Sample Size Variations: The number of participants varies considerably across countries (from 92 in Canada to 620 in the USA), affecting the precision of estimates.
Excluded Countries: 34 countries were excluded from the analysis due to small sample sizes. This extensive list includes several European countries (Germany, Italy, Netherlands, etc.) and countries from various regions worldwide, limiting the global representativeness of the findings.
Historical Context: The data doesn’t indicate when participants were educated in these countries, so the findings may reflect historical differences in educational approaches across different time periods.
Education vs. Residence: This analysis focuses on country of education rather than current residence, which may provide insights into formative influences rather than current environmental factors affecting RMT usage.

Conclusion

The analysis provides strong evidence that country of education is significantly associated with RMT usage, with particularly pronounced differences between Australia/USA and the UK. These findings suggest that educational systems and approaches play an important role in shaping technology adoption practices among musicians. Understanding these patterns could help inform international educational policy, curriculum development, and technology integration strategies in music education.