1 Question 1: Simulation Study — Robustness of the One-Sample T Test

We evaluate the Type I error rate and coverage probability of the one-sample \(T\) test under both normal and non-normal distributions, across sample sizes \(n \in \{10, 50, 200\}\) with \(S = 1000\) replications.

1.1 Simulation Function

set.seed(123)

S <- 1000

simulate_t_test <- function(n, S = 1000, dist = "normal", mu0 = 0) {
  
  reject <- numeric(S)
  cover  <- numeric(S)
  
  for (s in 1:S) {
    
    if (dist == "normal") {
      x <- rnorm(n, mean = 0, sd = 1)
      true_mu <- 0
    }
    
    if (dist == "chisq2") {
      x <- rchisq(n, df = 2)
      true_mu <- 2
    }
    
    if (dist == "chisq10") {
      x <- rchisq(n, df = 10)
      true_mu <- 10
    }
    
    test <- t.test(x, mu = mu0, conf.level = 0.95)
    
    reject[s] <- (test$p.value < 0.05)
    
    ci <- test$conf.int
    cover[s] <- (ci[1] <= true_mu & true_mu <= ci[2])
  }
  
  return(c(Type_I_Error = mean(reject),
           Coverage = mean(cover)))
}

1.2 (a)–(c): Normal Distribution \(N(0,1)\)

We draw from \(N(0,1)\) and test \(H_0: \mu = 0\) vs \(H_a: \mu \neq 0\) at \(\alpha = 0.05\).

n_values <- c(10, 50, 200)

normal_results <- sapply(n_values, function(n) {
  simulate_t_test(n = n, S = S, dist = "normal", mu0 = 0)
})

normal_results <- t(normal_results)
rownames(normal_results) <- paste0("n = ", n_values)

cat("Results under Normal Distribution N(0,1)\n")
## Results under Normal Distribution N(0,1)
print(normal_results)
##         Type_I_Error Coverage
## n = 10         0.045    0.955
## n = 50         0.036    0.964
## n = 200        0.051    0.949

1.3 (d): Observations for Normal Distribution

Under the normal distribution, the \(T\) test is exact. We observe:

  • The Type I error rate is close to the nominal level \(\alpha = 0.05\) for all sample sizes.
  • The coverage probability is close to the nominal 95% across all \(n\).
  • This is expected since the \(T\) test assumes normality, which is satisfied here.

1.4 (e): Chi-Square Distribution with \(df = 2\)

The \(\chi^2_2\) distribution is heavily right-skewed (skewness \(= \sqrt{2}\)). We test \(H_0: \mu = 2\).

chisq2_results <- sapply(n_values, function(n) {
  simulate_t_test(n = n, S = S, dist = "chisq2", mu0 = 2)
})

chisq2_results <- t(chisq2_results)
rownames(chisq2_results) <- paste0("n = ", n_values)

cat("Results under Chi-Square df = 2\n")
## Results under Chi-Square df = 2
print(chisq2_results)
##         Type_I_Error Coverage
## n = 10         0.108    0.892
## n = 50         0.061    0.939
## n = 200        0.058    0.942

1.5 (e continued): Chi-Square Distribution with \(df = 10\)

The \(\chi^2_{10}\) distribution is moderately right-skewed (skewness \(= \sqrt{0.8}\)). We test \(H_0: \mu = 10\).

chisq10_results <- sapply(n_values, function(n) {
  simulate_t_test(n = n, S = S, dist = "chisq10", mu0 = 10)
})

chisq10_results <- t(chisq10_results)
rownames(chisq10_results) <- paste0("n = ", n_values)

cat("Results under Chi-Square df = 10\n")
## Results under Chi-Square df = 10
print(chisq10_results)
##         Type_I_Error Coverage
## n = 10         0.063    0.937
## n = 50         0.054    0.946
## n = 200        0.049    0.951

1.6 Observations for Chi-Square Distributions

  • \(\chi^2_2\) (heavily skewed): At small \(n\) (e.g., \(n = 10\)), the Type I error rate may deviate from 0.05 and coverage may drop below 95%, because the CLT approximation is poor for highly skewed data with small samples. As \(n\) increases, the \(T\) test becomes more robust due to the CLT.
  • \(\chi^2_{10}\) (moderately skewed): The departure from nominal levels is less severe than \(\chi^2_2\), since the distribution is closer to symmetric. Even at \(n = 10\), results are reasonably close to nominal.
  • General conclusion: The one-sample \(T\) test is robust to non-normality for moderate-to-large sample sizes, but can be unreliable for small samples drawn from heavily skewed distributions.

2 Question 2: Permutation Test for Spearman Rank Correlation

We implement a permutation test for the Spearman rank correlation and compare the resulting p-value with that from cor.test.

2.1 Permutation Test Function

set.seed(123)

library(MASS)

spearman_perm_test <- function(x, y, B = 5000) {
  
  # Observed statistic
  obs_stat <- cor(x, y, method = "spearman")
  
  perm_stats <- numeric(B)
  
  for (b in 1:B) {
    y_perm <- sample(y)  # permute Y
    perm_stats[b] <- cor(x, y_perm, method = "spearman")
  }
  
  # Two-sided p-value
  p_value <- mean(abs(perm_stats) >= abs(obs_stat))
  
  return(list(
    statistic = obs_stat,
    p_value = p_value
  ))
}

2.2 Example 1: Bivariate Normal

Samples are drawn from a bivariate normal distribution with \(\rho = 0.5\).

mu <- c(0, 0)
Sigma <- matrix(c(1, 0.5,
                  0.5, 1), 2, 2)

data1 <- mvrnorm(10, mu, Sigma)

x1 <- data1[,1]
y1 <- data1[,2]

# Permutation test
perm1 <- spearman_perm_test(x1, y1)

# cor.test comparison
test1 <- cor.test(x1, y1, method = "spearman")

cat("==== Bivariate Normal Data ====\n")
## ==== Bivariate Normal Data ====
cat("Spearman rho (observed):", perm1$statistic, "\n")
## Spearman rho (observed): 0.5151515
cat("Permutation p-value:", perm1$p_value, "\n")
## Permutation p-value: 0.1306
cat("cor.test p-value:", test1$p.value, "\n")
## cor.test p-value: 0.1328231

2.3 Example 2: Lognormal Data

Samples are lognormal (exponentiated bivariate normal), which introduces heavy right skew.

data2 <- exp(mvrnorm(10, mu, Sigma))

x2 <- data2[,1]
y2 <- data2[,2]

# Permutation test
perm2 <- spearman_perm_test(x2, y2)

# cor.test comparison
test2 <- cor.test(x2, y2, method = "spearman")

cat("==== Lognormal Data ====\n")
## ==== Lognormal Data ====
cat("Spearman rho (observed):", perm2$statistic, "\n")
## Spearman rho (observed): 0.3090909
cat("Permutation p-value:", perm2$p_value, "\n")
## Permutation p-value: 0.3842
cat("cor.test p-value:", test2$p.value, "\n")
## cor.test p-value: 0.387055

2.4 Discussion

  • The permutation p-values and cor.test p-values are generally close, validating the permutation approach.
  • The permutation test is distribution-free and does not rely on asymptotic approximations, making it particularly suitable for small samples or non-standard distributions like the lognormal case.
  • Since Spearman’s \(\rho\) is based on ranks, it is invariant to monotone transformations, so the observed \(\rho\) is identical whether we use the normal or lognormal data (same ranks). The permutation test respects this property.

3 Question 3: Bootstrap Analysis of the scor Dataset

The scor dataset in the bootstrap package contains test scores for 88 students across five subjects: mechanics (mec), vectors (vec), algebra (alg), analysis (ana), and statistics (sta).

library(bootstrap)
library(boot)

data(scor)
X <- scor
n <- nrow(X)

3.1 (a) Scatterplot Matrix and Correlation Matrix

pairs(X, main = "Scatterplot Matrix of Test Scores")

cor_matrix <- cor(X)
cat("Sample Correlation Matrix:\n")
## Sample Correlation Matrix:
print(round(cor_matrix, 4))
##        mec    vec    alg    ana    sta
## mec 1.0000 0.5534 0.5468 0.4094 0.3891
## vec 0.5534 1.0000 0.6096 0.4851 0.4364
## alg 0.5468 0.6096 1.0000 0.7108 0.6647
## ana 0.4094 0.4851 0.7108 1.0000 0.6072
## sta 0.3891 0.4364 0.6647 0.6072 1.0000

Observations: The scatterplots and correlation matrix reveal that the open-book tests (algebra, analysis, statistics) tend to be more strongly correlated with each other than with the closed-book tests (mechanics, vectors). This suggests the two groups of tests measure somewhat different latent abilities.

3.2 (b) Bootstrap Estimates for Selected Correlations

We estimate bias and standard error for \(\hat{\rho}_{12}\) (mec, vec), \(\hat{\rho}_{34}\) (alg, ana), \(\hat{\rho}_{35}\) (alg, sta), and \(\hat{\rho}_{45}\) (ana, sta) using \(B = 2000\) bootstrap resamples.

cor_stat <- function(data, indices) {
  d <- data[indices, ]
  
  c(
    rho12 = cor(d$mec, d$vec),
    rho34 = cor(d$alg, d$ana),
    rho35 = cor(d$alg, d$sta),
    rho45 = cor(d$ana, d$sta)
  )
}

boot_cor <- boot(X, statistic = cor_stat, R = 2000)

orig_estimates <- boot_cor$t0
bias_estimates <- colMeans(boot_cor$t) - orig_estimates
se_estimates <- apply(boot_cor$t, 2, sd)

results_cor <- data.frame(
  Estimate = round(orig_estimates, 4),
  Bias = round(bias_estimates, 4),
  Std_Error = round(se_estimates, 4)
)

cat("Bootstrap Estimates for Correlations:\n")
## Bootstrap Estimates for Correlations:
print(results_cor)
##       Estimate    Bias Std_Error
## rho12   0.5534 -0.0052    0.0770
## rho34   0.7108  0.0001    0.0490
## rho35   0.6647 -0.0021    0.0622
## rho45   0.6072  0.0001    0.0678

Observations: The bootstrap biases are small relative to the estimates, suggesting the sample correlations are approximately unbiased. The standard errors quantify the sampling variability of each correlation estimate.

3.3 (c) PCA — Proportion of Variance Explained by First Component

The proportion of variance explained by the first principal component is:

\[\theta = \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3 + \lambda_4 + \lambda_5}\]

3.3.1 (i) Sample Estimate

S_hat <- cov(X)
eigvals_hat <- eigen(S_hat)$values
theta_hat <- eigvals_hat[1] / sum(eigvals_hat)

cat("Sample Estimate theta_hat:", round(theta_hat, 4), "\n")
## Sample Estimate theta_hat: 0.6191

3.3.2 (ii) Bootstrap Bias and Standard Error

theta_stat <- function(data, indices) {
  d <- data[indices, ]
  
  S <- cov(d)
  eigvals <- eigen(S)$values
  
  theta <- eigvals[1] / sum(eigvals)
  return(theta)
}

boot_theta <- boot(X, statistic = theta_stat, R = 2000)

theta_bias <- mean(boot_theta$t) - boot_theta$t0
theta_se <- sd(boot_theta$t)

cat("Bootstrap Bias for theta:", round(theta_bias, 4), "\n")
## Bootstrap Bias for theta: 0.001
cat("Bootstrap Std Error for theta:", round(theta_se, 4), "\n")
## Bootstrap Std Error for theta: 0.0477

3.3.3 (iii) 95% Confidence Intervals

perc_ci <- boot.ci(boot_theta, type = "perc")
bca_ci <- boot.ci(boot_theta, type = "bca")

cat("95% Percentile CI for theta:\n")
## 95% Percentile CI for theta:
print(perc_ci)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_theta, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 0.5199,  0.7070 )  
## Calculations and Intervals on Original Scale
cat("\n95% BCa CI for theta:\n")
## 
## 95% BCa CI for theta:
print(bca_ci)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_theta, type = "bca")
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.5172,  0.7059 )  
## Calculations and Intervals on Original Scale

Observations: The BCa interval accounts for both bias and skewness in the bootstrap distribution of \(\hat{\theta}\), so it may differ slightly from the percentile interval. Both intervals provide a range of plausible values for the true proportion of variance explained by the first principal component.