Q2

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) both can be used to reduce the data dimensionality. Use the mortality data in Taiwan area to demonstrate how these two methods work. The data in the years 1982-2008 are used as the “training” (in-sample) data and the years 2009-2013 are used as the “testing” (out-sample) data. You only need to perform one set of data, according to your gender.


Q3 (Not yet)

  1. Write a small program to perform the “Permutation test” and test your result on the correlation of DDT vs. eggshell thickness in class, and the following data:

Check your answer with other correlation tests, such as regular Pearson and Spearman correlation coefficients.

X 585 1002 472 493 408 690 291
Y 0.1 0.2 0.5 1 1.5 2 3

Solution:

data <- data.frame(x=c(585,1002,472,493,408,690,291),
                   y=c(0.1,0.2,0.5,1,1.5,2,3))
x <- c(585,1002,472,493,408,690,291)
y <- c(0.1,0.2,0.5,1,1.5,2,3)
all.comb <- permutations(n = 7,r = 7,v = y)
result <- x%*%t(all.comb)
obs.value <- sum(x*y)
sum(obs.value>result)/length(result)
## [1] 0.07738
sum(obs.value<result)/length(result)
## [1] 0.9224
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-5 - Computing Pearson and Spearman correlation coefficients.

cor.test(x,y,method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1.508, df = 5, p-value = 0.1919
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9234  0.3349
## sample estimates:
##     cor 
## -0.5592
cor.test(x,y,method='spearman')
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 86, p-value = 0.2357
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## -0.5357
  1. Simulate a set of two correlated normal distribution variables, with zero mean and variance 1. Let the correlation coefficient be 0.2 and 0.8. (Use Cholesky!) Then convert the data back to Uniform(0,1) and record only the first decimal number. (亦即只取小數第一位,0至9的整數) Suppose the sample size is 10. Apply the permutation test, Pearson and Spearman correlation coefficients, and records the p-values of these three methods. (10,000 simulation runs)
x1 <- rnorm(10)
x2 <- rnorm(10)
cho1 <- chol(matrix(c(1,0.2,0.2,1),2,2))

Q4

Using simulation to construct critical values of the Mann-Whitney-Wilcoxon test in the case that, where and are the number of observations in two populations. (Note: The number of replications shall be at least 10,000.)

Solution:

MWW <- function(x1,x2){
  n1 <- length(x1)
  n2 <- length(x2)
  r1 <- rank(c(x1,x2))[1:n1]
  r2 <- rank(c(x2,x1))[1:n2]
  Sum_r1 <- sum(r1)
  Sum_r2 <- sum(r2)
  max_r1_r2 <- max(Sum_r1,Sum_r2)
  U <- (n1*n2) + (n2*(n2+1)/2) - max_r1_r2
  mean <- n1*n2/2
  sigma <- sqrt(n1*n2*(n1+n2+1)/12)
  Z <- (U-mean)/sigma
  return(Z)
}
temp <- NULL
for(i in 1:10000){
  temp <- c(temp,MWW(rnorm(10),rnorm(10)))
}
temp %>% data.frame(temp) %>% 
  ggplot(aes(x=temp))+geom_histogram(binwidth=0.3)+theme_wsj()+
  theme(axis.title=element_text(size=14))+labs(title='10000 times of replications for MW',x='Critical value',y='Frequency')

plot of chunk unnamed-chunk-9


Q5

To compare teaching, twenty schoolchildren were divided into two groups: ten taught by conventional methods and ten taught by an entirely new approach. The following are the test results:

Conventional 65 79 90 75 61 85 98 80 97 75
News 90 98 73 79 84 81 98 90 83 88

Are the two teaching methods equivalent in result? You need to use permutation test, (parametric and non-parametric) bootstrap, and parametric test, and then compare their differences in testing.

Solution

data <- data.frame(Conventional=c(65,79,90,75,61,85,98,80,97,75),
                   News=c(90,98,73,79,84,81,98,90,83,88))
perm <- permtest(data$Conventional,data$News)
perm
##              N          t.obs   t-Dist:P(>t) PermDist:P(>t)          F.obs 
##      1.848e+05     -1.267e+00      8.894e-01      8.855e-01      2.373e+00 
##   F-Dist:P(>F) PermDist:P(>F) 
##      1.070e-01      9.464e-02
Conv <- data$Conventional
News <- data$News
temp <- NULL
temp1 <- NULL
for(i in 1:1000){
  x <- sample(Conv,10,replace=T)
  x1 <- sample(News,10,replace=T)
  temp <- c(temp,median(x))
  temp1 <- c(temp1,median(x1))
}
std <- sqrt(var(temp))
std1 <- sqrt(var(temp1))
c(median(Conv)-1.96*std,median(Conv)+1.96*std)
## [1] 69.9 89.1
c(median(News)-1.96*std1,median(News)+1.96*std1)
## [1] 79.75 92.25
data1 <- data %>% gather(Group,score,Conventional:News)  
t.test(score~Group,data1)
## 
##  Welch Two Sample t-test
## 
## data:  score by Group
## t = -1.267, df = 15.44, p-value = 0.2239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.799   3.999
## sample estimates:
## mean in group Conventional         mean in group News 
##                       80.5                       86.4

Q6