統計計算與模擬第四次作業

Q2

Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) both can be used to reduce the data dimensionality. Use the mortality data in Taiwan area to demonstrate how these two methods work. The data in the years 1982-2008 are used as the “training” (in-sample) data and the years 2009-2013 are used as the “testing” (out-sample) data. You only need to perform one set of data, according to your gender.

Q3 (Not yet)

Write a small program to perform the “Permutation test” and test your result on the correlation of DDT vs. eggshell thickness in class, and the following data:

Check your answer with other correlation tests, such as regular Pearson and Spearman correlation coefficients.

X	585	1002	472	493	408	690	291
Y	0.1	0.2	0.5	1	1.5	2	3

Solution:

Creating vectors.

data <- data.frame(x=c(585,1002,472,493,408,690,291),
                   y=c(0.1,0.2,0.5,1,1.5,2,3))
x <- c(585,1002,472,493,408,690,291)
y <- c(0.1,0.2,0.5,1,1.5,2,3)

List all combinations of y by gtools packages
List all possibility of S=∑(x*y) by result

all.comb <- permutations(n = 7,r = 7,v = y)
result <- x%*%t(all.comb)

Computing p-value : 0.077 -> do not reject h0, there are no positive cor.
Computing p-value : 0.922 -> do not reject h0, there are no negative cor.
It means that the correlation is zero.

obs.value <- sum(x*y)
sum(obs.value>result)/length(result)

## [1] 0.07738

sum(obs.value<result)/length(result)

## [1] 0.9224

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-5 - Computing Pearson and Spearman correlation coefficients.

No difference to the result of permutation test.

cor.test(x,y,method='pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1.508, df = 5, p-value = 0.1919
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9234  0.3349
## sample estimates:
##     cor 
## -0.5592

cor.test(x,y,method='spearman')

## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 86, p-value = 0.2357
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## -0.5357

Simulate a set of two correlated normal distribution variables, with zero mean and variance 1. Let the correlation coefficient be 0.2 and 0.8. (Use Cholesky!) Then convert the data back to Uniform(0,1) and record only the first decimal number. (亦即只取小數第一位，0至9的整數) Suppose the sample size is 10. Apply the permutation test, Pearson and Spearman correlation coefficients, and records the p-values of these three methods. (10,000 simulation runs)

gererating two rv. from N(0,1)

x1 <- rnorm(10)
x2 <- rnorm(10)
cho1 <- chol(matrix(c(1,0.2,0.2,1),2,2))

Q4

Using simulation to construct critical values of the Mann-Whitney-Wilcoxon test in the case that, where and are the number of observations in two populations. (Note: The number of replications shall be at least 10,000.)

Solution:

At first construct a function to produce critical value

MWW <- function(x1,x2){
  n1 <- length(x1)
  n2 <- length(x2)
  r1 <- rank(c(x1,x2))[1:n1]
  r2 <- rank(c(x2,x1))[1:n2]
  Sum_r1 <- sum(r1)
  Sum_r2 <- sum(r2)
  max_r1_r2 <- max(Sum_r1,Sum_r2)
  U <- (n1*n2) + (n2*(n2+1)/2) - max_r1_r2
  mean <- n1*n2/2
  sigma <- sqrt(n1*n2*(n1+n2+1)/12)
  Z <- (U-mean)/sigma
  return(Z)
}

Take n1=n2=10, replication=10000

temp <- NULL
for(i in 1:10000){
  temp <- c(temp,MWW(rnorm(10),rnorm(10)))
}
temp %>% data.frame(temp) %>% 
  ggplot(aes(x=temp))+geom_histogram(binwidth=0.3)+theme_wsj()+
  theme(axis.title=element_text(size=14))+labs(title='10000 times of replications for MW',x='Critical value',y='Frequency')

plot of chunk unnamed-chunk-9

Q5

To compare teaching, twenty schoolchildren were divided into two groups: ten taught by conventional methods and ten taught by an entirely new approach. The following are the test results:

Conventional	65	79	90	75	61	85	98	80	97	75
News	90	98	73	79	84	81	98	90	83	88

Are the two teaching methods equivalent in result? You need to use permutation test, (parametric and non-parametric) bootstrap, and parametric test, and then compare their differences in testing.

Solution

Permutation test
- Do not reject Ho, they are no diffence between News and conventional method.

data <- data.frame(Conventional=c(65,79,90,75,61,85,98,80,97,75),
                   News=c(90,98,73,79,84,81,98,90,83,88))
perm <- permtest(data$Conventional,data$News)
perm

##              N          t.obs   t-Dist:P(>t) PermDist:P(>t)          F.obs 
##      1.848e+05     -1.267e+00      8.894e-01      8.855e-01      2.373e+00 
##   F-Dist:P(>F) PermDist:P(>F) 
##      1.070e-01      9.464e-02

Bootstrap
- We can see that 95%CI for News group does not contain the median of Conventional group:79.5
- So, score of News group is significant great than conventional group.

Conv <- data$Conventional
News <- data$News
temp <- NULL
temp1 <- NULL
for(i in 1:1000){
  x <- sample(Conv,10,replace=T)
  x1 <- sample(News,10,replace=T)
  temp <- c(temp,median(x))
  temp1 <- c(temp1,median(x1))
}
std <- sqrt(var(temp))
std1 <- sqrt(var(temp1))
c(median(Conv)-1.96*std,median(Conv)+1.96*std)

## [1] 69.9 89.1

c(median(News)-1.96*std1,median(News)+1.96*std1)

## [1] 79.75 92.25

Parametric test : Using t test
- The result is that we can’t reject difference in means is not equal to 0

data1 <- data %>% gather(Group,score,Conventional:News)  
t.test(score~Group,data1)

## 
##  Welch Two Sample t-test
## 
## data:  score by Group
## t = -1.267, df = 15.44, p-value = 0.2239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.799   3.999
## sample estimates:
## mean in group Conventional         mean in group News 
##                       80.5                       86.4

統計計算與模擬第四次作業

103354020楊崇甫 103354021林敬昇

2015年5月4日

Q2

Q3 (Not yet)

Solution:

Q4

Solution:

Q5

Solution

Q6