Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) both can be used to reduce the data dimensionality. Use the mortality data in Taiwan area to demonstrate how these two methods work. The data in the years 1982-2008 are used as the “training” (in-sample) data and the years 2009-2013 are used as the “testing” (out-sample) data. You only need to perform one set of data, according to your gender.
Check your answer with other correlation tests, such as regular Pearson and Spearman correlation coefficients.
| X | 585 | 1002 | 472 | 493 | 408 | 690 | 291 |
|---|---|---|---|---|---|---|---|
| Y | 0.1 | 0.2 | 0.5 | 1 | 1.5 | 2 | 3 |
data <- data.frame(x=c(585,1002,472,493,408,690,291),
y=c(0.1,0.2,0.5,1,1.5,2,3))
x <- c(585,1002,472,493,408,690,291)
y <- c(0.1,0.2,0.5,1,1.5,2,3)
List all combinations of y by gtools packages
List all possibility of S=∑(x*y) by result
all.comb <- permutations(n = 7,r = 7,v = y)
result <- x%*%t(all.comb)
0.077 -> do not reject h0, there are no positive cor.0.922 -> do not reject h0, there are no negative cor.obs.value <- sum(x*y)
sum(obs.value>result)/length(result)
## [1] 0.07738
sum(obs.value<result)/length(result)
## [1] 0.9224
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
- Computing Pearson and Spearman correlation coefficients.
cor.test(x,y,method='pearson')
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1.508, df = 5, p-value = 0.1919
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9234 0.3349
## sample estimates:
## cor
## -0.5592
cor.test(x,y,method='spearman')
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 86, p-value = 0.2357
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5357
x1 <- rnorm(10)
x2 <- rnorm(10)
cho1 <- chol(matrix(c(1,0.2,0.2,1),2,2))
Using simulation to construct critical values of the Mann-Whitney-Wilcoxon test in the case that, where and are the number of observations in two populations. (Note: The number of replications shall be at least 10,000.)
MWW <- function(x1,x2){
n1 <- length(x1)
n2 <- length(x2)
r1 <- rank(c(x1,x2))[1:n1]
r2 <- rank(c(x2,x1))[1:n2]
Sum_r1 <- sum(r1)
Sum_r2 <- sum(r2)
max_r1_r2 <- max(Sum_r1,Sum_r2)
U <- (n1*n2) + (n2*(n2+1)/2) - max_r1_r2
mean <- n1*n2/2
sigma <- sqrt(n1*n2*(n1+n2+1)/12)
Z <- (U-mean)/sigma
return(Z)
}
temp <- NULL
for(i in 1:10000){
temp <- c(temp,MWW(rnorm(10),rnorm(10)))
}
temp %>% data.frame(temp) %>%
ggplot(aes(x=temp))+geom_histogram(binwidth=0.3)+theme_wsj()+
theme(axis.title=element_text(size=14))+labs(title='10000 times of replications for MW',x='Critical value',y='Frequency')
To compare teaching, twenty schoolchildren were divided into two groups: ten taught by conventional methods and ten taught by an entirely new approach. The following are the test results:
| Conventional | 65 | 79 | 90 | 75 | 61 | 85 | 98 | 80 | 97 | 75 |
|---|---|---|---|---|---|---|---|---|---|---|
| News | 90 | 98 | 73 | 79 | 84 | 81 | 98 | 90 | 83 | 88 |
Are the two teaching methods equivalent in result? You need to use permutation test, (parametric and non-parametric) bootstrap, and parametric test, and then compare their differences in testing.
data <- data.frame(Conventional=c(65,79,90,75,61,85,98,80,97,75),
News=c(90,98,73,79,84,81,98,90,83,88))
perm <- permtest(data$Conventional,data$News)
perm
## N t.obs t-Dist:P(>t) PermDist:P(>t) F.obs
## 1.848e+05 -1.267e+00 8.894e-01 8.855e-01 2.373e+00
## F-Dist:P(>F) PermDist:P(>F)
## 1.070e-01 9.464e-02
79.5Conv <- data$Conventional
News <- data$News
temp <- NULL
temp1 <- NULL
for(i in 1:1000){
x <- sample(Conv,10,replace=T)
x1 <- sample(News,10,replace=T)
temp <- c(temp,median(x))
temp1 <- c(temp1,median(x1))
}
std <- sqrt(var(temp))
std1 <- sqrt(var(temp1))
c(median(Conv)-1.96*std,median(Conv)+1.96*std)
## [1] 69.9 89.1
c(median(News)-1.96*std1,median(News)+1.96*std1)
## [1] 79.75 92.25
data1 <- data %>% gather(Group,score,Conventional:News)
t.test(score~Group,data1)
##
## Welch Two Sample t-test
##
## data: score by Group
## t = -1.267, df = 15.44, p-value = 0.2239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.799 3.999
## sample estimates:
## mean in group Conventional mean in group News
## 80.5 86.4