Given the following data, use the orthogonalization methods such as Cholesky or QR to perform regression analysis, including the parameter estimates and their standard errors via the sweep operator. Note that, to solve the linear equation, you may use the Gauss elimination or use the function “lm” or “glm” in R.
##依照Cholesky分解,可得A=LL'再帶入回歸式X'X(Bhat)=X'y => LL'(Bhat)=X'y,再解兩次方程式就可得Bhat##
x1=c(rep(1300,6),rep(1200,6),rep(1100,4))
x2=c(7.5,9,11,13.5,17,23,5.3,7.5,11,13.5,17,23,5.3,7.5,11,17)
x3=c(0.012,0.012,0.0115,0.0130,0.0135,0.012,0.04,0.038,0.032,0.026,0.034,0.041,0.084,0.098,0.092,0.086)
x4=x1*x1
x5=x2*x2
x6=x3*x3
x7=x1*x3
x8=x1*x2
x9=x2*x3
y=matrix(c(49,50.2,50.5,48.5,47.5,44.5,28,31.5,34.5,35,38,38.5,15,17,20.5,29.5))
X=matrix(c(rep(1,16),x1,x2,x3,x4,x5,x6,x7,x8,x9),ncol=10)
A=t(X)%*%X
L=chol(A)
B=solve(L,solve(t(L),t(X)%*%y))
B
## [,1]
## [1,] -3.617228e+03
## [2,] 5.324332e+00
## [3,] 1.924396e+01
## [4,] 1.376632e+04
## [5,] -1.926707e-03
## [6,] -3.034201e-02
## [7,] -1.158168e+04
## [8,] -1.057731e+01
## [9,] -1.414447e-02
## [10,] -2.103478e+01
Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) both can be used to reduce the data dimensionality. Use the mortality data in Taiwan area to demonstrate how these two methods work. The data in the years 1982-2008 are used as the “training” (in-sample) data and the years 2009-2013 are used as the “testing” (out-sample) data. You only need to perform one set of data, according to your gender.
#SVD
a = read.csv("/Users/chenyupei/Dropbox/統研碩一/統計計算與模擬/week10/comp1032assg4/female.csv")
a1 = a[,-1]
a1 = as.matrix(a1)
loga=log(a1)
b1=svd(loga)
b11=b1$d
b11[1]/sum(b11)
## [1] 0.9322284
s1=b1$u[,1]
summary(lm(s1~c(1982:2013)))
##
## Call:
## lm(formula = s1 ~ c(1982:2013))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0021868 -0.0009498 -0.0000722 0.0007810 0.0032661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.497e+00 4.631e-02 32.33 <2e-16 ***
## c(1982:2013) -8.379e-04 2.319e-05 -36.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001211 on 30 degrees of freedom
## Multiple R-squared: 0.9775, Adjusted R-squared: 0.9768
## F-statistic: 1306 on 1 and 30 DF, p-value: < 2.2e-16
#PCA
b2=princomp(loga)
summary(b2)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.255360 0.22484883 0.19751530 0.17047446
## Proportion of Variance 0.893211 0.02865491 0.02211155 0.01647162
## Cumulative Proportion 0.893211 0.92186590 0.94397745 0.96044908
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.16307066 0.13213087 0.097551631 0.068854045
## Proportion of Variance 0.01507195 0.00989524 0.005393701 0.002687056
## Cumulative Proportion 0.97552102 0.98541626 0.990809965 0.993497021
## Comp.9 Comp.10 Comp.11 Comp.12
## Standard deviation 0.060376035 0.048478682 0.0416434764 0.0358836951
## Proportion of Variance 0.002066079 0.001332046 0.0009829054 0.0007298138
## Cumulative Proportion 0.995563100 0.996895146 0.9978780518 0.9986078656
## Comp.13 Comp.14 Comp.15 Comp.16
## Standard deviation 0.0297964650 0.0224101546 0.0201708289 0.0168011308
## Proportion of Variance 0.0005032077 0.0002846476 0.0002306032 0.0001599907
## Cumulative Proportion 0.9991110733 0.9993957209 0.9996263240 0.9997863147
## Comp.17 Comp.18
## Standard deviation 0.0153551347 0.0118841610
## Proportion of Variance 0.0001336365 0.0000800488
## Cumulative Proportion 0.9999199512 1.0000000000
s2=b2$score[,1]
summary(lm(s2~c(1982:2013)))
##
## Call:
## lm(formula = s2 ~ c(1982:2013))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43725 -0.09977 0.00720 0.12696 0.28727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.692e+02 6.496e+00 -41.45 <2e-16 ***
## c(1982:2013) 1.348e-01 3.252e-03 41.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1699 on 30 degrees of freedom
## Multiple R-squared: 0.9828, Adjusted R-squared: 0.9823
## F-statistic: 1718 on 1 and 30 DF, p-value: < 2.2e-16
Write a small program to perform the “Permutation test” and test your result on the correlation of DDT vs. eggshell thickness in class, and the following data:
x 585 1002 472 493 408 690 291
y 0.1 0.2 0.5 1.0 1.5 2.0 3.0
Check your answer with other correlation tests, such as regular Pearson and Spearman correlation coefficients.
For the data:
x 585 1002 472 493 408 690 291
y 0.1 0.2 0.5 1.0 1.5 2.0 3.0
H0:兩變數間無關
H1:兩變數間有關
x = c(585,1002,472,493,408,690,291)
y = c(0.1,0.2,0.5,1,1.5,2,3)
##permutation test##
permutation = function(x,y){
nx = length(x)
ny = length(y)
sx = sum(x^2) - nx*(mean(x)^2)
sy = sum(y^2) - ny*(mean(y)^2)
t_stat = function(x,y) {
(mean(x)-mean(y))/sqrt((sx+sy)/(nx+ny-2)*(1/nx+1/ny))
}
t = t_stat(x,y)
t_pvalue = 1-pt(t,nx+ny-2)
cat("p-value = ",t_pvalue)
}
permutation(x,y)
## p-value = 1.624924e-05
因為p-value<0.05,故拒絕虛無假設,有充分證據顯示兩變數間有關。
##pearson correlation coefficient & spearman correlation coefficient & spearman##
cor.test(x, y, method="pearson")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1.5083, df = 5, p-value = 0.1919
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9234039 0.3348768
## sample estimates:
## cor
## -0.5592018
cor.test(x, y, method="spearman")
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 86, p-value = 0.2357
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5357143
利用pearson correlation coefficient test,p-value=0.1919>0.05,沒有充分證據顯示兩變數有關。
利用spearman correlation coefficient test,p-value=0.2357>0.05,沒有充分證據顯示兩變數有關。
For the other data:
DDT = 65 98 117 122 130
Egg Shell Thichness = 0.52 0.53 0.49 0.49 0.37
ddt=c(65,98,117,122,130)
egg=c(0.52,0.53,0.49,0.49,0.37)
permutation(ddt,egg)
## p-value = 8.399283e-06
cor.test(ddt, egg, method="pearson")
##
## Pearson's product-moment correlation
##
## data: ddt and egg
## t = -1.5343, df = 3, p-value = 0.2225
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9749731 0.5281884
## sample estimates:
## cor
## -0.6630711
cor.test(ddt, egg, method="spearman")
##
## Spearman's rank correlation rho
##
## data: ddt and egg
## S = 37.4416, p-value = 0.05385
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.8720816
利用permutation test,因為p-value<0.05,故拒絕虛無假設,有充分證據顯示兩變數間有關。
利用pearson correlation coefficient test,p-value=0.2225>0.05,沒有充分證據顯示兩變數有關。
利用spearman correlation coefficient test,p-value=0.05385>0.05,沒有充分證據顯示兩變數有關。
Simulate a set of two correlated normal distribution variables, with zero mean and variance 1. Let the correlation coefficient be 0.2 and 0.8. (Use Cholesky!) Then convert the data back to Uniform(0,1) and record only the first decimal number. (亦即只取小數第一位,0至9的整數) Suppose the sample size is 10. Apply the permutation test, Pearson and Spearman correlation coefficients, and records the p-values of these three methods. (10,000 simulation runs)
x1 = rnorm(10)
x2 = rnorm(10)
cho1 = chol(matrix(c(1,0.2,0.2,1),2,2))
Using simulation to construct critical values of the Mann-Whitney-Wilcoxon test in the case that 2<=n1 n2<=10, where n1 and n2 are the number of observations in two populations. (Note: The number of replications shall be at least 10,000.)
library(dplyr)
library(ggplot2)
##critical values of the Mann-Whitney-Wilcoxon##
m_w_w = function(x,y){
n1 = length(x)
n2 = length(y)
r1 = rank(c(x,y))[1:n1]
r2 = rank(c(x,y))[-(1:n1)]
w1 = sum(r1)
w2 = sum(r2)
u1 = n1*n2+n1*(n1+1)/2-w1
u2 = n1*n2+n2*(n2+1)/2-w2
u = min(u1,u2)
eu = n1*n2/2
varu = n1*n2*(n1+n2+1)/12
z = (u-eu)/sqrt(varu)
return(z)
}
##n1=n2=10 replications for 10,000 times##
critical = NULL
for (i in 1:10000){
critical = c(critical,m_w_w(rnorm(10),rnorm(10)))
}
critical %>% data.frame(critical) %>%
ggplot(aes(x=critical))+
geom_histogram(binwidth=0.3)+
theme(axis.title=element_text(size=10))+
labs(title='10000 times of replications for Mann-Whitney-Wilcoxon',x='Critical value',y='Frequency')
To compare teaching, twenty schoolchildren were divided into two groups: ten taught by conventional methods and ten taught by an entirely new approach. The following are the test results:
Conventional 65 79 90 75 61 85 98 80 97 75
New 90 98 73 79 84 81 98 90 83 88
Are the two teaching methods equivalent in result? You need to use permutation test, (parametric and non-parametric) bootstrap, and parametric test, and then compare their differences in testing.
##permutation test##
Conventional = c(65,79,90,75,61,85,98,80,97,75)
New = c(90,98,73,79,84,81,98,90,83,88)
permtest(Conventional,New)
## N t.obs t-Dist:P(>t) PermDist:P(>t) F.obs
## 1.847560e+05 -1.267231e+00 8.893884e-01 8.855355e-01 2.372925e+00
## F-Dist:P(>F) PermDist:P(>F)
## 1.070328e-01 9.463833e-02
利用permutation test,因為p-value=0.8894>0.05,故不拒絕虛無假設,沒有充分證據顯示兩種教學方式間有差異。
##bootstrap##
t1 = NULL
t2 = NULL
for (i in 1:10000){
x1 = sample(Conventional,10,replace = T)
x2 = sample(New,10,replace = T)
t1 = c(t1,median(x1))
t2 = c(t2,median(x2))
}
std1 = sqrt(var(t1))
std2 = sqrt(var(t2))
med1 = median(Conventional)
med2 = median(New)
c(med1,med2)
## [1] 79.5 86.0
c(med1-1.96*std1,med1+1.96*std1)
## [1] 70.05761 88.94239
c(med2-1.96*std2,med2+1.96*std2)
## [1] 79.60671 92.39329
由New method的95%信賴區間我們可以看到,其不包含Conventional method的中位數79.5,故有充分證據顯示,新教學方式帶給學生的成效較佳。
##t-test##
t.test(Conventional,New,alternative="two.side")
##
## Welch Two Sample t-test
##
## data: Conventional and New
## t = -1.2672, df = 15.442, p-value = 0.2239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.798987 3.998987
## sample estimates:
## mean of x mean of y
## 80.5 86.4
H0 : mu1 = mu2
H1 : mu1 != mu2
由t檢定我們可以看到,因為p-value=0.2239>0.05,故不拒絕H0,沒有充分證據顯示兩種教學方式有異。
The block bootstrap can be used in prediction for dependent data. Use the data “Telephone price.txt”, which is the telephone cost between New York and San Francisco in 1915-1996, compare the difference of prediction via block bootstrap and AR-type model. As a check, you can leave the final 10 observations as “testing” data.
price = c(319.29,296.83,234.35,192.35,149.36,128.95,144.35,154.12,151.41,151.11,147.37,134.48,100.17,82.26,82.26,84.4,92.55,103.18,108.77,105.24,102.68,96.04,70.88,72.23,73.27,53.96,42.53,38.44,36.21,35.59,28.27,20.04,17.52,16.26,16.42,16.26,15.07,14.75,14.64,14.56,14.62,14.4,13.91,13.54,12.98,11.89,11.78,11.65,11.51,11.36,11.06,9.65,9.28,7.78,7.26,5.56,5.21,5.05,5.11,4.6,3.95,3.57,3.36,1.3,1.3,2.6,2.72,2.63,2.61,2.57,1.75,1.63,1.39,1.34,1.29,1.26,1.25,1.26,1.29,1.39,1.37,0.9)
price1 = price[-(73:82)]
log_price = log(price1)
delta_price = log_price[2:72] - log_price[1:71] #計算相鄰觀察值的差值
##1000次block bootstrap的模擬##
t3 = NULL
t4 = NULL
t6 = NULL
for (i in 1:1000){
t3 = sample(c(1:62),1)
t4 = delta_price[t3:(t3+9)]
t5 = NULL
for (i in 1:10){
a1 = log_price[72]
a1 = a1 + t4[i]
t5 = c(t5,a1)
}
t6 = rbind(t6,t5)
}
##建立95%信賴區間##
ci = NULL
for(i in 1:10){
lpi = median(as.vector(t6[,i]))-sqrt(var(as.vector(t6[,i])))
med = median(as.vector(t6[,i]))
upi = median(as.vector(t6[,i]))+sqrt(var(as.vector(t6[,i])))
ci = rbind(ci,c(lpi,med,upi))
}
##建立AR(2)模型##
ar2 = arima(log_price,order=c(2,0,0))
##建立AR(2)的95%信賴區間##
ar2_predict = predict(ar2,n.ahead=10)$pred
ar2_std = predict(ar2,n.ahead=10)$se
ar2_upi = ar2_predict + 1.96 * ar2_std
ar2_lpi = ar2_predict - 1.96 * ar2_std
##畫圖##
table6 = matrix(NA,nrow = 82,ncol = 5)
table6[1:82,1] = log(price)
table6[73:82,2:4] = ci
table6[73:82,5] = ar2_predict
colnames(table6) = c("log_price","LPI_bootstrap","Median_bootstrap","UPI_bootstrap","ARIMA(2,0,0)")
ts.plot(table6,lty = c(1,2,1,5,4),col=c("black","blue","black","blue","red"),main = "Block Bootstrap in Telephone price")
legend(50,5,c("LPI_bootstrap","Median_bootstrap","UPI_bootstrap","ARIMA(2,0,0)"),lty = c(2,1,5,4),col=c("blue","black","blue","red"),cex = 0.8)