The corresoponding packages are already loaded. Now Load the data into R using the commands
gap.raw <- read.csv('gap.csv')
gap <- gap.raw
gap[,3:14]<- log(gap.raw[,3:14])
Note that for GDP per capita, it is best to work with log(GDP) when doing statistical analysis, as the values vary over several orders of magnitude between countries. For ease of plotting, it may be useful to split the data into two data frames, one containing GDP per capita, and the other life expectancy data
gdp <- exp(gap[,3:14])
years <- seq(1952, 2007,5)
colnames(gdp) <- years
rownames(gdp) <- gap[,2]
lifeExp <- gap[,15:26]
colnames(lifeExp) <- years
rownames(lifeExp) <- gap[,2]
Since countries or continents are not specified, we should consider the change of all the countries over the past 70 years. So gdp and life expectancy are taken average to see their changes.
gdp_mean <- colMeans(gdp)
lifeExp_mean <- colMeans(lifeExp)
plot(years,gdp_mean,xlab = "years")
plot(years,lifeExp_mean,xlab = "years")
From the plots we can see that both gdp and life expectancy have been increasing steadily for the past 70 years on average worldwide.
Since the varaibles are measuring similar entities,PCR using S should be enough.
pca_gdp <- prcomp(gdp)
pca_lifeExp <- prcomp(lifeExp)
summary(pca_gdp)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.977e+04 1.411e+04 4.386e+03 1631.9705 1.321e+03
## Proportion of Variance 7.976e-01 1.791e-01 1.732e-02 0.0024 1.570e-03
## Cumulative Proportion 7.976e-01 9.768e-01 9.941e-01 0.9965 9.980e-01
## PC6 PC7 PC8 PC9 PC10
## Standard deviation 895.62687 735.23148 541.92310 512.29981 390.94375
## Proportion of Variance 0.00072 0.00049 0.00026 0.00024 0.00014
## Cumulative Proportion 0.99876 0.99925 0.99951 0.99975 0.99989
## PC11 PC12
## Standard deviation 310.89089 174.92261
## Proportion of Variance 0.00009 0.00003
## Cumulative Proportion 0.99997 1.00000
summary(pca_lifeExp)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 38.6350 9.83955 4.50873 2.52549 1.39085 1.27733 1.01896
## Proportion of Variance 0.9203 0.05969 0.01253 0.00393 0.00119 0.00101 0.00064
## Cumulative Proportion 0.9203 0.97994 0.99247 0.99641 0.99760 0.99860 0.99924
## PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.79691 0.48377 0.43319 0.33836 0.23486
## Proportion of Variance 0.00039 0.00014 0.00012 0.00007 0.00003
## Cumulative Proportion 0.99964 0.99978 0.99990 0.99997 1.00000
plot(pca_gdp)
plot(pca_lifeExp)
From the outputs of summary, we can see from the second rows that how much the principal components of gdp and life expectancy explain. For gdp, the proportions are 79.76%,17.91%,…; For life expectancy, the proportions are 92.03%,5.969%,…
For both cases, we should consider the first two principal components.Because they all explain over 95% which is enough to explain the variation.
pca_gdp$rotation[,1:3]
## PC1 PC2 PC3
## 1952 0.2353456 0.42731132 -0.26028080
## 1957 0.2578710 0.43359999 -0.21726232
## 1962 0.2463384 0.32565937 -0.06460943
## 1967 0.2478510 0.22651812 0.14640562
## 1972 0.3226684 0.31205792 0.15316427
## 1977 0.2692097 0.03960993 0.49996163
## 1982 0.2405693 -0.12731538 0.51102935
## 1987 0.2570970 -0.19517312 0.27158722
## 1992 0.2848019 -0.20793971 0.05488671
## 1997 0.3197645 -0.24277083 -0.16052462
## 2002 0.3397718 -0.32093010 -0.26194321
## 2007 0.3957711 -0.33705789 -0.39206569
pca_lifeExp$rotation[,1:3]
## PC1 PC2 PC3
## 1952 0.2995501 0.34902226 -0.33190727
## 1957 0.3041007 0.32308013 -0.23604180
## 1962 0.3024818 0.29646127 -0.16795890
## 1967 0.2967688 0.23213961 -0.03516382
## 1972 0.2898377 0.16609597 0.13016468
## 1977 0.2850260 0.10290633 0.28373569
## 1982 0.2751025 0.02102816 0.33368816
## 1987 0.2681294 -0.06709016 0.38555489
## 1992 0.2777391 -0.22660990 0.42086149
## 1997 0.2835186 -0.35942503 0.04474640
## 2002 0.2937106 -0.44973534 -0.32991736
## 2007 0.2856875 -0.45398620 -0.39906629
gdp: The first PC gives increasing weight to 1952-2007, which means that 1952 explains the least variation and 2007 explains the most variation. And the values are all positive thus high marks on 1952-2007 will have a higher value on y1. High value represents a higher gdp level. The second PC represents a contrast between 1952-1977 and 1982-2007. For instance, a large negative value for y2 implies the country did much better in 1987-2007 than 1952-1977, and a large positive value implies the opposite. Low value represents they made progress in gdp during 1982-2007 compared to 1952-1977.
life expentancy: The first PC gives approximately equal weigt to 1952-2007, which means that they all explain equal variation. And the values are all positive thus high marks on 1952-2007 will have a higher value on y1. High value represents a higher life expectancy level. The second PC represents a contrast between 1952-1982 and 1987-2007. Similar with the second PC of gdp. low value represents they made progress in life expectancy during 1987-2007 compared with 1952-1982
gap$gdp_PC1 <- pca_gdp$x[,1]
gap$gdp_PC2 <- pca_gdp$x[,2]
gap$gdp_PC3 <- pca_gdp$x[,3]
p1 <- ggplot(gap,aes(gdp_PC1,gdp_PC2,label = country,color = continent))
p1+ geom_point()+geom_text()
p2 <- ggplot(gap,aes(gdp_PC1,gdp_PC3,label = country,color = continent))
p2+ geom_point()+geom_text()
p3 <- ggplot(gap,aes(gdp_PC2,gdp_PC3,label = country,color = continent))
p3+ geom_point()+geom_text()
From the fisrt plot, we can see that Kuwait has oddly large PC1 and PC2,which means that it generally has high gdp but it has been going down quickly approximately since 1980s. We can also observe that Saudi Arabia has high PC3 scores. From the 3rd loading, we guess it has rather high level of gdp in the middle parts of these 70 years but has relatively low gdp at both ends.
I do not know history very well about Kuwait but I guess it is running out of petroleum. Saudi Arabia might suffer from economic recession approximately in 1990s.
gap$lifeExp_PC1 <- pca_lifeExp$x[,1]
gap$lifeExp_PC2 <- pca_lifeExp$x[,2]
gap$lifeExp_PC3 <- pca_lifeExp$x[,3]
p4 <- ggplot(gap,aes(lifeExp_PC1,lifeExp_PC2,label = country,color = continent))
p4+ geom_point()+geom_text()
p5 <- ggplot(gap,aes(lifeExp_PC1,lifeExp_PC3,label = country,color = continent))
p5+ geom_point()+geom_text()
p6 <- ggplot(gap,aes(lifeExp_PC2,lifeExp_PC3,label = country,color = continent))
p6+ geom_point()+geom_text()
Zimbabwe has high PC3 and PC2 but low PC1, which means that (from loadings) it had high life expectancy around 1970s but did bad in other years. I guess maybe because of war after 1970s.
Rwanda has low PC1 and PC3 values but high PC2,which means their life expectancy is relatively low generally but had its highlight in the early stage of these 70 years. Also, I guess it has something to do with wars.
combined_data <- as.matrix(gap[,3:26])
D <- as.matrix(dist(combined_data,upper = T,diag = T))
mds <- cmdscale(D)
p7 <- ggplot(gap,aes(mds[,1],mds[,2],color = continent))
p7 +geom_point()
This plot is same with the plot of the second PC score vs the first PC score. It does not suprise us because accroding to lecture notes using SVD of column centred matrix X we can easily get that in this case of 2-dimension, it is easy to see that classical MDS is the same as PCA.
Asia_index <- (gap[,1] == "Asia")
Europe_index <- (gap[,1] == "Europe")
Asia_gdp_lifeExp <- gap[Asia_index,c(14,26)]
Europe_gap_lifeExp <- gap[Europe_index,c(14,26)]
Asia_1952 <- gap[Asia_index,c(3,15)]
Europe_1952 <- gap[Europe_index,c(3,15)]
Now we have two samples: Asian gdp and life expectancy, European gdp and life expectancy.
Let x1,…,x33 be the gdp and lifeExp of Asia 2007 and y1,…,y30 be the gdp and lifeExp of Europe 2007. Let mu1 and mu2 be the population means for Asia 2007 and Europe 2007 repectively. Our hypotheis are H0: mu1 = mu2 vs H1: mu1 not equal to mu2 It is reasonable to assume that x1,…,x33 follow 2-dimensional MVN with mean mu1 and variance sigma1, y1,…,y30 follow 2-dimensional MVN with mean mu2 and variance sigma2.
HotellingsT2(Asia_gdp_lifeExp,Europe_gap_lifeExp)
##
## Hotelling's two sample T2-test
##
## data: Asia_gdp_lifeExp and Europe_gap_lifeExp
## T.2 = 12.681, df1 = 2, df2 = 60, p-value = 2.55e-05
## alternative hypothesis: true location difference is not equal to c(0,0)
HotellingsT2(Asia_1952,Europe_1952)
##
## Hotelling's two sample T2-test
##
## data: Asia_1952 and Europe_1952
## T.2 = 39.347, df1 = 2, df2 = 60, p-value = 1.21e-11
## alternative hypothesis: true location difference is not equal to c(0,0)
From the output, we can see that the test statistic 12.681 is quite large and p-value is small. Therefore, we can reject the null hypothesis at 5% level. There was a significantly difference between the mean log(GDP) and life expentancy of Asia and European countries in 2007. From the second Hotelling test, the test statistic 39.347 was even bigger. Thus, the continents were more dissimilar in 1952.
set.seed(5)
test.ind <- sample(142,size = 47)
gap.test <- gap.raw[test.ind,c(1,3:26)]
gap.train <- gap.raw[-test.ind,c(1,3:26)]
lda1 <- lda(continent~.,gap.train)
test.ldapred <- predict(lda1,gap.test)
(accuracy <- sum(test.ldapred$class == gap.test$continent))/47
## [1] 0.7021277
Since we are predicting continent using log(GDP) and lifeExp, we can discard the column of countries and then make predictions.
From the output we can see that the predicted accuracy is 0.70.
par(pty = "s")
gap1 <- gap.raw[,c(1,3:26)]
gap1$continent <- factor(gap1$continent)
lda2 <- lda(continent~.,gap1)
plot(lda2,col = as.integer(gap1$continent)+1,abbrev=1)
There are multiple plots of each pair of LD. We only pay attention to the plot of LD1 and LD2 is the 2-dimensional LDA projection. The two plots are similar expect that LDA did better gob separating the populations.
##Clustering
gap2 <- gap.raw[,c(3:26)]
fviz_nbclust(gap2,kmeans,method = "wss")
set.seed(6)
(gap.k <- kmeans(gap2,centers = 4,nstart = 25))
## K-means clustering with 4 clusters of sizes 1, 36, 78, 27
##
## Cluster means:
## gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
## 1 108382.353 113523.133 95458.112 80894.883 109347.867
## 2 3533.191 4304.229 5001.277 6279.562 7706.466
## 3 1184.869 1295.062 1411.868 1594.969 1803.396
## 4 7444.156 8926.881 10571.688 12863.409 16070.601
## gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997
## 1 59265.477 31354.036 28118.430 34932.920 40300.620
## 2 9208.509 9290.909 9429.554 8971.628 10076.218
## 3 1947.249 2049.339 2030.146 2013.139 2179.727
## 4 18363.421 20074.401 22073.961 23836.518 26583.027
## gdpPercap_2002 gdpPercap_2007 lifeExp_1952 lifeExp_1957 lifeExp_1962
## 1 35110.11 47306.990 55.56500 58.03300 60.47000
## 2 10817.79 13287.942 54.52628 57.69869 60.11186
## 3 2348.14 2811.331 40.60476 43.00456 45.18617
## 4 29652.92 33837.534 65.94444 67.57441 69.01833
## lifeExp_1967 lifeExp_1972 lifeExp_1977 lifeExp_1982 lifeExp_1987 lifeExp_1992
## 1 64.62400 67.71200 69.34300 71.30900 74.17400 75.19000
## 2 62.12850 64.20281 66.14847 67.83942 69.43611 70.24906
## 3 47.56035 49.78947 51.85239 54.14127 56.04433 56.96547
## 4 70.19852 71.23470 72.73289 74.11730 75.21700 76.41870
## lifeExp_1997 lifeExp_2002 lifeExp_2007
## 1 76.15600 76.90400 77.58800
## 2 71.05228 71.63039 72.60031
## 3 57.73618 58.29614 59.88586
## 4 77.57867 78.74007 79.73178
##
## Clustering vector:
## [1] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 3 3
## [38] 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3 2 4 2 3 2 2 3 2 3 3 3 3 2 2 3 2 3 2 2 2
## [75] 4 2 2 3 4 3 3 3 4 3 3 2 2 4 4 3 3 2 1 2 2 3 3 3 2 3 3 4 4 3 3 2 3 3 3 3 3
## [112] 4 4 3 2 2 2 4 4 4 4 4 2 4 4 4 2 4 4 2 2 2 2 2 4 4 4 4 3 4 4 4
##
## Within cluster sum of squares by cluster:
## [1] 0 5432255272 1927759742 7075566202
## (between_SS / total_SS = 90.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(gap.k,data = gap2,geom = "point",)
From the elbow plot we can see that the optimal choice is k=4. Also, by implementing different clusters,the overlap is the least when k=4. It is worth noting that there is an outlier in the data which forms a cluster only with itself in it.
gap.scaled <- gap.raw
gap.scaled[,3:26] <- scale(gap[,3:26])
gap3 <- as.matrix(gap.scaled[,3:26])
D <- dist(gap3,method = "euclidean",diag = TRUE)
D.sl <- hclust(D,method = "single")
cutree(D.sl,k=4)
## [1] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
D.cl <- hclust(D,method = "complete")
cutree(D.cl,k=4)
## [1] 1 2 2 1 2 2 2 2 2 3 2 1 2 2 1 2 2 2 1 2 3 2 2 3 3 2 1 2 2 2 3 1 1 2 1 2 2
## [38] 1 2 3 2 2 2 1 2 1 2 3 1 2 2 3 4 1 1 4 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 4 1
## [75] 4 1 4 2 4 2 2 3 4 3 3 1 1 4 4 1 1 1 4 1 1 3 3 2 1 3 1 1 4 1 1 1 1 3 1 2 1
## [112] 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 1 1 4 4 4 4 4 1 4 4 4
D.ga <- hclust(D,method = "average")
cutree(D.ga,k=4)
## [1] 1 2 2 1 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 3 1 2 1 2 2
## [38] 3 2 1 2 2 2 1 2 1 2 2 1 2 2 2 3 1 3 3 3 3 3 3 1 3 1 1 2 1 3 3 1 3 3 1 3 3
## [75] 3 3 3 2 3 2 2 1 3 2 1 1 1 3 3 1 1 3 4 3 3 1 2 2 1 1 1 1 3 1 1 3 1 1 1 2 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3
plot(D.cl)
plot(D.ga)
By comparison, LA and Cl did a great job sepearating the populations which Sl did not.
When I used 4 clusters, hierarchical clustering and K-means both separated the data nicely. Also, they both put the outlier in a different cluster.
There are 5 continents in the data and the optimal cluster number is 4 so far. Maybe because some countries in different continents share similar lifeExp and GDP data, they are not separated well.
Since I don not know how to labe the points in the plots by their continents, my best guess is the countries do not naturally cluster by their continent since their are 5 continents and the optimal number of clusters is 4.
Our aim is to predict the 2007 lifeExp using GDP values. Since the covaraites are the GDP values from 1952 to 2007, they have relatively high correlation. Therefore, ridge regression is the most applicable method here when there are many correlated covariates.
set.seed(100)
gap4.train <- as.matrix(gap.raw[1:95,3:14])
gap4.test <- as.matrix(gap.raw[96:142,3:14])
y.train <- gap.raw[1:95,26]
y.test <- gap.raw[96:142,26]
lambdas <- 10^seq(3,-4,by = -.1)
gap_cv <- cv.glmnet(gap4.train,y.train,alpha = 0,lambda = lambdas)
(optimal.lambda <- gap_cv$lambda.min)
## [1] 1.584893
By observing the cross-validation predictive test, the optimal value of lambda is about 1.58. The following is my ridge regression model.
gap.ridge <- glmnet(gap4.train,y.train,alpha = 0,lambda = 1.58)
predict.y <- predict(gap.ridge,gap4.test)
Now R^2 is usually used to estimate the accuracy.
SSE <- sum((predict.y - y.test)^2)
SST <- sum((y.test - mean(y.test))^2)
(R_square <- 1 - SST / SSE)
## [1] 0.6005922
The R-squared value is 0.60 which stands for its accuracy.