Coursework

Preparations

The corresoponding packages are already loaded. Now Load the data into R using the commands

gap.raw <- read.csv('gap.csv')
gap <- gap.raw
gap[,3:14]<- log(gap.raw[,3:14])

Note that for GDP per capita, it is best to work with log(GDP) when doing statistical analysis, as the values vary over several orders of magnitude between countries. For ease of plotting, it may be useful to split the data into two data frames, one containing GDP per capita, and the other life expectancy data

gdp <- exp(gap[,3:14])
years <- seq(1952, 2007,5)
colnames(gdp) <- years
rownames(gdp) <- gap[,2]
lifeExp <- gap[,15:26]
colnames(lifeExp) <- years
rownames(lifeExp) <- gap[,2]

Basic exploratory data analysis plots

Since countries or continents are not specified, we should consider the change of all the countries over the past 70 years. So gdp and life expectancy are taken average to see their changes.

gdp_mean <- colMeans(gdp)
lifeExp_mean <- colMeans(lifeExp)
plot(years,gdp_mean,xlab = "years")

plot(years,lifeExp_mean,xlab = "years")

From the plots we can see that both gdp and life expectancy have been increasing steadily for the past 70 years on average worldwide.

Princial component analysis

Since the varaibles are measuring similar entities,PCR using S should be enough.

pca_gdp <- prcomp(gdp)
pca_lifeExp <- prcomp(lifeExp)

summary(pca_gdp)

## Importance of components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     2.977e+04 1.411e+04 4.386e+03 1631.9705 1.321e+03
## Proportion of Variance 7.976e-01 1.791e-01 1.732e-02    0.0024 1.570e-03
## Cumulative Proportion  7.976e-01 9.768e-01 9.941e-01    0.9965 9.980e-01
##                              PC6       PC7       PC8       PC9      PC10
## Standard deviation     895.62687 735.23148 541.92310 512.29981 390.94375
## Proportion of Variance   0.00072   0.00049   0.00026   0.00024   0.00014
## Cumulative Proportion    0.99876   0.99925   0.99951   0.99975   0.99989
##                             PC11      PC12
## Standard deviation     310.89089 174.92261
## Proportion of Variance   0.00009   0.00003
## Cumulative Proportion    0.99997   1.00000

summary(pca_lifeExp)

## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     38.6350 9.83955 4.50873 2.52549 1.39085 1.27733 1.01896
## Proportion of Variance  0.9203 0.05969 0.01253 0.00393 0.00119 0.00101 0.00064
## Cumulative Proportion   0.9203 0.97994 0.99247 0.99641 0.99760 0.99860 0.99924
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.79691 0.48377 0.43319 0.33836 0.23486
## Proportion of Variance 0.00039 0.00014 0.00012 0.00007 0.00003
## Cumulative Proportion  0.99964 0.99978 0.99990 0.99997 1.00000

plot(pca_gdp)

plot(pca_lifeExp)

From the outputs of summary, we can see from the second rows that how much the principal components of gdp and life expectancy explain. For gdp, the proportions are 79.76%,17.91%,…; For life expectancy, the proportions are 92.03%,5.969%,…

For both cases, we should consider the first two principal components.Because they all explain over 95% which is enough to explain the variation.

pca_gdp$rotation[,1:3]

##            PC1         PC2         PC3
## 1952 0.2353456  0.42731132 -0.26028080
## 1957 0.2578710  0.43359999 -0.21726232
## 1962 0.2463384  0.32565937 -0.06460943
## 1967 0.2478510  0.22651812  0.14640562
## 1972 0.3226684  0.31205792  0.15316427
## 1977 0.2692097  0.03960993  0.49996163
## 1982 0.2405693 -0.12731538  0.51102935
## 1987 0.2570970 -0.19517312  0.27158722
## 1992 0.2848019 -0.20793971  0.05488671
## 1997 0.3197645 -0.24277083 -0.16052462
## 2002 0.3397718 -0.32093010 -0.26194321
## 2007 0.3957711 -0.33705789 -0.39206569

pca_lifeExp$rotation[,1:3]

##            PC1         PC2         PC3
## 1952 0.2995501  0.34902226 -0.33190727
## 1957 0.3041007  0.32308013 -0.23604180
## 1962 0.3024818  0.29646127 -0.16795890
## 1967 0.2967688  0.23213961 -0.03516382
## 1972 0.2898377  0.16609597  0.13016468
## 1977 0.2850260  0.10290633  0.28373569
## 1982 0.2751025  0.02102816  0.33368816
## 1987 0.2681294 -0.06709016  0.38555489
## 1992 0.2777391 -0.22660990  0.42086149
## 1997 0.2835186 -0.35942503  0.04474640
## 2002 0.2937106 -0.44973534 -0.32991736
## 2007 0.2856875 -0.45398620 -0.39906629

gdp: The first PC gives increasing weight to 1952-2007, which means that 1952 explains the least variation and 2007 explains the most variation. And the values are all positive thus high marks on 1952-2007 will have a higher value on y1. High value represents a higher gdp level. The second PC represents a contrast between 1952-1977 and 1982-2007. For instance, a large negative value for y2 implies the country did much better in 1987-2007 than 1952-1977, and a large positive value implies the opposite. Low value represents they made progress in gdp during 1982-2007 compared to 1952-1977.

life expentancy: The first PC gives approximately equal weigt to 1952-2007, which means that they all explain equal variation. And the values are all positive thus high marks on 1952-2007 will have a higher value on y1. High value represents a higher life expectancy level. The second PC represents a contrast between 1952-1982 and 1987-2007. Similar with the second PC of gdp. low value represents they made progress in life expectancy during 1987-2007 compared with 1952-1982

gap$gdp_PC1 <- pca_gdp$x[,1]
gap$gdp_PC2 <- pca_gdp$x[,2]
gap$gdp_PC3 <- pca_gdp$x[,3]
p1 <- ggplot(gap,aes(gdp_PC1,gdp_PC2,label = country,color = continent))
p1+ geom_point()+geom_text()

p2 <- ggplot(gap,aes(gdp_PC1,gdp_PC3,label = country,color = continent))
p2+ geom_point()+geom_text()

p3 <- ggplot(gap,aes(gdp_PC2,gdp_PC3,label = country,color = continent))
p3+ geom_point()+geom_text()

From the fisrt plot, we can see that Kuwait has oddly large PC1 and PC2,which means that it generally has high gdp but it has been going down quickly approximately since 1980s. We can also observe that Saudi Arabia has high PC3 scores. From the 3rd loading, we guess it has rather high level of gdp in the middle parts of these 70 years but has relatively low gdp at both ends.

I do not know history very well about Kuwait but I guess it is running out of petroleum. Saudi Arabia might suffer from economic recession approximately in 1990s.

gap$lifeExp_PC1 <- pca_lifeExp$x[,1]
gap$lifeExp_PC2 <- pca_lifeExp$x[,2]
gap$lifeExp_PC3 <- pca_lifeExp$x[,3]
p4 <- ggplot(gap,aes(lifeExp_PC1,lifeExp_PC2,label = country,color = continent))
p4+ geom_point()+geom_text()

p5 <- ggplot(gap,aes(lifeExp_PC1,lifeExp_PC3,label = country,color = continent))
p5+ geom_point()+geom_text()

p6 <- ggplot(gap,aes(lifeExp_PC2,lifeExp_PC3,label = country,color = continent))
p6+ geom_point()+geom_text()

Zimbabwe has high PC3 and PC2 but low PC1, which means that (from loadings) it had high life expectancy around 1970s but did bad in other years. I guess maybe because of war after 1970s.

Rwanda has low PC1 and PC3 values but high PC2,which means their life expectancy is relatively low generally but had its highlight in the early stage of these 70 years. Also, I guess it has something to do with wars.

Multidimensional scaling

combined_data <- as.matrix(gap[,3:26])
D <- as.matrix(dist(combined_data,upper = T,diag = T))
mds <- cmdscale(D)
p7 <- ggplot(gap,aes(mds[,1],mds[,2],color = continent))
p7 +geom_point()

This plot is same with the plot of the second PC score vs the first PC score. It does not suprise us because accroding to lecture notes using SVD of column centred matrix X we can easily get that in this case of 2-dimension, it is easy to see that classical MDS is the same as PCA.

Hypothesis testing

Asia_index <- (gap[,1] == "Asia")
Europe_index <- (gap[,1] == "Europe")
Asia_gdp_lifeExp <- gap[Asia_index,c(14,26)]
Europe_gap_lifeExp <- gap[Europe_index,c(14,26)]
Asia_1952 <- gap[Asia_index,c(3,15)]
Europe_1952 <- gap[Europe_index,c(3,15)]

Now we have two samples: Asian gdp and life expectancy, European gdp and life expectancy.

Let x1,…,x33 be the gdp and lifeExp of Asia 2007 and y1,…,y30 be the gdp and lifeExp of Europe 2007. Let mu1 and mu2 be the population means for Asia 2007 and Europe 2007 repectively. Our hypotheis are H0: mu1 = mu2 vs H1: mu1 not equal to mu2 It is reasonable to assume that x1,…,x33 follow 2-dimensional MVN with mean mu1 and variance sigma1, y1,…,y30 follow 2-dimensional MVN with mean mu2 and variance sigma2.

HotellingsT2(Asia_gdp_lifeExp,Europe_gap_lifeExp)

## 
##  Hotelling's two sample T2-test
## 
## data:  Asia_gdp_lifeExp and Europe_gap_lifeExp
## T.2 = 12.681, df1 = 2, df2 = 60, p-value = 2.55e-05
## alternative hypothesis: true location difference is not equal to c(0,0)

HotellingsT2(Asia_1952,Europe_1952)

## 
##  Hotelling's two sample T2-test
## 
## data:  Asia_1952 and Europe_1952
## T.2 = 39.347, df1 = 2, df2 = 60, p-value = 1.21e-11
## alternative hypothesis: true location difference is not equal to c(0,0)

From the output, we can see that the test statistic 12.681 is quite large and p-value is small. Therefore, we can reject the null hypothesis at 5% level. There was a significantly difference between the mean log(GDP) and life expentancy of Asia and European countries in 2007. From the second Hotelling test, the test statistic 39.347 was even bigger. Thus, the continents were more dissimilar in 1952.

Linear discriminant analysis

set.seed(5)
test.ind <- sample(142,size = 47)
gap.test <- gap.raw[test.ind,c(1,3:26)]
gap.train <- gap.raw[-test.ind,c(1,3:26)]

lda1 <- lda(continent~.,gap.train)
test.ldapred <- predict(lda1,gap.test)
(accuracy <- sum(test.ldapred$class == gap.test$continent))/47

## [1] 0.7021277

Since we are predicting continent using log(GDP) and lifeExp, we can discard the column of countries and then make predictions.

From the output we can see that the predicted accuracy is 0.70.

par(pty = "s")
gap1 <- gap.raw[,c(1,3:26)]
gap1$continent <- factor(gap1$continent)
lda2 <- lda(continent~.,gap1)
plot(lda2,col = as.integer(gap1$continent)+1,abbrev=1)

There are multiple plots of each pair of LD. We only pay attention to the plot of LD1 and LD2 is the 2-dimensional LDA projection. The two plots are similar expect that LDA did better gob separating the populations.

##Clustering

gap2 <- gap.raw[,c(3:26)]
fviz_nbclust(gap2,kmeans,method = "wss")

set.seed(6)
(gap.k <- kmeans(gap2,centers = 4,nstart = 25))

## K-means clustering with 4 clusters of sizes 1, 36, 78, 27
## 
## Cluster means:
##   gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 gdpPercap_1972
## 1     108382.353     113523.133      95458.112      80894.883     109347.867
## 2       3533.191       4304.229       5001.277       6279.562       7706.466
## 3       1184.869       1295.062       1411.868       1594.969       1803.396
## 4       7444.156       8926.881      10571.688      12863.409      16070.601
##   gdpPercap_1977 gdpPercap_1982 gdpPercap_1987 gdpPercap_1992 gdpPercap_1997
## 1      59265.477      31354.036      28118.430      34932.920      40300.620
## 2       9208.509       9290.909       9429.554       8971.628      10076.218
## 3       1947.249       2049.339       2030.146       2013.139       2179.727
## 4      18363.421      20074.401      22073.961      23836.518      26583.027
##   gdpPercap_2002 gdpPercap_2007 lifeExp_1952 lifeExp_1957 lifeExp_1962
## 1       35110.11      47306.990     55.56500     58.03300     60.47000
## 2       10817.79      13287.942     54.52628     57.69869     60.11186
## 3        2348.14       2811.331     40.60476     43.00456     45.18617
## 4       29652.92      33837.534     65.94444     67.57441     69.01833
##   lifeExp_1967 lifeExp_1972 lifeExp_1977 lifeExp_1982 lifeExp_1987 lifeExp_1992
## 1     64.62400     67.71200     69.34300     71.30900     74.17400     75.19000
## 2     62.12850     64.20281     66.14847     67.83942     69.43611     70.24906
## 3     47.56035     49.78947     51.85239     54.14127     56.04433     56.96547
## 4     70.19852     71.23470     72.73289     74.11730     75.21700     76.41870
##   lifeExp_1997 lifeExp_2002 lifeExp_2007
## 1     76.15600     76.90400     77.58800
## 2     71.05228     71.63039     72.60031
## 3     57.73618     58.29614     59.88586
## 4     77.57867     78.74007     79.73178
## 
## Clustering vector:
##   [1] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 3 3 3 3 2 3 3 3 3 3
##  [38] 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 2 3 2 4 2 3 2 2 3 2 3 3 3 3 2 2 3 2 3 2 2 2
##  [75] 4 2 2 3 4 3 3 3 4 3 3 2 2 4 4 3 3 2 1 2 2 3 3 3 2 3 3 4 4 3 3 2 3 3 3 3 3
## [112] 4 4 3 2 2 2 4 4 4 4 4 2 4 4 4 2 4 4 2 2 2 2 2 4 4 4 4 3 4 4 4
## 
## Within cluster sum of squares by cluster:
## [1]          0 5432255272 1927759742 7075566202
##  (between_SS / total_SS =  90.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(gap.k,data = gap2,geom = "point",)

From the elbow plot we can see that the optimal choice is k=4. Also, by implementing different clusters,the overlap is the least when k=4. It is worth noting that there is an outlier in the data which forms a cluster only with itself in it.

gap.scaled <- gap.raw
gap.scaled[,3:26] <- scale(gap[,3:26])
gap3 <- as.matrix(gap.scaled[,3:26])
D <- dist(gap3,method = "euclidean",diag = TRUE)
D.sl <- hclust(D,method = "single")
cutree(D.sl,k=4)

##   [1] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

D.cl <- hclust(D,method = "complete")
cutree(D.cl,k=4)

##   [1] 1 2 2 1 2 2 2 2 2 3 2 1 2 2 1 2 2 2 1 2 3 2 2 3 3 2 1 2 2 2 3 1 1 2 1 2 2
##  [38] 1 2 3 2 2 2 1 2 1 2 3 1 2 2 3 4 1 1 4 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 4 1
##  [75] 4 1 4 2 4 2 2 3 4 3 3 1 1 4 4 1 1 1 4 1 1 3 3 2 1 3 1 1 4 1 1 1 1 3 1 2 1
## [112] 4 4 1 1 4 4 4 4 4 4 4 4 4 4 4 1 4 4 4 4 1 1 4 4 4 4 4 1 4 4 4

D.ga <- hclust(D,method = "average")
cutree(D.ga,k=4)

##   [1] 1 2 2 1 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 3 1 2 1 2 2
##  [38] 3 2 1 2 2 2 1 2 1 2 2 1 2 2 2 3 1 3 3 3 3 3 3 1 3 1 1 2 1 3 3 1 3 3 1 3 3
##  [75] 3 3 3 2 3 2 2 1 3 2 1 1 1 3 3 1 1 3 4 3 3 1 2 2 1 1 1 1 3 1 1 3 1 1 1 2 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3

plot(D.cl)

plot(D.ga)

By comparison, LA and Cl did a great job sepearating the populations which Sl did not.

When I used 4 clusters, hierarchical clustering and K-means both separated the data nicely. Also, they both put the outlier in a different cluster.

There are 5 continents in the data and the optimal cluster number is 4 so far. Maybe because some countries in different continents share similar lifeExp and GDP data, they are not separated well.

Since I don not know how to labe the points in the plots by their continents, my best guess is the countries do not naturally cluster by their continent since their are 5 continents and the optimal number of clusters is 4.

Linear regression

Our aim is to predict the 2007 lifeExp using GDP values. Since the covaraites are the GDP values from 1952 to 2007, they have relatively high correlation. Therefore, ridge regression is the most applicable method here when there are many correlated covariates.

set.seed(100)
gap4.train <- as.matrix(gap.raw[1:95,3:14])
gap4.test <- as.matrix(gap.raw[96:142,3:14])
y.train <- gap.raw[1:95,26]
y.test <- gap.raw[96:142,26]
lambdas <- 10^seq(3,-4,by = -.1)
gap_cv <- cv.glmnet(gap4.train,y.train,alpha = 0,lambda = lambdas)
(optimal.lambda <- gap_cv$lambda.min)

## [1] 1.584893

By observing the cross-validation predictive test, the optimal value of lambda is about 1.58. The following is my ridge regression model.

gap.ridge <- glmnet(gap4.train,y.train,alpha = 0,lambda = 1.58)
predict.y <- predict(gap.ridge,gap4.test)

Now R^2 is usually used to estimate the accuracy.

SSE <- sum((predict.y - y.test)^2)
SST <- sum((y.test - mean(y.test))^2)
(R_square <- 1 - SST / SSE)

## [1] 0.6005922

The R-squared value is 0.60 which stands for its accuracy.