- ASK : What the problem(s) we need to solve ?
- RESEARCH : What data do we need and how do we get it ?
- MODEL : Which method(s) is appropriate to use ?
- VALIDATE : Do the model and assumptions work as expected ?
- TEST : How does the model generalize to real world data ?
- INTEPRETE : How can we use the conclusion in the real world ?
As the data is already given, what we can do is following :
Can be non-parameter as we make no assumption about the functional form of predicted variables – in fact, we don’t even know the response variables.) Supervised or Unsupervised ? As we have no idea about the ‘groups’ or classes’ of variables as in case of classification, we might opt to choose unsupervised method: PCA & Clustering
In this document we follow 4 steps :
require(c('magrittr','reshaped2', 'ggplot2', 'devtools',
'Rcpp', 'dplyr', 'tidyr'))
## c("Loading required package: c", "Loading required package: magrittr", "Loading required package: reshaped2", "Loading required package: ggplot2", "Loading required package: devtools", "Loading required package: Rcpp", "Loading required package: dplyr", "Loading required package: tidyr")
library(reshape2)
library(magrittr) # Package for pipeline %>%
library(ggplot2)
library(devtools)
library(Rcpp)
library(FactoMineR)
# library(rCharts)
# family="sans-serif"
Examine the contents of the CSV file.
Budget <- read.csv("Budget.txt", header=FALSE, stringsAsFactors=FALSE)
colnames(Budget) = c("AN", "PVP", "AGR", "CMI", "TRA", "LOG",
"EDU", "ACS", "ANC", "DEF", "DET", "DIV")
Budget
## AN PVP AGR CMI TRA LOG EDU ACS ANC DEF DET DIV
## 1 1872 18.0 0.5 0.1 6.7 0.5 2.1 2.0 0.0 26.4 41.5 2.1
## 2 1880 14.1 0.8 0.1 15.3 1.9 3.7 0.5 0.0 29.8 31.3 2.5
## 3 1890 13.6 0.7 0.7 6.8 0.6 7.1 0.7 0.0 33.8 34.4 1.7
## 4 1900 14.3 1.7 1.7 6.9 1.2 7.4 0.8 0.0 37.7 26.2 2.2
## 5 1903 10.3 1.5 0.4 9.3 0.6 8.5 0.9 0.0 38.4 27.2 3.0
## 6 1906 13.4 1.4 0.5 8.1 0.7 8.6 1.8 0.0 38.5 25.3 1.9
## 7 1909 13.5 1.1 0.5 9.0 0.6 9.0 3.4 0.0 36.8 23.5 2.6
## 8 1912 12.9 1.4 0.3 9.4 0.6 9.3 4.3 0.0 41.1 19.4 1.3
## 9 1920 12.3 0.3 0.1 11.9 2.4 3.7 1.7 1.9 42.4 23.1 0.2
## 10 1923 7.6 1.2 3.2 5.1 0.6 5.6 1.8 10.0 29.0 35.0 0.9
## 11 1926 10.5 0.3 0.4 4.5 1.8 6.6 2.1 10.1 19.9 41.6 2.3
## 12 1929 10.0 0.6 0.6 9.0 1.0 8.1 3.2 11.8 28.0 25.8 2.0
## 13 1932 10.6 0.8 0.3 8.9 3.0 10.0 6.4 13.4 27.4 19.2 0.0
## 14 1935 8.8 2.6 1.4 7.8 1.4 12.4 6.2 11.3 29.3 18.5 0.4
## 15 1938 10.1 1.1 1.2 5.9 1.4 9.5 6.0 5.9 40.7 18.2 0.0
## 16 1947 15.6 1.6 10.1 11.4 7.6 8.8 4.8 3.4 32.2 4.6 0.0
## 17 1950 11.2 1.3 16.5 12.4 15.8 8.1 4.9 3.4 20.7 4.2 1.5
## 18 1953 12.9 1.5 7.0 7.9 12.1 8.1 5.3 3.9 36.1 5.2 0.0
## 19 1956 10.9 5.3 9.7 7.6 9.6 9.4 8.5 4.6 28.2 6.2 0.0
## 20 1959 13.1 4.4 7.3 5.7 9.8 12.5 8.0 5.0 26.7 7.5 0.0
## 21 1962 12.8 4.7 7.5 6.6 6.8 15.7 9.7 5.3 24.5 6.4 0.1
## 22 1965 12.4 4.3 8.4 9.1 6.0 19.5 10.6 4.7 19.8 3.5 1.8
## 23 1968 11.4 6.0 9.5 5.9 5.0 21.1 10.7 4.2 20.0 4.4 1.9
## 24 1971 12.8 2.8 7.1 8.5 4.0 23.8 11.3 3.7 18.8 7.2 0.0
What is the format of the first line? What does it represent? The first line is a list of variable names (11, AN is for individual names).
What is the format of the following lines? What do they represent? The other lines are the observations (1872 : 1971); each observation is a list of percentage of the budget allocated to each service.
str(Budget)
## 'data.frame': 24 obs. of 12 variables:
## $ AN : int 1872 1880 1890 1900 1903 1906 1909 1912 1920 1923 ...
## $ PVP: num 18 14.1 13.6 14.3 10.3 13.4 13.5 12.9 12.3 7.6 ...
## $ AGR: num 0.5 0.8 0.7 1.7 1.5 1.4 1.1 1.4 0.3 1.2 ...
## $ CMI: num 0.1 0.1 0.7 1.7 0.4 0.5 0.5 0.3 0.1 3.2 ...
## $ TRA: num 6.7 15.3 6.8 6.9 9.3 8.1 9 9.4 11.9 5.1 ...
## $ LOG: num 0.5 1.9 0.6 1.2 0.6 0.7 0.6 0.6 2.4 0.6 ...
## $ EDU: num 2.1 3.7 7.1 7.4 8.5 8.6 9 9.3 3.7 5.6 ...
## $ ACS: num 2 0.5 0.7 0.8 0.9 1.8 3.4 4.3 1.7 1.8 ...
## $ ANC: num 0 0 0 0 0 0 0 0 1.9 10 ...
## $ DEF: num 26.4 29.8 33.8 37.7 38.4 38.5 36.8 41.1 42.4 29 ...
## $ DET: num 41.5 31.3 34.4 26.2 27.2 25.3 23.5 19.4 23.1 35 ...
## $ DIV: num 2.1 2.5 1.7 2.2 3 1.9 2.6 1.3 0.2 0.9 ...
summary(Budget)
## AN PVP AGR CMI
## Min. :1872 Min. : 7.60 Min. :0.300 Min. : 0.100
## 1st Qu.:1908 1st Qu.:10.57 1st Qu.:0.800 1st Qu.: 0.400
## Median :1930 Median :12.60 Median :1.400 Median : 1.300
## Mean :1929 Mean :12.21 Mean :1.996 Mean : 3.942
## 3rd Qu.:1954 3rd Qu.:13.43 3rd Qu.:2.650 3rd Qu.: 7.350
## Max. :1971 Max. :18.00 Max. :6.000 Max. :16.500
## TRA LOG EDU ACS
## Min. : 4.500 Min. : 0.500 Min. : 2.100 Min. : 0.500
## 1st Qu.: 6.675 1st Qu.: 0.675 1st Qu.: 7.325 1st Qu.: 1.800
## Median : 8.000 Median : 1.850 Median : 8.700 Median : 4.550
## Mean : 8.321 Mean : 3.958 Mean : 9.942 Mean : 4.817
## 3rd Qu.: 9.150 3rd Qu.: 6.200 3rd Qu.:10.600 3rd Qu.: 6.800
## Max. :15.300 Max. :15.800 Max. :23.800 Max. :11.300
## ANC DEF DET DIV
## Min. : 0.000 Min. :18.80 Min. : 3.50 Min. :0.000
## 1st Qu.: 0.000 1st Qu.:25.93 1st Qu.: 6.35 1st Qu.:0.000
## Median : 3.800 Median :29.15 Median :19.30 Median :1.400
## Mean : 4.275 Mean :30.26 Mean :19.14 Mean :1.183
## 3rd Qu.: 5.450 3rd Qu.:37.02 3rd Qu.:26.45 3rd Qu.:2.025
## Max. :13.400 Max. :42.40 Max. :41.60 Max. :3.000
The difference between the mean and median (eg. CMI, LOG) indicates that there are a few cases/rows with extreme values that are elevating the mean
Distribution, location and spread of the represented data.
boxplot(Budget[,-1], ylab = 'Percentage of France Budget', xlab = 'Budget Category')
In the above plot, the share of all spending categories is mostly well bellow 20%, while for defense (DEF) and dept refund (DET), it is higher (showing war time and high governmental spending). The box is much higher for DEF and DET than for the others. This means that DEF and DET have larger variability than the others. The three largest observations in education (EDU) are shown as boxplot outliers, i.e. they are more than 1.5 times the inter-quartile range (hight of the box) larger than the 3rd quartile.
Here we consider each variable separately. We consider the histograms as a probability density rather than a frequency. We fit empirical kernel density estimator curves, which give a more-or less smoothed continuous approximation to the histogram. What evidence does the summary give that the distribution is somewhat symmetric? We can observe the following:
par(mfrow=c(3,4),family ='sans')
for (i in 2:length(names(Budget))){
hist((Budget)[,i],main=names(Budget)[i],prob=TRUE,xlab='% of the total Budget')
lines(density(Budget[,i]), col="red")
}
c_d<-names(Budget)
par(mfrow=c(3,4),family ='sans')
for (i in 2:length(names(Budget))){
qqnorm((Budget)[,i], main=paste(c_d[i],'\n', 'vs','\n', 'Normal dist.'), ylab = c_d[i])
qqline((Budget)[,i], col= "green")
}
A heat map is like a table that uses colors instead of numbers to represent the values for the cells, using Pearson Correlation
data(attitude)
head(attitude)
## rating complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34
head(Budget[,-1])
## PVP AGR CMI TRA LOG EDU ACS ANC DEF DET DIV
## 1 18.0 0.5 0.1 6.7 0.5 2.1 2.0 0 26.4 41.5 2.1
## 2 14.1 0.8 0.1 15.3 1.9 3.7 0.5 0 29.8 31.3 2.5
## 3 13.6 0.7 0.7 6.8 0.6 7.1 0.7 0 33.8 34.4 1.7
## 4 14.3 1.7 1.7 6.9 1.2 7.4 0.8 0 37.7 26.2 2.2
## 5 10.3 1.5 0.4 9.3 0.6 8.5 0.9 0 38.4 27.2 3.0
## 6 13.4 1.4 0.5 8.1 0.7 8.6 1.8 0 38.5 25.3 1.9
library(ggplot2)
library(reshape2)
qplot(x=Var1, y=Var2, data=melt(cor(Budget[,-1], method = 'pearson')), fill=-value, geom="tile")
From the heatmap table, the correlation between variables is noticeable for ACS (Social action) & AGR (Agriculture), and ACS & EDU (Education)
par(mfrow = c(3,4))
for (i in 2:(ncol(Budget))) {
tb = ts(Budget[,i])
plot(tb, ylab = colnames(Budget[i]))
}
With only simple chart in time series, we can see the upward trend for Education (EDU) and Social Action (ACS) spending (as percentage of budget) over time, even during the economics slow down 1929 - 1932 and the 2 world wars
A principal component analysis of the data can be applied using the ACP function of the FactoMineR library (omitting the AN variable).
Step 1: We examine the eigenvalues to determine how many principal components should be considered:
budget = PCA(Budget[,-1], ncp = 2, graph = FALSE)
summary(budget)
##
## Call:
## PCA(X = Budget[, -1], ncp = 2, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 4.972 2.051 1.290 0.993 0.708 0.558
## % of var. 45.203 18.642 11.729 9.028 6.440 5.074
## Cumulative % of var. 45.203 63.845 75.574 84.602 91.042 96.116
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.204 0.125 0.063 0.035 0.000
## % of var. 1.857 1.138 0.571 0.318 0.000
## Cumulative % of var. 97.973 99.111 99.682 100.000 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 4.253 | -2.900 7.049 0.465 | 1.024 2.132 0.058 |
## 2 | 4.121 | -2.767 6.416 0.451 | 2.012 8.226 0.238 |
## 3 | 2.727 | -2.416 4.893 0.785 | 0.224 0.102 0.007 |
## 4 | 2.645 | -2.057 3.545 0.605 | 0.755 1.158 0.081 |
## 5 | 3.090 | -2.338 4.581 0.573 | 0.167 0.057 0.003 |
## 6 | 2.356 | -1.985 3.302 0.710 | 0.626 0.796 0.071 |
## 7 | 2.460 | -1.907 3.048 0.601 | 0.812 1.339 0.109 |
## 8 | 2.244 | -1.431 1.715 0.407 | 0.768 1.198 0.117 |
## 9 | 3.227 | -2.139 3.833 0.439 | 0.957 1.859 0.088 |
## 10 | 3.494 | -1.144 1.096 0.107 | -2.883 16.887 0.681 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2
## PVP | -0.173 0.604 0.030 | 0.740 26.686 0.547 |
## AGR | 0.818 13.471 0.670 | 0.005 0.001 0.000 |
## CMI | 0.833 13.957 0.694 | 0.341 5.679 0.116 |
## TRA | -0.137 0.377 0.019 | 0.631 19.402 0.398 |
## LOG | 0.722 10.473 0.521 | 0.398 7.714 0.158 |
## EDU | 0.787 12.449 0.619 | -0.137 0.915 0.019 |
## ACS | 0.933 17.515 0.871 | -0.101 0.496 0.010 |
## ANC | 0.289 1.679 0.083 | -0.807 31.783 0.652 |
## DEF | -0.612 7.538 0.375 | 0.216 2.282 0.047 |
## DET | -0.889 15.890 0.790 | -0.301 4.428 0.091 |
If we take all of these eigenvalues and add them up we get the total variance of 11. The proportion of variation explained by each eigenvalue is given in the third column. For example, 4.972362e+00 divided by the 11 equals 0.4520329, or, about 45% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. Therefore, about 64% of the variation is explained by the first two eigenvalues together.
The first four principal components explain 85% of the variation. This is an acceptably large percentage.
A Scree Plot may serve as another indicator of how many eigenvalues to consider. With the eigenvalues ordered from largest to the smallest, a scree plot is the plot of λi versus i. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size.
plot(budget[[1]]$`cumulative percentage of variance`, type = 'b',
ylab = 'Variances', xlab = "Number of Components")
How many components?
Step 2: To interpret each component, we must compute the correlations between the original data for each variable and each principal component.
plot(budget, choix='var', title='PCA graph: Variances', axes=c(1:2))
The function dimdesc() allows to see which variables the axes are the most linked to: which variables are the most correlated to each axis (here a correlation value above 0.5 is deemed important).
First Principal Component Analysis - Dim1
dimdesc(budget, axes = 1)
## $Dim.1
## $Dim.1$quanti
## correlation p.value
## ACS 0.9332323 2.968369e-11
## CMI 0.8330641 4.369668e-07
## AGR 0.8184263 1.024506e-06
## EDU 0.7867728 5.121852e-06
## LOG 0.7216397 6.888506e-05
## DIV -0.5483342 5.532800e-03
## DEF -0.6122032 1.474803e-03
## DET -0.8888848 6.516706e-09
The first principal component is strongly correlated with 8 of the original variables. The first principal component increases with increasing ACS, CMI, AGR, EDU and LOG scores and with decreasing DIV, DEF and DET. This suggests that the five first criteria vary together. Furthermore, we see that the first principal component correlates most strongly with the ACS.
Second Principal Component Analysis - Dim2
dimdesc(budget, axes = 2)
## $Dim.2
## $Dim.2$quanti
## correlation p.value
## PVP 0.7397530 3.608892e-05
## TRA 0.6307605 9.514279e-04
## ANC -0.8073139 1.863188e-06
The second principal component increases with increasing PVP and TRA and with decreasing ANC.
plot(budget, choix='ind', title='PCA graph: Individuals', axes=c(1:2))
We may conclude that : 1968, 1971, 1956, 1962, 1965 and 1959 have a very high value for the first principal component and we would expect these years to have high values for the ACS, CMI, AGR, EDU and LOG and low values for DIV, DEF and DET.
Whereas if you look at 1872, 1880, 1890, 1920, 1903, 1906, 1900, 1909 we would expect to have high values for DIV, DEF and DET and low values for ACS, CMI, AGR, EDU and LOG.
1923, 1926, 1935, 1932, 1929 have a low value for the second component. So we would expect that these years to have low values for PVP and TRA. And conversely if you were to look at 1947 and 1950, these years would have high values for PVP and TRA.
2 component solution
Is 3rd component useful or not? It seems not very useful as we look at the following plot
plot(PCA(Budget[,-1], ncp = 3, graph = FALSE), choix='ind',
title='PCA graph: Individuals', axes=c(1,3))
km5=kmeans(Budget[,-1],4,nstart=50)
km4=kmeans(Budget[,-1],4,nstart=50)
km3=kmeans(Budget[,-1],3,nstart=50)
km2=kmeans(Budget[,-1],2,nstart=50)
par(mfrow = c(2,2))
plot(Budget[,1], km2$cluster, xlab = 'Year', ylab = '2 Clusters')
plot(Budget[,1], km3$cluster, xlab = 'Year', ylab = '3 Clusters')
plot(Budget[,1], km4$cluster, xlab = 'Year', ylab = '4 Clusters')
plot(Budget[,1], km5$cluster, xlab = 'Year', ylab = '5 Clusters')
From the plots, we can see K-means (with number of clusters K = 3) produced the same group of years as PCA (with the first component)
# Create a matrix with sd of each variables
N<-matrix(rep(apply(Budget[,-1],2,sd),24), nrow=24)
head(N)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 2.238267 4.585603 4.271841 3.482087 7.466733 1.047841 1.681221
## [2,] 1.681221 2.520866 5.335600 4.244203 12.455972 2.238267 4.585603
## [3,] 4.585603 4.271841 3.482087 7.466733 1.047841 1.681221 2.520866
## [4,] 2.520866 5.335600 4.244203 12.455972 2.238267 4.585603 4.271841
## [5,] 4.271841 3.482087 7.466733 1.047841 1.681221 2.520866 5.335600
## [6,] 5.335600 4.244203 12.455972 2.238267 4.585603 4.271841 3.482087
## [,8] [,9] [,10] [,11]
## [1,] 2.520866 5.335600 4.244203 12.455972
## [2,] 4.271841 3.482087 7.466733 1.047841
## [3,] 5.335600 4.244203 12.455972 2.238267
## [4,] 3.482087 7.466733 1.047841 1.681221
## [5,] 4.244203 12.455972 2.238267 4.585603
## [6,] 7.466733 1.047841 1.681221 2.520866
# Adjusted for sd with 1/n
N<- (24-1)/23*(1/N)
# Create a matrix with mean of each variables
M<-matrix(rep(apply(Budget[,-1],2,mean),24),nrow=24)
# Centered data
Mprime<-Budget[,-1]-M
# Normalized
M_n<-Mprime*N
# Hierarchical ascending clustering
my_H_clust<-hclust(dist(M_n), method ="ward.D2")
my_H_clust1 = hclust(dist(M_n), method='complete')
# Plot data
my_H_clust
##
## Call:
## hclust(d = dist(M_n), method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 24
par(mfrow=c(1,1))
plot(my_H_clust, labels = Budget[,1])
plot(my_H_clust1, labels = Budget[,1])