Data science cycle control cycle

  1. ASK : What the problem(s) we need to solve ?
  2. RESEARCH : What data do we need and how do we get it ?
  3. MODEL : Which method(s) is appropriate to use ?
  4. VALIDATE : Do the model and assumptions work as expected ?
  5. TEST : How does the model generalize to real world data ?
  6. INTEPRETE : How can we use the conclusion in the real world ?

As the data is already given, what we can do is following :

Can be non-parameter as we make no assumption about the functional form of predicted variables – in fact, we don’t even know the response variables.) Supervised or Unsupervised ? As we have no idea about the ‘groups’ or classes’ of variables as in case of classification, we might opt to choose unsupervised method: PCA & Clustering

In this document we follow 4 steps :

Descriptive Analysis

1. Importing necessary packages

require(c('magrittr','reshaped2', 'ggplot2', 'devtools', 
          'Rcpp', 'dplyr', 'tidyr'))
## c("Loading required package: c", "Loading required package: magrittr", "Loading required package: reshaped2", "Loading required package: ggplot2", "Loading required package: devtools", "Loading required package: Rcpp", "Loading required package: dplyr", "Loading required package: tidyr")
library(reshape2)
library(magrittr)  # Package for pipeline %>%
library(ggplot2)
library(devtools)
library(Rcpp)
library(FactoMineR)

# library(rCharts)
# family="sans-serif"

2. Data structure

Examine the contents of the CSV file.

Budget <- read.csv("Budget.txt", header=FALSE, stringsAsFactors=FALSE)
colnames(Budget) = c("AN", "PVP", "AGR", "CMI", "TRA", "LOG", 
                     "EDU", "ACS", "ANC", "DEF", "DET", "DIV")
Budget
##      AN  PVP AGR  CMI  TRA  LOG  EDU  ACS  ANC  DEF  DET DIV
## 1  1872 18.0 0.5  0.1  6.7  0.5  2.1  2.0  0.0 26.4 41.5 2.1
## 2  1880 14.1 0.8  0.1 15.3  1.9  3.7  0.5  0.0 29.8 31.3 2.5
## 3  1890 13.6 0.7  0.7  6.8  0.6  7.1  0.7  0.0 33.8 34.4 1.7
## 4  1900 14.3 1.7  1.7  6.9  1.2  7.4  0.8  0.0 37.7 26.2 2.2
## 5  1903 10.3 1.5  0.4  9.3  0.6  8.5  0.9  0.0 38.4 27.2 3.0
## 6  1906 13.4 1.4  0.5  8.1  0.7  8.6  1.8  0.0 38.5 25.3 1.9
## 7  1909 13.5 1.1  0.5  9.0  0.6  9.0  3.4  0.0 36.8 23.5 2.6
## 8  1912 12.9 1.4  0.3  9.4  0.6  9.3  4.3  0.0 41.1 19.4 1.3
## 9  1920 12.3 0.3  0.1 11.9  2.4  3.7  1.7  1.9 42.4 23.1 0.2
## 10 1923  7.6 1.2  3.2  5.1  0.6  5.6  1.8 10.0 29.0 35.0 0.9
## 11 1926 10.5 0.3  0.4  4.5  1.8  6.6  2.1 10.1 19.9 41.6 2.3
## 12 1929 10.0 0.6  0.6  9.0  1.0  8.1  3.2 11.8 28.0 25.8 2.0
## 13 1932 10.6 0.8  0.3  8.9  3.0 10.0  6.4 13.4 27.4 19.2 0.0
## 14 1935  8.8 2.6  1.4  7.8  1.4 12.4  6.2 11.3 29.3 18.5 0.4
## 15 1938 10.1 1.1  1.2  5.9  1.4  9.5  6.0  5.9 40.7 18.2 0.0
## 16 1947 15.6 1.6 10.1 11.4  7.6  8.8  4.8  3.4 32.2  4.6 0.0
## 17 1950 11.2 1.3 16.5 12.4 15.8  8.1  4.9  3.4 20.7  4.2 1.5
## 18 1953 12.9 1.5  7.0  7.9 12.1  8.1  5.3  3.9 36.1  5.2 0.0
## 19 1956 10.9 5.3  9.7  7.6  9.6  9.4  8.5  4.6 28.2  6.2 0.0
## 20 1959 13.1 4.4  7.3  5.7  9.8 12.5  8.0  5.0 26.7  7.5 0.0
## 21 1962 12.8 4.7  7.5  6.6  6.8 15.7  9.7  5.3 24.5  6.4 0.1
## 22 1965 12.4 4.3  8.4  9.1  6.0 19.5 10.6  4.7 19.8  3.5 1.8
## 23 1968 11.4 6.0  9.5  5.9  5.0 21.1 10.7  4.2 20.0  4.4 1.9
## 24 1971 12.8 2.8  7.1  8.5  4.0 23.8 11.3  3.7 18.8  7.2 0.0

What is the format of the first line? What does it represent? The first line is a list of variable names (11, AN is for individual names).

  • AN : year
  • PVP : authorities
  • AGR : agriculture
  • CMI : trades and companies
  • TRA : work
  • LOG : accommodations
  • EDU : education
  • ACS : social action
  • ANC : veterans
  • DEF : defense
  • DET : debt refund
  • DIV : various

What is the format of the following lines? What do they represent? The other lines are the observations (1872 : 1971); each observation is a list of percentage of the budget allocated to each service.

3. Summary

str(Budget)
## 'data.frame':    24 obs. of  12 variables:
##  $ AN : int  1872 1880 1890 1900 1903 1906 1909 1912 1920 1923 ...
##  $ PVP: num  18 14.1 13.6 14.3 10.3 13.4 13.5 12.9 12.3 7.6 ...
##  $ AGR: num  0.5 0.8 0.7 1.7 1.5 1.4 1.1 1.4 0.3 1.2 ...
##  $ CMI: num  0.1 0.1 0.7 1.7 0.4 0.5 0.5 0.3 0.1 3.2 ...
##  $ TRA: num  6.7 15.3 6.8 6.9 9.3 8.1 9 9.4 11.9 5.1 ...
##  $ LOG: num  0.5 1.9 0.6 1.2 0.6 0.7 0.6 0.6 2.4 0.6 ...
##  $ EDU: num  2.1 3.7 7.1 7.4 8.5 8.6 9 9.3 3.7 5.6 ...
##  $ ACS: num  2 0.5 0.7 0.8 0.9 1.8 3.4 4.3 1.7 1.8 ...
##  $ ANC: num  0 0 0 0 0 0 0 0 1.9 10 ...
##  $ DEF: num  26.4 29.8 33.8 37.7 38.4 38.5 36.8 41.1 42.4 29 ...
##  $ DET: num  41.5 31.3 34.4 26.2 27.2 25.3 23.5 19.4 23.1 35 ...
##  $ DIV: num  2.1 2.5 1.7 2.2 3 1.9 2.6 1.3 0.2 0.9 ...
summary(Budget)
##        AN            PVP             AGR             CMI        
##  Min.   :1872   Min.   : 7.60   Min.   :0.300   Min.   : 0.100  
##  1st Qu.:1908   1st Qu.:10.57   1st Qu.:0.800   1st Qu.: 0.400  
##  Median :1930   Median :12.60   Median :1.400   Median : 1.300  
##  Mean   :1929   Mean   :12.21   Mean   :1.996   Mean   : 3.942  
##  3rd Qu.:1954   3rd Qu.:13.43   3rd Qu.:2.650   3rd Qu.: 7.350  
##  Max.   :1971   Max.   :18.00   Max.   :6.000   Max.   :16.500  
##       TRA              LOG              EDU              ACS        
##  Min.   : 4.500   Min.   : 0.500   Min.   : 2.100   Min.   : 0.500  
##  1st Qu.: 6.675   1st Qu.: 0.675   1st Qu.: 7.325   1st Qu.: 1.800  
##  Median : 8.000   Median : 1.850   Median : 8.700   Median : 4.550  
##  Mean   : 8.321   Mean   : 3.958   Mean   : 9.942   Mean   : 4.817  
##  3rd Qu.: 9.150   3rd Qu.: 6.200   3rd Qu.:10.600   3rd Qu.: 6.800  
##  Max.   :15.300   Max.   :15.800   Max.   :23.800   Max.   :11.300  
##       ANC              DEF             DET             DIV       
##  Min.   : 0.000   Min.   :18.80   Min.   : 3.50   Min.   :0.000  
##  1st Qu.: 0.000   1st Qu.:25.93   1st Qu.: 6.35   1st Qu.:0.000  
##  Median : 3.800   Median :29.15   Median :19.30   Median :1.400  
##  Mean   : 4.275   Mean   :30.26   Mean   :19.14   Mean   :1.183  
##  3rd Qu.: 5.450   3rd Qu.:37.02   3rd Qu.:26.45   3rd Qu.:2.025  
##  Max.   :13.400   Max.   :42.40   Max.   :41.60   Max.   :3.000

The difference between the mean and median (eg. CMI, LOG) indicates that there are a few cases/rows with extreme values that are elevating the mean

4. Box Plot

Distribution, location and spread of the represented data.

boxplot(Budget[,-1], ylab = 'Percentage of France Budget', xlab = 'Budget Category')

In the above plot, the share of all spending categories is mostly well bellow 20%, while for defense (DEF) and dept refund (DET), it is higher (showing war time and high governmental spending). The box is much higher for DEF and DET than for the others. This means that DEF and DET have larger variability than the others. The three largest observations in education (EDU) are shown as boxplot outliers, i.e. they are more than 1.5 times the inter-quartile range (hight of the box) larger than the 3rd quartile.

5. Histograms

Here we consider each variable separately. We consider the histograms as a probability density rather than a frequency. We fit empirical kernel density estimator curves, which give a more-or less smoothed continuous approximation to the histogram. What evidence does the summary give that the distribution is somewhat symmetric? We can observe the following:

  • Unimodal
    • Skewed left : AGR, TRA, LOG, EDU, ANC, CMI
    • Symetric : PVP, DEF
  • Bimodal : DET, DIV
par(mfrow=c(3,4),family ='sans')
for (i in 2:length(names(Budget))){
  hist((Budget)[,i],main=names(Budget)[i],prob=TRUE,xlab='% of the total Budget')
  lines(density(Budget[,i]), col="red")
}

6. Q-Q Plot

c_d<-names(Budget)
par(mfrow=c(3,4),family ='sans')
for (i in 2:length(names(Budget))){
  qqnorm((Budget)[,i], main=paste(c_d[i],'\n', 'vs','\n', 'Normal dist.'), ylab = c_d[i])
  qqline((Budget)[,i], col= "green")
}

7. Heat Map

A heat map is like a table that uses colors instead of numbers to represent the values for the cells, using Pearson Correlation

data(attitude)
head(attitude)
##   rating complaints privileges learning raises critical advance
## 1     43         51         30       39     61       92      45
## 2     63         64         51       54     63       73      47
## 3     71         70         68       69     76       86      48
## 4     61         63         45       47     54       84      35
## 5     81         78         56       66     71       83      47
## 6     43         55         49       44     54       49      34
head(Budget[,-1])
##    PVP AGR CMI  TRA LOG EDU ACS ANC  DEF  DET DIV
## 1 18.0 0.5 0.1  6.7 0.5 2.1 2.0   0 26.4 41.5 2.1
## 2 14.1 0.8 0.1 15.3 1.9 3.7 0.5   0 29.8 31.3 2.5
## 3 13.6 0.7 0.7  6.8 0.6 7.1 0.7   0 33.8 34.4 1.7
## 4 14.3 1.7 1.7  6.9 1.2 7.4 0.8   0 37.7 26.2 2.2
## 5 10.3 1.5 0.4  9.3 0.6 8.5 0.9   0 38.4 27.2 3.0
## 6 13.4 1.4 0.5  8.1 0.7 8.6 1.8   0 38.5 25.3 1.9
library(ggplot2)
library(reshape2)
qplot(x=Var1, y=Var2, data=melt(cor(Budget[,-1], method = 'pearson')), fill=-value, geom="tile")

From the heatmap table, the correlation between variables is noticeable for ACS (Social action) & AGR (Agriculture), and ACS & EDU (Education)

8. Evolution overtime

par(mfrow = c(3,4))
for (i in 2:(ncol(Budget))) {
    tb = ts(Budget[,i])
    plot(tb, ylab = colnames(Budget[i]))
}

With only simple chart in time series, we can see the upward trend for Education (EDU) and Social Action (ACS) spending (as percentage of budget) over time, even during the economics slow down 1929 - 1932 and the 2 world wars

Modeling

1. PCA

A principal component analysis of the data can be applied using the ACP function of the FactoMineR library (omitting the AN variable).

Step 1: We examine the eigenvalues to determine how many principal components should be considered:

budget = PCA(Budget[,-1], ncp = 2, graph = FALSE)
summary(budget)
## 
## Call:
## PCA(X = Budget[, -1], ncp = 2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               4.972   2.051   1.290   0.993   0.708   0.558
## % of var.             45.203  18.642  11.729   9.028   6.440   5.074
## Cumulative % of var.  45.203  63.845  75.574  84.602  91.042  96.116
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.204   0.125   0.063   0.035   0.000
## % of var.              1.857   1.138   0.571   0.318   0.000
## Cumulative % of var.  97.973  99.111  99.682 100.000 100.000
## 
## Individuals (the 10 first)
##         Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1   |  4.253 | -2.900  7.049  0.465 |  1.024  2.132  0.058 |
## 2   |  4.121 | -2.767  6.416  0.451 |  2.012  8.226  0.238 |
## 3   |  2.727 | -2.416  4.893  0.785 |  0.224  0.102  0.007 |
## 4   |  2.645 | -2.057  3.545  0.605 |  0.755  1.158  0.081 |
## 5   |  3.090 | -2.338  4.581  0.573 |  0.167  0.057  0.003 |
## 6   |  2.356 | -1.985  3.302  0.710 |  0.626  0.796  0.071 |
## 7   |  2.460 | -1.907  3.048  0.601 |  0.812  1.339  0.109 |
## 8   |  2.244 | -1.431  1.715  0.407 |  0.768  1.198  0.117 |
## 9   |  3.227 | -2.139  3.833  0.439 |  0.957  1.859  0.088 |
## 10  |  3.494 | -1.144  1.096  0.107 | -2.883 16.887  0.681 |
## 
## Variables (the 10 first)
##        Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## PVP | -0.173  0.604  0.030 |  0.740 26.686  0.547 |
## AGR |  0.818 13.471  0.670 |  0.005  0.001  0.000 |
## CMI |  0.833 13.957  0.694 |  0.341  5.679  0.116 |
## TRA | -0.137  0.377  0.019 |  0.631 19.402  0.398 |
## LOG |  0.722 10.473  0.521 |  0.398  7.714  0.158 |
## EDU |  0.787 12.449  0.619 | -0.137  0.915  0.019 |
## ACS |  0.933 17.515  0.871 | -0.101  0.496  0.010 |
## ANC |  0.289  1.679  0.083 | -0.807 31.783  0.652 |
## DEF | -0.612  7.538  0.375 |  0.216  2.282  0.047 |
## DET | -0.889 15.890  0.790 | -0.301  4.428  0.091 |

If we take all of these eigenvalues and add them up we get the total variance of 11. The proportion of variation explained by each eigenvalue is given in the third column. For example, 4.972362e+00 divided by the 11 equals 0.4520329, or, about 45% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. Therefore, about 64% of the variation is explained by the first two eigenvalues together.

The first four principal components explain 85% of the variation. This is an acceptably large percentage.

A Scree Plot may serve as another indicator of how many eigenvalues to consider. With the eigenvalues ordered from largest to the smallest, a scree plot is the plot of λi versus i. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size.

plot(budget[[1]]$`cumulative percentage of variance`, type = 'b', 
     ylab = 'Variances', xlab = "Number of Components")

How many components?

  • Kaiser criterion: 3 components
    • Greater than 80%: 4 components
    • Greater than 90%: 5 components

Step 2: To interpret each component, we must compute the correlations between the original data for each variable and each principal component.

plot(budget, choix='var', title='PCA graph: Variances', axes=c(1:2))

The function dimdesc() allows to see which variables the axes are the most linked to: which variables are the most correlated to each axis (here a correlation value above 0.5 is deemed important).

First Principal Component Analysis - Dim1

dimdesc(budget, axes = 1)
## $Dim.1
## $Dim.1$quanti
##     correlation      p.value
## ACS   0.9332323 2.968369e-11
## CMI   0.8330641 4.369668e-07
## AGR   0.8184263 1.024506e-06
## EDU   0.7867728 5.121852e-06
## LOG   0.7216397 6.888506e-05
## DIV  -0.5483342 5.532800e-03
## DEF  -0.6122032 1.474803e-03
## DET  -0.8888848 6.516706e-09
The first principal component is strongly correlated with 8 of the original variables. The first principal component increases with increasing ACS, CMI, AGR, EDU and LOG scores and with decreasing DIV, DEF and DET. This suggests that the five first criteria vary together. Furthermore, we see that the first principal component correlates most strongly with the ACS.

Second Principal Component Analysis - Dim2

dimdesc(budget, axes = 2)
## $Dim.2
## $Dim.2$quanti
##     correlation      p.value
## PVP   0.7397530 3.608892e-05
## TRA   0.6307605 9.514279e-04
## ANC  -0.8073139 1.863188e-06
The second principal component increases with increasing PVP and TRA and with decreasing ANC.
plot(budget, choix='ind', title='PCA graph: Individuals', axes=c(1:2))

We may conclude that : 1968, 1971, 1956, 1962, 1965 and 1959 have a very high value for the first principal component and we would expect these years to have high values for the ACS, CMI, AGR, EDU and LOG and low values for DIV, DEF and DET.
Whereas if you look at 1872, 1880, 1890, 1920, 1903, 1906, 1900, 1909 we would expect to have high values for DIV, DEF and DET and low values for ACS, CMI, AGR, EDU and LOG.
1923, 1926, 1935, 1932, 1929 have a low value for the second component. So we would expect that these years to have low values for PVP and TRA. And conversely if you were to look at 1947 and 1950, these years would have high values for PVP and TRA.

2 component solution

  • Component 1: (eg 1872 and 1968)
    • Education
    • Social action
  • Component 2: (eg 1923 and 1947)
    • Authorities
    • Work
    • Trades and companies

Is 3rd component useful or not? It seems not very useful as we look at the following plot

plot(PCA(Budget[,-1], ncp = 3, graph = FALSE), choix='ind', 
     title='PCA graph: Individuals', axes=c(1,3))

2. Clustering

km5=kmeans(Budget[,-1],4,nstart=50)
km4=kmeans(Budget[,-1],4,nstart=50)
km3=kmeans(Budget[,-1],3,nstart=50)
km2=kmeans(Budget[,-1],2,nstart=50)

par(mfrow = c(2,2))
plot(Budget[,1], km2$cluster, xlab = 'Year', ylab = '2 Clusters')
plot(Budget[,1], km3$cluster, xlab = 'Year', ylab = '3 Clusters')
plot(Budget[,1], km4$cluster, xlab = 'Year', ylab = '4 Clusters')
plot(Budget[,1], km5$cluster, xlab = 'Year', ylab = '5 Clusters')

From the plots, we can see K-means (with number of clusters K = 3) produced the same group of years as PCA (with the first component)

3. Hierachiacal Clustering

# Create a matrix with sd of each variables
N<-matrix(rep(apply(Budget[,-1],2,sd),24), nrow=24)
head(N)
##          [,1]     [,2]      [,3]      [,4]      [,5]     [,6]     [,7]
## [1,] 2.238267 4.585603  4.271841  3.482087  7.466733 1.047841 1.681221
## [2,] 1.681221 2.520866  5.335600  4.244203 12.455972 2.238267 4.585603
## [3,] 4.585603 4.271841  3.482087  7.466733  1.047841 1.681221 2.520866
## [4,] 2.520866 5.335600  4.244203 12.455972  2.238267 4.585603 4.271841
## [5,] 4.271841 3.482087  7.466733  1.047841  1.681221 2.520866 5.335600
## [6,] 5.335600 4.244203 12.455972  2.238267  4.585603 4.271841 3.482087
##          [,8]      [,9]     [,10]     [,11]
## [1,] 2.520866  5.335600  4.244203 12.455972
## [2,] 4.271841  3.482087  7.466733  1.047841
## [3,] 5.335600  4.244203 12.455972  2.238267
## [4,] 3.482087  7.466733  1.047841  1.681221
## [5,] 4.244203 12.455972  2.238267  4.585603
## [6,] 7.466733  1.047841  1.681221  2.520866
# Adjusted for sd with 1/n
N<- (24-1)/23*(1/N)

# Create a matrix with mean of each variables
M<-matrix(rep(apply(Budget[,-1],2,mean),24),nrow=24)

# Centered data
Mprime<-Budget[,-1]-M

# Normalized
M_n<-Mprime*N

# Hierarchical ascending clustering 
my_H_clust<-hclust(dist(M_n), method ="ward.D2")
my_H_clust1 = hclust(dist(M_n), method='complete')
# Plot data
my_H_clust
## 
## Call:
## hclust(d = dist(M_n), method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 24
par(mfrow=c(1,1))
plot(my_H_clust, labels = Budget[,1])

plot(my_H_clust1, labels = Budget[,1])