Data Analysis

Principal component analysis (PCA) is a statistical method to reduce dimensionality. PCA is mostly used on high dimensional dataset with loads of variables. The main goal is to create a simpler version of such a dataset. After that step data is represented by the components which represent majority of variance from original dataset, where commonly few first components explain the biggest part of variance.

This project will be based on Breast Cancer Coimbra Data Set http://archive.ics.uci.edu/ml//datasets/Breast+Cancer+Coimbra. Dataset consists of 116 observations, where 64 patients have a breast cancer and 52 patients are a control group. The main purpose of this data is to build a predictive model, but this project will be mainly focused on PCA and checking whether it can help to get better results and also to get insights about data. The dataset consists of 10 variables:

df <- select(data, -Classification)

At the beggining lets get look into data, to see basic statistics.

head(df,3)
##   Age      BMI Glucose Insulin      HOMA  Leptin Adiponectin Resistin
## 1  48 23.50000      70   2.707 0.4674087  8.8071    9.702400  7.99585
## 2  83 20.69049      92   3.115 0.7068973  8.8438    5.429285  4.06405
## 3  82 23.12467      91   4.498 1.0096511 17.9393   22.432040  9.27715
##     MCP.1
## 1 417.114
## 2 468.786
## 3 554.697
summary(df)
##       Age            BMI           Glucose          Insulin      
##  Min.   :24.0   Min.   :18.37   Min.   : 60.00   Min.   : 2.432  
##  1st Qu.:45.0   1st Qu.:22.97   1st Qu.: 85.75   1st Qu.: 4.359  
##  Median :56.0   Median :27.66   Median : 92.00   Median : 5.925  
##  Mean   :57.3   Mean   :27.58   Mean   : 97.79   Mean   :10.012  
##  3rd Qu.:71.0   3rd Qu.:31.24   3rd Qu.:102.00   3rd Qu.:11.189  
##  Max.   :89.0   Max.   :38.58   Max.   :201.00   Max.   :58.460  
##       HOMA             Leptin        Adiponectin        Resistin     
##  Min.   : 0.4674   Min.   : 4.311   Min.   : 1.656   Min.   : 3.210  
##  1st Qu.: 0.9180   1st Qu.:12.314   1st Qu.: 5.474   1st Qu.: 6.882  
##  Median : 1.3809   Median :20.271   Median : 8.353   Median :10.828  
##  Mean   : 2.6950   Mean   :26.615   Mean   :10.181   Mean   :14.726  
##  3rd Qu.: 2.8578   3rd Qu.:37.378   3rd Qu.:11.816   3rd Qu.:17.755  
##  Max.   :25.0503   Max.   :90.280   Max.   :38.040   Max.   :82.100  
##      MCP.1        
##  Min.   :  45.84  
##  1st Qu.: 269.98  
##  Median : 471.32  
##  Mean   : 534.65  
##  3rd Qu.: 700.09  
##  Max.   :1698.44

To deepen the knowledge about data, the plot with pairs of variables is very useful. At this plot we can see distributions, correlations and also boxplots to see differences for patients with and without breast cancer. On this plot we can see age difference among those two groups, it can indicate that control group is not perfectly chosen. Moreover all the indicators are higher for patients with cancer.

ggpairs(data)

To get a better view of correlations, the corrplot was conducted. Correlation between Glucose, HOMA, Leptin is now clearly visible.

corr_df = cor(df,method='pearson')
corrplot(corr_df)

Principal Component Analysis (PCA)

How to choose number of components

Kaiser’s Stopping Rule is a method to decide which components should be chosen. In this approach, components with eigenvalue higher than 1 should retain. It is also connected with scree test in which we plot eignevalues on vertical axis and components on horizontal axis. The components are ordered descending, from the largest to the smallest and basing on the elbow rule, we choose number of components. If line of eigenvalues is levelling off, we should pick that number of components. The other approach is to look on the percentage of variance explained, it is good when components explain 70-90%.

pca <- prcomp(df, center=TRUE, scale=TRUE)
eigen(cor(df))$values
## [1] 3.05853968 1.52220124 1.16754381 1.10569811 0.72254235 0.65727917
## [7] 0.44155050 0.29262393 0.03202122
fviz_eig(pca, choice='eigenvalue')

Basing on Kaiser’s rule, the 4 components should be chosen, because eigenvalues of those are higher than 1. The scree plot gives the same results.

fviz_eig(pca)

summary(pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7489 1.2338 1.0805 1.0515 0.85002 0.81073 0.66449
## Proportion of Variance 0.3398 0.1691 0.1297 0.1229 0.08028 0.07303 0.04906
## Cumulative Proportion  0.3398 0.5090 0.6387 0.7615 0.84184 0.91487 0.96393
##                            PC8     PC9
## Standard deviation     0.54095 0.17894
## Proportion of Variance 0.03251 0.00356
## Cumulative Proportion  0.99644 1.00000

If we look at the plot of components and variance that they explain, in my opinion we should chose 4 or 5 components. It is beacuse the fifth component is not explaining a lot of variance solely, but 4 components explain only 76% of variance. So, in the analysis, the fifth component will be also taken into consideration.

Analysis of components

fviz_pca_var(pca, col.var = "steelblue")

pca$rotation[,1:5]
##                    PC1         PC2         PC3          PC4        PC5
## Age          0.1245739  0.06626166 -0.20670086  0.821387749  0.2529717
## BMI          0.2604297  0.49933879  0.42573480 -0.070921590 -0.2324569
## Glucose      0.4390227 -0.18594179 -0.13088534  0.125615405  0.1995197
## Insulin      0.4439787 -0.38631334  0.09371342 -0.059771821 -0.2976113
## HOMA         0.4928517 -0.37472788 -0.01219108 -0.005644021 -0.1391521
## Leptin       0.3314935  0.23364275  0.58320150  0.058353947  0.2878059
## Adiponectin -0.1726096 -0.48059810  0.28212343 -0.276866882  0.5292842
## Resistin     0.2817413  0.30361887 -0.28892346 -0.302703643  0.5975961
## MCP.1        0.2546307  0.21044940 -0.49675794 -0.359469872 -0.1188700

Basing only on the plot, few components are clearly visible. First one consists of HOMA, Insulin and Glucose. The second is probably mainly a BMI and Adiponectin and third is basicly Age. To distinguish which variables are the main part of component 4 and 5, the plot of contribution of variables to components must be conducted.

var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
grid.arrange(a,b,c,d,e,top='Contribution to the Principal Components')

As it was said before the components consist of:

  • HOMA, Insulin, Glucose
  • BMI, Adiponectin, but also Insulin and HOMA
  • Age

And fourth and fifth component:

  • Leptin, MCP.1, BMI
  • Resistin, Adiponectin

Advanced visualisations

The results of PCA can also be shown on biplot with distinction of classes (1=Healthy controls, 2=Patients (with cancer))

ggbiplot(pca, obs.scale=1, var.scale=1, groups=as.factor(data$Classification), ellipse=TRUE, circle = TRUE)

In this case, the observations of group 2 are more spreaded and variance in this group is rather bigger than in group 1. What is important observations on the right side of the plot have bigger values of HOMA, Insulin, Glucose (PC1 consists mainly of this variables), so those variables can be good indicators of having cancer.

In R the 3d interactive plots are also available. On this plot the analysis of 3 dimension show that generally some patients among two groups have marginal values of Leptin, MCP.1 (PC3). (3d plot source: https://planspacedotorg.wordpress.com/2013/02/03/pca-3d-visualization-and-clustering-in-r/)

plot3d(pca$x[,1:3], col = factor(data$Classification))
text3d(pca$rotation[,1:3], texts = rownames(pca$rotation), col='blue')

coords <- NULL
for (i in 1:nrow(pca$rotation)) {
  coords <- rbind(coords, rbind(c(0,0,0),pca$rotation[i,1:3]))
}
lines3d(coords, col="blue", lwd=4)

You must enable Javascript to view this page properly.

SVM on PCA

As it was said at the beggining the main purpose of this dataset is to build a model to predict whether a patient does or does not have cancer. It could be also used to a faster detection of possible cancer. Beacuse of that, model based on the whole dataset will be compared with models based on 4 and 5 components.

The model that will be used is Support vector machine (SVM). It is supervised method for classification and regression. The algorithm divides the space into number of classes that we predict, so the algortihm finds the hiperplane that differentiates two classes in the best possible way. In this case, the two classes will be predicted (1=Healthy controls, 2=Patients (with cancer)).

Dataset will be split in 80/20 proportions for train/test datasets.

smp_size <- floor(0.8 * nrow(data))
set.seed(123)
data0 = as.data.frame(scale(df))
data0$Classification  = data$Classification
train_ind <- sample(seq_len(nrow(data0)), size = smp_size)
data_train <- data0[train_ind, ]
data_test <- data0[-train_ind, ]

data1=as.data.frame(get_pca_ind(pca)$coord)
data1 = data1[c(1:4)]
data1$Classification  = data$Classification
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
data_train_4PC <- data1[train_ind, ]
data_test_4PC <- data1[-train_ind, ]

data2=as.data.frame(get_pca_ind(pca)$coord)
data2 = data1[c(1:5)]
data2$Classification  = data$Classification
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
data_train_5PC <- data2[train_ind, ]
data_test_5PC <- data2[-train_ind, ]

The models will be trained on train set with k-fold cross validation, where k=3. The results will be measured by accuracy of each model and the best will be used to predict on the test set.

# define training control
train_control <- trainControl(method="repeatedcv", number=3)
# train the model
model <- train(Classification~., data=data_train, trControl=train_control, method="svmPoly")
model_4PC <- train(Classification~., data=data_train_4PC, trControl=train_control, method="svmPoly")
model_5PC <- train(Classification~., data=data_train_5PC, trControl=train_control, method="svmPoly")

# summarize results
results = resamples(list(SVM_data=model, SVM_data_4PC=model_4PC, SVM_data_5PC=model_5PC))
bwplot(results, metric="Accuracy")

Model based on all the variables works better than models based on 4 and 5 components. So the prediction on test set will be conducted with the use of this model.

predictions <- predict(model, data_test[1:9])
conf_max = confusionMatrix(table(predictions, data_test$Classification))
print(conf_max$table)
##            
## predictions  1  2
##           1  7  3
##           2  2 12
print(conf_max$overall['Accuracy']) 
##  Accuracy 
## 0.7916667

The results on the test set are rather satisfying. On 24 observations 19 on them were classified correctly what gives accuracy of 79%.

Conclusions

In this project PCA was done on rather small dataset. The analysis helps to get better understanding of the data and dependencies between variables. The predictive model were built on original data but it is rather the fact of not high dimensional dataset. It is always good to check, whether such analysis can improve the model. In this project the way of ‘how to use PCA on classification’ problem was covered, so this can be used on problems, in which dataset consists of more variables.