Principal component analysis (PCA) is a statistical method to reduce dimensionality. PCA is mostly used on high dimensional dataset with loads of variables. The main goal is to create a simpler version of such a dataset. After that step data is represented by the components which represent majority of variance from original dataset, where commonly few first components explain the biggest part of variance.
This project will be based on Breast Cancer Coimbra Data Set http://archive.ics.uci.edu/ml//datasets/Breast+Cancer+Coimbra. Dataset consists of 116 observations, where 64 patients have a breast cancer and 52 patients are a control group. The main purpose of this data is to build a predictive model, but this project will be mainly focused on PCA and checking whether it can help to get better results and also to get insights about data. The dataset consists of 10 variables:
df <- select(data, -Classification)
At the beggining lets get look into data, to see basic statistics.
head(df,3)
## Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
## 1 48 23.50000 70 2.707 0.4674087 8.8071 9.702400 7.99585
## 2 83 20.69049 92 3.115 0.7068973 8.8438 5.429285 4.06405
## 3 82 23.12467 91 4.498 1.0096511 17.9393 22.432040 9.27715
## MCP.1
## 1 417.114
## 2 468.786
## 3 554.697
summary(df)
## Age BMI Glucose Insulin
## Min. :24.0 Min. :18.37 Min. : 60.00 Min. : 2.432
## 1st Qu.:45.0 1st Qu.:22.97 1st Qu.: 85.75 1st Qu.: 4.359
## Median :56.0 Median :27.66 Median : 92.00 Median : 5.925
## Mean :57.3 Mean :27.58 Mean : 97.79 Mean :10.012
## 3rd Qu.:71.0 3rd Qu.:31.24 3rd Qu.:102.00 3rd Qu.:11.189
## Max. :89.0 Max. :38.58 Max. :201.00 Max. :58.460
## HOMA Leptin Adiponectin Resistin
## Min. : 0.4674 Min. : 4.311 Min. : 1.656 Min. : 3.210
## 1st Qu.: 0.9180 1st Qu.:12.314 1st Qu.: 5.474 1st Qu.: 6.882
## Median : 1.3809 Median :20.271 Median : 8.353 Median :10.828
## Mean : 2.6950 Mean :26.615 Mean :10.181 Mean :14.726
## 3rd Qu.: 2.8578 3rd Qu.:37.378 3rd Qu.:11.816 3rd Qu.:17.755
## Max. :25.0503 Max. :90.280 Max. :38.040 Max. :82.100
## MCP.1
## Min. : 45.84
## 1st Qu.: 269.98
## Median : 471.32
## Mean : 534.65
## 3rd Qu.: 700.09
## Max. :1698.44
To deepen the knowledge about data, the plot with pairs of variables is very useful. At this plot we can see distributions, correlations and also boxplots to see differences for patients with and without breast cancer. On this plot we can see age difference among those two groups, it can indicate that control group is not perfectly chosen. Moreover all the indicators are higher for patients with cancer.
ggpairs(data)
To get a better view of correlations, the corrplot was conducted. Correlation between Glucose, HOMA, Leptin is now clearly visible.
corr_df = cor(df,method='pearson')
corrplot(corr_df)
Kaiser’s Stopping Rule is a method to decide which components should be chosen. In this approach, components with eigenvalue higher than 1 should retain. It is also connected with scree test in which we plot eignevalues on vertical axis and components on horizontal axis. The components are ordered descending, from the largest to the smallest and basing on the elbow rule, we choose number of components. If line of eigenvalues is levelling off, we should pick that number of components. The other approach is to look on the percentage of variance explained, it is good when components explain 70-90%.
pca <- prcomp(df, center=TRUE, scale=TRUE)
eigen(cor(df))$values
## [1] 3.05853968 1.52220124 1.16754381 1.10569811 0.72254235 0.65727917
## [7] 0.44155050 0.29262393 0.03202122
fviz_eig(pca, choice='eigenvalue')
Basing on Kaiser’s rule, the 4 components should be chosen, because eigenvalues of those are higher than 1. The scree plot gives the same results.
fviz_eig(pca)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7489 1.2338 1.0805 1.0515 0.85002 0.81073 0.66449
## Proportion of Variance 0.3398 0.1691 0.1297 0.1229 0.08028 0.07303 0.04906
## Cumulative Proportion 0.3398 0.5090 0.6387 0.7615 0.84184 0.91487 0.96393
## PC8 PC9
## Standard deviation 0.54095 0.17894
## Proportion of Variance 0.03251 0.00356
## Cumulative Proportion 0.99644 1.00000
If we look at the plot of components and variance that they explain, in my opinion we should chose 4 or 5 components. It is beacuse the fifth component is not explaining a lot of variance solely, but 4 components explain only 76% of variance. So, in the analysis, the fifth component will be also taken into consideration.
fviz_pca_var(pca, col.var = "steelblue")
pca$rotation[,1:5]
## PC1 PC2 PC3 PC4 PC5
## Age 0.1245739 0.06626166 -0.20670086 0.821387749 0.2529717
## BMI 0.2604297 0.49933879 0.42573480 -0.070921590 -0.2324569
## Glucose 0.4390227 -0.18594179 -0.13088534 0.125615405 0.1995197
## Insulin 0.4439787 -0.38631334 0.09371342 -0.059771821 -0.2976113
## HOMA 0.4928517 -0.37472788 -0.01219108 -0.005644021 -0.1391521
## Leptin 0.3314935 0.23364275 0.58320150 0.058353947 0.2878059
## Adiponectin -0.1726096 -0.48059810 0.28212343 -0.276866882 0.5292842
## Resistin 0.2817413 0.30361887 -0.28892346 -0.302703643 0.5975961
## MCP.1 0.2546307 0.21044940 -0.49675794 -0.359469872 -0.1188700
Basing only on the plot, few components are clearly visible. First one consists of HOMA, Insulin and Glucose. The second is probably mainly a BMI and Adiponectin and third is basicly Age. To distinguish which variables are the main part of component 4 and 5, the plot of contribution of variables to components must be conducted.
var <- get_pca_var(pca)
a<-fviz_contrib(pca, "var",axes = 1)
b<-fviz_contrib(pca, "var",axes = 2)
c<-fviz_contrib(pca, "var",axes = 3)
d<-fviz_contrib(pca, "var",axes = 4)
e<-fviz_contrib(pca, "var",axes = 5)
grid.arrange(a,b,c,d,e,top='Contribution to the Principal Components')
As it was said before the components consist of:
And fourth and fifth component:
The results of PCA can also be shown on biplot with distinction of classes (1=Healthy controls, 2=Patients (with cancer))
ggbiplot(pca, obs.scale=1, var.scale=1, groups=as.factor(data$Classification), ellipse=TRUE, circle = TRUE)
In this case, the observations of group 2 are more spreaded and variance in this group is rather bigger than in group 1. What is important observations on the right side of the plot have bigger values of HOMA, Insulin, Glucose (PC1 consists mainly of this variables), so those variables can be good indicators of having cancer.
In R the 3d interactive plots are also available. On this plot the analysis of 3 dimension show that generally some patients among two groups have marginal values of Leptin, MCP.1 (PC3). (3d plot source: https://planspacedotorg.wordpress.com/2013/02/03/pca-3d-visualization-and-clustering-in-r/)
plot3d(pca$x[,1:3], col = factor(data$Classification))
text3d(pca$rotation[,1:3], texts = rownames(pca$rotation), col='blue')
coords <- NULL
for (i in 1:nrow(pca$rotation)) {
coords <- rbind(coords, rbind(c(0,0,0),pca$rotation[i,1:3]))
}
lines3d(coords, col="blue", lwd=4)
You must enable Javascript to view this page properly.
As it was said at the beggining the main purpose of this dataset is to build a model to predict whether a patient does or does not have cancer. It could be also used to a faster detection of possible cancer. Beacuse of that, model based on the whole dataset will be compared with models based on 4 and 5 components.
The model that will be used is Support vector machine (SVM). It is supervised method for classification and regression. The algorithm divides the space into number of classes that we predict, so the algortihm finds the hiperplane that differentiates two classes in the best possible way. In this case, the two classes will be predicted (1=Healthy controls, 2=Patients (with cancer)).
Dataset will be split in 80/20 proportions for train/test datasets.
smp_size <- floor(0.8 * nrow(data))
set.seed(123)
data0 = as.data.frame(scale(df))
data0$Classification = data$Classification
train_ind <- sample(seq_len(nrow(data0)), size = smp_size)
data_train <- data0[train_ind, ]
data_test <- data0[-train_ind, ]
data1=as.data.frame(get_pca_ind(pca)$coord)
data1 = data1[c(1:4)]
data1$Classification = data$Classification
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
data_train_4PC <- data1[train_ind, ]
data_test_4PC <- data1[-train_ind, ]
data2=as.data.frame(get_pca_ind(pca)$coord)
data2 = data1[c(1:5)]
data2$Classification = data$Classification
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
data_train_5PC <- data2[train_ind, ]
data_test_5PC <- data2[-train_ind, ]
The models will be trained on train set with k-fold cross validation, where k=3. The results will be measured by accuracy of each model and the best will be used to predict on the test set.
# define training control
train_control <- trainControl(method="repeatedcv", number=3)
# train the model
model <- train(Classification~., data=data_train, trControl=train_control, method="svmPoly")
model_4PC <- train(Classification~., data=data_train_4PC, trControl=train_control, method="svmPoly")
model_5PC <- train(Classification~., data=data_train_5PC, trControl=train_control, method="svmPoly")
# summarize results
results = resamples(list(SVM_data=model, SVM_data_4PC=model_4PC, SVM_data_5PC=model_5PC))
bwplot(results, metric="Accuracy")
Model based on all the variables works better than models based on 4 and 5 components. So the prediction on test set will be conducted with the use of this model.
predictions <- predict(model, data_test[1:9])
conf_max = confusionMatrix(table(predictions, data_test$Classification))
print(conf_max$table)
##
## predictions 1 2
## 1 7 3
## 2 2 12
print(conf_max$overall['Accuracy'])
## Accuracy
## 0.7916667
The results on the test set are rather satisfying. On 24 observations 19 on them were classified correctly what gives accuracy of 79%.
In this project PCA was done on rather small dataset. The analysis helps to get better understanding of the data and dependencies between variables. The predictive model were built on original data but it is rather the fact of not high dimensional dataset. It is always good to check, whether such analysis can improve the model. In this project the way of ‘how to use PCA on classification’ problem was covered, so this can be used on problems, in which dataset consists of more variables.