This article focused on the application of Principal Component Analysis (PCA) on reducing the dimension. So, the mathematical formula will not be detail explained here.
If you are familiar enough with data, sometimes you are faced with too many predictor variables that make the computation so heavy. Let us say, you are challenged to predict employee in your company will resign or not while the variables are the level of satisfaction on work, number of project, average monthly hours, time spend at the company, etc. You are facing so many predictor that took so long for training the model.
Then, you should reduce the dimension to make the computation less heavy. To do the dimensionality reduction, the techniques divide into two ways:
Feature elimination is when you select the variable that is influence your prediction, and throw away the variable that has no contribution to your prediction. In the case of prediction of resigning employee or not, for example, you only choose the variable that is influencing the employee resignation.
Generally, you choose the variables based on your expertise on experiencing the employee resignation. Besides, you can use several statistical technique to this, like using variance, spearman, anova, etc. Unfortunately, this article will not explain what kinds of feature elimination here, since we want to focus on the one of feature extraction methods.
Feature extraction is a technique that you create new variable based on your existing variable. Let us say, for the employee resignation example, given we have 10 predictor variables to predict the employee will resign or not. So, in feature extraction, we create 10 new variables based on the 10 given variable. One of the techniques to do this is called Principal Component Analysis (PCA)
The Principal Component Analysis (PCA) is a statistical method to reduce the dimension of the data by extracting the variables and leave the variables that has least information about something that we predicted \(\hat{y}\).
Then, when you should using PCA instead of other method?1
In this article, we want to apply Principal Component Analysis on two dataset, Online Shopper Intension and Breast Cancer dataset. The aim of this article is to compare how powerful PCA when applied in the data that has less correlate of one another and the dataset that has higher correlation of each variables. Now, let us start with the Online shopper intention dataset first.
We will explored PCA on the data that has variables correlation and no correlation. We will start with the correlated variables first.
In this use case, we use Online Shoppers Intention dataset. The data is downloaded from kaggle. The data consists of various Information related to customer behavior in online shopping websites. Let us say, we want to predict a customer will generate the revenue of our business or not.
We will create two models here, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.
Load the library needed.
# data wrangling
library(tidyverse)
library(GGally)
# data preprocessing
library(recipes)
# modelling
library(rsample)
library(caret)
# measure time consumption
library(tictoc)
Load the shopper intention dataset to our environment.
The data is shown as seen below:
## Observations: 12,330
## Variables: 18
## $ Administrative <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ Administrative_Duration <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0,...
## $ Informational <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Informational_Duration <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0,...
## $ ProductRelated <dbl> 1, 2, 1, 2, 10, 19, 1, 1, 2, 3, 3, 16,...
## $ ProductRelated_Duration <dbl> 0.000000, 64.000000, -1.000000, 2.6666...
## $ BounceRates <dbl> 0.200000000, 0.000000000, 0.200000000,...
## $ ExitRates <dbl> 0.200000000, 0.100000000, 0.200000000,...
## $ PageValues <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ SpecialDay <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0...
## $ Month <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Fe...
## $ OperatingSystems <fct> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1,...
## $ Browser <fct> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1,...
## $ Region <fct> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1,...
## $ TrafficType <dbl> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3,...
## $ VisitorType <chr> "Returning_Visitor", "Returning_Visito...
## $ Weekend <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALS...
## $ Revenue <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...
The dataset has 12,330 observations and 18 variables. Hence, we have 17 predictor variables and 1 target variable to predict. Here are the description of the variables in the data:
Administrative
= Administrative ValueAdministrative_Duration
= Duration in Administrative PageInformational
= Informational ValueInformational_Duration
= Duration in Informational PageProductRelated
= Product Related ValueProductRelated_Duration
= Duration in Product Related PageBounceRates
= percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.ExitRates
= Exit rate of a web pagePageValuesPage
= values of each web pageSpecialDaySpecial
= days like valentine etcMonth
= Month of the yearOperatingSystems
= Operating system usedBrowser
= Browser usedRegion
= Region of the userTrafficType
= Traffic TypeVisitorType
= Types of VisitorWeekend
= Weekend or notRevenue
= Revenue will be generated or notBased on its description, it looks like our variables are in its correct data type. Besides, we want to check the correlation between each numerical predictor variable using visualization in ggcorr.
It looks like we have several variables that has correlation of one another, but the correlation is not quite high. Now, let us do the cross validation to split the data into train and test.
Cross Validation
Let us split the data into 80% to be our training dataset and 20% to be our testing dataset.
RNGkind(sample.kind = "Rounding")
set.seed(417)
splitted <- initial_split(data = shopper_intention, prop = 0.8, strata = "Revenue")
Now, let us check the proportion of our target variable in the train dataset, that is Revenue
.
##
## FALSE TRUE
## 0.8452103 0.1547897
Based on the proportion of our target variable, only 15.4% of our visitor in the website purchase any goods, hence it resulting revenue for the shop. Besides, the proportion of our target variable is imbalance
Then, let us check is there any missing value on each variable.
## Administrative Administrative_Duration Informational
## 14 14 14
## Informational_Duration ProductRelated ProductRelated_Duration
## 14 14 14
## BounceRates ExitRates PageValues
## 14 14 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
Based on the output above, our data has several missing value (NA), but the number of missing value still 5% of our data. Hence, we can remove the NA in our preprocessing step.
Here, we do the several preprocessing step using recipe()
function from recipe package. We store all of our preprocessing, in step_*()
function, including PCA step. Here, the syntax of PCA is step_pca(all_numeric(), threshold = 0.90)
we use the numeric variable only, and take the 90% of cummulative variance of the data, hence the threshold is set by 0.90.
rec <- recipe(Revenue~., training(splitted)) %>%
step_naomit(all_predictors()) %>% # remove the observation that has NA (missing value)
step_nzv(all_predictors()) %>% # remove the near zero variance variable
step_upsample(Revenue, ratio = 1, seed = 100) %>% # balancing the target variable proportion
step_center(all_numeric()) %>% # make all the predictor has 0 mean
step_scale(all_numeric()) %>% # make the predictor has 1 sd
step_pca(all_numeric(), threshold = 0.90) %>% # do the pca by using 90% variance of the data
prep() # prepare the recipe
Now, peek our train dataset after the preprocessing applied.
Here we can see in train dataset above, we have 1 target variable, 6 categorical predictor and 6 new PCs (the result of 90% variance of PCA) that will be train in to our model.
In our first model that is using PCA in our preprocessing data, we want to build a random forest model using 5 fold validation and 3 repeats to predict if the visitor of our website will generate the revenue or not. Besides, we use tic()
and toc()
function to measure the time elapsed while running the random forest model.
RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model <- train(Revenue ~ ., data = train, method = "rf", trControl = ctrl)
toc()
After running the model, the time consumed to build the model is 1608.41 or around 26 minutes.
Then, we use the model to predict the test dataset.
Now, lets check the accuracy of the model built on a confusion matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 1954 171
## TRUE 128 210
##
## Accuracy : 0.8786
## 95% CI : (0.8651, 0.8912)
## No Information Rate : 0.8453
## P-Value [Acc > NIR] : 1.425e-06
##
## Kappa : 0.5134
##
## Mcnemar's Test P-Value : 0.01514
##
## Sensitivity : 0.55118
## Specificity : 0.93852
## Pos Pred Value : 0.62130
## Neg Pred Value : 0.91953
## Prevalence : 0.15469
## Detection Rate : 0.08526
## Detection Prevalence : 0.13723
## Balanced Accuracy : 0.74485
##
## 'Positive' Class : TRUE
##
Now, we want to compare the result of model that using PCA in the preprocessing process with the model using the same preprocessing, but without PCA. Here let us make the recipe first.
rec2 <- recipe(Revenue~., training(splitted)) %>%
step_naomit(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_upsample(Revenue, ratio = 1, seed = 100) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
In this case, we use 16 predictors, means there are no variable that has been removed. Now, apply the sampe random forest algorithm with the exactly same model tuning to compare the time comsume and accuracy of the model.
RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model2 <- train(Revenue ~ ., data = train2, method = "rf", trControl = ctrl)
toc()
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 1949 143
## TRUE 133 238
##
## Accuracy : 0.8879
## 95% CI : (0.8748, 0.9001)
## No Information Rate : 0.8453
## P-Value [Acc > NIR] : 6.578e-10
##
## Kappa : 0.5669
##
## Mcnemar's Test P-Value : 0.588
##
## Sensitivity : 0.62467
## Specificity : 0.93612
## Pos Pred Value : 0.64151
## Neg Pred Value : 0.93164
## Prevalence : 0.15469
## Detection Rate : 0.09663
## Detection Prevalence : 0.15063
## Balanced Accuracy : 0.78040
##
## 'Positive' Class : TRUE
##
Result:
- The online shopper data has a few variables that correlated of one another.
- The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.87, without PCA 0.88)
- The time consuming while using PCA is 1608.41 sec elapsed and without PCA is 1936.95. Then we can save 328.54 seconds or +-/ 5 minutes of time when using PCA.
Now, how if we have larger numeric predictor and stronger correlation?
In this section, we will use breast cancer dataset. Let us say, we want to predict a patient is diagnosed with malignant or benign cancer. The predictor variables are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data itself can be downloaded from UCI Machine Learning Repository
Here, we will create two models, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.
Now, let us take a look at our data.
## Observations: 569
## Variables: 33
## $ id <dbl> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M"...
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ `concave points_mean` <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ `concave points_se` <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ `concave points_worst` <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X33 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
The dataset has 569 observations and 33 variables (32 predictors, 1 response variable). While, the variable description is explained below:
ID
= ID numberdiagnosis
= (M = malignant, B = benign)Ten real-valued features are computed for each cell nucleus:
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
From the data, the id
and X33
variable did not help us to predict the diagnosis of cancer patient. Let us remove it from the data.
## diagnosis radius_mean texture_mean
## 0 0 0
## perimeter_mean area_mean smoothness_mean
## 0 0 0
## compactness_mean concavity_mean concave points_mean
## 0 0 0
## symmetry_mean fractal_dimension_mean radius_se
## 0 0 0
## texture_se perimeter_se area_se
## 0 0 0
## smoothness_se compactness_se concavity_se
## 0 0 0
## concave points_se symmetry_se fractal_dimension_se
## 0 0 0
## radius_worst texture_worst perimeter_worst
## 0 0 0
## area_worst smoothness_worst compactness_worst
## 0 0 0
## concavity_worst concave points_worst symmetry_worst
## 0 0 0
## fractal_dimension_worst
## 0
Now, let us check the correlation of each variable below to make sure the are the variables high correlated of one another rather than the online shopper data.
From the visualization above, the data has higher correlated between each variable than the online shopper data.
RNGkind(sample.kind = "Rounding")
set.seed(100)
idx <- initial_split(cancer, prop = 0.8,strata = "diagnosis")
cancer_train <- training(idx)
cancer_test <- testing(idx)
Using breast cancer dataset, we first want to build a model using PCA in the preprocessing approach. Still, we use the 90% of the variance of the data.
rec_cancer_pca <- recipe(diagnosis~., cancer_train) %>%
step_naomit(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_pca(all_numeric(), threshold = 0.9) %>%
prep()
After applying PCA in breast cancer dataset, here are the number of variable that we will be using.
From the table above, we use 7 PCs instead of 30 predictor variables. Now lets train the data to the model.
RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer_pca <- train(diagnosis ~ ., data = cancer_train_pca, method = "rf", trControl = ctrl)
The time consumed when using PCA is 4.88 seconds on training the dataset. Next, we can predict the test dataset from the model_cancer_pca
.
Now, let us check the condusion matrix of our model using confusion matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 3
## M 1 39
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9235
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.9286
## Specificity : 0.9859
## Pos Pred Value : 0.9750
## Neg Pred Value : 0.9589
## Prevalence : 0.3717
## Detection Rate : 0.3451
## Detection Prevalence : 0.3540
## Balanced Accuracy : 0.9572
##
## 'Positive' Class : M
##
The accuracy of the model for the test data while using PCA is 0.96. Then, we will build a model that’s not using PCA to be compared with.
In this part, we want to classify the breast cancer patient diagnosis without PCA in the preprocessing step. Let us create a recipe for it.
rec_cancer <- recipe(diagnosis~., cancer_train) %>%
step_naomit(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
prep()
Here, we want to create a model using the same algorithm and specification to be compared with the previous model.
tic()
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer <- train(diagnosis ~ ., data = cancer_train, method = "rf", trControl = ctrl)
The time consuming without PCA in processing data is 11.21 seconds, means it is almost 3x faster than the model that is using PCA in the preprocessing data.
How about the accuracy of the model? is the accuracy greater while we do not use PCA? Now let us check it using confusion matrix below
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 5
## M 0 37
##
## Accuracy : 0.9558
## 95% CI : (0.8998, 0.9855)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9029
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 0.8810
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9342
## Prevalence : 0.3717
## Detection Rate : 0.3274
## Detection Prevalence : 0.3274
## Balanced Accuracy : 0.9405
##
## 'Positive' Class : M
##
Turns out, based on the confusion matrix above, the accuracy is lesser (0.95) than using PCA (0.96). Hence, the PCA really works well on the data that has high dimensional data and high correlated of variables2.
Result:
Principal Component Analysis (PCA) is very useful to speed up the computation by reducing the dimensionality of the data. Plus, when you have high dimensionality with high correlated variable of one another, the PCA can improve the accuracy of classification model. Unfortunately, while using PCA, you make your machine learning model less interpretable. Also, PCA will only be applied in your dataset when your dataset contains more than one numerical variable that you want to reduce its dimension.
kapan pake knn kapan pake cluster