This article focused on the application of Principal Component Analysis (PCA) on reducing the dimension. So, the mathematical formula will not be detail explained here.

If you are familiar enough with data, sometimes you are faced with too many predictor variables that make the computation so heavy. Let us say, you are challenged to predict employee in your company will resign or not while the variables are the level of satisfaction on work, number of project, average monthly hours, time spend at the company, etc. You are facing so many predictor that took so long for training the model.

Then, you should reduce the dimension to make the computation less heavy. To do the dimensionality reduction, the techniques divide into two ways:

Feature Elimination
Feature Extraction

Feature Elimination

Feature elimination is when you select the variable that is influence your prediction, and throw away the variable that has no contribution to your prediction. In the case of prediction of resigning employee or not, for example, you only choose the variable that is influencing the employee resignation.

Generally, you choose the variables based on your expertise on experiencing the employee resignation. Besides, you can use several statistical technique to this, like using variance, spearman, anova, etc. Unfortunately, this article will not explain what kinds of feature elimination here, since we want to focus on the one of feature extraction methods.

Feature Extraction

Feature extraction is a technique that you create new variable based on your existing variable. Let us say, for the employee resignation example, given we have 10 predictor variables to predict the employee will resign or not. So, in feature extraction, we create 10 new variables based on the 10 given variable. One of the techniques to do this is called Principal Component Analysis (PCA)

Principal Component Analysis

The Principal Component Analysis (PCA) is a statistical method to reduce the dimension of the data by extracting the variables and leave the variables that has least information about something that we predicted \(\hat{y}\).

Then, when you should using PCA instead of other method?¹

When you want to reduce the dimension/variable, but you dont care what variables that is completely remove
When you want to ensure your variables are not correlate of one another
When you are comfortable enough to make your predictor variables less interpretable

In this article, we want to apply Principal Component Analysis on two dataset, Online Shopper Intension and Breast Cancer dataset. The aim of this article is to compare how powerful PCA when applied in the data that has less correlate of one another and the dataset that has higher correlation of each variables. Now, let us start with the Online shopper intention dataset first.

Applying PCA on Online Shopper Intention Dataset

We will explored PCA on the data that has variables correlation and no correlation. We will start with the correlated variables first.

In this use case, we use Online Shoppers Intention dataset. The data is downloaded from kaggle. The data consists of various Information related to customer behavior in online shopping websites. Let us say, we want to predict a customer will generate the revenue of our business or not.

We will create two models here, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.

Load the library needed.

# data wrangling
library(tidyverse)
library(GGally)

# data preprocessing
library(recipes)

# modelling
library(rsample)
library(caret)

# measure time consumption
library(tictoc)

Load the shopper intention dataset to our environment.

shopper_intention <- read_csv("data_input/online_shoppers_intention.csv")

The data is shown as seen below:

glimpse(shopper_intention)

## Observations: 12,330
## Variables: 18
## $ Administrative          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ Administrative_Duration <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0,...
## $ Informational           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Informational_Duration  <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0,...
## $ ProductRelated          <dbl> 1, 2, 1, 2, 10, 19, 1, 1, 2, 3, 3, 16,...
## $ ProductRelated_Duration <dbl> 0.000000, 64.000000, -1.000000, 2.6666...
## $ BounceRates             <dbl> 0.200000000, 0.000000000, 0.200000000,...
## $ ExitRates               <dbl> 0.200000000, 0.100000000, 0.200000000,...
## $ PageValues              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ SpecialDay              <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0...
## $ Month                   <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Fe...
## $ OperatingSystems        <fct> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1,...
## $ Browser                 <fct> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1,...
## $ Region                  <fct> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1,...
## $ TrafficType             <dbl> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3,...
## $ VisitorType             <chr> "Returning_Visitor", "Returning_Visito...
## $ Weekend                 <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALS...
## $ Revenue                 <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FAL...

The dataset has 12,330 observations and 18 variables. Hence, we have 17 predictor variables and 1 target variable to predict. Here are the description of the variables in the data:

Administrative = Administrative Value
Administrative_Duration = Duration in Administrative Page
Informational = Informational Value
Informational_Duration = Duration in Informational Page
ProductRelated = Product Related Value
ProductRelated_Duration = Duration in Product Related Page
BounceRates = percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.
ExitRates = Exit rate of a web page
PageValuesPage = values of each web page
SpecialDaySpecial = days like valentine etc
Month = Month of the year
OperatingSystems = Operating system used
Browser = Browser used
Region = Region of the user
TrafficType = Traffic Type
VisitorType = Types of Visitor
Weekend = Weekend or not
Revenue = Revenue will be generated or not

Based on its description, it looks like our variables are in its correct data type. Besides, we want to check the correlation between each numerical predictor variable using visualization in ggcorr.

ggcorr(select_if(shopper_intention, is.numeric), 
       label = T, 
       hjust = 1, 
       layout.exp = 3)

It looks like we have several variables that has correlation of one another, but the correlation is not quite high. Now, let us do the cross validation to split the data into train and test.

Cross Validation

Let us split the data into 80% to be our training dataset and 20% to be our testing dataset.

RNGkind(sample.kind = "Rounding")
set.seed(417)
splitted <- initial_split(data = shopper_intention, prop = 0.8, strata = "Revenue")

Now, let us check the proportion of our target variable in the train dataset, that is Revenue.

prop.table(table(training(splitted)$Revenue))

## 
##     FALSE      TRUE 
## 0.8452103 0.1547897

Based on the proportion of our target variable, only 15.4% of our visitor in the website purchase any goods, hence it resulting revenue for the shop. Besides, the proportion of our target variable is imbalance

Then, let us check is there any missing value on each variable.

colSums(is.na(shopper_intention))

##          Administrative Administrative_Duration           Informational 
##                      14                      14                      14 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                      14                      14                      14 
##             BounceRates               ExitRates              PageValues 
##                      14                      14                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0

Based on the output above, our data has several missing value (NA), but the number of missing value still 5% of our data. Hence, we can remove the NA in our preprocessing step.

The Revenue on Online Wesite Prediction with PCA

Here, we do the several preprocessing step using recipe() function from recipe package. We store all of our preprocessing, in step_*() function, including PCA step. Here, the syntax of PCA is step_pca(all_numeric(), threshold = 0.90) we use the numeric variable only, and take the 90% of cummulative variance of the data, hence the threshold is set by 0.90.

rec <- recipe(Revenue~., training(splitted)) %>% 
  step_naomit(all_predictors()) %>% # remove the observation that has NA (missing value)
  step_nzv(all_predictors()) %>% # remove the near zero variance variable
  step_upsample(Revenue, ratio = 1, seed = 100) %>% # balancing the target variable proportion
  step_center(all_numeric()) %>% # make all the predictor has 0 mean
  step_scale(all_numeric()) %>% # make the predictor has 1 sd
  step_pca(all_numeric(), threshold = 0.90) %>% # do the pca by using 90% variance of the data
  prep() # prepare the recipe

train <- juice(rec)
test <- bake(rec, testing(splitted))

Now, peek our train dataset after the preprocessing applied.

head(train)

Here we can see in train dataset above, we have 1 target variable, 6 categorical predictor and 6 new PCs (the result of 90% variance of PCA) that will be train in to our model.

In our first model that is using PCA in our preprocessing data, we want to build a random forest model using 5 fold validation and 3 repeats to predict if the visitor of our website will generate the revenue or not. Besides, we use tic() and toc() function to measure the time elapsed while running the random forest model.

RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model <- train(Revenue ~ ., data = train, method = "rf", trControl = ctrl)
toc()

After running the model, the time consumed to build the model is 1608.41 or around 26 minutes.

Then, we use the model to predict the test dataset.

prediction_pca <- predict(model, test)

Now, lets check the accuracy of the model built on a confusion matrix.

confusionMatrix(prediction_pca, test$Revenue, positive = "TRUE")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  1954  171
##      TRUE    128  210
##                                           
##                Accuracy : 0.8786          
##                  95% CI : (0.8651, 0.8912)
##     No Information Rate : 0.8453          
##     P-Value [Acc > NIR] : 1.425e-06       
##                                           
##                   Kappa : 0.5134          
##                                           
##  Mcnemar's Test P-Value : 0.01514         
##                                           
##             Sensitivity : 0.55118         
##             Specificity : 0.93852         
##          Pos Pred Value : 0.62130         
##          Neg Pred Value : 0.91953         
##              Prevalence : 0.15469         
##          Detection Rate : 0.08526         
##    Detection Prevalence : 0.13723         
##       Balanced Accuracy : 0.74485         
##                                           
##        'Positive' Class : TRUE            
##

The Revenue on Online Wesite Prediction without PCA

Now, we want to compare the result of model that using PCA in the preprocessing process with the model using the same preprocessing, but without PCA. Here let us make the recipe first.

rec2 <- recipe(Revenue~., training(splitted)) %>% 
  step_naomit(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_upsample(Revenue, ratio = 1, seed = 100) %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()

train2 <- juice(rec2)
test2 <- bake(rec2, testing(splitted))

head(train2)

In this case, we use 16 predictors, means there are no variable that has been removed. Now, apply the sampe random forest algorithm with the exactly same model tuning to compare the time comsume and accuracy of the model.

RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model2 <- train(Revenue ~ ., data = train2, method = "rf", trControl = ctrl)
toc()

prediction <- predict(model2, test2)

confusionMatrix(prediction, test$Revenue, positive = "TRUE")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  1949  143
##      TRUE    133  238
##                                           
##                Accuracy : 0.8879          
##                  95% CI : (0.8748, 0.9001)
##     No Information Rate : 0.8453          
##     P-Value [Acc > NIR] : 6.578e-10       
##                                           
##                   Kappa : 0.5669          
##                                           
##  Mcnemar's Test P-Value : 0.588           
##                                           
##             Sensitivity : 0.62467         
##             Specificity : 0.93612         
##          Pos Pred Value : 0.64151         
##          Neg Pred Value : 0.93164         
##              Prevalence : 0.15469         
##          Detection Rate : 0.09663         
##    Detection Prevalence : 0.15063         
##       Balanced Accuracy : 0.78040         
##                                           
##        'Positive' Class : TRUE            
##

Result:
- The online shopper data has a few variables that correlated of one another.
- The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.87, without PCA 0.88)
- The time consuming while using PCA is 1608.41 sec elapsed and without PCA is 1936.95. Then we can save 328.54 seconds or +-/ 5 minutes of time when using PCA.

Now, how if we have larger numeric predictor and stronger correlation?

Applying PCA in Breast Cancer Dataset

In this section, we will use breast cancer dataset. Let us say, we want to predict a patient is diagnosed with malignant or benign cancer. The predictor variables are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data itself can be downloaded from UCI Machine Learning Repository

Here, we will create two models, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.

cancer <- read_csv("data_input/breast-cancer-wisconsin-data/data.csv")

Now, let us take a look at our data.

glimpse(cancer)

## Observations: 569
## Variables: 33
## $ id                      <dbl> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M"...
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ `concave points_mean`   <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ `concave points_se`     <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ `concave points_worst`  <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X33                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

The dataset has 569 observations and 33 variables (32 predictors, 1 response variable). While, the variable description is explained below:

ID = ID number
diagnosis = (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

From the data, the id and X33 variable did not help us to predict the diagnosis of cancer patient. Let us remove it from the data.

cancer <- cancer %>% 
  select(-c(X33, id))

colSums(is.na(cancer))

##               diagnosis             radius_mean            texture_mean 
##                       0                       0                       0 
##          perimeter_mean               area_mean         smoothness_mean 
##                       0                       0                       0 
##        compactness_mean          concavity_mean     concave points_mean 
##                       0                       0                       0 
##           symmetry_mean  fractal_dimension_mean               radius_se 
##                       0                       0                       0 
##              texture_se            perimeter_se                 area_se 
##                       0                       0                       0 
##           smoothness_se          compactness_se            concavity_se 
##                       0                       0                       0 
##       concave points_se             symmetry_se    fractal_dimension_se 
##                       0                       0                       0 
##            radius_worst           texture_worst         perimeter_worst 
##                       0                       0                       0 
##              area_worst        smoothness_worst       compactness_worst 
##                       0                       0                       0 
##         concavity_worst    concave points_worst          symmetry_worst 
##                       0                       0                       0 
## fractal_dimension_worst 
##                       0

Now, let us check the correlation of each variable below to make sure the are the variables high correlated of one another rather than the online shopper data.

ggcorr(cancer, label = T, hjust = 1, label_size = 2, layout.exp = 6)

From the visualization above, the data has higher correlated between each variable than the online shopper data.

RNGkind(sample.kind = "Rounding")
set.seed(100)
idx <- initial_split(cancer, prop = 0.8,strata = "diagnosis")
cancer_train <- training(idx)
cancer_test <- testing(idx)

The Breast Cancer Prediction with PCA

Using breast cancer dataset, we first want to build a model using PCA in the preprocessing approach. Still, we use the 90% of the variance of the data.

rec_cancer_pca <- recipe(diagnosis~., cancer_train) %>% 
  step_naomit(all_predictors()) %>% 
  step_nzv(all_predictors()) %>%  
  step_center(all_numeric()) %>%  
  step_scale(all_numeric()) %>%  
  step_pca(all_numeric(), threshold = 0.9) %>%  
  prep()

cancer_train_pca <- juice(rec_cancer_pca)
cancer_test_pca <- bake(rec_cancer_pca, cancer_test)

After applying PCA in breast cancer dataset, here are the number of variable that we will be using.

head(cancer_train_pca)

From the table above, we use 7 PCs instead of 30 predictor variables. Now lets train the data to the model.

RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer_pca <- train(diagnosis ~ ., data = cancer_train_pca, method = "rf", trControl = ctrl)

toc()

The time consumed when using PCA is 4.88 seconds on training the dataset. Next, we can predict the test dataset from the model_cancer_pca.

pred_cancer_pca <- predict(model_cancer_pca, cancer_test_pca)

Now, let us check the condusion matrix of our model using confusion matrix.

confusionMatrix(pred_cancer_pca, cancer_test_pca$diagnosis, positive = "M")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  3
##          M  1 39
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9235          
##                                           
##  Mcnemar's Test P-Value : 0.6171          
##                                           
##             Sensitivity : 0.9286          
##             Specificity : 0.9859          
##          Pos Pred Value : 0.9750          
##          Neg Pred Value : 0.9589          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3451          
##    Detection Prevalence : 0.3540          
##       Balanced Accuracy : 0.9572          
##                                           
##        'Positive' Class : M               
##

The accuracy of the model for the test data while using PCA is 0.96. Then, we will build a model that’s not using PCA to be compared with.

The Breast Cancer Prediction without PCA

In this part, we want to classify the breast cancer patient diagnosis without PCA in the preprocessing step. Let us create a recipe for it.

rec_cancer <- recipe(diagnosis~., cancer_train) %>% 
  step_naomit(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()

cancer_train <- juice(rec_cancer)
cancer_test <- bake(rec_cancer, cancer_test)

Here, we want to create a model using the same algorithm and specification to be compared with the previous model.

tic()
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer <- train(diagnosis ~ ., data = cancer_train, method = "rf", trControl = ctrl)

toc()

The time consuming without PCA in processing data is 11.21 seconds, means it is almost 3x faster than the model that is using PCA in the preprocessing data.

pred_cancer <- predict(model_cancer, cancer_test)

How about the accuracy of the model? is the accuracy greater while we do not use PCA? Now let us check it using confusion matrix below

confusionMatrix(pred_cancer, cancer_test$diagnosis, positive = "M")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  5
##          M  0 37
##                                           
##                Accuracy : 0.9558          
##                  95% CI : (0.8998, 0.9855)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9029          
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 0.8810          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9342          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3274          
##    Detection Prevalence : 0.3274          
##       Balanced Accuracy : 0.9405          
##                                           
##        'Positive' Class : M               
##

Turns out, based on the confusion matrix above, the accuracy is lesser (0.95) than using PCA (0.96). Hence, the PCA really works well on the data that has high dimensional data and high correlated of variables².

Result:

The breast cancer dataset has many variables that correlated of one another.
The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.96, without PCA 0.95)
The time consuming while using PCA is 4.88 sec elapsed and without PCA is 11.21. Then we can save 6.33 seconds or while using PCA the computation is more than 2x faster than the model without PCA.

Conclusion

Principal Component Analysis (PCA) is very useful to speed up the computation by reducing the dimensionality of the data. Plus, when you have high dimensionality with high correlated variable of one another, the PCA can improve the accuracy of classification model. Unfortunately, while using PCA, you make your machine learning model less interpretable. Also, PCA will only be applied in your dataset when your dataset contains more than one numerical variable that you want to reduce its dimension.

Reference:

kapan pake knn kapan pake cluster

Time Efficiency and Accuracy Improvement using PCA

Yaumil Sitta

April 13, 2020