PCA Analysis and K-Means CLustering Explained

Musthofa Syarifudin

June 5, 2021

Introduction

Description

This notebook will guide to unsupervised learning using Principal Component Analysis and K-Means Clustering to analyze Kaggle problem Kaggle: Tabular Playground Series Juny 2021.

Stucture

Here are stucture of notebook:

  1. Introduction to Unsupervised Learning
  2. Introduction to K-Means Clustering
  3. Analyzing Kaggle Data with K-Means Clustering
  4. Introduction to PCA
  5. PCA Implementation
  6. Verdict

Introduction to Unsupervised Learning

Unsupervised learning is a type of algorithm that learns patterns from untagged/unlabbel data. The hope is that, through mimicry, the machine is forced to build a compact internal representation of its world and then generate imaginative content. in short unsupervised learning ia algorithm that learn data pattern without needing labelling process. it is used in many application like reduce dimensionality of data, feature extractor, anomaly detection, clustering, stucture prediction, or more advanced example like reinforcement learning algorithm where the program learning from past experiment or called self learning like one of Open AI Project.

Two of the main methods used in unsupervised learning are principal component and cluster analysis. Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships. Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.

K-Means Clustering

Introduction to K-Means Clustering

k-means clustering is a method of vector quantization, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

Analyzing Kaggle Data with K-Means Clustering

In this notebook we will trying to using K-Means Clustering to analyze data in Kaggle Competition Tabular Playground

Preprocessing Data

read train dataset

data <- read.csv("train.csv")
rmarkdown::paged_table((glimpse(data)))
#> Rows: 200,000
#> Columns: 77
#> $ id         <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
#> $ feature_0  <int> 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 3,~
#> $ feature_1  <int> 0, 0, 0, 0, 0, 15, 1, 3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1~
#> $ feature_2  <int> 6, 0, 0, 7, 0, 0, 2, 5, 0, 0, 0, 0, 28, 1, 0, 0, 1, 0, 0, 1~
#> $ feature_3  <int> 1, 0, 0, 0, 0, 0, 1, 0, 35, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 1~
#> $ feature_4  <int> 0, 0, 0, 1, 0, 0, 0, 0, 6, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0,~
#> $ feature_5  <int> 0, 0, 1, 5, 0, 1, 2, 1, 2, 0, 0, 0, 3, 1, 0, 1, 0, 0, 0, 1,~
#> $ feature_6  <int> 0, 0, 0, 2, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0,~
#> $ feature_7  <int> 0, 0, 3, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 1,~
#> $ feature_8  <int> 7, 0, 0, 0, 0, 0, 0, 10, 3, 0, 0, 3, 2, 0, 2, 7, 16, 1, 10,~
#> $ feature_9  <int> 0, 0, 0, 1, 0, 2, 2, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0,~
#> $ feature_10 <int> 0, 0, 1, 2, 0, 0, 1, 0, 6, 1, 0, 0, 7, 0, 0, 0, 0, 3, 0, 1,~
#> $ feature_11 <int> 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 0, 0, 0,~
#> $ feature_12 <int> 3, 1, 0, 5, 0, 0, 5, 0, 0, 0, 0, 2, 0, 4, 0, 2, 9, 4, 0, 2,~
#> $ feature_13 <int> 0, 0, 0, 0, 0, 1, 0, 3, 0, 2, 1, 0, 1, 16, 0, 0, 0, 1, 0, 0~
#> $ feature_14 <int> 1, 0, 0, 0, 0, 0, 0, 6, 7, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 1,~
#> $ feature_15 <int> 0, 0, 0, 4, 0, 1, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,~
#> $ feature_16 <int> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 1, 3, 5, 0, 0, 0, 2,~
#> $ feature_17 <int> 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,~
#> $ feature_18 <int> 3, 0, 0, 22, 0, 0, 3, 0, 0, 0, 0, 11, 1, 5, 0, 3, 0, 0, 1, ~
#> $ feature_19 <int> 1, 0, 5, 2, 0, 7, 12, 8, 5, 6, 3, 9, 7, 1, 3, 9, 2, 4, 2, 3~
#> $ feature_20 <int> 0, 0, 4, 1, 1, 1, 4, 4, 1, 3, 2, 0, 3, 0, 0, 0, 1, 0, 0, 0,~
#> $ feature_21 <int> 2, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 3, 0,~
#> $ feature_22 <int> 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0~
#> $ feature_23 <int> 0, 0, 0, 0, 0, 0, 3, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,~
#> $ feature_24 <int> 0, 0, 0, 0, 0, 2, 0, 1, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
#> $ feature_25 <int> 0, 0, 0, 3, 1, 0, 1, 5, 4, 7, 0, 0, 7, 0, 3, 3, 0, 0, 0, 3,~
#> $ feature_26 <int> 0, 0, 0, 0, 0, 2, 3, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,~
#> $ feature_27 <int> 0, 0, 0, 37, 0, 0, 1, 0, 1, 0, 3, 1, 4, 0, 0, 0, 0, 0, 0, 0~
#> $ feature_28 <int> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 7, 1, 0, 2, 1, 1, 0, 1,~
#> $ feature_29 <int> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 5,~
#> $ feature_30 <int> 0, 0, 0, 3, 0, 1, 3, 8, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 1,~
#> $ feature_31 <int> 1, 0, 0, 13, 0, 0, 2, 3, 0, 0, 0, 0, 4, 2, 0, 2, 0, 1, 0, 2~
#> $ feature_32 <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 5, 1, 0, 1, 0, 0, 0, 0,~
#> $ feature_33 <int> 0, 0, 0, 10, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1~
#> $ feature_34 <int> 0, 0, 2, 0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
#> $ feature_35 <int> 0, 1, 0, 3, 0, 1, 0, 4, 31, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0~
#> $ feature_36 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
#> $ feature_37 <int> 11, 0, 5, 1, 2, 0, 0, 2, 0, 0, 0, 11, 1, 3, 0, 11, 0, 0, 0,~
#> $ feature_38 <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 6, 12~
#> $ feature_39 <int> 0, 0, 5, 7, 5, 10, 0, 3, 0, 0, 6, 0, 0, 1, 1, 0, 3, 5, 0, 3~
#> $ feature_40 <int> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1,~
#> $ feature_41 <int> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 1,~
#> $ feature_42 <int> 0, 0, 0, 2, 0, 2, 0, 0, 8, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,~
#> $ feature_43 <int> 9, 2, 0, 0, 0, 0, 7, 7, 0, 0, 0, 5, 2, 1, 1, 1, 2, 1, 2, 3,~
#> $ feature_44 <int> 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 2, 1, 1, 0, 2, 0, 0, 0, 1,~
#> $ feature_45 <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
#> $ feature_46 <int> 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 1,~
#> $ feature_47 <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0,~
#> $ feature_48 <int> 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 0, 0, 14~
#> $ feature_49 <int> 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
#> $ feature_50 <int> 3, 0, 7, 0, 0, 0, 0, 3, 0, 0, 0, 1, 8, 1, 2, 0, 0, 1, 4, 2,~
#> $ feature_51 <int> 0, 0, 0, 10, 0, 0, 1, 42, 0, 1, 0, 0, 3, 2, 0, 0, 1, 1, 0, ~
#> $ feature_52 <int> 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0,~
#> $ feature_53 <int> 3, 0, 1, 0, 0, 0, 3, 4, 0, 0, 0, 4, 3, 3, 0, 0, 1, 0, 0, 6,~
#> $ feature_54 <int> 0, 0, 0, 25, 3, 1, 36, 6, 1, 4, 6, 18, 39, 2, 3, 5, 0, 2, 0~
#> $ feature_55 <int> 0, 0, 3, 1, 0, 0, 0, 16, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ feature_56 <int> 0, 0, 4, 0, 0, 2, 0, 6, 3, 4, 0, 4, 24, 4, 0, 1, 0, 0, 2, 2~
#> $ feature_57 <int> 0, 0, 0, 1, 1, 0, 0, 0, 7, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
#> $ feature_58 <int> 0, 0, 0, 2, 0, 2, 0, 1, 1, 1, 0, 3, 1, 9, 1, 0, 0, 0, 0, 1,~
#> $ feature_59 <int> 0, 0, 1, 0, 0, 0, 5, 1, 2, 0, 0, 2, 0, 0, 2, 32, 0, 0, 0, 0~
#> $ feature_60 <int> 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 4, 3, 0, 0, 0,~
#> $ feature_61 <int> 1, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 4, 2, 2, 1, 0, 0, 0, 0, 0,~
#> $ feature_62 <int> 1, 0, 2, 7, 0, 2, 0, 1, 1, 0, 1, 1, 5, 0, 0, 0, 0, 0, 0, 2,~
#> $ feature_63 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
#> $ feature_64 <int> 0, 0, 0, 0, 0, 1, 0, 0, 6, 1, 1, 0, 0, 0, 6, 0, 0, 0, 0, 2,~
#> $ feature_65 <int> 3, 0, 8, 0, 0, 0, 0, 1, 0, 0, 0, 7, 6, 3, 0, 7, 0, 0, 0, 0,~
#> $ feature_66 <int> 0, 2, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,~
#> $ feature_67 <int> 0, 0, 0, 4, 0, 0, 2, 0, 37, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 1~
#> $ feature_68 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 2, 0, 1, 1, 0,~
#> $ feature_69 <int> 0, 0, 0, 2, 0, 0, 2, 0, 5, 0, 0, 0, 0, 1, 3, 54, 0, 0, 0, 0~
#> $ feature_70 <int> 0, 0, 1, 2, 0, 0, 0, 0, 4, 0, 0, 0, 1, 40, 0, 0, 0, 1, 0, 0~
#> $ feature_71 <int> 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 4,~
#> $ feature_72 <int> 2, 0, 0, 4, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 1, 2, 1, 0, 0,~
#> $ feature_73 <int> 0, 1, 0, 3, 0, 0, 0, 60, 0, 10, 3, 0, 6, 0, 0, 6, 0, 0, 0, ~
#> $ feature_74 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0,~
#> $ target     <chr> "Class_6", "Class_6", "Class_2", "Class_8", "Class_2", "Cla~

Train data has 200,000 sample and 77 coloumn. The dataset is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

read sample submission dataset

sample_sub <- read.csv("sample_submission.csv")
glimpse(sample_sub)
#> Rows: 100,000
#> Columns: 10
#> $ id      <int> 200000, 200001, 200002, 200003, 200004, 200005, 200006, 200007~
#> $ Class_1 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_2 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_3 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_4 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_5 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_6 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_7 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_8 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_9 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~

to be able submit kaggle we need to predict each class probability and include test id number.

check NA column

colSums(is.na(data))
#>         id  feature_0  feature_1  feature_2  feature_3  feature_4  feature_5 
#>          0          0          0          0          0          0          0 
#>  feature_6  feature_7  feature_8  feature_9 feature_10 feature_11 feature_12 
#>          0          0          0          0          0          0          0 
#> feature_13 feature_14 feature_15 feature_16 feature_17 feature_18 feature_19 
#>          0          0          0          0          0          0          0 
#> feature_20 feature_21 feature_22 feature_23 feature_24 feature_25 feature_26 
#>          0          0          0          0          0          0          0 
#> feature_27 feature_28 feature_29 feature_30 feature_31 feature_32 feature_33 
#>          0          0          0          0          0          0          0 
#> feature_34 feature_35 feature_36 feature_37 feature_38 feature_39 feature_40 
#>          0          0          0          0          0          0          0 
#> feature_41 feature_42 feature_43 feature_44 feature_45 feature_46 feature_47 
#>          0          0          0          0          0          0          0 
#> feature_48 feature_49 feature_50 feature_51 feature_52 feature_53 feature_54 
#>          0          0          0          0          0          0          0 
#> feature_55 feature_56 feature_57 feature_58 feature_59 feature_60 feature_61 
#>          0          0          0          0          0          0          0 
#> feature_62 feature_63 feature_64 feature_65 feature_66 feature_67 feature_68 
#>          0          0          0          0          0          0          0 
#> feature_69 feature_70 feature_71 feature_72 feature_73 feature_74     target 
#>          0          0          0          0          0          0          0

train data contain zero nan values accross all features.

we need to remove id column, because it does not having additional info about target.

data_clean <- data[, -1] %>% mutate(target = as.factor(target))
rmarkdown::paged_table(head(data_clean))

we check summary of train data

summary(data_clean)
#>    feature_0         feature_1        feature_2        feature_3     
#>  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
#>  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
#>  Median : 0.0000   Median : 0.000   Median : 0.000   Median : 0.000  
#>  Mean   : 0.9727   Mean   : 1.168   Mean   : 2.219   Mean   : 2.297  
#>  3rd Qu.: 1.0000   3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000  
#>  Max.   :61.0000   Max.   :51.000   Max.   :64.000   Max.   :70.000  
#>                                                                      
#>    feature_4         feature_5        feature_6        feature_7      
#>  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000  
#>  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.0000  
#>  Median : 0.0000   Median : 0.000   Median : 0.000   Median : 0.0000  
#>  Mean   : 0.7935   Mean   : 1.431   Mean   : 1.011   Mean   : 0.6731  
#>  3rd Qu.: 0.0000   3rd Qu.: 1.000   3rd Qu.: 0.000   3rd Qu.: 0.0000  
#>  Max.   :38.0000   Max.   :76.000   Max.   :43.000   Max.   :30.0000  
#>                                                                       
#>    feature_8        feature_9       feature_10       feature_11    
#>  Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000  
#>  Median : 0.000   Median : 0.00   Median : 0.000   Median : 0.000  
#>  Mean   : 1.944   Mean   : 1.72   Mean   : 1.423   Mean   : 0.981  
#>  3rd Qu.: 2.000   3rd Qu.: 1.00   3rd Qu.: 1.000   3rd Qu.: 0.000  
#>  Max.   :38.000   Max.   :72.00   Max.   :33.000   Max.   :46.000  
#>                                                                    
#>    feature_12       feature_13       feature_14       feature_15     
#>  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   :  0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:  0.000  
#>  Median : 1.000   Median : 0.000   Median : 0.000   Median :  0.000  
#>  Mean   : 2.445   Mean   : 1.078   Mean   : 1.406   Mean   :  1.413  
#>  3rd Qu.: 3.000   3rd Qu.: 1.000   3rd Qu.: 2.000   3rd Qu.:  0.000  
#>  Max.   :37.000   Max.   :43.000   Max.   :32.000   Max.   :121.000  
#>                                                                      
#>    feature_16      feature_17        feature_18       feature_19     
#>  Min.   : 0.00   Min.   : 0.0000   Min.   : 0.000   Min.   :  0.000  
#>  1st Qu.: 0.00   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.:  0.000  
#>  Median : 0.00   Median : 0.0000   Median : 1.000   Median :  2.000  
#>  Mean   : 1.39   Mean   : 0.3177   Mean   : 1.657   Mean   :  6.187  
#>  3rd Qu.: 1.00   3rd Qu.: 0.0000   3rd Qu.: 2.000   3rd Qu.:  7.000  
#>  Max.   :27.00   Max.   :14.0000   Max.   :22.000   Max.   :263.000  
#>                                                                      
#>    feature_20       feature_21       feature_22        feature_23    
#>  Min.   : 0.000   Min.   : 0.000   Min.   :  0.000   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.: 0.000  
#>  Median : 1.000   Median : 0.000   Median :  0.000   Median : 0.000  
#>  Mean   : 1.439   Mean   : 1.031   Mean   :  1.466   Mean   : 0.572  
#>  3rd Qu.: 2.000   3rd Qu.: 1.000   3rd Qu.:  0.000   3rd Qu.: 0.000  
#>  Max.   :30.000   Max.   :33.000   Max.   :123.000   Max.   :22.000  
#>                                                                      
#>    feature_24       feature_25        feature_26        feature_27     
#>  Min.   : 0.000   Min.   :  0.000   Min.   : 0.0000   Min.   : 0.0000  
#>  1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
#>  Median : 0.000   Median :  0.000   Median : 0.0000   Median : 0.0000  
#>  Mean   : 1.061   Mean   :  2.349   Mean   : 0.7745   Mean   : 0.7893  
#>  3rd Qu.: 0.000   3rd Qu.:  2.000   3rd Qu.: 1.0000   3rd Qu.: 1.0000  
#>  Max.   :69.000   Max.   :149.000   Max.   :24.0000   Max.   :84.0000  
#>                                                                        
#>    feature_28        feature_29       feature_30        feature_31    
#>  Min.   :  0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
#>  1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
#>  Median :  0.000   Median : 0.000   Median : 0.0000   Median : 1.000  
#>  Mean   :  2.326   Mean   : 1.582   Mean   : 0.5988   Mean   : 1.857  
#>  3rd Qu.:  1.000   3rd Qu.: 1.000   3rd Qu.: 1.0000   3rd Qu.: 2.000  
#>  Max.   :105.000   Max.   :84.000   Max.   :22.0000   Max.   :39.000  
#>                                                                       
#>    feature_32       feature_33       feature_34        feature_35    
#>  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
#>  Median : 0.000   Median : 0.000   Median : 0.0000   Median : 0.000  
#>  Mean   : 1.516   Mean   : 1.557   Mean   : 0.6811   Mean   : 1.162  
#>  3rd Qu.: 0.000   3rd Qu.: 2.000   3rd Qu.: 1.0000   3rd Qu.: 1.000  
#>  Max.   :78.000   Max.   :41.000   Max.   :36.0000   Max.   :41.000  
#>                                                                      
#>    feature_36        feature_37     feature_38       feature_39    
#>  Min.   : 0.0000   Min.   : 0.0   Min.   : 0.000   Min.   : 0.000  
#>  1st Qu.: 0.0000   1st Qu.: 0.0   1st Qu.: 0.000   1st Qu.: 0.000  
#>  Median : 0.0000   Median : 0.0   Median : 0.000   Median : 1.000  
#>  Mean   : 0.6654   Mean   : 1.5   Mean   : 1.276   Mean   : 2.333  
#>  3rd Qu.: 0.0000   3rd Qu.: 2.0   3rd Qu.: 0.000   3rd Qu.: 3.000  
#>  Max.   :42.0000   Max.   :34.0   Max.   :41.000   Max.   :49.000  
#>                                                                    
#>    feature_40       feature_41       feature_42        feature_43    
#>  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
#>  Median : 0.000   Median : 0.000   Median : 0.0000   Median : 2.000  
#>  Mean   : 1.255   Mean   : 1.159   Mean   : 0.8346   Mean   : 4.473  
#>  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.0000   3rd Qu.: 5.000  
#>  Max.   :81.000   Max.   :73.000   Max.   :53.0000   Max.   :63.000  
#>                                                                      
#>    feature_44        feature_45        feature_46        feature_47     
#>  Min.   : 0.0000   Min.   : 0.0000   Min.   :  0.000   Min.   : 0.0000  
#>  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:  0.000   1st Qu.: 0.0000  
#>  Median : 0.0000   Median : 0.0000   Median :  0.000   Median : 0.0000  
#>  Mean   : 0.8903   Mean   : 0.6909   Mean   :  2.414   Mean   : 0.9691  
#>  3rd Qu.: 1.0000   3rd Qu.: 1.0000   3rd Qu.:  1.000   3rd Qu.: 0.0000  
#>  Max.   :27.0000   Max.   :30.0000   Max.   :117.000   Max.   :97.0000  
#>                                                                         
#>    feature_48       feature_49        feature_50       feature_51    
#>  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.000  
#>  Median : 0.000   Median : 0.0000   Median : 1.000   Median : 0.000  
#>  Mean   : 1.527   Mean   : 0.4796   Mean   : 2.275   Mean   : 1.617  
#>  3rd Qu.: 1.000   3rd Qu.: 0.0000   3rd Qu.: 2.000   3rd Qu.: 1.000  
#>  Max.   :40.000   Max.   :38.0000   Max.   :56.000   Max.   :73.000  
#>                                                                      
#>    feature_52        feature_53       feature_54        feature_55    
#>  Min.   : 0.0000   Min.   : 0.000   Min.   :  0.000   Min.   : 0.000  
#>  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.: 0.000  
#>  Median : 0.0000   Median : 0.000   Median :  2.000   Median : 0.000  
#>  Mean   : 0.6226   Mean   : 1.354   Mean   :  6.008   Mean   : 2.493  
#>  3rd Qu.: 1.0000   3rd Qu.: 2.000   3rd Qu.:  7.000   3rd Qu.: 1.000  
#>  Max.   :38.0000   Max.   :36.000   Max.   :104.000   Max.   :76.000  
#>                                                                       
#>    feature_56       feature_57        feature_58        feature_59     
#>  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000   Min.   :  0.000  
#>  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:  0.000  
#>  Median : 1.000   Median : 0.0000   Median : 0.0000   Median :  0.000  
#>  Mean   : 2.118   Mean   : 0.5667   Mean   : 0.9271   Mean   :  1.344  
#>  3rd Qu.: 3.000   3rd Qu.: 0.0000   3rd Qu.: 1.0000   3rd Qu.:  1.000  
#>  Max.   :46.000   Max.   :31.0000   Max.   :30.0000   Max.   :352.000  
#>                                                                        
#>    feature_60        feature_61       feature_62        feature_63    
#>  Min.   :  0.000   Min.   : 0.000   Min.   :  0.000   Min.   : 0.000  
#>  1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:  0.000   1st Qu.: 0.000  
#>  Median :  0.000   Median : 0.000   Median :  0.000   Median : 0.000  
#>  Mean   :  1.667   Mean   : 1.287   Mean   :  2.764   Mean   : 1.455  
#>  3rd Qu.:  1.000   3rd Qu.: 1.000   3rd Qu.:  2.000   3rd Qu.: 0.000  
#>  Max.   :231.000   Max.   :80.000   Max.   :102.000   Max.   :80.000  
#>                                                                       
#>    feature_64        feature_65       feature_66        feature_67    
#>  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
#>  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
#>  Median : 0.0000   Median : 0.000   Median : 0.0000   Median : 0.000  
#>  Mean   : 0.6969   Mean   : 1.798   Mean   : 0.5087   Mean   : 1.827  
#>  3rd Qu.: 0.0000   3rd Qu.: 1.000   3rd Qu.: 0.0000   3rd Qu.: 1.000  
#>  Max.   :25.0000   Max.   :54.000   Max.   :24.0000   Max.   :79.000  
#>                                                                       
#>    feature_68        feature_69       feature_70       feature_71     
#>  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000  
#>  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.0000  
#>  Median : 0.0000   Median : 0.000   Median : 0.000   Median : 0.0000  
#>  Mean   : 0.9104   Mean   : 1.604   Mean   : 1.219   Mean   : 0.8069  
#>  3rd Qu.: 1.0000   3rd Qu.: 2.000   3rd Qu.: 1.000   3rd Qu.: 1.0000  
#>  Max.   :55.0000   Max.   :65.000   Max.   :67.000   Max.   :30.0000  
#>                                                                       
#>    feature_72       feature_73       feature_74         target     
#>  Min.   : 0.000   Min.   :  0.00   Min.   : 0.000   Class_6:51811  
#>  1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.: 0.000   Class_8:51763  
#>  Median : 0.000   Median :  0.00   Median : 0.000   Class_9:25542  
#>  Mean   : 1.283   Mean   :  2.94   Mean   : 0.632   Class_2:24431  
#>  3rd Qu.: 1.000   3rd Qu.:  1.00   3rd Qu.: 0.000   Class_3:14798  
#>  Max.   :61.000   Max.   :130.00   Max.   :52.000   Class_7:14769  
#>                                                     (Other):16886

Data Visualization

we can also check class distribution

summary(data_clean$target) %>% barplot()

table(data_clean$target) %>% prop.table() * 100
#> 
#> Class_1 Class_2 Class_3 Class_4 Class_5 Class_6 Class_7 Class_8 Class_9 
#>  4.5590 12.2155  7.3990  2.3520  1.5320 25.9055  7.3845 25.8815 12.7710

it seems that there is imbalanced class. There is class that having many sample like class 6 having 25% of data, and class that have low sample like class 5 that have 1.5%

we can plot feature distribution, but since data have 76 column we are using only 8 first column

data_clean[, 1:8] %>% 
  inspect_num() %>% 
  show_plot()

almost all of feature having the same distribution. data that show the most is 0 in all features.

we can also check class distribution

data_clean %>% 
  inspect_cat() %>% 
  show_plot()

We now that class 6 having most data up to 25% data than any other class.

K-Means Clustering

first, we need to scale our data because we scaled data can give better persentation because of outlier cost are minimize

scale_data <- scale(data_clean %>% select(-target))
rmarkdown::paged_table(head(as.data.frame(scale_data)))

we set seed to 99 to reproductability. Here we picked 4 as number of total cluster in data and try to minimize the within-cluster sum of squared the k-means algorithm is guaranteed to converge but is not guaranteed to a global optima.

set.seed(99)
# k-means with 4 clusters
data_km <- kmeans(scale_data, 4)

we can check how many iteration k-means need to converge

data_km$iter
#> [1] 4

check size of each cluster

data_km$size %>% prop.table()
#> [1] 0.013565 0.216290 0.009550 0.760595

in each cluster we can see percentage of each cluster. each data assign to one of the four cluster.

center of cluster, for profiling cluster

data_km$centers[, "feature_74"]
#>           1           2           3           4 
#>  0.42569388  0.29070327  0.36637920 -0.09485951

we can look that for each feature they have a center for each of cluster.

Goodness of Fit K-Means Clustering

The objective of K-means clustering is to minimize within-cluster sum of squared. we can see how k-mean with k = 4 perform with:

  1. Within Sum of Squares : sum of square between data point to each center of cluster
  2. Between Sum of Squares : sum of square weigthed between each centroid to mean of global , weight calculate from sample in each centroid
  3. Total Sum of Square : sum of square between each centroid to mean of global

Within Sum of Squares

(data_km$withinss)
#> [1]  628765.4 8763236.5  446362.9 4043800.3

Between Sum of Squares

(data_km$betweenss)
#> [1] 1117760

Total Sum of Square

(data_km$tot.withinss)
#> [1] 13882165

Ratio of Between Sum of Squares and Total Sum of Square

(data_km$betweenss) / (data_km$tot.withinss)
#> [1] 0.08051769

the good ratio is 1, because that means each cluster is very different/unique for each cluster. But our k-means score mean that our selected k simply did not represent data good enought.

K Optimization Technique

wss <- function(data, maxCluster = 20) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(scale_data)

from the graph we know a pattern that the more k number the lower k-means loss. lets try using k = 9

data_km_update <- kmeans(scale_data, 9)
(data_km_update$betweenss) / (data_km_update$tot.withinss)
#> [1] 0.1274018

lets try using k = 65

data_km_update_1 <- kmeans(scale_data, 65)
(data_km_update_1$betweenss) / (data_km_update_1$tot.withinss)
#> [1] 0.5574481

by using higher k values, the ratio of bss and tss become better. but using k-means we also need to consider every case that we need to solve

Introduction to PCA

PCA or Principal Component Analysis is unsupervised learning algorithm that used in demension reductionality task. PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA usually used in modeling data that has high dimensionality to extract usufull feauture, by using PCA model can learn better features that without PCA. PCA can also use for visaulizing high dimensionality data, reducing multicoliniairty on linear regression, pattern discovery, and identify pattern that have highly corelated with other.

When you should you PCA:

  1. When you have high dimensionality data and want to eliminate unnecesary feature
  2. When you want your features independent from other features
  3. When you okay that result of PCA is less interpretable
  4. When the cost of computing high, and you want to make computing faster

PCA Implementation

Because we have 76 feature we want to use only usefull information and leave the rest. TO do that we need to use PCA in our data.

PCA Analysis

we create PCA object using prcomp passing numeric feature in data train and set scaling into True.

pca_data <- prcomp(data_clean %>% select_if(is.numeric), scale = T)

Because the data has so many feauture, it take quite a while to finish. We also run out of space to print the PCA result. Next we are going to analyze PCA.

There 3 main component in PCA:

  1. Standart Deviation : root of variance that each PCA capture also called Eigen Value. The higher the value of eigen value the higher information that able to capture
(pca_data$sdev)^2
#>  [1] 6.4578343 1.3249831 1.0705558 1.0019038 0.9810737 0.9782173 0.9744343
#>  [8] 0.9719042 0.9695561 0.9670591 0.9646226 0.9642496 0.9617351 0.9586698
#> [15] 0.9571705 0.9550799 0.9545572 0.9519402 0.9511018 0.9494240 0.9481461
#> [22] 0.9474360 0.9432548 0.9422025 0.9409537 0.9383206 0.9369489 0.9363129
#> [29] 0.9350561 0.9339832 0.9310155 0.9298138 0.9286055 0.9283237 0.9272273
#> [36] 0.9246102 0.9228263 0.9212213 0.9200961 0.9177734 0.9164805 0.9153621
#> [43] 0.9135290 0.9108558 0.9093121 0.9078848 0.9071317 0.9056605 0.9037233
#> [50] 0.9028659 0.9014026 0.9005281 0.8985981 0.8979466 0.8973640 0.8938088
#> [57] 0.8920329 0.8906461 0.8887879 0.8863003 0.8849590 0.8839932 0.8828073
#> [64] 0.8801404 0.8782843 0.8759305 0.8748665 0.8704880 0.8668830 0.8656817
#> [71] 0.8632780 0.8610589 0.8600497 0.8511445 0.8400139

we can also plot variance of each PCA

plot(pca_data)

from the plot we know that the highest information capture by PC1 and from PC2 the information captured decresing.

  1. Eigen Vector : rotation matrix that is use to each point in PC. This is used to knowing how much impact each feature in data in the result of each PC
rmarkdown::paged_table(data.frame(head(pca_data$rotation)))

from the data frame we can know how much each feature contribute to each PC, by looking we know that for every feature in data train most of them are similiar in 0.1 values.

  1. Projection Value : this is projection point for every row in data, used to extract new information.
rmarkdown::paged_table(data.frame(head(pca_data$x)))

that is our new data train but we still have 76 feature, because we still did not doing dimensionality reduction. So lets move to the next section.

Dimensional Reduction With PCA

To reduce dimensionality we need to know how much information that we want to extract or how much feature you want to use. To know how many feature we want to extract we can use commulative variance, that tell use how much information has been capture with each PC.

summary(pca_data)
#> Importance of components:
#>                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
#> Standard deviation     2.5412 1.15108 1.03468 1.00095 0.99049 0.98905 0.98713
#> Proportion of Variance 0.0861 0.01767 0.01427 0.01336 0.01308 0.01304 0.01299
#> Cumulative Proportion  0.0861 0.10377 0.11804 0.13140 0.14448 0.15753 0.17052
#>                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
#> Standard deviation     0.98585 0.98466 0.98339 0.98215 0.98196 0.98068 0.97912
#> Proportion of Variance 0.01296 0.01293 0.01289 0.01286 0.01286 0.01282 0.01278
#> Cumulative Proportion  0.18348 0.19641 0.20930 0.22216 0.23502 0.24784 0.26062
#>                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
#> Standard deviation     0.97835 0.97728 0.97701 0.97567 0.97524 0.97438 0.97373
#> Proportion of Variance 0.01276 0.01273 0.01273 0.01269 0.01268 0.01266 0.01264
#> Cumulative Proportion  0.27339 0.28612 0.29885 0.31154 0.32422 0.33688 0.34952
#>                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
#> Standard deviation     0.97336 0.97121 0.97067 0.97003 0.96867 0.96796 0.96763
#> Proportion of Variance 0.01263 0.01258 0.01256 0.01255 0.01251 0.01249 0.01248
#> Cumulative Proportion  0.36216 0.37473 0.38729 0.39984 0.41235 0.42484 0.43733
#>                           PC29    PC30    PC31   PC32    PC33    PC34    PC35
#> Standard deviation     0.96698 0.96643 0.96489 0.9643 0.96364 0.96350 0.96293
#> Proportion of Variance 0.01247 0.01245 0.01241 0.0124 0.01238 0.01238 0.01236
#> Cumulative Proportion  0.44980 0.46225 0.47466 0.4871 0.49944 0.51182 0.52418
#>                           PC36   PC37    PC38    PC39    PC40    PC41   PC42
#> Standard deviation     0.96157 0.9606 0.95980 0.95922 0.95800 0.95733 0.9567
#> Proportion of Variance 0.01233 0.0123 0.01228 0.01227 0.01224 0.01222 0.0122
#> Cumulative Proportion  0.53651 0.5488 0.56110 0.57337 0.58560 0.59782 0.6100
#>                           PC43    PC44    PC45    PC46   PC47    PC48    PC49
#> Standard deviation     0.95579 0.95439 0.95358 0.95283 0.9524 0.95166 0.95064
#> Proportion of Variance 0.01218 0.01214 0.01212 0.01211 0.0121 0.01208 0.01205
#> Cumulative Proportion  0.62221 0.63435 0.64648 0.65858 0.6707 0.68275 0.69480
#>                           PC50    PC51    PC52    PC53    PC54    PC55    PC56
#> Standard deviation     0.95019 0.94942 0.94896 0.94794 0.94760 0.94729 0.94541
#> Proportion of Variance 0.01204 0.01202 0.01201 0.01198 0.01197 0.01196 0.01192
#> Cumulative Proportion  0.70684 0.71886 0.73087 0.74285 0.75482 0.76678 0.77870
#>                           PC57    PC58    PC59    PC60   PC61    PC62    PC63
#> Standard deviation     0.94447 0.94374 0.94276 0.94144 0.9407 0.94021 0.93958
#> Proportion of Variance 0.01189 0.01188 0.01185 0.01182 0.0118 0.01179 0.01177
#> Cumulative Proportion  0.79060 0.80247 0.81432 0.82614 0.8379 0.84972 0.86150
#>                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
#> Standard deviation     0.93816 0.93717 0.93591 0.93534 0.93300 0.93107 0.93042
#> Proportion of Variance 0.01174 0.01171 0.01168 0.01166 0.01161 0.01156 0.01154
#> Cumulative Proportion  0.87323 0.88494 0.89662 0.90829 0.91989 0.93145 0.94299
#>                           PC71    PC72    PC73    PC74   PC75
#> Standard deviation     0.92913 0.92793 0.92739 0.92257 0.9165
#> Proportion of Variance 0.01151 0.01148 0.01147 0.01135 0.0112
#> Cumulative Proportion  0.95450 0.96598 0.97745 0.98880 1.0000

we can see that the each variance in PC is distributed almost evenly that means in this case to capture 100 of data we need all of the PC. lets say that we want to capture only 80% of the initial data, to do that we need PC 1 to PC 58.

data_train_pca <- data.frame(pca_data$x[, 1:58])
data_train_pca$label <- data$target
rmarkdown::paged_table(head(data_train_pca))

we also can visualize how pca 1 and pca 2 seperate each class

ggplot(as.data.frame(pca_data$x), aes(x = PC1, y = PC2, col = data$target)) + geom_point()

Verdict

After analyzing the data from Kaggle dataset we know that we can use K-means clustering algorithm to cluster data into K- Cluster this can help spot pattern in dataset we also can use PCA to extract important feature.