Introduction
Description
This notebook will guide to unsupervised learning using Principal Component Analysis and K-Means Clustering to analyze Kaggle problem Kaggle: Tabular Playground Series Juny 2021.
Stucture
Here are stucture of notebook:
- Introduction to Unsupervised Learning
- Introduction to K-Means Clustering
- Analyzing Kaggle Data with K-Means Clustering
- Introduction to PCA
- PCA Implementation
- Verdict
Introduction to Unsupervised Learning
Unsupervised learning is a type of algorithm that learns patterns from untagged/unlabbel data. The hope is that, through mimicry, the machine is forced to build a compact internal representation of its world and then generate imaginative content. in short unsupervised learning ia algorithm that learn data pattern without needing labelling process. it is used in many application like reduce dimensionality of data, feature extractor, anomaly detection, clustering, stucture prediction, or more advanced example like reinforcement learning algorithm where the program learning from past experiment or called self learning like one of Open AI Project.
Two of the main methods used in unsupervised learning are principal component and cluster analysis. Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships. Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.
K-Means Clustering
Introduction to K-Means Clustering
k-means clustering is a method of vector quantization, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
Analyzing Kaggle Data with K-Means Clustering
In this notebook we will trying to using K-Means Clustering to analyze data in Kaggle Competition Tabular Playground
Preprocessing Data
read train dataset
data <- read.csv("train.csv")
rmarkdown::paged_table((glimpse(data)))#> Rows: 200,000
#> Columns: 77
#> $ id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
#> $ feature_0 <int> 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 3,~
#> $ feature_1 <int> 0, 0, 0, 0, 0, 15, 1, 3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1~
#> $ feature_2 <int> 6, 0, 0, 7, 0, 0, 2, 5, 0, 0, 0, 0, 28, 1, 0, 0, 1, 0, 0, 1~
#> $ feature_3 <int> 1, 0, 0, 0, 0, 0, 1, 0, 35, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 1~
#> $ feature_4 <int> 0, 0, 0, 1, 0, 0, 0, 0, 6, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0,~
#> $ feature_5 <int> 0, 0, 1, 5, 0, 1, 2, 1, 2, 0, 0, 0, 3, 1, 0, 1, 0, 0, 0, 1,~
#> $ feature_6 <int> 0, 0, 0, 2, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0,~
#> $ feature_7 <int> 0, 0, 3, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 1,~
#> $ feature_8 <int> 7, 0, 0, 0, 0, 0, 0, 10, 3, 0, 0, 3, 2, 0, 2, 7, 16, 1, 10,~
#> $ feature_9 <int> 0, 0, 0, 1, 0, 2, 2, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0,~
#> $ feature_10 <int> 0, 0, 1, 2, 0, 0, 1, 0, 6, 1, 0, 0, 7, 0, 0, 0, 0, 3, 0, 1,~
#> $ feature_11 <int> 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 2, 0, 0, 1, 2, 0, 0, 0, 0,~
#> $ feature_12 <int> 3, 1, 0, 5, 0, 0, 5, 0, 0, 0, 0, 2, 0, 4, 0, 2, 9, 4, 0, 2,~
#> $ feature_13 <int> 0, 0, 0, 0, 0, 1, 0, 3, 0, 2, 1, 0, 1, 16, 0, 0, 0, 1, 0, 0~
#> $ feature_14 <int> 1, 0, 0, 0, 0, 0, 0, 6, 7, 0, 0, 0, 2, 1, 0, 2, 0, 0, 0, 1,~
#> $ feature_15 <int> 0, 0, 0, 4, 0, 1, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,~
#> $ feature_16 <int> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 1, 3, 5, 0, 0, 0, 2,~
#> $ feature_17 <int> 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,~
#> $ feature_18 <int> 3, 0, 0, 22, 0, 0, 3, 0, 0, 0, 0, 11, 1, 5, 0, 3, 0, 0, 1, ~
#> $ feature_19 <int> 1, 0, 5, 2, 0, 7, 12, 8, 5, 6, 3, 9, 7, 1, 3, 9, 2, 4, 2, 3~
#> $ feature_20 <int> 0, 0, 4, 1, 1, 1, 4, 4, 1, 3, 2, 0, 3, 0, 0, 0, 1, 0, 0, 0,~
#> $ feature_21 <int> 2, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 3, 0,~
#> $ feature_22 <int> 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0~
#> $ feature_23 <int> 0, 0, 0, 0, 0, 0, 3, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,~
#> $ feature_24 <int> 0, 0, 0, 0, 0, 2, 0, 1, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
#> $ feature_25 <int> 0, 0, 0, 3, 1, 0, 1, 5, 4, 7, 0, 0, 7, 0, 3, 3, 0, 0, 0, 3,~
#> $ feature_26 <int> 0, 0, 0, 0, 0, 2, 3, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,~
#> $ feature_27 <int> 0, 0, 0, 37, 0, 0, 1, 0, 1, 0, 3, 1, 4, 0, 0, 0, 0, 0, 0, 0~
#> $ feature_28 <int> 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 7, 1, 0, 2, 1, 1, 0, 1,~
#> $ feature_29 <int> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 5,~
#> $ feature_30 <int> 0, 0, 0, 3, 0, 1, 3, 8, 0, 1, 0, 0, 2, 0, 1, 0, 0, 0, 0, 1,~
#> $ feature_31 <int> 1, 0, 0, 13, 0, 0, 2, 3, 0, 0, 0, 0, 4, 2, 0, 2, 0, 1, 0, 2~
#> $ feature_32 <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 5, 1, 0, 1, 0, 0, 0, 0,~
#> $ feature_33 <int> 0, 0, 0, 10, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1~
#> $ feature_34 <int> 0, 0, 2, 0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
#> $ feature_35 <int> 0, 1, 0, 3, 0, 1, 0, 4, 31, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0~
#> $ feature_36 <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
#> $ feature_37 <int> 11, 0, 5, 1, 2, 0, 0, 2, 0, 0, 0, 11, 1, 3, 0, 11, 0, 0, 0,~
#> $ feature_38 <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 6, 12~
#> $ feature_39 <int> 0, 0, 5, 7, 5, 10, 0, 3, 0, 0, 6, 0, 0, 1, 1, 0, 3, 5, 0, 3~
#> $ feature_40 <int> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1,~
#> $ feature_41 <int> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 1,~
#> $ feature_42 <int> 0, 0, 0, 2, 0, 2, 0, 0, 8, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,~
#> $ feature_43 <int> 9, 2, 0, 0, 0, 0, 7, 7, 0, 0, 0, 5, 2, 1, 1, 1, 2, 1, 2, 3,~
#> $ feature_44 <int> 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 2, 1, 1, 0, 2, 0, 0, 0, 1,~
#> $ feature_45 <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,~
#> $ feature_46 <int> 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 1,~
#> $ feature_47 <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0,~
#> $ feature_48 <int> 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 0, 0, 14~
#> $ feature_49 <int> 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
#> $ feature_50 <int> 3, 0, 7, 0, 0, 0, 0, 3, 0, 0, 0, 1, 8, 1, 2, 0, 0, 1, 4, 2,~
#> $ feature_51 <int> 0, 0, 0, 10, 0, 0, 1, 42, 0, 1, 0, 0, 3, 2, 0, 0, 1, 1, 0, ~
#> $ feature_52 <int> 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 4, 0, 0, 0, 0, 0, 0, 0,~
#> $ feature_53 <int> 3, 0, 1, 0, 0, 0, 3, 4, 0, 0, 0, 4, 3, 3, 0, 0, 1, 0, 0, 6,~
#> $ feature_54 <int> 0, 0, 0, 25, 3, 1, 36, 6, 1, 4, 6, 18, 39, 2, 3, 5, 0, 2, 0~
#> $ feature_55 <int> 0, 0, 3, 1, 0, 0, 0, 16, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ feature_56 <int> 0, 0, 4, 0, 0, 2, 0, 6, 3, 4, 0, 4, 24, 4, 0, 1, 0, 0, 2, 2~
#> $ feature_57 <int> 0, 0, 0, 1, 1, 0, 0, 0, 7, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
#> $ feature_58 <int> 0, 0, 0, 2, 0, 2, 0, 1, 1, 1, 0, 3, 1, 9, 1, 0, 0, 0, 0, 1,~
#> $ feature_59 <int> 0, 0, 1, 0, 0, 0, 5, 1, 2, 0, 0, 2, 0, 0, 2, 32, 0, 0, 0, 0~
#> $ feature_60 <int> 0, 0, 3, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 4, 3, 0, 0, 0,~
#> $ feature_61 <int> 1, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 4, 2, 2, 1, 0, 0, 0, 0, 0,~
#> $ feature_62 <int> 1, 0, 2, 7, 0, 2, 0, 1, 1, 0, 1, 1, 5, 0, 0, 0, 0, 0, 0, 2,~
#> $ feature_63 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
#> $ feature_64 <int> 0, 0, 0, 0, 0, 1, 0, 0, 6, 1, 1, 0, 0, 0, 6, 0, 0, 0, 0, 2,~
#> $ feature_65 <int> 3, 0, 8, 0, 0, 0, 0, 1, 0, 0, 0, 7, 6, 3, 0, 7, 0, 0, 0, 0,~
#> $ feature_66 <int> 0, 2, 0, 0, 0, 0, 1, 8, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,~
#> $ feature_67 <int> 0, 0, 0, 4, 0, 0, 2, 0, 37, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 1~
#> $ feature_68 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 2, 0, 1, 1, 0,~
#> $ feature_69 <int> 0, 0, 0, 2, 0, 0, 2, 0, 5, 0, 0, 0, 0, 1, 3, 54, 0, 0, 0, 0~
#> $ feature_70 <int> 0, 0, 1, 2, 0, 0, 0, 0, 4, 0, 0, 0, 1, 40, 0, 0, 0, 1, 0, 0~
#> $ feature_71 <int> 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 4,~
#> $ feature_72 <int> 2, 0, 0, 4, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 1, 2, 1, 0, 0,~
#> $ feature_73 <int> 0, 1, 0, 3, 0, 0, 0, 60, 0, 10, 3, 0, 6, 0, 0, 6, 0, 0, 0, ~
#> $ feature_74 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0,~
#> $ target <chr> "Class_6", "Class_6", "Class_2", "Class_8", "Class_2", "Cla~
Train data has 200,000 sample and 77 coloumn. The dataset is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.
read sample submission dataset
sample_sub <- read.csv("sample_submission.csv")
glimpse(sample_sub)#> Rows: 100,000
#> Columns: 10
#> $ id <int> 200000, 200001, 200002, 200003, 200004, 200005, 200006, 200007~
#> $ Class_1 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_2 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_3 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_4 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_5 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_6 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_7 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_8 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
#> $ Class_9 <dbl> 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111, 0.1111~
to be able submit kaggle we need to predict each class probability and include test id number.
check NA column
colSums(is.na(data))#> id feature_0 feature_1 feature_2 feature_3 feature_4 feature_5
#> 0 0 0 0 0 0 0
#> feature_6 feature_7 feature_8 feature_9 feature_10 feature_11 feature_12
#> 0 0 0 0 0 0 0
#> feature_13 feature_14 feature_15 feature_16 feature_17 feature_18 feature_19
#> 0 0 0 0 0 0 0
#> feature_20 feature_21 feature_22 feature_23 feature_24 feature_25 feature_26
#> 0 0 0 0 0 0 0
#> feature_27 feature_28 feature_29 feature_30 feature_31 feature_32 feature_33
#> 0 0 0 0 0 0 0
#> feature_34 feature_35 feature_36 feature_37 feature_38 feature_39 feature_40
#> 0 0 0 0 0 0 0
#> feature_41 feature_42 feature_43 feature_44 feature_45 feature_46 feature_47
#> 0 0 0 0 0 0 0
#> feature_48 feature_49 feature_50 feature_51 feature_52 feature_53 feature_54
#> 0 0 0 0 0 0 0
#> feature_55 feature_56 feature_57 feature_58 feature_59 feature_60 feature_61
#> 0 0 0 0 0 0 0
#> feature_62 feature_63 feature_64 feature_65 feature_66 feature_67 feature_68
#> 0 0 0 0 0 0 0
#> feature_69 feature_70 feature_71 feature_72 feature_73 feature_74 target
#> 0 0 0 0 0 0 0
train data contain zero nan values accross all features.
we need to remove id column, because it does not having additional info about target.
data_clean <- data[, -1] %>% mutate(target = as.factor(target))
rmarkdown::paged_table(head(data_clean))we check summary of train data
summary(data_clean)#> feature_0 feature_1 feature_2 feature_3
#> Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.0000 Median : 0.000 Median : 0.000 Median : 0.000
#> Mean : 0.9727 Mean : 1.168 Mean : 2.219 Mean : 2.297
#> 3rd Qu.: 1.0000 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000
#> Max. :61.0000 Max. :51.000 Max. :64.000 Max. :70.000
#>
#> feature_4 feature_5 feature_6 feature_7
#> Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.0000
#> 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000
#> Median : 0.0000 Median : 0.000 Median : 0.000 Median : 0.0000
#> Mean : 0.7935 Mean : 1.431 Mean : 1.011 Mean : 0.6731
#> 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 0.000 3rd Qu.: 0.0000
#> Max. :38.0000 Max. :76.000 Max. :43.000 Max. :30.0000
#>
#> feature_8 feature_9 feature_10 feature_11
#> Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.000 Median : 0.00 Median : 0.000 Median : 0.000
#> Mean : 1.944 Mean : 1.72 Mean : 1.423 Mean : 0.981
#> 3rd Qu.: 2.000 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 0.000
#> Max. :38.000 Max. :72.00 Max. :33.000 Max. :46.000
#>
#> feature_12 feature_13 feature_14 feature_15
#> Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 1.000 Median : 0.000 Median : 0.000 Median : 0.000
#> Mean : 2.445 Mean : 1.078 Mean : 1.406 Mean : 1.413
#> 3rd Qu.: 3.000 3rd Qu.: 1.000 3rd Qu.: 2.000 3rd Qu.: 0.000
#> Max. :37.000 Max. :43.000 Max. :32.000 Max. :121.000
#>
#> feature_16 feature_17 feature_18 feature_19
#> Min. : 0.00 Min. : 0.0000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.00 Median : 0.0000 Median : 1.000 Median : 2.000
#> Mean : 1.39 Mean : 0.3177 Mean : 1.657 Mean : 6.187
#> 3rd Qu.: 1.00 3rd Qu.: 0.0000 3rd Qu.: 2.000 3rd Qu.: 7.000
#> Max. :27.00 Max. :14.0000 Max. :22.000 Max. :263.000
#>
#> feature_20 feature_21 feature_22 feature_23
#> Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 1.000 Median : 0.000 Median : 0.000 Median : 0.000
#> Mean : 1.439 Mean : 1.031 Mean : 1.466 Mean : 0.572
#> 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 0.000 3rd Qu.: 0.000
#> Max. :30.000 Max. :33.000 Max. :123.000 Max. :22.000
#>
#> feature_24 feature_25 feature_26 feature_27
#> Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.0000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
#> Median : 0.000 Median : 0.000 Median : 0.0000 Median : 0.0000
#> Mean : 1.061 Mean : 2.349 Mean : 0.7745 Mean : 0.7893
#> 3rd Qu.: 0.000 3rd Qu.: 2.000 3rd Qu.: 1.0000 3rd Qu.: 1.0000
#> Max. :69.000 Max. :149.000 Max. :24.0000 Max. :84.0000
#>
#> feature_28 feature_29 feature_30 feature_31
#> Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
#> Median : 0.000 Median : 0.000 Median : 0.0000 Median : 1.000
#> Mean : 2.326 Mean : 1.582 Mean : 0.5988 Mean : 1.857
#> 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.0000 3rd Qu.: 2.000
#> Max. :105.000 Max. :84.000 Max. :22.0000 Max. :39.000
#>
#> feature_32 feature_33 feature_34 feature_35
#> Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
#> Median : 0.000 Median : 0.000 Median : 0.0000 Median : 0.000
#> Mean : 1.516 Mean : 1.557 Mean : 0.6811 Mean : 1.162
#> 3rd Qu.: 0.000 3rd Qu.: 2.000 3rd Qu.: 1.0000 3rd Qu.: 1.000
#> Max. :78.000 Max. :41.000 Max. :36.0000 Max. :41.000
#>
#> feature_36 feature_37 feature_38 feature_39
#> Min. : 0.0000 Min. : 0.0 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.0000 Median : 0.0 Median : 0.000 Median : 1.000
#> Mean : 0.6654 Mean : 1.5 Mean : 1.276 Mean : 2.333
#> 3rd Qu.: 0.0000 3rd Qu.: 2.0 3rd Qu.: 0.000 3rd Qu.: 3.000
#> Max. :42.0000 Max. :34.0 Max. :41.000 Max. :49.000
#>
#> feature_40 feature_41 feature_42 feature_43
#> Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
#> Median : 0.000 Median : 0.000 Median : 0.0000 Median : 2.000
#> Mean : 1.255 Mean : 1.159 Mean : 0.8346 Mean : 4.473
#> 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.0000 3rd Qu.: 5.000
#> Max. :81.000 Max. :73.000 Max. :53.0000 Max. :63.000
#>
#> feature_44 feature_45 feature_46 feature_47
#> Min. : 0.0000 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
#> 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
#> Median : 0.0000 Median : 0.0000 Median : 0.000 Median : 0.0000
#> Mean : 0.8903 Mean : 0.6909 Mean : 2.414 Mean : 0.9691
#> 3rd Qu.: 1.0000 3rd Qu.: 1.0000 3rd Qu.: 1.000 3rd Qu.: 0.0000
#> Max. :27.0000 Max. :30.0000 Max. :117.000 Max. :97.0000
#>
#> feature_48 feature_49 feature_50 feature_51
#> Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.000 Median : 0.0000 Median : 1.000 Median : 0.000
#> Mean : 1.527 Mean : 0.4796 Mean : 2.275 Mean : 1.617
#> 3rd Qu.: 1.000 3rd Qu.: 0.0000 3rd Qu.: 2.000 3rd Qu.: 1.000
#> Max. :40.000 Max. :38.0000 Max. :56.000 Max. :73.000
#>
#> feature_52 feature_53 feature_54 feature_55
#> Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.0000 Median : 0.000 Median : 2.000 Median : 0.000
#> Mean : 0.6226 Mean : 1.354 Mean : 6.008 Mean : 2.493
#> 3rd Qu.: 1.0000 3rd Qu.: 2.000 3rd Qu.: 7.000 3rd Qu.: 1.000
#> Max. :38.0000 Max. :36.000 Max. :104.000 Max. :76.000
#>
#> feature_56 feature_57 feature_58 feature_59
#> Min. : 0.000 Min. : 0.0000 Min. : 0.0000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000
#> Median : 1.000 Median : 0.0000 Median : 0.0000 Median : 0.000
#> Mean : 2.118 Mean : 0.5667 Mean : 0.9271 Mean : 1.344
#> 3rd Qu.: 3.000 3rd Qu.: 0.0000 3rd Qu.: 1.0000 3rd Qu.: 1.000
#> Max. :46.000 Max. :31.0000 Max. :30.0000 Max. :352.000
#>
#> feature_60 feature_61 feature_62 feature_63
#> Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
#> Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000
#> Mean : 1.667 Mean : 1.287 Mean : 2.764 Mean : 1.455
#> 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 2.000 3rd Qu.: 0.000
#> Max. :231.000 Max. :80.000 Max. :102.000 Max. :80.000
#>
#> feature_64 feature_65 feature_66 feature_67
#> Min. : 0.0000 Min. : 0.000 Min. : 0.0000 Min. : 0.000
#> 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
#> Median : 0.0000 Median : 0.000 Median : 0.0000 Median : 0.000
#> Mean : 0.6969 Mean : 1.798 Mean : 0.5087 Mean : 1.827
#> 3rd Qu.: 0.0000 3rd Qu.: 1.000 3rd Qu.: 0.0000 3rd Qu.: 1.000
#> Max. :25.0000 Max. :54.000 Max. :24.0000 Max. :79.000
#>
#> feature_68 feature_69 feature_70 feature_71
#> Min. : 0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.0000
#> 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.0000
#> Median : 0.0000 Median : 0.000 Median : 0.000 Median : 0.0000
#> Mean : 0.9104 Mean : 1.604 Mean : 1.219 Mean : 0.8069
#> 3rd Qu.: 1.0000 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 1.0000
#> Max. :55.0000 Max. :65.000 Max. :67.000 Max. :30.0000
#>
#> feature_72 feature_73 feature_74 target
#> Min. : 0.000 Min. : 0.00 Min. : 0.000 Class_6:51811
#> 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 Class_8:51763
#> Median : 0.000 Median : 0.00 Median : 0.000 Class_9:25542
#> Mean : 1.283 Mean : 2.94 Mean : 0.632 Class_2:24431
#> 3rd Qu.: 1.000 3rd Qu.: 1.00 3rd Qu.: 0.000 Class_3:14798
#> Max. :61.000 Max. :130.00 Max. :52.000 Class_7:14769
#> (Other):16886
Data Visualization
we can also check class distribution
summary(data_clean$target) %>% barplot()table(data_clean$target) %>% prop.table() * 100#>
#> Class_1 Class_2 Class_3 Class_4 Class_5 Class_6 Class_7 Class_8 Class_9
#> 4.5590 12.2155 7.3990 2.3520 1.5320 25.9055 7.3845 25.8815 12.7710
it seems that there is imbalanced class. There is class that having many sample like class 6 having 25% of data, and class that have low sample like class 5 that have 1.5%
we can plot feature distribution, but since data have 76 column we are using only 8 first column
data_clean[, 1:8] %>%
inspect_num() %>%
show_plot() almost all of feature having the same distribution. data that show the most is 0 in all features.
we can also check class distribution
data_clean %>%
inspect_cat() %>%
show_plot() We now that class 6 having most data up to 25% data than any other class.
K-Means Clustering
first, we need to scale our data because we scaled data can give better persentation because of outlier cost are minimize
scale_data <- scale(data_clean %>% select(-target))
rmarkdown::paged_table(head(as.data.frame(scale_data)))we set seed to 99 to reproductability. Here we picked 4 as number of total cluster in data and try to minimize the within-cluster sum of squared the k-means algorithm is guaranteed to converge but is not guaranteed to a global optima.
set.seed(99)
# k-means with 4 clusters
data_km <- kmeans(scale_data, 4)we can check how many iteration k-means need to converge
data_km$iter#> [1] 4
check size of each cluster
data_km$size %>% prop.table()#> [1] 0.013565 0.216290 0.009550 0.760595
in each cluster we can see percentage of each cluster. each data assign to one of the four cluster.
center of cluster, for profiling cluster
data_km$centers[, "feature_74"]#> 1 2 3 4
#> 0.42569388 0.29070327 0.36637920 -0.09485951
we can look that for each feature they have a center for each of cluster.
Goodness of Fit K-Means Clustering
The objective of K-means clustering is to minimize within-cluster sum of squared. we can see how k-mean with k = 4 perform with:
- Within Sum of Squares : sum of square between data point to each center of cluster
- Between Sum of Squares : sum of square weigthed between each centroid to mean of global , weight calculate from sample in each centroid
- Total Sum of Square : sum of square between each centroid to mean of global
Within Sum of Squares
(data_km$withinss)#> [1] 628765.4 8763236.5 446362.9 4043800.3
Between Sum of Squares
(data_km$betweenss)#> [1] 1117760
Total Sum of Square
(data_km$tot.withinss)#> [1] 13882165
Ratio of Between Sum of Squares and Total Sum of Square
(data_km$betweenss) / (data_km$tot.withinss)#> [1] 0.08051769
the good ratio is 1, because that means each cluster is very different/unique for each cluster. But our k-means score mean that our selected k simply did not represent data good enought.
K Optimization Technique
wss <- function(data, maxCluster = 20) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=19)
}
wss(scale_data) from the graph we know a pattern that the more k number the lower k-means loss. lets try using k = 9
data_km_update <- kmeans(scale_data, 9)
(data_km_update$betweenss) / (data_km_update$tot.withinss)#> [1] 0.1274018
lets try using k = 65
data_km_update_1 <- kmeans(scale_data, 65)
(data_km_update_1$betweenss) / (data_km_update_1$tot.withinss)#> [1] 0.5574481
by using higher k values, the ratio of bss and tss become better. but using k-means we also need to consider every case that we need to solve
Introduction to PCA
PCA or Principal Component Analysis is unsupervised learning algorithm that used in demension reductionality task. PCA is defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA usually used in modeling data that has high dimensionality to extract usufull feauture, by using PCA model can learn better features that without PCA. PCA can also use for visaulizing high dimensionality data, reducing multicoliniairty on linear regression, pattern discovery, and identify pattern that have highly corelated with other.
When you should you PCA:
- When you have high dimensionality data and want to eliminate unnecesary feature
- When you want your features independent from other features
- When you okay that result of PCA is less interpretable
- When the cost of computing high, and you want to make computing faster
PCA Implementation
Because we have 76 feature we want to use only usefull information and leave the rest. TO do that we need to use PCA in our data.
PCA Analysis
we create PCA object using prcomp passing numeric feature in data train and set scaling into True.
pca_data <- prcomp(data_clean %>% select_if(is.numeric), scale = T)Because the data has so many feauture, it take quite a while to finish. We also run out of space to print the PCA result. Next we are going to analyze PCA.
There 3 main component in PCA:
- Standart Deviation : root of variance that each PCA capture also called Eigen Value. The higher the value of eigen value the higher information that able to capture
(pca_data$sdev)^2#> [1] 6.4578343 1.3249831 1.0705558 1.0019038 0.9810737 0.9782173 0.9744343
#> [8] 0.9719042 0.9695561 0.9670591 0.9646226 0.9642496 0.9617351 0.9586698
#> [15] 0.9571705 0.9550799 0.9545572 0.9519402 0.9511018 0.9494240 0.9481461
#> [22] 0.9474360 0.9432548 0.9422025 0.9409537 0.9383206 0.9369489 0.9363129
#> [29] 0.9350561 0.9339832 0.9310155 0.9298138 0.9286055 0.9283237 0.9272273
#> [36] 0.9246102 0.9228263 0.9212213 0.9200961 0.9177734 0.9164805 0.9153621
#> [43] 0.9135290 0.9108558 0.9093121 0.9078848 0.9071317 0.9056605 0.9037233
#> [50] 0.9028659 0.9014026 0.9005281 0.8985981 0.8979466 0.8973640 0.8938088
#> [57] 0.8920329 0.8906461 0.8887879 0.8863003 0.8849590 0.8839932 0.8828073
#> [64] 0.8801404 0.8782843 0.8759305 0.8748665 0.8704880 0.8668830 0.8656817
#> [71] 0.8632780 0.8610589 0.8600497 0.8511445 0.8400139
we can also plot variance of each PCA
plot(pca_data) from the plot we know that the highest information capture by PC1 and from PC2 the information captured decresing.
- Eigen Vector : rotation matrix that is use to each point in PC. This is used to knowing how much impact each feature in data in the result of each PC
rmarkdown::paged_table(data.frame(head(pca_data$rotation)))from the data frame we can know how much each feature contribute to each PC, by looking we know that for every feature in data train most of them are similiar in 0.1 values.
- Projection Value : this is projection point for every row in data, used to extract new information.
rmarkdown::paged_table(data.frame(head(pca_data$x)))that is our new data train but we still have 76 feature, because we still did not doing dimensionality reduction. So lets move to the next section.
Dimensional Reduction With PCA
To reduce dimensionality we need to know how much information that we want to extract or how much feature you want to use. To know how many feature we want to extract we can use commulative variance, that tell use how much information has been capture with each PC.
summary(pca_data)#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 2.5412 1.15108 1.03468 1.00095 0.99049 0.98905 0.98713
#> Proportion of Variance 0.0861 0.01767 0.01427 0.01336 0.01308 0.01304 0.01299
#> Cumulative Proportion 0.0861 0.10377 0.11804 0.13140 0.14448 0.15753 0.17052
#> PC8 PC9 PC10 PC11 PC12 PC13 PC14
#> Standard deviation 0.98585 0.98466 0.98339 0.98215 0.98196 0.98068 0.97912
#> Proportion of Variance 0.01296 0.01293 0.01289 0.01286 0.01286 0.01282 0.01278
#> Cumulative Proportion 0.18348 0.19641 0.20930 0.22216 0.23502 0.24784 0.26062
#> PC15 PC16 PC17 PC18 PC19 PC20 PC21
#> Standard deviation 0.97835 0.97728 0.97701 0.97567 0.97524 0.97438 0.97373
#> Proportion of Variance 0.01276 0.01273 0.01273 0.01269 0.01268 0.01266 0.01264
#> Cumulative Proportion 0.27339 0.28612 0.29885 0.31154 0.32422 0.33688 0.34952
#> PC22 PC23 PC24 PC25 PC26 PC27 PC28
#> Standard deviation 0.97336 0.97121 0.97067 0.97003 0.96867 0.96796 0.96763
#> Proportion of Variance 0.01263 0.01258 0.01256 0.01255 0.01251 0.01249 0.01248
#> Cumulative Proportion 0.36216 0.37473 0.38729 0.39984 0.41235 0.42484 0.43733
#> PC29 PC30 PC31 PC32 PC33 PC34 PC35
#> Standard deviation 0.96698 0.96643 0.96489 0.9643 0.96364 0.96350 0.96293
#> Proportion of Variance 0.01247 0.01245 0.01241 0.0124 0.01238 0.01238 0.01236
#> Cumulative Proportion 0.44980 0.46225 0.47466 0.4871 0.49944 0.51182 0.52418
#> PC36 PC37 PC38 PC39 PC40 PC41 PC42
#> Standard deviation 0.96157 0.9606 0.95980 0.95922 0.95800 0.95733 0.9567
#> Proportion of Variance 0.01233 0.0123 0.01228 0.01227 0.01224 0.01222 0.0122
#> Cumulative Proportion 0.53651 0.5488 0.56110 0.57337 0.58560 0.59782 0.6100
#> PC43 PC44 PC45 PC46 PC47 PC48 PC49
#> Standard deviation 0.95579 0.95439 0.95358 0.95283 0.9524 0.95166 0.95064
#> Proportion of Variance 0.01218 0.01214 0.01212 0.01211 0.0121 0.01208 0.01205
#> Cumulative Proportion 0.62221 0.63435 0.64648 0.65858 0.6707 0.68275 0.69480
#> PC50 PC51 PC52 PC53 PC54 PC55 PC56
#> Standard deviation 0.95019 0.94942 0.94896 0.94794 0.94760 0.94729 0.94541
#> Proportion of Variance 0.01204 0.01202 0.01201 0.01198 0.01197 0.01196 0.01192
#> Cumulative Proportion 0.70684 0.71886 0.73087 0.74285 0.75482 0.76678 0.77870
#> PC57 PC58 PC59 PC60 PC61 PC62 PC63
#> Standard deviation 0.94447 0.94374 0.94276 0.94144 0.9407 0.94021 0.93958
#> Proportion of Variance 0.01189 0.01188 0.01185 0.01182 0.0118 0.01179 0.01177
#> Cumulative Proportion 0.79060 0.80247 0.81432 0.82614 0.8379 0.84972 0.86150
#> PC64 PC65 PC66 PC67 PC68 PC69 PC70
#> Standard deviation 0.93816 0.93717 0.93591 0.93534 0.93300 0.93107 0.93042
#> Proportion of Variance 0.01174 0.01171 0.01168 0.01166 0.01161 0.01156 0.01154
#> Cumulative Proportion 0.87323 0.88494 0.89662 0.90829 0.91989 0.93145 0.94299
#> PC71 PC72 PC73 PC74 PC75
#> Standard deviation 0.92913 0.92793 0.92739 0.92257 0.9165
#> Proportion of Variance 0.01151 0.01148 0.01147 0.01135 0.0112
#> Cumulative Proportion 0.95450 0.96598 0.97745 0.98880 1.0000
we can see that the each variance in PC is distributed almost evenly that means in this case to capture 100 of data we need all of the PC. lets say that we want to capture only 80% of the initial data, to do that we need PC 1 to PC 58.
data_train_pca <- data.frame(pca_data$x[, 1:58])
data_train_pca$label <- data$target
rmarkdown::paged_table(head(data_train_pca))we also can visualize how pca 1 and pca 2 seperate each class
ggplot(as.data.frame(pca_data$x), aes(x = PC1, y = PC2, col = data$target)) + geom_point()Verdict
After analyzing the data from Kaggle dataset we know that we can use K-means clustering algorithm to cluster data into K- Cluster this can help spot pattern in dataset we also can use PCA to extract important feature.