K-MEANS Clustering

DOEUN

2020-04-07

K- MEANS

WINE

## Observations: 178
## Variables: 14
## $ Cultivar             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Alcohol              <dbl> 14.23, 13.20, 13.16, 14.37, 13.24, 14.20, 14.3...
## $ Malic.acid           <dbl> 1.71, 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15...
## $ Ash                  <dbl> 2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61...
## $ Alcalinity           <dbl> 15.6, 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6...
## $ Magnesium            <int> 127, 100, 101, 113, 118, 112, 96, 121, 97, 98,...
## $ Total.phenols        <dbl> 2.80, 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60...
## $ Flavanoid            <dbl> 3.06, 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51...
## $ nonflavanoid.phenols <dbl> 0.28, 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31...
## $ Proanthocyanin       <dbl> 2.29, 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25...
## $ Color.intensityy     <dbl> 5.64, 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05...
## $ Hue                  <dbl> 1.04, 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06...
## $ diulted.wines        <dbl> 3.92, 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58...
## $ Proline              <int> 1065, 1050, 1185, 1480, 735, 1450, 1290, 1295,...
## K-means clustering with 3 clusters of sizes 47, 69, 62
## 
## Cluster means:
##    Alcohol Malic.acid      Ash Alcalinity Magnesium Total.phenols Flavanoid
## 1 13.80447   1.883404 2.426170   17.02340 105.51064      2.867234  3.014255
## 2 12.51667   2.494203 2.288551   20.82319  92.34783      2.070725  1.758406
## 3 12.92984   2.504032 2.408065   19.89032 103.59677      2.111129  1.584032
##   nonflavanoid.phenols Proanthocyanin Color.intensityy       Hue diulted.wines
## 1            0.2853191       1.910426         5.702553 1.0782979      3.114043
## 2            0.3901449       1.451884         4.086957 0.9411594      2.490725
## 3            0.3883871       1.503387         5.650323 0.8839677      2.365484
##     Proline
## 1 1195.1489
## 2  458.2319
## 3  728.3387
## 
## Clustering vector:
##   [1] 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 1 3 3 1 1 3 1 1 1 1 1 1 3 3
##  [38] 1 1 3 3 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 2 2 3 2 2 3 3 3 2 2 1
##  [75] 3 2 2 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 2 2 3 3 2 3 2 3 2 2 2 3 2 2 2 2 3 2
## [112] 2 3 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 3 2 2 3 3 3 3 2 2 2 3 3 2 2 3 3 2 3
## [149] 3 2 2 2 2 3 3 3 2 3 3 3 2 3 2 3 3 2 3 3 3 3 2 2 3 3 3 3 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 1360950.5  443166.7  566572.5
##  (between_SS / total_SS =  86.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

## [1] 47 69 62
## [1] 47 69 62

적절한 군집 갯수 찾아보기 방법: FitMenas (하티건) & Gap 통계량

실제 그룹의 cluster 와 3개를 할당해준 wine3n25개를 비교 Cross - table

##    
##      1  2  3
##   1 46  0 13
##   2  1 50 20
##   3  0 19 29

Gap 통계량 - 군집 내 상이도를 데이터의 것과 붓스트랩 샘플로 얻은 데이터의 것을 서로 비교한다.

실제와 기대의 차이를 확인 - 우리는 이미 3개로 나눠진것을 알고 있기 때문에 가능

##        logW   E.logW       gap     SE.sim
## 1  9.655294 9.939941 0.2846472 0.03514345
## 2  8.987942 9.255181 0.2672389 0.03329494
## 3  8.617563 8.870678 0.2531152 0.02869671
## 4  8.370194 8.587259 0.2170650 0.03116535
## 5  8.193144 8.387801 0.1946568 0.02820239
## 6  7.979259 8.236368 0.2571090 0.02988772
## 7  7.819287 8.098896 0.2796087 0.03104365
## 8  7.685612 7.991956 0.3063439 0.02629217
## 9  7.591487 7.899753 0.3082661 0.02329127
## 10 7.496676 7.820809 0.3241326 0.02353644
## 11 7.398811 7.752443 0.3536316 0.02285021
## 12 7.340516 7.692275 0.3517596 0.02365140
## 13 7.269456 7.640784 0.3713283 0.02412040
## 14 7.224292 7.593502 0.3692103 0.02428222
## 15 7.157981 7.551405 0.3934239 0.02679342
## 16 7.104300 7.511463 0.4071627 0.02767741
## 17 7.054116 7.475877 0.4217612 0.02774118
## 18 7.006179 7.439694 0.4335148 0.02724484
## 19 6.971455 7.406621 0.4351667 0.02809914
## 20 6.932463 7.374723 0.4422595 0.02842890

갭 통계량을 확인해본 결과 5개의 군집이 괜찮다.

Market Segmentation - KMEANS

InvoiceNo StockCode Description Quantity
541811 85099B JUMBO BAG RED RETROSPOT 9
536415 22632 HAND WARMER RED RETROSPOT 3
578314 22697 GREEN REGENCY TEACUP AND SAUCER 24
545162 85014A BLACK/BLUE POLKADOT UMBRELLA 6
575325 84580 MOUSE TOY WITH PINK T-SHIRT 1
567853 23290 SPACEBOY CHILDRENS BOWL 4
546542 22077 6 RIBBONS RUSTIC CHARM 2
541871 37342 POLKADOT COFFEE CUP & SAUCER PINK 2
566399 22561 WOODEN SCHOOL COLOURING SET 12
547365 85123A WHITE HANGING HEART T-LIGHT HOLDER 12
##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0           0           0           0           0 
##  CustomerID     Country 
##      135080           0

## # A tibble: 12 x 3
##    time_mon profit_mon profit100
##    <ord>         <dbl>     <dbl>
##  1 1           691365.       691
##  2 2           523632.       524
##  3 3           717639.       718
##  4 4           537809.       538
##  5 5           770536.       771
##  6 6           761740.       762
##  7 7           719221.       719
##  8 8           759138.       759
##  9 9          1058590.      1059
## 10 10         1154979.      1155
## 11 11         1509496.      1509
## 12 12         1462539.      1463
## # A tibble: 20 x 3
##    Description                           sales levels                           
##    <fct>                                 <dbl> <fct>                            
##  1 "DOTCOM POSTAGE"                    206249. "DOTCOM POSTAGE"                 
##  2 "REGENCY CAKESTAND 3 TIER"          174485. "REGENCY CAKESTAND 3 TIER"       
##  3 "PAPER CRAFT , LITTLE BIRDIE"       168470. "PAPER CRAFT , LITTLE BIRDIE"    
##  4 "WHITE HANGING HEART T-LIGHT HOLDE~ 106293. "WHITE HANGING HEART T-LIGHT HOL~
##  5 "PARTY BUNTING"                      99504. "PARTY BUNTING"                  
##  6 "JUMBO BAG RED RETROSPOT"            94340. "JUMBO BAG RED RETROSPOT"        
##  7 "MEDIUM CERAMIC TOP STORAGE JAR"     81701. "MEDIUM CERAMIC TOP STORAGE JAR" 
##  8 "Manual"                             78113. "Manual"                         
##  9 "POSTAGE"                            78102. "POSTAGE"                        
## 10 "RABBIT NIGHT LIGHT"                 66965. "RABBIT NIGHT LIGHT"             
## 11 "PAPER CHAIN KIT 50'S CHRISTMAS "    64952. "PAPER CHAIN KIT 50'S CHRISTMAS "
## 12 "ASSORTED COLOUR BIRD ORNAMENT"      59095. "ASSORTED COLOUR BIRD ORNAMENT"  
## 13 "CHILLI LIGHTS"                      54118. "CHILLI LIGHTS"                  
## 14 "SPOTTY BUNTING"                     42548. "SPOTTY BUNTING"                 
## 15 "JUMBO BAG PINK POLKADOT"            42436. "JUMBO BAG PINK POLKADOT"        
## 16 "BLACK RECORD COVER FRAME"           40652. "BLACK RECORD COVER FRAME"       
## 17 "PICNIC BASKET WICKER 60 PIECES"     39620. "PICNIC BASKET WICKER 60 PIECES" 
## 18 "DOORMAT KEEP CALM AND COME IN"      38167. "DOORMAT KEEP CALM AND COME IN"  
## 19 "SET OF 3 CAKE TINS PANTRY DESIGN "  38158. "SET OF 3 CAKE TINS PANTRY DESIG~
## 20 "JAM MAKING SET WITH JARS"           37129. "JAM MAKING SET WITH JARS"

Customer Segments

이상치 제거

Optimal- K 찾기

하티건 그래프

Elbow Method - SS/TOtal_SS 비율이 적게 움직이는 Point 를 찾아보기.

5 개로 클러스터링 하는 것이 적합해 보임

##  [1] 12861.000  8044.794  4787.072  3550.839  2815.099  2411.752  2114.481
##  [8]  1858.031  1704.857  1525.015  1417.678  1294.715  1227.693  1150.527
## [15]  1082.664

Group 1 은 평균 14100 정도로 들어가있는 고객들은 2628 명 (VIP로 분류)

Group 3 은 평균 3786 정도 지불하지만 (VIP 로 올라갈 수 있는 가망 고객들) - 타켓 설정

## Classes 'tbl_df', 'tbl' and 'data.frame':    4 obs. of  4 variables:
##  $ K_group: chr  "Group 4" "Group 2" "Group 1" "Group 3"
##  $ profit : num  14100 3786 965 439
##  $ receny : num  38 47 70 273
##  $ n      : num  323 268 53 26

Random_Forest

## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Group 1 Group 2 Group 3 Group 4
##    Group 1     525       2       1       0
##    Group 2       0     103       1       0
##    Group 3       1       0     205       0
##    Group 4       0       1       0      17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9848, 0.9974)
##     No Information Rate : 0.6145          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9872          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Group 1 Class: Group 2 Class: Group 3
## Sensitivity                  0.9981         0.9717         0.9903
## Specificity                  0.9909         0.9987         0.9985
## Pos Pred Value               0.9943         0.9904         0.9951
## Neg Pred Value               0.9970         0.9960         0.9969
## Prevalence                   0.6145         0.1238         0.2418
## Detection Rate               0.6133         0.1203         0.2395
## Detection Prevalence         0.6168         0.1215         0.2407
## Balanced Accuracy            0.9945         0.9852         0.9944
##                      Class: Group 4
## Sensitivity                 1.00000
## Specificity                 0.99881
## Pos Pred Value              0.94444
## Neg Pred Value              1.00000
## Prevalence                  0.01986
## Detection Rate              0.01986
## Detection Prevalence        0.02103
## Balanced Accuracy           0.99940