05 maggio 2019

Cluster Analysis with k-means Overview

This is an RStudio shiny application developed as a part of final project in the Developing Data Products course in Coursera Data Science Specialization track.

The application shows a cluster analysis with k-means and it's available at the link clusteranalysiskmeans. For the code you can visit the github repository here.

This analysis is an example of clustering customers from a wholesale customer database.

You can download the data I’m using from the Berkley UCI Machine Learning Repository here, when you can find the data and attribute information.

Focus on Data

Some information about data

##  Channel Region      Fresh             Milk          Grocery     
##  1:298   1: 77   Min.   :     3   Min.   :   55   Min.   :    3  
##  2:142   2: 47   1st Qu.:  3128   1st Qu.: 1533   1st Qu.: 2153  
##          3:316   Median :  8504   Median : 3627   Median : 4756  
##                  Mean   : 12000   Mean   : 5796   Mean   : 7951  
##                  3rd Qu.: 16934   3rd Qu.: 7190   3rd Qu.:10656  
##                  Max.   :112151   Max.   :73498   Max.   :92780  
##      Frozen        Detergents_Paper    Delicassen     
##  Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

In the analysis we will use only the numeric variables.

Functionality

In this app you can find:

  • A graph of total Within Sum of Squares for different number of cluster (k) to help to choose the correct one
  • the Number of the rows considered for the analysis
  • A heatmap to help to caracterize clusters

Available Interacions

  • Choose number of outliers to remove (default = 5)
  • Choose interval value of k (default = from 2 to 20)
  • Chooce is rescale variables (default = TRUE)
  • Choose number of clusters for the heatmap (default = 5)