Unsupervised_lbb

R Markdown

Using any of the two unsupervised learning algorithms you’ve learned, produce a simple R markdown document where you demonstrate an exercise of either clustering or dimensionality reduction on one of either the wholesale.csv or the nyc dataset.

Explain your choice of parameters (how you choose k for k-means clustering, or how you choose to retain n number of dimensions for PCA) from the original data. What are some business utility for the unsupervised model you’ve developed? The R Markdown document should be not longer than 4 paragraph, and contain one or two visualization

# The library that i use
pacman::p_load(cluster, factoextra, tidyverse)

Wholesale data :

# Read the data
wholesale <- read.csv("wholesale.csv")
str(wholesale)

## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

# Data Preparation
  ## Checking is there NA in columns
wholesale %>% 
  is.na() %>% 
  colSums()

##          Channel           Region            Fresh             Milk 
##                0                0                0                0 
##          Grocery           Frozen Detergents_Paper       Delicassen 
##                0                0                0                0

  ## Subseting 2 variables and scale the data
wholescale <- wholesale %>% 
  select(-c(Channel,Region)) %>% 
  scale()

# Biplot 
wholebp <-wholescale %>% 
  prcomp(wholescale, scale. = T, center = T) %>% 
  biplot()

## Warning in if (retx) r$x <- x %*% s$v: the condition has length > 1 and
## only the first element will be used

# determining optimal number of clustering, this process to compute called "Elbow method". 
set.seed(100)
fviz_nbclust(wholescale, kmeans, method = "wss")

# or with Average Silhouette Method measures the quality of a clustering.
set.seed(100)
fviz_nbclust(wholescale, kmeans, method = "silhouette")

# K-means Clustering and ploting 
set.seed(100)
wholescale %>% 
  kmeans(centers = 5) %>%  
  fviz_cluster(wholescale,ggtheme = theme_bw(), main = "Clustering plot")

From Biplot we can see that PCA as one of the many ways to analyse the structure of a given correlation matrix. By construction, the first principal axis is the one which maximizes the variance (reflected by its eigenvalue) when data are projected onto a line and the second one is orthogonal to it, and still maximizes the remaining variance. This is the reason why using the first two axes should yield the better approximation of the original variables space when it is projected onto a plane.

Principal components are just linear combinations of the original variables. Therefore, plotting individual factor scores (defined as Xu, where u is the vector of loadings of any principal component) may help to highlight groups of homogeneous individuals, for example, or to interpret one’s overall scoring when considering all variables at the same time. In other words, this is a way to summarize one’s location with respect to his value on the p variables, or a combination thereof.

Here it is worth noting that both variables and individuals are shown on the same diagram (this is called a biplot), which helps to interpret the factorial axes while looking at individuals’ location.

The plot shows: - Where the wholesales item within our data is positioned in terms of PC1 and PC2, represented by text labels - The loading of each variable on PC1 and PC2, represented by the red arrow

In the Clustering the observations, we want observations in the same group to be similar. Because there isn’t a response variable, this is an unsupervised method, which implies that it seeks to find relationships between the n observations without being trained by a response variable. to know how to choose k for k-means clustering, i test for the optimal number of clusters. To aid the analyst, the following explains the three most popular methods for determining the optimal clusters, which includes: 1. Elbow method, The results suggest that 5 is the optimal number of clusters as it appears to be the bend in the knee (or elbow), and 2. Silhouette method, A high average silhouette width indicates a good clustering. The results show that 2 clusters maximize the average silhouette values with max 5 clusters coming in as second optimal number of clusters.

With most of these approaches suggesting 5 as the number of optimal clusters, we can perform the final analysis and extract the results using 5 clusters.

Unsupervised_lbb_shelloren

Shelloren

September 10, 2018

R Markdown