INTRODUCTION
I will demonstrate this by using unsupervised ML technique (KMeans Clustering Algorithm). Considering a supermarket mall owner who has some basic data about his customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.
This data helps to understand the and to pedict Target Customers so that efforts can be made by the marketing team in creating strategies and planning accordingly.
Cleaning the Dataset
Now, I will check if there are any missing variables, then take them off as they may interfere in the data structure and may cause inconsistencies in results.
sum(is.na(mallCust) == 1)
## [1] 0
Since, the results reveals there are no observations with NAs, I will go ahead with my dataset.
Statistics and Exploratory Data Analysis
Rendering basic statistical data analysis helps one to properly understand the particular dataset been analyzed. I run an analysis on the data to find out basic information about the data. Such as the location of measures (mode,median and mean), minimum and maximum observations also be obtained for every variable.
Using ggplot for visualisation and explanotory data analysis.
library(ggplot2)
library(cowplot)
Age - Gender variable
ageHist <- ggplot(mallCust, aes(Age, fill=Gender)) +
geom_histogram(bins = 20) +
ggtitle("Distribution of data per Age and Gender")
ageBox <- ggplot(mallCust, aes(x=Gender, y=Age, fill = Gender)) +
geom_boxplot() +
ggtitle("Boxplot of Age per Gender")
plot_grid(ageHist, ageBox)

Gender - Income variable
incomeHist <- ggplot(mallCust, aes(Income, fill=Gender)) +
geom_density(alpha = 0.5) +
ggtitle("Density of income per Gender")
incomeBox <- ggplot(mallCust, aes(x=Gender, y=Income, fill = Gender)) +
geom_boxplot() +
ggtitle("Boxplot of Income per Gender")
plot_grid(incomeHist, incomeBox)

Gender - Spending Score
spendHist <- ggplot(mallCust, aes(SpendingScore, fill=Gender)) +
geom_dotplot(binwidth = 10) +
ggtitle("Density of Spending Score per Gender")
spendBox <- ggplot(mallCust, aes(x=Gender, y=SpendingScore, fill = Gender)) +
geom_boxplot() +
ggtitle("Boxplot of Spending Score per Gender")
plot_grid(spendHist, spendBox)

How does the variable ‘Spending Score’ vary in relation to Gender, Age and Income?
This section is dedicated to analysing the variations between the variable ‘spending score’ and other corresponding variables in order to determine the existing relationships.
The diagram below clearly indicates that the spending score is highest among the group 23 - 32 peaking at around 28.Lowest among the group 45 - 60 Year olds. Spending score starts increasing again for 60+ year olds.
ggplot(mallCust, aes(x = Age, y=SpendingScore)) +
geom_smooth() +
geom_point(aes(size=SpendingScore, colour = Gender)) +
scale_radius(range=c(1, 5))+
geom_vline(xintercept = c(23, 28, 32, 45,60, 70)) +
ggtitle("Spending Score v Age")

Although the spending score for women is higher than that of men for each bracket, the mode for men is higher than that of women.
ggplot(mallCust, aes(x = Gender, y=SpendingScore)) +
geom_violin(scale = "area", aes(fill = Gender)) +
ggtitle("Spending Score v Gender") +
geom_hline(yintercept = c(20, 40, 60, 80, 90)) +
stat_sum()

High Spending scores with income < 40K and in between 70K - 110 K. Higher Income v Spending score frequencies in 40K - 70K.
ggplot(mallCust, aes(x = Income, y=SpendingScore)) +
geom_hex() +
geom_smooth() +
geom_vline(xintercept = c(40, 70, 110)) +
ggtitle("Spending Score v Income") +
scale_fill_gradientn(colours = topo.colors(6, alpha = 0.8))

As income of the customer increases, the variance of the spending score also increases.
Clustering
Clustering the data with 2 centres.
cluster_Data <- function(c, title){
#cols <- c("2" = "red", "3" = "green")
#par(mfrow=c(1,2))
km <- kmeans(mallCust[,3:5], centers = c)
agePlot <- ggplot(mallCust, aes(Age, SpendingScore)) +
geom_point(colour = factor(km$cluster + 1), size=3)+
scale_color_discrete() +
ggtitle(title)
incomePlot <- ggplot(mallCust, aes(Income, SpendingScore)) +
geom_point(colour = factor(km$cluster + 1), size=3)+
scale_color_discrete() +
ggtitle( title)
plot_grid(agePlot, incomePlot)
}
#Cluster with 2 centers
#cols <- c("2" = "red", "3" = "green")
cluster_Data(2, "k-means clustering results with k= 2")

#cols <- c("2" = "red", "3" = "green")
cluster_Data(5, "k-means clustering results with k= 5")

Identifying the best value for centers using internal validation methods - data structure based metrics.