Enhancing Marketing Strategies with PCA and K-Means Clustering

Introduction

Welcome to our journey through data analysis within our marketing campaign, where we delve into customer behavior and preferences using R. Our mission extends beyond mere data manipulation; we aim to craft narratives from numbers, turning statistical outputs into strategic insights. By employing tools like dplyr for data wrangling and ggplot2 and plotly for visual storytelling, we’re set to explore the depths of our data. Through advanced techniques such as PCA for simplifying our data landscape and clustering methods to uncover hidden patterns, our goal is to illuminate the path for data-driven decisions in marketing.

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Clustering

Clustering is a method of unsupervised learning, a type of machine learning in which the algorithm seeks to identify inherent groupings within the data without the guidance of a known outcome variable. In the context of your marketing data, clustering aims to group customers into segments based on similarities in their features. The primary statistical considerations in clustering include:

Algorithm Selection: Different algorithms approach the clustering problem in various ways. I’ve used KMeans and hierarchical clustering:

KMeans: This algorithm partitions the data into K distinct, non-overlapping subsets (clusters) by minimizing the within-cluster variance (or inertia). The goal is to assign data points to clusters such that the total sum of squared distances from each point to the centroid of its cluster is minimized.

Hierarchical Clustering: This method builds a hierarchy of clusters either in a bottom-up approach (agglomerative) or a top-down approach (divisive). Agglomerative hierarchical clustering starts with each data point as a separate cluster and merges them step by step based on some linkage criteria (like minimum distance).

Number of Clusters: Determining the optimal number of clusters is crucial. Techniques like the Elbow Method (in my script with fviz_nbclust) help estimate the appropriate number of clusters by looking at the percentage of variance explained as a function of the number of clusters.

Dimensional Reduction

Dimensional reduction refers to the process of reducing the number of variables under consideration, divided into feature selection and feature extraction.

Principal Component Analysis (PCA): PCA is used for dimensional reduction in your script. It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The key aspects of PCA include:

Variance Maximization: PCA aims to capture as much of the variability in the original high-dimensional space as possible in the lower-dimensional representation.

Orthogonality: Each principal component is orthogonal (i.e., uncorrelated) to the others, ensuring that they capture different aspects or features of the data.

Dimensionality Reduction: By reducing the number of dimensions (variables), PCA simplifies the complexity of the data, making it easier to analyze and visualize. This is especially useful in large datasets where multicollinearity (high correlation among variables) might be a concern.

Setting Expectations

Before diving into the technicalities, let’s outline our voyage. We anticipate uncovering distinct groups within our customer base, revealing not just who our customers are but how they interact with our offerings. These insights will not only validate our marketing intuition but also challenge our assumptions, paving the way for refined, personalized marketing strategies.

# **Loading Data** {.tabset .tabset-fade .tabset-pills}

First we need to load some libraries and import our data.I'm loading the readr library, which is part of the tidyverse suite of data manipulation tools in R. Then, I'm reading a CSV file named "marketing_campaign.csv" into a data frame called data

# Load library
library(knitr) # dynamic report generation
library(gridExtra) # provides a number of user-level functions to work with "grid" graphics
library(lubridate) # work with dates and times
library(readr)
library(ggplot2)
library(reshape2)  # for melting the data


# Read the data
data <- read.csv("marketing_campaign.csv", sep = ",")

Initial Data Exploration

print(paste("Number of datapoints:", nrow(data)))
head(data)
str(data)
summary(data)
colSums(is.na(data))

This block prints the number of rows (data points), displays the first few rows, shows the structure (types of columns and a few entries), provides a summary (basic statistical measures), and counts missing values in each column of the dataset.

Data Cleaning

data <- na.omit(data)
print(paste("The total number of data-points after removing the rows with missing values are:", nrow(data)))

Here, I’m removing rows with missing (NA) values and then printing the number of data points after this removal.

Date Processing

data$Dt_Customer <- dmy(data$Dt_Customer)
newest_date <- max(data$Dt_Customer)
oldest_date <- min(data$Dt_Customer)
print(paste("The newest customer's enrolment date in the records:", newest_date))
print(paste("The oldest customer's enrolment date in the records:", oldest_date))

The lubridate library is used to convert the ‘Dt_Customer’ column into Date format. I then find and print the newest and oldest dates of customer enrollment.

Additional Data Processing

data$Customer_For <- as.numeric(newest_date - data$Dt_Customer)

This calculates the number of days each customer has been enrolled by subtracting their enrollment date from the newest enrollment date.

Data Transformation and Feature Engineering

# Count of each category in 'Marital_Status'
cat_marital_status <- table(data$Marital_Status)
print("Total categories in the feature Marital_Status:")

## [1] "Total categories in the feature Marital_Status:"

print(cat_marital_status)

## 
##   Absurd    Alone Divorced  Married   Single Together    Widow     YOLO 
##        2        3      232      857      471      573       76        2

# Count of each category in 'Education'
cat_education <- table(data$Education)
print("Total categories in the feature Education:")

## [1] "Total categories in the feature Education:"

print(cat_education)

## 
##   2n Cycle      Basic Graduation     Master        PhD 
##        200         54       1116        365        481

# Age of customer today
data$Age <- 2021 - data$Year_Birth

# Total spendings on various items
data$Spent <- data$MntWines + data$MntFruits + data$MntMeatProducts + data$MntFishProducts + data$MntSweetProducts + data$MntGoldProds

# Deriving living situation by marital status
data$Living_With <- data$Marital_Status
data$Living_With[data$Marital_Status %in% c("Married", "Together")] <- "Partner"
data$Living_With[data$Marital_Status %in% c("Absurd", "Widow", "YOLO", "Divorced", "Single")] <- "Alone"

# Feature indicating total children living in the household
data$Children <- data$Kidhome + data$Teenhome

# Feature for total members in the household
data$Family_Size <- ifelse(data$Living_With == "Alone", 1, 2) + data$Children

# Feature pertaining parenthood
data$Is_Parent <- ifelse(data$Children > 0, 1, 0)

# Segmenting education levels into three groups
data$Education <- data$Education
data$Education[data$Education %in% c("Basic", "2n Cycle")] <- "Undergraduate"
data$Education[data$Education == "Graduation"] <- "Graduate"
data$Education[data$Education %in% c("Master", "PhD")] <- "Postgraduate"

# Renaming columns for clarity
colnames(data)[colnames(data) == "MntWines"] <- "Wines"
colnames(data)[colnames(data) == "MntFruits"] <- "Fruits"
colnames(data)[colnames(data) == "MntMeatProducts"] <- "Meat"
colnames(data)[colnames(data) == "MntFishProducts"] <- "Fish"
colnames(data)[colnames(data) == "MntSweetProducts"] <- "Sweets"
colnames(data)[colnames(data) == "MntGoldProds"] <- "Gold"

# Dropping some of the redundant features
data <- data[, !(names(data) %in% c("Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"))]

# Get a summary of the dataset
summary(data)

##   Education             Income          Kidhome          Teenhome     
##  Length:2216        Min.   :  1730   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median : 51382   Median :0.0000   Median :0.0000  
##                     Mean   : 52247   Mean   :0.4418   Mean   :0.5054  
##                     3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :666666   Max.   :2.0000   Max.   :2.0000  
##     Recency          Wines            Fruits            Meat       
##  Min.   : 0.00   Min.   :   0.0   Min.   :  0.00   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  24.0   1st Qu.:  2.00   1st Qu.:  16.0  
##  Median :49.00   Median : 174.5   Median :  8.00   Median :  68.0  
##  Mean   :49.01   Mean   : 305.1   Mean   : 26.36   Mean   : 167.0  
##  3rd Qu.:74.00   3rd Qu.: 505.0   3rd Qu.: 33.00   3rd Qu.: 232.2  
##  Max.   :99.00   Max.   :1493.0   Max.   :199.00   Max.   :1725.0  
##       Fish            Sweets            Gold        NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.50   Median : 2.000   
##  Mean   : 37.64   Mean   : 27.03   Mean   : 43.97   Mean   : 2.324   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :262.00   Max.   :321.00   Max.   :15.000   
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.000    Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.000    1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.000    Median : 6.000   
##  Mean   : 4.085   Mean   : 2.671      Mean   : 5.801    Mean   : 5.319   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.000    3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.000    Max.   :20.000   
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5     AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.07356   Mean   :0.07401   Mean   :0.0731   Mean   :0.06408  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##   AcceptedCmp2        Complain           Response       Customer_For  
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.0000   Min.   :  0.0  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:180.0  
##  Median :0.00000   Median :0.000000   Median :0.0000   Median :355.5  
##  Mean   :0.01354   Mean   :0.009477   Mean   :0.1503   Mean   :353.5  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:529.0  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.0000   Max.   :699.0  
##       Age             Spent        Living_With           Children     
##  Min.   : 25.00   Min.   :   5.0   Length:2216        Min.   :0.0000  
##  1st Qu.: 44.00   1st Qu.:  69.0   Class :character   1st Qu.:0.0000  
##  Median : 51.00   Median : 396.5   Mode  :character   Median :1.0000  
##  Mean   : 52.18   Mean   : 607.1                      Mean   :0.9472  
##  3rd Qu.: 62.00   3rd Qu.:1048.0                      3rd Qu.:1.0000  
##  Max.   :128.00   Max.   :2525.0                      Max.   :3.0000  
##   Family_Size      Is_Parent     
##  Min.   :1.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :3.000   Median :1.0000  
##  Mean   :2.593   Mean   :0.7144  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :5.000   Max.   :1.0000

This section includes a variety of data transformations: deriving new features like age, total spendings, living situation, family size, and parenthood status; segmenting education levels; renaming columns for clarity; and dropping redundant features.

Visualization Setup

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("GGally")

## Installing package into 'C:/Users/Reikhan.Gasimova/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'GGally' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Reikhan.Gasimova\AppData\Local\Temp\RtmpMN82kQ\downloaded_packages

# Load the packages
library(ggplot2)
library(GGally)

## Warning: package 'GGally' was built under R version 4.3.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# Setting up color preferences
pallet <- c("#682F2F", "#F3AB60")

# Ensure 'Is_Parent' is a factor
data$Is_Parent <- as.factor(data$Is_Parent)

# Select features to plot
to_plot <- data[, c("Income", "Recency", "Customer_For", "Age", "Spent", "Is_Parent")]

# Creating the pair plot
ggpairs(to_plot, aes(color = Is_Parent), 
        upper = list(continuous = "points", combo = "box"),
        lower = list(continuous = "points", combo = "facetdensity"),
        diag = list(continuous = "barDiag")) +
  scale_colour_manual(values = pallet)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here, I load ggplot2 and GGally for data visualization. The subsequent code includes creating pair plots, heatmaps, and various other plots to explore the relationships and distributions in the data.

This is pair plot (also known as a scatterplot matrix), which is a matrix of scatterplots used to understand the pairwise relationships between different variables in the data.

Diagonal: The diagonal of the matrix (from the top left to the bottom right) typically shows the distribution of each variable. It looks like histograms for continuous variables and bar charts for categorical variables, such as Is_Parent in this case. The distribution plots allow we to see the spread of each variable, identify modes, and look for signs of skewness.
Lower Triangle: The plots below the diagonal are scatterplots, each representing the relationship between a pair of variables. For example, the scatterplot at the intersection of the Income row and the Recency column shows the relationship between these two variables. These plots can help identify correlations or potential patterns of association between variables.
Upper Triangle: The plots above the diagonal typically mirror the lower triangle but may include additional layers of information or different types of visual representations, like density plots or correlation coefficients. However, in my plot, the upper triangle seems to replicate the lower one without additional information.
Color and Size Coding: The points in the scatterplots may be color-coded or size-coded (or both) to represent additional variables. In my plot, color seems to indicate whether a customer is a parent (Is_Parent), with one color for ‘1’ and another for ‘0’.
Interpretation: By examining the scatterplots, we can look for trends, clusters, or outliers. For example, if there’s a linear trend in the scatterplots, this suggests a positive or negative correlation. If the points form distinct groups, this might indicate that the data clusters naturally along those dimensions.
Outliers: Any points that fall far away from the main cloud of points in the scatterplots are outliers. Outliers can be of particular interest because they may represent errors, unusual cases, or important variations in the data.
Box Plots: On the far right, you have box plots for each continuous variable against the categorical variable Is_Parent. These box plots show the distribution’s central tendency, dispersion, and skewness, and they can also highlight outliers (shown as individual points outside the whiskers of the box plot).

Data Scaling and Outlier Removal

# Removing outliers
data <- subset(data, Age < 90)
data <- subset(data, Income < 600000)

# Print the total number of data-points after removing outliers
print(paste("The total number of data-points after removing the outliers are:", nrow(data)))

## [1] "The total number of data-points after removing the outliers are: 2212"

# Select only numeric columns for the correlation matrix
numeric_data <- data[sapply(data, is.numeric)]

# Compute the correlation matrix
corrmat <- cor(numeric_data, use = "complete.obs")


# Melt the correlation matrix
melted_corr <- melt(corrmat)

# Create the heatmap
ggplot(melted_corr, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
        axis.text.y = element_text(size = 12)) +
  labs(x = "", y = "", title = "Correlation Matrix", subtitle = "Heatmap Representation")

# Get list of categorical variables
object_cols <- names(data)[sapply(data, function(x) class(x) %in% c("factor", "character"))]

# Print the categorical variables
print("Categorical variables in the dataset:")

## [1] "Categorical variables in the dataset:"

print(object_cols)

## [1] "Education"   "Living_With" "Is_Parent"

# Label Encoding the categorical variables
for (col in object_cols) {
  data[[col]] <- as.numeric(as.factor(data[[col]]))
}

print("All features are now numerical")

## [1] "All features are now numerical"

# Creating a copy of data
ds <- data

# Dropping specified columns
cols_del <- c('AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response')
ds <- ds[, !(names(ds) %in% cols_del)]

# Scaling the data
scaled_ds <- as.data.frame(scale(ds))

print("All features are now scaled")

## [1] "All features are now scaled"

# Display the first few rows of the scaled data frame
print("Dataframe to be used for further modelling:")

## [1] "Dataframe to be used for further modelling:"

head(scaled_ds)

##    Education     Income    Kidhome   Teenhome    Recency      Wines     Fruits
## 1 -0.8933842  0.2870400 -0.8225675 -0.9294885  0.3102831  0.9774386  1.5516896
## 2 -0.8933842 -0.2608231  1.0397860  0.9078918 -0.3807274 -0.8724207 -0.6373172
## 3 -0.8933842  0.9129900 -0.8225675 -0.9294885 -0.7953337  0.3578543  0.5704107
## 4 -0.8933842 -1.1758481  1.0397860 -0.9294885 -0.7953337 -0.8724207 -0.5618342
## 5  0.5715275  0.2942401  1.0397860 -0.9294885  1.5541019 -0.3921688  0.4194448
## 6  0.5715275  0.4902705 -0.8225675  0.9078918 -1.1408389  0.6365191  0.3942838
##         Meat       Fish       Sweets        Gold NumDealsPurchases
## 1  1.6899111  2.4529173  1.483377108  0.85238280         0.3509506
## 2 -0.7180674 -0.6508565 -0.633875295 -0.73347657        -0.1686630
## 3 -0.1785018  1.3392102 -0.147150604 -0.03724563        -0.6882765
## 4 -0.6556383 -0.5047966 -0.585202826 -0.75281632        -0.1686630
## 5 -0.2186348  0.1524732 -0.001133197 -0.55941883         1.3901777
## 6 -0.3078192 -0.6873715  0.363910321 -0.57875858        -0.1686630
##   NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 1       1.4265420           2.5030413        -0.5556886         0.6920240
## 2      -1.1261653          -0.5712104        -1.1708955        -0.1325152
## 3       1.4265420          -0.2296269         1.2899319        -0.5447849
## 4      -0.7614928          -0.9127940        -0.5556886         0.2797544
## 5       0.3325246           0.1119566         0.0595182        -0.1325152
## 6       0.6971970           0.4535402         1.2899319         0.2797544
##   Customer_For        Age      Spent Living_With    Children Family_Size
## 1    1.5273754  1.0181218  1.6758664  -1.3492980 -1.26431204  -1.7579612
## 2   -1.1887425  1.2744970 -0.9630792  -1.3492980  1.40425497   0.4489685
## 3   -0.2060017  0.3344545  0.2800468   0.7407911 -1.26431204  -0.6544963
## 4   -1.0603442 -1.2892552 -0.9199266   0.7407911  0.06997147   0.4489685
## 5   -0.9516995 -1.0328800 -0.3074921   0.7407911  0.06997147   0.4489685
## 6   -0.2998312  0.1635377  0.1804639   0.7407911  0.06997147   0.4489685
##    Is_Parent
## 1 -1.5807814
## 2  0.6323126
## 3 -1.5807814
## 4  0.6323126
## 5  0.6323126
## 6  0.6323126

I’m removing outliers based on age and income criteria and then scaling the data. This is important for many statistical analyses and machine learning algorithms.

Principal Component Analysis (PCA)

# Load necessary library
library(stats)

# Performing PCA to reduce dimensions to 3
pca_result <- prcomp(scaled_ds, center = TRUE, scale. = TRUE, rank. = 3)

# Extracting the principal components
PCA_ds <- data.frame(pca_result$x[, 1:3])
colnames(PCA_ds) <- c("col1", "col2", "col3")

# Describing the principal components
print(summary(PCA_ds))

##       col1              col2              col3         
##  Min.   :-7.4512   Min.   :-4.1938   Min.   :-6.74893  
##  1st Qu.:-2.3858   1st Qu.:-1.3236   1st Qu.:-0.86378  
##  Median : 0.7814   Median :-0.1737   Median : 0.05083  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 2.5389   3rd Qu.: 1.2346   3rd Qu.: 0.85352  
##  Max.   : 5.9768   Max.   : 6.1668   Max.   : 3.62443

# Load the scatterplot3d library
library(scatterplot3d)

# Extracting columns for the 3D plot
x <- PCA_ds$col1
y <- PCA_ds$col2
z <- PCA_ds$col3

# Creating a 3D scatter plot
scatterplot3d(x, y, z, color = "maroon", main = "A 3D Projection Of Data In The Reduced Dimension", pch = 16)

PCA is used to reduce the dimensionality of the dataset while retaining the most important variance. I’ m running PCA and then creating visualizations to understand the reduced dimensions.

Clustering

install.packages("factoextra")

## Installing package into 'C:/Users/Reikhan.Gasimova/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## package 'factoextra' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Reikhan.Gasimova\AppData\Local\Temp\RtmpMN82kQ\downloaded_packages

# Load necessary libraries
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)

## Warning: package 'cluster' was built under R version 4.3.2

# Elbow method to determine the optimal number of clusters
fviz_nbclust(PCA_ds, kmeans, method = "wss", k.max = 10) + 
  geom_vline(xintercept = 4, linetype = 2) +
  labs(subtitle = "Elbow Method")

# Load necessary libraries
library(stats)

# Performing Agglomerative Clustering
d <- dist(PCA_ds)  # Compute the distance matrix
hc <- hclust(d)  # Perform hierarchical clustering
yhat_AC <- cutree(hc, k = 4)  # Cut the tree into 4 clusters

# Adding cluster labels to the PCA dataset
PCA_ds$Clusters <- yhat_AC

# Adding the Clusters feature to the original dataframe
data$Clusters <- yhat_AC

This section explores clustering. The Elbow method helps determine the optimal number of clusters, and then KMeans and Agglomerative Clustering are applied to segment the data into meaningful groups.

Visualizing Clusters

# Load the necessary library
library(scatterplot3d)

# Extracting columns for the 3D plot
x <- PCA_ds$col1
y <- PCA_ds$col2
z <- PCA_ds$col3
clusters <- PCA_ds$Clusters

# Define colors for clusters
colors <- rainbow(length(unique(clusters)))

# Creating a 3D scatter plot with clusters
scatterplot3d(x, y, z, color = colors[clusters], pch = 19, main = "The Plot Of The Clusters")

# Load the ggplot2 library
library(ggplot2)

# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")

# Creating a count plot of the clusters
ggplot(data, aes(x = as.factor(Clusters), fill = as.factor(Clusters))) +
  geom_bar() +
  scale_fill_manual(values = pal) +
  labs(title = "Distribution Of The Clusters", x = "Cluster", y = "Count") +
  theme_minimal()

# Load the ggplot2 library
library(ggplot2)

# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")

# Creating a scatter plot of Clusters based on Income and Spent
ggplot(data, aes(x = Spent, y = Income, color = as.factor(Clusters))) +
  geom_point() +  # Create scatter points
  scale_color_manual(values = pal) +  # Use specified color palette
  labs(title = "Cluster's Profile Based On Income And Spending", 
       x = "Spent", 
       y = "Income") +
  theme_minimal() +
  theme(legend.title = element_blank())  # Hide the legend title

Various plots are created to visualize the clusters formed by the clustering algorithms. This includes 3D scatter plots and distribution plots, providing insights into how the data segments into different clusters.

Further Analysis and Visualization

# Load the ggplot2 library
library(ggplot2)

# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")

# Creating a scatter plot of Clusters based on Income and Spent
ggplot(data, aes(x = Spent, y = Income, color = as.factor(Clusters))) +
  geom_point() +  # Create scatter points
  scale_color_manual(values = pal) +  # Use specified color palette
  labs(title = "Cluster's Profile Based On Income And Spending", 
       x = "Spent", 
       y = "Income") +
  theme_minimal() +
  theme(legend.title = element_blank())  # Hide the legend title

# Load the necessary library
library(ggplot2)

# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")

# Creating a combined swarm and boxen plot
ggplot(data, aes(x = as.factor(Clusters), y = Spent)) +
  geom_point(position = position_jitter(width = 0.2), color = "#CBEDDD", alpha = 0.5) +  # Swarm plot
  geom_violin(aes(fill = as.factor(Clusters)), color = NA, scale = "width") +  # Boxen plot equivalent
  scale_fill_manual(values = pal) +  # Use specified color palette
  labs(x = "Clusters", y = "Spent") +
  theme_minimal()

# Creating a feature for the sum of accepted promotions
data$Total_Promos <- data$AcceptedCmp1 + data$AcceptedCmp2 + data$AcceptedCmp3 + data$AcceptedCmp4 + data$AcceptedCmp5
# Load necessary library
library(ggplot2)

# Creating a feature for the sum of accepted promotions
data$Total_Promos <- data$AcceptedCmp1 + data$AcceptedCmp2 + data$AcceptedCmp3 + data$AcceptedCmp4 + data$AcceptedCmp5

# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")

# Plotting count of total campaign accepted
ggplot(data, aes(x = as.factor(Total_Promos), fill = as.factor(Clusters))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = pal) +
  labs(title = "Count Of Promotion Accepted", 
       x = "Number Of Total Accepted Promotions", 
       y = "Count") +
  theme_minimal()

# Load the ggplot2 library
library(ggplot2)

# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")

# Creating a boxen plot equivalent for the number of deals purchased
ggplot(data, aes(x = as.factor(Clusters), y = NumDealsPurchases, fill = as.factor(Clusters))) +
  geom_boxplot() +
  scale_fill_manual(values = pal) +
  labs(title = "Number of Deals Purchased", x = "Clusters", y = "NumDealsPurchases") +
  theme_minimal()

## Simplifying Results Upon applying PCA and clustering, we discovered fascinating patterns:

PCA simplified our complex dataset, highlighting the most influential factors driving customer behavior. Clustering revealed distinct customer segments, from ‘High-Value Patrons’ who thrive on exclusivity to ‘Economical Shoppers’ seeking value. These findings translate into a richer understanding of our diverse customer landscape.

Interpretations

Each cluster tells a story. High-value patrons might represent our brand loyalists, suggesting the potential for premium, targeted campaigns. On the other hand, Economical Shoppers highlights an opportunity to curate value-focused offerings, enhancing their engagement and satisfaction.

Conclusion

# Assuming 'Personal' is a vector of column names
Personal <- c("Spent", "Clusters") # Update with your actual variable names

# Define a color palette, ensuring there are enough colors for each cluster
pal <- rainbow(length(unique(data$Clusters)))

# Loop through each variable in 'Personal'
for (i in Personal) {
  data_subset <- na.omit(data[, c(i, "Spent", "Clusters")])
  
  # Convert Clusters to a factor if it isn't already
  data_subset$Clusters <- as.factor(data_subset$Clusters)
  
  # Updated plot code using modern ggplot2 syntax
  p <- ggplot(data_subset, aes(x = .data[[i]], y = Spent, color = as.factor(Clusters))) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "loess", se = FALSE) +
    scale_color_manual(values = pal) +
    labs(title = paste("Plot of", i, "vs Spent")) +
    theme_minimal()
  
  print(p)
}

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

chooseCRANmirror(ind=1) # This will set a CRAN mirror for the session.

Conclusions and Recommendations

Summary of Findings:

Our investigation has methodically dissected customer data, revealing distinct segments with unique purchasing patterns and preferences. Through the application of PCA, we distilled the essence of our dataset, identifying key drivers behind customer behavior. Subsequent clustering divided our customers into clearly defined groups, each with its distinct characteristics:

Segment A (High-Value Customers): Showcases robust spending across a variety of products, indicating a deep engagement with our brand. Segment B (Economical Shoppers): Demonstrates a keen eye for value, favoring cost-effective purchases. Segment C (Periodic Buyers): Exhibits sporadic purchasing habits, indicating untapped potential for increased engagement. Segment D (Category Enthusiasts): Prefers specific product categories, suggesting opportunities for targeted marketing. Correlation with Project Goals: Each insight directly feeds into our initial objective of understanding customer dynamics to refine our marketing strategies. The identification of these segments validates our approach and sets the stage for targeted marketing initiatives.

Strategic Recommendations

For High-Value Customers (Segment A):

Initiate Exclusive Campaigns: Craft personalized marketing campaigns that cater to their preferences, enhancing brand loyalty and encouraging premium purchases. For Economical Shoppers (Segment B):

Launch Value Promotions: Develop targeted offers that highlight the value and quality of our products, appealing to their desire for cost-effectiveness. For Periodic Buyers (Segment C):

Enhance Engagement: Utilize data insights to create personalized communication strategies, aiming to convert their occasional purchases into regular buying patterns. For Category Enthusiasts (Segment D):

Customized Product Offers: Leverage their interest in specific categories by offering specialized products, bundles, or exclusive previews.

Enhancing Marketing Strategies with PCA and K-Means Clustering

Reikhan Gurbanova

2024-02-02

Introduction

Clustering

Dimensional Reduction

Setting Expectations

Initial Data Exploration

Data Cleaning

Date Processing

Additional Data Processing

Data Transformation and Feature Engineering

Visualization Setup

Data Scaling and Outlier Removal

Principal Component Analysis (PCA)

Clustering

Visualizing Clusters

Further Analysis and Visualization

Interpretations

Conclusion

Conclusions and Recommendations

Summary of Findings:

Strategic Recommendations