Welcome to our journey through data analysis within our marketing campaign, where we delve into customer behavior and preferences using R. Our mission extends beyond mere data manipulation; we aim to craft narratives from numbers, turning statistical outputs into strategic insights. By employing tools like dplyr for data wrangling and ggplot2 and plotly for visual storytelling, we’re set to explore the depths of our data. Through advanced techniques such as PCA for simplifying our data landscape and clustering methods to uncover hidden patterns, our goal is to illuminate the path for data-driven decisions in marketing.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Clustering is a method of unsupervised learning, a type of machine learning in which the algorithm seeks to identify inherent groupings within the data without the guidance of a known outcome variable. In the context of your marketing data, clustering aims to group customers into segments based on similarities in their features. The primary statistical considerations in clustering include:
Algorithm Selection: Different algorithms approach the clustering problem in various ways. I’ve used KMeans and hierarchical clustering:
KMeans: This algorithm partitions the data into K distinct, non-overlapping subsets (clusters) by minimizing the within-cluster variance (or inertia). The goal is to assign data points to clusters such that the total sum of squared distances from each point to the centroid of its cluster is minimized.
Hierarchical Clustering: This method builds a hierarchy of clusters either in a bottom-up approach (agglomerative) or a top-down approach (divisive). Agglomerative hierarchical clustering starts with each data point as a separate cluster and merges them step by step based on some linkage criteria (like minimum distance).
Number of Clusters: Determining the optimal number of clusters is crucial. Techniques like the Elbow Method (in my script with fviz_nbclust) help estimate the appropriate number of clusters by looking at the percentage of variance explained as a function of the number of clusters.
Dimensional reduction refers to the process of reducing the number of variables under consideration, divided into feature selection and feature extraction.
Principal Component Analysis (PCA): PCA is used for dimensional reduction in your script. It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The key aspects of PCA include:
Variance Maximization: PCA aims to capture as much of the variability in the original high-dimensional space as possible in the lower-dimensional representation.
Orthogonality: Each principal component is orthogonal (i.e., uncorrelated) to the others, ensuring that they capture different aspects or features of the data.
Dimensionality Reduction: By reducing the number of dimensions (variables), PCA simplifies the complexity of the data, making it easier to analyze and visualize. This is especially useful in large datasets where multicollinearity (high correlation among variables) might be a concern.
Before diving into the technicalities, let’s outline our voyage. We anticipate uncovering distinct groups within our customer base, revealing not just who our customers are but how they interact with our offerings. These insights will not only validate our marketing intuition but also challenge our assumptions, paving the way for refined, personalized marketing strategies.
# **Loading Data** {.tabset .tabset-fade .tabset-pills}
First we need to load some libraries and import our data.I'm loading the readr library, which is part of the tidyverse suite of data manipulation tools in R. Then, I'm reading a CSV file named "marketing_campaign.csv" into a data frame called data
# Load library
library(knitr) # dynamic report generation
library(gridExtra) # provides a number of user-level functions to work with "grid" graphics
library(lubridate) # work with dates and times
library(readr)
library(ggplot2)
library(reshape2) # for melting the data
# Read the data
data <- read.csv("marketing_campaign.csv", sep = ",")
print(paste("Number of datapoints:", nrow(data)))
head(data)
str(data)
summary(data)
colSums(is.na(data))
This block prints the number of rows (data points), displays the first few rows, shows the structure (types of columns and a few entries), provides a summary (basic statistical measures), and counts missing values in each column of the dataset.
data <- na.omit(data)
print(paste("The total number of data-points after removing the rows with missing values are:", nrow(data)))
Here, I’m removing rows with missing (NA) values and then printing the number of data points after this removal.
data$Dt_Customer <- dmy(data$Dt_Customer)
newest_date <- max(data$Dt_Customer)
oldest_date <- min(data$Dt_Customer)
print(paste("The newest customer's enrolment date in the records:", newest_date))
print(paste("The oldest customer's enrolment date in the records:", oldest_date))
The lubridate library is used to
convert the ‘Dt_Customer’ column into Date format. I then find and print
the newest and oldest dates of customer enrollment.
:
data$Customer_For <- as.numeric(newest_date - data$Dt_Customer)
This calculates the number of days each customer has been enrolled by subtracting their enrollment date from the newest enrollment date.
# Count of each category in 'Marital_Status'
cat_marital_status <- table(data$Marital_Status)
print("Total categories in the feature Marital_Status:")
## [1] "Total categories in the feature Marital_Status:"
print(cat_marital_status)
##
## Absurd Alone Divorced Married Single Together Widow YOLO
## 2 3 232 857 471 573 76 2
# Count of each category in 'Education'
cat_education <- table(data$Education)
print("Total categories in the feature Education:")
## [1] "Total categories in the feature Education:"
print(cat_education)
##
## 2n Cycle Basic Graduation Master PhD
## 200 54 1116 365 481
# Age of customer today
data$Age <- 2021 - data$Year_Birth
# Total spendings on various items
data$Spent <- data$MntWines + data$MntFruits + data$MntMeatProducts + data$MntFishProducts + data$MntSweetProducts + data$MntGoldProds
# Deriving living situation by marital status
data$Living_With <- data$Marital_Status
data$Living_With[data$Marital_Status %in% c("Married", "Together")] <- "Partner"
data$Living_With[data$Marital_Status %in% c("Absurd", "Widow", "YOLO", "Divorced", "Single")] <- "Alone"
# Feature indicating total children living in the household
data$Children <- data$Kidhome + data$Teenhome
# Feature for total members in the household
data$Family_Size <- ifelse(data$Living_With == "Alone", 1, 2) + data$Children
# Feature pertaining parenthood
data$Is_Parent <- ifelse(data$Children > 0, 1, 0)
# Segmenting education levels into three groups
data$Education <- data$Education
data$Education[data$Education %in% c("Basic", "2n Cycle")] <- "Undergraduate"
data$Education[data$Education == "Graduation"] <- "Graduate"
data$Education[data$Education %in% c("Master", "PhD")] <- "Postgraduate"
# Renaming columns for clarity
colnames(data)[colnames(data) == "MntWines"] <- "Wines"
colnames(data)[colnames(data) == "MntFruits"] <- "Fruits"
colnames(data)[colnames(data) == "MntMeatProducts"] <- "Meat"
colnames(data)[colnames(data) == "MntFishProducts"] <- "Fish"
colnames(data)[colnames(data) == "MntSweetProducts"] <- "Sweets"
colnames(data)[colnames(data) == "MntGoldProds"] <- "Gold"
# Dropping some of the redundant features
data <- data[, !(names(data) %in% c("Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "Year_Birth", "ID"))]
# Get a summary of the dataset
summary(data)
## Education Income Kidhome Teenhome
## Length:2216 Min. : 1730 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median : 51382 Median :0.0000 Median :0.0000
## Mean : 52247 Mean :0.4418 Mean :0.5054
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :666666 Max. :2.0000 Max. :2.0000
## Recency Wines Fruits Meat
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0
## Median :49.00 Median : 174.5 Median : 8.00 Median : 68.0
## Mean :49.01 Mean : 305.1 Mean : 26.36 Mean : 167.0
## 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.: 33.00 3rd Qu.: 232.2
## Max. :99.00 Max. :1493.0 Max. :199.00 Max. :1725.0
## Fish Sweets Gold NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.50 Median : 2.000
## Mean : 37.64 Mean : 27.03 Mean : 43.97 Mean : 2.324
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :262.00 Max. :321.00 Max. :15.000
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.085 Mean : 2.671 Mean : 5.801 Mean : 5.319
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.000 Max. :20.000
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.07356 Mean :0.07401 Mean :0.0731 Mean :0.06408
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## AcceptedCmp2 Complain Response Customer_For
## Min. :0.00000 Min. :0.000000 Min. :0.0000 Min. : 0.0
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:180.0
## Median :0.00000 Median :0.000000 Median :0.0000 Median :355.5
## Mean :0.01354 Mean :0.009477 Mean :0.1503 Mean :353.5
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:529.0
## Max. :1.00000 Max. :1.000000 Max. :1.0000 Max. :699.0
## Age Spent Living_With Children
## Min. : 25.00 Min. : 5.0 Length:2216 Min. :0.0000
## 1st Qu.: 44.00 1st Qu.: 69.0 Class :character 1st Qu.:0.0000
## Median : 51.00 Median : 396.5 Mode :character Median :1.0000
## Mean : 52.18 Mean : 607.1 Mean :0.9472
## 3rd Qu.: 62.00 3rd Qu.:1048.0 3rd Qu.:1.0000
## Max. :128.00 Max. :2525.0 Max. :3.0000
## Family_Size Is_Parent
## Min. :1.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :3.000 Median :1.0000
## Mean :2.593 Mean :0.7144
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :5.000 Max. :1.0000
This section includes a variety of data transformations: deriving new features like age, total spendings, living situation, family size, and parenthood status; segmenting education levels; renaming columns for clarity; and dropping redundant features.
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("GGally")
## Installing package into 'C:/Users/Reikhan.Gasimova/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'GGally' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Reikhan.Gasimova\AppData\Local\Temp\RtmpMN82kQ\downloaded_packages
# Load the packages
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Setting up color preferences
pallet <- c("#682F2F", "#F3AB60")
# Ensure 'Is_Parent' is a factor
data$Is_Parent <- as.factor(data$Is_Parent)
# Select features to plot
to_plot <- data[, c("Income", "Recency", "Customer_For", "Age", "Spent", "Is_Parent")]
# Creating the pair plot
ggpairs(to_plot, aes(color = Is_Parent),
upper = list(continuous = "points", combo = "box"),
lower = list(continuous = "points", combo = "facetdensity"),
diag = list(continuous = "barDiag")) +
scale_colour_manual(values = pallet)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here, I load ggplot2 and
GGally for data visualization. The
subsequent code includes creating pair plots, heatmaps, and various
other plots to explore the relationships and distributions in the
data.
This is pair plot (also known as a scatterplot matrix), which is a matrix of scatterplots used to understand the pairwise relationships between different variables in the data.
Diagonal: The diagonal of the matrix (from the
top left to the bottom right) typically shows the distribution of each
variable. It looks like histograms for continuous variables and bar
charts for categorical variables, such as
Is_Parent in this case. The distribution
plots allow we to see the spread of each variable, identify modes, and
look for signs of skewness.
Lower Triangle: The plots below the diagonal are
scatterplots, each representing the relationship between a pair of
variables. For example, the scatterplot at the intersection of the
Income row and the
Recency column shows the relationship
between these two variables. These plots can help identify correlations
or potential patterns of association between variables.
Upper Triangle: The plots above the diagonal typically mirror the lower triangle but may include additional layers of information or different types of visual representations, like density plots or correlation coefficients. However, in my plot, the upper triangle seems to replicate the lower one without additional information.
Color and Size Coding: The points in the
scatterplots may be color-coded or size-coded (or both) to represent
additional variables. In my plot, color seems to indicate whether a
customer is a parent (Is_Parent), with one
color for ‘1’ and another for ‘0’.
Interpretation: By examining the scatterplots, we can look for trends, clusters, or outliers. For example, if there’s a linear trend in the scatterplots, this suggests a positive or negative correlation. If the points form distinct groups, this might indicate that the data clusters naturally along those dimensions.
Outliers: Any points that fall far away from the main cloud of points in the scatterplots are outliers. Outliers can be of particular interest because they may represent errors, unusual cases, or important variations in the data.
Box Plots: On the far right, you have box plots
for each continuous variable against the categorical variable
Is_Parent. These box plots show the
distribution’s central tendency, dispersion, and skewness, and they can
also highlight outliers (shown as individual points outside the whiskers
of the box plot).
# Removing outliers
data <- subset(data, Age < 90)
data <- subset(data, Income < 600000)
# Print the total number of data-points after removing outliers
print(paste("The total number of data-points after removing the outliers are:", nrow(data)))
## [1] "The total number of data-points after removing the outliers are: 2212"
# Select only numeric columns for the correlation matrix
numeric_data <- data[sapply(data, is.numeric)]
# Compute the correlation matrix
corrmat <- cor(numeric_data, use = "complete.obs")
# Melt the correlation matrix
melted_corr <- melt(corrmat)
# Create the heatmap
ggplot(melted_corr, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
axis.text.y = element_text(size = 12)) +
labs(x = "", y = "", title = "Correlation Matrix", subtitle = "Heatmap Representation")
# Get list of categorical variables
object_cols <- names(data)[sapply(data, function(x) class(x) %in% c("factor", "character"))]
# Print the categorical variables
print("Categorical variables in the dataset:")
## [1] "Categorical variables in the dataset:"
print(object_cols)
## [1] "Education" "Living_With" "Is_Parent"
# Label Encoding the categorical variables
for (col in object_cols) {
data[[col]] <- as.numeric(as.factor(data[[col]]))
}
print("All features are now numerical")
## [1] "All features are now numerical"
# Creating a copy of data
ds <- data
# Dropping specified columns
cols_del <- c('AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response')
ds <- ds[, !(names(ds) %in% cols_del)]
# Scaling the data
scaled_ds <- as.data.frame(scale(ds))
print("All features are now scaled")
## [1] "All features are now scaled"
# Display the first few rows of the scaled data frame
print("Dataframe to be used for further modelling:")
## [1] "Dataframe to be used for further modelling:"
head(scaled_ds)
## Education Income Kidhome Teenhome Recency Wines Fruits
## 1 -0.8933842 0.2870400 -0.8225675 -0.9294885 0.3102831 0.9774386 1.5516896
## 2 -0.8933842 -0.2608231 1.0397860 0.9078918 -0.3807274 -0.8724207 -0.6373172
## 3 -0.8933842 0.9129900 -0.8225675 -0.9294885 -0.7953337 0.3578543 0.5704107
## 4 -0.8933842 -1.1758481 1.0397860 -0.9294885 -0.7953337 -0.8724207 -0.5618342
## 5 0.5715275 0.2942401 1.0397860 -0.9294885 1.5541019 -0.3921688 0.4194448
## 6 0.5715275 0.4902705 -0.8225675 0.9078918 -1.1408389 0.6365191 0.3942838
## Meat Fish Sweets Gold NumDealsPurchases
## 1 1.6899111 2.4529173 1.483377108 0.85238280 0.3509506
## 2 -0.7180674 -0.6508565 -0.633875295 -0.73347657 -0.1686630
## 3 -0.1785018 1.3392102 -0.147150604 -0.03724563 -0.6882765
## 4 -0.6556383 -0.5047966 -0.585202826 -0.75281632 -0.1686630
## 5 -0.2186348 0.1524732 -0.001133197 -0.55941883 1.3901777
## 6 -0.3078192 -0.6873715 0.363910321 -0.57875858 -0.1686630
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 1 1.4265420 2.5030413 -0.5556886 0.6920240
## 2 -1.1261653 -0.5712104 -1.1708955 -0.1325152
## 3 1.4265420 -0.2296269 1.2899319 -0.5447849
## 4 -0.7614928 -0.9127940 -0.5556886 0.2797544
## 5 0.3325246 0.1119566 0.0595182 -0.1325152
## 6 0.6971970 0.4535402 1.2899319 0.2797544
## Customer_For Age Spent Living_With Children Family_Size
## 1 1.5273754 1.0181218 1.6758664 -1.3492980 -1.26431204 -1.7579612
## 2 -1.1887425 1.2744970 -0.9630792 -1.3492980 1.40425497 0.4489685
## 3 -0.2060017 0.3344545 0.2800468 0.7407911 -1.26431204 -0.6544963
## 4 -1.0603442 -1.2892552 -0.9199266 0.7407911 0.06997147 0.4489685
## 5 -0.9516995 -1.0328800 -0.3074921 0.7407911 0.06997147 0.4489685
## 6 -0.2998312 0.1635377 0.1804639 0.7407911 0.06997147 0.4489685
## Is_Parent
## 1 -1.5807814
## 2 0.6323126
## 3 -1.5807814
## 4 0.6323126
## 5 0.6323126
## 6 0.6323126
I’m removing outliers based on age and income criteria and then scaling the data. This is important for many statistical analyses and machine learning algorithms.
# Load necessary library
library(stats)
# Performing PCA to reduce dimensions to 3
pca_result <- prcomp(scaled_ds, center = TRUE, scale. = TRUE, rank. = 3)
# Extracting the principal components
PCA_ds <- data.frame(pca_result$x[, 1:3])
colnames(PCA_ds) <- c("col1", "col2", "col3")
# Describing the principal components
print(summary(PCA_ds))
## col1 col2 col3
## Min. :-7.4512 Min. :-4.1938 Min. :-6.74893
## 1st Qu.:-2.3858 1st Qu.:-1.3236 1st Qu.:-0.86378
## Median : 0.7814 Median :-0.1737 Median : 0.05083
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 2.5389 3rd Qu.: 1.2346 3rd Qu.: 0.85352
## Max. : 5.9768 Max. : 6.1668 Max. : 3.62443
# Load the scatterplot3d library
library(scatterplot3d)
# Extracting columns for the 3D plot
x <- PCA_ds$col1
y <- PCA_ds$col2
z <- PCA_ds$col3
# Creating a 3D scatter plot
scatterplot3d(x, y, z, color = "maroon", main = "A 3D Projection Of Data In The Reduced Dimension", pch = 16)
PCA is used to reduce the dimensionality of the dataset while retaining the most important variance. I’ m running PCA and then creating visualizations to understand the reduced dimensions.
install.packages("factoextra")
## Installing package into 'C:/Users/Reikhan.Gasimova/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'factoextra' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Reikhan.Gasimova\AppData\Local\Temp\RtmpMN82kQ\downloaded_packages
# Load necessary libraries
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
## Warning: package 'cluster' was built under R version 4.3.2
# Elbow method to determine the optimal number of clusters
fviz_nbclust(PCA_ds, kmeans, method = "wss", k.max = 10) +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow Method")
# Load necessary libraries
library(stats)
# Performing Agglomerative Clustering
d <- dist(PCA_ds) # Compute the distance matrix
hc <- hclust(d) # Perform hierarchical clustering
yhat_AC <- cutree(hc, k = 4) # Cut the tree into 4 clusters
# Adding cluster labels to the PCA dataset
PCA_ds$Clusters <- yhat_AC
# Adding the Clusters feature to the original dataframe
data$Clusters <- yhat_AC
# Load the necessary library
library(scatterplot3d)
# Extracting columns for the 3D plot
x <- PCA_ds$col1
y <- PCA_ds$col2
z <- PCA_ds$col3
clusters <- PCA_ds$Clusters
# Define colors for clusters
colors <- rainbow(length(unique(clusters)))
# Creating a 3D scatter plot with clusters
scatterplot3d(x, y, z, color = colors[clusters], pch = 19, main = "The Plot Of The Clusters")
# Load the ggplot2 library
library(ggplot2)
# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")
# Creating a count plot of the clusters
ggplot(data, aes(x = as.factor(Clusters), fill = as.factor(Clusters))) +
geom_bar() +
scale_fill_manual(values = pal) +
labs(title = "Distribution Of The Clusters", x = "Cluster", y = "Count") +
theme_minimal()
# Load the ggplot2 library
library(ggplot2)
# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")
# Creating a scatter plot of Clusters based on Income and Spent
ggplot(data, aes(x = Spent, y = Income, color = as.factor(Clusters))) +
geom_point() + # Create scatter points
scale_color_manual(values = pal) + # Use specified color palette
labs(title = "Cluster's Profile Based On Income And Spending",
x = "Spent",
y = "Income") +
theme_minimal() +
theme(legend.title = element_blank()) # Hide the legend title
Various plots are created to visualize the clusters formed by the clustering algorithms. This includes 3D scatter plots and distribution plots, providing insights into how the data segments into different clusters.
# Load the ggplot2 library
library(ggplot2)
# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")
# Creating a scatter plot of Clusters based on Income and Spent
ggplot(data, aes(x = Spent, y = Income, color = as.factor(Clusters))) +
geom_point() + # Create scatter points
scale_color_manual(values = pal) + # Use specified color palette
labs(title = "Cluster's Profile Based On Income And Spending",
x = "Spent",
y = "Income") +
theme_minimal() +
theme(legend.title = element_blank()) # Hide the legend title
# Load the necessary library
library(ggplot2)
# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")
# Creating a combined swarm and boxen plot
ggplot(data, aes(x = as.factor(Clusters), y = Spent)) +
geom_point(position = position_jitter(width = 0.2), color = "#CBEDDD", alpha = 0.5) + # Swarm plot
geom_violin(aes(fill = as.factor(Clusters)), color = NA, scale = "width") + # Boxen plot equivalent
scale_fill_manual(values = pal) + # Use specified color palette
labs(x = "Clusters", y = "Spent") +
theme_minimal()
# Creating a feature for the sum of accepted promotions
data$Total_Promos <- data$AcceptedCmp1 + data$AcceptedCmp2 + data$AcceptedCmp3 + data$AcceptedCmp4 + data$AcceptedCmp5
# Load necessary library
library(ggplot2)
# Creating a feature for the sum of accepted promotions
data$Total_Promos <- data$AcceptedCmp1 + data$AcceptedCmp2 + data$AcceptedCmp3 + data$AcceptedCmp4 + data$AcceptedCmp5
# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")
# Plotting count of total campaign accepted
ggplot(data, aes(x = as.factor(Total_Promos), fill = as.factor(Clusters))) +
geom_bar(position = "dodge") +
scale_fill_manual(values = pal) +
labs(title = "Count Of Promotion Accepted",
x = "Number Of Total Accepted Promotions",
y = "Count") +
theme_minimal()
# Load the ggplot2 library
library(ggplot2)
# Define the color palette
pal <- c("#682F2F", "#B9C0C9", "#9F8A78", "#F3AB60")
# Creating a boxen plot equivalent for the number of deals purchased
ggplot(data, aes(x = as.factor(Clusters), y = NumDealsPurchases, fill = as.factor(Clusters))) +
geom_boxplot() +
scale_fill_manual(values = pal) +
labs(title = "Number of Deals Purchased", x = "Clusters", y = "NumDealsPurchases") +
theme_minimal()
## Simplifying Results Upon applying PCA and clustering, we discovered
fascinating patterns:
PCA simplified our complex dataset, highlighting the most influential factors driving customer behavior. Clustering revealed distinct customer segments, from ‘High-Value Patrons’ who thrive on exclusivity to ‘Economical Shoppers’ seeking value. These findings translate into a richer understanding of our diverse customer landscape.
Each cluster tells a story. High-value patrons might represent our brand loyalists, suggesting the potential for premium, targeted campaigns. On the other hand, Economical Shoppers highlights an opportunity to curate value-focused offerings, enhancing their engagement and satisfaction.
# Assuming 'Personal' is a vector of column names
Personal <- c("Spent", "Clusters") # Update with your actual variable names
# Define a color palette, ensuring there are enough colors for each cluster
pal <- rainbow(length(unique(data$Clusters)))
# Loop through each variable in 'Personal'
for (i in Personal) {
data_subset <- na.omit(data[, c(i, "Spent", "Clusters")])
# Convert Clusters to a factor if it isn't already
data_subset$Clusters <- as.factor(data_subset$Clusters)
# Updated plot code using modern ggplot2 syntax
p <- ggplot(data_subset, aes(x = .data[[i]], y = Spent, color = as.factor(Clusters))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
scale_color_manual(values = pal) +
labs(title = paste("Plot of", i, "vs Spent")) +
theme_minimal()
print(p)
}
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
chooseCRANmirror(ind=1) # This will set a CRAN mirror for the session.
Our investigation has methodically dissected customer data, revealing distinct segments with unique purchasing patterns and preferences. Through the application of PCA, we distilled the essence of our dataset, identifying key drivers behind customer behavior. Subsequent clustering divided our customers into clearly defined groups, each with its distinct characteristics:
Segment A (High-Value Customers): Showcases robust spending across a variety of products, indicating a deep engagement with our brand. Segment B (Economical Shoppers): Demonstrates a keen eye for value, favoring cost-effective purchases. Segment C (Periodic Buyers): Exhibits sporadic purchasing habits, indicating untapped potential for increased engagement. Segment D (Category Enthusiasts): Prefers specific product categories, suggesting opportunities for targeted marketing. Correlation with Project Goals: Each insight directly feeds into our initial objective of understanding customer dynamics to refine our marketing strategies. The identification of these segments validates our approach and sets the stage for targeted marketing initiatives.
For High-Value Customers (Segment A):
Initiate Exclusive Campaigns: Craft personalized marketing campaigns that cater to their preferences, enhancing brand loyalty and encouraging premium purchases. For Economical Shoppers (Segment B):
Launch Value Promotions: Develop targeted offers that highlight the value and quality of our products, appealing to their desire for cost-effectiveness. For Periodic Buyers (Segment C):
Enhance Engagement: Utilize data insights to create personalized communication strategies, aiming to convert their occasional purchases into regular buying patterns. For Category Enthusiasts (Segment D):
Customized Product Offers: Leverage their interest in specific categories by offering specialized products, bundles, or exclusive previews.