Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
=======================================================
The first code cell is dedicated to setting up the environment. It checks for missing packages, installs any that are not already installed, and loads the necessary libraries. This ensures that all subsequent code can run without issues related to missing software.
# Install necessary packages if they are not already installed
if (!require("data.table")) install.packages("data.table")
Loading required package: data.table
Warning: package ‘data.table’ was built under R version 4.3.2Registered S3 method overwritten by 'data.table':
method from
print.data.table
data.table 1.14.8 using 10 threads (see ?getDTthreads). Latest news: r-datatable.com
if (!require("dplyr")) install.packages("dplyr")
Loading required package: dplyr
Warning: package ‘dplyr’ was built under R version 4.3.2
Attaching package: ‘dplyr’
The following objects are masked from ‘package:data.table’:
between, first, last
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
# Load the necessary libraries
library(data.table)
library(dplyr)
This cell loads the dataset from a CSV file using tabulation as a separator. It then displays the structure and the first few rows of the dataframe for a preliminary inspection of the data.
# Load the dataframe, separator here is the tabulation "\t"
df <- fread('marketing_campaign.csv', sep = '\t')
# View the first few rows of the dataframe
head(df)
# Display information about the dataframe
str(df)
Classes ‘data.table’ and 'data.frame': 2240 obs. of 29 variables:
$ ID : int 5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
$ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
$ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
$ Marital_Status : chr "Single" "Single" "Together" "Together" ...
$ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
$ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
$ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
$ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
$ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
$ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
$ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
$ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
$ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
$ MntSweetProducts : int 88 1 21 3 27 42 49 1 3 1 ...
$ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
$ NumDealsPurchases : int 3 2 1 2 5 2 4 2 1 1 ...
$ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
$ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
$ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
$ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
$ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
$ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
$ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
$ Z_CostContact : int 3 3 3 3 3 3 3 3 3 3 ...
$ Z_Revenue : int 11 11 11 11 11 11 11 11 11 11 ...
$ Response : int 1 0 0 0 0 0 0 0 1 0 ...
- attr(*, ".internal.selfref")=<externalptr>
The code calculates and prints the percentage of missing values for each column in the dataset. This is critical for understanding data completeness and determining the necessity of data cleaning or imputation.
# Define a function to calculate the percentage of missing values
per_missing <- function(df) {
missing_values <- sapply(df, function(x) sum(is.na(x)))
non_missing_values <- sapply(df, function(x) sum(!is.na(x)))
per_miss <- missing_values / non_missing_values * 100
return(per_miss)
}
# Print the percentage of missing values for the first 10 columns
print(head(per_missing(df), 10))
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines
0.000000 0.000000 0.000000 0.000000 1.083032 0.000000 0.000000 0.000000 0.000000 0.000000
Here, missing values in the ‘Income’ column are imputed with the median income. This choice avoids the influence of outliers that might skew the mean and calculates the frequency of unique values in ‘Z_CostContact’ and ‘Z_Revenue’. Upon finding no variability, these columns are dropped as they provide no informative value for analysis
# Filling the NaN or Null values with the median
df$Income[is.na(df$Income)] <- median(df$Income, na.rm = TRUE)
# Print the percentage of missing values for the first 10 columns
print(head(per_missing(df), 10))
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines
0 0 0 0 0 0 0 0 0 0
# Get the value counts for Z_CostContact and Z_Revenue
Z_CostContact_values <- table(df$Z_CostContact)
Z_Revenue_values <- table(df$Z_Revenue)
# Print the value counts
print(Z_CostContact_values)
3
2240
print(Z_Revenue_values)
11
2240
# Drop the 'Z_CostContact' and 'Z_Revenue' columns since they are no use for the segmentation
df <- df %>% select(-c(Z_CostContact, Z_Revenue))
This function call generates bar plots for each categorical variable in the dataset, providing a visual frequency distribution which aids in understanding the data composition.
# Load necessary libraries for data manipulation and plotting
if (!require("ggplot2")) install.packages("ggplot2")
Loading required package: ggplot2
Warning: package ‘ggplot2’ was built under R version 4.3.2
library(ggplot2)
unique_cat_var <- function(df) {
col <- names(df)[sapply(df, is.character) & names(df) != 'Dt_Customer']
# Print the columns that will be plotted
print(paste("Plotting for columns:", paste(col, collapse = ", ")))
for (i in col) {
# Create the plot
p <- ggplot(data = df, aes_string(x = i)) +
geom_bar(aes(y = (..count..)/sum(..count..) * 100)) +
ylab("Frequency of Occurrence (%)") +
theme_minimal() +
ggtitle(paste("Distribution of", i))
# Print the plot
print(p)
}
}
# Call the function
unique_cat_var(df)
[1] "Plotting for columns: Education, Marital_Status"
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
Please use tidy evaluation idioms with `aes()`.
See also `vignette("ggplot2-in-packages")` for more information.
The ‘Dt_Customer’ column is converted from character to Date format to facilitate any operations that require date arithmetic.
# Converting 'Dt_Customer' to Date format
df$Dt_Customer <- as.Date(df$Dt_Customer, format="%d-%m-%Y")
# Display the first few rows of the dataframe
head(df)
# Converting 'Dt_Customer' to Date format in R
# Ensure the format matches the date format in your data
df$Dt_Customer <- as.Date(df$Dt_Customer, format="%d-%m-%Y")
# Display the first few rows of the dataframe
head(df)
NA
These visualizations are used to detect outliers and understand the distribution and relationships between different numerical variables.
# Load the ggplot2 library
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Create the scatter plot
ggplot(df, aes(x = Year_Birth, y = Income)) +
geom_point() + # Add points
geom_vline(xintercept = 1910, linetype = "solid") + # Vertical line at x = 1910
geom_hline(yintercept = 150000, color = "red", linetype = "solid") + # Horizontal line at y = 150000
theme_minimal() # Optional: Adds a minimalistic theme
# Load necessary libraries
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("dplyr")) install.packages("dplyr")
library(ggplot2)
library(dplyr)
# Boxplot for 'Income'
ggplot(df, aes(y = Income)) +
geom_boxplot() +
coord_flip() +
theme_minimal()
NA
NA
NA
# Boxplot for 'Year_Birth'
ggplot(df, aes(y = Year_Birth)) +
geom_boxplot() +
coord_flip() +
theme_minimal()
NA
NA
Outliers are removed from the data based on domain knowledge, which can improve the performance of clustering algorithms by reducing noise.
# Dropping the outliers by setting a cap on Age and Income
df <- df %>% filter(Year_Birth > 1910, Income < 150000)
# Print the total number of data points after removing the outliers
print(paste("The total number of data-points after removing the outliers are:", nrow(df)))
[1] "The total number of data-points after removing the outliers are: 2229"
# Assuming df$Dt_Customer is already converted to Date type in R
# Print the maximum date in 'Dt_Customer'
print(max(df$Dt_Customer))
[1] "2014-06-29"
# Print a specific date
print(as.Date("2014-10-04"))
[1] "2014-10-04"
This cell includes comprehensive feature engineering, which involves creating new features that may better capture the underlying patterns and relationships for the clustering task.
# Load necessary packages
if (!require("dplyr")) install.packages("dplyr")
if (!require("lubridate")) install.packages("lubridate")
Loading required package: lubridate
Warning: package ‘lubridate’ was built under R version 4.3.2
Attaching package: ‘lubridate’
The following objects are masked from ‘package:data.table’:
hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
library(dplyr)
library(lubridate)
# Create the Age attribute
current_year <- year(max(df$Dt_Customer, na.rm = TRUE))
df <- df %>% mutate(Age = current_year - Year_Birth)
# Create the Total_spent attribute
df <- df %>% mutate(Total_spent = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts + MntGoldProds)
last_date <- max(df$Dt_Customer, na.rm = TRUE)
df <- df %>% mutate(Eldership = as.numeric(difftime(last_date, Dt_Customer, units = "days")))
# Create the Eldership attribute
last_date <- max(df$Dt_Customer, na.rm = TRUE)
df <- df %>% mutate(Eldership = as.numeric(last_date - Dt_Customer))
# Remap Marital Status and Education attributes
df$Marital_Status <- ifelse(df$Marital_Status %in% c('Divorced', 'Single', 'Absurd', 'Widow', 'YOLO'), 'Alone', 'In couple')
df$Education <- ifelse(df$Education %in% c('Basic', '2n Cycle'), 'Undergraduate', ifelse(df$Education %in% c('Graduation', 'Master', 'PhD'), 'Postgraduate', df$Education))
# Create Children and Has_Child Attributes
df <- df %>% mutate(Children = Kidhome + Teenhome, Has_child = ifelse(Children > 0, 1, 0))
# Create reduction accepted
df <- df %>% mutate(Promo_Accepted = AcceptedCmp3 + AcceptedCmp4 + AcceptedCmp5 + AcceptedCmp1 + AcceptedCmp2 + Response)
# Display the first few rows of the dataframe
head(df)
Standardization: ![]()
Covariance Matrix: ![]()
Eigen Decomposition: ![]()
Explained Variance: ![]()
Cumulative Explained Variance: ![]()
Data Projection: 
Eigenvalues and Eigenvectors:
pca$sdev^2: This gives the
eigenvalues, which are the variances of the principal
components.
pca$rotation: This matrix contains
the eigenvectors. Each column represents an eigenvector corresponding to
a principal component.
Explained Variance:
var_exp <- (eig_vals / tot) * 100:
Here, you calculated the percentage of variance explained by each
principal component. This is derived from the eigenvalues, which measure
the variance along each principal component.Cumulative Explained Variance:
cum_var_exp <- cumsum(var_exp):
This is the running total of the explained variance. It helps in
understanding how many principal components are needed to capture most
of the variability in the data.Data Projection (not explicitly shown in the
code provided but implied with the use of the
prcomp function):
pca$x after the PCA computation, which is
the projection of the original data onto the new principal component
axes.Standardization (mentioned but not performed because the data is already scaled):
scale. = FALSE parameter in
prcomp(X_scaled, scale. = FALSE):
Indicates that the data has been pre-scaled, so
prcomp should not scale it again.Visualization:
Correlation Matrix of PCA Components:
cor(PCA_ds): Although PCA components
are theoretically orthogonal, this part of the code calculates the
correlation matrix of the principal components, which should ideally
show no correlation if the PCA was successful. Then, it visualizes the
correlation matrix using a heatmap to ensure that the principal
components are indeed uncorrelated.# Ensure the dplyr package is installed and loaded
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
# Copy the dataframe and drop the specified columns
X <- df %>%
select(-c(Dt_Customer, ID, Year_Birth, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2, Response, Complain))
# Load necessary libraries
if (!require("dplyr")) install.packages("dplyr")
library(dplyr)
# Convert data.table to dataframe if necessary
X <- as.data.frame(X)
# Select only categorical variables
cat_variables <- names(X)[sapply(X, is.character)]
# Convert categorical variables to factors
X[cat_variables] <- lapply(X[cat_variables], factor)
# Convert factors to numeric (integer encoding)
X[cat_variables] <- lapply(X[cat_variables], as.integer)
Standardization makes the data compatible for clustering, and the correlation heatmap visualizes the relationships between different variables, which can inform feature selection.
# Load necessary libraries
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("reshape2")) install.packages("reshape2")
Loading required package: reshape2
Warning: package ‘reshape2’ was built under R version 4.3.2
Attaching package: ‘reshape2’
The following objects are masked from ‘package:data.table’:
dcast, melt
library(ggplot2)
library(reshape2)
# Standardize the dataset
X_scaled <- as.data.frame(scale(X))
# Calculate the correlation matrix
corr <- cor(X_scaled, use = "complete.obs")
# Melt the correlation matrix for ggplot2
corr_melted <- melt(corr)
# Plot the heatmap
# Plot the heatmap with adjusted label sizes
ggplot(corr_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 1, size = 7, hjust = 1), # Reduced size here
axis.text.y = element_text(size = 7)) + # Reduced size here
coord_fixed()
# Perform PCA
pca <- prcomp(X_scaled, scale. = FALSE) # scale. = FALSE since data is already scaled
# Eigenvalues (squared singular values)
eig_vals <- pca$sdev^2
# Eigenvectors
eig_vecs <- pca$rotation
# Calculation of Explained Variance from the eigenvalues
tot <- sum(eig_vals)
var_exp <- (eig_vals / tot) * 100 # Individual explained variance
cum_var_exp <- cumsum(var_exp) # Cumulative explained variance
# Display results
print("Individual explained variance:")
[1] "Individual explained variance:"
print(var_exp)
[1] 3.563896e+01 1.155031e+01 6.640557e+00 5.315775e+00 4.581099e+00 4.424258e+00 4.300946e+00 3.740996e+00 3.328853e+00 2.986258e+00 2.632113e+00 2.498659e+00 1.974972e+00 1.907933e+00 1.734079e+00
[16] 1.655578e+00 1.386549e+00 1.260457e+00 9.650458e-01 8.523071e-01 6.243027e-01 1.829674e-29 5.492255e-30
print("Cumulative explained variance:")
[1] "Cumulative explained variance:"
print(cum_var_exp)
[1] 35.63896 47.18926 53.82982 59.14560 63.72669 68.15095 72.45190 76.19289 79.52175 82.50801 85.14012 87.63878 89.61375 91.52168 93.25576 94.91134 96.29789 97.55834 98.52339 99.37570
[21] 100.00000 100.00000 100.00000
Principal Component Analysis (PCA) is conducted to reduce dimensionality while retaining most of the variability in the data. This can improve clustering results by focusing on the most informative aspects of the data
# Load necessary libraries
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Create a data frame for plotting
num_components <- length(var_exp)
df_pca <- data.frame(
Component = 1:num_components,
Individual = var_exp,
Cumulative = cum_var_exp
)
# Create the plot
ggplot(df_pca, aes(x = Component)) +
geom_bar(aes(y = Individual), stat = "identity", fill = "green", alpha = 0.3333) +
geom_step(aes(y = Cumulative), direction = "mid") +
geom_vline(xintercept = 12, linetype = "dashed") +
geom_hline(yintercept = 90, color = "red", linetype = "dashed") +
labs(x = "Principal components", y = "Explained variance ratio") +
ggtitle("PCA Explained Variance") +
theme_minimal() +
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) +
scale_x_continuous(breaks = 1:num_components) +
scale_y_continuous(limits = c(0, 100), expand = c(0, 0)) +
geom_text(aes(y = Cumulative, label = ifelse(Cumulative >= 90 & Cumulative < 91, "90% Variance", "")),
vjust = -0.5, hjust = -0.1, size = 3) +
guides(fill = guide_legend(title = "Variance"))
# Display the plot
print(ggplot_object)
Error: object 'ggplot_object' not found
# Load necessary library
if (!require("stats")) install.packages("stats")
library(stats)
# Performing PCA with 13 components
pca <- prcomp(X_scaled, center = TRUE, scale. = FALSE, n.components = 13)
Warning: In prcomp.default(X_scaled, center = TRUE, scale. = FALSE, n.components = 13) :
extra argument ‘n.components’ will be disregarded
# Create a new dataset with the PCA components
PCA_ds <- as.data.frame(pca$x)
names(PCA_ds) <- paste("PCA", 1:13, sep="")
# Describing the PCA dataset
summary(PCA_ds)
PCA1 PCA2 PCA3 PCA4 PCA5 PCA6 PCA7 PCA8 PCA9 PCA10
Min. :-8.0115 Min. :-5.6331 Min. :-5.643417 Min. :-5.10986 Min. :-3.40925 Min. :-2.90823 Min. :-2.1721 Min. :-3.9927 Min. :-4.23279 Min. :-5.58260
1st Qu.:-2.3517 1st Qu.:-1.2219 1st Qu.:-0.845908 1st Qu.:-0.53992 1st Qu.:-0.66812 1st Qu.:-0.70017 1st Qu.:-0.7953 1st Qu.:-0.3618 1st Qu.:-0.56668 1st Qu.:-0.37981
Median : 0.8391 Median : 0.1812 Median : 0.001538 Median :-0.00627 Median :-0.04246 Median : 0.01233 Median :-0.1122 Median : 0.1852 Median :-0.03753 Median : 0.04629
Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 2.6001 3rd Qu.: 1.3322 3rd Qu.: 0.861202 3rd Qu.: 0.55008 3rd Qu.: 0.64841 3rd Qu.: 0.71222 3rd Qu.: 0.7363 3rd Qu.: 0.6353 3rd Qu.: 0.57465 3rd Qu.: 0.46616
Max. : 5.7445 Max. : 4.0049 Max. : 3.366752 Max. : 4.86567 Max. : 3.81089 Max. : 2.79845 Max. : 3.3368 Max. : 2.1438 Max. : 2.83915 Max. : 3.09653
PCA11 PCA12 PCA13 NA NA NA NA NA NA NA
Min. :-5.90052 Min. :-6.73088 Min. :-3.23855 Min. :-3.511707 Min. :-3.833712 Min. :-5.404859 Min. :-3.51685 Min. :-3.8793 Min. :-3.48885 Min. :-1.68799
1st Qu.:-0.47450 1st Qu.:-0.36964 1st Qu.:-0.39814 1st Qu.:-0.202507 1st Qu.:-0.276136 1st Qu.:-0.221899 1st Qu.:-0.31800 1st Qu.:-0.3108 1st Qu.:-0.22397 1st Qu.:-0.31382
Median : 0.05563 Median : 0.01958 Median :-0.02591 Median : 0.006125 Median : 0.004122 Median :-0.002098 Median :-0.02745 Median :-0.0249 Median :-0.02272 Median : 0.06562
Mean : 0.00000 Mean : 0.00000 Mean : 0.00000 Mean : 0.000000 Mean : 0.000000 Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.51645 3rd Qu.: 0.35428 3rd Qu.: 0.41004 3rd Qu.: 0.214407 3rd Qu.: 0.320311 3rd Qu.: 0.230142 3rd Qu.: 0.26279 3rd Qu.: 0.2609 3rd Qu.: 0.21736 3rd Qu.: 0.32945
Max. : 2.45427 Max. : 7.34334 Max. : 4.76153 Max. : 5.499598 Max. : 3.257568 Max. : 3.064481 Max. : 3.44359 Max. : 3.8466 Max. : 1.95129 Max. : 2.08344
NA NA NA
Min. :-1.66311 Min. :-7.524e-15 Min. :-4.460e-15
1st Qu.:-0.22940 1st Qu.:-4.902e-16 1st Qu.:-8.346e-16
Median :-0.01991 Median : 4.004e-16 Median : 2.581e-17
Mean : 0.00000 Mean :-3.939e-17 Mean : 2.161e-17
3rd Qu.: 0.22833 3rd Qu.: 1.237e-15 3rd Qu.: 8.030e-16
Max. : 3.78256 Max. : 4.231e-15 Max. : 4.966e-15
# Load necessary libraries
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("reshape2")) install.packages("reshape2")
library(ggplot2)
library(reshape2)
# Calculate the correlation matrix for PCA components
corr <- cor(PCA_ds)
# Melt the correlation matrix for ggplot2
corr_melted <- melt(corr)
# Plot the heatmap
ggplot(corr_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1),
axis.text.y = element_text(size = 10)) +
coord_fixed()
Determining the optimal number of clusters using the Elbow method The Elbow method plots the percentage of variance explained by each number of potential clusters. We look for the number of clusters at which the variance explained begins to level off (“elbow point”).
if (!require("factoextra")) install.packages("factoextra")
Loading required package: factoextra
Warning: package ‘factoextra’ was built under R version 4.3.2Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(factoextra)
# Determining the optimal number of clusters using the Elbow method
set.seed(123) # Set seed for reproducibility
fviz_nbclust(PCA_ds, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) # Adjust the xintercept as needed
The K-means clustering algorithm is applied to the data, and the resulting cluster assignments are appended to the dataframe. This is a critical step in segmenting the data into distinct groups.
# Perform K-means clustering
set.seed(123) # Setting a seed for reproducibility
kmeans_result <- kmeans(X_scaled, centers = 4)
# Add the cluster assignments to your original dataframe
X$Cluster <- as.factor(kmeans_result$cluster)
A scatter plot is created to visualize how the clusters differ in terms of customer income and total spending, providing insights into the customer segmentation.
# Create a scatter plot
ggplot(X, aes(x = Total_spent, y = Income, color = Cluster)) +
geom_point() +
labs(title = "Cluster's Profile Based On Income And Spending", x = "Total Spent", y = "Income") +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
theme(legend.title = element_blank())
This visualization shows how promotional strategies are received by different customer segments, which can guide marketing efforts.
# Load the ggplot2 package
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Create a count plot for Promo_Accepted grouped by Cluster
ggplot(X, aes(x = Promo_Accepted, fill = Cluster)) +
geom_bar(position = "dodge") +
labs(title = "Count Of Promotion Accepted", x = "Number Of Total Accepted Promotions", y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
theme(legend.title = element_blank())
NA
NA
NA
The final visualization shows the average spending in different product categories for each cluster, indicating the preferences and behaviors within each customer segment.
# Define the product categories
products <- c('MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds')
# Calculate the mean values for each cluster
values <- aggregate(. ~ Cluster, data = X[, c("Cluster", products)], FUN = mean)
# Load necessary libraries
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("reshape2")) install.packages("reshape2")
library(ggplot2)
library(reshape2)
# Reshape the data for plotting
values_melted <- melt(values, id.vars = 'Cluster')
# Create bar plots
ggplot(values_melted, aes(x = Cluster, y = value, fill = Cluster)) +
geom_bar(stat = "identity", position = position_dodge()) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
labs(y = "Mean Value", x = "Cluster") +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
theme(legend.position = "none")