knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
data <- read.table("./customer_data.csv", header = TRUE, sep = ",")
head(data)
## id age gender income education region loyalty_status purchase_frequency
## 1 1 27 Male 40682 Bachelor East Gold frequent
## 2 2 29 Male 15317 Masters West Regular rare
## 3 3 37 Male 38849 Bachelor West Silver rare
## 4 4 30 Male 11568 HighSchool South Regular frequent
## 5 5 31 Female 46952 College North Regular occasional
## 6 6 38 Male 7347 Bachelor South Silver occasional
## purchase_amount product_category promotion_usage satisfaction_score
## 1 18249 Books 0 6
## 2 4557 Clothing 1 6
## 3 11822 Clothing 0 6
## 4 4098 Food 0 7
## 5 19685 Clothing 1 5
## 6 2822 Electronics 0 5
Each row in the dataset represents a single customer that shops at the given store.
The dataset contains 100000 observations (where 100000 is the total number of customers in the dataset). Each observation corresponds to a unique customer.
Variable | Type | Description | Unit |
---|---|---|---|
age | Numerical (Ratio) | The age of the customer. | Years |
gender | Categorical (Nominal) | The gender of the customer. Possible values: Male ,
Female . |
N/A |
income | Numerical (Ratio) | The annual income of the customer. | Currency (e.g., USD, EUR) |
education | Categorical (Nominal) | The education level of the customer. Possible values:
HighSchool , College , Bachelor ,
Masters . |
N/A |
region | Categorical (Nominal) | The region where the customer resides. Possible values:
North , South , East ,
West . |
N/A |
loyalty_status | Categorical (Nominal) | The loyalty status of the customer. Possible values:
Regular , Silver , Gold . |
N/A |
purchase_frequency | Categorical (Nominal) | The frequency of purchases made by the customer. Possible values:
Rare , Occasional , Frequent . |
N/A |
purchase_amount | Numerical (Ratio) | The amount spent by the customer in each purchase. | Currency (e.g., USD, EUR) |
product_category | Categorical (Nominal) | The category of the purchased product. Multiple values possible. | N/A |
promotion_usage | Categorical (Nominal) | Indicates whether the customer used promotional offers
(0 for No, 1 for Yes`). |
Binary (0 or 1) |
satisfaction_score | Numerical (Interval) | The satisfaction score of the customer. | Scale (0 to 10) |
The dataset is obtained from Kaggle:
Customer
Purchases Behaviour Dataset
I can note some inconsistencies in the dataset, such as customers as young as 12 years old having an annual income. Although this may simply reflect the definition of the data, it doesn’t seem meaningful. Therefore, I will remove all rows for customers under 18.
library(dplyr)
data <- data %>% filter(age >= 18)
library(dplyr)
library(tidyr)
data <- data %>% drop_na()
I chose to rename only one variable since the others seemed
appropriate. I changed purchase_amount
to
money_spent_purchase
to better reflect that it represents
the amount of money spent per purchase.
library(dplyr)
library(tidyr)
data <- data %>% rename(money_spent_purchase = purchase_amount)
I converted the variable promotion_usage
into a factor,
as its possible values (0 and 1) correspond to No and Yes, as stated in
the dataset’s description.
data$promotion_usage_factor <- factor(data$promotion_usage,
levels = c(0, 1),
labels = c("No", "Yes"))
I created several data frames based on specific conditions, such as gender (female and male) and annual income within a given range.
dataFemale <- data[data$gender == "Female", ]
dataMale <- data[data$gender == "Male", ]
dataLowerIncome <- data[data$income >= 5000 & data$income <= 10000, ]
dataHigherIncome <- data[data$income >= 40000 & data$income <= 50000, ]
library(pastecs)
summary(data[, -c(1, 3, 5, 6, 7, 8, 10, 13)])
## age income money_spent_purchase promotion_usage
## Min. :18.00 Min. : 5000 Min. : 1118 Min. :0.0000
## 1st Qu.:27.00 1st Qu.:16269 1st Qu.: 5583 1st Qu.:0.0000
## Median :30.00 Median :27583 Median : 9450 Median :0.0000
## Mean :30.04 Mean :27512 Mean : 9634 Mean :0.3008
## 3rd Qu.:33.00 3rd Qu.:38742 3rd Qu.:13349 3rd Qu.:1.0000
## Max. :49.00 Max. :50000 Max. :26204 Max. :1.0000
## satisfaction_score
## Min. : 0.00
## 1st Qu.: 4.00
## Median : 5.00
## Mean : 5.01
## 3rd Qu.: 6.00
## Max. :10.00
I also decided to compute the standard deviation and the coefficient
of variation, as they are not included in the summary
function.
sd_income <- round(sd(data$income), 2)
sd_satisfaction_score <- round(sd(data$satisfaction_score), 2)
cv_income <- round((sd_income / mean(data$income)) * 100, 2)
cv_satisfaction_score <- round((sd_satisfaction_score / mean(data$satisfaction_score)) * 100, 2)
print(paste("Income:", sd_income, ",", cv_income, sep = " "))
## [1] "Income: 12996.06 , 47.24"
print(paste("Satisfaction Score:", sd_satisfaction_score, ",", cv_satisfaction_score, sep = " "))
## [1] "Satisfaction Score: 1.04 , 20.76"
income
is 12996.78, while for
satisfaction_score
, it is 4799.34. This indicates that the
income variable has greater variability, meaning its values are more
dispersed relative to the mean.income
having a
CV of 47.23 and satisfaction_score
a CV of 20.76, we can
confirm that income exhibits greater variability.Additionally, for the non-numerical variables, I decided to compute the mode.
library(modeest)
mode_gender <- mlv(data$gender)
mode_education <- mlv(data$education)
mode_region <- mlv(data$region)
mode_loyalty_status <- mlv(data$loyalty_status)
mode_purchase_frequency <- mlv(data$purchase_frequency)
mode_product_category <- mlv(data$product_category)
mode_promotion_usage_factor <- mlv(data$promotion_usage_factor)
print(paste("Gender:", mode_gender))
## [1] "Gender: Female"
print(paste("Education:", mode_education))
## [1] "Education: College"
print(paste("Region:", mode_region))
## [1] "Region: East"
print(paste("Loyalty Status:", mode_loyalty_status))
## [1] "Loyalty Status: Regular"
print(paste("Purchase Frequency:", mode_purchase_frequency))
## [1] "Purchase Frequency: rare"
print(paste("Product Category:", mode_product_category))
## [1] "Product Category: Electronics"
print(paste("Promotion Usage:", mode_promotion_usage_factor))
## [1] "Promotion Usage: No"
The majority of customers are female and have a college-level education. The most frequent region is East, and the most common loyalty status is Regular. In terms of purchasing behavior, customers most often make purchases infrequently (“rare”), and the most popular product category is Electronics. Most customers didn’t use a promotional offer.
For this section, while I recognize there are many possible analyses, I chose to avoid repetition and ensure that I do not duplicate graphs.
# ggplot Barplot
library(ggplot2)
ggplot(data, aes(x = product_category)) +
geom_bar(colour = "slateblue4", fill = "slateblue1") +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, col = "white")
The most popular category, as previously noted, is electronics, with a total of 30027 purchases, while beauty is the least bought, with only 5035.
# Base R Histogram
hist(data$age,
main = "Distribution of age in customers",
xlab = "Age",
ylab = "Frequency",
breaks = seq(from = 0, to = 50, by = 5),
col = "darksalmon")
The majority of customers are clustered between the ages of 25 and 30, with no noticeable outliers.
# ggplot Histogram
library(ggplot2)
ggplot(data, aes(x = satisfaction_score)) +
geom_histogram(binwidth = 1, colour = "black", fill = "azure") +
facet_wrap(~region, ncol = 1) +
ylab("Frequency") +
theme_minimal()
We can observe that the satisfaction score remains consistent across regions. For all four regions, the score follows a symmetric and unimodal distribution, appearing approximately normal, with a peak at a score of 5.
# ggplot Histogram
library(ggplot2)
ggplot(data, aes(x = age, fill = gender)) +
geom_histogram(position = position_dodge(width = 3), binwidth = 5, colour = "black") +
facet_wrap(~education, ncol = 1) +
ylab("Frequency") +
labs(fill = "Gender") + scale_fill_manual(values = c("Male" = "darkcyan", "Female" = "deeppink2"))
As observed earlier, the majority of customers are around 30 years old for both genders. All distributions are symmetrical, unimodal, and appear similar across genders. However, the least common education level is a master’s degree.
Since the numerical variables have very different ranges, plotting all the boxplots in a single graph would make readability difficult. Therefore, I will separate them into different plots.
# Base R Boxplots
boxplot(data[, c(4, 9)], col = "brown1")
# Base R Boxplots
boxplot(data[, c(2, 12)], col = "brown1")
In both examples, some data points fall outside the first and third quartiles. However, since they all appear reasonable, I will not remove them.
# ggplot Boxplot
ggplot(data, aes(x = promotion_usage_factor, y = satisfaction_score)) +
geom_boxplot(color = "darkblue", fill = "cornflowerblue") +
xlab("Usage of Promotional Offers")
The satisfaction score distribution appears similar for both options, indicating that promotional offers had no impact on satisfaction.
# ggplot Boxplot
ggplot(data, aes(x = product_category, y = satisfaction_score, fill = loyalty_status)) +
geom_boxplot() +
scale_fill_brewer(palette = "YIOrRd") +
xlab("Product") +
labs(fill = "loyalty_status")
Once again, across all categories and loyalty statuses, the distributions appear similar, with a median of around 5. This indicates that for every category and loyalty status, 50% of customers gave a score of 5 or below, while the other 50% scored above 5.
# ggplot Scatterplot
library(ggplot2)
ggplot(data, aes(x = money_spent_purchase, y = income)) +
geom_point(color = "pink") +
labs(x = "Amount of Money Spent", y = "Annual Income") +
geom_smooth(method = "lm", color = "deeppink4", se = FALSE)
The plot reveals a positive correlation between income
and money_spent_purchase
, suggesting that individuals with
higher incomes tend to spend more. This aligns with logical
expectations.
# Base R Scatterplot
library(car)
scatterplot(age ~ money_spent_purchase | gender,
ylab = "Age",
xlab = "Amount of Money Spent",
smooth = FALSE,
data = dataLowerIncome,
col = "transparent",
regLine = list(lty = 1, col = c("darkcyan", "orchid1"))
)
For customers classified as lower income, the correlation between
age
and money_spent_purchase
is virtually
nonexistent and is consistent across both genders.
For this visualization, I decided to use a random sample of 200 observations from my data, as including all the observations would compromise readability (I tried it initially).
# Base R Scatterplot Matrix
library(car)
data200 <- data[sample(nrow(data), 200), ]
scatterplotMatrix(data200[, c(2, 4, 9)],
smooth = FALSE,
col = "firebrick")
As observed earlier, income and money spent on purchases are positively correlated. However, for the remaining pairs, no significant correlation is apparent.