Homework Assigment 1: Descriptive Statistics

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

data <- read.table("./customer_data.csv", header = TRUE, sep = ",")

head(data)

##   id age gender income  education region loyalty_status purchase_frequency
## 1  1  27   Male  40682   Bachelor   East           Gold           frequent
## 2  2  29   Male  15317    Masters   West        Regular               rare
## 3  3  37   Male  38849   Bachelor   West         Silver               rare
## 4  4  30   Male  11568 HighSchool  South        Regular           frequent
## 5  5  31 Female  46952    College  North        Regular         occasional
## 6  6  38   Male   7347   Bachelor  South         Silver         occasional
##   purchase_amount product_category promotion_usage satisfaction_score
## 1           18249            Books               0                  6
## 2            4557         Clothing               1                  6
## 3           11822         Clothing               0                  6
## 4            4098             Food               0                  7
## 5           19685         Clothing               1                  5
## 6            2822      Electronics               0                  5

Data Description

Unit of Observation

Each row in the dataset represents a single customer that shops at the given store.

Sample Size

The dataset contains 100000 observations (where 100000 is the total number of customers in the dataset). Each observation corresponds to a unique customer.

Definition of Variables

Variable	Type	Description	Unit
age	Numerical (Ratio)	The age of the customer.	Years
gender	Categorical (Nominal)	The gender of the customer. Possible values: `Male`, `Female`.	N/A
income	Numerical (Ratio)	The annual income of the customer.	Currency (e.g., USD, EUR)
education	Categorical (Nominal)	The education level of the customer. Possible values: `HighSchool`, `College`, `Bachelor`, `Masters`.	N/A
region	Categorical (Nominal)	The region where the customer resides. Possible values: `North`, `South`, `East`, `West`.	N/A
loyalty_status	Categorical (Nominal)	The loyalty status of the customer. Possible values: `Regular`, `Silver`, `Gold`.	N/A
purchase_frequency	Categorical (Nominal)	The frequency of purchases made by the customer. Possible values: `Rare`, `Occasional`, `Frequent`.	N/A
purchase_amount	Numerical (Ratio)	The amount spent by the customer in each purchase.	Currency (e.g., USD, EUR)
product_category	Categorical (Nominal)	The category of the purchased product. Multiple values possible.	N/A
promotion_usage	Categorical (Nominal)	Indicates whether the customer used promotional offers (`0` for No, `1` for Yes`).	Binary (0 or 1)
satisfaction_score	Numerical (Interval)	The satisfaction score of the customer.	Scale (0 to 10)

Source of Data

The dataset is obtained from Kaggle:
Customer Purchases Behaviour Dataset

Data Manipulation

I can note some inconsistencies in the dataset, such as customers as young as 12 years old having an annual income. Although this may simply reflect the definition of the data, it doesn’t seem meaningful. Therefore, I will remove all rows for customers under 18.

library(dplyr)

data <- data %>% filter(age >= 18)

library(dplyr)
library(tidyr)

data <- data %>% drop_na()

I chose to rename only one variable since the others seemed appropriate. I changed purchase_amount to money_spent_purchase to better reflect that it represents the amount of money spent per purchase.

library(dplyr)
library(tidyr)

data <- data %>% rename(money_spent_purchase = purchase_amount)

I converted the variable promotion_usage into a factor, as its possible values (0 and 1) correspond to No and Yes, as stated in the dataset’s description.

data$promotion_usage_factor <- factor(data$promotion_usage, 
                             levels = c(0, 1), 
                             labels = c("No", "Yes"))

I created several data frames based on specific conditions, such as gender (female and male) and annual income within a given range.

dataFemale <- data[data$gender == "Female", ]
dataMale <- data[data$gender == "Male", ]

dataLowerIncome <- data[data$income >= 5000 & data$income <= 10000, ]
dataHigherIncome <- data[data$income >= 40000 & data$income <= 50000, ]

Descriptive Statistics

library(pastecs)

summary(data[, -c(1, 3, 5, 6, 7, 8, 10, 13)])

##       age            income      money_spent_purchase promotion_usage 
##  Min.   :18.00   Min.   : 5000   Min.   : 1118        Min.   :0.0000  
##  1st Qu.:27.00   1st Qu.:16269   1st Qu.: 5583        1st Qu.:0.0000  
##  Median :30.00   Median :27583   Median : 9450        Median :0.0000  
##  Mean   :30.04   Mean   :27512   Mean   : 9634        Mean   :0.3008  
##  3rd Qu.:33.00   3rd Qu.:38742   3rd Qu.:13349        3rd Qu.:1.0000  
##  Max.   :49.00   Max.   :50000   Max.   :26204        Max.   :1.0000  
##  satisfaction_score
##  Min.   : 0.00     
##  1st Qu.: 4.00     
##  Median : 5.00     
##  Mean   : 5.01     
##  3rd Qu.: 6.00     
##  Max.   :10.00

Explanation of the sample statistics

Min: The smallest value in each column. For example, the youngest person in the dataset is 18 years old.
1st Qu: The value bellow which 25% of the data fails. For example, 25% of the individuals have an annual income below or equal to 16269 euros.
Median: The value bellow which 50% of the data fails. For example, 50% of the individuals spent 9450 or less euros per purchase.
Mean: The sum of all values divided by the number of observations. For example, the average time a promotion offer was used corresponds to 0.3008.
3rd Qu: The value bellow which 75% of the data fails. For example, 75% of the individuals have a satisfaction score of 6 or less.
Max: The largest value in each column. For example, the oldest person in the dataset is 49 years old.

I also decided to compute the standard deviation and the coefficient of variation, as they are not included in the summary function.

sd_income <- round(sd(data$income), 2)
sd_satisfaction_score <- round(sd(data$satisfaction_score), 2)

cv_income <- round((sd_income / mean(data$income)) * 100, 2)
cv_satisfaction_score <- round((sd_satisfaction_score / mean(data$satisfaction_score)) * 100, 2)

print(paste("Income:", sd_income, ",", cv_income, sep = " "))

## [1] "Income: 12996.06 , 47.24"

print(paste("Satisfaction Score:", sd_satisfaction_score, ",", cv_satisfaction_score, sep = " "))

## [1] "Satisfaction Score: 1.04 , 20.76"

Standard Deviation: The standard deviation for income is 12996.78, while for satisfaction_score, it is 4799.34. This indicates that the income variable has greater variability, meaning its values are more dispersed relative to the mean.
Coefficient of Variation: Since these variables are not directly comparable, the coefficient of variation (CV) is used to determine which has more variability. With income having a CV of 47.23 and satisfaction_score a CV of 20.76, we can confirm that income exhibits greater variability.

Additionally, for the non-numerical variables, I decided to compute the mode.

library(modeest)

mode_gender <- mlv(data$gender)
mode_education <- mlv(data$education)
mode_region <- mlv(data$region)
mode_loyalty_status <- mlv(data$loyalty_status)
mode_purchase_frequency <- mlv(data$purchase_frequency)
mode_product_category <- mlv(data$product_category)
mode_promotion_usage_factor <- mlv(data$promotion_usage_factor)

print(paste("Gender:", mode_gender))

## [1] "Gender: Female"

print(paste("Education:", mode_education))

## [1] "Education: College"

print(paste("Region:", mode_region))

## [1] "Region: East"

print(paste("Loyalty Status:", mode_loyalty_status))

## [1] "Loyalty Status: Regular"

print(paste("Purchase Frequency:", mode_purchase_frequency))

## [1] "Purchase Frequency: rare"

print(paste("Product Category:", mode_product_category))

## [1] "Product Category: Electronics"

print(paste("Promotion Usage:", mode_promotion_usage_factor))

## [1] "Promotion Usage: No"

The majority of customers are female and have a college-level education. The most frequent region is East, and the most common loyalty status is Regular. In terms of purchasing behavior, customers most often make purchases infrequently (“rare”), and the most popular product category is Electronics. Most customers didn’t use a promotional offer.

Graphs

For this section, while I recognize there are many possible analyses, I chose to avoid repetition and ensure that I do not duplicate graphs.

# ggplot Barplot
library(ggplot2)

ggplot(data, aes(x = product_category)) + 
  geom_bar(colour = "slateblue4", fill = "slateblue1") + 
  geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, col = "white")

The most popular category, as previously noted, is electronics, with a total of 30027 purchases, while beauty is the least bought, with only 5035.

# Base R Histogram
hist(data$age, 
     main = "Distribution of age in customers", 
     xlab = "Age", 
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 50, by = 5),
     col = "darksalmon")

The majority of customers are clustered between the ages of 25 and 30, with no noticeable outliers.

# ggplot Histogram
library(ggplot2)

ggplot(data, aes(x = satisfaction_score)) + 
  geom_histogram(binwidth = 1, colour = "black", fill = "azure") +
  facet_wrap(~region, ncol = 1) + 
  ylab("Frequency") + 
  theme_minimal()

We can observe that the satisfaction score remains consistent across regions. For all four regions, the score follows a symmetric and unimodal distribution, appearing approximately normal, with a peak at a score of 5.

# ggplot Histogram
library(ggplot2)

ggplot(data, aes(x = age, fill = gender)) +
  geom_histogram(position = position_dodge(width = 3), binwidth = 5, colour = "black") +
  facet_wrap(~education, ncol = 1) + 
  ylab("Frequency") +
  labs(fill = "Gender") + scale_fill_manual(values = c("Male" = "darkcyan", "Female" = "deeppink2"))

As observed earlier, the majority of customers are around 30 years old for both genders. All distributions are symmetrical, unimodal, and appear similar across genders. However, the least common education level is a master’s degree.

Since the numerical variables have very different ranges, plotting all the boxplots in a single graph would make readability difficult. Therefore, I will separate them into different plots.

# Base R Boxplots
boxplot(data[, c(4, 9)], col = "brown1")

# Base R Boxplots
boxplot(data[, c(2, 12)], col = "brown1")

In both examples, some data points fall outside the first and third quartiles. However, since they all appear reasonable, I will not remove them.

# ggplot Boxplot
ggplot(data, aes(x = promotion_usage_factor, y = satisfaction_score)) +
  geom_boxplot(color = "darkblue", fill = "cornflowerblue") +
  xlab("Usage of Promotional Offers")

The satisfaction score distribution appears similar for both options, indicating that promotional offers had no impact on satisfaction.

# ggplot Boxplot
ggplot(data, aes(x = product_category, y = satisfaction_score, fill = loyalty_status)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "YIOrRd") + 
  xlab("Product") + 
  labs(fill = "loyalty_status")

Once again, across all categories and loyalty statuses, the distributions appear similar, with a median of around 5. This indicates that for every category and loyalty status, 50% of customers gave a score of 5 or below, while the other 50% scored above 5.

# ggplot Scatterplot
library(ggplot2)

ggplot(data, aes(x = money_spent_purchase, y = income)) + 
  geom_point(color = "pink") + 
  labs(x = "Amount of Money Spent", y = "Annual Income") +
  geom_smooth(method = "lm", color = "deeppink4", se = FALSE)

The plot reveals a positive correlation between income and money_spent_purchase, suggesting that individuals with higher incomes tend to spend more. This aligns with logical expectations.

# Base R Scatterplot
library(car)

scatterplot(age ~ money_spent_purchase | gender, 
            ylab = "Age", 
            xlab = "Amount of Money Spent", 
            smooth = FALSE, 
            data = dataLowerIncome,
            col = "transparent", 
            regLine = list(lty = 1, col = c("darkcyan", "orchid1"))
)

For customers classified as lower income, the correlation between age and money_spent_purchase is virtually nonexistent and is consistent across both genders.

For this visualization, I decided to use a random sample of 200 observations from my data, as including all the observations would compromise readability (I tried it initially).

# Base R Scatterplot Matrix
library(car)

data200 <- data[sample(nrow(data), 200), ]

scatterplotMatrix(data200[, c(2, 4, 9)], 
                  smooth = FALSE, 
                  col = "firebrick")

As observed earlier, income and money spent on purchases are positively correlated. However, for the remaining pairs, no significant correlation is apparent.