Importing and cleaning the data

Let’s import and display our dataset.

customers <- read.csv("Mall_Customers.csv", header = TRUE)
head(customers)

We are using read.csv() to load our dataset, Mall_Customers.csv.

Variables and definitions:

Variable name Type Description
Customer ID Categorical (Nominal) Unique identifier for each customer
Gender Categorical (Nominal) Customer’s gender
Age Numerical (Ratio) Customer’s age in years
Annual income Numerical (Ratio) Customer’s annual income in thousands of USD
Spending score Numerical (Interval) Score based on customer spending habits (1–100 scale)

Source of the data

The data was obtained from kaggle.

Link: Mall Customer Segmentation Data

Data manipulation I

Renaming columns

The original column names are somewhat awkward (e.g., “Annual.Income..k..”), therefore we use names(customers)[...] <- ... to give them more readable titles.

names(customers)[names(customers) == "CustomerID"] <- "Customer ID"
names(customers)[names(customers) == "Annual.Income..k.."] <- "Annual Income (USD in thousands)"
names(customers)[names(customers) == "Spending.Score..1.100."] <- "Spending Score (1-100)"
colnames(customers)
## [1] "Customer ID"                      "Gender"                          
## [3] "Age"                              "Annual Income (USD in thousands)"
## [5] "Spending Score (1-100)"

Examining the structure

Let’s examine the structure of our dataset with str() that shows how each column is stored (numeric, factor, character etc.).

str(customers)
## 'data.frame':    200 obs. of  5 variables:
##  $ Customer ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                          : chr  "Male" "Male" "Female" "Female" ...
##  $ Age                             : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual Income (USD in thousands): int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending Score (1-100)          : int  39 81 6 77 40 76 6 94 3 72 ...

Let’s define Gender as a factor with two levels, “Male” and “Female” to help R understand that “Gender” is categorical, useful for summary statistics and plotting.

customers$Gender <- factor(customers$Gender, levels = c("Male", "Female"))
str(customers)
## 'data.frame':    200 obs. of  5 variables:
##  $ Customer ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                          : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 2 2 1 2 ...
##  $ Age                             : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual Income (USD in thousands): int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending Score (1-100)          : int  39 81 6 77 40 76 6 94 3 72 ...

Checking and removing missing data

sum(is.na(customers))
## [1] 0

We see that there is no NAs in our dataset, but for the sake of practice let’s clean the data.

cat("Number of rows before removing NAs:", nrow(customers), "\n")
## Number of rows before removing NAs: 200
customers <- na.omit(customers)
cat("Number of rows after removing NAs:", nrow(customers), "\n")
## Number of rows after removing NAs: 200

Descriptive statistics

We load the psych package to use its describe() function, which returns stats like mean, SD, median, min, and max etc.

library(psych)
describe(customers[ , -c(1,2)])
library(pastecs)
round(stat.desc(customers[ , -c(1,2)]), 2)

Let’s use stat.desc() from pastecs, which also offers a wide range of descriptive stats.

summary(customers[ , -1])
##     Gender         Age        Annual Income (USD in thousands)
##  Male  : 88   Min.   :18.00   Min.   : 15.00                  
##  Female:112   1st Qu.:28.75   1st Qu.: 41.50                  
##               Median :36.00   Median : 61.50                  
##               Mean   :38.85   Mean   : 60.56                  
##               3rd Qu.:49.00   3rd Qu.: 78.00                  
##               Max.   :70.00   Max.   :137.00                  
##  Spending Score (1-100)
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

We could also use the base R function summary(), which gives min, 1st quartile, median, mean, 3rd quartile, and max for each numeric variable. We keep Gender to see how many rows have each factor (88 males and 112 females).

tapply(customers$`Annual Income (USD in thousands)`, customers$Gender, summary)
## $Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   45.50   62.50   62.23   78.00  137.00 
## 
## $Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   39.75   60.00   59.25   77.25  126.00

Here, we compute summary statistics for annual income grouped by Gender.

This allows us to compare incomes between males and females quickly and we can see that the male part of the customers has a higher annual income on average than female (62.23 vs 59.25).

Data manipulation II

Let’s create a new categorical variable called IncomeCategory based on annual income.

library(dplyr)
customers <- customers %>%
  mutate(IncomeCategory = ifelse(`Annual Income (USD in thousands)` < 30, "Low", 
                          ifelse(`Annual Income (USD in thousands)` <= 90, "Medium", "High")))

customers$IncomeCategory <- as.factor(customers$IncomeCategory)

table(customers$IncomeCategory)
## 
##   High    Low Medium 
##     22     30    148
prop.table(table(customers$IncomeCategory))
## 
##   High    Low Medium 
##   0.11   0.15   0.74

Here, we categorize “Low” (< $30k), “Medium” (between $30k and $90k), and “High” (above $90k).

We can see that most of the customers (74%) fall into “Medium” income category.

Visualizations

Let’s visualize the distribution of customers’ age through the histogram.

library(ggplot2)
ggplot(customers, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "#9FE2BF", color = "black") +
  labs(
    title = "Distribution of customer age",
    x     = "Age",
    y     = "Count"
  ) +
  theme_minimal()

ggplot(customers, aes(x = Gender, y = `Spending Score (1-100)`, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Spending scores by gender",
    x     = "Gender",
    y     = "Spending score (1-100)"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("Female" = "#DE3163", "Male" = "#6495ED")) +
  theme(legend.position = "none")  #removes the legend

ggplot(customers, aes(x = IncomeCategory, fill = IncomeCategory)) +
  geom_bar() +
  labs(
    title = "Number of customers by income category",
    x     = "Income category",
    y     = "Count"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("High" = "#4ba934", "Low" = "#FF5733", "Medium" = "#6495ED")) +
  theme(legend.position = "none")  #removes the legend

ggplot(customers, aes(x = `Annual Income (USD in thousands)`, y = `Spending Score (1-100)`, color = Gender)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(
    x     = "Annual income (k$)",
    y     = "Spending score (1-100)",
    color = "Gender"
  ) +
  theme_minimal()

To support the point, let’s calculate the correlation between two variables.

cor(customers$`Annual Income (USD in thousands)`,
    customers$`Spending Score (1-100)`,
    method = "pearson")
## [1] 0.009902848

We observe an extremely low number which proves our point.