Let’s import and display our dataset.
customers <- read.csv("Mall_Customers.csv", header = TRUE)
head(customers)
We are using read.csv() to load our dataset,
Mall_Customers.csv.
Our dataset contains 200 observations of 5 variables. It provides some basic data about the shop customers like Customer ID, age, gender, annual income and spending score.
The unit of observation in our case is one individual customer.
The sample size is 200.
Variables and definitions:
| Variable name | Type | Description |
|---|---|---|
| Customer ID | Categorical (Nominal) | Unique identifier for each customer |
| Gender | Categorical (Nominal) | Customer’s gender |
| Age | Numerical (Ratio) | Customer’s age in years |
| Annual income | Numerical (Ratio) | Customer’s annual income in thousands of USD |
| Spending score | Numerical (Interval) | Score based on customer spending habits (1–100 scale) |
The original column names are somewhat awkward (e.g.,
“Annual.Income..k..”), therefore we use
names(customers)[...] <- ... to give them more readable
titles.
names(customers)[names(customers) == "CustomerID"] <- "Customer ID"
names(customers)[names(customers) == "Annual.Income..k.."] <- "Annual Income (USD in thousands)"
names(customers)[names(customers) == "Spending.Score..1.100."] <- "Spending Score (1-100)"
colnames(customers)
## [1] "Customer ID" "Gender"
## [3] "Age" "Annual Income (USD in thousands)"
## [5] "Spending Score (1-100)"
Let’s examine the structure of our dataset with
str() that shows how each column is stored
(numeric, factor, character etc.).
str(customers)
## 'data.frame': 200 obs. of 5 variables:
## $ Customer ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual Income (USD in thousands): int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending Score (1-100) : int 39 81 6 77 40 76 6 94 3 72 ...
Let’s define Gender as a factor with two levels, “Male”
and “Female” to help R understand that “Gender” is
categorical, useful for summary statistics and
plotting.
customers$Gender <- factor(customers$Gender, levels = c("Male", "Female"))
str(customers)
## 'data.frame': 200 obs. of 5 variables:
## $ Customer ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 2 2 1 2 ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual Income (USD in thousands): int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending Score (1-100) : int 39 81 6 77 40 76 6 94 3 72 ...
sum(is.na(customers))
## [1] 0
We see that there is no NAs in our dataset, but for the sake of practice let’s clean the data.
cat("Number of rows before removing NAs:", nrow(customers), "\n")
## Number of rows before removing NAs: 200
customers <- na.omit(customers)
cat("Number of rows after removing NAs:", nrow(customers), "\n")
## Number of rows after removing NAs: 200
We load the psych package to use its
describe() function, which returns stats like mean, SD,
median, min, and max etc.
library(psych)
describe(customers[ , -c(1,2)])
The average age of customers is 38.9 years old.
Mean annual income is 60.6 thousand USD.
From skew we can say that the distribution of both
age and annual income is skewed to the right.
Median for income is 61.5 thousand USD, meaning that 50% of the customers earn ≤61.5k and 50% earn >61.5k USD.
library(pastecs)
round(stat.desc(customers[ , -c(1,2)]), 2)
Let’s use stat.desc() from pastecs, which
also offers a wide range of descriptive stats.
summary(customers[ , -1])
## Gender Age Annual Income (USD in thousands)
## Male : 88 Min. :18.00 Min. : 15.00
## Female:112 1st Qu.:28.75 1st Qu.: 41.50
## Median :36.00 Median : 61.50
## Mean :38.85 Mean : 60.56
## 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :70.00 Max. :137.00
## Spending Score (1-100)
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
We could also use the base R function summary(), which
gives min, 1st quartile, median, mean, 3rd quartile, and max for each
numeric variable. We keep Gender to see how many rows have
each factor (88 males and 112 females).
tapply(customers$`Annual Income (USD in thousands)`, customers$Gender, summary)
## $Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 45.50 62.50 62.23 78.00 137.00
##
## $Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 39.75 60.00 59.25 77.25 126.00
Here, we compute summary statistics for annual income grouped
by Gender.
This allows us to compare incomes between males and females quickly and we can see that the male part of the customers has a higher annual income on average than female (62.23 vs 59.25).
Let’s create a new categorical variable called
IncomeCategory based on annual income.
library(dplyr)
customers <- customers %>%
mutate(IncomeCategory = ifelse(`Annual Income (USD in thousands)` < 30, "Low",
ifelse(`Annual Income (USD in thousands)` <= 90, "Medium", "High")))
customers$IncomeCategory <- as.factor(customers$IncomeCategory)
table(customers$IncomeCategory)
##
## High Low Medium
## 22 30 148
prop.table(table(customers$IncomeCategory))
##
## High Low Medium
## 0.11 0.15 0.74
Here, we categorize “Low” (< $30k), “Medium” (between $30k and $90k), and “High” (above $90k).
We can see that most of the customers (74%) fall into “Medium” income category.
Let’s visualize the distribution of customers’ age through the histogram.
library(ggplot2)
ggplot(customers, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "#9FE2BF", color = "black") +
labs(
title = "Distribution of customer age",
x = "Age",
y = "Count"
) +
theme_minimal()
ggplot(customers, aes(x = Gender, y = `Spending Score (1-100)`, fill = Gender)) +
geom_boxplot() +
labs(
title = "Spending scores by gender",
x = "Gender",
y = "Spending score (1-100)"
) +
theme_minimal() +
scale_fill_manual(values = c("Female" = "#DE3163", "Male" = "#6495ED")) +
theme(legend.position = "none") #removes the legend
Looking at the center line for each group, we can say that both groups tend to have a similar median spending score.
The range and variability of male group seems to be slightly larger than that of female, judging by the box height.
No outliers.
ggplot(customers, aes(x = IncomeCategory, fill = IncomeCategory)) +
geom_bar() +
labs(
title = "Number of customers by income category",
x = "Income category",
y = "Count"
) +
theme_minimal() +
scale_fill_manual(values = c("High" = "#4ba934", "Low" = "#FF5733", "Medium" = "#6495ED")) +
theme(legend.position = "none") #removes the legend
ggplot(customers, aes(x = `Annual Income (USD in thousands)`, y = `Spending Score (1-100)`, color = Gender)) +
geom_point(size = 2, alpha = 0.8) +
labs(
x = "Annual income (k$)",
y = "Spending score (1-100)",
color = "Gender"
) +
theme_minimal()
In our dataset we see no clear correlation between how much a customer earns (horizontal axis) and how much they tend to spend (vertical axis).
Furthermore, the male/female distribution does not differ across the income–spending space.
To support the point, let’s calculate the correlation between two variables.
cor(customers$`Annual Income (USD in thousands)`,
customers$`Spending Score (1-100)`,
method = "pearson")
## [1] 0.009902848
We observe an extremely low number which proves our point.