Home Assignment 1

Importing and cleaning the data

Let’s import and display our dataset.

customers <- read.csv("Mall_Customers.csv", header = TRUE)
head(customers)

We are using read.csv() to load our dataset, Mall_Customers.csv.

Our dataset contains 200 observations of 5 variables. It provides some basic data about the shop customers like Customer ID, age, gender, annual income and spending score.
The unit of observation in our case is one individual customer.
The sample size is 200.

Variables and definitions:

Variable name	Type	Description
Customer ID	Categorical (Nominal)	Unique identifier for each customer
Gender	Categorical (Nominal)	Customer’s gender
Age	Numerical (Ratio)	Customer’s age in years
Annual income	Numerical (Ratio)	Customer’s annual income in thousands of USD
Spending score	Numerical (Interval)	Score based on customer spending habits (1–100 scale)

Source of the data

The data was obtained from kaggle.

Link: Mall Customer Segmentation Data

Data manipulation I

Renaming columns

The original column names are somewhat awkward (e.g., “Annual.Income..k..”), therefore we use names(customers)[...] <- ... to give them more readable titles.

names(customers)[names(customers) == "CustomerID"] <- "Customer ID"
names(customers)[names(customers) == "Annual.Income..k.."] <- "Annual Income (USD in thousands)"
names(customers)[names(customers) == "Spending.Score..1.100."] <- "Spending Score (1-100)"
colnames(customers)

## [1] "Customer ID"                      "Gender"                          
## [3] "Age"                              "Annual Income (USD in thousands)"
## [5] "Spending Score (1-100)"

Examining the structure

Let’s examine the structure of our dataset with str() that shows how each column is stored (numeric, factor, character etc.).

str(customers)

## 'data.frame':    200 obs. of  5 variables:
##  $ Customer ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                          : chr  "Male" "Male" "Female" "Female" ...
##  $ Age                             : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual Income (USD in thousands): int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending Score (1-100)          : int  39 81 6 77 40 76 6 94 3 72 ...

Let’s define Gender as a factor with two levels, “Male” and “Female” to help R understand that “Gender” is categorical, useful for summary statistics and plotting.

customers$Gender <- factor(customers$Gender, levels = c("Male", "Female"))
str(customers)

## 'data.frame':    200 obs. of  5 variables:
##  $ Customer ID                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                          : Factor w/ 2 levels "Male","Female": 1 1 2 2 2 2 2 2 1 2 ...
##  $ Age                             : int  19 21 20 23 31 22 35 23 64 30 ...
##  $ Annual Income (USD in thousands): int  15 15 16 16 17 17 18 18 19 19 ...
##  $ Spending Score (1-100)          : int  39 81 6 77 40 76 6 94 3 72 ...

Checking and removing missing data

sum(is.na(customers))

## [1] 0

We see that there is no NAs in our dataset, but for the sake of practice let’s clean the data.

cat("Number of rows before removing NAs:", nrow(customers), "\n")

## Number of rows before removing NAs: 200

customers <- na.omit(customers)
cat("Number of rows after removing NAs:", nrow(customers), "\n")

## Number of rows after removing NAs: 200

Descriptive statistics

We load the psych package to use its describe() function, which returns stats like mean, SD, median, min, and max etc.

library(psych)
describe(customers[ , -c(1,2)])

The average age of customers is 38.9 years old.
Mean annual income is 60.6 thousand USD.
From skew we can say that the distribution of both age and annual income is skewed to the right.
Median for income is 61.5 thousand USD, meaning that 50% of the customers earn ≤61.5k and 50% earn >61.5k USD.

library(pastecs)
round(stat.desc(customers[ , -c(1,2)]), 2)

Let’s use stat.desc() from pastecs, which also offers a wide range of descriptive stats.

summary(customers[ , -1])

##     Gender         Age        Annual Income (USD in thousands)
##  Male  : 88   Min.   :18.00   Min.   : 15.00                  
##  Female:112   1st Qu.:28.75   1st Qu.: 41.50                  
##               Median :36.00   Median : 61.50                  
##               Mean   :38.85   Mean   : 60.56                  
##               3rd Qu.:49.00   3rd Qu.: 78.00                  
##               Max.   :70.00   Max.   :137.00                  
##  Spending Score (1-100)
##  Min.   : 1.00         
##  1st Qu.:34.75         
##  Median :50.00         
##  Mean   :50.20         
##  3rd Qu.:73.00         
##  Max.   :99.00

We could also use the base R function summary(), which gives min, 1st quartile, median, mean, 3rd quartile, and max for each numeric variable. We keep Gender to see how many rows have each factor (88 males and 112 females).

tapply(customers$`Annual Income (USD in thousands)`, customers$Gender, summary)

## $Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   45.50   62.50   62.23   78.00  137.00 
## 
## $Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   39.75   60.00   59.25   77.25  126.00

Here, we compute summary statistics for annual income grouped by Gender.

This allows us to compare incomes between males and females quickly and we can see that the male part of the customers has a higher annual income on average than female (62.23 vs 59.25).

Data manipulation II

Let’s create a new categorical variable called IncomeCategory based on annual income.

library(dplyr)
customers <- customers %>%
  mutate(IncomeCategory = ifelse(`Annual Income (USD in thousands)` < 30, "Low", 
                          ifelse(`Annual Income (USD in thousands)` <= 90, "Medium", "High")))

customers$IncomeCategory <- as.factor(customers$IncomeCategory)

table(customers$IncomeCategory)

## 
##   High    Low Medium 
##     22     30    148

prop.table(table(customers$IncomeCategory))

## 
##   High    Low Medium 
##   0.11   0.15   0.74

Here, we categorize “Low” (< $30k), “Medium” (between $30k and $90k), and “High” (above $90k).

We can see that most of the customers (74%) fall into “Medium” income category.

Visualizations

Let’s visualize the distribution of customers’ age through the histogram.

library(ggplot2)
ggplot(customers, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "#9FE2BF", color = "black") +
  labs(
    title = "Distribution of customer age",
    x     = "Age",
    y     = "Count"
  ) +
  theme_minimal()

We can see that the distribution is slightly positively skewed (asymmetry to the right), meaning there are a lot of younger customers (aged under 40), though the older clientele is also well represented in our dataset.

ggplot(customers, aes(x = Gender, y = `Spending Score (1-100)`, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Spending scores by gender",
    x     = "Gender",
    y     = "Spending score (1-100)"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("Female" = "#DE3163", "Male" = "#6495ED")) +
  theme(legend.position = "none")  #removes the legend

Looking at the center line for each group, we can say that both groups tend to have a similar median spending score.
The range and variability of male group seems to be slightly larger than that of female, judging by the box height.
No outliers.

ggplot(customers, aes(x = IncomeCategory, fill = IncomeCategory)) +
  geom_bar() +
  labs(
    title = "Number of customers by income category",
    x     = "Income category",
    y     = "Count"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("High" = "#4ba934", "Low" = "#FF5733", "Medium" = "#6495ED")) +
  theme(legend.position = "none")  #removes the legend

We can see that “Medium” has the tallest bar, meaning that, as mentioned above, most of the customers earn between $30k and $90k annually.

ggplot(customers, aes(x = `Annual Income (USD in thousands)`, y = `Spending Score (1-100)`, color = Gender)) +
  geom_point(size = 2, alpha = 0.8) +
  labs(
    x     = "Annual income (k$)",
    y     = "Spending score (1-100)",
    color = "Gender"
  ) +
  theme_minimal()

In our dataset we see no clear correlation between how much a customer earns (horizontal axis) and how much they tend to spend (vertical axis).
Furthermore, the male/female distribution does not differ across the income–spending space.

To support the point, let’s calculate the correlation between two variables.

cor(customers$`Annual Income (USD in thousands)`,
    customers$`Spending Score (1-100)`,
    method = "pearson")

## [1] 0.009902848

We observe an extremely low number which proves our point.