Data wrangling

Upload dataset

df <- read_csv("../Datasets/Blackwell_Demographic_Data.csv")

## Parsed with column specification:
## cols(
##   `in-store` = col_integer(),
##   age = col_integer(),
##   items = col_integer(),
##   amount = col_double(),
##   region = col_integer()
## )

Information of the dataset

This dataset is a data frame of 80000 observations and 5 variables. The variables are in-store, age, items, amount, region:

In-store: The “in-store” column indicates where each transaction was made online (0), or in-store (1).
Age: The “age” column indicates the age of the customer who made the transaction.
Items: The “items” column tracks the number of items the customer purchased.
Amount: The “amount” column records the amount of money spent on the transaction.
Region: The “region” column indicates in which of the four regions the purchase was made (1 = East, 2 = West, 3 = South, 4 = Central).

Some information about the company: Blackwell has been a successful electronics retailer for over 40 years, with over 30 stores in the Southeast. A little over a year ago we launched our eCommerce website.

Total amount: 6.687357610^{7}$.

Questions to answer about the dataset:

Do customers in different regions spend more per transaction? Which regions spend the most/least?
Are there differences in the age of customers between regions? If so, can we predict the age of a customer in a region based on other demographic data?
We need to investigate Martin’s hypothesis: Is there any correlation between age of a customer and if the transaction was made online or in the store? Do any other factors predict if a customer will buy online or in our stores?
Finally, is there a relationship between number of items purchased and amount spent?

More information about the dataset:

There is 0 missing values.

Summary of the dataset:

knitr::kable (summary(df))

`in-store </th>`	`age </th>`	`items </th>`	`amount </th>`	`region </th>`
Min. :0.0000	Min. :18.00	Min. :1.000	Min. : 5.0	Min. :1.000
1st Qu.:0.0000	1st Qu.:33.00	1st Qu.:3.000	1st Qu.: 285.1	1st Qu.:2.000
Median :0.0000	Median :45.00	Median :4.000	Median : 582.3	Median :3.000
Mean :0.4561	Mean :45.76	Mean :4.505	Mean : 835.9	Mean :2.675
3rd Qu.:1.0000	3rd Qu.:56.00	3rd Qu.:6.000	3rd Qu.:1233.7	3rd Qu.:4.000
Max. :1.0000	Max. :85.00	Max. :8.000	Max. :3000.0	Max. :4.000

Notes:

In store, items and region are a factors with different levels –> change that.
Which is the distribution of age for all the dataset? And for each region?
Distribution of amount by region and for all the dataset.

Formatting the data

Factorizing variables:

df$`in-store` <- factor(df$`in-store`, levels = c(0,1), labels = c("Online","In-store"))
df$items <- factor(df$items)
df$region <- factor(df$region, levels = c(1,2,3,4), labels = c("East","West","South","Central"))

Now the information is easier to understand:

kable (summary(df))

`in-store </th>`	`age </th>`	`items </th>`	`amount </th>`	`region </th>`
Online :43516	Min. :18.00	4 :11596	Min. : 5.0	East :16000
In-store:36484	1st Qu.:33.00	6 :11522	1st Qu.: 285.1	West :20000
NA	Median :45.00	3 :11487	Median : 582.3	South :18000
NA	Mean :45.76	7 :11378	Mean : 835.9	Central:26000
NA	3rd Qu.:56.00	2 :11290	3rd Qu.:1233.7	NA
NA	Max. :85.00	5 :11238	Max. :3000.0	NA
NA	NA	(Other):11489	NA	NA

Dataset visualizations

Analysis of the amount by region

df %>%
  group_by(region) %>%
  summarise(TotalAmount = sum(amount)) %>%
  arrange(desc(TotalAmount)) %>%
  print() %>%
  ggplot(aes(region,TotalAmount, fill = region))+
    geom_bar(stat = "identity")

## # A tibble: 4 x 2
##   region  TotalAmount
##   <fct>         <dbl>
## 1 Central   33385352.
## 2 South     16523454.
## 3 East      11922584.
## 4 West       5042186.

Distribution of the amount by region

df %>%
  ggplot(aes(amount, color =  region))+
    geom_freqpoly(binwidth = 50)

West region amount range:

kable(df %>%
  filter(region == "West") %>%
  filter(amount == max(amount) | amount == min(amount)) %>%
  select(amount)  %>%
  arrange(desc(amount)))

amount
499.94
5.00

East region amount range:

amount
1999.80
50.05

South region amount range:

amount
3000.00
50.13

Central region amount range:

amount
3000.0
50.6

Analysis of the amount by in-store

df %>%
  group_by(`in-store`) %>%
  summarise(TotalAmount = sum(amount)) %>%
  arrange(desc(TotalAmount)) %>%
  print() %>%
  ggplot(aes(`in-store`,TotalAmount, fill = `in-store`))+
    geom_bar(stat = "identity")

## # A tibble: 2 x 2
##   `in-store` TotalAmount
##   <fct>            <dbl>
## 1 Online       38531657.
## 2 In-store     28341919.

Amount related to in-store by region

df %>%
  group_by(`in-store`,region) %>%
  summarise(TotalAmount = sum(amount)) %>%
  arrange(desc(TotalAmount)) %>%
  print() %>%
  ggplot(aes(`in-store`,TotalAmount, fill = `in-store`))+
    geom_bar(stat = "identity") +
    facet_wrap(~region, ncol = 2, nrow = 2)

## # A tibble: 7 x 3
## # Groups:   in-store [2]
##   `in-store` region  TotalAmount
##   <fct>      <fct>         <dbl>
## 1 Online     Central   20067064.
## 2 In-store   Central   13318288.
## 3 Online     South     10793540.
## 4 In-store   East       9293717.
## 5 In-store   South      5729913.
## 6 Online     West       5042186.
## 7 Online     East       2628867.

Conclusion:

West region only buys online.
Central sells are 60% Online and 40% In-store.
East sells are 22% Online and 78% In-store.
South sells are are 65% Online and 35% In-store.

Analysis of the amount related to the number of items purchased

df %>%
  group_by(items,region) %>%
  summarise(TotalAmount = sum(amount)) %>%
  arrange(desc(TotalAmount)) %>%
  ggplot(aes(items,TotalAmount, fill = items))+
    geom_bar(stat = "identity")+
    facet_wrap(~region, ncol = 2)

There is no relevant relation between itemps and total amount.

Relation between age and amount

Distribution of the population by regions

df %>%
  ggplot(aes(age)) +
    geom_histogram(bins = 30, color = "blue", fill = "white") +
    facet_wrap(~region, ncol = 2)

Central population analysis:

kable(df %>%
  filter(region == "Central") %>%
  filter(age == max(age) | age == min(age)) %>%
  select(age)  %>%
  unique() %>%
  arrange(desc(age)))

age
63
18

East population analysis:

age
74
19

South population analysis

age
74
18

West population analysis

age
85
28

Amount by population

To make data more understand, we will discretize the age by different groups:

Clustering the age

df$age <- cut(df$age, breaks = 4)
summary(df$age)

## (17.9,34.8] (34.8,51.5] (51.5,68.2] (68.2,85.1] 
##       22454       29723       19848        7975

Showing the amount by population groups

df %>%
  group_by(region, age) %>%
  summarise(TotalAmount = sum(amount)) %>%
  arrange(desc(TotalAmount)) %>%
  print() %>%
  ggplot(aes(age, TotalAmount)) +
    geom_bar(stat = "identity", aes(fill = age), show.legend = TRUE, width = .5) +
    facet_wrap(~region, ncol = 2)

## # A tibble: 15 x 3
## # Groups:   region [4]
##    region  age         TotalAmount
##    <fct>   <fct>             <dbl>
##  1 Central (34.8,51.5]   13904361.
##  2 Central (17.9,34.8]   13081734.
##  3 Central (51.5,68.2]    6399258.
##  4 South   (34.8,51.5]    6047671.
##  5 South   (17.9,34.8]    5161244.
##  6 East    (34.8,51.5]    4990490.
##  7 South   (51.5,68.2]    4697042.
##  8 East    (17.9,34.8]    4128797.
##  9 East    (51.5,68.2]    2298870.
## 10 West    (51.5,68.2]    1529149.
## 11 West    (68.2,85.1]    1471742.
## 12 West    (34.8,51.5]    1461438.
## 13 South   (68.2,85.1]     617497.
## 14 West    (17.9,34.8]     579857.
## 15 East    (68.2,85.1]     504427.

Creation of customer profiles to predict the amount spend

Using decision trees:

library(rpart)
library(rpart.plot)
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

training.ids <- createDataPartition(df$amount, 
                                    p = 0.7,
                                    list = F)
mod <- rpart(amount ~., 
             data = df[training.ids,],
             method = "anova",
             control = rpart.control(minsplit = 20, cp = 0.01))

prp(mod, type = 2, extra = 100, nn = TRUE, 
    fallen.leaves = TRUE, faclen = 4, varlen = 8, 
    shadow.col = "gray", roundint=FALSE)

Investigate customer buying patterns

Joan Claverol

7 de agosto de 2018