We are going to perform exploratory data analysis to a marketing campaign dataset of 2,240 customers of a wholesaler company.

So we first have a glimpse at the dataset, and to the structure of our dataset -using the R STR command and checking for missing values:

tabla1 <- as.data.frame(head(marketing_data, 10))
rmarkdown::paged_table(tabla1)
tabla2 <- as.data.frame(str(marketing_data))
## spec_tbl_df[,28] [2,240 x 28] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ID                 : num [1:2240] 1826 1 10476 1386 5371 ...
##  $ Year_Birth         : num [1:2240] 1970 1961 1958 1967 1989 ...
##  $ Education          : chr [1:2240] "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr [1:2240] "Divorced" "Single" "Married" "Together" ...
##  $ Income             : num [1:2240] 84835 57091 67267 32474 21474 ...
##  $ Kidhome            : num [1:2240] 0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : num [1:2240] 0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : Date[1:2240], format: "2014-06-16" "2014-06-15" ...
##  $ Recency            : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : num [1:2240] 189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : num [1:2240] 104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : num [1:2240] 379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : num [1:2240] 111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : num [1:2240] 189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : num [1:2240] 218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : num [1:2240] 1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : num [1:2240] 4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: num [1:2240] 4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : num [1:2240] 6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : num [1:2240] 1 5 2 7 7 2 6 5 4 4 ...
##  $ AcceptedCmp3       : num [1:2240] 0 0 0 0 1 0 1 0 0 0 ...
##  $ AcceptedCmp4       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : num [1:2240] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Response           : num [1:2240] 1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : num [1:2240] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Country            : chr [1:2240] "Spain" "Canada" "USA" "Australia" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ID = col_double(),
##   ..   Year_Birth = col_double(),
##   ..   Education = col_character(),
##   ..   Marital_Status = col_character(),
##   ..   Income = col_double(),
##   ..   Kidhome = col_double(),
##   ..   Teenhome = col_double(),
##   ..   Dt_Customer = col_date(format = "%Y-%m-%d"),
##   ..   Recency = col_double(),
##   ..   MntWines = col_double(),
##   ..   MntFruits = col_double(),
##   ..   MntMeatProducts = col_double(),
##   ..   MntFishProducts = col_double(),
##   ..   MntSweetProducts = col_double(),
##   ..   MntGoldProds = col_double(),
##   ..   NumDealsPurchases = col_double(),
##   ..   NumWebPurchases = col_double(),
##   ..   NumCatalogPurchases = col_double(),
##   ..   NumStorePurchases = col_double(),
##   ..   NumWebVisitsMonth = col_double(),
##   ..   AcceptedCmp3 = col_double(),
##   ..   AcceptedCmp4 = col_double(),
##   ..   AcceptedCmp5 = col_double(),
##   ..   AcceptedCmp1 = col_double(),
##   ..   AcceptedCmp2 = col_double(),
##   ..   Response = col_double(),
##   ..   Complain = col_double(),
##   ..   Country = col_character()
##   .. )
rmarkdown::paged_table(tabla2)

Each record represents one customer and contains fields that can be classified as follows:

Customer Profile

ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise
Country: Customer’s location

Product Preferences

MntWines: Amount spent on wine in the last 2 years
MntFruits: Amount spent on fruits in the last 2 years
MntMeatProducts: Amount spent on meat in the last 2 years
MntFishProducts: Amount spent on fish in the last 2 years
MntSweetProducts: Amount spent on sweets in the last 2 years
MntGoldProds: Amount spent on gold in the last 2 years

Channel Performance

NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month

Campaign Success

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

sapply(marketing_data, function(x) sum(is.na(x) ) ) 
##                  ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                  24                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Response            Complain             Country 
##                   0                   0                   0                   0
# highlight <- color_tile("yellow","yellow")
# formattable(tabla3, list( area(col = 5, row = 1 ) ~ highlight ))

   

And so firstly we see that:

   

summary(marketing_data)
##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1893   Length:2240        Length:2240       
##  1st Qu.: 2828   1st Qu.:1959   Class :character   Class :character  
##  Median : 5458   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5592   Mean   :1969                                        
##  3rd Qu.: 8428   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##                                                                      
##      Income          Kidhome          Teenhome       Dt_Customer        
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Min.   :2012-07-30  
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2013-01-16  
##  Median : 51382   Median :0.0000   Median :0.0000   Median :2013-07-08  
##  Mean   : 52247   Mean   :0.4442   Mean   :0.5062   Mean   :2013-07-10  
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2013-12-30  
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000   Max.   :2014-06-29  
##  NA's   :24                                                             
##     Recency         MntWines         MntFruits     MntMeatProducts 
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0  
##  Median :49.00   Median : 173.50   Median :  8.0   Median :  67.0  
##  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 166.9  
##  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0  
##  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725.0  
##                                                                    
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
##                                                                      
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
##  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
##                                                                          
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   AcceptedCmp2        Response         Complain          Country         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.000000   Length:2240       
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.000000   Class :character  
##  Median :0.00000   Median :0.0000   Median :0.000000   Mode  :character  
##  Mean   :0.01339   Mean   :0.1491   Mean   :0.009375                     
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.000000                     
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.000000                     
## 

   

INCOME has a couple of abnormally low values.
YEAR_BIRTH has values <1900 –that is certainly incorrect.

 

Let’s do an initial plot to see the distribution of each variable:

   

   

Obviously, this plot is not a good one. There are different orders of magnitude between several variables.

But it certainly shows that INCOME is the most disperse variable

   

    There is a high value which is an extreme outlier

The value is 666666

We remove it and draw the histogram again:

 

marketing_data <- marketing_data %>% filter(Income < 200000) 
hist(marketing_data$Income, col = 'blue', main = "Histogram of INCOME variable", xlab = "INCOME in AUD")

hist(marketing_data$Year_Birth, col = 'red', main = "Histogram of YEAR of BIRTH variable", xlab = "Decade of Birth")

 

As for YEAR of BIRTH there are 3 outliers that are wrong -since there can’t be customers aged >100 years old.

 

So we plot without those:

   

   

INCOME has a non-Normal distribution –as expected

We reinstate the 3 rows that we took out -we cannot impute using AVERAGE (since the distribution of this variable is not normal) -we will impute using the mode:

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))] }

getmode(marketing_data$Year_Birth)
## [1] 1976
### The mode is '1976'

marketing_data<- marketing_data %>% mutate(Year_Birth = replace(Year_Birth, Year_Birth <1901, 1976))

   

Now we look at the distribution of the Total Nr. of Purchases and TOTAL PURCHASED AMOUNT variables:

### DISTRIBUTION of 'TOTAL NR of PURCHASES' variables  ####
Nr_ventas <- melt(marketing_data[ , c(16:19)])
ggplot(Nr_ventas,aes(x = value)) + facet_wrap(~variable,scales = "free_x") +  geom_histogram() + ggtitle("HISTOGRAMS of variables containing the TOTAL # of PURCHASES")

### DISTRIBUTION of 'AMOUNT' variables  #####
monto_ventas <- melt(marketing_data[ , c(10:15)])
ggplot(monto_ventas,aes(x = value)) + facet_wrap(~variable,scales = "free_x") +  geom_histogram() + ggtitle("DISTRIBUTION - 'TOTAL AMOUNT SPENT per PURCHASE'")

 

Which products are performing best?

tabla5 <- as.data.frame(colSums(marketing_data[ , 10:15]))
highlight <- color_tile("yellow","yellow")
formattable(tabla5, list(
  area(col = 1, row = 1 ) ~ highlight,
  area(col = 1, row = 3 ) ~ highlight
))
colSums(marketing_data[, 10:15])
MntWines 676074
MntFruits 58391
MntMeatProducts 370045
MntFishProducts 83397
MntSweetProducts 59895
MntGoldProds 97415

MEAT and WINES are the best performers

     

Which Marketing Campaing was more succesful?

     

The last 3 campaigns were the most successful ones.  

CUSTOMERS’ PROFILE

marketing_play2 <- marketing_data %>% select(ID,Education,Marital_Status, Country)
### If we want to display all 3 variables at once, we have to modify the dataset from "wide" to "long"
marketing_long <- gather(marketing_play2, key="measure", value="value", -ID)
head(marketing_long,10)
## # A tibble: 10 x 3
##       ID measure   value     
##    <dbl> <chr>     <chr>     
##  1  1826 Education Graduation
##  2     1 Education Graduation
##  3 10476 Education Graduation
##  4  1386 Education Graduation
##  5  5371 Education Graduation
##  6  7348 Education PhD       
##  7  4073 Education 2n Cycle  
##  8  1991 Education Graduation
##  9  4047 Education PhD       
## 10  9477 Education PhD
ggplot(marketing_long, aes(x = value, fill = ID)) + geom_bar(fill='green') + facet_wrap(~ measure, scales = "free_x") + theme(axis.text.x = element_text(angle = 90, face = "bold"))

      For the INCOME variable, we need to plot differently since this is a integer variable (not a categorial one):

## [1] "The average income of the customers is AUD"
## [2] "51969.8613995485"

## [1] "We can't use the mean because the distribution of income is not normal"

   

CORRELATIONS

Is there a correlation between the customers’ INCOME and AMOUNT PURCHASED?

 

ggplot(marketing_data,aes(x = Income, y = MntWines))  +  geom_point()

ggplot(marketing_data,aes(x = Income, y = MntFruits))  +  geom_point()

ggplot(marketing_data,aes(x = Income, y = MntMeatProducts))  +  geom_point()

ggplot(marketing_data,aes(x = Income, y = MntGoldProds))  +  geom_point()

ggplot(marketing_data,aes(x = Income, y = MntSweetProducts))  +  geom_point()

 

There seems to be positive correlation between INCOME and AMOUNT PURCHASED

Another way of visualizing it -there appears to be an exponential correlation:

marketing_play <- marketing_data %>% filter(Income < 100000)
ggplot(data = marketing_play, mapping = aes(x = Income, y = MntWines)) + geom_boxplot(mapping = aes(group = cut_width(Income, 2000)))

 

Fitting an exponential model between INCOME and any one of the AMOUNT PURCHASED channels produces a not very solid model:

## 
## Call:
## lm(formula = log(MntWines) ~ Income, data = marketing_play2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6505  -0.6433   0.0968   0.7998   3.0364 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 1.18109733 0.06655073   17.75 <0.0000000000000002 ***
## Income      0.00006662 0.00000118   56.46 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.186 on 2200 degrees of freedom
## Multiple R-squared:  0.5916, Adjusted R-squared:  0.5915 
## F-statistic:  3187 on 1 and 2200 DF,  p-value: < 0.00000000000000022

With only 59% of explained of values explained by it.