MarketingData EDA

marketDf <-  read.csv("C:/Users/PC/Documents/marketing_data.csv")

str(marketDf)

## 'data.frame':    2240 obs. of  28 variables:
##  $ ï..ID              : int  1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
##  $ Year_Birth         : int  1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
##  $ Education          : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 3 5 1 3 5 5 ...
##  $ Marital_Status     : Factor w/ 8 levels "Absurd","Alone",..: 3 5 4 6 5 5 4 6 4 4 ...
##  $ Income             : Factor w/ 1975 levels "","$1,730.00 ",..: 1891 1141 1447 395 122 1579 1332 790 1388 1388 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : Factor w/ 663 levels "1/1/13","1/1/14",..: 473 471 410 406 398 298 40 18 6 6 ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ AcceptedCmp3       : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Country            : Factor w/ 8 levels "AUS","CA","GER",..: 7 2 8 1 7 7 3 7 8 4 ...

Data Cleaning

colSums(is.na(marketDf)) ## This result is Inaccurate and as we will see later some actual NA values are been masked.

##               ï..ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                   0                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Response            Complain             Country 
##                   0                   0                   0                   0

In the following chunk we are going to: 1. Make a Copy 1. Drop the id column 1. Strip the Income Variable of the $ and reClassify it. 1. Reclassify Dt_Customer. 1. Create a new Variable Age from Year_Birth variable. 1. Drop Missing Values

## Make a copy
df <- marketDf

#1
## drop Id
# marketDf <- marketDf[,!(names(marketDf) %in% "ï..ID")]
df = df[-c(1)]


#2
## Reclassify Income and strip the "$" sign
df[, "Income"] <- as.numeric(df[, "Income"] %>%
                               gsub("[$,]", "",.))


#3
## Reclassify Dt_Customer to a date format
library(lubridate)
df$Dt_Customer <- as.Date(parse_date_time(df$Dt_Customer, "mdy"))


#4
## Create Age Variable from YOB
df <- df %>% mutate(df, Age = 2020 - df$Year_Birth)
str(df)

## 'data.frame':    2240 obs. of  28 variables:
##  $ Year_Birth         : int  1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
##  $ Education          : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 3 5 1 3 5 5 ...
##  $ Marital_Status     : Factor w/ 8 levels "Absurd","Alone",..: 3 5 4 6 5 5 4 6 4 4 ...
##  $ Income             : num  84835 57091 67267 32474 21474 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : Date, format: "2014-06-16" "2014-06-15" ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ AcceptedCmp3       : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Country            : Factor w/ 8 levels "AUS","CA","GER",..: 7 2 8 1 7 7 3 7 8 4 ...
##  $ Age                : num  50 59 62 53 31 62 66 53 66 66 ...

#5
## Check again for missing values
na_var <- colnames(df)[colSums(is.na(df)) > 0]
cat(na_var, "contains missing data")

## Income contains missing data

## Drop Missing Income Data
df <- df[!(is.na(df$Income)),]

Later on the Analysis Age had some spectacular outlier, the following distribution shows age going well ahead of 100 years. They are a small and quite outragoeus outlier, we will remove it.

df %>%
  ggplot(aes(x = Age)) +
  geom_histogram(color = "white") +
  theme_minimal()

summary(df$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   43.00   50.00   51.18   61.00  127.00

df[(df$Age > 100),]$Age

## [1] 127 121 120

Now we can see the more clearly the outliers in Age especially @ ages127, 121, 120, let us remove them.

#Remove datapoints where Age>100
paste0("Number of Datapoints to remove for Age>100: ",count(df%>%filter(.,Age>100)))

## [1] "Number of Datapoints to remove for Age>100: 3"

df <- df[!(df$Age > 100),]

Phew, our data should be sparkling now. However, we can still make our data not just clean but also Neat, let us merge repetitive data together in Feature Engineering.

Feature Engineering

## Total Amount Spent
df<- df %>% 
     mutate(Total_amt_spent = MntWines+ MntFruits+ MntMeatProducts+ MntFishProducts+ MntSweetProducts+ MntGoldProds) 

summary(df$Total_amt_spent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5      69     397     607    1048    2525

## Total Campaigns Accepted
df <- df %>%
      mutate(Num_campaigns_Accepted = round((AcceptedCmp1+ AcceptedCmp2 +AcceptedCmp3+ AcceptedCmp4+ AcceptedCmp5)/4))

summary(df$Num_campaigns_Accepted)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02485 0.00000 1.00000

## Total Number of Purchases
df <- df %>% 
      mutate(Num_purchases= NumDealsPurchases+ NumCatalogPurchases+ NumStorePurchases+ NumWebPurchases)

summary(df$Num_purchases)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   15.00   14.89   21.00   44.00

## Total Number of Dependents
df <- df %>% 
      mutate(Num_dependents = Kidhome + Teenhome)

summary(df$Num_dependents)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.9476  1.0000  3.0000

Now having Merged what we need, let us drop the parent variables.

## Drop Irrelevant Columns
drp <- c("Kidhome","Teenhome","Dt_Customer","Recency","MntWines" ,"MntFruits","MntMeatProducts","MntFishProducts","MntSweetProducts" ,"MntGoldProds","NumDealsPurchases","NumWebPurchases","NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth","AcceptedCmp3", "AcceptedCmp4" ,"AcceptedCmp5","AcceptedCmp1","AcceptedCmp2", "Year_Birth")

df <- df[,!(names(df) %in% drp)]

## Move Total_Amt_Spent to the tail
df <- df %>% select(-Total_amt_spent, Total_amt_spent)
str(df)

## 'data.frame':    2213 obs. of  11 variables:
##  $ Education             : Factor w/ 5 levels "2n Cycle","Basic",..: 3 3 3 3 3 5 1 3 5 5 ...
##  $ Marital_Status        : Factor w/ 8 levels "Absurd","Alone",..: 3 5 4 6 5 5 4 6 4 4 ...
##  $ Income                : num  84835 57091 67267 32474 21474 ...
##  $ Response              : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Country               : Factor w/ 8 levels "AUS","CA","GER",..: 7 2 8 1 7 7 3 7 8 4 ...
##  $ Age                   : num  50 59 62 53 31 62 66 53 66 66 ...
##  $ Num_campaigns_Accepted: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Num_purchases         : int  15 18 11 4 8 17 28 7 20 20 ...
##  $ Num_dependents        : int  0 0 1 2 1 0 0 1 1 1 ...
##  $ Total_amt_spent       : int  1190 577 251 11 91 1192 1215 96 544 544 ...

As far as Cleaning the data, it is now tidy enough for Visualisations and managing

Top Distributions

Outcome Variable

df %>%
  ggplot(aes(x = Num_purchases)) +
  geom_histogram(color = "white", bins = 15) +
  labs(title = "Participants by Total Purchases",
       x = "Total Amount Spent") +
  theme_minimal()

library(scales)

#2
## for loop to reclassify factor Variables
for(var in names(df)){
  if(length(unique(df[[var]])) <= 2){
    ## Assign the Boolean Values
    df[,var] <- ifelse(df[,var]==0, "False", "True")
    ## Convert to Factpr
    df[,var] <- as.factor(df[,var])
  }
}


## NumericVars
factorVars <- select_if(df, is.factor)
numericVars <- select_if(df, is.numeric)

## function to tidy up to a summary table
tidy_eval_arrange <- function(.data, var) {
  plotdata <- .data %>%
    count({{var}}) %>%
    mutate(pct = n / sum(n),
           pctlabel = paste0(round(pct * 100), "%")) 
  
  print(plotdata)
}

tidy_eval_arrange(df, Response)

##   Response    n       pct pctlabel
## 1    False 1880 0.8495255      85%
## 2     True  333 0.1504745      15%

## function to Plot Graphs for Factor Variables
auto_factor_plot <- function(.data) {
  ## Type of Variable
  nm <- names(.data)
  
## loop to plot through
  for (i in seq_along(nm)) {
    plot <- .data %>%
              ggplot(aes_string(x = nm[i])) + 
                    geom_histogram(alpha = .5,fill = "black", stat = "count") +
                    theme_minimal() + 
                    ggtitle ("Percentage of Market Data by Categorical Variables") +
                    coord_flip() +
                    theme_minimal()
    
    print(plot)
    # print(plot)
  }
    
  
  
}

auto_factor_plot(factorVars)

Numeric Variables

Correlation

BiVariate Relationship Numeric x Numeric Outcome: Correlation Analysis

# calulate the correlations
library(kableExtra)
df_num <- select_if(df, is.numeric)

crr <- cor(df_num, use="complete.obs")
kable(round(crr,2))

	Income	Age	Num_purchases	Num_dependents	Total_amt_spent
Income	1.00	0.16	0.57	-0.29	0.67
Age	0.16	1.00	0.18	0.09	0.12
Num_purchases	0.57	0.18	1.00	-0.25	0.76
Num_dependents	-0.29	0.09	-0.25	1.00	-0.50
Total_amt_spent	0.67	0.12	0.76	-0.50	1.00

One thing we can already see if that those with more Dependents will tend to spend less. Let us visualize this relationship using a p-value.

crr %>%
  corrplot(method = "color", type = "lower", tl.col = "black", tl.srt = 45,
           p.mat = cor.mtest(crr)$p,
           insig = "p-value", sig.level = -1)

MultiVariate Relationship {Factor & Numeric} x Numeric Outcome: Linear Regression Analysis

## Dummify Factor Variables
linR <- lm(Num_purchases ~ ., data = df)

summary(linR)

## 
## Call:
## lm(formula = Num_purchases ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.857  -3.219  -0.455   3.382  16.914 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 4.565e+00  3.446e+00   1.325   0.1854    
## EducationBasic             -1.672e+00  7.490e-01  -2.233   0.0257 *  
## EducationGraduation        -5.190e-01  3.725e-01  -1.393   0.1637    
## EducationMaster            -7.313e-01  4.284e-01  -1.707   0.0879 .  
## EducationPhD               -5.458e-01  4.139e-01  -1.319   0.1874    
## Marital_StatusAlone         1.220e+00  4.384e+00   0.278   0.7808    
## Marital_StatusDivorced      1.756e-01  3.413e+00   0.051   0.9590    
## Marital_StatusMarried       2.519e-01  3.403e+00   0.074   0.9410    
## Marital_StatusSingle       -3.536e-01  3.405e+00  -0.104   0.9173    
## Marital_StatusTogether     -1.018e-01  3.405e+00  -0.030   0.9761    
## Marital_StatusWidow         2.983e-01  3.446e+00   0.087   0.9310    
## Marital_StatusYOLO          6.056e+00  4.805e+00   1.260   0.2076    
## Income                      2.682e-05  5.546e-06   4.836 1.42e-06 ***
## ResponseTrue               -3.244e-01  3.115e-01  -1.041   0.2979    
## ComplainTrue                6.173e-01  1.080e+00   0.572   0.5675    
## CountryCA                   2.271e-01  4.945e-01   0.459   0.6461    
## CountryGER                 -1.834e-02  5.975e-01  -0.031   0.9755    
## CountryIND                  2.981e-01  5.633e-01   0.529   0.5967    
## CountryME                   1.389e+00  2.801e+00   0.496   0.6201    
## CountrySA                   1.748e-01  4.760e-01   0.367   0.7134    
## CountrySP                  -2.903e-04  4.229e-01  -0.001   0.9995    
## CountryUS                   1.275e+00  6.108e-01   2.087   0.0370 *  
## Age                         3.702e-02  9.320e-03   3.972 7.35e-05 ***
## Num_campaigns_AcceptedTrue -3.958e+00  7.072e-01  -5.598 2.45e-08 ***
## Num_dependents              1.518e+00  1.616e-01   9.392  < 2e-16 ***
## Total_amt_spent             1.006e-02  2.602e-04  38.676  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.786 on 2187 degrees of freedom
## Multiple R-squared:  0.615,  Adjusted R-squared:  0.6106 
## F-statistic: 139.7 on 25 and 2187 DF,  p-value: < 2.2e-16

Using P-values, we can observe variables that hold Significant Relationships with the Outcome Variable. p-val < 0.05 indicates a significant relationship.

#Retrieve all the p-values
pvaluedf <- data.frame(summary(linR)$coefficients[,c('Pr(>|t|)', 'Estimate')])
colnames(pvaluedf) <- c('pvalue', 'coefficient')
pvaluedf$variables <- rownames(pvaluedf)

#Plot the variables and their significance 
pvaluedf %>% 
  ggplot(aes(x = reorder(variables, pvalue),y = pvalue)) +
    geom_col() +
    geom_hline(yintercept = 0.05, color = "white") + ## level of Significance
    geom_text(aes(x = 2, y = 0.10),label = "5%", color = "darkred") +
    scale_x_discrete(name = "Predictors") +
    ggtitle("X-Variables and their P-Values") +
    coord_flip() +
    theme_minimal()

Using this Logic, from our graph US downwards hold signifiant relationships with the Number of Purchase made in the market.

Notice: The inspiration for this notebook, is another notebook on this subject-matter, a far illustrative and masterful work by Oi Zhen, you may want to check it out here Kaggle Marketing Analytics.