Part 2: Exploratory Data Analysis

# create column - Dependents and campaigns_accepted
df$Dependents <- df$Kidhome + df$Teenhome

df$campaigns_accepted <- df$AcceptedCmp1 + df$AcceptedCmp2 + df$AcceptedCmp3 +
  df$AcceptedCmp4 + df$AcceptedCmp5 + df$Response

2.1 Descriptive Statistical Analysis

  • Create simple average, max, mean, sd, iqr of store purchases across Country and Marital_Status
summarise_store_purchases <- df %>%
  group_by(Country, Marital_Status) %>%
  summarise(count = n(),
            max_store_purchases = max(NumStorePurchases),
            min_store_purchases = min(NumStorePurchases),
            mean_store_purchases = mean(NumStorePurchases),
            median_store_purchases = median(NumStorePurchases))
## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
summarise_store_purchases
## # A tibble: 37 × 7
## # Groups:   Country [8]
##    Country Marital_Status count max_store_purchases min_store_purchases
##    <chr>   <chr>          <int>               <dbl>               <dbl>
##  1 AUS     Divorced          20                  11                   2
##  2 AUS     Married           66                  13                   2
##  3 AUS     Single            19                  11                   2
##  4 AUS     Together          37                  13                   2
##  5 AUS     Widow              5                  10                   3
##  6 CA      Divorced          27                  12                   2
##  7 CA      Married          105                  13                   0
##  8 CA      Single            61                  13                   2
##  9 CA      Together          64                  13                   1
## 10 CA      Widow              6                  11                   3
## # … with 27 more rows, and 2 more variables: mean_store_purchases <dbl>,
## #   median_store_purchases <dbl>
# Sort summarise_store_purchases in descending order
summarise_store_purchases %>% arrange(desc(mean_store_purchases))
## # A tibble: 37 × 7
## # Groups:   Country [8]
##    Country Marital_Status count max_store_purchases min_store_purchases
##    <chr>   <chr>          <int>               <dbl>               <dbl>
##  1 GER     Widow              3                  13                   6
##  2 US      Divorced          16                  13                   3
##  3 ME      Together           1                   7                   7
##  4 SP      Widow             43                  13                   2
##  5 GER     Single            18                  12                   2
##  6 IND     Widow              4                   7                   6
##  7 SA      Divorced          47                  13                   0
##  8 GER     Together          32                  13                   2
##  9 SA      Single            69                  12                   2
## 10 US      Together          23                  12                   0
## # … with 27 more rows, and 2 more variables: mean_store_purchases <dbl>,
## #   median_store_purchases <dbl>

  • Look at the plot, we spotted the highest mean store purchases came from GER (Widow Group)
  • The second highest mean store purchases came from the US(Divorced Group)
  • There is no store purchases from Widow, Married, Divorced at ME country. Our team need to find out the reasons why those groups do not purchase in store in Me
  • For now, if we build marketing campaigns in ME, we should not focus on Widow,Married, or Divorced group at ME
# Similarly, create plots for Country, Dependents, and Store purchase

summarise_store_purchases <- df %>%
  group_by(Country, Dependents) %>%
  summarise(count = n(),
            max_store_purchases = max(NumStorePurchases),
            min_store_purchases = min(NumStorePurchases),
            mean_store_purchases = mean(NumStorePurchases),
            median_store_purchases = median(NumStorePurchases))
## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
summarise_store_purchases
## # A tibble: 30 × 7
## # Groups:   Country [8]
##    Country Dependents count max_store_purcha… min_store_purcha… mean_store_purc…
##    <chr>        <dbl> <int>             <dbl>             <dbl>            <dbl>
##  1 AUS              0    37                13                 2             7.22
##  2 AUS              1    76                12                 2             5.32
##  3 AUS              2    29                12                 2             4.90
##  4 AUS              3     5                 5                 2             3.4 
##  5 CA               0    75                13                 0             7.69
##  6 CA               1   129                13                 1             5.56
##  7 CA               2    50                12                 2             4.42
##  8 CA               3     9                 7                 2             3.44
##  9 GER              0    39                13                 2             7.79
## 10 GER              1    57                13                 2             5.33
## # … with 20 more rows, and 1 more variable: median_store_purchases <dbl>

  • The plot shows that the most store purchases came from those who has no dependents.
  • Base on this info, if we create a marketing campaign with a tight budget to increase purchases in stores, we need to focus on customers who have no dependents.
  • The second group we can focus on is those who has 1 dependent.
  • If we run a marketing campaign in the US, dependents don’t have major effect on store_purchases

2.2 Correlation Statistics (Pearson correlation) between numeric variables

  • Let’s calculate correlation for multiple variables: Education, Marital_Status, Dependents, Total spent, Recency, MntGOldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases # NumWebVisitMonth # Campaign_accepted to find out what causes the number of store purchases?
numerics_df <- df %>%
  select(Income, Education_new, Marital_Status_new, Dependents, total_purchases,
         total_spent, MntGoldProds, NumDealsPurchases,NumWebPurchases,
         NumWebVisitsMonth, NumStorePurchases,NumCatalogPurchases, MntFishProducts,MntWines,MntFruits, MntMeatProducts, MntSweetProducts)


df_core <- cor(numerics_df, method = "pearson", use = "pairwise.complete.obs" )
df_core
##                          Income Education_new Marital_Status_new  Dependents
## Income               1.00000000   0.122150429       0.0381146423 -0.34571878
## Education_new        0.12215043   1.000000000       0.0339473661  0.05226278
## Marital_Status_new   0.03811464   0.033947366       1.0000000000  0.04185307
## Dependents          -0.34571878   0.052262777       0.0418530707  1.00000000
## total_purchases      0.66951165   0.074573063       0.0459013011 -0.25192652
## total_spent          0.79231663   0.060488302       0.0165774121 -0.50199010
## MntGoldProds         0.38694480  -0.125185286       0.0258094578 -0.26844736
## NumDealsPurchases   -0.11007770   0.027435840       0.0337068448  0.43608770
## NumWebPurchases      0.45892080   0.066974617       0.0550668164 -0.15120386
## NumWebVisitsMonth   -0.65026215  -0.031867927      -0.0003336462  0.41717754
## NumStorePurchases    0.63095340   0.059127465       0.0220052098 -0.32450202
## NumCatalogPurchases  0.69589927   0.048917149       0.0220461269 -0.44500377
## MntFishProducts      0.51947652  -0.112295317       0.0101984151 -0.42781975
## MntWines             0.68725600   0.169304803       0.0380889318 -0.35511573
## MntFruits            0.50763749  -0.103586935       0.0088991656 -0.39634912
## MntMeatProducts      0.69180624   0.001589597      -0.0266421414 -0.50553772
## MntSweetProducts     0.52353259  -0.104146835       0.0210512773 -0.39064093
##                     total_purchases total_spent MntGoldProds NumDealsPurchases
## Income                   0.66951165  0.79231663   0.38694480      -0.110077697
## Education_new            0.07457306  0.06048830  -0.12518529       0.027435840
## Marital_Status_new       0.04590130  0.01657741   0.02580946       0.033706845
## Dependents              -0.25192652 -0.50199010  -0.26844736       0.436087698
## total_purchases          1.00000000  0.75637378   0.49230172       0.359501671
## total_spent              0.75637378  1.00000000   0.52565381      -0.067295498
## MntGoldProds             0.49230172  0.52565381   1.00000000       0.051937450
## NumDealsPurchases        0.35950167 -0.06729550   0.05193745       1.000000000
## NumWebPurchases          0.78426010  0.52969159   0.40825701       0.240349944
## NumWebVisitsMonth       -0.31357726 -0.49815284  -0.24504689       0.347549925
## NumStorePurchases        0.82197402  0.67541666   0.38983191       0.064744839
## NumCatalogPurchases      0.73582359  0.77961819   0.44021212      -0.013367648
## MntFishProducts          0.46718539  0.64073416   0.42479800      -0.144539582
## MntWines                 0.71308310  0.89264706   0.38968743       0.007707586
## MntFruits                0.45390831  0.61272078   0.39355365      -0.135078230
## MntMeatProducts          0.56433900  0.84506738   0.35588063      -0.122559486
## MntSweetProducts         0.46972073  0.60688839   0.35648891      -0.122387135
##                     NumWebPurchases NumWebVisitsMonth NumStorePurchases
## Income                   0.45892080     -0.6502621489        0.63095340
## Education_new            0.06697462     -0.0318679269        0.05912747
## Marital_Status_new       0.05506682     -0.0003336462        0.02200521
## Dependents              -0.15120386      0.4171775379       -0.32450202
## total_purchases          0.78426010     -0.3135772552        0.82197402
## total_spent              0.52969159     -0.4981528444        0.67541666
## MntGoldProds             0.40825701     -0.2450468936        0.38983191
## NumDealsPurchases        0.24034994      0.3475499247        0.06474484
## NumWebPurchases          1.00000000     -0.0516650842        0.51611710
## NumWebVisitsMonth       -0.05166508      1.0000000000       -0.43256382
## NumStorePurchases        0.51611710     -0.4325638205        1.00000000
## NumCatalogPurchases      0.38656822     -0.5215183888        0.51711055
## MntFishProducts          0.29995381     -0.4456204464        0.45638679
## MntWines                 0.55432773     -0.3203952600        0.64000634
## MntFruits                0.30179434     -0.4177338554        0.45899289
## MntMeatProducts          0.30666955     -0.5389296680        0.48501857
## MntSweetProducts         0.33313048     -0.4220980814        0.45449378
##                     NumCatalogPurchases MntFishProducts     MntWines
## Income                       0.69589927      0.51947652  0.687255998
## Education_new                0.04891715     -0.11229532  0.169304803
## Marital_Status_new           0.02204613      0.01019842  0.038088932
## Dependents                  -0.44500377     -0.42781975 -0.355115733
## total_purchases              0.73582359      0.46718539  0.713083105
## total_spent                  0.77961819      0.64073416  0.892647055
## MntGoldProds                 0.44021212      0.42479800  0.389687435
## NumDealsPurchases           -0.01336765     -0.14453958  0.007707586
## NumWebPurchases              0.38656822      0.29995381  0.554327729
## NumWebVisitsMonth           -0.52151839     -0.44562045 -0.320395260
## NumStorePurchases            0.51711055      0.45638679  0.640006335
## NumCatalogPurchases          1.00000000      0.53127805  0.633121957
## MntFishProducts              0.53127805      1.00000000  0.395177256
## MntWines                     0.63312196      0.39517726  1.000000000
## MntFruits                    0.48566814      0.59363389  0.386087558
## MntMeatProducts              0.73318515      0.57163746  0.566807669
## MntSweetProducts             0.49437494      0.58394261  0.389401050
##                        MntFruits MntMeatProducts MntSweetProducts
## Income               0.507637492     0.691806241       0.52353259
## Education_new       -0.103586935     0.001589597      -0.10414684
## Marital_Status_new   0.008899166    -0.026642141       0.02105128
## Dependents          -0.396349117    -0.505537723      -0.39064093
## total_purchases      0.453908314     0.564338995       0.46972073
## total_spent          0.612720781     0.845067382       0.60688839
## MntGoldProds         0.393553648     0.355880632       0.35648891
## NumDealsPurchases   -0.135078230    -0.122559486      -0.12238714
## NumWebPurchases      0.301794343     0.306669555       0.33313048
## NumWebVisitsMonth   -0.417733855    -0.538929668      -0.42209808
## NumStorePurchases    0.458992889     0.485018566       0.45449378
## NumCatalogPurchases  0.485668141     0.733185154       0.49437494
## MntFishProducts      0.593633886     0.571637461       0.58394261
## MntWines             0.386087558     0.566807669       0.38940105
## MntFruits            1.000000000     0.546835312       0.57087092
## MntMeatProducts      0.546835312     1.000000000       0.53450352
## MntSweetProducts     0.570870923     0.534503520       1.00000000

It’s hard to understand the table, so I create a map to visualise correlation coefficient among variable

## Warning in ind1:ind2: numerical expression has 2 elements: only the first used
## Warning in ind1:ind2: numerical expression has 3 elements: only the first used

  • The plot illustrates relationship between:
  • NumStorePurchases and total_purchases: 0.82 (Strong)
  • NumStorePurchases and total_spent : 0.68 (Medium)
  • NumStorePurchases and Income : 0.63 (Medium)
  • total_purchases and MntGoldProds : 0.49 (Medium)
  • NumStorePurchases and dependents : -0.31 (Weak)
  • NumStorePurchases and NumWebVisitsMonth : -0.43 (Medium)

We need to check if these correlation coefficients are significant by calculating the P-value using cor.test() function

# 1. NumStorePurchases and total_spent
df %>%
  cor.test(~ NumStorePurchases + total_spent, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and total_spent
## t = 42.959, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6520412 0.6975093
## sample estimates:
##       cor 
## 0.6754167
# 2. NumStorePurchases and Income
df %>%
  cor.test(~ NumStorePurchases + Income, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and Income
## t = 38.146, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6051305 0.6554501
## sample estimates:
##       cor 
## 0.6309534
# 3. NumStorePurchases and total_purchases
df %>%
  cor.test(~ NumStorePurchases + total_purchases, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and total_purchases
## t = 67.695, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8079432 0.8350733
## sample estimates:
##      cor 
## 0.821974
# 4. total_purchases and MntGoldProds
df %>%
  cor.test(~ MntGoldProds + total_purchases, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  MntGoldProds and total_purchases
## t = 26.528, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4599893 0.5233119
## sample estimates:
##       cor 
## 0.4923017
# 5. NumStorePurchases and dependents
df %>%
  cor.test(~ NumStorePurchases + Dependents, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and Dependents
## t = -16.091, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3613753 -0.2866153
## sample estimates:
##       cor 
## -0.324502
# 6. NumStorePurchases and NumWebVisitsMonth
df %>%
  cor.test(~ NumStorePurchases + NumWebVisitsMonth, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and NumWebVisitsMonth
## t = -22.503, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4659169 -0.3979832
## sample estimates:
##        cor 
## -0.4325638

Notice that the all P-values above (< 2.2e -16) are very small, much smaller than 0.001. So we can be certain that there is a correlation between these variables

# We found relationships between these variables, let's visualise it to explore further

ggplot(df, aes(x = Income, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  • Look at the plot, digital marketing campaigns should target customers who have an income between $50000 and $90000
ggplot(df, aes(x = NumWebVisitsMonth, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Negative linear relationship

  • The plot shows that the more web visited per month, the lower purchases at stores.
  • It means if we create digital marketing campaigns, we should exclude those who visited the website more than 9 times per month, so we can save marketing budget
ggplot(df, aes(x = Dependents, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  • The more dependents customer have, the lower chance they purchase in store.
  • We will create a further test to calculate the correlation coefficient between these variables
ggplot(df, aes(x = MntGoldProds, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  • Customer with MntGoldProds lower than $100 brought the most store purchases

Question 1: What causes store purchases?

  • Answers: The causes are Income, Dependents, total_spent, total_purchases, MtnGoldProds,and NumWebvisitsMonth

Question 2: How to increase the company revenue?

  • Answer: To do it, we need to increase customer’s spend which has a positive relationship with the number of store purchases.
  • In short, to get more revenue, we need more store purchases.
  • To rise the store purchases, we can focus on increasing the number of MntGoldProds (<$100) by analysing customers ’digital behaviours, channels, and interests. Then build marketing campaigns to target audience who is similar to our current MntGoldProds customer (< $100)

Next, to develop effective marketing campaigns to increase store purchases, our team should:

  • Target customers have less than 1 dependent
  • The income between $50000 and $90000
  • The campaigns should exclude those who visited the website more than 9 times

Question 3: Does the US fare significantly better than the rest of the world in term of total purchases?

# explore what country has the most purchases
total_purchase_by_country <- df %>%
  group_by(Country) %>%
  summarise(sum = sum(total_purchases))

arrange(total_purchase_by_country, desc(sum))
## # A tibble: 8 × 2
##   Country   sum
##   <chr>   <dbl>
## 1 SP      15934
## 2 SA       5092
## 3 CA       4046
## 4 AUS      2156
## 5 IND      2092
## 6 US       1743
## 7 GER      1708
## 8 ME         59
  • Answer: No, the US doesn’t. SP has the highest number of purchases 15992 while US has 1743 purchases.

Question 4: The supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test.

mean_MntGoldProds <- mean(df$MntGoldProds)
mean_MntGoldProds
## [1] 43.98592
ggplot(df, aes(x = MntGoldProds, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

  • We can see the positive relationship between NumStorePurchases and MntGoldProds as above. So those who spend more on gold will have a higher chance to spend more in store.

  • The plot shows that the most store purchases came from those spent on gold between $44 and $100

Question 6: Is there a significant relationship between geographical regional and success of a campaign?

unique(df$Country)
## [1] "SP"  "CA"  "US"  "AUS" "GER" "IND" "SA"  "ME"
# change categorical values to numeric values
df$Country_numeric <- df$Country
df$Country_numeric[df$Country_numeric == "SP"] <- 1
df$Country_numeric[df$Country_numeric == "CA"] <- 2
df$Country_numeric[df$Country_numeric == "US"] <- 3
df$Country_numeric[df$Country_numeric == "AUS"] <- 4
df$Country_numeric[df$Country_numeric == "GER"] <- 5
df$Country_numeric[df$Country_numeric == "IND"] <- 6
df$Country_numeric[df$Country_numeric == "SA"] <- 7
df$Country_numeric[df$Country_numeric == "ME"] <- 8

df$Country_numeric <- as.numeric(df$Country_numeric)

df_country_camp <- df %>%
  select(Country_numeric,campaigns_accepted)

core_country_camp <- cor(df_country_camp, method = "pearson", use = "pairwise.complete.obs" )

core_country_camp
##                    Country_numeric campaigns_accepted
## Country_numeric         1.00000000        -0.03617805
## campaigns_accepted     -0.03617805         1.00000000
df %>%
  cor.test(~ Country_numeric + campaigns_accepted, data = .)
## 
##  Pearson's product-moment correlation
## 
## data:  Country_numeric and campaigns_accepted
## t = -1.698, df = 2200, p-value = 0.08965
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.077832181  0.005602167
## sample estimates:
##         cor 
## -0.03617805
  • Answer question 6: there is no relationship between geographical regional and success of a campaign as the correlation coefficient is only 0.036