Marketing Analysis - Part 2 - Exploratory Data Analysis

Part 2: Exploratory Data Analysis

Question 1: "What causes store purchases? Are there relationships among variables (Income, Education, Marital_Status, Country, NumStorePurchases,etc.)
Question 2: How to increase the company revenue?
Question 3: Question 3: Does the US fare significantly better than the rest of the world in term of total purchases?
Question 4: The supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases.Justify or refute this statement using an appropriate statistical test.
Question 5: Fish has Omega 3 fatty acids which are good for the brain. Accordingly do “Married PhD candidates” have a significant relation with the amount spend on fish products? What factors are significantly related to amount spend on fish?
Question 6: Is there a significant relationship between geographical regional and success of a campaign?

# create column - Dependents and campaigns_accepted
df$Dependents <- df$Kidhome + df$Teenhome

df$campaigns_accepted <- df$AcceptedCmp1 + df$AcceptedCmp2 + df$AcceptedCmp3 +
  df$AcceptedCmp4 + df$AcceptedCmp5 + df$Response

2.1 Descriptive Statistical Analysis

Create simple average, max, mean, sd, iqr of store purchases across Country and Marital_Status

summarise_store_purchases <- df %>%
  group_by(Country, Marital_Status) %>%
  summarise(count = n(),
            max_store_purchases = max(NumStorePurchases),
            min_store_purchases = min(NumStorePurchases),
            mean_store_purchases = mean(NumStorePurchases),
            median_store_purchases = median(NumStorePurchases))

## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.

summarise_store_purchases

## # A tibble: 37 × 7
## # Groups:   Country [8]
##    Country Marital_Status count max_store_purchases min_store_purchases
##    <chr>   <chr>          <int>               <dbl>               <dbl>
##  1 AUS     Divorced          20                  11                   2
##  2 AUS     Married           66                  13                   2
##  3 AUS     Single            19                  11                   2
##  4 AUS     Together          37                  13                   2
##  5 AUS     Widow              5                  10                   3
##  6 CA      Divorced          27                  12                   2
##  7 CA      Married          105                  13                   0
##  8 CA      Single            61                  13                   2
##  9 CA      Together          64                  13                   1
## 10 CA      Widow              6                  11                   3
## # … with 27 more rows, and 2 more variables: mean_store_purchases <dbl>,
## #   median_store_purchases <dbl>

# Sort summarise_store_purchases in descending order
summarise_store_purchases %>% arrange(desc(mean_store_purchases))

## # A tibble: 37 × 7
## # Groups:   Country [8]
##    Country Marital_Status count max_store_purchases min_store_purchases
##    <chr>   <chr>          <int>               <dbl>               <dbl>
##  1 GER     Widow              3                  13                   6
##  2 US      Divorced          16                  13                   3
##  3 ME      Together           1                   7                   7
##  4 SP      Widow             43                  13                   2
##  5 GER     Single            18                  12                   2
##  6 IND     Widow              4                   7                   6
##  7 SA      Divorced          47                  13                   0
##  8 GER     Together          32                  13                   2
##  9 SA      Single            69                  12                   2
## 10 US      Together          23                  12                   0
## # … with 27 more rows, and 2 more variables: mean_store_purchases <dbl>,
## #   median_store_purchases <dbl>

Look at the plot, we spotted the highest mean store purchases came from GER (Widow Group)
The second highest mean store purchases came from the US(Divorced Group)
There is no store purchases from Widow, Married, Divorced at ME country. Our team need to find out the reasons why those groups do not purchase in store in Me
For now, if we build marketing campaigns in ME, we should not focus on Widow,Married, or Divorced group at ME

# Similarly, create plots for Country, Dependents, and Store purchase

summarise_store_purchases <- df %>%
  group_by(Country, Dependents) %>%
  summarise(count = n(),
            max_store_purchases = max(NumStorePurchases),
            min_store_purchases = min(NumStorePurchases),
            mean_store_purchases = mean(NumStorePurchases),
            median_store_purchases = median(NumStorePurchases))

## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.

summarise_store_purchases

## # A tibble: 30 × 7
## # Groups:   Country [8]
##    Country Dependents count max_store_purcha… min_store_purcha… mean_store_purc…
##    <chr>        <dbl> <int>             <dbl>             <dbl>            <dbl>
##  1 AUS              0    37                13                 2             7.22
##  2 AUS              1    76                12                 2             5.32
##  3 AUS              2    29                12                 2             4.90
##  4 AUS              3     5                 5                 2             3.4 
##  5 CA               0    75                13                 0             7.69
##  6 CA               1   129                13                 1             5.56
##  7 CA               2    50                12                 2             4.42
##  8 CA               3     9                 7                 2             3.44
##  9 GER              0    39                13                 2             7.79
## 10 GER              1    57                13                 2             5.33
## # … with 20 more rows, and 1 more variable: median_store_purchases <dbl>

The plot shows that the most store purchases came from those who has no dependents.
Base on this info, if we create a marketing campaign with a tight budget to increase purchases in stores, we need to focus on customers who have no dependents.
The second group we can focus on is those who has 1 dependent.
If we run a marketing campaign in the US, dependents don’t have major effect on store_purchases

2.2 Correlation Statistics (Pearson correlation) between numeric variables

Let’s calculate correlation for multiple variables: Education, Marital_Status, Dependents, Total spent, Recency, MntGOldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases # NumWebVisitMonth # Campaign_accepted to find out what causes the number of store purchases?

numerics_df <- df %>%
  select(Income, Education_new, Marital_Status_new, Dependents, total_purchases,
         total_spent, MntGoldProds, NumDealsPurchases,NumWebPurchases,
         NumWebVisitsMonth, NumStorePurchases,NumCatalogPurchases, MntFishProducts,MntWines,MntFruits, MntMeatProducts, MntSweetProducts)


df_core <- cor(numerics_df, method = "pearson", use = "pairwise.complete.obs" )
df_core

##                          Income Education_new Marital_Status_new  Dependents
## Income               1.00000000   0.122150429       0.0381146423 -0.34571878
## Education_new        0.12215043   1.000000000       0.0339473661  0.05226278
## Marital_Status_new   0.03811464   0.033947366       1.0000000000  0.04185307
## Dependents          -0.34571878   0.052262777       0.0418530707  1.00000000
## total_purchases      0.66951165   0.074573063       0.0459013011 -0.25192652
## total_spent          0.79231663   0.060488302       0.0165774121 -0.50199010
## MntGoldProds         0.38694480  -0.125185286       0.0258094578 -0.26844736
## NumDealsPurchases   -0.11007770   0.027435840       0.0337068448  0.43608770
## NumWebPurchases      0.45892080   0.066974617       0.0550668164 -0.15120386
## NumWebVisitsMonth   -0.65026215  -0.031867927      -0.0003336462  0.41717754
## NumStorePurchases    0.63095340   0.059127465       0.0220052098 -0.32450202
## NumCatalogPurchases  0.69589927   0.048917149       0.0220461269 -0.44500377
## MntFishProducts      0.51947652  -0.112295317       0.0101984151 -0.42781975
## MntWines             0.68725600   0.169304803       0.0380889318 -0.35511573
## MntFruits            0.50763749  -0.103586935       0.0088991656 -0.39634912
## MntMeatProducts      0.69180624   0.001589597      -0.0266421414 -0.50553772
## MntSweetProducts     0.52353259  -0.104146835       0.0210512773 -0.39064093
##                     total_purchases total_spent MntGoldProds NumDealsPurchases
## Income                   0.66951165  0.79231663   0.38694480      -0.110077697
## Education_new            0.07457306  0.06048830  -0.12518529       0.027435840
## Marital_Status_new       0.04590130  0.01657741   0.02580946       0.033706845
## Dependents              -0.25192652 -0.50199010  -0.26844736       0.436087698
## total_purchases          1.00000000  0.75637378   0.49230172       0.359501671
## total_spent              0.75637378  1.00000000   0.52565381      -0.067295498
## MntGoldProds             0.49230172  0.52565381   1.00000000       0.051937450
## NumDealsPurchases        0.35950167 -0.06729550   0.05193745       1.000000000
## NumWebPurchases          0.78426010  0.52969159   0.40825701       0.240349944
## NumWebVisitsMonth       -0.31357726 -0.49815284  -0.24504689       0.347549925
## NumStorePurchases        0.82197402  0.67541666   0.38983191       0.064744839
## NumCatalogPurchases      0.73582359  0.77961819   0.44021212      -0.013367648
## MntFishProducts          0.46718539  0.64073416   0.42479800      -0.144539582
## MntWines                 0.71308310  0.89264706   0.38968743       0.007707586
## MntFruits                0.45390831  0.61272078   0.39355365      -0.135078230
## MntMeatProducts          0.56433900  0.84506738   0.35588063      -0.122559486
## MntSweetProducts         0.46972073  0.60688839   0.35648891      -0.122387135
##                     NumWebPurchases NumWebVisitsMonth NumStorePurchases
## Income                   0.45892080     -0.6502621489        0.63095340
## Education_new            0.06697462     -0.0318679269        0.05912747
## Marital_Status_new       0.05506682     -0.0003336462        0.02200521
## Dependents              -0.15120386      0.4171775379       -0.32450202
## total_purchases          0.78426010     -0.3135772552        0.82197402
## total_spent              0.52969159     -0.4981528444        0.67541666
## MntGoldProds             0.40825701     -0.2450468936        0.38983191
## NumDealsPurchases        0.24034994      0.3475499247        0.06474484
## NumWebPurchases          1.00000000     -0.0516650842        0.51611710
## NumWebVisitsMonth       -0.05166508      1.0000000000       -0.43256382
## NumStorePurchases        0.51611710     -0.4325638205        1.00000000
## NumCatalogPurchases      0.38656822     -0.5215183888        0.51711055
## MntFishProducts          0.29995381     -0.4456204464        0.45638679
## MntWines                 0.55432773     -0.3203952600        0.64000634
## MntFruits                0.30179434     -0.4177338554        0.45899289
## MntMeatProducts          0.30666955     -0.5389296680        0.48501857
## MntSweetProducts         0.33313048     -0.4220980814        0.45449378
##                     NumCatalogPurchases MntFishProducts     MntWines
## Income                       0.69589927      0.51947652  0.687255998
## Education_new                0.04891715     -0.11229532  0.169304803
## Marital_Status_new           0.02204613      0.01019842  0.038088932
## Dependents                  -0.44500377     -0.42781975 -0.355115733
## total_purchases              0.73582359      0.46718539  0.713083105
## total_spent                  0.77961819      0.64073416  0.892647055
## MntGoldProds                 0.44021212      0.42479800  0.389687435
## NumDealsPurchases           -0.01336765     -0.14453958  0.007707586
## NumWebPurchases              0.38656822      0.29995381  0.554327729
## NumWebVisitsMonth           -0.52151839     -0.44562045 -0.320395260
## NumStorePurchases            0.51711055      0.45638679  0.640006335
## NumCatalogPurchases          1.00000000      0.53127805  0.633121957
## MntFishProducts              0.53127805      1.00000000  0.395177256
## MntWines                     0.63312196      0.39517726  1.000000000
## MntFruits                    0.48566814      0.59363389  0.386087558
## MntMeatProducts              0.73318515      0.57163746  0.566807669
## MntSweetProducts             0.49437494      0.58394261  0.389401050
##                        MntFruits MntMeatProducts MntSweetProducts
## Income               0.507637492     0.691806241       0.52353259
## Education_new       -0.103586935     0.001589597      -0.10414684
## Marital_Status_new   0.008899166    -0.026642141       0.02105128
## Dependents          -0.396349117    -0.505537723      -0.39064093
## total_purchases      0.453908314     0.564338995       0.46972073
## total_spent          0.612720781     0.845067382       0.60688839
## MntGoldProds         0.393553648     0.355880632       0.35648891
## NumDealsPurchases   -0.135078230    -0.122559486      -0.12238714
## NumWebPurchases      0.301794343     0.306669555       0.33313048
## NumWebVisitsMonth   -0.417733855    -0.538929668      -0.42209808
## NumStorePurchases    0.458992889     0.485018566       0.45449378
## NumCatalogPurchases  0.485668141     0.733185154       0.49437494
## MntFishProducts      0.593633886     0.571637461       0.58394261
## MntWines             0.386087558     0.566807669       0.38940105
## MntFruits            1.000000000     0.546835312       0.57087092
## MntMeatProducts      0.546835312     1.000000000       0.53450352
## MntSweetProducts     0.570870923     0.534503520       1.00000000

It’s hard to understand the table, so I create a map to visualise correlation coefficient among variable

## Warning in ind1:ind2: numerical expression has 2 elements: only the first used

## Warning in ind1:ind2: numerical expression has 3 elements: only the first used

The plot illustrates relationship between:
NumStorePurchases and total_purchases: 0.82 (Strong)
NumStorePurchases and total_spent : 0.68 (Medium)
NumStorePurchases and Income : 0.63 (Medium)
total_purchases and MntGoldProds : 0.49 (Medium)
NumStorePurchases and dependents : -0.31 (Weak)
NumStorePurchases and NumWebVisitsMonth : -0.43 (Medium)

We need to check if these correlation coefficients are significant by calculating the P-value using cor.test() function

# 1. NumStorePurchases and total_spent
df %>%
  cor.test(~ NumStorePurchases + total_spent, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and total_spent
## t = 42.959, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6520412 0.6975093
## sample estimates:
##       cor 
## 0.6754167

# 2. NumStorePurchases and Income
df %>%
  cor.test(~ NumStorePurchases + Income, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and Income
## t = 38.146, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6051305 0.6554501
## sample estimates:
##       cor 
## 0.6309534

# 3. NumStorePurchases and total_purchases
df %>%
  cor.test(~ NumStorePurchases + total_purchases, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and total_purchases
## t = 67.695, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8079432 0.8350733
## sample estimates:
##      cor 
## 0.821974

# 4. total_purchases and MntGoldProds
df %>%
  cor.test(~ MntGoldProds + total_purchases, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  MntGoldProds and total_purchases
## t = 26.528, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4599893 0.5233119
## sample estimates:
##       cor 
## 0.4923017

# 5. NumStorePurchases and dependents
df %>%
  cor.test(~ NumStorePurchases + Dependents, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and Dependents
## t = -16.091, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3613753 -0.2866153
## sample estimates:
##       cor 
## -0.324502

# 6. NumStorePurchases and NumWebVisitsMonth
df %>%
  cor.test(~ NumStorePurchases + NumWebVisitsMonth, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  NumStorePurchases and NumWebVisitsMonth
## t = -22.503, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4659169 -0.3979832
## sample estimates:
##        cor 
## -0.4325638

Notice that the all P-values above (< 2.2e -16) are very small, much smaller than 0.001. So we can be certain that there is a correlation between these variables

# We found relationships between these variables, let's visualise it to explore further

ggplot(df, aes(x = Income, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Look at the plot, digital marketing campaigns should target customers who have an income between $50000 and $90000

ggplot(df, aes(x = NumWebVisitsMonth, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Negative linear relationship

The plot shows that the more web visited per month, the lower purchases at stores.
It means if we create digital marketing campaigns, we should exclude those who visited the website more than 9 times per month, so we can save marketing budget

ggplot(df, aes(x = Dependents, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

The more dependents customer have, the lower chance they purchase in store.
We will create a further test to calculate the correlation coefficient between these variables

ggplot(df, aes(x = MntGoldProds, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Customer with MntGoldProds lower than $100 brought the most store purchases

Question 1: What causes store purchases?

Answers: The causes are Income, Dependents, total_spent, total_purchases, MtnGoldProds,and NumWebvisitsMonth

Question 2: How to increase the company revenue?

Answer: To do it, we need to increase customer’s spend which has a positive relationship with the number of store purchases.
In short, to get more revenue, we need more store purchases.
To rise the store purchases, we can focus on increasing the number of MntGoldProds (<$100) by analysing customers ’digital behaviours, channels, and interests. Then build marketing campaigns to target audience who is similar to our current MntGoldProds customer (< $100)

Next, to develop effective marketing campaigns to increase store purchases, our team should:

Target customers have less than 1 dependent
The income between $50000 and $90000
The campaigns should exclude those who visited the website more than 9 times

Question 3: Does the US fare significantly better than the rest of the world in term of total purchases?

# explore what country has the most purchases
total_purchase_by_country <- df %>%
  group_by(Country) %>%
  summarise(sum = sum(total_purchases))

arrange(total_purchase_by_country, desc(sum))

## # A tibble: 8 × 2
##   Country   sum
##   <chr>   <dbl>
## 1 SP      15934
## 2 SA       5092
## 3 CA       4046
## 4 AUS      2156
## 5 IND      2092
## 6 US       1743
## 7 GER      1708
## 8 ME         59

Answer: No, the US doesn’t. SP has the highest number of purchases 15992 while US has 1743 purchases.

Question 4: The supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test.

mean_MntGoldProds <- mean(df$MntGoldProds)
mean_MntGoldProds

## [1] 43.98592

ggplot(df, aes(x = MntGoldProds, y = NumStorePurchases)) +
  geom_point () + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

We can see the positive relationship between NumStorePurchases and MntGoldProds as above. So those who spend more on gold will have a higher chance to spend more in store.
The plot shows that the most store purchases came from those spent on gold between $44 and $100

Question 5: Fish has Omega 3 fatty acids which are good for the brain. Accordingly do “Married PhD candidates” have a significant relation with the amount spend on fish products? What factors are significantly related to amount spend on fish?

Answer: Based on the plot above, Education and Marital Status have very low correlation coefficient on MntFishProducts (0.11 and 0.01), so “Married PhD Candidates” have no relation with the amount spent on fish products.

df %>%
  cor.test(~ MntFishProducts + MntFruits, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  MntFishProducts and MntFruits
## t = 34.6, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5658947 0.6200307
## sample estimates:
##       cor 
## 0.5936339

df %>%
  cor.test(~ MntFishProducts + total_spent, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  MntFishProducts and total_spent
## t = 39.144, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6154343 0.6647151
## sample estimates:
##       cor 
## 0.6407342

df %>%
  cor.test(~ MntFishProducts + MntMeatProducts, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  MntFishProducts and MntMeatProducts
## t = 32.678, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5428275 0.5991036
## sample estimates:
##       cor 
## 0.5716375

df %>%
  cor.test(~ MntFishProducts + NumStorePurchases, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  MntFishProducts and NumStorePurchases
## t = 24.058, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4226729 0.4888393
## sample estimates:
##       cor 
## 0.4563868

Above P-values are much smaller than 0.01, so MntFruits, MntMeatProducts, total_spent, NumStorePurchases are related to amount spend on fish (0.59, 0.57, 0.64, 0.46)
Interpretation: Customers who spend more on fish are those who spend more on fruits, meat, and prefer to buy products in stores.
Action: if we create digital marketing campaigns, fish, meat, and fruits should put together in the same carousel ad.

Question 6: Is there a significant relationship between geographical regional and success of a campaign?

unique(df$Country)

## [1] "SP"  "CA"  "US"  "AUS" "GER" "IND" "SA"  "ME"

# change categorical values to numeric values
df$Country_numeric <- df$Country
df$Country_numeric[df$Country_numeric == "SP"] <- 1
df$Country_numeric[df$Country_numeric == "CA"] <- 2
df$Country_numeric[df$Country_numeric == "US"] <- 3
df$Country_numeric[df$Country_numeric == "AUS"] <- 4
df$Country_numeric[df$Country_numeric == "GER"] <- 5
df$Country_numeric[df$Country_numeric == "IND"] <- 6
df$Country_numeric[df$Country_numeric == "SA"] <- 7
df$Country_numeric[df$Country_numeric == "ME"] <- 8

df$Country_numeric <- as.numeric(df$Country_numeric)

df_country_camp <- df %>%
  select(Country_numeric,campaigns_accepted)

core_country_camp <- cor(df_country_camp, method = "pearson", use = "pairwise.complete.obs" )

core_country_camp

##                    Country_numeric campaigns_accepted
## Country_numeric         1.00000000        -0.03617805
## campaigns_accepted     -0.03617805         1.00000000

df %>%
  cor.test(~ Country_numeric + campaigns_accepted, data = .)

## 
##  Pearson's product-moment correlation
## 
## data:  Country_numeric and campaigns_accepted
## t = -1.698, df = 2200, p-value = 0.08965
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.077832181  0.005602167
## sample estimates:
##         cor 
## -0.03617805

Answer question 6: there is no relationship between geographical regional and success of a campaign as the correlation coefficient is only 0.036

Marketing Analysis - Part 2 - Exploratory Data Analysis

Truc Nguyen

10/01/2022

Part 2: Exploratory Data Analysis

2.1 Descriptive Statistical Analysis

2.2 Correlation Statistics (Pearson correlation) between numeric variables

It’s hard to understand the table, so I create a map to visualise correlation coefficient among variable

We need to check if these correlation coefficients are significant by calculating the P-value using cor.test() function

Notice that the all P-values above (< 2.2e -16) are very small, much smaller than 0.001. So we can be certain that there is a correlation between these variables

Negative linear relationship

Question 1: What causes store purchases?

Question 2: How to increase the company revenue?

Next, to develop effective marketing campaigns to increase store purchases, our team should:

Question 3: Does the US fare significantly better than the rest of the world in term of total purchases?

Question 4: The supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test.

Question 6: Is there a significant relationship between geographical regional and success of a campaign?