Question 1: "What causes store purchases? Are there relationships among variables (Income, Education, Marital_Status, Country, NumStorePurchases,etc.)
Question 2: How to increase the company revenue?
Question 3: Question 3: Does the US fare significantly better than the rest of the world in term of total purchases?
Question 4: The supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases.Justify or refute this statement using an appropriate statistical test.
Question 5: Fish has Omega 3 fatty acids which are good for the brain. Accordingly do “Married PhD candidates” have a significant relation with the amount spend on fish products? What factors are significantly related to amount spend on fish?
Question 6: Is there a significant relationship between geographical regional and success of a campaign?
# create column - Dependents and campaigns_accepted
df$Dependents <- df$Kidhome + df$Teenhome
df$campaigns_accepted <- df$AcceptedCmp1 + df$AcceptedCmp2 + df$AcceptedCmp3 +
df$AcceptedCmp4 + df$AcceptedCmp5 + df$Response
summarise_store_purchases <- df %>%
group_by(Country, Marital_Status) %>%
summarise(count = n(),
max_store_purchases = max(NumStorePurchases),
min_store_purchases = min(NumStorePurchases),
mean_store_purchases = mean(NumStorePurchases),
median_store_purchases = median(NumStorePurchases))
## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
summarise_store_purchases
## # A tibble: 37 × 7
## # Groups: Country [8]
## Country Marital_Status count max_store_purchases min_store_purchases
## <chr> <chr> <int> <dbl> <dbl>
## 1 AUS Divorced 20 11 2
## 2 AUS Married 66 13 2
## 3 AUS Single 19 11 2
## 4 AUS Together 37 13 2
## 5 AUS Widow 5 10 3
## 6 CA Divorced 27 12 2
## 7 CA Married 105 13 0
## 8 CA Single 61 13 2
## 9 CA Together 64 13 1
## 10 CA Widow 6 11 3
## # … with 27 more rows, and 2 more variables: mean_store_purchases <dbl>,
## # median_store_purchases <dbl>
# Sort summarise_store_purchases in descending order
summarise_store_purchases %>% arrange(desc(mean_store_purchases))
## # A tibble: 37 × 7
## # Groups: Country [8]
## Country Marital_Status count max_store_purchases min_store_purchases
## <chr> <chr> <int> <dbl> <dbl>
## 1 GER Widow 3 13 6
## 2 US Divorced 16 13 3
## 3 ME Together 1 7 7
## 4 SP Widow 43 13 2
## 5 GER Single 18 12 2
## 6 IND Widow 4 7 6
## 7 SA Divorced 47 13 0
## 8 GER Together 32 13 2
## 9 SA Single 69 12 2
## 10 US Together 23 12 0
## # … with 27 more rows, and 2 more variables: mean_store_purchases <dbl>,
## # median_store_purchases <dbl>
# Similarly, create plots for Country, Dependents, and Store purchase
summarise_store_purchases <- df %>%
group_by(Country, Dependents) %>%
summarise(count = n(),
max_store_purchases = max(NumStorePurchases),
min_store_purchases = min(NumStorePurchases),
mean_store_purchases = mean(NumStorePurchases),
median_store_purchases = median(NumStorePurchases))
## `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
summarise_store_purchases
## # A tibble: 30 × 7
## # Groups: Country [8]
## Country Dependents count max_store_purcha… min_store_purcha… mean_store_purc…
## <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 AUS 0 37 13 2 7.22
## 2 AUS 1 76 12 2 5.32
## 3 AUS 2 29 12 2 4.90
## 4 AUS 3 5 5 2 3.4
## 5 CA 0 75 13 0 7.69
## 6 CA 1 129 13 1 5.56
## 7 CA 2 50 12 2 4.42
## 8 CA 3 9 7 2 3.44
## 9 GER 0 39 13 2 7.79
## 10 GER 1 57 13 2 5.33
## # … with 20 more rows, and 1 more variable: median_store_purchases <dbl>
numerics_df <- df %>%
select(Income, Education_new, Marital_Status_new, Dependents, total_purchases,
total_spent, MntGoldProds, NumDealsPurchases,NumWebPurchases,
NumWebVisitsMonth, NumStorePurchases,NumCatalogPurchases, MntFishProducts,MntWines,MntFruits, MntMeatProducts, MntSweetProducts)
df_core <- cor(numerics_df, method = "pearson", use = "pairwise.complete.obs" )
df_core
## Income Education_new Marital_Status_new Dependents
## Income 1.00000000 0.122150429 0.0381146423 -0.34571878
## Education_new 0.12215043 1.000000000 0.0339473661 0.05226278
## Marital_Status_new 0.03811464 0.033947366 1.0000000000 0.04185307
## Dependents -0.34571878 0.052262777 0.0418530707 1.00000000
## total_purchases 0.66951165 0.074573063 0.0459013011 -0.25192652
## total_spent 0.79231663 0.060488302 0.0165774121 -0.50199010
## MntGoldProds 0.38694480 -0.125185286 0.0258094578 -0.26844736
## NumDealsPurchases -0.11007770 0.027435840 0.0337068448 0.43608770
## NumWebPurchases 0.45892080 0.066974617 0.0550668164 -0.15120386
## NumWebVisitsMonth -0.65026215 -0.031867927 -0.0003336462 0.41717754
## NumStorePurchases 0.63095340 0.059127465 0.0220052098 -0.32450202
## NumCatalogPurchases 0.69589927 0.048917149 0.0220461269 -0.44500377
## MntFishProducts 0.51947652 -0.112295317 0.0101984151 -0.42781975
## MntWines 0.68725600 0.169304803 0.0380889318 -0.35511573
## MntFruits 0.50763749 -0.103586935 0.0088991656 -0.39634912
## MntMeatProducts 0.69180624 0.001589597 -0.0266421414 -0.50553772
## MntSweetProducts 0.52353259 -0.104146835 0.0210512773 -0.39064093
## total_purchases total_spent MntGoldProds NumDealsPurchases
## Income 0.66951165 0.79231663 0.38694480 -0.110077697
## Education_new 0.07457306 0.06048830 -0.12518529 0.027435840
## Marital_Status_new 0.04590130 0.01657741 0.02580946 0.033706845
## Dependents -0.25192652 -0.50199010 -0.26844736 0.436087698
## total_purchases 1.00000000 0.75637378 0.49230172 0.359501671
## total_spent 0.75637378 1.00000000 0.52565381 -0.067295498
## MntGoldProds 0.49230172 0.52565381 1.00000000 0.051937450
## NumDealsPurchases 0.35950167 -0.06729550 0.05193745 1.000000000
## NumWebPurchases 0.78426010 0.52969159 0.40825701 0.240349944
## NumWebVisitsMonth -0.31357726 -0.49815284 -0.24504689 0.347549925
## NumStorePurchases 0.82197402 0.67541666 0.38983191 0.064744839
## NumCatalogPurchases 0.73582359 0.77961819 0.44021212 -0.013367648
## MntFishProducts 0.46718539 0.64073416 0.42479800 -0.144539582
## MntWines 0.71308310 0.89264706 0.38968743 0.007707586
## MntFruits 0.45390831 0.61272078 0.39355365 -0.135078230
## MntMeatProducts 0.56433900 0.84506738 0.35588063 -0.122559486
## MntSweetProducts 0.46972073 0.60688839 0.35648891 -0.122387135
## NumWebPurchases NumWebVisitsMonth NumStorePurchases
## Income 0.45892080 -0.6502621489 0.63095340
## Education_new 0.06697462 -0.0318679269 0.05912747
## Marital_Status_new 0.05506682 -0.0003336462 0.02200521
## Dependents -0.15120386 0.4171775379 -0.32450202
## total_purchases 0.78426010 -0.3135772552 0.82197402
## total_spent 0.52969159 -0.4981528444 0.67541666
## MntGoldProds 0.40825701 -0.2450468936 0.38983191
## NumDealsPurchases 0.24034994 0.3475499247 0.06474484
## NumWebPurchases 1.00000000 -0.0516650842 0.51611710
## NumWebVisitsMonth -0.05166508 1.0000000000 -0.43256382
## NumStorePurchases 0.51611710 -0.4325638205 1.00000000
## NumCatalogPurchases 0.38656822 -0.5215183888 0.51711055
## MntFishProducts 0.29995381 -0.4456204464 0.45638679
## MntWines 0.55432773 -0.3203952600 0.64000634
## MntFruits 0.30179434 -0.4177338554 0.45899289
## MntMeatProducts 0.30666955 -0.5389296680 0.48501857
## MntSweetProducts 0.33313048 -0.4220980814 0.45449378
## NumCatalogPurchases MntFishProducts MntWines
## Income 0.69589927 0.51947652 0.687255998
## Education_new 0.04891715 -0.11229532 0.169304803
## Marital_Status_new 0.02204613 0.01019842 0.038088932
## Dependents -0.44500377 -0.42781975 -0.355115733
## total_purchases 0.73582359 0.46718539 0.713083105
## total_spent 0.77961819 0.64073416 0.892647055
## MntGoldProds 0.44021212 0.42479800 0.389687435
## NumDealsPurchases -0.01336765 -0.14453958 0.007707586
## NumWebPurchases 0.38656822 0.29995381 0.554327729
## NumWebVisitsMonth -0.52151839 -0.44562045 -0.320395260
## NumStorePurchases 0.51711055 0.45638679 0.640006335
## NumCatalogPurchases 1.00000000 0.53127805 0.633121957
## MntFishProducts 0.53127805 1.00000000 0.395177256
## MntWines 0.63312196 0.39517726 1.000000000
## MntFruits 0.48566814 0.59363389 0.386087558
## MntMeatProducts 0.73318515 0.57163746 0.566807669
## MntSweetProducts 0.49437494 0.58394261 0.389401050
## MntFruits MntMeatProducts MntSweetProducts
## Income 0.507637492 0.691806241 0.52353259
## Education_new -0.103586935 0.001589597 -0.10414684
## Marital_Status_new 0.008899166 -0.026642141 0.02105128
## Dependents -0.396349117 -0.505537723 -0.39064093
## total_purchases 0.453908314 0.564338995 0.46972073
## total_spent 0.612720781 0.845067382 0.60688839
## MntGoldProds 0.393553648 0.355880632 0.35648891
## NumDealsPurchases -0.135078230 -0.122559486 -0.12238714
## NumWebPurchases 0.301794343 0.306669555 0.33313048
## NumWebVisitsMonth -0.417733855 -0.538929668 -0.42209808
## NumStorePurchases 0.458992889 0.485018566 0.45449378
## NumCatalogPurchases 0.485668141 0.733185154 0.49437494
## MntFishProducts 0.593633886 0.571637461 0.58394261
## MntWines 0.386087558 0.566807669 0.38940105
## MntFruits 1.000000000 0.546835312 0.57087092
## MntMeatProducts 0.546835312 1.000000000 0.53450352
## MntSweetProducts 0.570870923 0.534503520 1.00000000
## Warning in ind1:ind2: numerical expression has 2 elements: only the first used
## Warning in ind1:ind2: numerical expression has 3 elements: only the first used
# 1. NumStorePurchases and total_spent
df %>%
cor.test(~ NumStorePurchases + total_spent, data = .)
##
## Pearson's product-moment correlation
##
## data: NumStorePurchases and total_spent
## t = 42.959, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6520412 0.6975093
## sample estimates:
## cor
## 0.6754167
# 2. NumStorePurchases and Income
df %>%
cor.test(~ NumStorePurchases + Income, data = .)
##
## Pearson's product-moment correlation
##
## data: NumStorePurchases and Income
## t = 38.146, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6051305 0.6554501
## sample estimates:
## cor
## 0.6309534
# 3. NumStorePurchases and total_purchases
df %>%
cor.test(~ NumStorePurchases + total_purchases, data = .)
##
## Pearson's product-moment correlation
##
## data: NumStorePurchases and total_purchases
## t = 67.695, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8079432 0.8350733
## sample estimates:
## cor
## 0.821974
# 4. total_purchases and MntGoldProds
df %>%
cor.test(~ MntGoldProds + total_purchases, data = .)
##
## Pearson's product-moment correlation
##
## data: MntGoldProds and total_purchases
## t = 26.528, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4599893 0.5233119
## sample estimates:
## cor
## 0.4923017
# 5. NumStorePurchases and dependents
df %>%
cor.test(~ NumStorePurchases + Dependents, data = .)
##
## Pearson's product-moment correlation
##
## data: NumStorePurchases and Dependents
## t = -16.091, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3613753 -0.2866153
## sample estimates:
## cor
## -0.324502
# 6. NumStorePurchases and NumWebVisitsMonth
df %>%
cor.test(~ NumStorePurchases + NumWebVisitsMonth, data = .)
##
## Pearson's product-moment correlation
##
## data: NumStorePurchases and NumWebVisitsMonth
## t = -22.503, df = 2200, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4659169 -0.3979832
## sample estimates:
## cor
## -0.4325638
# We found relationships between these variables, let's visualise it to explore further
ggplot(df, aes(x = Income, y = NumStorePurchases)) +
geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(df, aes(x = NumWebVisitsMonth, y = NumStorePurchases)) +
geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(df, aes(x = Dependents, y = NumStorePurchases)) +
geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(df, aes(x = MntGoldProds, y = NumStorePurchases)) +
geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# explore what country has the most purchases
total_purchase_by_country <- df %>%
group_by(Country) %>%
summarise(sum = sum(total_purchases))
arrange(total_purchase_by_country, desc(sum))
## # A tibble: 8 × 2
## Country sum
## <chr> <dbl>
## 1 SP 15934
## 2 SA 5092
## 3 CA 4046
## 4 AUS 2156
## 5 IND 2092
## 6 US 1743
## 7 GER 1708
## 8 ME 59
mean_MntGoldProds <- mean(df$MntGoldProds)
mean_MntGoldProds
## [1] 43.98592
ggplot(df, aes(x = MntGoldProds, y = NumStorePurchases)) +
geom_point () + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
We can see the positive relationship between NumStorePurchases and MntGoldProds as above. So those who spend more on gold will have a higher chance to spend more in store.
The plot shows that the most store purchases came from those spent on gold between $44 and $100
unique(df$Country)
## [1] "SP" "CA" "US" "AUS" "GER" "IND" "SA" "ME"
# change categorical values to numeric values
df$Country_numeric <- df$Country
df$Country_numeric[df$Country_numeric == "SP"] <- 1
df$Country_numeric[df$Country_numeric == "CA"] <- 2
df$Country_numeric[df$Country_numeric == "US"] <- 3
df$Country_numeric[df$Country_numeric == "AUS"] <- 4
df$Country_numeric[df$Country_numeric == "GER"] <- 5
df$Country_numeric[df$Country_numeric == "IND"] <- 6
df$Country_numeric[df$Country_numeric == "SA"] <- 7
df$Country_numeric[df$Country_numeric == "ME"] <- 8
df$Country_numeric <- as.numeric(df$Country_numeric)
df_country_camp <- df %>%
select(Country_numeric,campaigns_accepted)
core_country_camp <- cor(df_country_camp, method = "pearson", use = "pairwise.complete.obs" )
core_country_camp
## Country_numeric campaigns_accepted
## Country_numeric 1.00000000 -0.03617805
## campaigns_accepted -0.03617805 1.00000000
df %>%
cor.test(~ Country_numeric + campaigns_accepted, data = .)
##
## Pearson's product-moment correlation
##
## data: Country_numeric and campaigns_accepted
## t = -1.698, df = 2200, p-value = 0.08965
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.077832181 0.005602167
## sample estimates:
## cor
## -0.03617805