GitHub copy
Research Question: Is there a noticeable difference in the amount spent on wines between customers who accepted the offer in the latest marketing campaign and those who did not?
Source: Kaggle.com. (2022). Classification: Customer Personality Analysis. Retrieved on January 7, 2024, from https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis
Import the data (CSV file):
cust_pers <- read.table("/Users/jakakranjc/Downloads/marketing_campaign.csv", header=TRUE, sep = "\t", dec = ".")
The first 6 rows:
head(cust_pers)
## ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524 1957 Graduation Single 58138 0 0 04-09-2012
## 2 2174 1954 Graduation Single 46344 1 1 08-03-2014
## 3 4141 1965 Graduation Together 71613 0 0 21-08-2013
## 4 6182 1984 Graduation Together 26646 1 0 10-02-2014
## 5 5324 1981 PhD Married 58293 1 0 19-01-2014
## 6 7446 1967 Master Together 62513 0 1 09-09-2013
## Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 58 635 88 546 172 88
## 2 38 11 1 6 2 1
## 3 26 426 49 127 111 21
## 4 26 11 4 20 10 3
## 5 94 173 43 118 46 27
## 6 16 520 42 98 0 42
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 88 3 8 10
## 2 6 2 1 1
## 3 42 1 8 2
## 4 5 2 2 0
## 5 15 5 5 3
## 6 14 2 6 4
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 4 7 0 0 0
## 2 2 5 0 0 0
## 3 10 4 0 0 0
## 4 4 6 0 0 0
## 5 6 5 0 0 0
## 6 10 6 0 0 0
## AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1 0 0 0 3 11 1
## 2 0 0 0 3 11 0
## 3 0 0 0 3 11 0
## 4 0 0 0 3 11 0
## 5 0 0 0 3 11 0
## 6 0 0 0 3 11 0
Remove unnecessary columns, rename the rest
cust_personal <- cust_pers[,-c(2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28)]
colnames(cust_personal) <- c("ID", "Znesek_vino", "Odgovor")
head(cust_personal)
## ID Znesek_vino Odgovor
## 1 5524 635 1
## 2 2174 11 0
## 3 4141 426 0
## 4 6182 11 0
## 5 5324 173 0
## 6 7446 520 0
Explanation of Variables:
ID_posameznika <- cust_personal$`ID`
Povprecen_znesek_zapravljen_na_vinu_zadnje_dve_leti <- cust_personal$`Znesek_vino`
Sodelovanje_v_kampanji <- cust_personal$`Odgovor`
Descriptive Statistics:
summary(cust_personal)
## ID Znesek_vino Odgovor
## Min. : 0 Min. : 0.00 Min. :0.0000
## 1st Qu.: 2828 1st Qu.: 23.75 1st Qu.:0.0000
## Median : 5458 Median : 173.50 Median :0.0000
## Mean : 5592 Mean : 303.94 Mean :0.1491
## 3rd Qu.: 8428 3rd Qu.: 504.25 3rd Qu.:0.0000
## Max. :11191 Max. :1493.00 Max. :1.0000
Some Explanations:
- The minimum value for ID is 0, and the maximum is
11191.
- The mean (average) amount spent on wine (Znesek_vino)
is 303.94 EUR, and the median is 173.5
EUR.
- The mean is higher than the median → this indicates positive
skewness.
- The average value for Odgovor = 1 is
0.1491, meaning 14.91% of responses
are equal to 1.
Perform a Welch t-test:
group_1 <- cust_personal$Znesek_vino[cust_personal$Odgovor == 1]
group_2 <- cust_personal$Znesek_vino[cust_personal$Odgovor == 0]
t_test_rezultat <- t.test(group_1, group_2)
print(t_test_rezultat)
##
## Welch Two Sample t-test
##
## data: group_1 and group_2
## t = 9.5748, df = 394.67, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 185.6345 281.5639
## sample estimates:
## mean of x mean of y
## 502.7036 269.1044
A Welch t-test was performed to compare the mean amounts spent on wine between those who participated in the last campaign and those who did not. The p-value was less than 0.05, which means we can reject the null hypothesis. There is a statistically significant difference between the average amounts spent on wine by customers who accepted the campaign and those who did not. Additionally, a 95% confidence interval further supports the ‘advantage’ or positive difference in the group that accepted the campaign.
Next, perform a non-parametric test - Wilcoxon rank-sum test:
group_1_wr <- cust_personal$Znesek_vino[cust_personal$Odgovor == 1]
group_2_wr <- cust_personal$Znesek_vino[cust_personal$Odgovor == 0]
wilcox_rank_vsota <- wilcox.test(group_1_wr, group_2_wr)
print(wilcox_rank_vsota)
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_1_wr and group_2_wr
## W = 420994, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
The Wilcoxon rank-sum test was conducted to compare the distribution of average amounts spent on wine between the two groups (those who accepted the last campaign = 1, and those who did not = 0). The result provides evidence against the null hypothesis, indicating that there is a significant difference in the distribution of amounts. The p-value is less than 2.2e-16, strongly supporting the rejection of the null hypothesis. This confirms that the difference in distribution is not equal to 0, aligning with the previous t-test results.
Check for normal distribution using the Shapiro-Wilk test:
shapiro.test(cust_personal$Znesek_vino[cust_personal$Odgovor == 1])
##
## Shapiro-Wilk normality test
##
## data: cust_personal$Znesek_vino[cust_personal$Odgovor == 1]
## W = 0.91224, p-value = 4.87e-13
shapiro.test(cust_personal$Znesek_vino[cust_personal$Odgovor == 0])
##
## Shapiro-Wilk normality test
##
## data: cust_personal$Znesek_vino[cust_personal$Odgovor == 0]
## W = 0.82905, p-value < 2.2e-16
The Shapiro-Wilk test shows that the assumption of normality is violated for the average amount spent by individuals on wine over the past two years, based on the group they belong to.
Both tests support the rejection of the null hypothesis. However, due to the violation of the normality assumption, we rely on the non-parametric Wilcoxon rank-sum test.
PART 2
I decided to use multiple linear regression, with the data source
remaining the same.
Research Question: How does the amount spent on wine relate to income and the number of children at home?
summary(cust_pers)
## ID Year_Birth Education Marital_Status
## Min. : 0 Min. :1893 Length:2240 Length:2240
## 1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
## Median : 5458 Median :1970 Mode :character Mode :character
## Mean : 5592 Mean :1969
## 3rd Qu.: 8428 3rd Qu.:1977
## Max. :11191 Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2240
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median : 51382 Median :0.0000 Median :0.0000 Mode :character
## Mean : 52247 Mean :0.4442 Mean :0.5062
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :666666 Max. :2.0000 Max. :2.0000
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 167
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## Min. :0.00000 Min. :0.000000 Min. :3 Min. :11
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11
## Median :0.00000 Median :0.000000 Median :3 Median :11
## Mean :0.01339 Mean :0.009375 Mean :3 Mean :11
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11
## Max. :1.00000 Max. :1.000000 Max. :3 Max. :11
##
## Response
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1491
## 3rd Qu.:0.0000
## Max. :1.0000
##
cust_pers2 <- cust_pers[,-c(1,2,3,4,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29)]
colnames(cust_pers2) <- c("Prihodek", "Otroci_doma", "Znesek_vino")
summary(cust_pers2)
## Prihodek Otroci_doma Znesek_vino
## Min. : 1730 Min. :0.0000 Min. : 0.00
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.: 23.75
## Median : 51382 Median :0.0000 Median : 173.50
## Mean : 52247 Mean :0.4442 Mean : 303.94
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.: 504.25
## Max. :666666 Max. :2.0000 Max. :1493.00
## NA's :24
are there NA values or how many of them are empty in Prihodek
sum(is.na(cust_pers2))
## [1] 24
All NAs in Prihodek get assigned median
cust_pers2$Prihodek[is.na(cust_pers2$Prihodek)] <- median(cust_pers2$Prihodek, na.rm = TRUE)
If a value in the “income” column is missing, the median income is filled in the cell to facilitate analysis.
Create a model for multiple linear regression:
fit <- lm(Znesek_vino ~ Prihodek + Otroci_doma, data = cust_pers2)
summary(fit)
##
## Call:
## lm(formula = Znesek_vino ~ Prihodek + Otroci_doma, data = cust_pers2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3877.5 -143.3 -44.9 87.7 1110.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.537e+01 1.634e+01 4.612 4.21e-06 ***
## Prihodek 6.004e-03 2.413e-04 24.882 < 2e-16 ***
## Otroci_doma -1.915e+02 1.122e+01 -17.067 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 258.7 on 2237 degrees of freedom
## Multiple R-squared: 0.4097, Adjusted R-squared: 0.4092
## F-statistic: 776.3 on 2 and 2237 DF, p-value: < 2.2e-16
The p-value is extremely small (2.2e-16), indicating that both income and the number of children at home are statistically significant predictors of the amount spent on wine. The model is statistically significant.
library(car)
## Loading required package: carData
Get VIF
vif(fit)
## Prihodek Otroci_doma
## 1.220856 1.220856
Values are under 5 and acceptable.
Check for coorelation
cor_matrix <- cor(cust_pers2)
print(cor_matrix)
## Prihodek Otroci_doma Znesek_vino
## Prihodek 1.0000000 -0.4253265 0.5769031
## Otroci_doma -0.4253265 1.0000000 -0.4962969
## Znesek_vino 0.5769031 -0.4962969 1.0000000
The correlation between the amount spent on wine and income is 0.57, which means it is moderately positive, indicating that as income increases, the amount spent on wine also increases. The correlation between the amount spent and the number of children at home is -0.49, meaning it is negative, and as the number of children at home increases, the amount spent on wine decreases.
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method = "color")
Perform the Breusch-Pagan test for the assumption of
homoscedasticity.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------------
## Response : Znesek_vino
## Variables: fitted values of Znesek_vino
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 6178.5399
## Prob > Chi2 = 0.0000
Null hypothesis (H₀): Variance is constant
(homoscedasticity).
Alternative hypothesis (H₁): Variance is not constant
(heteroscedasticity).
The p-value is very small and less than 0.0000, so we can reject H₀. The
variability of errors is not constant, indicating the presence of
heteroscedasticity.
Next, check the model fit using R² and adjusted R².
summary(fit)$r.squared
## [1] 0.4096862
summary(fit)$adj.r.squared
## [1] 0.4091584