R Project 2

GitHub copy

Research Question: Is there a noticeable difference in the amount spent on wines between customers who accepted the offer in the latest marketing campaign and those who did not?

Source: Kaggle.com. (2022). Classification: Customer Personality Analysis. Retrieved on January 7, 2024, from https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis

Import the data (CSV file):

cust_pers <- read.table("/Users/jakakranjc/Downloads/marketing_campaign.csv", header=TRUE, sep = "\t", dec = ".")

The first 6 rows:

head(cust_pers)

##     ID Year_Birth  Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 5524       1957 Graduation         Single  58138       0        0  04-09-2012
## 2 2174       1954 Graduation         Single  46344       1        1  08-03-2014
## 3 4141       1965 Graduation       Together  71613       0        0  21-08-2013
## 4 6182       1984 Graduation       Together  26646       1        0  10-02-2014
## 5 5324       1981        PhD        Married  58293       1        0  19-01-2014
## 6 7446       1967     Master       Together  62513       0        1  09-09-2013
##   Recency MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      58      635        88             546             172               88
## 2      38       11         1               6               2                1
## 3      26      426        49             127             111               21
## 4      26       11         4              20              10                3
## 5      94      173        43             118              46               27
## 6      16      520        42              98               0               42
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1           88                 3               8                  10
## 2            6                 2               1                   1
## 3           42                 1               8                   2
## 4            5                 2               2                   0
## 5           15                 5               5                   3
## 6           14                 2               6                   4
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1                 4                 7            0            0            0
## 2                 2                 5            0            0            0
## 3                10                 4            0            0            0
## 4                 4                 6            0            0            0
## 5                 6                 5            0            0            0
## 6                10                 6            0            0            0
##   AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
## 1            0            0        0             3        11        1
## 2            0            0        0             3        11        0
## 3            0            0        0             3        11        0
## 4            0            0        0             3        11        0
## 5            0            0        0             3        11        0
## 6            0            0        0             3        11        0

Remove unnecessary columns, rename the rest

cust_personal <- cust_pers[,-c(2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28)]
colnames(cust_personal) <- c("ID", "Znesek_vino", "Odgovor")
head(cust_personal)

##     ID Znesek_vino Odgovor
## 1 5524         635       1
## 2 2174          11       0
## 3 4141         426       0
## 4 6182          11       0
## 5 5324         173       0
## 6 7446         520       0

Explanation of Variables:

ID: Unique identifier for each customer.
Znesek_vino: Amount (in EUR) spent by the customer on wine over the past two years.
Odgovor:
- 1 = The customer accepted the offer in the latest marketing campaign.
- 2 = The customer did not accept the offer.

ID_posameznika <- cust_personal$`ID`
Povprecen_znesek_zapravljen_na_vinu_zadnje_dve_leti <- cust_personal$`Znesek_vino`
Sodelovanje_v_kampanji <- cust_personal$`Odgovor`

Descriptive Statistics:

summary(cust_personal)

##        ID         Znesek_vino         Odgovor      
##  Min.   :    0   Min.   :   0.00   Min.   :0.0000  
##  1st Qu.: 2828   1st Qu.:  23.75   1st Qu.:0.0000  
##  Median : 5458   Median : 173.50   Median :0.0000  
##  Mean   : 5592   Mean   : 303.94   Mean   :0.1491  
##  3rd Qu.: 8428   3rd Qu.: 504.25   3rd Qu.:0.0000  
##  Max.   :11191   Max.   :1493.00   Max.   :1.0000

Some Explanations:
- The minimum value for ID is 0, and the maximum is 11191.
- The mean (average) amount spent on wine (Znesek_vino) is 303.94 EUR, and the median is 173.5 EUR.
- The mean is higher than the median → this indicates positive skewness.
- The average value for Odgovor = 1 is 0.1491, meaning 14.91% of responses are equal to 1.

Perform a Welch t-test:

group_1 <- cust_personal$Znesek_vino[cust_personal$Odgovor == 1]
group_2 <- cust_personal$Znesek_vino[cust_personal$Odgovor == 0]

t_test_rezultat <- t.test(group_1, group_2)
print(t_test_rezultat)

## 
##  Welch Two Sample t-test
## 
## data:  group_1 and group_2
## t = 9.5748, df = 394.67, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  185.6345 281.5639
## sample estimates:
## mean of x mean of y 
##  502.7036  269.1044

A Welch t-test was performed to compare the mean amounts spent on wine between those who participated in the last campaign and those who did not. The p-value was less than 0.05, which means we can reject the null hypothesis. There is a statistically significant difference between the average amounts spent on wine by customers who accepted the campaign and those who did not. Additionally, a 95% confidence interval further supports the ‘advantage’ or positive difference in the group that accepted the campaign.

Next, perform a non-parametric test - Wilcoxon rank-sum test:

group_1_wr <- cust_personal$Znesek_vino[cust_personal$Odgovor == 1]
group_2_wr <- cust_personal$Znesek_vino[cust_personal$Odgovor == 0]

wilcox_rank_vsota <- wilcox.test(group_1_wr, group_2_wr)
print(wilcox_rank_vsota)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  group_1_wr and group_2_wr
## W = 420994, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The Wilcoxon rank-sum test was conducted to compare the distribution of average amounts spent on wine between the two groups (those who accepted the last campaign = 1, and those who did not = 0). The result provides evidence against the null hypothesis, indicating that there is a significant difference in the distribution of amounts. The p-value is less than 2.2e-16, strongly supporting the rejection of the null hypothesis. This confirms that the difference in distribution is not equal to 0, aligning with the previous t-test results.

Check for normal distribution using the Shapiro-Wilk test:

shapiro.test(cust_personal$Znesek_vino[cust_personal$Odgovor == 1])

## 
##  Shapiro-Wilk normality test
## 
## data:  cust_personal$Znesek_vino[cust_personal$Odgovor == 1]
## W = 0.91224, p-value = 4.87e-13

shapiro.test(cust_personal$Znesek_vino[cust_personal$Odgovor == 0])

## 
##  Shapiro-Wilk normality test
## 
## data:  cust_personal$Znesek_vino[cust_personal$Odgovor == 0]
## W = 0.82905, p-value < 2.2e-16

The Shapiro-Wilk test shows that the assumption of normality is violated for the average amount spent by individuals on wine over the past two years, based on the group they belong to.

Both tests support the rejection of the null hypothesis. However, due to the violation of the normality assumption, we rely on the non-parametric Wilcoxon rank-sum test.

PART 2
I decided to use multiple linear regression, with the data source remaining the same.

Research Question: How does the amount spent on wine relate to income and the number of children at home?

summary(cust_pers)

##        ID          Year_Birth    Education         Marital_Status    
##  Min.   :    0   Min.   :1893   Length:2240        Length:2240       
##  1st Qu.: 2828   1st Qu.:1959   Class :character   Class :character  
##  Median : 5458   Median :1970   Mode  :character   Mode  :character  
##  Mean   : 5592   Mean   :1969                                        
##  3rd Qu.: 8428   3rd Qu.:1977                                        
##  Max.   :11191   Max.   :1996                                        
##                                                                      
##      Income          Kidhome          Teenhome      Dt_Customer       
##  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2240       
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median : 51382   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   : 52247   Mean   :0.4442   Mean   :0.5062                     
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :666666   Max.   :2.0000   Max.   :2.0000                     
##  NA's   :24                                                           
##     Recency         MntWines         MntFruits     MntMeatProducts
##  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0   
##  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16   
##  Median :49.00   Median : 173.50   Median :  8.0   Median :  67   
##  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 167   
##  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232   
##  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725   
##                                                                   
##  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
##  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
##  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
##  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
##  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
##  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
##                                                                      
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
##  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
##                                                                          
##   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   AcceptedCmp2        Complain        Z_CostContact   Z_Revenue 
##  Min.   :0.00000   Min.   :0.000000   Min.   :3     Min.   :11  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:3     1st Qu.:11  
##  Median :0.00000   Median :0.000000   Median :3     Median :11  
##  Mean   :0.01339   Mean   :0.009375   Mean   :3     Mean   :11  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:3     3rd Qu.:11  
##  Max.   :1.00000   Max.   :1.000000   Max.   :3     Max.   :11  
##                                                                 
##     Response     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1491  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##

cust_pers2 <- cust_pers[,-c(1,2,3,4,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29)]
colnames(cust_pers2) <- c("Prihodek", "Otroci_doma", "Znesek_vino")
summary(cust_pers2)

##     Prihodek       Otroci_doma      Znesek_vino     
##  Min.   :  1730   Min.   :0.0000   Min.   :   0.00  
##  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:  23.75  
##  Median : 51382   Median :0.0000   Median : 173.50  
##  Mean   : 52247   Mean   :0.4442   Mean   : 303.94  
##  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.: 504.25  
##  Max.   :666666   Max.   :2.0000   Max.   :1493.00  
##  NA's   :24

are there NA values or how many of them are empty in Prihodek

sum(is.na(cust_pers2))

## [1] 24

All NAs in Prihodek get assigned median

cust_pers2$Prihodek[is.na(cust_pers2$Prihodek)] <- median(cust_pers2$Prihodek, na.rm = TRUE)

If a value in the “income” column is missing, the median income is filled in the cell to facilitate analysis.

Create a model for multiple linear regression:

fit <- lm(Znesek_vino ~ Prihodek + Otroci_doma, data = cust_pers2)

summary(fit)

## 
## Call:
## lm(formula = Znesek_vino ~ Prihodek + Otroci_doma, data = cust_pers2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3877.5  -143.3   -44.9    87.7  1110.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.537e+01  1.634e+01   4.612 4.21e-06 ***
## Prihodek     6.004e-03  2.413e-04  24.882  < 2e-16 ***
## Otroci_doma -1.915e+02  1.122e+01 -17.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 258.7 on 2237 degrees of freedom
## Multiple R-squared:  0.4097, Adjusted R-squared:  0.4092 
## F-statistic: 776.3 on 2 and 2237 DF,  p-value: < 2.2e-16

The p-value is extremely small (2.2e-16), indicating that both income and the number of children at home are statistically significant predictors of the amount spent on wine. The model is statistically significant.

library(car)

## Loading required package: carData

Get VIF

vif(fit)

##    Prihodek Otroci_doma 
##    1.220856    1.220856

Values are under 5 and acceptable.

Check for coorelation

cor_matrix <- cor(cust_pers2)
print(cor_matrix)

##               Prihodek Otroci_doma Znesek_vino
## Prihodek     1.0000000  -0.4253265   0.5769031
## Otroci_doma -0.4253265   1.0000000  -0.4962969
## Znesek_vino  0.5769031  -0.4962969   1.0000000

The correlation between the amount spent on wine and income is 0.57, which means it is moderately positive, indicating that as income increases, the amount spent on wine also increases. The correlation between the amount spent and the number of children at home is -0.49, meaning it is negative, and as the number of children at home increases, the amount spent on wine decreases.

library(corrplot)

## corrplot 0.92 loaded

corrplot(cor_matrix, method = "color")

Perform the Breusch-Pagan test for the assumption of homoscedasticity.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                  Data                   
##  ---------------------------------------
##  Response : Znesek_vino 
##  Variables: fitted values of Znesek_vino 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    6178.5399 
##  Prob > Chi2   =    0.0000

Null hypothesis (H₀): Variance is constant (homoscedasticity).
Alternative hypothesis (H₁): Variance is not constant (heteroscedasticity).
The p-value is very small and less than 0.0000, so we can reject H₀. The variability of errors is not constant, indicating the presence of heteroscedasticity.

Next, check the model fit using R² and adjusted R².

summary(fit)$r.squared

## [1] 0.4096862

summary(fit)$adj.r.squared

## [1] 0.4091584