2024-11-10

Introduction

The aim of the analysis is to assess the annual income on purchase amount with the dataset Customer Purchasing Behaviors from Kaggle (https://www.kaggle.com/datasets/hanaksoy/customer-purchasing-behaviors). It started from visualization of the two variables with scatter plot, then was extended to simple and multiple linear regressions to assess its effect quantitatively with or without confounding factor adjustment.

Scatter plot of purchase amount against annual income

Interactive scatter plot of purchase amount against annual income by region

Wider confidence intervals (CI) in the East

Explanation for the wider confidence intervals (CI) in the East than in the other regions

By comparing the correlation scatter plot between the East and other regions, I observed a wider confidence interval (CI) in the East than in the West. This observation can be explained by inspecting the formula to calculate CI: \(CI = \bar x \pm z\displaystyle {s\over{\sqrt{n}}}\), where \(\bar x\) is the sample mean, z is the critical confidence level value (i.e. the critical z-value for a 95% confidence interval is 1.96 for two-sides), s is the sample standard deviation and n is the sample size. There is a much smaller sample size in the East than in the West, which rendered a higher value for the term \(z\displaystyle {s\over{\sqrt{n}}}\). Thus a wider CI for the East.

Simple linear regression

The following formula was used to model a simple linear regression to assess the predictability of annual income on purchase amount: \(Y = \beta_0+\beta_1X+\epsilon\), where Y is purchase_amount, X is annual income, \(\beta_0\) is intercept, \(\beta_1\) is the coefficient for annual income and \(\epsilon\) is random error.

lm_res = lm(purchase_amount ~ annual_income, data = dat_df)
coef(summary(lm_res))
##                    Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)   -268.26390507 8.28181716 -32.39191  7.985522e-89
## annual_income    0.01208716 0.00014151  85.41557 1.783646e-179

Multiple linear regression

To assess the effect of annual income on purchase amount after adjusting for some uncontrollable effects, the formula, \(Y = \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\epsilon\), was used, where Y is purchase_amount, \(X_1\) is annual income, \(X_2\) is purchase frequency, \(X_3\) is loyalty score, \(X_4\) is region, \(X_5\) is age and \(\epsilon\) is random error.

##                         Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept)        -1.315901e+02 7.7761838284 -16.9221936 2.859069e-42
## annual_income      -2.854172e-04 0.0004048954  -0.7049158 4.815758e-01
## purchase_frequency  1.284403e+01 1.0304725973  12.4642095 1.379213e-27
## loyalty_score       3.379738e+01 2.8087244312  12.0329991 3.442501e-26
## regionNorth         5.467634e+00 4.2979203772   1.2721581 2.046016e-01
## regionSouth         4.224960e+00 4.3142649789   0.9793000 3.284607e-01
## regionWest          6.973048e+00 4.3713301779   1.5951777 1.120456e-01
## age                 2.179029e+00 0.4003687069   5.4425554 1.342838e-07

Percent of variance explained by each variable in the model

Discussion and Conclusion

The annual income is capable of predicting the purchasing amount statistically significantly. However, after including other factors, purchasing frequency, loyalty score and age, and built a multiple linear regression model, the predictability of the annual income is off. Possible reason behind the difference between the models with and without confounding factor adjustments is the collinearity between annual income and the confounding factors, which need to be assessed further. Nevertheless, annual income explains the most variance (> 90%) of purchasing amount among the independent variables.