1 Case study

The shopping customer data set in this case study contains 2000 observations on 8 variables. The data set is available in the UCI machine learning data repository. R library {mlbench} has two versions of this data. The data set does not contains any missing values. Also, Shop Customer Data is a detailed analysis of a imaginative shop’s ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.

We will perform a logistic regression to see whether Annual income is an effective predictor variable for a customer’s gender.

1.1 Data and Variable Description

Thhere are 8 variables in the set

Customer ID: Id of the customers

Gender: Male or Female

Age: Age (in years)

Annual Income: How much money they make every year

Spending Score: Score assigned by the shop, based on customer behavior and spending nature

Profession: what job are they doing

Work Experience :(work Experience) in years

Family Size: number of people in their household

1.2 Research Question

The purpose of this study is to discover the factors that determine whether a customer is a male or female. We first make a pairwise scatter plots to check for potential issues with predictor variables.

We can see that the variables Annual Income is skewed to the left, while Profession, Work expierience, are skewed to the left. While are the other variables are unimodal. Next we will look at the frequency distributions of only our numerical predictor variables that are not unimodal, which are Annual Income and Work Expierience.

Based on the above histogram, We will discretize work expierience and Annual Income.

There was not much of a correlation with any of the predictor variables. However, We will not drop any of these variables for the moment but will do an automatic variable selection process to remove possible redundant variables since only a few of those variables will be forced to be included in the final model. Since we are performing an association analysis at the moment, we will not perform any variable transformations for the time being.

1.3 Building The Multiple Logistic Regression Model

Based on the exploratory analysis, we first build the full model and the smallest model.

It is common in real-world applications that some practically important variables are always included in the final model regardless of their statistical significance. In the shopping customer dataset, three variables will be included in our final model, which are Anuual.Income, Work Expierience, and Spending Score1.100. These are significant factors to find out whether or not the customer is a male.

Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2211873 0.4186433 -0.5283431 0.5972612
CustomerID 0.0000175 0.0000897 0.1956329 0.8448975
Age 0.0003104 0.0016219 0.1913797 0.8482281
Annual.Income…. -0.0000022 0.0000017 -1.3372883 0.1811285
Spending.Score..1.100. -0.0000626 0.0016428 -0.0381179 0.9695937
ProfessionArtist -0.1972037 0.3531538 -0.5584074 0.5765662
ProfessionDoctor 0.0792619 0.3788839 0.2091984 0.8342934
ProfessionEngineer -0.0085352 0.3750279 -0.0227589 0.9818426
ProfessionEntertainment 0.0227568 0.3678807 0.0618593 0.9506749
ProfessionExecutive 0.0224020 0.3797274 0.0589948 0.9529562
ProfessionHealthcare -0.0157848 0.3599731 -0.0438499 0.9650241
ProfessionHomemaker -0.3382848 0.4371807 -0.7737870 0.4390567
ProfessionLawyer -0.1166701 0.3840983 -0.3037506 0.7613179
ProfessionMarketing -0.2131749 0.4103341 -0.5195155 0.6034013
Work.Experience -0.0067170 0.0291688 -0.2302811 0.8178733
Family.Size -0.0013055 0.0236049 -0.0553074 0.9558936
Annual.catMiddle class -0.1196332 0.2326503 -0.5142187 0.6070991
Annual.catUpper class 0.2046030 0.2793930 0.7323125 0.4639778
Work.cat6-12 years 0.1312982 0.2161967 0.6073092 0.5436457
Work.cat12+years 0.3733014 0.4720583 0.7907952 0.4290635
Summary of inferential statistics of the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4340921 0.1496388 -2.9009333 0.0037205
Annual.Income…. 0.0000001 0.0000010 0.1284676 0.8977789
Work.Experience 0.0105621 0.0116352 0.9077747 0.3639972
Spending.Score..1.100. -0.0000002 0.0016314 -0.0001295 0.9998967
Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3034642 0.2122171 -1.4299703 0.1527255
Annual.Income…. -0.0000020 0.0000016 -1.2340340 0.2171902
Work.Experience 0.0105151 0.0116552 0.9021794 0.3669616
Spending.Score..1.100. -0.0000849 0.0016336 -0.0519923 0.9585348
Annual.catMiddle class -0.1239278 0.2243462 -0.5523955 0.5806774
Annual.catUpper class 0.1996485 0.2705176 0.7380239 0.4604999
Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 2690.907 2702.992 2730.907
reduced.model 2702.124 2702.992 2710.124
final.model 2697.641 2702.992 2709.641

1.4 Final Model

In the exploratory analysis, we found that none of the variables are linearly correlated. After automatic variable selection, customerID, age, grp.age, Work.Expierience, and Family size were removed to make our final model. However, Annual Income, and spending score are still in the model. Although none of the predictor variables are statistically insignificant, we still include it in our model because they can still help us determine whether or not the customer is a male or not.

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -0.3034642 0.2122171 -1.4299703 0.1527255 0.7382564
Annual.Income…. -0.0000020 0.0000016 -1.2340340 0.2171902 0.9999980
Work.Experience 0.0105151 0.0116552 0.9021794 0.3669616 1.0105706
Spending.Score..1.100. -0.0000849 0.0016336 -0.0519923 0.9585348 0.9999151
Annual.catMiddle class -0.1239278 0.2243462 -0.5523955 0.5806774 0.8834436
Annual.catUpper class 0.1996485 0.2705176 0.7380239 0.4604999 1.2209735

the Annual cat variable has 3 categories. The baseline category is lowerclass. We can see from the above table inferential table that the odds of the individual being a male increases as Annual Income increases. As an example, the odds ratio associated with being upperclass is 1.22 which means that, having the same level of Spending score, Work Expierience, and annual Income, the odds of being a male in the upper class group is 1.22 times of that in our baseline group, which is lower class. But the same ratio becomes .88 times when comparing the middle class group with the baseline group.

2 Summary and Conclusion

This study focused on the association analysis between a set of predictor variables determining if the customer was a male. The initial data set has 6 numerical variables and 2 categorical variables. After exploratory analysis, we decide to re-group two numerical variables which were Annual income and work Experience, and then created dummy variables for those associated variables. These new grouped variables were used in the model search process to find the best model with the lowest AIC score. Although Annual Income,Work Experience, and Spending score are relevant for determining whether a customer is a male, we needed to include these factors in the final model regardless of their statistical significance. Final model overall is not effective for predicting the customers gender because it only contains statistically insignificant variables. The dataset also seems to need more information on the customers so that we can incorporate more predictor variables into our model to create a more effective predictor model. After automatic variable selection process, we obtain the final model with 4 factors, Annual Income, Work Experience, Spending score, Annual.cat(2 dummy variables) ( None of these variables were statistically significant but were still important for predicting the customer’s gender)