The shopping customer data set in this case study contains 2000 observations on 8 variables. The data set is available in the UCI machine learning data repository. R library {mlbench} has two versions of this data. The data set does not contains any missing values. Also, Shop Customer Data is a detailed analysis of a imaginative shop’s ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.
We will perform a logistic regression to see whether Annual income is an effective predictor variable for a customer’s gender.
Thhere are 8 variables in the set
Customer ID: Id of the customers
Gender: Male or Female
Age: Age (in years)
Annual Income: How much money they make every year
Spending Score: Score assigned by the shop, based on customer behavior and spending nature
Profession: what job are they doing
Work Experience :(work Experience) in years
Family Size: number of people in their household
The purpose of this study is to discover the factors that determine
whether a customer is a male or female. We first make a pairwise scatter
plots to check for potential issues with predictor variables.
We can see that the variables Annual Income is skewed to the left, while Profession, Work expierience, are skewed to the left. While are the other variables are unimodal. Next we will look at the frequency distributions of only our numerical predictor variables that are not unimodal, which are Annual Income and Work Expierience.
Based on the above histogram, We will discretize work expierience and Annual Income.
There was not much of a correlation with any of the predictor variables. However, We will not drop any of these variables for the moment but will do an automatic variable selection process to remove possible redundant variables since only a few of those variables will be forced to be included in the final model. Since we are performing an association analysis at the moment, we will not perform any variable transformations for the time being.
Based on the exploratory analysis, we first build the full model and the smallest model.
It is common in real-world applications that some practically important variables are always included in the final model regardless of their statistical significance. In the shopping customer dataset, three variables will be included in our final model, which are Anuual.Income, Work Expierience, and Spending Score1.100. These are significant factors to find out whether or not the customer is a male.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.2211873 | 0.4186433 | -0.5283431 | 0.5972612 |
| CustomerID | 0.0000175 | 0.0000897 | 0.1956329 | 0.8448975 |
| Age | 0.0003104 | 0.0016219 | 0.1913797 | 0.8482281 |
| Annual.Income…. | -0.0000022 | 0.0000017 | -1.3372883 | 0.1811285 |
| Spending.Score..1.100. | -0.0000626 | 0.0016428 | -0.0381179 | 0.9695937 |
| ProfessionArtist | -0.1972037 | 0.3531538 | -0.5584074 | 0.5765662 |
| ProfessionDoctor | 0.0792619 | 0.3788839 | 0.2091984 | 0.8342934 |
| ProfessionEngineer | -0.0085352 | 0.3750279 | -0.0227589 | 0.9818426 |
| ProfessionEntertainment | 0.0227568 | 0.3678807 | 0.0618593 | 0.9506749 |
| ProfessionExecutive | 0.0224020 | 0.3797274 | 0.0589948 | 0.9529562 |
| ProfessionHealthcare | -0.0157848 | 0.3599731 | -0.0438499 | 0.9650241 |
| ProfessionHomemaker | -0.3382848 | 0.4371807 | -0.7737870 | 0.4390567 |
| ProfessionLawyer | -0.1166701 | 0.3840983 | -0.3037506 | 0.7613179 |
| ProfessionMarketing | -0.2131749 | 0.4103341 | -0.5195155 | 0.6034013 |
| Work.Experience | -0.0067170 | 0.0291688 | -0.2302811 | 0.8178733 |
| Family.Size | -0.0013055 | 0.0236049 | -0.0553074 | 0.9558936 |
| Annual.catMiddle class | -0.1196332 | 0.2326503 | -0.5142187 | 0.6070991 |
| Annual.catUpper class | 0.2046030 | 0.2793930 | 0.7323125 | 0.4639778 |
| Work.cat6-12 years | 0.1312982 | 0.2161967 | 0.6073092 | 0.5436457 |
| Work.cat12+years | 0.3733014 | 0.4720583 | 0.7907952 | 0.4290635 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.4340921 | 0.1496388 | -2.9009333 | 0.0037205 |
| Annual.Income…. | 0.0000001 | 0.0000010 | 0.1284676 | 0.8977789 |
| Work.Experience | 0.0105621 | 0.0116352 | 0.9077747 | 0.3639972 |
| Spending.Score..1.100. | -0.0000002 | 0.0016314 | -0.0001295 | 0.9998967 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.3034642 | 0.2122171 | -1.4299703 | 0.1527255 |
| Annual.Income…. | -0.0000020 | 0.0000016 | -1.2340340 | 0.2171902 |
| Work.Experience | 0.0105151 | 0.0116552 | 0.9021794 | 0.3669616 |
| Spending.Score..1.100. | -0.0000849 | 0.0016336 | -0.0519923 | 0.9585348 |
| Annual.catMiddle class | -0.1239278 | 0.2243462 | -0.5523955 | 0.5806774 |
| Annual.catUpper class | 0.1996485 | 0.2705176 | 0.7380239 | 0.4604999 |
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 2690.907 | 2702.992 | 2730.907 |
| reduced.model | 2702.124 | 2702.992 | 2710.124 |
| final.model | 2697.641 | 2702.992 | 2709.641 |
In the exploratory analysis, we found that none of the variables are linearly correlated. After automatic variable selection, customerID, age, grp.age, Work.Expierience, and Family size were removed to make our final model. However, Annual Income, and spending score are still in the model. Although none of the predictor variables are statistically insignificant, we still include it in our model because they can still help us determine whether or not the customer is a male or not.
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -0.3034642 | 0.2122171 | -1.4299703 | 0.1527255 | 0.7382564 |
| Annual.Income…. | -0.0000020 | 0.0000016 | -1.2340340 | 0.2171902 | 0.9999980 |
| Work.Experience | 0.0105151 | 0.0116552 | 0.9021794 | 0.3669616 | 1.0105706 |
| Spending.Score..1.100. | -0.0000849 | 0.0016336 | -0.0519923 | 0.9585348 | 0.9999151 |
| Annual.catMiddle class | -0.1239278 | 0.2243462 | -0.5523955 | 0.5806774 | 0.8834436 |
| Annual.catUpper class | 0.1996485 | 0.2705176 | 0.7380239 | 0.4604999 | 1.2209735 |
the Annual cat variable has 3 categories. The baseline category is lowerclass. We can see from the above table inferential table that the odds of the individual being a male increases as Annual Income increases. As an example, the odds ratio associated with being upperclass is 1.22 which means that, having the same level of Spending score, Work Expierience, and annual Income, the odds of being a male in the upper class group is 1.22 times of that in our baseline group, which is lower class. But the same ratio becomes .88 times when comparing the middle class group with the baseline group.
This study focused on the association analysis between a set of predictor variables determining if the customer was a male. The initial data set has 6 numerical variables and 2 categorical variables. After exploratory analysis, we decide to re-group two numerical variables which were Annual income and work Experience, and then created dummy variables for those associated variables. These new grouped variables were used in the model search process to find the best model with the lowest AIC score. Although Annual Income,Work Experience, and Spending score are relevant for determining whether a customer is a male, we needed to include these factors in the final model regardless of their statistical significance. Final model overall is not effective for predicting the customers gender because it only contains statistically insignificant variables. The dataset also seems to need more information on the customers so that we can incorporate more predictor variables into our model to create a more effective predictor model. After automatic variable selection process, we obtain the final model with 4 factors, Annual Income, Work Experience, Spending score, Annual.cat(2 dummy variables) ( None of these variables were statistically significant but were still important for predicting the customer’s gender)