The shopping customer data set in this case study contains 2000 observations on 8 variables. The data set is available in the UCI machine learning data repository. R library {mlbench} has two versions of this data. The data set does not contains any missing values. Also, Shop Customer Data is a detailed analysis of a imaginative shop’s ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.
We will perform a logistic regression to see whether Annual income is an effective predictor variable for a customer’s gender.
Thhere are 8 variables in the set
Customer ID: Id of the customers
Gender: Male or Female
Age: Age (in years)
Annual Income: How much money they make every year
Spending Score: Score assigned by the shop, based on customer behavior and spending nature
Profession: what job are they doing
Work Experience :(work Experience) in years
Family Size: number of people in their household
The purpose of this study is to discover the factors that determine
whether a customer is a male or female. We first make a pairwise scatter
plots to check for potential issues with predictor variables.
We can see that the variables Annual Income is skewed to the left, while Profession, Work expierience, are skewed to the left. While are the other variables are unimodal. Next we will look at the frequency distributions of only our numerical predictor variables that are not unimodal, which are Annual Income and Work Expierience.
Based on the above histograms, We will discretize work expierience and Annual Income.
A weak correlation is observed in all of the pairs of variables. We will not drop any of the variables. Instead, we will standardize these variables. Part of the correlation may be removed from after they have been standardized.
Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.
because this is a predictive model, we don’t need to worry about the interpretation of the coefficients. The objective is to identify a model that has the most predictive performance.
We randomly split the data into two subsets. 70% of the data will be used as training data. The training data will be used to look for the candidate models, validate them and identify the final model while using the cross-validation method. The 30% of hold-up sample will be used for finding out the performance of our final model.
We introduced full and reduced models to set up the scope for searching the final model. In this case study, we use the full, reduced, and final model obtained based on the step-wise variable selection as the three candidate models.
Also, we use 0.5 as the common cut-off for all three models to define the predicted model. 5-fold Cross-Validation
Because our training data is small, we apply a 5 fold cross validation to ensure the validation data set to have enough gender cases. We will average up to 5 folds and see which model gives us the least predictive error
| PE1 | PE2 | PE3 |
|---|---|---|
| 0.53625 | 0.53625 | 0.55125 |
We can see that PE1 and PE2 have the same predictive error. Because model 2 is simpler than model 1, we will choose model 2 as the final predictive model.
| x |
|---|
| 0.5625 |
We first estimate the TPR (true positive rate, sensitivity) and FPR (false positive rate, 1 - specificity) at each cut-off probability for each of the three candidate models using the function below.
We can see from the ROC curve that candidate 1 seems to a better fit compared to the full and reduced model. From our ROC curve, candidate 1 could be our best choice.
The case study focused on predicting a customer’s gender. We used three models as candidates and applied cross-validation and ROC curve to find the final working model. Both cross-validation and the ROC curve gave the same result. This model is however, not very effective for predicting gender because much of the points have a low sensitivity rate. Aslo, many of the points are below our cut off probability of .39.