1 Case study

The shopping customer data set in this case study contains 2000 observations on 8 variables. The data set is available in the UCI machine learning data repository. R library {mlbench} has two versions of this data. The data set does not contains any missing values. Also, Shop Customer Data is a detailed analysis of a imaginative shop’s ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.

We will perform a logistic regression to see whether Annual income is an effective predictor variable for a customer’s gender.

1.1 Data and Variable Description

Thhere are 8 variables in the set

Customer ID: Id of the customers

Gender: Male or Female

Age: Age (in years)

Annual Income: How much money they make every year

Spending Score: Score assigned by the shop, based on customer behavior and spending nature

Profession: what job are they doing

Work Experience :(work Experience) in years

Family Size: number of people in their household

1.2 Research Question

The purpose of this study is to discover the factors that determine whether a customer is a male or female. We first make a pairwise scatter plots to check for potential issues with predictor variables.

We can see that the variables Annual Income is skewed to the left, while Profession, Work expierience, are skewed to the left. While are the other variables are unimodal. Next we will look at the frequency distributions of only our numerical predictor variables that are not unimodal, which are Annual Income and Work Expierience.

Based on the above histograms, We will discretize work expierience and Annual Income.

A weak correlation is observed in all of the pairs of variables. We will not drop any of the variables. Instead, we will standardize these variables. Part of the correlation may be removed from after they have been standardized.

Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.

1.3 standardizing numerical variables

because this is a predictive model, we don’t need to worry about the interpretation of the coefficients. The objective is to identify a model that has the most predictive performance.

1.4 3.5 Data Split - Training and Testing Data

We randomly split the data into two subsets. 70% of the data will be used as training data. The training data will be used to look for the candidate models, validate them and identify the final model while using the cross-validation method. The 30% of hold-up sample will be used for finding out the performance of our final model.

1.5 3.6 Candidat Models

We introduced full and reduced models to set up the scope for searching the final model. In this case study, we use the full, reduced, and final model obtained based on the step-wise variable selection as the three candidate models.

Also, we use 0.5 as the common cut-off for all three models to define the predicted model. 5-fold Cross-Validation

Because our training data is small, we apply a 5 fold cross validation to ensure the validation data set to have enough gender cases. We will average up to 5 folds and see which model gives us the least predictive error

Average of prediction errors of candidate models
PE1	PE2	PE3
0.53625	0.53625	0.55125

We can see that PE1 and PE2 have the same predictive error. Because model 2 is simpler than model 1, we will choose model 2 as the final predictive model.

The actual accuracy of the final model
x
0.5625

1.6 Selecting the final model using ROC

We first estimate the TPR (true positive rate, sensitivity) and FPR (false positive rate, 1 - specificity) at each cut-off probability for each of the three candidate models using the function below.

We can see from the ROC curve that candidate 1 seems to a better fit compared to the full and reduced model. From our ROC curve, candidate 1 could be our best choice.

1.7 Conclusion

The case study focused on predicting a customer’s gender. We used three models as candidates and applied cross-validation and ROC curve to find the final working model. Both cross-validation and the ROC curve gave the same result. This model is however, not very effective for predicting gender because much of the points have a low sensitivity rate. Aslo, many of the points are below our cut off probability of .39.

Logisitcs Distibution

Yuanqi Zhang

Homework