The shopping customer data set in this case study contains 2000 observations on 8 variables. The data set is available in the UCI machine learning data repository. R library {mlbench} has two versions of this data. The data set does not contains any missing values. Also, Shop Customer Data is a detailed analysis of a imaginative shop’s ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.
We will perform a logistic regression to see whether Annual income is an effective predictor variable for a customer’s gender.
Thhere are 8 variables in the set
Customer ID: Id of the customers
Gender: Male or Female
Age: Age (in years)
Annual Income: How much money they make every year
Spending Score: Score assigned by the shop, based on customer behavior and spending nature
Profession: what job are they doing
Work Experience :(work Experience) in years
Family Size: number of people in their household
We perform exploratory data analysis on a predictor variable to make sure that the variable isn’t extremely skewed.
Since the simple logistic regression contains just one continuous variable that is a predictor variable of an binary catgeroical variable, We will not transform BMI and fit a logistic regression directly to the data.
## Waiting for profiling to be done...
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
|---|---|---|---|---|---|---|
| (Intercept) | -0.3995660 | 0.1193339 | -3.3483029 | 0.0008131 | -0.6343165 | -0.1663458 |
| Annual.Income…. | 0.0000002 | 0.0000010 | 0.2102511 | 0.8334717 | -0.0000017 | 0.0000022 |
We can see that the p-value is greater than .05, which means that there is a high chance customer’s annual income has no effect on his or her gender. The 95% confidence interval [-0.00000017, 0.000000022]. which means that zero is not in the confidence interval, which means that the variable is statistically insignificant.
Next, we convert the estimated coefficient to odds ratios. The odds ratio associated with Annuealincome is 1.00 meaning that as the Annualincome increases by one unit, the odds of being a male increase by 0%
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -0.3995660 | 0.1193339 | -3.3483029 | 0.0008131 | 0.670611 |
| Annual.Income…. | 0.0000002 | 0.0000010 | 0.2102511 | 0.8334717 | 1.000000 |
Global goodness-of-fit measures are summarized below.
| Deviance.residual | Null.Deviance.Residual | AIC |
|---|---|---|
| 2703 | 2703 | 2707 |
Since the global goodness-of-fit is based on the maximum likelihood function, we don’t have other candidate models with corresponding likelihood at the same scale to compare with in this simple logistic regression model, so we will not interpret these goodness-of-fit measures.
The function below will print out a probability S curve of the logistical model
The left-hand side plot in the above figure is a linear curve representing how the probability of being a male increases as the Annual Income increases very little. Moreover, the rate of change in the probability of being a male, we obtain an almost straight line, which indicates that the rate of change does not decrease/increase very much throught the domain. The cause of the lack of change in both of the graphs could be that how much a customer makes every year is not a strong predictor of whether or not that customer is a male or not.
This focuses on a data analysis of the simple logistic regression model. The case study uses a real-world customer idenification data set to show the procedure for performing the simple logistic regression model.