0.1 Case study

The shopping customer data set in this case study contains 2000 observations on 8 variables. The data set is available in the UCI machine learning data repository. R library {mlbench} has two versions of this data. The data set does not contains any missing values. Also, Shop Customer Data is a detailed analysis of a imaginative shop’s ideal customers. It helps a business to better understand its customers. The owner of a shop gets information about Customers through membership cards.

We will perform a logistic regression to see whether Annual income is an effective predictor variable for a customer’s gender.

0.2 Data and Variable Description

Thhere are 8 variables in the set

Customer ID: Id of the customers

Gender: Male or Female

Age: Age (in years)

Annual Income: How much money they make every year

Spending Score: Score assigned by the shop, based on customer behavior and spending nature

Profession: what job are they doing

Work Experience :(work Experience) in years

Family Size: number of people in their household

0.3 Simple logistics Regression

We perform exploratory data analysis on a predictor variable to make sure that the variable isn’t extremely skewed.

Since the simple logistic regression contains just one continuous variable that is a predictor variable of an binary catgeroical variable, We will not transform BMI and fit a logistic regression directly to the data.

## Waiting for profiling to be done...
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -0.3995660 0.1193339 -3.3483029 0.0008131 -0.6343165 -0.1663458
Annual.Income…. 0.0000002 0.0000010 0.2102511 0.8334717 -0.0000017 0.0000022

We can see that the p-value is greater than .05, which means that there is a high chance customer’s annual income has no effect on his or her gender. The 95% confidence interval [-0.00000017, 0.000000022]. which means that zero is not in the confidence interval, which means that the variable is statistically insignificant.

Next, we convert the estimated coefficient to odds ratios. The odds ratio associated with Annuealincome is 1.00 meaning that as the Annualincome increases by one unit, the odds of being a male increase by 0%

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -0.3995660 0.1193339 -3.3483029 0.0008131 0.670611
Annual.Income…. 0.0000002 0.0000010 0.2102511 0.8334717 1.000000

Global goodness-of-fit measures are summarized below.

Deviance.residual Null.Deviance.Residual AIC
2703 2703 2707

Since the global goodness-of-fit is based on the maximum likelihood function, we don’t have other candidate models with corresponding likelihood at the same scale to compare with in this simple logistic regression model, so we will not interpret these goodness-of-fit measures.

The function below will print out a probability S curve of the logistical model

The left-hand side plot in the above figure is a linear curve representing how the probability of being a male increases as the Annual Income increases very little. Moreover, the rate of change in the probability of being a male, we obtain an almost straight line, which indicates that the rate of change does not decrease/increase very much throught the domain. The cause of the lack of change in both of the graphs could be that how much a customer makes every year is not a strong predictor of whether or not that customer is a male or not.

0.4 Conclusion

This focuses on a data analysis of the simple logistic regression model. The case study uses a real-world customer idenification data set to show the procedure for performing the simple logistic regression model.