rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 536444 28.7 1197988 64 NA 669417 35.8
## Vcells 990072 7.6 8388608 64 16384 1851813 14.2
cat("\f")
dev.off
## function (which = dev.cur())
## {
## if (which == 1)
## stop("cannot shut down device 1 (the null device)")
## .External(C_devoff, as.integer(which))
## dev.cur()
## }
## <bytecode: 0x13b256f20>
## <environment: namespace:grDevices>
For this example, I will look at marketing campaign data to predict whether a customer’s age and gender has an impact on whether they give a positive response (yes) or negative response (no) to a marketing campaign. Data taken from Kaggle (https://www.kaggle.com/datasets/sujithmandala/marketing-campaign-positive-response-prediction?resource=download).
#set working directory and assign data to variable
setwd("/Users/ginaocchipinti/Documents/ADEC 7310 Data Analytics/Week 6")
df <- read.csv("campaign_responses.csv", header = T, stringsAsFactors = T)
df
## customer_id age gender annual_income credit_score employed marital_status
## 1 1 35 Male 65000 720 Yes Married
## 2 2 28 Female 45000 680 No Single
## 3 3 42 Male 85000 750 Yes Married
## 4 4 31 Female 55000 710 Yes Single
## 5 5 47 Male 95000 790 Yes Married
## 6 6 25 Female 38000 630 No Single
## 7 7 39 Male 72000 740 Yes Married
## 8 8 33 Female 48000 670 Yes Single
## 9 9 51 Male 110000 820 Yes Married
## 10 10 27 Female 40000 620 No Single
## 11 11 44 Male 90000 780 Yes Married
## 12 12 30 Female 52000 690 Yes Single
## 13 13 36 Male 75000 730 Yes Married
## 14 14 29 Female 45000 660 No Single
## 15 15 49 Male 105000 800 Yes Married
## 16 16 26 Female 36000 610 No Single
## 17 17 41 Male 85000 760 Yes Married
## 18 18 32 Female 54000 700 Yes Single
## 19 19 37 Male 80000 740 Yes Married
## 20 20 34 Female 60000 720 Yes Single
## 21 21 43 Male 92000 770 Yes Married
## 22 22 28 Female 42000 640 No Single
## 23 23 38 Male 78000 750 Yes Married
## 24 24 31 Female 48000 680 Yes Single
## 25 25 45 Male 98000 790 Yes Married
## 26 26 27 Female 40000 630 No Single
## 27 27 40 Male 85000 760 Yes Married
## 28 28 35 Female 62000 710 Yes Single
## 29 29 46 Male 100000 800 Yes Married
## 30 30 29 Female 44000 650 No Single
## 31 31 42 Male 90000 780 Yes Married
## 32 32 33 Female 56000 690 Yes Single
## 33 33 39 Male 82000 750 Yes Married
## 34 34 30 Female 50000 670 Yes Single
## 35 35 48 Male 105000 810 Yes Married
## 36 36 25 Female 35000 600 No Single
## 37 37 41 Male 88000 770 Yes Married
## 38 38 34 Female 58000 700 Yes Single
## 39 39 43 Male 95000 780 Yes Married
## 40 40 28 Female 43000 640 No Single
## 41 41 37 Male 80000 750 Yes Married
## 42 42 32 Female 52000 680 Yes Single
## 43 43 45 Male 100000 800 Yes Married
## 44 44 30 Female 46000 660 No Single
## 45 45 40 Male 88000 770 Yes Married
## 46 46 36 Female 64000 720 Yes Single
## 47 47 47 Male 102000 790 Yes Married
## 48 48 26 Female 38000 620 No Single
## 49 49 42 Male 90000 760 Yes Married
## 50 50 33 Female 54000 690 Yes Single
## 51 51 39 Male 85000 750 Yes Married
## 52 52 31 Female 50000 680 Yes Single
## 53 53 46 Male 98000 800 Yes Married
## 54 54 28 Female 42000 630 No Single
## 55 55 41 Male 90000 770 Yes Married
## 56 56 34 Female 60000 710 Yes Single
## no_of_children responded
## 1 2 Yes
## 2 0 No
## 3 3 Yes
## 4 1 No
## 5 2 Yes
## 6 0 No
## 7 2 Yes
## 8 0 No
## 9 3 Yes
## 10 0 No
## 11 2 Yes
## 12 0 No
## 13 1 Yes
## 14 0 No
## 15 3 Yes
## 16 0 No
## 17 2 Yes
## 18 0 No
## 19 2 Yes
## 20 1 No
## 21 3 Yes
## 22 0 No
## 23 2 Yes
## 24 0 No
## 25 3 Yes
## 26 0 No
## 27 2 Yes
## 28 1 No
## 29 3 Yes
## 30 0 No
## 31 2 Yes
## 32 0 No
## 33 2 Yes
## 34 0 No
## 35 3 Yes
## 36 0 No
## 37 2 Yes
## 38 1 No
## 39 3 Yes
## 40 0 No
## 41 2 Yes
## 42 0 No
## 43 3 Yes
## 44 0 No
## 45 2 Yes
## 46 1 No
## 47 3 Yes
## 48 0 No
## 49 2 Yes
## 50 0 No
## 51 2 Yes
## 52 0 No
## 53 3 Yes
## 54 0 No
## 55 2 Yes
## 56 1 No
# create a logistic regression model with our data
logit <- glm(df$responded ~ df$age + df$gender,
data = df,
family = "binomial")
summary(logit)
##
## Call:
## glm(formula = df$responded ~ df$age + df$gender, family = "binomial",
## data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.657e+01 4.052e+05 0 1
## df$age 1.570e-11 1.324e+04 0 1
## df$genderMale 5.313e+01 1.860e+05 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.7632e+01 on 55 degrees of freedom
## Residual deviance: 3.2489e-10 on 53 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
Interpreting this regression, when gender and age are both at zero, the baseline chance of a customer responding “yes” to the marketing campaign is -26.57. A 1-unit change in gender (being male) results in a 53.13 increase in responded positive to the marketing campaign. With each 1-unit increase in age, the chance of responding positively to the campaign increases by a very tiny amount, 1.57e-11. In this case where are z values are 0 and p values are 1, it suggests that our predictors variables age and gender have little to no effect on the outcome of responding positively to the campaign. Where the p-value is 1, there’s a 100% probability of observing a z value of 0 if the null is true that there is no impact of these variables on a positive campaign response.
You would want to use logistic regression over multivariate regression when the dependent variable is discrete or binary (yes/no for example) vs. continuous (GDP, for example). Logistic regression also doesn’t have the same restrictions as multivariate regression where the independent variable/dependent variable relationship must be linear or the residuals normally distributed. Logistic regression is overall better apt to handle categorical data over quantitative data.
EXTRA
Overall hospital bankruptcies are low in the US but we saw higher rates in states like West Virginia, Rhode Island and Connecticut. In this dataset several models were run and compared like LogReg, Altman and Ohlson. LogReg is logistic regression that predicts a binary outcome like bankrupt or not, the Altman model is a famous model for predicting bankruptcy based on financial ratios, and the Ohlson model uses a logistic regression to predict a company going bankrupt.
REFLECTION
I definitely learned a lot about data analysis this course, from basic exploratory data analysis, loading data to understanding summary info and stats, especially for larger data sets. I’ve learned interesting bits like how to determine how we can hypothesize whether a procedure, technology, change, etc. affects a population or not. The reality is there is always some error without being able to survey an entire population. But we can more realistically use sample data to make assumptions of the population which is helpful. Stats can be useful in telling us the chance of getting these sampled values if we assume our null hypothesis is true. Regression was quite interesting as well to learn how variables either do or don’t affect each other and there are conditions we need to keep in mind to run accurate, useful tests, like the GM assumptions.