rm(list = ls())
gc()
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 536444 28.7    1197988   64         NA   669417 35.8
## Vcells 990072  7.6    8388608   64      16384  1851813 14.2
cat("\f")
dev.off
## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x13b256f20>
## <environment: namespace:grDevices>
  1. Implement the logistic regression on any dataset of your choice, and interpret your coefficients. Tell us why you should not run a multivariate regression.

For this example, I will look at marketing campaign data to predict whether a customer’s age and gender has an impact on whether they give a positive response (yes) or negative response (no) to a marketing campaign. Data taken from Kaggle (https://www.kaggle.com/datasets/sujithmandala/marketing-campaign-positive-response-prediction?resource=download).

#set working directory and assign data to variable
setwd("/Users/ginaocchipinti/Documents/ADEC 7310  Data Analytics/Week 6")
df <- read.csv("campaign_responses.csv", header = T, stringsAsFactors = T)
df
##    customer_id age gender annual_income credit_score employed marital_status
## 1            1  35   Male         65000          720      Yes        Married
## 2            2  28 Female         45000          680       No         Single
## 3            3  42   Male         85000          750      Yes        Married
## 4            4  31 Female         55000          710      Yes         Single
## 5            5  47   Male         95000          790      Yes        Married
## 6            6  25 Female         38000          630       No         Single
## 7            7  39   Male         72000          740      Yes        Married
## 8            8  33 Female         48000          670      Yes         Single
## 9            9  51   Male        110000          820      Yes        Married
## 10          10  27 Female         40000          620       No         Single
## 11          11  44   Male         90000          780      Yes        Married
## 12          12  30 Female         52000          690      Yes         Single
## 13          13  36   Male         75000          730      Yes        Married
## 14          14  29 Female         45000          660       No         Single
## 15          15  49   Male        105000          800      Yes        Married
## 16          16  26 Female         36000          610       No         Single
## 17          17  41   Male         85000          760      Yes        Married
## 18          18  32 Female         54000          700      Yes         Single
## 19          19  37   Male         80000          740      Yes        Married
## 20          20  34 Female         60000          720      Yes         Single
## 21          21  43   Male         92000          770      Yes        Married
## 22          22  28 Female         42000          640       No         Single
## 23          23  38   Male         78000          750      Yes        Married
## 24          24  31 Female         48000          680      Yes         Single
## 25          25  45   Male         98000          790      Yes        Married
## 26          26  27 Female         40000          630       No         Single
## 27          27  40   Male         85000          760      Yes        Married
## 28          28  35 Female         62000          710      Yes         Single
## 29          29  46   Male        100000          800      Yes        Married
## 30          30  29 Female         44000          650       No         Single
## 31          31  42   Male         90000          780      Yes        Married
## 32          32  33 Female         56000          690      Yes         Single
## 33          33  39   Male         82000          750      Yes        Married
## 34          34  30 Female         50000          670      Yes         Single
## 35          35  48   Male        105000          810      Yes        Married
## 36          36  25 Female         35000          600       No         Single
## 37          37  41   Male         88000          770      Yes        Married
## 38          38  34 Female         58000          700      Yes         Single
## 39          39  43   Male         95000          780      Yes        Married
## 40          40  28 Female         43000          640       No         Single
## 41          41  37   Male         80000          750      Yes        Married
## 42          42  32 Female         52000          680      Yes         Single
## 43          43  45   Male        100000          800      Yes        Married
## 44          44  30 Female         46000          660       No         Single
## 45          45  40   Male         88000          770      Yes        Married
## 46          46  36 Female         64000          720      Yes         Single
## 47          47  47   Male        102000          790      Yes        Married
## 48          48  26 Female         38000          620       No         Single
## 49          49  42   Male         90000          760      Yes        Married
## 50          50  33 Female         54000          690      Yes         Single
## 51          51  39   Male         85000          750      Yes        Married
## 52          52  31 Female         50000          680      Yes         Single
## 53          53  46   Male         98000          800      Yes        Married
## 54          54  28 Female         42000          630       No         Single
## 55          55  41   Male         90000          770      Yes        Married
## 56          56  34 Female         60000          710      Yes         Single
##    no_of_children responded
## 1               2       Yes
## 2               0        No
## 3               3       Yes
## 4               1        No
## 5               2       Yes
## 6               0        No
## 7               2       Yes
## 8               0        No
## 9               3       Yes
## 10              0        No
## 11              2       Yes
## 12              0        No
## 13              1       Yes
## 14              0        No
## 15              3       Yes
## 16              0        No
## 17              2       Yes
## 18              0        No
## 19              2       Yes
## 20              1        No
## 21              3       Yes
## 22              0        No
## 23              2       Yes
## 24              0        No
## 25              3       Yes
## 26              0        No
## 27              2       Yes
## 28              1        No
## 29              3       Yes
## 30              0        No
## 31              2       Yes
## 32              0        No
## 33              2       Yes
## 34              0        No
## 35              3       Yes
## 36              0        No
## 37              2       Yes
## 38              1        No
## 39              3       Yes
## 40              0        No
## 41              2       Yes
## 42              0        No
## 43              3       Yes
## 44              0        No
## 45              2       Yes
## 46              1        No
## 47              3       Yes
## 48              0        No
## 49              2       Yes
## 50              0        No
## 51              2       Yes
## 52              0        No
## 53              3       Yes
## 54              0        No
## 55              2       Yes
## 56              1        No
# create a logistic regression model with our data

logit <- glm(df$responded ~ df$age + df$gender,
             data = df,
             family = "binomial")
summary(logit)
## 
## Call:
## glm(formula = df$responded ~ df$age + df$gender, family = "binomial", 
##     data = df)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -2.657e+01  4.052e+05       0        1
## df$age         1.570e-11  1.324e+04       0        1
## df$genderMale  5.313e+01  1.860e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7.7632e+01  on 55  degrees of freedom
## Residual deviance: 3.2489e-10  on 53  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25

Interpreting this regression, when gender and age are both at zero, the baseline chance of a customer responding “yes” to the marketing campaign is -26.57. A 1-unit change in gender (being male) results in a 53.13 increase in responded positive to the marketing campaign. With each 1-unit increase in age, the chance of responding positively to the campaign increases by a very tiny amount, 1.57e-11. In this case where are z values are 0 and p values are 1, it suggests that our predictors variables age and gender have little to no effect on the outcome of responding positively to the campaign. Where the p-value is 1, there’s a 100% probability of observing a z value of 0 if the null is true that there is no impact of these variables on a positive campaign response.

You would want to use logistic regression over multivariate regression when the dependent variable is discrete or binary (yes/no for example) vs. continuous (GDP, for example). Logistic regression also doesn’t have the same restrictions as multivariate regression where the independent variable/dependent variable relationship must be linear or the residuals normally distributed. Logistic regression is overall better apt to handle categorical data over quantitative data.

EXTRA

Overall hospital bankruptcies are low in the US but we saw higher rates in states like West Virginia, Rhode Island and Connecticut. In this dataset several models were run and compared like LogReg, Altman and Ohlson. LogReg is logistic regression that predicts a binary outcome like bankrupt or not, the Altman model is a famous model for predicting bankruptcy based on financial ratios, and the Ohlson model uses a logistic regression to predict a company going bankrupt.

REFLECTION

I definitely learned a lot about data analysis this course, from basic exploratory data analysis, loading data to understanding summary info and stats, especially for larger data sets. I’ve learned interesting bits like how to determine how we can hypothesize whether a procedure, technology, change, etc. affects a population or not. The reality is there is always some error without being able to survey an entire population. But we can more realistically use sample data to make assumptions of the population which is helpful. Stats can be useful in telling us the chance of getting these sampled values if we assume our null hypothesis is true. Regression was quite interesting as well to learn how variables either do or don’t affect each other and there are conditions we need to keep in mind to run accurate, useful tests, like the GM assumptions.