This document summarizes some logistic regression for the 2011-2017 microtest data. With the data you are working with/your comfort level with R, the usefulness of this document/project package will vary. At the very least, you can see some of the useful procedures/functions used to clean data and how to run logistic regression given a dataset. Knowledge of dplyr and in general how to use the pipe operator %>% will help with understandability; these are some of the most useful tools in R anyway, so if you dont use them already, get on it! If you have any questions feel free to email me at -REDACTED-

The required packages for this analysis are in “requirements.R”. Please install all these packages before running the data cleaning or analysis portions of this project.

Data Preparation

The data has been cleaned in the data_cleanup.R file and combined into a single dataframe with columns of interest. A complete explanation of its (the data cleaning file) nuts and bolts would be very time consuming, but the general process is as follows

A simpler data cleaning workflow is presented in the data_cleanup_2017.Rmd file, which outlines some steps taken to clean just the 2017 survey results.

After running the data_cleanup.R file, there should be a dataframe called aggregate_df, which contains microtest records from 2011-2017. wkbk_list is the list whose elements are the dataframes for each year.

Lets have a look at this dataframe:
ID program intake date gender minority loan_amount intake_biz intake_sales intake_draw intake_outemploy intake_hhincome intake_hhsize intake.pubass survey_cmplt survey_biz survey_sales survey_draw survey_outemploy survey_outincome survey_pubass year
4204239 EDU 1.1.2015 Female No 0 No 0 0 Yes 50000 2 No Could not survey but we know that the client HAD a business during 2016 NA NA NA NA NA NA 2017
4236598 EDU 4.1.2015 Female No 0 Yes 0 0 NA NA NA NA Client completed survey Yes 18000 9000 No NA No 2017
4263395 EDU 6.7.2015 Female No 0 No 0 0 NA NA NA NA Client completed survey No NA NA Yes NA No 2017
4033269 EDU 6.3.2014 Female No 0 Yes 13500 3600 Yes 28323 5 Yes Could not survey do not know business status in 2016 NA NA NA NA NA NA 2017
1719520 EDU 7.1.2013 Male No 0 No 0 0 Yes 13000 1 Yes Client completed survey Yes 5400 500 Yes 15000 No 2017
4115089 EDU 10.24.2014 Male No 0 Yes 0 0 No 45000 7 Yes Client completed survey Yes 27316 17150 No NA Yes 2017

Variables

We are interested in the response variables of whether or not someone moved off public assistance, and whether or not an individual started a business. As one can see from the head of the dataframe shown above, we have selected columns representing the individuals unique ID, program(s) they were enrolled in, intake date, gender, minority status, loan amount, business status at intake and survey, public assistance status at intake and survey, sales/draw at intake and survey, outside work status and income, and year.

Lets create the response variables as columns. For public assistance, they will get a 1 if they responded “Yes” at intake and “No” at survey. For business starts, they will get a 1 if they responded “No” at intake and “Yes” at survey (and 0 otherwise, NA’s are ignored). The following code creates new dataframes from aggregate_df by selecting rows(clients) who were on public assistance at intake (or did not have a business at intake) and then creating our response variable. These tasks are accomplished using the filter() function and the very useful mutate(variable_name = ifelse()) structure.

survcmplt_pa <- aggregate_df %>% filter(intake.pubass == "Yes") %>% # create a dataframe that is the result of keeping only rows where intake.pubass is "Yes"
  mutate(movedoff = ifelse(survey_pubass == "No", 1, 0)) #<- Read this as:  create a new column called movedoff.  If this particular row has a value of "No" in the survey_pubass column, movedoff gets a 1, otherwise, it gets a zero.

surv_biz_start <- aggregate_df %>% filter(intake_biz == "No") %>%
  mutate(biz_start= ifelse(survey_biz == "Yes", 1, 0))

surv_biz_start_LN <- aggregate_df %>% filter(intake_biz == "No") %>%
  mutate(biz_start= ifelse(survey_biz == "Yes", 1, 0)) %>%
  filter(program == "LN")

Analysis

Now for the easy part. We perform two logistic regressions with our response variables being “moved off public assistance?” and “started a business in fiscal year?”. Our explanatory variables include household income, household size, gender, minority status, loan amount, and outside employment status for both regressions. When public assistance status change is our response, we also have business status at intake as an explanatory. When business status change is our response, we include public assistance at intake as an explanatory.

The general form of the glm() procedure is similar to lm(). You specify your response, explanatory variables, and additionally a family of distributions and a link function (dont worry if you dont know what that means, the majority of people who run glm() don’t either). For our purposes, we will be using the binomial family with a logit link function.

logmod_pa <- glm(movedoff ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount, data = survcmplt_pa, family = binomial(link = "logit"))
#summary(logmod_pa)
pander(logmod_pa)
Fitting generalized (binomial/logit) linear model: movedoff ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.9199 0.5172 1.779 0.07529
programIDA -0.6534 0.4105 -1.592 0.1115
programLN -0.8135 0.7287 -1.116 0.2642
intake_hhincome 2.382e-05 1.316e-05 1.81 0.07033
intake_hhsize -0.3289 0.1149 -2.864 0.004187
intake_outemployYes 0.1309 0.4019 0.3256 0.7447
intake_bizYes -0.3303 0.3881 -0.851 0.3948
genderMale 0.1868 0.4035 0.463 0.6433
genderOther -15.53 882.7 -0.01759 0.986
minorityYes -0.002818 0.4834 -0.00583 0.9953
loan_amount 5.958e-05 3.597e-05 1.657 0.0976
logmod_biz <- glm(biz_start ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount, data = surv_biz_start, family = binomial(link = "logit"))
#summary(logmod_biz)
pander(logmod_biz)
Fitting generalized (binomial/logit) linear model: biz_start ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.2752 0.5151 -0.5344 0.5931
programIDA 0.05044 0.4291 0.1176 0.9064
programLN 2.083 1.086 1.917 0.05525
intake_hhincome -2.412e-06 3.656e-06 -0.6598 0.5094
intake_hhsize 0.2535 0.1643 1.543 0.1229
intake_outemployYes 0.8935 0.4026 2.219 0.02646
intake.pubassYes -0.1514 0.3513 -0.4309 0.6665
genderMale -0.4106 0.3762 -1.091 0.2751
genderOther 13.87 882.7 0.01571 0.9875
minorityYes -0.3396 0.4568 -0.7434 0.4572
loan_amount 3.098e-05 5.835e-05 0.5308 0.5955

Public Assistance Status Results

The coefficient estimates for the model with public assistance status as the response are significant for the intercept, household size, household income, and loan_amount. The coefficients are positive except for household size. This indicates an associated increase in the odds of moving off of public assistance given an increase in these variables (except for household size which is negative). It is good to check that this meets some sort of intuition.

Business Start Results

Coefficient estimates for business starts are significant for inclusion in the loan program and outside employment. This should not be surprising, people seeking small business loans are far more likely to be starting businesses. As for the significant estimate for outside employment, one might suspect that individuals without outside employment are likely looking for regular employment to support themselves first rather than starting a risky small business venture.

Variable Selection

When interpreting coefficient estimates, one must be careful of highly correlated variables. Clearly, inclusion in the loan program and loan_amount should be highly correlated. We can perform a simple anova to check for equality of means within each group.

anova1 <- aov(loan_amount ~ program, data = surv_biz_start)
summary(anova1)
##              Df    Sum Sq   Mean Sq F value  Pr(>F)    
## program       2 3.358e+10 1.679e+10   38.61 5.6e-16 ***
## Residuals   376 1.635e+11 4.348e+08                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which shows a clear difference between groups. We can also see whether the inclusion of the program variable affects the model by doing a likelihood ratio test. The procedure here is to compare the likelihood of observing our data between a model with and without the program variable.

#fit a model which includes program
logmod_progyes <- glm(movedoff ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount, data = survcmplt_pa, family = binomial(link = "logit"))

#fit a model without program
logmod_progno <- glm(movedoff ~ intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount, data = survcmplt_pa, family = binomial(link = "logit"))

#perform a likelihood ratio test between the two models
anova(logmod_progyes, logmod_progno, test = "LRT")
#Perform the same procedure above but with biz_start as the response
logmod_progyes <- glm(biz_start ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount, data = surv_biz_start, family = binomial(link = "logit"))

logmod_progno <- glm(biz_start ~ intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount, data = surv_biz_start, family = binomial(link = "logit"))

anova(logmod_progyes, logmod_progno, test = "LRT")

In both cases, we see insignificant \(\chi^2\) estimates, indicating the inclusion of the program variable does not produce an appreciable difference in model fit. We expected this, since our assumption was that loan_amount contained most of the information contained in program. Let’s re-run the regression without program.

Fitting generalized (binomial/logit) linear model: movedoff ~ intake_hhincome + intake_hhsize + intake_outemploy + gender + minority + loan_amount
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3558 0.4353 0.8174 0.4137
intake_hhincome 2.636e-05 1.341e-05 1.966 0.0493
intake_hhsize -0.3557 0.1119 -3.179 0.001477
intake_outemployYes 0.198 0.3742 0.5292 0.5967
genderMale 0.2437 0.3922 0.6214 0.5343
genderOther -15.03 882.7 -0.01702 0.9864
minorityYes 0.0548 0.4719 0.1161 0.9076
loan_amount 4.103e-05 2.327e-05 1.763 0.07784
Fitting generalized (binomial/logit) linear model: biz_start ~ intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3046 0.5068 -0.6011 0.5478
intake_hhincome -2.724e-06 3.598e-06 -0.7572 0.4489
intake_hhsize 0.2715 0.1633 1.663 0.09638
intake_outemployYes 0.9207 0.3828 2.405 0.01616
intake.pubassYes -0.1777 0.3475 -0.5112 0.6092
genderMale -0.2859 0.3649 -0.7834 0.4334
genderOther 13.88 882.7 0.01573 0.9875
minorityYes -0.18 0.4405 -0.4087 0.6828
loan_amount 0.0001482 6.348e-05 2.334 0.0196

For moving off of public assistance, the estimates for household size, household income, and loan amount remain significant. The signs of the parameter estimates also remain the same.

For business starts, we see the effect of removing program from the model. Loan amount is now significant, it is assuming the role of program in describing the variation in the data, since it carries much of the same information. We see a significant estimate for household size and outside employment, still consistent with our previous model.

Conclusions and Ideas for Further Analysis

In general, the most we can gather from this data is that economic variables such as household size and income tend to inform whether not a client will move off public assistance in the survey period. Specifically, the higher your household income, the more likely you moved off public assistance, and the larger your household, the less likely. For business starts, it can really be explained by whether or not you got a loan. Got a loan? You probably started a business. Didn’t seek a loan? You probably didn’t start a business, not too much of interest there. The outside employment variable is somewhat interesting, which again we attempted to explain by someones increased willingness to assume more risk when they already have a source of income (or perhaps savings) to fall back on if things don’t work out.

You Can Add/Remove Variables Easily!

Feel free to play around with variable selection and transformation. Perhaps stepwise methods, log transformation of certain variables or adding other variables of interest from Vistashare. I have included the ID column in the fully aggregated dataframe. These ID’s correspond the the System Name ID field in Vistashare, which allows you to add columns/variables using a simple left join. In the below code your_df would be a dataframe with the System_Name_ID column and other columns that you wish to be joined to the full dataset.

aggregate_df <- aggregate df %>% 
 left_join(your_df, by = c("ID" = "System_Name_ID"))

ADD VARIABLES! See if theres a difference in some outcome between foundations I and foundations II graduates, if there is a relationship between the response and net worth etc. Play around with different response variables. See what informs %increases in gross sales, %increase in owner draw, etc. there is tons of data in Vistashare to play around with!

Missing Data

There is a lot of missing data here, the logistic regression we ran removes observations that do not have complete data. Try using the mice package to fill missing values and see if your results change, or track down columns that can give you more complete data on some variable (e.g. race can be found in multiple columns, combine them to form a single complete column).