This document summarizes some logistic regression for the 2011-2017 microtest data. With the data you are working with/your comfort level with R, the usefulness of this document/project package will vary. At the very least, you can see some of the useful procedures/functions used to clean data and how to run logistic regression given a dataset. Knowledge of dplyr and in general how to use the pipe operator %>% will help with understandability; these are some of the most useful tools in R anyway, so if you dont use them already, get on it! If you have any questions feel free to email me at -REDACTED-
The required packages for this analysis are in “requirements.R”. Please install all these packages before running the data cleaning or analysis portions of this project.
The data has been cleaned in the data_cleanup.R file and combined into a single dataframe with columns of interest. A complete explanation of its (the data cleaning file) nuts and bolts would be very time consuming, but the general process is as follows
A simpler data cleaning workflow is presented in the data_cleanup_2017.Rmd file, which outlines some steps taken to clean just the 2017 survey results.
After running the data_cleanup.R file, there should be a dataframe called aggregate_df, which contains microtest records from 2011-2017. wkbk_list is the list whose elements are the dataframes for each year.
Lets have a look at this dataframe:| ID | program | intake date | gender | minority | loan_amount | intake_biz | intake_sales | intake_draw | intake_outemploy | intake_hhincome | intake_hhsize | intake.pubass | survey_cmplt | survey_biz | survey_sales | survey_draw | survey_outemploy | survey_outincome | survey_pubass | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4204239 | EDU | 1.1.2015 | Female | No | 0 | No | 0 | 0 | Yes | 50000 | 2 | No | Could not survey but we know that the client HAD a business during 2016 | NA | NA | NA | NA | NA | NA | 2017 |
| 4236598 | EDU | 4.1.2015 | Female | No | 0 | Yes | 0 | 0 | NA | NA | NA | NA | Client completed survey | Yes | 18000 | 9000 | No | NA | No | 2017 |
| 4263395 | EDU | 6.7.2015 | Female | No | 0 | No | 0 | 0 | NA | NA | NA | NA | Client completed survey | No | NA | NA | Yes | NA | No | 2017 |
| 4033269 | EDU | 6.3.2014 | Female | No | 0 | Yes | 13500 | 3600 | Yes | 28323 | 5 | Yes | Could not survey do not know business status in 2016 | NA | NA | NA | NA | NA | NA | 2017 |
| 1719520 | EDU | 7.1.2013 | Male | No | 0 | No | 0 | 0 | Yes | 13000 | 1 | Yes | Client completed survey | Yes | 5400 | 500 | Yes | 15000 | No | 2017 |
| 4115089 | EDU | 10.24.2014 | Male | No | 0 | Yes | 0 | 0 | No | 45000 | 7 | Yes | Client completed survey | Yes | 27316 | 17150 | No | NA | Yes | 2017 |
We are interested in the response variables of whether or not someone moved off public assistance, and whether or not an individual started a business. As one can see from the head of the dataframe shown above, we have selected columns representing the individuals unique ID, program(s) they were enrolled in, intake date, gender, minority status, loan amount, business status at intake and survey, public assistance status at intake and survey, sales/draw at intake and survey, outside work status and income, and year.
Lets create the response variables as columns. For public assistance, they will get a 1 if they responded “Yes” at intake and “No” at survey. For business starts, they will get a 1 if they responded “No” at intake and “Yes” at survey (and 0 otherwise, NA’s are ignored). The following code creates new dataframes from aggregate_df by selecting rows(clients) who were on public assistance at intake (or did not have a business at intake) and then creating our response variable. These tasks are accomplished using the filter() function and the very useful mutate(variable_name = ifelse()) structure.
survcmplt_pa <- aggregate_df %>% filter(intake.pubass == "Yes") %>% # create a dataframe that is the result of keeping only rows where intake.pubass is "Yes"
mutate(movedoff = ifelse(survey_pubass == "No", 1, 0)) #<- Read this as: create a new column called movedoff. If this particular row has a value of "No" in the survey_pubass column, movedoff gets a 1, otherwise, it gets a zero.
surv_biz_start <- aggregate_df %>% filter(intake_biz == "No") %>%
mutate(biz_start= ifelse(survey_biz == "Yes", 1, 0))
surv_biz_start_LN <- aggregate_df %>% filter(intake_biz == "No") %>%
mutate(biz_start= ifelse(survey_biz == "Yes", 1, 0)) %>%
filter(program == "LN")
Now for the easy part. We perform two logistic regressions with our response variables being “moved off public assistance?” and “started a business in fiscal year?”. Our explanatory variables include household income, household size, gender, minority status, loan amount, and outside employment status for both regressions. When public assistance status change is our response, we also have business status at intake as an explanatory. When business status change is our response, we include public assistance at intake as an explanatory.
The general form of the glm() procedure is similar to lm(). You specify your response, explanatory variables, and additionally a family of distributions and a link function (dont worry if you dont know what that means, the majority of people who run glm() don’t either). For our purposes, we will be using the binomial family with a logit link function.
logmod_pa <- glm(movedoff ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount, data = survcmplt_pa, family = binomial(link = "logit"))
#summary(logmod_pa)
pander(logmod_pa)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.9199 | 0.5172 | 1.779 | 0.07529 |
| programIDA | -0.6534 | 0.4105 | -1.592 | 0.1115 |
| programLN | -0.8135 | 0.7287 | -1.116 | 0.2642 |
| intake_hhincome | 2.382e-05 | 1.316e-05 | 1.81 | 0.07033 |
| intake_hhsize | -0.3289 | 0.1149 | -2.864 | 0.004187 |
| intake_outemployYes | 0.1309 | 0.4019 | 0.3256 | 0.7447 |
| intake_bizYes | -0.3303 | 0.3881 | -0.851 | 0.3948 |
| genderMale | 0.1868 | 0.4035 | 0.463 | 0.6433 |
| genderOther | -15.53 | 882.7 | -0.01759 | 0.986 |
| minorityYes | -0.002818 | 0.4834 | -0.00583 | 0.9953 |
| loan_amount | 5.958e-05 | 3.597e-05 | 1.657 | 0.0976 |
logmod_biz <- glm(biz_start ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount, data = surv_biz_start, family = binomial(link = "logit"))
#summary(logmod_biz)
pander(logmod_biz)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.2752 | 0.5151 | -0.5344 | 0.5931 |
| programIDA | 0.05044 | 0.4291 | 0.1176 | 0.9064 |
| programLN | 2.083 | 1.086 | 1.917 | 0.05525 |
| intake_hhincome | -2.412e-06 | 3.656e-06 | -0.6598 | 0.5094 |
| intake_hhsize | 0.2535 | 0.1643 | 1.543 | 0.1229 |
| intake_outemployYes | 0.8935 | 0.4026 | 2.219 | 0.02646 |
| intake.pubassYes | -0.1514 | 0.3513 | -0.4309 | 0.6665 |
| genderMale | -0.4106 | 0.3762 | -1.091 | 0.2751 |
| genderOther | 13.87 | 882.7 | 0.01571 | 0.9875 |
| minorityYes | -0.3396 | 0.4568 | -0.7434 | 0.4572 |
| loan_amount | 3.098e-05 | 5.835e-05 | 0.5308 | 0.5955 |
Public Assistance Status Results
The coefficient estimates for the model with public assistance status as the response are significant for the intercept, household size, household income, and loan_amount. The coefficients are positive except for household size. This indicates an associated increase in the odds of moving off of public assistance given an increase in these variables (except for household size which is negative). It is good to check that this meets some sort of intuition.
Business Start Results
Coefficient estimates for business starts are significant for inclusion in the loan program and outside employment. This should not be surprising, people seeking small business loans are far more likely to be starting businesses. As for the significant estimate for outside employment, one might suspect that individuals without outside employment are likely looking for regular employment to support themselves first rather than starting a risky small business venture.
When interpreting coefficient estimates, one must be careful of highly correlated variables. Clearly, inclusion in the loan program and loan_amount should be highly correlated. We can perform a simple anova to check for equality of means within each group.
anova1 <- aov(loan_amount ~ program, data = surv_biz_start)
summary(anova1)
## Df Sum Sq Mean Sq F value Pr(>F)
## program 2 3.358e+10 1.679e+10 38.61 5.6e-16 ***
## Residuals 376 1.635e+11 4.348e+08
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which shows a clear difference between groups. We can also see whether the inclusion of the program variable affects the model by doing a likelihood ratio test. The procedure here is to compare the likelihood of observing our data between a model with and without the program variable.
#fit a model which includes program
logmod_progyes <- glm(movedoff ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount, data = survcmplt_pa, family = binomial(link = "logit"))
#fit a model without program
logmod_progno <- glm(movedoff ~ intake_hhincome + intake_hhsize + intake_outemploy + intake_biz + gender + minority + loan_amount, data = survcmplt_pa, family = binomial(link = "logit"))
#perform a likelihood ratio test between the two models
anova(logmod_progyes, logmod_progno, test = "LRT")
#Perform the same procedure above but with biz_start as the response
logmod_progyes <- glm(biz_start ~ program + intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount, data = surv_biz_start, family = binomial(link = "logit"))
logmod_progno <- glm(biz_start ~ intake_hhincome + intake_hhsize + intake_outemploy + intake.pubass + gender + minority + loan_amount, data = surv_biz_start, family = binomial(link = "logit"))
anova(logmod_progyes, logmod_progno, test = "LRT")
In both cases, we see insignificant \(\chi^2\) estimates, indicating the inclusion of the program variable does not produce an appreciable difference in model fit. We expected this, since our assumption was that loan_amount contained most of the information contained in program. Let’s re-run the regression without program.
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 0.3558 | 0.4353 | 0.8174 | 0.4137 |
| intake_hhincome | 2.636e-05 | 1.341e-05 | 1.966 | 0.0493 |
| intake_hhsize | -0.3557 | 0.1119 | -3.179 | 0.001477 |
| intake_outemployYes | 0.198 | 0.3742 | 0.5292 | 0.5967 |
| genderMale | 0.2437 | 0.3922 | 0.6214 | 0.5343 |
| genderOther | -15.03 | 882.7 | -0.01702 | 0.9864 |
| minorityYes | 0.0548 | 0.4719 | 0.1161 | 0.9076 |
| loan_amount | 4.103e-05 | 2.327e-05 | 1.763 | 0.07784 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.3046 | 0.5068 | -0.6011 | 0.5478 |
| intake_hhincome | -2.724e-06 | 3.598e-06 | -0.7572 | 0.4489 |
| intake_hhsize | 0.2715 | 0.1633 | 1.663 | 0.09638 |
| intake_outemployYes | 0.9207 | 0.3828 | 2.405 | 0.01616 |
| intake.pubassYes | -0.1777 | 0.3475 | -0.5112 | 0.6092 |
| genderMale | -0.2859 | 0.3649 | -0.7834 | 0.4334 |
| genderOther | 13.88 | 882.7 | 0.01573 | 0.9875 |
| minorityYes | -0.18 | 0.4405 | -0.4087 | 0.6828 |
| loan_amount | 0.0001482 | 6.348e-05 | 2.334 | 0.0196 |
For moving off of public assistance, the estimates for household size, household income, and loan amount remain significant. The signs of the parameter estimates also remain the same.
For business starts, we see the effect of removing program from the model. Loan amount is now significant, it is assuming the role of program in describing the variation in the data, since it carries much of the same information. We see a significant estimate for household size and outside employment, still consistent with our previous model.
In general, the most we can gather from this data is that economic variables such as household size and income tend to inform whether not a client will move off public assistance in the survey period. Specifically, the higher your household income, the more likely you moved off public assistance, and the larger your household, the less likely. For business starts, it can really be explained by whether or not you got a loan. Got a loan? You probably started a business. Didn’t seek a loan? You probably didn’t start a business, not too much of interest there. The outside employment variable is somewhat interesting, which again we attempted to explain by someones increased willingness to assume more risk when they already have a source of income (or perhaps savings) to fall back on if things don’t work out.
Feel free to play around with variable selection and transformation. Perhaps stepwise methods, log transformation of certain variables or adding other variables of interest from Vistashare. I have included the ID column in the fully aggregated dataframe. These ID’s correspond the the System Name ID field in Vistashare, which allows you to add columns/variables using a simple left join. In the below code your_df would be a dataframe with the System_Name_ID column and other columns that you wish to be joined to the full dataset.
aggregate_df <- aggregate df %>%
left_join(your_df, by = c("ID" = "System_Name_ID"))
ADD VARIABLES! See if theres a difference in some outcome between foundations I and foundations II graduates, if there is a relationship between the response and net worth etc. Play around with different response variables. See what informs %increases in gross sales, %increase in owner draw, etc. there is tons of data in Vistashare to play around with!
There is a lot of missing data here, the logistic regression we ran removes observations that do not have complete data. Try using the mice package to fill missing values and see if your results change, or track down columns that can give you more complete data on some variable (e.g. race can be found in multiple columns, combine them to form a single complete column).