DSC607: Logistic - Bifurcation on New York State Startups

Project Overview

During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for that research project. For this project, though, I will examine these probabilities and gauge the predictive power of such location-based decisions as a new economic downturn is upon us. More pointedly, the questions that will be examined include:

Where should a new company start, but more importantly, from a data-driven decision standpoint, is the likelihood of reaping the success of an initial public offering (IPO) or overall acquisition stronger in New York City or any points Upstate New York?

The hypothesis states that location in New York State and other key variables do not make a considerable difference in terms of predicting the success of a new company.

The choice was made to begin with logistic regression to understand whether each subset of the data (New York City and Upstate) proved to be more favorable to startup companies in terms of probability of success. By running different models — instead of one combined — the results may be more conclusive and a declarative statement can be made whether location really matters.

These data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. The set contains 3,819 observations and 17 variables and includes companies that received funding from 1991-2017. This includes 3,527 in New York City that received at least one round of investment and 292 in any area of Upstate New York, primarily in metro areas (MSAs) like Albany, Buffalo, Rochester, and Syracuse, although that number has increased since 2017 as the creation of technology companies has accelerated along with the desire to invest in these new endeavors. However, the decision was made to create an end date for this data set at 2017, simply because companies rarely are acquired or receive an IPO in less than three years.

Logistic Regression Modeling for Upstate New York Companies

df <- read.csv("C:/Users/bjorzech/Desktop/upstate_final.csv",stringsAsFactors = FALSE)
head(df)

##                                      name status funding rounds investors
## 1                Loan Servicing Solutions      0   10000      1         1
## 2                       Immco Diagnostics      0   10000      1         1
## 3                                 EMED Co      0   10000      1         3
## 4 BioMedical Technologies Solutions, Inc.      0  280000      2         1
## 5                               dotSyntax      0  500000      1         1
## 6                           Content Savvy      0  992250      3         1

tail(df)

##                      name status    funding rounds investors
## 287                 Ioxus      1   69000000      4         4
## 288               Rheonix      1   88724758      8         5
## 289 Kinex Pharmaceuticals      1  102149580      6         6
## 290           L100003 GCS      1  170000000      1         6
## 291               Chobani      1  750000000      1         7
## 292            Carestream      1 2400000000      1         9

For presentation purposes only the head and tail of the data frame are presented. Additionally, the variables were transformed to numeric except status, which was converted to factor, and name, which did not factor into this model.

df$status <- as.factor (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
str(df)

## 'data.frame':    292 obs. of  5 variables:
##  $ name     : chr  "Loan Servicing Solutions" "Immco Diagnostics" "EMED Co" "BioMedical Technologies Solutions, Inc." ...
##  $ status   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ funding  : num  10000 10000 10000 280000 500000 ...
##  $ rounds   : num  1 1 1 2 1 3 1 1 1 1 ...
##  $ investors: num  1 1 3 1 1 1 1 1 1 1 ...

By using a Generalized Linear Model (GLM) for logistic regression, we aim to predict a binary outcome from a set of continuous variables. In this case, the binary outcome is from the dependent variable, which is status. Either the company is operating and receiving funding or has been acquired or exited with an IPO. It’s discriminant and the independent variables funding (level), investors, and rounds (of investment) were used.

logit <- glm(status ~ funding + rounds + investors,
             data=df, family=binomial(link = "logit"))
summary (logit)

## 
## Call:
## glm(formula = status ~ funding + rounds + investors, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4055   0.3388   0.3389   0.3991   1.3140  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.389e+00  4.143e-01   8.179 2.85e-16 ***
## funding      1.277e-09  2.871e-09   0.445    0.656    
## rounds      -3.387e-01  1.490e-01  -2.274    0.023 *  
## investors   -2.217e-01  1.617e-01  -1.371    0.170    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.91  on 291  degrees of freedom
## Residual deviance: 156.88  on 288  degrees of freedom
## AIC: 164.88
## 
## Number of Fisher Scoring iterations: 6

The R package “caret” is loaded to test the predictive power of the logistic regression model. The following results can be found and analyzed in the final research paper.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

fold <- trainControl(method = "cv", number = 10)
fold

## $method
## [1] "cv"
## 
## $number
## [1] 10
## 
## $repeats
## [1] NA
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL
## 
## $horizon
## [1] 1
## 
## $fixedWindow
## [1] TRUE
## 
## $skip
## [1] 0
## 
## $verboseIter
## [1] FALSE
## 
## $returnData
## [1] TRUE
## 
## $returnResamp
## [1] "final"
## 
## $savePredictions
## [1] FALSE
## 
## $classProbs
## [1] FALSE
## 
## $summaryFunction
## function (data, lev = NULL, model = NULL) 
## {
##     if (is.character(data$obs)) 
##         data$obs <- factor(data$obs, levels = lev)
##     postResample(data[, "pred"], data[, "obs"])
## }
## <bytecode: 0x000000001cfba9c0>
## <environment: namespace:caret>
## 
## $selectionFunction
## [1] "best"
## 
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
## 
## $preProcOptions$ICAcomp
## [1] 3
## 
## $preProcOptions$k
## [1] 5
## 
## $preProcOptions$freqCut
## [1] 19
## 
## $preProcOptions$uniqueCut
## [1] 10
## 
## $preProcOptions$cutoff
## [1] 0.9
## 
## 
## $sampling
## NULL
## 
## $index
## NULL
## 
## $indexOut
## NULL
## 
## $indexFinal
## NULL
## 
## $timingSamps
## [1] 0
## 
## $predictionBounds
## [1] FALSE FALSE
## 
## $seeds
## [1] NA
## 
## $adaptive
## $adaptive$min
## [1] 5
## 
## $adaptive$alpha
## [1] 0.05
## 
## $adaptive$method
## [1] "gls"
## 
## $adaptive$complete
## [1] TRUE
## 
## 
## $trim
## [1] FALSE
## 
## $allowParallel
## [1] TRUE

logit.cv <- train(status ~ funding + rounds + investors,
                  data=df,
                  trControl = fold,
                  method = "glm",
                  family=binomial())
logit.cv

## Generalized Linear Model 
## 
## 292 samples
##   3 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 263, 263, 263, 263, 262, 263, ... 
## Resampling results:
## 
##   Accuracy   Kappa       
##   0.9145977  -0.005263158

Logistic Regression Modeling for New York City Companies

The second model was then run based on the New York City-specific data set. Please note that the following steps are identical to the previous Upstate New York company model, however, the primary difference is the number of companies, which is significantly larger in this data set.

df <- read.csv("C:/Users/bjorzech/Desktop/NYC_final.csv",stringsAsFactors = FALSE)
head(df)

##                          name status  funding rounds investors
## 1         BratPackStyle, LLC.      1    10000      1         4
## 2                     waywire      0  1750000      1         1
## 3                         x+1      1 45000000      4         1
## 4 10,000PublicRelations, Inc.      1  6000000      1         3
## 5   10000 Secure Technologies      0  3400000      1         1
## 6                    1010data      0 35000000      1         1

tail(df)

##        name status funding rounds investors
## 3522   Zula      1 4000000      3         1
## 3523  Zumur      1  700000      1         5
## 3524  zurvu      0 1200000      1         3
## 3525   Zuse      1   10000      1         1
## 3526 Zuznow      1  650000      1         1
## 3527   Zype      1 3300000      2         5

df$status <- as.factor (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
str(df)

## 'data.frame':    3527 obs. of  5 variables:
##  $ name     : chr  "BratPackStyle, LLC." "waywire" "x+1" "10,000PublicRelations, Inc." ...
##  $ status   : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 2 2 ...
##  $ funding  : num  1.00e+04 1.75e+06 4.50e+07 6.00e+06 3.40e+06 3.50e+07 1.00e+04 1.70e+06 1.17e+08 2.80e+06 ...
##  $ rounds   : num  1 1 4 1 1 1 1 2 5 2 ...
##  $ investors: num  4 1 1 3 1 1 2 1 1 1 ...

logit <- glm(status ~ funding + rounds + investors,
             data=df, family=binomial(link = "logit"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary (logit)

## 
## Call:
## glm(formula = status ~ funding + rounds + investors, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4554   0.4308   0.4884   0.5242   1.8713  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.082e+00  1.185e-01  17.573  < 2e-16 ***
## funding     -2.875e-09  8.526e-10  -3.372 0.000745 ***
## rounds      -1.486e-01  3.309e-02  -4.491 7.09e-06 ***
## investors    1.322e-01  3.838e-02   3.446 0.000569 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2595.3  on 3526  degrees of freedom
## Residual deviance: 2523.5  on 3523  degrees of freedom
## AIC: 2531.5
## 
## Number of Fisher Scoring iterations: 7

library(caret)

fold <- trainControl(method = "cv", number = 10)
fold

## $method
## [1] "cv"
## 
## $number
## [1] 10
## 
## $repeats
## [1] NA
## 
## $search
## [1] "grid"
## 
## $p
## [1] 0.75
## 
## $initialWindow
## NULL
## 
## $horizon
## [1] 1
## 
## $fixedWindow
## [1] TRUE
## 
## $skip
## [1] 0
## 
## $verboseIter
## [1] FALSE
## 
## $returnData
## [1] TRUE
## 
## $returnResamp
## [1] "final"
## 
## $savePredictions
## [1] FALSE
## 
## $classProbs
## [1] FALSE
## 
## $summaryFunction
## function (data, lev = NULL, model = NULL) 
## {
##     if (is.character(data$obs)) 
##         data$obs <- factor(data$obs, levels = lev)
##     postResample(data[, "pred"], data[, "obs"])
## }
## <bytecode: 0x000000001cfba9c0>
## <environment: namespace:caret>
## 
## $selectionFunction
## [1] "best"
## 
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
## 
## $preProcOptions$ICAcomp
## [1] 3
## 
## $preProcOptions$k
## [1] 5
## 
## $preProcOptions$freqCut
## [1] 19
## 
## $preProcOptions$uniqueCut
## [1] 10
## 
## $preProcOptions$cutoff
## [1] 0.9
## 
## 
## $sampling
## NULL
## 
## $index
## NULL
## 
## $indexOut
## NULL
## 
## $indexFinal
## NULL
## 
## $timingSamps
## [1] 0
## 
## $predictionBounds
## [1] FALSE FALSE
## 
## $seeds
## [1] NA
## 
## $adaptive
## $adaptive$min
## [1] 5
## 
## $adaptive$alpha
## [1] 0.05
## 
## $adaptive$method
## [1] "gls"
## 
## $adaptive$complete
## [1] TRUE
## 
## 
## $trim
## [1] FALSE
## 
## $allowParallel
## [1] TRUE

Please note that the warnings were taken into consideration and the decision was made to ignore simply because the model considered the “rounds” and “investors” variables. Through research and documentation, it was stated that if there is no conflict between the variables and purpose, warnings can be ignored.

logit.cv <- train(status ~ funding + rounds + investors,
                  data=df,
                  trControl = fold,
                  method = "glm",
                  family=binomial())

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

logit.cv

## Generalized Linear Model 
## 
## 3527 samples
##    3 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3174, 3174, 3175, 3173, 3174, 3175, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8803492  0.0222645

The overall findings and analysis can be found in the final research paper for DSC607: Data Mining.

DSC607: Logistic - Bifurcation on New York State Startups

Brett Orzechowski

6/28/2020

Project Overview

Logistic Regression Modeling for Upstate New York Companies

Logistic Regression Modeling for New York City Companies