During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for that research project. For this project, though, I will examine these probabilities and gauge the predictive power of such location-based decisions as a new economic downturn is upon us. More pointedly, the questions that will be examined include:
Where should a new company start, but more importantly, from a data-driven decision standpoint, is the likelihood of reaping the success of an initial public offering (IPO) or overall acquisition stronger in New York City or any points Upstate New York?
The hypothesis states that location in New York State and other key variables do not make a considerable difference in terms of predicting the success of a new company.
The choice was made to begin with logistic regression to understand whether each subset of the data (New York City and Upstate) proved to be more favorable to startup companies in terms of probability of success. By running different models — instead of one combined — the results may be more conclusive and a declarative statement can be made whether location really matters.
These data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. The set contains 3,819 observations and 17 variables and includes companies that received funding from 1991-2017. This includes 3,527 in New York City that received at least one round of investment and 292 in any area of Upstate New York, primarily in metro areas (MSAs) like Albany, Buffalo, Rochester, and Syracuse, although that number has increased since 2017 as the creation of technology companies has accelerated along with the desire to invest in these new endeavors. However, the decision was made to create an end date for this data set at 2017, simply because companies rarely are acquired or receive an IPO in less than three years.
df <- read.csv("C:/Users/bjorzech/Desktop/upstate_final.csv",stringsAsFactors = FALSE)
head(df)
## name status funding rounds investors
## 1 Loan Servicing Solutions 0 10000 1 1
## 2 Immco Diagnostics 0 10000 1 1
## 3 EMED Co 0 10000 1 3
## 4 BioMedical Technologies Solutions, Inc. 0 280000 2 1
## 5 dotSyntax 0 500000 1 1
## 6 Content Savvy 0 992250 3 1
tail(df)
## name status funding rounds investors
## 287 Ioxus 1 69000000 4 4
## 288 Rheonix 1 88724758 8 5
## 289 Kinex Pharmaceuticals 1 102149580 6 6
## 290 L100003 GCS 1 170000000 1 6
## 291 Chobani 1 750000000 1 7
## 292 Carestream 1 2400000000 1 9
For presentation purposes only the head and tail of the data frame are presented. Additionally, the variables were transformed to numeric except status, which was converted to factor, and name, which did not factor into this model.
df$status <- as.factor (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
str(df)
## 'data.frame': 292 obs. of 5 variables:
## $ name : chr "Loan Servicing Solutions" "Immco Diagnostics" "EMED Co" "BioMedical Technologies Solutions, Inc." ...
## $ status : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ funding : num 10000 10000 10000 280000 500000 ...
## $ rounds : num 1 1 1 2 1 3 1 1 1 1 ...
## $ investors: num 1 1 3 1 1 1 1 1 1 1 ...
By using a Generalized Linear Model (GLM) for logistic regression, we aim to predict a binary outcome from a set of continuous variables. In this case, the binary outcome is from the dependent variable, which is status. Either the company is operating and receiving funding or has been acquired or exited with an IPO. It’s discriminant and the independent variables funding (level), investors, and rounds (of investment) were used.
logit <- glm(status ~ funding + rounds + investors,
data=df, family=binomial(link = "logit"))
summary (logit)
##
## Call:
## glm(formula = status ~ funding + rounds + investors, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4055 0.3388 0.3389 0.3991 1.3140
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.389e+00 4.143e-01 8.179 2.85e-16 ***
## funding 1.277e-09 2.871e-09 0.445 0.656
## rounds -3.387e-01 1.490e-01 -2.274 0.023 *
## investors -2.217e-01 1.617e-01 -1.371 0.170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.91 on 291 degrees of freedom
## Residual deviance: 156.88 on 288 degrees of freedom
## AIC: 164.88
##
## Number of Fisher Scoring iterations: 6
The R package “caret” is loaded to test the predictive power of the logistic regression model. The following results can be found and analyzed in the final research paper.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
fold <- trainControl(method = "cv", number = 10)
fold
## $method
## [1] "cv"
##
## $number
## [1] 10
##
## $repeats
## [1] NA
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
##
## $horizon
## [1] 1
##
## $fixedWindow
## [1] TRUE
##
## $skip
## [1] 0
##
## $verboseIter
## [1] FALSE
##
## $returnData
## [1] TRUE
##
## $returnResamp
## [1] "final"
##
## $savePredictions
## [1] FALSE
##
## $classProbs
## [1] FALSE
##
## $summaryFunction
## function (data, lev = NULL, model = NULL)
## {
## if (is.character(data$obs))
## data$obs <- factor(data$obs, levels = lev)
## postResample(data[, "pred"], data[, "obs"])
## }
## <bytecode: 0x000000001cfba9c0>
## <environment: namespace:caret>
##
## $selectionFunction
## [1] "best"
##
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
##
## $preProcOptions$ICAcomp
## [1] 3
##
## $preProcOptions$k
## [1] 5
##
## $preProcOptions$freqCut
## [1] 19
##
## $preProcOptions$uniqueCut
## [1] 10
##
## $preProcOptions$cutoff
## [1] 0.9
##
##
## $sampling
## NULL
##
## $index
## NULL
##
## $indexOut
## NULL
##
## $indexFinal
## NULL
##
## $timingSamps
## [1] 0
##
## $predictionBounds
## [1] FALSE FALSE
##
## $seeds
## [1] NA
##
## $adaptive
## $adaptive$min
## [1] 5
##
## $adaptive$alpha
## [1] 0.05
##
## $adaptive$method
## [1] "gls"
##
## $adaptive$complete
## [1] TRUE
##
##
## $trim
## [1] FALSE
##
## $allowParallel
## [1] TRUE
logit.cv <- train(status ~ funding + rounds + investors,
data=df,
trControl = fold,
method = "glm",
family=binomial())
logit.cv
## Generalized Linear Model
##
## 292 samples
## 3 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 263, 263, 263, 263, 262, 263, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9145977 -0.005263158
The second model was then run based on the New York City-specific data set. Please note that the following steps are identical to the previous Upstate New York company model, however, the primary difference is the number of companies, which is significantly larger in this data set.
df <- read.csv("C:/Users/bjorzech/Desktop/NYC_final.csv",stringsAsFactors = FALSE)
head(df)
## name status funding rounds investors
## 1 BratPackStyle, LLC. 1 10000 1 4
## 2 waywire 0 1750000 1 1
## 3 x+1 1 45000000 4 1
## 4 10,000PublicRelations, Inc. 1 6000000 1 3
## 5 10000 Secure Technologies 0 3400000 1 1
## 6 1010data 0 35000000 1 1
tail(df)
## name status funding rounds investors
## 3522 Zula 1 4000000 3 1
## 3523 Zumur 1 700000 1 5
## 3524 zurvu 0 1200000 1 3
## 3525 Zuse 1 10000 1 1
## 3526 Zuznow 1 650000 1 1
## 3527 Zype 1 3300000 2 5
df$status <- as.factor (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
str(df)
## 'data.frame': 3527 obs. of 5 variables:
## $ name : chr "BratPackStyle, LLC." "waywire" "x+1" "10,000PublicRelations, Inc." ...
## $ status : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 2 2 ...
## $ funding : num 1.00e+04 1.75e+06 4.50e+07 6.00e+06 3.40e+06 3.50e+07 1.00e+04 1.70e+06 1.17e+08 2.80e+06 ...
## $ rounds : num 1 1 4 1 1 1 1 2 5 2 ...
## $ investors: num 4 1 1 3 1 1 2 1 1 1 ...
logit <- glm(status ~ funding + rounds + investors,
data=df, family=binomial(link = "logit"))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary (logit)
##
## Call:
## glm(formula = status ~ funding + rounds + investors, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4554 0.4308 0.4884 0.5242 1.8713
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.082e+00 1.185e-01 17.573 < 2e-16 ***
## funding -2.875e-09 8.526e-10 -3.372 0.000745 ***
## rounds -1.486e-01 3.309e-02 -4.491 7.09e-06 ***
## investors 1.322e-01 3.838e-02 3.446 0.000569 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2595.3 on 3526 degrees of freedom
## Residual deviance: 2523.5 on 3523 degrees of freedom
## AIC: 2531.5
##
## Number of Fisher Scoring iterations: 7
library(caret)
fold <- trainControl(method = "cv", number = 10)
fold
## $method
## [1] "cv"
##
## $number
## [1] 10
##
## $repeats
## [1] NA
##
## $search
## [1] "grid"
##
## $p
## [1] 0.75
##
## $initialWindow
## NULL
##
## $horizon
## [1] 1
##
## $fixedWindow
## [1] TRUE
##
## $skip
## [1] 0
##
## $verboseIter
## [1] FALSE
##
## $returnData
## [1] TRUE
##
## $returnResamp
## [1] "final"
##
## $savePredictions
## [1] FALSE
##
## $classProbs
## [1] FALSE
##
## $summaryFunction
## function (data, lev = NULL, model = NULL)
## {
## if (is.character(data$obs))
## data$obs <- factor(data$obs, levels = lev)
## postResample(data[, "pred"], data[, "obs"])
## }
## <bytecode: 0x000000001cfba9c0>
## <environment: namespace:caret>
##
## $selectionFunction
## [1] "best"
##
## $preProcOptions
## $preProcOptions$thresh
## [1] 0.95
##
## $preProcOptions$ICAcomp
## [1] 3
##
## $preProcOptions$k
## [1] 5
##
## $preProcOptions$freqCut
## [1] 19
##
## $preProcOptions$uniqueCut
## [1] 10
##
## $preProcOptions$cutoff
## [1] 0.9
##
##
## $sampling
## NULL
##
## $index
## NULL
##
## $indexOut
## NULL
##
## $indexFinal
## NULL
##
## $timingSamps
## [1] 0
##
## $predictionBounds
## [1] FALSE FALSE
##
## $seeds
## [1] NA
##
## $adaptive
## $adaptive$min
## [1] 5
##
## $adaptive$alpha
## [1] 0.05
##
## $adaptive$method
## [1] "gls"
##
## $adaptive$complete
## [1] TRUE
##
##
## $trim
## [1] FALSE
##
## $allowParallel
## [1] TRUE
Please note that the warnings were taken into consideration and the decision was made to ignore simply because the model considered the “rounds” and “investors” variables. Through research and documentation, it was stated that if there is no conflict between the variables and purpose, warnings can be ignored.
logit.cv <- train(status ~ funding + rounds + investors,
data=df,
trControl = fold,
method = "glm",
family=binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
logit.cv
## Generalized Linear Model
##
## 3527 samples
## 3 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3174, 3174, 3175, 3173, 3174, 3175, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8803492 0.0222645
The overall findings and analysis can be found in the final research paper for DSC607: Data Mining.