DSC607: KNN - Bifurcation on New York State Startups NYC

Project Overview

During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for that research project. For this project, though, I will examine these probabilities and gauge the predictive power of such location-based decisions as a new economic downturn is upon us. More pointedly, the questions that will be examined include:

Where should a new company start, but more importantly, from a data-driven decision standpoint, is the likelihood of reaping the success of an initial public offering (IPO) or overall acquisition stronger in New York City or any points Upstate New York?

The hypothesis states that location in New York State and other key variables do not make a considerable difference in terms of predicting the success of a new company. For classification purposes, and to understand whether certain key characteristics — or variables — associated with traditional investing lend themselves to strengthening a company’s success, the choice was made to use nearest neighbor instead of a decision tree. K-NN can be used to determine the class and label while the approach is more flexible to find all the training examples that are relatively similar to the attributes of the test instances (Tan et al., 2019, p. 208). Originally, the idea was to use a decision tree to gauge whether changes in IPO or acquisition would matter, but in terms of investors preparing for a next round of investment, this method may offer more insight through testing.

By running different models — instead of one combined — the results may be more conclusive and a declarative statement can be made whether location really matters. To do this, we will use k-NN for this model for New York City companies while another R Markdown document will analyze Upstate New York.

These data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. The set contains 3,819 observations and 17 variables and includes companies that received funding from 1991-2017. This includes 3,527 in New York City that received at least one round of investment and 292 in any area of Upstate New York, primarily in metro areas (MSAs) like Albany, Buffalo, Rochester, and Syracuse, although that number has increased since 2017 as the creation of technology companies has accelerated along with the desire to invest in these new endeavors. However, the decision was made to create an end date for this data set at 2017, simply because companies rarely are acquired or receive an IPO in less than three years.

k-NN Modeling for New York City Companies

df <- read.csv("C:/Users/bjorzech/Desktop/NYC_final1.csv",stringsAsFactors = FALSE)
head (df)

##         status funding_total_usd funding_rounds investors
## 1    operating             10000              1         1
## 2    operating             10000              1         1
## 3 ipo_acquired             10000              1         1
## 4    operating             10000              1         1
## 5    operating             10000              1         1
## 6 ipo_acquired             10000              1         1

tail (df)

##            status funding_total_usd funding_rounds investors
## 3522    operating         743000000              4         1
## 3523 ipo_acquired        1000000000              1         2
## 3524    operating        1002784331              6         1
## 3525 ipo_acquired        1200000000              1         1
## 3526 ipo_acquired        4715000000              2         1
## 3527 ipo_acquired       30079503000              5         1

For presentation purposes only the head and tail of the data frame are presented. Additionally, the variables were transformed to numeric except status, which was converted to factor, and name, which did not factor into this model. Additionally, for pre-processing purposes, the decision was also made to standarize and normalize the data. The following steps show the results of this methods.

Structure

str(df)

## 'data.frame':    3527 obs. of  4 variables:
##  $ status           : chr  "operating" "operating" "ipo_acquired" "operating" ...
##  $ funding_total_usd: num  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ funding_rounds   : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ investors        : int  1 1 1 1 1 1 1 1 1 1 ...

shuffle_index <- sample(1:nrow(df))
head(shuffle_index)

## [1] 1691 2392 2150 1212 2377  184

df<- df[shuffle_index, ]
head(df)

##         status funding_total_usd funding_rounds investors
## 1691 operating           1200000              1         1
## 2392 operating           4000000              1         7
## 2150 operating           2599984              2         1
## 1212 operating            418750              2         1
## 2377 operating           4000000              3         1
## 184  operating             10000              1         1

df$funding_rounds <- as.numeric(df$funding_rounds)
df$investors <- as.numeric(df$investors)
df$funding_total_usd <- as.numeric(df$funding_total_usd)
df$status <- as.factor(df$status)
rounds <- df$funding_rounds
investors <- df$investors
funding <- df$funding_total_usd
status <- df$status
str(df)

## 'data.frame':    3527 obs. of  4 variables:
##  $ status           : Factor w/ 2 levels "ipo_acquired",..: 2 2 2 2 2 2 2 2 1 2 ...
##  $ funding_total_usd: num  1200000 4000000 2599984 418750 4000000 ...
##  $ funding_rounds   : num  1 1 2 2 3 1 1 2 2 3 ...
##  $ investors        : num  1 7 1 1 1 1 1 1 1 1 ...

Normalize

normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x))) }
df1 <- as.data.frame(lapply(df[2:4], normalize))
head(df1)

##   funding_total_usd funding_rounds investors
## 1      3.956184e-05     0.00000000      0.00
## 2      1.326485e-04     0.00000000      0.75
## 3      8.610464e-05     0.08333333      0.00
## 4      1.358899e-05     0.08333333      0.00
## 5      1.326485e-04     0.16666667      0.00
## 6      0.000000e+00     0.00000000      0.00

num.vars <- sapply(df, is.numeric)
df[num.vars] <- lapply(df[num.vars], scale)
myvars <- c("funding_total_usd", "investors", "funding_rounds")
df.subset <- df[myvars]
summary(df.subset)

##  funding_total_usd.V1    investors.V1      funding_rounds.V1 
##  Min.   :-0.04320     Min.   :-0.697622   Min.   :-0.637409  
##  1st Qu.:-0.04299     1st Qu.:-0.697622   1st Qu.:-0.637409  
##  Median :-0.04060     Median :-0.697622   Median :-0.637409  
##  Mean   : 0.00000     Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.:-0.02962     3rd Qu.: 0.541931   3rd Qu.: 0.073578  
##  Max.   :58.39928     Max.   : 4.260589   Max.   : 7.894441

Pre-processing and the necessary steps to arrive at these results can be found in the R Markdown documentation (for both Upstate New York and New York City), but it’s worth noting in this paper that the training/testing distribution remained at 80/20 for desired results.

set.seed(123) 
test <- 1:650
train.df <- df.subset[-test,]
test.df <- df.subset[test,]
train.def <- df$status[-test]
test.def <- df$status[test]

For consistency, k is set at 1, 5, and 10 while the overall number of companies is outlined below within the respective table.

Results

library(class)
knn.1 <-  knn(train.df, test.df, train.def, k=1)
knn.5 <-  knn(train.df, test.df, train.def, k=5)
knn.10 <-  knn(train.df, test.df, train.def, k=10)
650 * sum(test.def == knn.1)/650

## [1] 539

650 * sum(test.def == knn.5)/650

## [1] 557

650 * sum(test.def == knn.10)/650

## [1] 565

Similar to the earlier assignment, to best test the models, a stratified cross-validation with another test of K at 1, 5, and 10 is used to sample the positive and negative instances in a K partition (Tan et al. 2019, p. 167). Additionally, similar to earlier models, increasing K increases the classification and success rate, however, the variability and predictive power is not as conclusive for those companies remaining in operating status.

Cross-Validation

table(knn.1 ,test.def)

##               test.def
## knn.1          ipo_acquired operating
##   ipo_acquired           14        38
##   operating              73       525

table(knn.5 ,test.def)

##               test.def
## knn.5          ipo_acquired operating
##   ipo_acquired            4        10
##   operating              83       553

table(knn.10 ,test.def)

##               test.def
## knn.10         ipo_acquired operating
##   ipo_acquired            2         0
##   operating              85       563

The most interesting conclusion from the plot package “psych” can be found in Pearson’s correlation when comparing the groups. Even though the “psych” package for plotting in R focuses on psychometric applications that emphasize techniques for dimension reduction including factor analysis, cluster analysis, and principal components analysis, it is applicable for maximum likelihood factor analysis (Revelle, 2018).