DSC607: KNN - Bifurcation on New York State Startups Upstate

Project Overview

During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for that research project. For this project, though, I will examine these probabilities and gauge the predictive power of such location-based decisions as a new economic downturn is upon us. More pointedly, the questions that will be examined include:

Where should a new company start, but more importantly, from a data-driven decision standpoint, is the likelihood of reaping the success of an initial public offering (IPO) or overall acquisition stronger in New York City or any points Upstate New York?

The hypothesis states that location in New York State and other key variables do not make a considerable difference in terms of predicting the success of a new company. For classification purposes, and to understand whether certain key characteristics — or variables — associated with traditional investing lend themselves to strengthening a company’s success, the choice was made to use nearest neighbor instead of a decision tree. K-NN can be used to determine the class and label while the approach is more flexible to find all the training examples that are relatively similar to the attributes of the test instances (Tan et al., 2019, p. 208). Originally, the idea was to use a decision tree to gauge whether changes in IPO or acquisition would matter, but in terms of investors preparing for a next round of investment, this method may offer more insight through testing.

By running different models — instead of one combined — the results may be more conclusive and a declarative statement can be made whether location really matters. To do this, we will use k-NN for this model for Upstate New York companies while another R Markdown document will analyze New York City.

These data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. The set contains 3,819 observations and 17 variables and includes companies that received funding from 1991-2017. This includes 3,527 in New York City that received at least one round of investment and 292 in any area of Upstate New York, primarily in metro areas (MSAs) like Albany, Buffalo, Rochester, and Syracuse, although that number has increased since 2017 as the creation of technology companies has accelerated along with the desire to invest in these new endeavors. However, the decision was made to create an end date for this data set at 2017, simply because companies rarely are acquired or receive an IPO in less than three years.

k-NN Modeling for Upstate New York Companies

df <- read.csv("C:/Users/bjorzech/Desktop/upstate_final1.csv",stringsAsFactors = FALSE)
head (df)

##      status funding_total_usd funding_rounds investors
## 1 operating             10000              1         1
## 2 operating             10000              1         1
## 3 operating             10000              1         1
## 4 operating             10000              1         1
## 5 operating             10000              1         1
## 6 operating             10000              1         1

tail (df)

##        status funding_total_usd funding_rounds investors
## 287 operating         175000000              1         6
## 288 operating             10000              1         7
## 289 operating             10000              1         7
## 290 operating             10000              1         7
## 291 operating         750000000              1         7
## 292 operating         240000000              1         9

For presentation purposes only the head and tail of the data frame are presented. Additionally, the variables were transformed to numeric except status, which was converted to factor, and name, which did not factor into this model. Additionally, for pre-processing purposes, the decision was also made to standarize and normalize the data. The following steps show the results of this methods.

Structure

str(df)

## 'data.frame':    292 obs. of  4 variables:
##  $ status           : chr  "operating" "operating" "operating" "operating" ...
##  $ funding_total_usd: int  10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ funding_rounds   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ investors        : int  1 1 1 1 1 1 1 1 1 1 ...

shuffle_index <- sample(1:nrow(df))
head(shuffle_index)

## [1] 107 195 136 247 151  73

df<- df[shuffle_index, ]
head(df)

##        status funding_total_usd funding_rounds investors
## 107 operating            300000              1         1
## 195 operating           4000000              1         1
## 136 operating            800000              1         1
## 247 operating           5500000              1         3
## 151 operating           1047500              2         1
## 73  operating             45000              1         1

df$funding_rounds <- as.numeric(df$funding_rounds)
df$investors <- as.numeric(df$investors)
df$funding_total_usd <- as.numeric(df$funding_total_usd)
df$status <- as.factor(df$status)
rounds <- df$funding_rounds
investors <- df$investors
funding <- df$funding_total_usd
status <- df$status
str(df)

## 'data.frame':    292 obs. of  4 variables:
##  $ status           : Factor w/ 2 levels "ipo_acquired",..: 2 2 2 2 2 2 2 2 1 2 ...
##  $ funding_total_usd: num  300000 4000000 800000 5500000 1047500 ...
##  $ funding_rounds   : num  1 1 1 1 2 1 1 1 1 1 ...
##  $ investors        : num  1 1 1 3 1 1 1 1 4 2 ...

Normalize

normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x))) }
df1 <- as.data.frame(lapply(df[2:4], normalize))
head(df1)

##   funding_total_usd funding_rounds investors
## 1      3.866718e-04      0.0000000      0.00
## 2      5.320071e-03      0.0000000      0.00
## 3      1.053347e-03      0.0000000      0.00
## 4      7.320098e-03      0.0000000      0.25
## 5      1.383352e-03      0.1428571      0.00
## 6      4.666729e-05      0.0000000      0.00

num.vars <- sapply(df, is.numeric)
df[num.vars] <- lapply(df[num.vars], scale)
myvars <- c("funding_total_usd", "investors", "funding_rounds")
df.subset <- df[myvars]
summary(df.subset)

##  funding_total_usd.V1    investors.V1      funding_rounds.V1 
##  Min.   :-0.186973    Min.   :-0.474889   Min.   :-0.493775  
##  1st Qu.:-0.186943    1st Qu.:-0.474889   1st Qu.:-0.493775  
##  Median :-0.172230    Median :-0.474889   Median :-0.493775  
##  Mean   : 0.000000    Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.:-0.105832    3rd Qu.: 0.304143   3rd Qu.: 0.430471  
##  Max.   :15.064834    Max.   : 5.757367   Max.   : 5.975946

Pre-processing and the necessary steps to arrive at these results can be found in the R Markdown documentation (for both Upstate New York and New York City), but it’s worth noting in this paper that the training/testing distribution remained at 80/20 for desired results.

set.seed(123) 
test <- 1:56
train.df <- df.subset[-test,]
test.df <- df.subset[test,]
train.def <- df$status[-test]
test.def <- df$status[test]

For consistency, k is set at 1, 5, and 10 while the overall number of companies is outlined below within the respective table.

Results

library(class)
knn.1 <-  knn(train.df, test.df, train.def, k=1)
knn.5 <-  knn(train.df, test.df, train.def, k=5)
knn.10 <-  knn(train.df, test.df, train.def, k=10)
56 * sum(test.def == knn.1)/56

## [1] 49

56 * sum(test.def == knn.5)/56

## [1] 52

56 * sum(test.def == knn.10)/56

## [1] 52

Similar to the earlier assignment, to best test the models, a stratified cross-validation with another test of K at 1, 5, and 10 is used to sample the positive and negative instances in a K partition (Tan et al. 2019, p. 167). Additionally, similar to earlier models, increasing K increases the classification and success rate, however, the variability and predictive power is not as conclusive for those companies remaining in operating status.

Cross-Validation

table(knn.1 ,test.def)

##               test.def
## knn.1          ipo_acquired operating
##   ipo_acquired            0         3
##   operating               4        49

table(knn.5 ,test.def)

##               test.def
## knn.5          ipo_acquired operating
##   ipo_acquired            0         0
##   operating               4        52

table(knn.10 ,test.def)

##               test.def
## knn.10         ipo_acquired operating
##   ipo_acquired            0         0
##   operating               4        52

The most interesting conclusion from the plot package “psych” can be found in Pearson’s correlation when comparing the groups. Even though the “psych” package for plotting in R focuses on psychometric applications that emphasize techniques for dimension reduction including factor analysis, cluster analysis, and principal components analysis, it is applicable for maximum likelihood factor analysis (Revelle, 2018).