DSC607: K-means - Bifurcation on New York State Startups

Project Overview

During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for that research project. For this project, though, I will examine these probabilities and gauge the predictive power of such location-based decisions as a new economic downturn is upon us. More pointedly, the questions that will be examined include:

Where should a new company start, but more importantly, from a data-driven decision standpoint, is the likelihood of reaping the success of an initial public offering (IPO) or overall acquisition stronger in New York City or any points Upstate New York?

The hypothesis states that location in New York State and other key variables do not make a considerable difference in terms of predicting the success of a new company. For more strategic investment based on geography, are there clustering patterns that offer insight into the potential of shared characteristics, and as a result, eventual success? Initially, models were proposed to focus on region, but after examining additional unsupervised algorithms, namely K-means and hierarchical structuring, this focus changed after the initial submission of the project overview. Even though a variable includes “region” there is also “city,” which will not be included in early models. This is important to note because conventional wisdom dictates that five or six MSAs attract the most company and capital (New York, Silicon Valley, Boston, Seattle, L.A. and/or Austin). There are no expectations and the results will most likely not show a considerable difference, but outliers may exist.

By running different models — instead of one combined — the results may be more conclusive and a declarative statement can be made whether location really matters. To do this, we will use K-means for this model for Upstate New York and then New York City-based companies.

These data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. The set contains 3,819 observations and 17 variables and includes companies that received funding from 1991-2017. This includes 3,527 in New York City that received at least one round of investment and 292 in any area of Upstate New York, primarily in metro areas (MSAs) like Albany, Buffalo, Rochester, and Syracuse, although that number has increased since 2017 as the creation of technology companies has accelerated along with the desire to invest in these new endeavors. However, the decision was made to create an end date for this data set at 2017, simply because companies rarely are acquired or receive an IPO in less than three years.

K-means Modeling for Upstate New York Companies

df <- read.csv("C:/Users/bjorzech/Desktop/upstate_final.csv",stringsAsFactors = FALSE)
head(df)

##                                      name status funding rounds investors
## 1                Loan Servicing Solutions      0   10000      1         1
## 2                       Immco Diagnostics      0   10000      1         1
## 3                                 EMED Co      0   10000      1         3
## 4 BioMedical Technologies Solutions, Inc.      0  280000      2         1
## 5                               dotSyntax      0  500000      1         1
## 6                           Content Savvy      0  992250      3         1

tail(df)

##                      name status    funding rounds investors
## 287                 Ioxus      1   69000000      4         4
## 288               Rheonix      1   88724758      8         5
## 289 Kinex Pharmaceuticals      1  102149580      6         6
## 290           L100003 GCS      1  170000000      1         6
## 291               Chobani      1  750000000      1         7
## 292            Carestream      1 2400000000      1         9

For presentation purposes only the head and tail of the data frame are presented. The variables were transformed to numeric except name, which was converted to factor. In this model, we also converted status to numeric, even though it is binary in nature. Additionally, for pre-processing purposes, the decision was also made to standarize and normalize the data. The independent variable funding, given its considerable differences and levels, would sway the algorithm. The following steps show the results of this methods.

df$status <- as.numeric (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
df$name <- as.factor (df$name)

After pre-processing, the decision was made to use variables rounds and investors for this clustering model.

x1 <- cbind (df$rounds, df$investors)
head(x1)

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    3
## [4,]    2    1
## [5,]    1    1
## [6,]    3    1

tail(x1)

##        [,1] [,2]
## [287,]    4    4
## [288,]    8    5
## [289,]    6    6
## [290,]    1    6
## [291,]    1    7
## [292,]    1    9

kcluster1 <- kmeans(x1, center=3)
kcluster1$center

##       [,1]     [,2]
## 1 2.684211 1.526316
## 2 1.000000 1.235897
## 3 2.333333 5.380952

table(df$investors, kcluster1$cluster)

##    
##       1   2   3
##   1  51 163   0
##   2  10  18   0
##   3  15  14   0
##   4   0   0   8
##   5   0   0   4
##   6   0   0   4
##   7   0   0   4
##   9   0   0   1

table(df$rounds, kcluster1$cluster)

##    
##       1   2   3
##   1   0 195  14
##   2  42   0   1
##   3  24   0   0
##   4   5   0   2
##   5   2   0   1
##   6   3   0   2
##   8   0   0   1

Additionally, even though the “mclust” package offers similar visualizations to the “animation” package and incorporates all variables in display — including “name” which is unnecessary — the most conclusive findings are often found in the bottom-right of the plot with number of investors and rounds. The results are linear and also show dimension reduction for analysis. They also incorporate Bayesian principles (Scrucca et al., 2016) to form a regression appearance with these two variables. What remained conclusive in a previous deliverable also appeared with both the extended Upstate New York State data set used here and the addition of the New York City model below.

library(mclust)

## Package 'mclust' version 5.4.6
## Type 'citation("mclust")' for citing this R package in publications.

fit <- Mclust(df)
plot(fit)

x2 <- cbind (df$rounds, df$investors)
kcluster2 <- kmeans(x2, center=3)
h_kcluster1 <- hclust (dist(x1), method="ave")
h_kcluster1

## 
## Call:
## hclust(d = dist(x1), method = "ave")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 292

Further, dendrograms were created in three, then four clusters each. A dendrogram displays both the cluster-subcluster relationships and the order in which the clusters were merged (Tan et al., 2019, p. 554). By running a hierarchical clustering algorithm, a variation of a single approach can be found — starting with individual points as clusters and successively merging the two closest clusters until only one remains (Tan et al., 2019, p. 555). The decision to display the findings in two steps is important to note as the sample — as opposed to a full dendrogram with all the data — is easier to analyze and conclusive. In this model, we use 50.

plot(h_kcluster1, hang = -1)
rect.hclust(h_kcluster1, k=3)

index_partial <- sample(1:dim(x1)[1], 50) 
sample <- x1[index_partial, ]
h_kcluster1 <- hclust (dist(sample), method = "ave")
h_kcluster1

## 
## Call:
## hclust(d = dist(sample), method = "ave")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 50

plot(h_kcluster1, hang = -1)
rect.hclust(h_kcluster1, k=4)

K-means Modeling for New York City Companies

df <- read.csv("C:/Users/bjorzech/Desktop/NYC_final.csv",stringsAsFactors = FALSE)
head(df)

##                          name status  funding rounds investors
## 1         BratPackStyle, LLC.      1    10000      1         4
## 2                     waywire      0  1750000      1         1
## 3                         x+1      1 45000000      4         1
## 4 10,000PublicRelations, Inc.      1  6000000      1         3
## 5   10000 Secure Technologies      0  3400000      1         1
## 6                    1010data      0 35000000      1         1

tail(df)

##        name status funding rounds investors
## 3522   Zula      1 4000000      3         1
## 3523  Zumur      1  700000      1         5
## 3524  zurvu      0 1200000      1         3
## 3525   Zuse      1   10000      1         1
## 3526 Zuznow      1  650000      1         1
## 3527   Zype      1 3300000      2         5

For comparision purposes, the same methodology for K-means analysis is used for the New York City-specific data set.

df$status <- as.numeric (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
df$name <- as.factor (df$name)

x1 <- cbind (df$rounds, df$investors)
head(x1)

##      [,1] [,2]
## [1,]    1    4
## [2,]    1    1
## [3,]    4    1
## [4,]    1    3
## [5,]    1    1
## [6,]    1    1

tail(x1)

##         [,1] [,2]
## [3522,]    3    1
## [3523,]    1    5
## [3524,]    1    3
## [3525,]    1    1
## [3526,]    1    1
## [3527,]    2    5

kcluster1 <- kmeans(x1, center=3)
kcluster1$center

##       [,1]     [,2]
## 1 4.084525 1.442133
## 2 1.252205 4.018519
## 3 1.310345 1.127463

table(df$investors, kcluster1$cluster)

##    
##        1    2    3
##   1  592    0 1417
##   2   47    0  207
##   3  110  642    0
##   4    7  135    0
##   5   13  211    0
##   6    0   32    0
##   7    0   79    0
##   8    0   24    0
##   9    0   11    0

table(df$rounds, kcluster1$cluster)

##     
##         1    2    3
##   1     0  889 1120
##   2     0  215  504
##   3   360   20    0
##   4   197    9    0
##   5    99    1    0
##   6    59    0    0
##   7    32    0    0
##   8    12    0    0
##   9     4    0    0
##   10    1    0    0
##   11    3    0    0
##   12    1    0    0
##   13    1    0    0

library(mclust)
fit <- Mclust(df)
plot(fit)

x2 <- cbind (df$rounds, df$investors)
kcluster2 <- kmeans(x2, center=3)
h_kcluster1 <- hclust (dist(x1), method="ave")
h_kcluster1

## 
## Call:
## hclust(d = dist(x1), method = "ave")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 3527

plot(h_kcluster1, hang = -1)
rect.hclust(h_kcluster1, k=3)

The decision to display the findings in two steps is important to note as the sample — as opposed to a full dendrogram with all the data — is easier to analyze and conclusive. In this model, we use 75.

index_partial <- sample(1:dim(x1)[1], 75) 
sample <- x1[index_partial, ]
h_kcluster1 <- hclust (dist(sample), method = "ave")
h_kcluster1

## 
## Call:
## hclust(d = dist(sample), method = "ave")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 75

plot(h_kcluster1, hang = -1)
rect.hclust(h_kcluster1, k=4)

The overall findings and analysis can be found in the final research paper for DSC607: Data Mining.

DSC607: K-means - Bifurcation on New York State Startups

Brett Orzechowski

6/28/2020

Project Overview

K-means Modeling for Upstate New York Companies

K-means Modeling for New York City Companies