DSC607 Upstate Startup Clustering

Project Overview

During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for this research project.

A continuous theme focused on the bifurcation of the New York State startup ecosystem – the gravitational pull and lure of larger investment potential in New York City countered by enclaves across Upstate New York. Even though the same data has been used for previous deliverables, we will continue to examine this set albeit through different clustering algorithms. It’s worth noting that there is no repetition in terms of exercises or findings, however, it’s been interesting to explore different methods and techniques via one data set as a number of models and algorithms can be applied without duplication.

To close the programming portion of the course before the final project, I’ve decided to approach this deliverable via two unsupervised learning algorithms – K-means and principal component analysis (PCA). As defined by Tan et al. (2019), K-means is a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K), which are represented by the centroid (Tan et al., 2019, p. 533). This will also be conveyed through hierarchical clustering, which is a collection of closely related clustering techniques, and ultimately and repeatedly merges the two closest clusters until a single, all-encompassing cluster remains (Tan et al., 2019, p. 533). PCA is a linear algebra technique for continuous attributes that finds new attributes that 1) are linear combinations of the original attributes, 2) are orthogonal (perpendicular) to each other, and 3) capture the maximum amount of variation in the data (Tan et al., 2019, p. 58). Between the data set, running the algorithms, and finding perhaps the perfect plot sequence in R, all of these are displayed below.

Finally, it’s worth noting that the exercise is clearly marked to demonstrate the shift from K-means to PCA in order to delineate the difference in implementing an appropriate unsupervised learning algorithm.

This data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000.

K-means

Select a Data Set/Load Into R

df <- read.csv("C:/Users/bjorzech/Desktop/upstate_test.csv",stringsAsFactors = FALSE)
head(df)

##                   name status funding rounds investors
## 1          24PageBooks      1   50000      1         1
## 2      Accipiter Radar      1 4700000      1         1
## 3 AccuMED Technologies      1   10000      1         1
## 4         ACV Auctions      1 1000000      1         1
## 5              Adenios      1  810000      1         1
## 6      Adirondack East      1   25000      1         1

tail(df)

##                      name status   funding rounds investors
## 287            Crystal IS      0  15237785      6         6
## 288           L100003 GCS      1 170000000      6         7
## 289          NanoDynamics      0  16125150      6         7
## 290   Syracuse University      0   6000000      6         7
## 291        The Bucket BBQ      0     10000      6         7
## 292 Kinex Pharmaceuticals      1 102149580      8         9

These data was extracted from a larger data set to create a sample of 292 records, which constitutes 292 unique startup companies that received some form of investment within a 10-year period (an extension of previous data samples). Within those 292 records, five attributes are included (name, investment rounds, number of investors, status – either ipo/acquired or operating – and level of funding).

Further, the attributes are also identified as nominal (categorical) and interval (numeric). In some respect, they can also be construed as ordinal because of their binary nature (Tan et al., 2019, p. 30). However, for this exercise, four of the five variables were converted to “numeric” in R while the variable “name” was converted to “factor.” This is just a pre-processing step as only two variables will be used in the K-means portion of this deliverable. It is also worth noting that for presentation purposes, just the head and tail of the sample of 292 startup companies are included as opposed to presenting the full data set. This is due to the number of variables and length of company name, which runs the R Markdown deeper.

df$status <- as.numeric (df$status)
df$funding <- as.numeric(df$funding)
df$rounds <- as.numeric (df$rounds)
df$investors <- as.numeric (df$investors)
df$name <- as.factor (df$name)

Perform Cluster Analysis of the Data

One of the oldest and widely used clustering algorithms, the purpose of K-means is to find clusters, not to predict labels. K-means defines a prototype in terms of a centroid, which is usually the mean of a group of points and is typically applied to objects in a continuous n-dimensional space (Tan et al., 2019, p. 535). For this exercise, instead of examining funding levels and the assumed profitability of a company (binary - ipo/acquisition or operating), the variables “rounds of investment” and “number of investors” are used to demonstrate the appropriate use of this unsupervised algorithm. These two features are used to identify k-clusters. Similar to above, since we are examining 292 records, the head and tail of the data set are used as outputs for presentation purposes. The full data set can be provided, if needed.

x1 <- cbind (df$rounds, df$investors)
head(x1)

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1
## [3,]    1    1
## [4,]    1    1
## [5,]    1    1
## [6,]    1    1

tail(x1)

##        [,1] [,2]
## [287,]    6    6
## [288,]    6    7
## [289,]    6    7
## [290,]    6    7
## [291,]    6    7
## [292,]    8    9

The following two steps will show the clustering assignment and a comparison of the clusters. It’s followed by the process of K-means with two tables to show different outputs – number of investors and number of rounds. These simple functions actually lend some clarity to determining the density of investors connected to the rounds of investment. For instance, in the first table focused on investors, the number is clustered disproportionately into a single group. This number rises in the second cluster with two and three single investors or groups until the total begins to thin as more investors back a company. This is evident in the first cluster. As an extension, a similar pattern develops in the table focused on number of rounds, as more investors are found disproportionately in one cluster in the first round of investment while rounds two and three still see a larger number before the total begins to thin as a company takes on more rounds of investment.

Also, the choice of three clusters seems appropriate given the size of the data set and the overall objective of the deliverable, which is to show the difference in clustering for comparison purposes and not labeling.

kcluster1 <- kmeans(x1, center=3)
kcluster1$center

##       [,1]     [,2]
## 1 1.151639 1.131148
## 2 5.888889 6.777778
## 3 2.897436 3.410256

table(df$investors, kcluster1$cluster)

##    
##       1   2   3
##   1 214   0   0
##   2  28   0   0
##   3   2   0  27
##   4   0   0   8
##   5   0   0   4
##   6   0   4   0
##   7   0   4   0
##   9   0   1   0

table(df$rounds, kcluster1$cluster)

##    
##       1   2   3
##   1 209   0   1
##   2  33   0   9
##   3   2   0  22
##   4   0   0   7
##   5   0   3   0
##   6   0   5   0
##   8   0   1   0

The R package “animation” was then loaded to show the results, but a version error proceeded with each run. Therefore, the R package “mclust” seemed to be an appropriate substitution for plotting clustering. It adopts covariance structures and also provides displays for visualizing estimation results (Scrucca et al., 2016). The following progression shows the changes in clustering when evaluating K-means from a plotting perspective.

library(mclust)

## Package 'mclust' version 5.4.6
## Type 'citation("mclust")' for citing this R package in publications.

fit <- Mclust(df)
plot(fit)

Even though the “mclust” package offers similar visualizations to “animation” and incorporates all variables in display – including “name” which is unnecessary – the most conclusive findings are often found in the bottom-right with number of investors and rounds. The results are linear and also show dimension reduction for analysis. They also incorporate Bayesian principles (Scrucca et al., 2016) to form a regression appearance with these two variables.

Running Hierarchical K-means

Although graphical representation offers some insight, the choice to plot these data via the hierarchical clustering algorithm is appropriate as they’re a variation of a single approach – starting with individual points as clusters and successively merging the two closest clusters until only one remains (Tan, et al., 2019, p. 555). The decision to display the findings in two steps is important to note as the sample – as opposed to a full dendrogram with all the data – is easier to analyze and conclusive. A dendrogram displays both the cluster-subcluster relationships and the order in which the clusters were merged (Tan, et al., 2019, p. 554).

You will see the progression – and clarity – with both the full and sample representations. The choice of three clusters is also appropriate as the sample dendrogram shows three clearly marked clusters.

x2 <- cbind (df$rounds, df$investors)
kcluster2 <- kmeans(x2, center=3)
h_kcluster1 <- hclust (dist(x1), method="ave")
h_kcluster1

## 
## Call:
## hclust(d = dist(x1), method = "ave")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 292

plot(h_kcluster1, hang = -1)
rect.hclust(h_kcluster1, k=3)

index_partial <- sample(1:dim(x1)[1], 50) 
sample <- x1[index_partial, ]
h_kcluster1 <- hclust (dist(sample), method = "ave")
h_kcluster1

## 
## Call:
## hclust(d = dist(sample), method = "ave")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 50

plot(h_kcluster1, hang = -1)
rect.hclust(h_kcluster1, k=3)

Interpret and Discuss the Results

As an extension of the above analysis – specifically the table findings, clustering, and hierarchical approach – further discussion is needed to address the algorithm’s strengths and weaknesses, which are consistent with the text (Tan et al., 2019, p. 565). In terms of creating a taxonomy, the hierarchical approach seems most appropriate with this data set and presumably my final deliverable because they simply produce a better-quality cluster. Although the cluster plot and subsequent findings are consistent with typical investment trends, the taxonomy offers more definitive evidence into which points align with user-defined parameters. After all, K-means is best used to find clusters, not labels. To close this section, it’s also worth noting that since the sample hierarchy appears more conclusive, noisy data can be addressed to some degree by first partially clustering the data using K-means (Tan et al, 2019, p. 565). This is evident here.

Principal Component Analysis (PCA)

Select a Data Set/Load Into R

Similar to the above approach, variables were converted in R as numeric, except for the company name. This is important to note only because these variables will be incorporated into the final project whereas in previous deliverables, these data were not included.

It is also worth noting that for presentation purposes, just the head and tail of the sample of 292 startup companies are included as opposed to running the full data set. This is due to the number of variables and length of company name, which runs the R Markdown deeper. The full data set can be provided, if needed.

pca1 <- read.csv("C:/Users/bjorzech/Desktop/upstate_test.csv",stringsAsFactors = FALSE)
head(pca1)

##                   name status funding rounds investors
## 1          24PageBooks      1   50000      1         1
## 2      Accipiter Radar      1 4700000      1         1
## 3 AccuMED Technologies      1   10000      1         1
## 4         ACV Auctions      1 1000000      1         1
## 5              Adenios      1  810000      1         1
## 6      Adirondack East      1   25000      1         1

tail(pca1)

##                      name status   funding rounds investors
## 287            Crystal IS      0  15237785      6         6
## 288           L100003 GCS      1 170000000      6         7
## 289          NanoDynamics      0  16125150      6         7
## 290   Syracuse University      0   6000000      6         7
## 291        The Bucket BBQ      0     10000      6         7
## 292 Kinex Pharmaceuticals      1 102149580      8         9

str(pca1)

## 'data.frame':    292 obs. of  5 variables:
##  $ name     : chr  "24PageBooks" "Accipiter Radar" "AccuMED Technologies" "ACV Auctions" ...
##  $ status   : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ funding  : num  5.0e+04 4.7e+06 1.0e+04 1.0e+06 8.1e+05 2.5e+04 2.0e+04 1.4e+07 8.5e+06 1.0e+04 ...
##  $ rounds   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ investors: int  1 1 1 1 1 1 1 1 1 1 ...

pca1$status <- as.numeric (pca1$status)
pca1$funding <- as.numeric(pca1$funding)
pca1$rounds <- as.numeric (pca1$rounds)
pca1$investors <- as.numeric (pca1$investors)
pca1$name <- as.factor (pca1$name)
str(pca1)

## 'data.frame':    292 obs. of  5 variables:
##  $ name     : Factor w/ 292 levels "1stGig.com","22nd Century Group",..: 3 4 5 7 8 9 10 11 12 13 ...
##  $ status   : num  1 1 1 1 1 1 1 1 1 0 ...
##  $ funding  : num  5.0e+04 4.7e+06 1.0e+04 1.0e+06 8.1e+05 2.5e+04 2.0e+04 1.4e+07 8.5e+06 1.0e+04 ...
##  $ rounds   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ investors: num  1 1 1 1 1 1 1 1 1 1 ...

summary(pca1)

##                    name         status          funding             rounds     
##  1stGig.com          :  1   Min.   :0.0000   Min.   :1.00e+04   Min.   :1.000  
##  22nd Century Group  :  1   1st Qu.:1.0000   1st Qu.:1.15e+04   1st Qu.:1.000  
##  24PageBooks         :  1   Median :1.0000   Median :7.35e+05   Median :1.000  
##  Accipiter Radar     :  1   Mean   :0.9178   Mean   :1.66e+07   Mean   :1.531  
##  AccuMED Technologies:  1   3rd Qu.:1.0000   3rd Qu.:4.00e+06   3rd Qu.:2.000  
##  Action Audio Apps   :  1   Max.   :1.0000   Max.   :2.40e+09   Max.   :8.000  
##  (Other)             :286                                                      
##    investors   
##  Min.   :1.00  
##  1st Qu.:1.00  
##  Median :1.00  
##  Mean   :1.61  
##  3rd Qu.:2.00  
##  Max.   :9.00  
##

Further, it’s important to recognize the summary findings and the discrepancy between the variables. Often, the findings are inconclusive as the variable with the largest values often skews the results. For this data set, the overall level of funding offers a much different perspective than binary (ipo/acquisition or operation) or nominal data like number of rounds or investors. This pre-processing step of normalizing the data is essential. If different variables are to be used together for clustering, then such a transformation is often necessary to avoid having a variable with large values dominate the results of the analysis (Tan et al., 2019, p. 71).

The following processes in R were conducted to normalize these data.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(tidyverse)

## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
## x purrr::map()    masks mclust::map()

preproc1 <- preProcess(pca1[,c(2:5)], method=c("center", "scale"))
pca <- predict(preproc1, pca1 [,c(2:5)])
summary(pca)

##      status           funding             rounds          investors      
##  Min.   :-3.3359   Min.   :-0.11229   Min.   :-0.4906   Min.   :-0.4749  
##  1st Qu.: 0.2987   1st Qu.:-0.11228   1st Qu.:-0.4906   1st Qu.:-0.4749  
##  Median : 0.2987   Median :-0.10738   Median :-0.4906   Median :-0.4749  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2987   3rd Qu.:-0.08528   3rd Qu.: 0.4336   3rd Qu.: 0.3041  
##  Max.   : 0.2987   Max.   :16.13412   Max.   : 5.9785   Max.   : 5.7574

Perform Cluster Analysis of the Data

Now that the data are normalized, the following correlation, fit, and loadings are summarized with key findings, similar to the tables displayed above in the K-means model. The following align with each “component” – status (1), funding (2), rounds (3), and investors (4). For example, in the correlation matrix, the strongest relationships are associated with “rounds” and “investors” – an interesting finding whereas past models signaled “funding level” as the strongest. For fit, it can be argued that proportion of variance is the strongest indicator while standard deviation – tied to the status variable – may be the best starting point, but not where the model turns. Overall, the loadings are somewhat consistent with the fit. This will be important to note when we plot these findings.

pca_na_omit <- drop_na(pca)
pca.active <- pca_na_omit[, 1:4]
cor(pca.active)

##                status    funding      rounds  investors
## status     1.00000000 0.02896838 -0.18769201 -0.2176660
## funding    0.02896838 1.00000000  0.01630857  0.1776160
## rounds    -0.18769201 0.01630857  1.00000000  0.9438836
## investors -0.21766597 0.17761602  0.94388364  1.0000000

fit <- princomp(na.omit(pca.active), cor = TRUE)
summary(fit)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     1.4280068 1.0160685 0.9417057 0.20394066
## Proportion of Variance 0.5098008 0.2580988 0.2217024 0.01039795
## Cumulative Proportion  0.5098008 0.7678996 0.9896021 1.00000000

loadings(fit)

## 
## Loadings:
##           Comp.1 Comp.2 Comp.3 Comp.4
## status     0.261  0.493  0.829       
## funding   -0.120  0.865 -0.472 -0.121
## rounds    -0.670         0.263 -0.692
## investors -0.684         0.142  0.711
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00

When plotting the findings, the first visualization shows that 1 is the number of principal components in the data. There is a slight drop from 2 to 3 before a full drop to 4.

Further, after a long and frustrating search for an appropriate plot to accurately display the percentage of variance, a Scree plot seemed to be the most appropriate for PCA. This is a retention plot for component and factor selection (Cattell, 1966). So in this model, the first dimension, consistent with the components above, shows a 51% explained variance – not strong but in comparison to the other variables, it’s conclusive.

plot(fit,type="lines")

pca_na_omit$pc_scores <- fit$scores
pca_final <- pca_na_omit
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(corrplot)

## corrplot 0.84 loaded

pca_prcomp <- prcomp(na.omit(pca.active, scale = TRUE))
fviz_eig(pca_prcomp, addlabels = TRUE, ylim = c(0, 60))

What follows is a series of findings and plots based on the principal component (prcomp) function in R. Similar to the above Scree plot, these appeared to add context and some depth to a PCA. For instance, Eigenvalues are a special set of scalars associated with a linear system of equations (Gregory, 1953). When running these models, it’s interesting to note that the quality of representation shifts to the variables investors (Dim. 1) and rounds (Dim. 2) rather than funding and status. Even though the level of funding still scores well in representation, it’s not to the same level as investors and rounds.

eig.val <- get_eigenvalue(pca_prcomp)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.03920333        50.980083                    50.98008
## Dim.2 1.03239524        25.809881                    76.78996
## Dim.3 0.88680963        22.170241                    98.96021
## Dim.4 0.04159179         1.039795                   100.00000

res.var <- get_pca_var(pca_prcomp)
res.var$coord

##                Dim.1       Dim.2      Dim.3        Dim.4
## status    -0.3727385  0.50112452  0.7809635 -0.006022309
## funding    0.1716029  0.87874913 -0.4446852  0.024646657
## rounds     0.9573207 -0.04695291  0.2477936  0.141176997
## investors  0.9769130  0.08285436  0.1332632 -0.144972831

res.var$contrib

##               Dim.1      Dim.2     Dim.3       Dim.4
## status     6.813150 24.3245775 68.775072  0.08720038
## funding    1.444071 74.7969391 22.298466  1.46052304
## rounds    44.942200  0.2135399  6.923881 47.92037920
## investors 46.800578  0.6649435  2.002581 50.53189738

res.var$cos2

##                Dim.1       Dim.2      Dim.3        Dim.4
## status    0.13893399 0.251125781 0.60990396 0.0000362682
## funding   0.02944755 0.772200041 0.19774495 0.0006074577
## rounds    0.91646283 0.002204576 0.06140164 0.0199309444
## investors 0.95435896 0.006864845 0.01775908 0.0210171217

The following plots show different representations of the analysis above. For clarity, Scree plots seem to be the most conclusive while more dimensionality is offered in the subsequent plots. The plot marked “Variables - PCA” offers another perspective beyond the Scree plot while the “Individuals - PCA” – although complete in the representation of 292 companies – is only effective on larger screens yet conclusive in determining which clusters are appropriate, represented by color. Although these representations are holistic, Scree or Variables seem more appropriate in presentation but the additional plots are worth including.

corrplot(res.var$cos2, is.corr=FALSE)

fviz_cos2(pca_prcomp, choice = "var", axes = 1:2)

fviz_pca_var(pca_prcomp, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE
)

fviz_pca_ind(pca_prcomp,
             col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE
)

Interpret and Discuss the Results

By running a PCA on Upstate New York startup companies, at times the above results and plots build on findings produced by a K-means algorithm but they also create new attributes that are linear combinations of the original attributes, as defined by Tan et al. (2019). This was fascinating simply because a conscious decision was made to analyze the variables “rounds” and “investors” with K-means – with satisfactory results – while a PCA created new dimensions with the same attributes – and found stronger results by also capturing the maximum amount of variations in the data (Tan et al., 2019, p. 58). This very well could be due to consistency in the data – i.e. similar number of investors or rounds for classification purposes – as opposed to extreme variance within the level of funding variable. Although conclusive, in the future, perhaps a larger data set with more variance may prove a different result.

References

Cattell, R. (1966) The Scree Test For The Number Of Factors, Multivariate Behavioral Research, 1:2, 245-276, DOI: 10.1207/s15327906mbr0102_10.

Gregory, R.T. (1953) Mathematical Tables and Other Aids to Computation, Vol. 7, No. 44, pp. 215-220.

Scrucca, L., Fop, M., Murphy T.B., Raftery A.E. (2016). Mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1), 289–317. https://doi.org/10.32614/RJ-2016-021.

Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.

DSC607 Upstate Startup Clustering

Brett Orzechowski

6/27/2020

Project Overview

K-means

Select a Data Set/Load Into R

Perform Cluster Analysis of the Data

Running Hierarchical K-means

Interpret and Discuss the Results

Principal Component Analysis (PCA)

Select a Data Set/Load Into R

Perform Cluster Analysis of the Data

Interpret and Discuss the Results

References