Project Overview

During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for this research project. Even though last week’s assignment focused on Boulder, Colorado, and the likelihood of profitability, we’ll examine another sample in the dataset that is more regional in scope but also focuses on the concept of classification.

This examination and assignment, implementing classification methods like decision trees (DT) and k-Nearest Neighbor (KNN) algorithms, focuses on companies receiving some level of investment across upstate New York, specifically in four major regions – Buffalo, Rochester, Syracuse, and Albany.

In general, funders realize full investment potential during an acquisition or initial public offering (IPO). Depending on the terms of the agreement, some investors also settle for an equity stake and earn when the company earns – or becomes profitable. Certain regions of the country lend themselves to higher probability levels for acquisition or IPOs, and the initial hypothesis focused on a diminished likelihood of these deals in the established upstate New York region.

For this exploration, though, I wanted to examine the probability of either an acquisition or IPO based on a number of variables for classification purposes – level of funding, number of investors, number of investment rounds, and whether the company is still operating and accepting investment or has been acquired or exercised an IPO.

This data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. It contains 168 observations and six variables.

Select a Data Set/Load Into R

df <- read.csv("C:/Users/bjorzech/Desktop/DSC607_Upstate_W3.csv",stringsAsFactors = FALSE)
(df)
##     Company.Number status funding_total_usd funding_rounds investors
## 1               53      0              5000              1         1
## 2               14      0             10000              1         1
## 3               16      0             10000              1         1
## 4               20      0             10000              1         1
## 5               28      0             10000              1         1
## 6               32      1             10000              1         1
## 7               40      0             10000              1         1
## 8               42      0             10000              1         1
## 9               47      1             10000              1         1
## 10              49      1             10000              1         1
## 11              50      1             10000              1         1
## 12              52      0             10000              1         1
## 13              75      1             10000              1         1
## 14             109      1             10000              1         1
## 15             113      0             10000              1         1
## 16             120      0             10000              1         1
## 17             123      1             10000              1         1
## 18             124      1             10000              1         1
## 19             134      0             10000              1         1
## 20             141      0             10000              1         1
## 21             149      1             10000              1         1
## 22             152      1             10000              1         1
## 23             160      1             10000              1         1
## 24             131      0             18000              1         1
## 25              10      0             45000              1         1
## 26              60      0             50000              1         1
## 27             145      0            150000              1         1
## 28              38      0            300000              1         1
## 29              82      0            505000              1         1
## 30              72      0            600000              1         1
## 31              76      0            750000              1         1
## 32              15      1           1000000              1         1
## 33             108      0           1090000              1         1
## 34              71      0           1200000              1         1
## 35             156      0           1500000              1         1
## 36               5      0           4000000              1         1
## 37             151      0           4000000              1         1
## 38             103      0           5500000              1         1
## 39              51      0           6000000              1         1
## 40             162      1           8300000              1         1
## 41             168      0             10000              2         1
## 42               6      0           3000000              2         1
## 43             130      0           6000000              2         1
## 44              79      0          12862048              6         1
## 45              70      0              5000              1         2
## 46               1      0             10000              1         2
## 47             105      0             50000              1         2
## 48               4      1            100000              1         2
## 49              67      0            100000              1         2
## 50              89      0            150000              1         2
## 51              43      1            200000              1         2
## 52              57      0            227979              1         2
## 53              65      0            250000              1         2
## 54              12      0            300000              1         2
## 55               7      0            423000              1         2
## 56             147      0            750000              1         2
## 57             166      0            750000              1         2
## 58               2      0           1000000              1         2
## 59             146      1           1000000              1         2
## 60             115      0           1300000              1         2
## 61              64      1           1500000              1         2
## 62             142      0           1500000              1         2
## 63              94      0           5200000              1         2
## 64              63      1          15000000              1         2
## 65             153      0           1450000              2         2
## 66             164      1           2450000              2         2
## 67             112      0           2666404              2         2
## 68              41      0          17000000              2         2
## 69             119      0            125000              3         2
## 70              66      1            950000              3         2
## 71              96      0          26000000              3         2
## 72             122      0              5000              1         3
## 73              25      0              9000              1         3
## 74             129      0             18000              1         3
## 75             111      1             40000              1         3
## 76              18      0             50000              1         3
## 77              98      0             50000              1         3
## 78             138      1             50000              1         3
## 79              97      0             55000              1         3
## 80             125      1            182500              1         3
## 81              55      1            200000              1         3
## 82              17      0            215000              1         3
## 83              21      0            250000              1         3
## 84              36      1            450000              1         3
## 85              77      0            500000              1         3
## 86              80      0            500000              1         3
## 87             110      0            610000              1         3
## 88             132      1           1500004              1         3
## 89              29      0           1550000              1         3
## 90             101      0           3700000              1         3
## 91              91      0           4000000              1         3
## 92             154      1           4000000              1         3
## 93              62      0           6000000              1         3
## 94             139      0           6000000              1         3
## 95              54      0           7175000              1         3
## 96              87      0           9700000              1         3
## 97             155      1          10000000              1         3
## 98              58      0          30000000              1         3
## 99             106      1         170000000              1         3
## 100            114      1            269344              2         3
## 101            163      1            429997              2         3
## 102            116      1           2750000              2         3
## 103             85      1           4550000              2         3
## 104            102      0           6800000              2         3
## 105            161      0          14000000              2         3
## 106            150      0           2056426              3         3
## 107             84      0           4000000              3         3
## 108             44      0           7500000              3         3
## 109             48      1           8500000              3         3
## 110             35      0          26720000              4         3
## 111            144      0          69000000              4         3
## 112              8      1          66426557              6         3
## 113             26      0         102149580              6         3
## 114            143      1              2000              1         4
## 115            117      1             10000              1         4
## 116            126      0             15000              1         4
## 117            167      0            100000              1         4
## 118            121      0            500009              1         4
## 119             86      1            520000              1         4
## 120            107      1            783800              1         4
## 121             13      1           1000000              1         4
## 122             99      1           1500000              1         4
## 123              3      0           3000000              1         4
## 124             83      0           3600000              1         4
## 125             73      0           3725000              1         4
## 126             78      1           4200000              1         4
## 127            100      1           4500000              1         4
## 128             37      0           9517008              1         4
## 129             88      1          17190245              1         4
## 130             24      0          21000000              1         4
## 131             39      0         175000000              1         4
## 132            140      1            500000              2         4
## 133             22      0            905421              2         4
## 134              9      0           1600000              2         4
## 135             59      0           3097222              2         4
## 136             69      1           3750000              2         4
## 137            128      0           4018000              2         4
## 138             30      0          16125150              2         4
## 139             27      0          30800000              2         4
## 140            137      1             10000              3         4
## 141             19      0            270000              3         4
## 142             56      0            275000              3         4
## 143            148      1           2950030              3         4
## 144             31      0           5300000              3         4
## 145             92      1           5922024              4         4
## 146             81      0           8595974              4         4
## 147            157      1          56000000              4         4
## 148             90      1            500000              5         4
## 149             23      1             10000              1         5
## 150            158      1             10000              1         5
## 151            127      0             12000              1         5
## 152             11      1            437500              1         5
## 153             45      1           1000000              1         5
## 154            135      1           1000000              1         5
## 155            133      1           1785000              1         5
## 156            136      1           5240000              1         5
## 157             93      1          10000000              1         5
## 158             95      1          17000000              1         5
## 159             68      0        2400000000              1         5
## 160            118      1             30000              2         5
## 161             33      0           2000000              2         5
## 162             74      0           2400000              2         5
## 163            165      1           3972414              2         5
## 164             46      0           6411000              3         5
## 165             34      0          11307429              3         5
## 166            159      1          15237785              3         5
## 167            104      1          28800000              4         5
## 168             61      0          23466222              5         5

Run a DT on the Data and Variables of Your Choosing

In order to run an accurate DT, I approached the data from a preprocessing perspective and used dimensionality reduction, subscribing to the belief that many data mining algorithms work better when the dimensionality – the number of attributes in the data – is lower (Tan et al., 2019, p. 57). In this case, the 168 observations with five variables seemed appropriate with a sixth variable, the company number, included in the basic summary. However, the company number in a DT analysis will not be considered. These are considered irrelevant features (Tan et al., 2019, p. 58). Finally, the “status” attribute was transformed via binarization (Tan et al., 2019, p. 68.) These data (“status”) lend themselves nicely to classification concepts and remain binary in description with acquisition/IPO receiving a “yes” and a “1” and still operating receiving a “no” and a “0.”

Structure

str(df)
## 'data.frame':    168 obs. of  5 variables:
##  $ Company.Number   : int  53 14 16 20 28 32 40 42 47 49 ...
##  $ status           : int  0 0 0 0 0 1 0 0 1 1 ...
##  $ funding_total_usd: num  5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ funding_rounds   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ investors        : int  1 1 1 1 1 1 1 1 1 1 ...

Further, after running a structure overview the first time, it was found that the variable “funding_total_usd” appears as a numeric value. To remain consistent, this was transformed through R to an integer. Additionally, the “status” is recognized as a character value. This was transformed to a factor as the DT model will use this as the dependent variable and the other three variables – funding, rounds, and investors – serve as the independent attributes. This is important to recognize as these variables are necessary to answer the hypothesis, “what is the probability of an upstate New York company being acquired or exercising an IPO based on funding, rounds, and number of investors?”

A tree classifier will allow us to distiguish between the two – operating or acquisition/IPO (Tan et al., 2019, p. 120). More accurately, this will use Hunt’s algorithm with further analysis due to the number of variables and by using the splitting criterion, which is an attribute test condition (Tan et al., 2019, p. 122).

df$funding_total_usd <- as.integer(df$funding_total_usd)
## Warning: NAs introduced by coercion to integer range
df$funding_total_usd
##   [1]      5000     10000     10000     10000     10000     10000     10000
##   [8]     10000     10000     10000     10000     10000     10000     10000
##  [15]     10000     10000     10000     10000     10000     10000     10000
##  [22]     10000     10000     18000     45000     50000    150000    300000
##  [29]    505000    600000    750000   1000000   1090000   1200000   1500000
##  [36]   4000000   4000000   5500000   6000000   8300000     10000   3000000
##  [43]   6000000  12862048      5000     10000     50000    100000    100000
##  [50]    150000    200000    227979    250000    300000    423000    750000
##  [57]    750000   1000000   1000000   1300000   1500000   1500000   5200000
##  [64]  15000000   1450000   2450000   2666404  17000000    125000    950000
##  [71]  26000000      5000      9000     18000     40000     50000     50000
##  [78]     50000     55000    182500    200000    215000    250000    450000
##  [85]    500000    500000    610000   1500004   1550000   3700000   4000000
##  [92]   4000000   6000000   6000000   7175000   9700000  10000000  30000000
##  [99] 170000000    269344    429997   2750000   4550000   6800000  14000000
## [106]   2056426   4000000   7500000   8500000  26720000  69000000  66426557
## [113] 102149580      2000     10000     15000    100000    500009    520000
## [120]    783800   1000000   1500000   3000000   3600000   3725000   4200000
## [127]   4500000   9517008  17190245  21000000 175000000    500000    905421
## [134]   1600000   3097222   3750000   4018000  16125150  30800000     10000
## [141]    270000    275000   2950030   5300000   5922024   8595974  56000000
## [148]    500000     10000     10000     12000    437500   1000000   1000000
## [155]   1785000   5240000  10000000  17000000        NA     30000   2000000
## [162]   2400000   3972414   6411000  11307429  15237785  28800000  23466222
funding <- df$funding_total_usd
funding
##   [1]      5000     10000     10000     10000     10000     10000     10000
##   [8]     10000     10000     10000     10000     10000     10000     10000
##  [15]     10000     10000     10000     10000     10000     10000     10000
##  [22]     10000     10000     18000     45000     50000    150000    300000
##  [29]    505000    600000    750000   1000000   1090000   1200000   1500000
##  [36]   4000000   4000000   5500000   6000000   8300000     10000   3000000
##  [43]   6000000  12862048      5000     10000     50000    100000    100000
##  [50]    150000    200000    227979    250000    300000    423000    750000
##  [57]    750000   1000000   1000000   1300000   1500000   1500000   5200000
##  [64]  15000000   1450000   2450000   2666404  17000000    125000    950000
##  [71]  26000000      5000      9000     18000     40000     50000     50000
##  [78]     50000     55000    182500    200000    215000    250000    450000
##  [85]    500000    500000    610000   1500004   1550000   3700000   4000000
##  [92]   4000000   6000000   6000000   7175000   9700000  10000000  30000000
##  [99] 170000000    269344    429997   2750000   4550000   6800000  14000000
## [106]   2056426   4000000   7500000   8500000  26720000  69000000  66426557
## [113] 102149580      2000     10000     15000    100000    500009    520000
## [120]    783800   1000000   1500000   3000000   3600000   3725000   4200000
## [127]   4500000   9517008  17190245  21000000 175000000    500000    905421
## [134]   1600000   3097222   3750000   4018000  16125150  30800000     10000
## [141]    270000    275000   2950030   5300000   5922024   8595974  56000000
## [148]    500000     10000     10000     12000    437500   1000000   1000000
## [155]   1785000   5240000  10000000  17000000        NA     30000   2000000
## [162]   2400000   3972414   6411000  11307429  15237785  28800000  23466222
df$status <- as.factor(df$status)
df$status
##   [1] 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
##  [38] 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0
##  [75] 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0
## [112] 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1
## [149] 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0
## Levels: 0 1
status <- df$status
status
##   [1] 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
##  [38] 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0
##  [75] 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0
## [112] 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1
## [149] 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0
## Levels: 0 1
investors <- df$investors
investors
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [149] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
rounds <- df$funding_rounds
rounds
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 2 2 2 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 4 4
## [112] 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5
## [149] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 4 5
str(df)
## 'data.frame':    168 obs. of  5 variables:
##  $ Company.Number   : int  53 14 16 20 28 32 40 42 47 49 ...
##  $ status           : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 2 2 ...
##  $ funding_total_usd: int  5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
##  $ funding_rounds   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ investors        : int  1 1 1 1 1 1 1 1 1 1 ...

All attribute naming conventions were changed to simplify.

library(rpart)

For this model, the rpart library is used. It serves as a CART algorithm and an acronym for Recursive Partitioning And Regression Trees.

treeAnalysis <- rpart(status ~ funding + rounds + investors , data = df)
treeAnalysis
## n= 168 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 168 64 0 (0.6190476 0.3809524)  
##    2) investors< 3.5 113 35 0 (0.6902655 0.3097345) *
##    3) investors>=3.5 55 26 1 (0.4727273 0.5272727)  
##      6) funding>=1550000 32 14 0 (0.5625000 0.4375000)  
##       12) funding< 3737500 9  2 0 (0.7777778 0.2222222) *
##       13) funding>=3737500 23 11 1 (0.4782609 0.5217391)  
##         26) funding>=6166512 15  6 0 (0.6000000 0.4000000) *
##         27) funding< 6166512 8  2 1 (0.2500000 0.7500000) *
##      7) funding< 1550000 23  8 1 (0.3478261 0.6521739) *

When running the model, “status” remains the factor and dependent attribute while funding, investors, and rounds serve as the independent variables. This is truncated in naming convention for simplicity purposes.

library(rpart.plot)

Finally, to visualize the DT, rpart.plot is incorporated.

rpart.plot(treeAnalysis, extra = 4)

Summary

To start, since operating and acquisition/IPO are within the “status” dependent variable, they are displayed as binary values with acquisition/IPO receiving a “yes” and a “1” and operating a “no” and “0” (Tan et al., 2019, p. 122). Ultimately, we would like to find the likelihood of whether an upstate New York company will be acquired or exercise an IPO via classification. However, throughout this DT we have a collective impurity of child nodes, as every child node represents a partition of the data resulting from one of the k outcomes of the attribute test condition (Tan et al., 2019, p. 129).

In a top-down analysis, we’ll first focus on the top node. In this node, the probability of the companies not being acquired or receiving an IPO – yet still operating – is 62% (38% potentially earning this status). As an extension, though this node is divided by the variable “number of investors,” if the company has fewer than four investors, the probability remains at 62%. If it’s four or more investors, the probability of the company being acquired or receiving an IPO is 38%. This is how a DT classifies companies into different groups. As a note, even if there are fewer than four investors, once that group is split, there is still a probability of 31% that a company will be acquired or receives an IPO.

However, when analyzing the full DT, the next node is perhaps the most intriguing simply because of the split between the potential likelihood of a company of four or more funders but with investment of $1.6 million or less. This probability is 53%. Within that node, there’s a 65% chance that company will be acquired or receive an IPO, but within that primary node, the 47% probability continues to be split. Within that node, there is a smaller probability (44%) for those companies receiving more than $3.7 million but less than $6.2 million. The highest probability exists within this node. If a company falls within the $3.7-$6.2 million threshold, there is a 75% probability that they will be acquired or exit with an IPO.

KNN on the Data and Variables of Your Choosing

As an alternative with KNN, I reclassified the dataset and focused on two attributes – funding and status – to gauge whether different classifications were impacted by a change in numerical values and binary structure. In the preprocessing stage, instead of using values that weren’t necessarily normalized in the DT tree, the data were classified based on intervals and assigned a 1-10 numerical value, consistent with different levels of funding. The higher the assigned value, the more funding a company received. Additionally, instead of using a binary naming convention under “status,” which remained the target object or factor, I assigned either operating or acquired_ipo to each company. To establish this from scratch, I wanted to see if this model proved stronger in terms of overall predicted value. In the process, a new dataset was created and uploaded.

df <- read.csv("C:/Users/bjorzech/Desktop/DSC607_Upstate_W3A.csv",stringsAsFactors = FALSE)
(df)
##     funding_total_usd funding_rounds investors       status
## 1                   1              1         2    operating
## 2                   1              1         2    operating
## 3                   1              1         4    operating
## 4                   1              1         2 acquired-ipo
## 5                   1              1         1    operating
## 6                   1              2         1    operating
## 7                   1              1         2    operating
## 8                   1              6         3 acquired-ipo
## 9                   1              2         4    operating
## 10                  1              1         1    operating
## 11                  1              1         5 acquired-ipo
## 12                  1              1         2    operating
## 13                  1              1         4 acquired-ipo
## 14                  1              1         1    operating
## 15                  1              1         1 acquired-ipo
## 16                  1              1         1    operating
## 17                  1              1         3    operating
## 18                  1              1         3    operating
## 19                  1              3         4    operating
## 20                  1              1         1    operating
## 21                  1              1         3    operating
## 22                  1              2         4    operating
## 23                  1              1         5 acquired-ipo
## 24                  1              1         4    operating
## 25                  1              1         3    operating
## 26                  1              6         3    operating
## 27                  1              2         4    operating
## 28                  1              1         1    operating
## 29                  1              1         3    operating
## 30                  1              2         4    operating
## 31                  1              3         4    operating
## 32                  1              1         1 acquired-ipo
## 33                  1              2         5    operating
## 34                  1              3         5    operating
## 35                  2              4         3    operating
## 36                  2              1         3 acquired-ipo
## 37                  2              1         4    operating
## 38                  2              1         1    operating
## 39                  2              1         4    operating
## 40                  2              1         1    operating
## 41                  2              2         2    operating
## 42                  2              1         1    operating
## 43                  2              1         2 acquired-ipo
## 44                  2              3         3    operating
## 45                  2              1         5 acquired-ipo
## 46                  2              3         5    operating
## 47                  3              1         1 acquired-ipo
## 48                  3              3         3 acquired-ipo
## 49                  3              1         1 acquired-ipo
## 50                  3              1         1 acquired-ipo
## 51                  3              1         1    operating
## 52                  3              1         1    operating
## 53                  3              1         1    operating
## 54                  3              1         3    operating
## 55                  3              1         3 acquired-ipo
## 56                  3              3         4    operating
## 57                  3              1         2    operating
## 58                  3              1         3    operating
## 59                  3              2         4    operating
## 60                  3              1         1    operating
## 61                  3              5         5    operating
## 62                  3              1         3    operating
## 63                  3              1         2 acquired-ipo
## 64                  3              1         2 acquired-ipo
## 65                  3              1         2    operating
## 66                  3              3         2 acquired-ipo
## 67                  3              1         2    operating
## 68                  3              1         5    operating
## 69                  3              2         4 acquired-ipo
## 70                  3              1         2    operating
## 71                  3              1         1    operating
## 72                  3              1         1    operating
## 73                  3              1         4    operating
## 74                  3              2         5    operating
## 75                  3              1         1 acquired-ipo
## 76                  3              1         1    operating
## 77                  3              1         3    operating
## 78                  3              1         4 acquired-ipo
## 79                  3              6         1    operating
## 80                  3              1         3    operating
## 81                  3              4         4    operating
## 82                  3              1         1    operating
## 83                  3              1         4    operating
## 84                  4              3         3    operating
## 85                  4              2         3 acquired-ipo
## 86                  4              1         4 acquired-ipo
## 87                  4              1         3    operating
## 88                  4              1         4 acquired-ipo
## 89                  4              1         2    operating
## 90                  4              5         4 acquired-ipo
## 91                  4              1         3    operating
## 92                  4              4         4 acquired-ipo
## 93                  4              1         5 acquired-ipo
## 94                  4              1         2    operating
## 95                  4              1         5 acquired-ipo
## 96                  4              3         2    operating
## 97                  4              1         3    operating
## 98                  4              1         3    operating
## 99                  4              1         4 acquired-ipo
## 100                 4              1         4 acquired-ipo
## 101                 4              1         3    operating
## 102                 5              2         3    operating
## 103                 5              1         1    operating
## 104                 5              4         5 acquired-ipo
## 105                 5              1         2    operating
## 106                 5              1         3 acquired-ipo
## 107                 5              1         4 acquired-ipo
## 108                 5              1         1    operating
## 109                 5              1         1 acquired-ipo
## 110                 5              1         3    operating
## 111                 5              1         3 acquired-ipo
## 112                 5              2         2    operating
## 113                 5              1         1    operating
## 114                 5              2         3 acquired-ipo
## 115                 5              1         2    operating
## 116                 5              2         3 acquired-ipo
## 117                 5              1         4 acquired-ipo
## 118                 5              2         5 acquired-ipo
## 119                 5              3         2    operating
## 120                 5              1         1    operating
## 121                 5              1         4    operating
## 122                 5              1         3    operating
## 123                 5              1         1 acquired-ipo
## 124                 5              1         1 acquired-ipo
## 125                 5              1         3 acquired-ipo
## 126                 6              1         4    operating
## 127                 6              1         5    operating
## 128                 6              2         4    operating
## 129                 6              1         3    operating
## 130                 6              2         1    operating
## 131                 6              1         1    operating
## 132                 6              1         3 acquired-ipo
## 133                 6              1         5 acquired-ipo
## 134                 6              1         1    operating
## 135                 6              1         5 acquired-ipo
## 136                 6              1         5 acquired-ipo
## 137                 6              3         4 acquired-ipo
## 138                 6              1         3 acquired-ipo
## 139                 6              1         3    operating
## 140                 6              2         4 acquired-ipo
## 141                 6              1         1    operating
## 142                 6              1         2    operating
## 143                 6              1         4 acquired-ipo
## 144                 7              4         3    operating
## 145                 7              1         1    operating
## 146                 7              1         2 acquired-ipo
## 147                 7              1         2    operating
## 148                 7              3         4 acquired-ipo
## 149                 7              1         1 acquired-ipo
## 150                 7              3         3    operating
## 151                 7              1         1    operating
## 152                 7              1         1 acquired-ipo
## 153                 7              2         2    operating
## 154                 7              1         3 acquired-ipo
## 155                 8              1         3 acquired-ipo
## 156                 8              1         1    operating
## 157                 8              4         4 acquired-ipo
## 158                 8              1         5 acquired-ipo
## 159                 8              3         5 acquired-ipo
## 160                 8              1         1 acquired-ipo
## 161                 8              2         3    operating
## 162                 8              1         1 acquired-ipo
## 163                 8              2         3 acquired-ipo
## 164                 8              2         2 acquired-ipo
## 165                 9              2         5 acquired-ipo
## 166                 9              1         2    operating
## 167                 9              1         4    operating
## 168                10              2         1    operating

Structure

str(df)
## 'data.frame':    168 obs. of  4 variables:
##  $ funding_total_usd: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ funding_rounds   : int  1 1 1 1 1 2 1 6 2 1 ...
##  $ investors        : int  2 2 4 2 1 1 2 3 4 1 ...
##  $ status           : chr  "operating" "operating" "operating" "acquired-ipo" ...

Similar to the DT, some preprocessing steps were taken to fit the KNN model. The variables “funding_rounds,”“investors,” and "funding _total_usd" were reclassified as numeric values while “status” remained the target (factor). Initially, this variable was classified as character. Finally, a new structure was established in order to build a KNN model.

df$funding_rounds <- as.numeric(df$funding_rounds)
df$investors <- as.numeric(df$investors)
df$funding_total_usd <- as.numeric(df$funding_total_usd)
df$status <- as.factor(df$status)
rounds <- df$funding_rounds
investors <- df$investors
funding <- df$funding_total_usd
status <- df$status
str(df)
## 'data.frame':    168 obs. of  4 variables:
##  $ funding_total_usd: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ funding_rounds   : num  1 1 1 1 1 2 1 6 2 1 ...
##  $ investors        : num  2 2 4 2 1 1 2 3 4 1 ...
##  $ status           : Factor w/ 2 levels "acquired-ipo",..: 2 2 2 1 2 2 2 1 2 2 ...

Normalize

normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x))) }
df1 <- as.data.frame(lapply(df[1:3], normalize))
head(df1)
##   funding_total_usd funding_rounds investors
## 1                 0            0.0      0.25
## 2                 0            0.0      0.25
## 3                 0            0.0      0.75
## 4                 0            0.0      0.25
## 5                 0            0.0      0.00
## 6                 0            0.2      0.00

Further, the data are normalized but only variables with a numeric value. In this case, the “status” variable is not included simply because it serves as the target or factor variable here in R. With any predictive or classification algorithm that includes distance, the data should be normalized (Tan et al., 2019, p. 211). The following model subscribes to the fourth characteristic of nearest neighbor classifiers (Tan et al., 2019, pp. 210-211). As we find, the decision boundaries of KNN classifiers also have high variability because they depend on the composition of training examples in the local neighborhood. Increasing the number of nearest neighbors, according to Tan et al. (2019), may reduce such variability. Ultimately, this makes sense as a more robust data set exists since this sample focused on upstate New York companies exclusively.

num.vars <- sapply(df, is.numeric)
df[num.vars] <- lapply(df[num.vars], scale)
myvars <- c("funding_total_usd", "investors", "funding_rounds")
df.subset <- df[myvars]
summary(df.subset)
##  funding_total_usd.V1     investors.V1      funding_rounds.V1 
##  Min.   :-1.3010259   Min.   :-1.2963726   Min.   :-0.534123  
##  1st Qu.:-0.8594656   1st Qu.:-1.2963726   1st Qu.:-0.534123  
##  Median : 0.0236550   Median : 0.1751855   Median :-0.534123  
##  Mean   : 0.0000000   Mean   : 0.0000000   Mean   : 0.000000  
##  3rd Qu.: 0.9067756   3rd Qu.: 0.9109645   3rd Qu.: 0.400592  
##  Max.   : 2.6730168   Max.   : 1.6467436   Max.   : 4.139452

As a secondary step, the independent variables in this model are grouped in a subset and summarized to ensure the data remain normalized.

set.seed(123) 
test <- 1:56
train.df <- df.subset[-test,]
test.df <- df.subset[test,]
train.def <- df$status[-test]
test.def <- df$status[test]

Next, I randomized the data before splitting between a training and testing set. Since the observations are randomized, the decision is made to split the sample by 2/3. The training set will have 112 observations while the test set has 56. We are also operating on the premise, per the lecture, that the greater the value of K, the smoother the decision boundaries. This can also lower variance but increase subjectivity. Additionally, one of the basic requirements of a clean model and set-up and selection is that the training and testing must be kept separate (Tan et al. 2019, p. 172).

Results

library(class)
knn.1 <-  knn(train.df, test.df, train.def, k=1)
knn.5 <-  knn(train.df, test.df, train.def, k=5)
knn.10 <-  knn(train.df, test.df, train.def, k=10)
56 * sum(test.def == knn.1)/56 
## [1] 37

In the K=1 model, it correctly classifies 37% of the outcomes.

56 * sum(test.def == knn.5)/56
## [1] 41

In the K=5 model, it correctly classifies 41% of the outcomes.

56 * sum(test.def == knn.10)/56
## [1] 43

In the K=10 model, it correctly classifies 43% of the outcomes.

Cross-Validation

table(knn.1 ,test.def)
##               test.def
## knn.1          acquired-ipo operating
##   acquired-ipo            0         4
##   operating              15        37
table(knn.5 ,test.def)
##               test.def
## knn.5          acquired-ipo operating
##   acquired-ipo            1         1
##   operating              14        40
table(knn.10 ,test.def)
##               test.def
## knn.10         acquired-ipo operating
##   acquired-ipo            4         2
##   operating              11        39

To best test the models, I chose to run a stratified cross-validation with another test of K at 1, 5, and 10, which samples the positive and negative instances in a K partition (Tan et al. 2019, p. 167).

For K=1, among four companies predicted for acquisition, 0% is the likelihood of this happening while 40.5% is the probability of an operating company predicted for acquisition. For K=5, the probability improves but for a smaller sample as a 50% likelihood exists if a company is predicted for acquisition. This changes with operating companies, as 14 of 40 fall into this range (35%). The predicted acquisition companies is only two, though. Finally, for the K=10 test, among six companies predicted for acquisition or IPO, the probability is 66%. For those operating, the percentage drops to 28.2% with 11 of 39 companies.

With this model, increasing K increases the classification and success rate, however, the variability and predictive power is not as conclusive for those companies remaining in operating status.

Visualizations

Since the KNN models remain somewhat inconclusive and binary even after normalizing the data, I chose to visualize the findings via two methods to display the models holistically with all variables – both the factor (dependent, status) and the three independent. In the first visualization, it remains apparent that funding has the most variability but perhaps the greatest impact on “status” and whether an upstate New York company will reach acquisition or IPO level. Number of investors and rounds still have an impact, but they are not as strong as funding.

Additionally, in the second visualization, KNN principles are included to show the predictability across all variables and potential relationships. The bottom row signifies the relationship between “status” and the independent variables with the red dot consistent with the above KNN results. However, the inclusion of histograms and correlation matrices is intentional to show there are other findings. For instance, the more funding rounds has a stronger relationship to “status” than the number of investors. It would be deemed statistically significant, but more analysis and data are needed. Conversely, the investor and round relationship is statistically insignificant. This is valid and consistent with industry standards.

library(psych)
pairs.panels (df[,-5], 
             method = "pearson", 
             hist.col = "#00AFBB",
             density = TRUE,  
             ellipses = TRUE 
)

In conclusion, more data are needed to create stronger algorithms for predictive value but the foundation remains for further exploration. Between the two models, though, decisions trees seem more appropriate in this analytical situation.

Reference

Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.