During the summer of 2019, I conducted research for a local venture capitalist who wanted to explore the migration of talent to and from communities across the country. Census-defined MSAs and CBSAs served as the baseline data for this research project. Even though last week’s assignment focused on Boulder, Colorado, and the likelihood of profitability, we’ll examine another sample in the dataset that is more regional in scope but also focuses on the concept of classification.
This examination and assignment, implementing classification methods like decision trees (DT) and k-Nearest Neighbor (KNN) algorithms, focuses on companies receiving some level of investment across upstate New York, specifically in four major regions – Buffalo, Rochester, Syracuse, and Albany.
In general, funders realize full investment potential during an acquisition or initial public offering (IPO). Depending on the terms of the agreement, some investors also settle for an equity stake and earn when the company earns – or becomes profitable. Certain regions of the country lend themselves to higher probability levels for acquisition or IPOs, and the initial hypothesis focused on a diminished likelihood of these deals in the established upstate New York region.
For this exploration, though, I wanted to examine the probability of either an acquisition or IPO based on a number of variables for classification purposes – level of funding, number of investors, number of investment rounds, and whether the company is still operating and accepting investment or has been acquired or exercised an IPO.
This data was extracted from Crunchbase at 2019-05-10 10:36:37 +0000. It contains 168 observations and six variables.
df <- read.csv("C:/Users/bjorzech/Desktop/DSC607_Upstate_W3.csv",stringsAsFactors = FALSE)
(df)
## Company.Number status funding_total_usd funding_rounds investors
## 1 53 0 5000 1 1
## 2 14 0 10000 1 1
## 3 16 0 10000 1 1
## 4 20 0 10000 1 1
## 5 28 0 10000 1 1
## 6 32 1 10000 1 1
## 7 40 0 10000 1 1
## 8 42 0 10000 1 1
## 9 47 1 10000 1 1
## 10 49 1 10000 1 1
## 11 50 1 10000 1 1
## 12 52 0 10000 1 1
## 13 75 1 10000 1 1
## 14 109 1 10000 1 1
## 15 113 0 10000 1 1
## 16 120 0 10000 1 1
## 17 123 1 10000 1 1
## 18 124 1 10000 1 1
## 19 134 0 10000 1 1
## 20 141 0 10000 1 1
## 21 149 1 10000 1 1
## 22 152 1 10000 1 1
## 23 160 1 10000 1 1
## 24 131 0 18000 1 1
## 25 10 0 45000 1 1
## 26 60 0 50000 1 1
## 27 145 0 150000 1 1
## 28 38 0 300000 1 1
## 29 82 0 505000 1 1
## 30 72 0 600000 1 1
## 31 76 0 750000 1 1
## 32 15 1 1000000 1 1
## 33 108 0 1090000 1 1
## 34 71 0 1200000 1 1
## 35 156 0 1500000 1 1
## 36 5 0 4000000 1 1
## 37 151 0 4000000 1 1
## 38 103 0 5500000 1 1
## 39 51 0 6000000 1 1
## 40 162 1 8300000 1 1
## 41 168 0 10000 2 1
## 42 6 0 3000000 2 1
## 43 130 0 6000000 2 1
## 44 79 0 12862048 6 1
## 45 70 0 5000 1 2
## 46 1 0 10000 1 2
## 47 105 0 50000 1 2
## 48 4 1 100000 1 2
## 49 67 0 100000 1 2
## 50 89 0 150000 1 2
## 51 43 1 200000 1 2
## 52 57 0 227979 1 2
## 53 65 0 250000 1 2
## 54 12 0 300000 1 2
## 55 7 0 423000 1 2
## 56 147 0 750000 1 2
## 57 166 0 750000 1 2
## 58 2 0 1000000 1 2
## 59 146 1 1000000 1 2
## 60 115 0 1300000 1 2
## 61 64 1 1500000 1 2
## 62 142 0 1500000 1 2
## 63 94 0 5200000 1 2
## 64 63 1 15000000 1 2
## 65 153 0 1450000 2 2
## 66 164 1 2450000 2 2
## 67 112 0 2666404 2 2
## 68 41 0 17000000 2 2
## 69 119 0 125000 3 2
## 70 66 1 950000 3 2
## 71 96 0 26000000 3 2
## 72 122 0 5000 1 3
## 73 25 0 9000 1 3
## 74 129 0 18000 1 3
## 75 111 1 40000 1 3
## 76 18 0 50000 1 3
## 77 98 0 50000 1 3
## 78 138 1 50000 1 3
## 79 97 0 55000 1 3
## 80 125 1 182500 1 3
## 81 55 1 200000 1 3
## 82 17 0 215000 1 3
## 83 21 0 250000 1 3
## 84 36 1 450000 1 3
## 85 77 0 500000 1 3
## 86 80 0 500000 1 3
## 87 110 0 610000 1 3
## 88 132 1 1500004 1 3
## 89 29 0 1550000 1 3
## 90 101 0 3700000 1 3
## 91 91 0 4000000 1 3
## 92 154 1 4000000 1 3
## 93 62 0 6000000 1 3
## 94 139 0 6000000 1 3
## 95 54 0 7175000 1 3
## 96 87 0 9700000 1 3
## 97 155 1 10000000 1 3
## 98 58 0 30000000 1 3
## 99 106 1 170000000 1 3
## 100 114 1 269344 2 3
## 101 163 1 429997 2 3
## 102 116 1 2750000 2 3
## 103 85 1 4550000 2 3
## 104 102 0 6800000 2 3
## 105 161 0 14000000 2 3
## 106 150 0 2056426 3 3
## 107 84 0 4000000 3 3
## 108 44 0 7500000 3 3
## 109 48 1 8500000 3 3
## 110 35 0 26720000 4 3
## 111 144 0 69000000 4 3
## 112 8 1 66426557 6 3
## 113 26 0 102149580 6 3
## 114 143 1 2000 1 4
## 115 117 1 10000 1 4
## 116 126 0 15000 1 4
## 117 167 0 100000 1 4
## 118 121 0 500009 1 4
## 119 86 1 520000 1 4
## 120 107 1 783800 1 4
## 121 13 1 1000000 1 4
## 122 99 1 1500000 1 4
## 123 3 0 3000000 1 4
## 124 83 0 3600000 1 4
## 125 73 0 3725000 1 4
## 126 78 1 4200000 1 4
## 127 100 1 4500000 1 4
## 128 37 0 9517008 1 4
## 129 88 1 17190245 1 4
## 130 24 0 21000000 1 4
## 131 39 0 175000000 1 4
## 132 140 1 500000 2 4
## 133 22 0 905421 2 4
## 134 9 0 1600000 2 4
## 135 59 0 3097222 2 4
## 136 69 1 3750000 2 4
## 137 128 0 4018000 2 4
## 138 30 0 16125150 2 4
## 139 27 0 30800000 2 4
## 140 137 1 10000 3 4
## 141 19 0 270000 3 4
## 142 56 0 275000 3 4
## 143 148 1 2950030 3 4
## 144 31 0 5300000 3 4
## 145 92 1 5922024 4 4
## 146 81 0 8595974 4 4
## 147 157 1 56000000 4 4
## 148 90 1 500000 5 4
## 149 23 1 10000 1 5
## 150 158 1 10000 1 5
## 151 127 0 12000 1 5
## 152 11 1 437500 1 5
## 153 45 1 1000000 1 5
## 154 135 1 1000000 1 5
## 155 133 1 1785000 1 5
## 156 136 1 5240000 1 5
## 157 93 1 10000000 1 5
## 158 95 1 17000000 1 5
## 159 68 0 2400000000 1 5
## 160 118 1 30000 2 5
## 161 33 0 2000000 2 5
## 162 74 0 2400000 2 5
## 163 165 1 3972414 2 5
## 164 46 0 6411000 3 5
## 165 34 0 11307429 3 5
## 166 159 1 15237785 3 5
## 167 104 1 28800000 4 5
## 168 61 0 23466222 5 5
In order to run an accurate DT, I approached the data from a preprocessing perspective and used dimensionality reduction, subscribing to the belief that many data mining algorithms work better when the dimensionality – the number of attributes in the data – is lower (Tan et al., 2019, p. 57). In this case, the 168 observations with five variables seemed appropriate with a sixth variable, the company number, included in the basic summary. However, the company number in a DT analysis will not be considered. These are considered irrelevant features (Tan et al., 2019, p. 58). Finally, the “status” attribute was transformed via binarization (Tan et al., 2019, p. 68.) These data (“status”) lend themselves nicely to classification concepts and remain binary in description with acquisition/IPO receiving a “yes” and a “1” and still operating receiving a “no” and a “0.”
str(df)
## 'data.frame': 168 obs. of 5 variables:
## $ Company.Number : int 53 14 16 20 28 32 40 42 47 49 ...
## $ status : int 0 0 0 0 0 1 0 0 1 1 ...
## $ funding_total_usd: num 5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
## $ funding_rounds : int 1 1 1 1 1 1 1 1 1 1 ...
## $ investors : int 1 1 1 1 1 1 1 1 1 1 ...
Further, after running a structure overview the first time, it was found that the variable “funding_total_usd” appears as a numeric value. To remain consistent, this was transformed through R to an integer. Additionally, the “status” is recognized as a character value. This was transformed to a factor as the DT model will use this as the dependent variable and the other three variables – funding, rounds, and investors – serve as the independent attributes. This is important to recognize as these variables are necessary to answer the hypothesis, “what is the probability of an upstate New York company being acquired or exercising an IPO based on funding, rounds, and number of investors?”
A tree classifier will allow us to distiguish between the two – operating or acquisition/IPO (Tan et al., 2019, p. 120). More accurately, this will use Hunt’s algorithm with further analysis due to the number of variables and by using the splitting criterion, which is an attribute test condition (Tan et al., 2019, p. 122).
df$funding_total_usd <- as.integer(df$funding_total_usd)
## Warning: NAs introduced by coercion to integer range
df$funding_total_usd
## [1] 5000 10000 10000 10000 10000 10000 10000
## [8] 10000 10000 10000 10000 10000 10000 10000
## [15] 10000 10000 10000 10000 10000 10000 10000
## [22] 10000 10000 18000 45000 50000 150000 300000
## [29] 505000 600000 750000 1000000 1090000 1200000 1500000
## [36] 4000000 4000000 5500000 6000000 8300000 10000 3000000
## [43] 6000000 12862048 5000 10000 50000 100000 100000
## [50] 150000 200000 227979 250000 300000 423000 750000
## [57] 750000 1000000 1000000 1300000 1500000 1500000 5200000
## [64] 15000000 1450000 2450000 2666404 17000000 125000 950000
## [71] 26000000 5000 9000 18000 40000 50000 50000
## [78] 50000 55000 182500 200000 215000 250000 450000
## [85] 500000 500000 610000 1500004 1550000 3700000 4000000
## [92] 4000000 6000000 6000000 7175000 9700000 10000000 30000000
## [99] 170000000 269344 429997 2750000 4550000 6800000 14000000
## [106] 2056426 4000000 7500000 8500000 26720000 69000000 66426557
## [113] 102149580 2000 10000 15000 100000 500009 520000
## [120] 783800 1000000 1500000 3000000 3600000 3725000 4200000
## [127] 4500000 9517008 17190245 21000000 175000000 500000 905421
## [134] 1600000 3097222 3750000 4018000 16125150 30800000 10000
## [141] 270000 275000 2950030 5300000 5922024 8595974 56000000
## [148] 500000 10000 10000 12000 437500 1000000 1000000
## [155] 1785000 5240000 10000000 17000000 NA 30000 2000000
## [162] 2400000 3972414 6411000 11307429 15237785 28800000 23466222
funding <- df$funding_total_usd
funding
## [1] 5000 10000 10000 10000 10000 10000 10000
## [8] 10000 10000 10000 10000 10000 10000 10000
## [15] 10000 10000 10000 10000 10000 10000 10000
## [22] 10000 10000 18000 45000 50000 150000 300000
## [29] 505000 600000 750000 1000000 1090000 1200000 1500000
## [36] 4000000 4000000 5500000 6000000 8300000 10000 3000000
## [43] 6000000 12862048 5000 10000 50000 100000 100000
## [50] 150000 200000 227979 250000 300000 423000 750000
## [57] 750000 1000000 1000000 1300000 1500000 1500000 5200000
## [64] 15000000 1450000 2450000 2666404 17000000 125000 950000
## [71] 26000000 5000 9000 18000 40000 50000 50000
## [78] 50000 55000 182500 200000 215000 250000 450000
## [85] 500000 500000 610000 1500004 1550000 3700000 4000000
## [92] 4000000 6000000 6000000 7175000 9700000 10000000 30000000
## [99] 170000000 269344 429997 2750000 4550000 6800000 14000000
## [106] 2056426 4000000 7500000 8500000 26720000 69000000 66426557
## [113] 102149580 2000 10000 15000 100000 500009 520000
## [120] 783800 1000000 1500000 3000000 3600000 3725000 4200000
## [127] 4500000 9517008 17190245 21000000 175000000 500000 905421
## [134] 1600000 3097222 3750000 4018000 16125150 30800000 10000
## [141] 270000 275000 2950030 5300000 5922024 8595974 56000000
## [148] 500000 10000 10000 12000 437500 1000000 1000000
## [155] 1785000 5240000 10000000 17000000 NA 30000 2000000
## [162] 2400000 3972414 6411000 11307429 15237785 28800000 23466222
df$status <- as.factor(df$status)
df$status
## [1] 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [38] 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0
## [75] 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0
## [112] 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1
## [149] 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0
## Levels: 0 1
status <- df$status
status
## [1] 0 0 0 0 0 1 0 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [38] 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0
## [75] 1 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0
## [112] 1 0 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 1
## [149] 1 1 0 1 1 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0
## Levels: 0 1
investors <- df$investors
investors
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [149] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
rounds <- df$funding_rounds
rounds
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 2 2 2 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 4 4
## [112] 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5
## [149] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 4 5
str(df)
## 'data.frame': 168 obs. of 5 variables:
## $ Company.Number : int 53 14 16 20 28 32 40 42 47 49 ...
## $ status : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 2 2 ...
## $ funding_total_usd: int 5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
## $ funding_rounds : int 1 1 1 1 1 1 1 1 1 1 ...
## $ investors : int 1 1 1 1 1 1 1 1 1 1 ...
All attribute naming conventions were changed to simplify.
library(rpart)
For this model, the rpart library is used. It serves as a CART algorithm and an acronym for Recursive Partitioning And Regression Trees.
treeAnalysis <- rpart(status ~ funding + rounds + investors , data = df)
treeAnalysis
## n= 168
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 168 64 0 (0.6190476 0.3809524)
## 2) investors< 3.5 113 35 0 (0.6902655 0.3097345) *
## 3) investors>=3.5 55 26 1 (0.4727273 0.5272727)
## 6) funding>=1550000 32 14 0 (0.5625000 0.4375000)
## 12) funding< 3737500 9 2 0 (0.7777778 0.2222222) *
## 13) funding>=3737500 23 11 1 (0.4782609 0.5217391)
## 26) funding>=6166512 15 6 0 (0.6000000 0.4000000) *
## 27) funding< 6166512 8 2 1 (0.2500000 0.7500000) *
## 7) funding< 1550000 23 8 1 (0.3478261 0.6521739) *
When running the model, “status” remains the factor and dependent attribute while funding, investors, and rounds serve as the independent variables. This is truncated in naming convention for simplicity purposes.
library(rpart.plot)
Finally, to visualize the DT, rpart.plot is incorporated.
rpart.plot(treeAnalysis, extra = 4)
To start, since operating and acquisition/IPO are within the “status” dependent variable, they are displayed as binary values with acquisition/IPO receiving a “yes” and a “1” and operating a “no” and “0” (Tan et al., 2019, p. 122). Ultimately, we would like to find the likelihood of whether an upstate New York company will be acquired or exercise an IPO via classification. However, throughout this DT we have a collective impurity of child nodes, as every child node represents a partition of the data resulting from one of the k outcomes of the attribute test condition (Tan et al., 2019, p. 129).
In a top-down analysis, we’ll first focus on the top node. In this node, the probability of the companies not being acquired or receiving an IPO – yet still operating – is 62% (38% potentially earning this status). As an extension, though this node is divided by the variable “number of investors,” if the company has fewer than four investors, the probability remains at 62%. If it’s four or more investors, the probability of the company being acquired or receiving an IPO is 38%. This is how a DT classifies companies into different groups. As a note, even if there are fewer than four investors, once that group is split, there is still a probability of 31% that a company will be acquired or receives an IPO.
However, when analyzing the full DT, the next node is perhaps the most intriguing simply because of the split between the potential likelihood of a company of four or more funders but with investment of $1.6 million or less. This probability is 53%. Within that node, there’s a 65% chance that company will be acquired or receive an IPO, but within that primary node, the 47% probability continues to be split. Within that node, there is a smaller probability (44%) for those companies receiving more than $3.7 million but less than $6.2 million. The highest probability exists within this node. If a company falls within the $3.7-$6.2 million threshold, there is a 75% probability that they will be acquired or exit with an IPO.
As an alternative with KNN, I reclassified the dataset and focused on two attributes – funding and status – to gauge whether different classifications were impacted by a change in numerical values and binary structure. In the preprocessing stage, instead of using values that weren’t necessarily normalized in the DT tree, the data were classified based on intervals and assigned a 1-10 numerical value, consistent with different levels of funding. The higher the assigned value, the more funding a company received. Additionally, instead of using a binary naming convention under “status,” which remained the target object or factor, I assigned either operating or acquired_ipo to each company. To establish this from scratch, I wanted to see if this model proved stronger in terms of overall predicted value. In the process, a new dataset was created and uploaded.
df <- read.csv("C:/Users/bjorzech/Desktop/DSC607_Upstate_W3A.csv",stringsAsFactors = FALSE)
(df)
## funding_total_usd funding_rounds investors status
## 1 1 1 2 operating
## 2 1 1 2 operating
## 3 1 1 4 operating
## 4 1 1 2 acquired-ipo
## 5 1 1 1 operating
## 6 1 2 1 operating
## 7 1 1 2 operating
## 8 1 6 3 acquired-ipo
## 9 1 2 4 operating
## 10 1 1 1 operating
## 11 1 1 5 acquired-ipo
## 12 1 1 2 operating
## 13 1 1 4 acquired-ipo
## 14 1 1 1 operating
## 15 1 1 1 acquired-ipo
## 16 1 1 1 operating
## 17 1 1 3 operating
## 18 1 1 3 operating
## 19 1 3 4 operating
## 20 1 1 1 operating
## 21 1 1 3 operating
## 22 1 2 4 operating
## 23 1 1 5 acquired-ipo
## 24 1 1 4 operating
## 25 1 1 3 operating
## 26 1 6 3 operating
## 27 1 2 4 operating
## 28 1 1 1 operating
## 29 1 1 3 operating
## 30 1 2 4 operating
## 31 1 3 4 operating
## 32 1 1 1 acquired-ipo
## 33 1 2 5 operating
## 34 1 3 5 operating
## 35 2 4 3 operating
## 36 2 1 3 acquired-ipo
## 37 2 1 4 operating
## 38 2 1 1 operating
## 39 2 1 4 operating
## 40 2 1 1 operating
## 41 2 2 2 operating
## 42 2 1 1 operating
## 43 2 1 2 acquired-ipo
## 44 2 3 3 operating
## 45 2 1 5 acquired-ipo
## 46 2 3 5 operating
## 47 3 1 1 acquired-ipo
## 48 3 3 3 acquired-ipo
## 49 3 1 1 acquired-ipo
## 50 3 1 1 acquired-ipo
## 51 3 1 1 operating
## 52 3 1 1 operating
## 53 3 1 1 operating
## 54 3 1 3 operating
## 55 3 1 3 acquired-ipo
## 56 3 3 4 operating
## 57 3 1 2 operating
## 58 3 1 3 operating
## 59 3 2 4 operating
## 60 3 1 1 operating
## 61 3 5 5 operating
## 62 3 1 3 operating
## 63 3 1 2 acquired-ipo
## 64 3 1 2 acquired-ipo
## 65 3 1 2 operating
## 66 3 3 2 acquired-ipo
## 67 3 1 2 operating
## 68 3 1 5 operating
## 69 3 2 4 acquired-ipo
## 70 3 1 2 operating
## 71 3 1 1 operating
## 72 3 1 1 operating
## 73 3 1 4 operating
## 74 3 2 5 operating
## 75 3 1 1 acquired-ipo
## 76 3 1 1 operating
## 77 3 1 3 operating
## 78 3 1 4 acquired-ipo
## 79 3 6 1 operating
## 80 3 1 3 operating
## 81 3 4 4 operating
## 82 3 1 1 operating
## 83 3 1 4 operating
## 84 4 3 3 operating
## 85 4 2 3 acquired-ipo
## 86 4 1 4 acquired-ipo
## 87 4 1 3 operating
## 88 4 1 4 acquired-ipo
## 89 4 1 2 operating
## 90 4 5 4 acquired-ipo
## 91 4 1 3 operating
## 92 4 4 4 acquired-ipo
## 93 4 1 5 acquired-ipo
## 94 4 1 2 operating
## 95 4 1 5 acquired-ipo
## 96 4 3 2 operating
## 97 4 1 3 operating
## 98 4 1 3 operating
## 99 4 1 4 acquired-ipo
## 100 4 1 4 acquired-ipo
## 101 4 1 3 operating
## 102 5 2 3 operating
## 103 5 1 1 operating
## 104 5 4 5 acquired-ipo
## 105 5 1 2 operating
## 106 5 1 3 acquired-ipo
## 107 5 1 4 acquired-ipo
## 108 5 1 1 operating
## 109 5 1 1 acquired-ipo
## 110 5 1 3 operating
## 111 5 1 3 acquired-ipo
## 112 5 2 2 operating
## 113 5 1 1 operating
## 114 5 2 3 acquired-ipo
## 115 5 1 2 operating
## 116 5 2 3 acquired-ipo
## 117 5 1 4 acquired-ipo
## 118 5 2 5 acquired-ipo
## 119 5 3 2 operating
## 120 5 1 1 operating
## 121 5 1 4 operating
## 122 5 1 3 operating
## 123 5 1 1 acquired-ipo
## 124 5 1 1 acquired-ipo
## 125 5 1 3 acquired-ipo
## 126 6 1 4 operating
## 127 6 1 5 operating
## 128 6 2 4 operating
## 129 6 1 3 operating
## 130 6 2 1 operating
## 131 6 1 1 operating
## 132 6 1 3 acquired-ipo
## 133 6 1 5 acquired-ipo
## 134 6 1 1 operating
## 135 6 1 5 acquired-ipo
## 136 6 1 5 acquired-ipo
## 137 6 3 4 acquired-ipo
## 138 6 1 3 acquired-ipo
## 139 6 1 3 operating
## 140 6 2 4 acquired-ipo
## 141 6 1 1 operating
## 142 6 1 2 operating
## 143 6 1 4 acquired-ipo
## 144 7 4 3 operating
## 145 7 1 1 operating
## 146 7 1 2 acquired-ipo
## 147 7 1 2 operating
## 148 7 3 4 acquired-ipo
## 149 7 1 1 acquired-ipo
## 150 7 3 3 operating
## 151 7 1 1 operating
## 152 7 1 1 acquired-ipo
## 153 7 2 2 operating
## 154 7 1 3 acquired-ipo
## 155 8 1 3 acquired-ipo
## 156 8 1 1 operating
## 157 8 4 4 acquired-ipo
## 158 8 1 5 acquired-ipo
## 159 8 3 5 acquired-ipo
## 160 8 1 1 acquired-ipo
## 161 8 2 3 operating
## 162 8 1 1 acquired-ipo
## 163 8 2 3 acquired-ipo
## 164 8 2 2 acquired-ipo
## 165 9 2 5 acquired-ipo
## 166 9 1 2 operating
## 167 9 1 4 operating
## 168 10 2 1 operating
str(df)
## 'data.frame': 168 obs. of 4 variables:
## $ funding_total_usd: int 1 1 1 1 1 1 1 1 1 1 ...
## $ funding_rounds : int 1 1 1 1 1 2 1 6 2 1 ...
## $ investors : int 2 2 4 2 1 1 2 3 4 1 ...
## $ status : chr "operating" "operating" "operating" "acquired-ipo" ...
Similar to the DT, some preprocessing steps were taken to fit the KNN model. The variables “funding_rounds,”“investors,” and "funding _total_usd" were reclassified as numeric values while “status” remained the target (factor). Initially, this variable was classified as character. Finally, a new structure was established in order to build a KNN model.
df$funding_rounds <- as.numeric(df$funding_rounds)
df$investors <- as.numeric(df$investors)
df$funding_total_usd <- as.numeric(df$funding_total_usd)
df$status <- as.factor(df$status)
rounds <- df$funding_rounds
investors <- df$investors
funding <- df$funding_total_usd
status <- df$status
str(df)
## 'data.frame': 168 obs. of 4 variables:
## $ funding_total_usd: num 1 1 1 1 1 1 1 1 1 1 ...
## $ funding_rounds : num 1 1 1 1 1 2 1 6 2 1 ...
## $ investors : num 2 2 4 2 1 1 2 3 4 1 ...
## $ status : Factor w/ 2 levels "acquired-ipo",..: 2 2 2 1 2 2 2 1 2 2 ...
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x))) }
df1 <- as.data.frame(lapply(df[1:3], normalize))
head(df1)
## funding_total_usd funding_rounds investors
## 1 0 0.0 0.25
## 2 0 0.0 0.25
## 3 0 0.0 0.75
## 4 0 0.0 0.25
## 5 0 0.0 0.00
## 6 0 0.2 0.00
Further, the data are normalized but only variables with a numeric value. In this case, the “status” variable is not included simply because it serves as the target or factor variable here in R. With any predictive or classification algorithm that includes distance, the data should be normalized (Tan et al., 2019, p. 211). The following model subscribes to the fourth characteristic of nearest neighbor classifiers (Tan et al., 2019, pp. 210-211). As we find, the decision boundaries of KNN classifiers also have high variability because they depend on the composition of training examples in the local neighborhood. Increasing the number of nearest neighbors, according to Tan et al. (2019), may reduce such variability. Ultimately, this makes sense as a more robust data set exists since this sample focused on upstate New York companies exclusively.
num.vars <- sapply(df, is.numeric)
df[num.vars] <- lapply(df[num.vars], scale)
myvars <- c("funding_total_usd", "investors", "funding_rounds")
df.subset <- df[myvars]
summary(df.subset)
## funding_total_usd.V1 investors.V1 funding_rounds.V1
## Min. :-1.3010259 Min. :-1.2963726 Min. :-0.534123
## 1st Qu.:-0.8594656 1st Qu.:-1.2963726 1st Qu.:-0.534123
## Median : 0.0236550 Median : 0.1751855 Median :-0.534123
## Mean : 0.0000000 Mean : 0.0000000 Mean : 0.000000
## 3rd Qu.: 0.9067756 3rd Qu.: 0.9109645 3rd Qu.: 0.400592
## Max. : 2.6730168 Max. : 1.6467436 Max. : 4.139452
As a secondary step, the independent variables in this model are grouped in a subset and summarized to ensure the data remain normalized.
set.seed(123)
test <- 1:56
train.df <- df.subset[-test,]
test.df <- df.subset[test,]
train.def <- df$status[-test]
test.def <- df$status[test]
Next, I randomized the data before splitting between a training and testing set. Since the observations are randomized, the decision is made to split the sample by 2/3. The training set will have 112 observations while the test set has 56. We are also operating on the premise, per the lecture, that the greater the value of K, the smoother the decision boundaries. This can also lower variance but increase subjectivity. Additionally, one of the basic requirements of a clean model and set-up and selection is that the training and testing must be kept separate (Tan et al. 2019, p. 172).
library(class)
knn.1 <- knn(train.df, test.df, train.def, k=1)
knn.5 <- knn(train.df, test.df, train.def, k=5)
knn.10 <- knn(train.df, test.df, train.def, k=10)
56 * sum(test.def == knn.1)/56
## [1] 37
In the K=1 model, it correctly classifies 37% of the outcomes.
56 * sum(test.def == knn.5)/56
## [1] 41
In the K=5 model, it correctly classifies 41% of the outcomes.
56 * sum(test.def == knn.10)/56
## [1] 43
In the K=10 model, it correctly classifies 43% of the outcomes.
table(knn.1 ,test.def)
## test.def
## knn.1 acquired-ipo operating
## acquired-ipo 0 4
## operating 15 37
table(knn.5 ,test.def)
## test.def
## knn.5 acquired-ipo operating
## acquired-ipo 1 1
## operating 14 40
table(knn.10 ,test.def)
## test.def
## knn.10 acquired-ipo operating
## acquired-ipo 4 2
## operating 11 39
To best test the models, I chose to run a stratified cross-validation with another test of K at 1, 5, and 10, which samples the positive and negative instances in a K partition (Tan et al. 2019, p. 167).
For K=1, among four companies predicted for acquisition, 0% is the likelihood of this happening while 40.5% is the probability of an operating company predicted for acquisition. For K=5, the probability improves but for a smaller sample as a 50% likelihood exists if a company is predicted for acquisition. This changes with operating companies, as 14 of 40 fall into this range (35%). The predicted acquisition companies is only two, though. Finally, for the K=10 test, among six companies predicted for acquisition or IPO, the probability is 66%. For those operating, the percentage drops to 28.2% with 11 of 39 companies.
With this model, increasing K increases the classification and success rate, however, the variability and predictive power is not as conclusive for those companies remaining in operating status.
Since the KNN models remain somewhat inconclusive and binary even after normalizing the data, I chose to visualize the findings via two methods to display the models holistically with all variables – both the factor (dependent, status) and the three independent. In the first visualization, it remains apparent that funding has the most variability but perhaps the greatest impact on “status” and whether an upstate New York company will reach acquisition or IPO level. Number of investors and rounds still have an impact, but they are not as strong as funding.
Additionally, in the second visualization, KNN principles are included to show the predictability across all variables and potential relationships. The bottom row signifies the relationship between “status” and the independent variables with the red dot consistent with the above KNN results. However, the inclusion of histograms and correlation matrices is intentional to show there are other findings. For instance, the more funding rounds has a stronger relationship to “status” than the number of investors. It would be deemed statistically significant, but more analysis and data are needed. Conversely, the investor and round relationship is statistically insignificant. This is valid and consistent with industry standards.
library(psych)
pairs.panels (df[,-5],
method = "pearson",
hist.col = "#00AFBB",
density = TRUE,
ellipses = TRUE
)
In conclusion, more data are needed to create stronger algorithms for predictive value but the foundation remains for further exploration. Between the two models, though, decisions trees seem more appropriate in this analytical situation.
Tan, P.-N., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to data mining. New York, NY: Pearson Education, Inc.