This dataset is called “Startup Growth & Funding Trends”. It
includes data regarding many startups and their characteristics which
determines the profitability of each one.
The link to the
source of the dataset: Startup
Growth & Funding Trends
Since this dataset will be used for Decision Trees and Logistic
Regression, it is necessary to determine the dependent and independent
variables.
Note that the description was copied from the
source
| variables | description |
|---|---|
| Profitable | A binary indicator (1 = Profitable, 0 = Not Profitable) |
| variables | description |
|---|---|
| Funding Rounds | The total number of funding rounds raised by the startup (1-5) |
| Funding Amount (M USD) | The total amount of funding received in millions of USD |
| Valuation (M USD) | The startup’s post-money valuation in millions of USD |
| Revenue (M USD) | The estimated annual revenue in millions of USD |
| Employees | The number of employees working in the startup (ranging from 5 to 5000) |
| Market Share (%) | The percentage of the market the startup has captured |
| Startup Age | The age of the startup. It will be equal to 2026 - Year Founded |
| variables | description |
|---|---|
| Startup Name | A binary indicator (1 = Profitable, 0 = Not Profitable) |
| Industry | The sector in which the startup operates (e.g., AI, FinTech, HealthTech)3 |
| Year Founded | The year when the startup was founded |
| Exit Status | The ownership status of the startup |
#Before running the program, it is necessary to import the dataset via: File -> Import Dataset -> From Text (readr)...
startup_data <- read.csv("startup_data.csv")
# Rename columns (to save my sanity)
colnames(startup_data) <- c("Startup_Name", "Industry", "Funding_Rounds", "Funding_Amount", "Valuation", "Revenue", "Employees", "Market_Share", "Profitable", "Year_Founded", "Region", "Exit_Status")
#Startup Age variable will also be calculated
startup_data$Startup_Age <- 2026 - startup_data$Year_Founded
It is necessary to split the dataset into train and test for:
set.seed(136)
spl <- sample.split(startup_data$Profitable, SplitRatio = 0.7)
train <- subset(startup_data, spl==TRUE)
test <- subset(startup_data, spl==FALSE)
As was previously mentioned, dataset will be analyzed with two different methods. First, it will be analyzed using Decision Trees, then with Logistic Regression. Both results will be compared in conclusion.
#Setting up the decision tree
StartupTree <- rpart(Profitable ~ Funding_Rounds + Funding_Amount + Valuation + Revenue + Employees + Market_Share + Startup_Age, data = train, method = "class", minbucket=25)
#Displaying the decision tree
prp(StartupTree)
summary(StartupTree)
## Call:
## rpart(formula = Profitable ~ Funding_Rounds + Funding_Amount +
## Valuation + Revenue + Employees + Market_Share + Startup_Age,
## data = train, method = "class", minbucket = 25)
## n= 350
##
## CP nsplit rel error xerror xstd
## 1 0.04966887 0 1.0000000 1.000000 0.06136264
## 2 0.02317881 4 0.7682119 1.092715 0.06184675
## 3 0.01000000 6 0.7218543 1.052980 0.06168851
##
## Variable importance
## Valuation Funding_Amount Market_Share Revenue Employees
## 38 28 12 11 8
## Startup_Age Funding_Rounds
## 2 1
##
## Node number 1: 350 observations, complexity param=0.04966887
## predicted class=0 expected loss=0.4314286 P(node) =1
## class counts: 199 151
## probabilities: 0.569 0.431
## left son=2 (36 obs) right son=3 (314 obs)
## Primary splits:
## Funding_Amount < 31.245 to the left, improve=6.868161, (0 missing)
## Market_Share < 5.365 to the left, improve=4.740918, (0 missing)
## Funding_Rounds < 4.5 to the left, improve=2.765714, (0 missing)
## Valuation < 1255.355 to the left, improve=2.731429, (0 missing)
## Revenue < 89.81 to the left, improve=2.203002, (0 missing)
## Surrogate splits:
## Valuation < 242.155 to the left, agree=0.969, adj=0.694, (0 split)
## Market_Share < 0.22 to the left, agree=0.900, adj=0.028, (0 split)
##
## Node number 2: 36 observations
## predicted class=0 expected loss=0.1388889 P(node) =0.1028571
## class counts: 31 5
## probabilities: 0.861 0.139
##
## Node number 3: 314 observations, complexity param=0.04966887
## predicted class=0 expected loss=0.4649682 P(node) =0.8971429
## class counts: 168 146
## probabilities: 0.535 0.465
## left son=6 (289 obs) right son=7 (25 obs)
## Primary splits:
## Valuation < 409.355 to the right, improve=6.097811, (0 missing)
## Market_Share < 8.145 to the left, improve=4.256508, (0 missing)
## Funding_Amount < 130.11 to the right, improve=4.244246, (0 missing)
## Revenue < 89.81 to the left, improve=1.777619, (0 missing)
## Funding_Rounds < 4.5 to the left, improve=1.529299, (0 missing)
## Surrogate splits:
## Funding_Amount < 42.54 to the right, agree=0.949, adj=0.36, (0 split)
## Revenue < 0.435 to the right, agree=0.927, adj=0.08, (0 split)
##
## Node number 6: 289 observations, complexity param=0.04966887
## predicted class=0 expected loss=0.4359862 P(node) =0.8257143
## class counts: 163 126
## probabilities: 0.564 0.436
## left son=12 (155 obs) right son=13 (134 obs)
## Primary splits:
## Market_Share < 5.365 to the left, improve=4.402552, (0 missing)
## Valuation < 1255.355 to the left, improve=2.969069, (0 missing)
## Funding_Amount < 258.905 to the left, improve=1.842587, (0 missing)
## Revenue < 89.81 to the left, improve=1.791264, (0 missing)
## Funding_Rounds < 4.5 to the left, improve=1.678135, (0 missing)
## Surrogate splits:
## Valuation < 1283.16 to the left, agree=0.606, adj=0.149, (0 split)
## Employees < 1690 to the left, agree=0.564, adj=0.060, (0 split)
## Startup_Age < 6.5 to the right, agree=0.557, adj=0.045, (0 split)
## Funding_Amount < 293.715 to the left, agree=0.554, adj=0.037, (0 split)
## Revenue < 1.665 to the right, agree=0.540, adj=0.007, (0 split)
##
## Node number 7: 25 observations
## predicted class=1 expected loss=0.2 P(node) =0.07142857
## class counts: 5 20
## probabilities: 0.200 0.800
##
## Node number 12: 155 observations, complexity param=0.02317881
## predicted class=0 expected loss=0.3548387 P(node) =0.4428571
## class counts: 100 55
## probabilities: 0.645 0.355
## left son=24 (76 obs) right son=25 (79 obs)
## Primary splits:
## Valuation < 1255.355 to the left, improve=4.1522860, (0 missing)
## Funding_Amount < 253.27 to the left, improve=2.3704090, (0 missing)
## Employees < 1755 to the right, improve=1.8503050, (0 missing)
## Revenue < 49.57 to the right, improve=1.7524500, (0 missing)
## Market_Share < 3.175 to the right, improve=0.8394941, (0 missing)
## Surrogate splits:
## Funding_Amount < 144.115 to the left, agree=0.742, adj=0.474, (0 split)
## Employees < 1125 to the left, agree=0.594, adj=0.171, (0 split)
## Startup_Age < 21.5 to the left, agree=0.574, adj=0.132, (0 split)
## Funding_Rounds < 2.5 to the right, agree=0.561, adj=0.105, (0 split)
## Revenue < 71.13 to the left, agree=0.542, adj=0.066, (0 split)
##
## Node number 13: 134 observations, complexity param=0.04966887
## predicted class=1 expected loss=0.4701493 P(node) =0.3828571
## class counts: 63 71
## probabilities: 0.470 0.530
## left son=26 (46 obs) right son=27 (88 obs)
## Primary splits:
## Revenue < 35.07 to the left, improve=3.599139, (0 missing)
## Funding_Amount < 129.885 to the right, improve=2.608360, (0 missing)
## Market_Share < 6.305 to the right, improve=2.222478, (0 missing)
## Startup_Age < 30.5 to the right, improve=1.360909, (0 missing)
## Funding_Rounds < 4.5 to the left, improve=1.265867, (0 missing)
## Surrogate splits:
## Funding_Amount < 287.21 to the right, agree=0.687, adj=0.087, (0 split)
## Market_Share < 9.365 to the right, agree=0.687, adj=0.087, (0 split)
## Startup_Age < 35.5 to the right, agree=0.664, adj=0.022, (0 split)
##
## Node number 24: 76 observations
## predicted class=0 expected loss=0.2368421 P(node) =0.2171429
## class counts: 58 18
## probabilities: 0.763 0.237
##
## Node number 25: 79 observations, complexity param=0.02317881
## predicted class=0 expected loss=0.4683544 P(node) =0.2257143
## class counts: 42 37
## probabilities: 0.532 0.468
## left son=50 (54 obs) right son=51 (25 obs)
## Primary splits:
## Employees < 1866 to the right, improve=2.1551050, (0 missing)
## Revenue < 50.97 to the right, improve=1.0058550, (0 missing)
## Market_Share < 2.19 to the right, improve=0.9136154, (0 missing)
## Funding_Amount < 182.865 to the right, improve=0.6369446, (0 missing)
## Valuation < 1951.51 to the left, improve=0.5515534, (0 missing)
## Surrogate splits:
## Market_Share < 0.565 to the right, agree=0.709, adj=0.08, (0 split)
## Startup_Age < 28.5 to the left, agree=0.696, adj=0.04, (0 split)
##
## Node number 26: 46 observations
## predicted class=0 expected loss=0.3695652 P(node) =0.1314286
## class counts: 29 17
## probabilities: 0.630 0.370
##
## Node number 27: 88 observations
## predicted class=1 expected loss=0.3863636 P(node) =0.2514286
## class counts: 34 54
## probabilities: 0.386 0.614
##
## Node number 50: 54 observations
## predicted class=0 expected loss=0.3888889 P(node) =0.1542857
## class counts: 33 21
## probabilities: 0.611 0.389
##
## Node number 51: 25 observations
## predicted class=1 expected loss=0.36 P(node) =0.07142857
## class counts: 9 16
## probabilities: 0.360 0.640
At this stage predictions can be made. First of all, confusion matrix will be calculated:
PredictCART <- predict(StartupTree, newdata=test, type='class')
table(test$Profitable, PredictCART)
## PredictCART
## 0 1
## 0 54 31
## 1 36 29
Based on the above table we can calculate the accuracy:
Accuracy = (69 + 34)/(69 + 34 + 31+ 16) = 0.6866
It is considered better than guessing randomly
#Creating a prediction
PredictROC <- predict(StartupTree, newdata = test)
#Making a ROC graph
pred <- prediction(PredictROC[,2], test$Profitable)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)
#Calculating AUC
auc1 <- as.numeric(performance(pred, "auc")@y.values)
auc1
## [1] 0.5161086
Finally, the dataset will be analyzed with logistic regression
# Clean all the NA values
train2 = na.omit(train)
test2 = na.omit(test)
# Creating logistic regression model
lr = glm(Profitable ~ Funding_Rounds + Funding_Amount + Valuation + Revenue + Employees + Market_Share + Startup_Age, data =train2, family = binomial)
# Displaying coefficients
summary(lr)
##
## Call:
## glm(formula = Profitable ~ Funding_Rounds + Funding_Amount +
## Valuation + Revenue + Employees + Market_Share + Startup_Age,
## family = binomial, data = train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.014e-01 5.168e-01 -1.744 0.0811 .
## Funding_Rounds 5.929e-02 7.779e-02 0.762 0.4459
## Funding_Amount -1.899e-03 2.159e-03 -0.879 0.3791
## Valuation 2.843e-04 1.887e-04 1.506 0.1321
## Revenue 2.873e-03 3.748e-03 0.767 0.4433
## Employees 1.391e-05 8.080e-05 0.172 0.8633
## Market_Share 8.003e-02 4.026e-02 1.988 0.0468 *
## Startup_Age -1.303e-02 1.192e-02 -1.093 0.2743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 478.60 on 349 degrees of freedom
## Residual deviance: 468.69 on 342 degrees of freedom
## AIC: 484.69
##
## Number of Fisher Scoring iterations: 4
Only the Market_Share variable is considered to to have mild significance.
Evaluation of how suitable this method is for prediction
# Creating the prediction on test dataset
predictTest = predict(lr, type = "response", newdata = test2)
# Preparing ROCR graph
ROCRpred <- prediction(predictTest, test2$Profitable)
ROCRperf <- performance(ROCRpred, 'tpr', 'fpr')
#Displaying ROCR graph
plot(ROCRperf,colorize = TRUE)
#Calculating AUC
auc2 <- as.numeric(performance(ROCRpred, "auc")@y.values)
auc2
## [1] 0.6126697
Comparing the results from Decision Tree and Logistic Regression analysis, it is evident that:
As a result, Logistic Regression is a better method for prediction than Decision Tree