Dataset Description

This dataset is called “Startup Growth & Funding Trends”. It includes data regarding many startups and their characteristics which determines the profitability of each one.

The link to the source of the dataset: Startup Growth & Funding Trends

Variables

Since this dataset will be used for Decision Trees and Logistic Regression, it is necessary to determine the dependent and independent variables.
Note that the description was copied from the source

Dependent

variables description
Profitable A binary indicator (1 = Profitable, 0 = Not Profitable)

Independent

variables description
Funding Rounds The total number of funding rounds raised by the startup (1-5)
Funding Amount (M USD) The total amount of funding received in millions of USD
Valuation (M USD) The startup’s post-money valuation in millions of USD
Revenue (M USD) The estimated annual revenue in millions of USD
Employees The number of employees working in the startup (ranging from 5 to 5000)
Market Share (%) The percentage of the market the startup has captured
Startup Age The age of the startup. It will be equal to 2026 - Year Founded

Other variables

variables description
Startup Name A binary indicator (1 = Profitable, 0 = Not Profitable)
Industry The sector in which the startup operates (e.g., AI, FinTech, HealthTech)3
Year Founded The year when the startup was founded
Exit Status The ownership status of the startup

Dataset loading

#Before running the program, it is necessary to import the dataset via: File -> Import Dataset -> From Text (readr)...
startup_data <- read.csv("startup_data.csv")

# Rename columns (to save my sanity)
colnames(startup_data) <- c("Startup_Name", "Industry", "Funding_Rounds", "Funding_Amount", "Valuation", "Revenue", "Employees", "Market_Share", "Profitable", "Year_Founded", "Region", "Exit_Status")

#Startup Age variable will also be calculated
startup_data$Startup_Age <- 2026 - startup_data$Year_Founded

Splitting the dataset

It is necessary to split the dataset into train and test for:

  • Decision Tree
  • Logistic Regression
set.seed(136)
spl <- sample.split(startup_data$Profitable, SplitRatio = 0.7)
train <- subset(startup_data, spl==TRUE) 
test <- subset(startup_data, spl==FALSE)

Decision Trees

As was previously mentioned, dataset will be analyzed with two different methods. First, it will be analyzed using Decision Trees, then with Logistic Regression. Both results will be compared in conclusion.

Creating decision tree

#Setting up the decision tree
StartupTree <- rpart(Profitable ~ Funding_Rounds + Funding_Amount + Valuation + Revenue + Employees + Market_Share + Startup_Age, data = train, method = "class",  minbucket=25)

#Displaying the decision tree
prp(StartupTree)

summary(StartupTree)
## Call:
## rpart(formula = Profitable ~ Funding_Rounds + Funding_Amount + 
##     Valuation + Revenue + Employees + Market_Share + Startup_Age, 
##     data = train, method = "class", minbucket = 25)
##   n= 350 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.04966887      0 1.0000000 1.000000 0.06136264
## 2 0.02317881      4 0.7682119 1.092715 0.06184675
## 3 0.01000000      6 0.7218543 1.052980 0.06168851
## 
## Variable importance
##      Valuation Funding_Amount   Market_Share        Revenue      Employees 
##             38             28             12             11              8 
##    Startup_Age Funding_Rounds 
##              2              1 
## 
## Node number 1: 350 observations,    complexity param=0.04966887
##   predicted class=0  expected loss=0.4314286  P(node) =1
##     class counts:   199   151
##    probabilities: 0.569 0.431 
##   left son=2 (36 obs) right son=3 (314 obs)
##   Primary splits:
##       Funding_Amount < 31.245   to the left,  improve=6.868161, (0 missing)
##       Market_Share   < 5.365    to the left,  improve=4.740918, (0 missing)
##       Funding_Rounds < 4.5      to the left,  improve=2.765714, (0 missing)
##       Valuation      < 1255.355 to the left,  improve=2.731429, (0 missing)
##       Revenue        < 89.81    to the left,  improve=2.203002, (0 missing)
##   Surrogate splits:
##       Valuation    < 242.155  to the left,  agree=0.969, adj=0.694, (0 split)
##       Market_Share < 0.22     to the left,  agree=0.900, adj=0.028, (0 split)
## 
## Node number 2: 36 observations
##   predicted class=0  expected loss=0.1388889  P(node) =0.1028571
##     class counts:    31     5
##    probabilities: 0.861 0.139 
## 
## Node number 3: 314 observations,    complexity param=0.04966887
##   predicted class=0  expected loss=0.4649682  P(node) =0.8971429
##     class counts:   168   146
##    probabilities: 0.535 0.465 
##   left son=6 (289 obs) right son=7 (25 obs)
##   Primary splits:
##       Valuation      < 409.355  to the right, improve=6.097811, (0 missing)
##       Market_Share   < 8.145    to the left,  improve=4.256508, (0 missing)
##       Funding_Amount < 130.11   to the right, improve=4.244246, (0 missing)
##       Revenue        < 89.81    to the left,  improve=1.777619, (0 missing)
##       Funding_Rounds < 4.5      to the left,  improve=1.529299, (0 missing)
##   Surrogate splits:
##       Funding_Amount < 42.54    to the right, agree=0.949, adj=0.36, (0 split)
##       Revenue        < 0.435    to the right, agree=0.927, adj=0.08, (0 split)
## 
## Node number 6: 289 observations,    complexity param=0.04966887
##   predicted class=0  expected loss=0.4359862  P(node) =0.8257143
##     class counts:   163   126
##    probabilities: 0.564 0.436 
##   left son=12 (155 obs) right son=13 (134 obs)
##   Primary splits:
##       Market_Share   < 5.365    to the left,  improve=4.402552, (0 missing)
##       Valuation      < 1255.355 to the left,  improve=2.969069, (0 missing)
##       Funding_Amount < 258.905  to the left,  improve=1.842587, (0 missing)
##       Revenue        < 89.81    to the left,  improve=1.791264, (0 missing)
##       Funding_Rounds < 4.5      to the left,  improve=1.678135, (0 missing)
##   Surrogate splits:
##       Valuation      < 1283.16  to the left,  agree=0.606, adj=0.149, (0 split)
##       Employees      < 1690     to the left,  agree=0.564, adj=0.060, (0 split)
##       Startup_Age    < 6.5      to the right, agree=0.557, adj=0.045, (0 split)
##       Funding_Amount < 293.715  to the left,  agree=0.554, adj=0.037, (0 split)
##       Revenue        < 1.665    to the right, agree=0.540, adj=0.007, (0 split)
## 
## Node number 7: 25 observations
##   predicted class=1  expected loss=0.2  P(node) =0.07142857
##     class counts:     5    20
##    probabilities: 0.200 0.800 
## 
## Node number 12: 155 observations,    complexity param=0.02317881
##   predicted class=0  expected loss=0.3548387  P(node) =0.4428571
##     class counts:   100    55
##    probabilities: 0.645 0.355 
##   left son=24 (76 obs) right son=25 (79 obs)
##   Primary splits:
##       Valuation      < 1255.355 to the left,  improve=4.1522860, (0 missing)
##       Funding_Amount < 253.27   to the left,  improve=2.3704090, (0 missing)
##       Employees      < 1755     to the right, improve=1.8503050, (0 missing)
##       Revenue        < 49.57    to the right, improve=1.7524500, (0 missing)
##       Market_Share   < 3.175    to the right, improve=0.8394941, (0 missing)
##   Surrogate splits:
##       Funding_Amount < 144.115  to the left,  agree=0.742, adj=0.474, (0 split)
##       Employees      < 1125     to the left,  agree=0.594, adj=0.171, (0 split)
##       Startup_Age    < 21.5     to the left,  agree=0.574, adj=0.132, (0 split)
##       Funding_Rounds < 2.5      to the right, agree=0.561, adj=0.105, (0 split)
##       Revenue        < 71.13    to the left,  agree=0.542, adj=0.066, (0 split)
## 
## Node number 13: 134 observations,    complexity param=0.04966887
##   predicted class=1  expected loss=0.4701493  P(node) =0.3828571
##     class counts:    63    71
##    probabilities: 0.470 0.530 
##   left son=26 (46 obs) right son=27 (88 obs)
##   Primary splits:
##       Revenue        < 35.07    to the left,  improve=3.599139, (0 missing)
##       Funding_Amount < 129.885  to the right, improve=2.608360, (0 missing)
##       Market_Share   < 6.305    to the right, improve=2.222478, (0 missing)
##       Startup_Age    < 30.5     to the right, improve=1.360909, (0 missing)
##       Funding_Rounds < 4.5      to the left,  improve=1.265867, (0 missing)
##   Surrogate splits:
##       Funding_Amount < 287.21   to the right, agree=0.687, adj=0.087, (0 split)
##       Market_Share   < 9.365    to the right, agree=0.687, adj=0.087, (0 split)
##       Startup_Age    < 35.5     to the right, agree=0.664, adj=0.022, (0 split)
## 
## Node number 24: 76 observations
##   predicted class=0  expected loss=0.2368421  P(node) =0.2171429
##     class counts:    58    18
##    probabilities: 0.763 0.237 
## 
## Node number 25: 79 observations,    complexity param=0.02317881
##   predicted class=0  expected loss=0.4683544  P(node) =0.2257143
##     class counts:    42    37
##    probabilities: 0.532 0.468 
##   left son=50 (54 obs) right son=51 (25 obs)
##   Primary splits:
##       Employees      < 1866     to the right, improve=2.1551050, (0 missing)
##       Revenue        < 50.97    to the right, improve=1.0058550, (0 missing)
##       Market_Share   < 2.19     to the right, improve=0.9136154, (0 missing)
##       Funding_Amount < 182.865  to the right, improve=0.6369446, (0 missing)
##       Valuation      < 1951.51  to the left,  improve=0.5515534, (0 missing)
##   Surrogate splits:
##       Market_Share < 0.565    to the right, agree=0.709, adj=0.08, (0 split)
##       Startup_Age  < 28.5     to the left,  agree=0.696, adj=0.04, (0 split)
## 
## Node number 26: 46 observations
##   predicted class=0  expected loss=0.3695652  P(node) =0.1314286
##     class counts:    29    17
##    probabilities: 0.630 0.370 
## 
## Node number 27: 88 observations
##   predicted class=1  expected loss=0.3863636  P(node) =0.2514286
##     class counts:    34    54
##    probabilities: 0.386 0.614 
## 
## Node number 50: 54 observations
##   predicted class=0  expected loss=0.3888889  P(node) =0.1542857
##     class counts:    33    21
##    probabilities: 0.611 0.389 
## 
## Node number 51: 25 observations
##   predicted class=1  expected loss=0.36  P(node) =0.07142857
##     class counts:     9    16
##    probabilities: 0.360 0.640

Prediction

At this stage predictions can be made. First of all, confusion matrix will be calculated:

PredictCART <- predict(StartupTree, newdata=test, type='class')
table(test$Profitable, PredictCART)
##    PredictCART
##      0  1
##   0 54 31
##   1 36 29

Based on the above table we can calculate the accuracy:

Accuracy = (69 + 34)/(69 + 34 + 31+ 16) = 0.6866

It is considered better than guessing randomly

#Creating a prediction
PredictROC <- predict(StartupTree, newdata = test)

#Making a ROC graph
pred <- prediction(PredictROC[,2], test$Profitable)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE)

#Calculating AUC
auc1 <- as.numeric(performance(pred, "auc")@y.values)
auc1
## [1] 0.5161086

Result

  • It is evident that the best threshold is between 0.45 and 0.6
  • Since AUC = 0.5161086, the prediction based on analysis is considered to be almost the same as guessing.

Logistic Regression

Finally, the dataset will be analyzed with logistic regression

Creating the logistic regression model

# Clean all the NA values
train2 = na.omit(train)
test2 = na.omit(test)

# Creating logistic regression model
lr = glm(Profitable ~ Funding_Rounds + Funding_Amount + Valuation + Revenue + Employees + Market_Share + Startup_Age, data =train2, family = binomial)

# Displaying coefficients
summary(lr)
## 
## Call:
## glm(formula = Profitable ~ Funding_Rounds + Funding_Amount + 
##     Valuation + Revenue + Employees + Market_Share + Startup_Age, 
##     family = binomial, data = train2)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -9.014e-01  5.168e-01  -1.744   0.0811 .
## Funding_Rounds  5.929e-02  7.779e-02   0.762   0.4459  
## Funding_Amount -1.899e-03  2.159e-03  -0.879   0.3791  
## Valuation       2.843e-04  1.887e-04   1.506   0.1321  
## Revenue         2.873e-03  3.748e-03   0.767   0.4433  
## Employees       1.391e-05  8.080e-05   0.172   0.8633  
## Market_Share    8.003e-02  4.026e-02   1.988   0.0468 *
## Startup_Age    -1.303e-02  1.192e-02  -1.093   0.2743  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 478.60  on 349  degrees of freedom
## Residual deviance: 468.69  on 342  degrees of freedom
## AIC: 484.69
## 
## Number of Fisher Scoring iterations: 4

Result

Only the Market_Share variable is considered to to have mild significance.

Prediction

Evaluation of how suitable this method is for prediction

# Creating the prediction on test dataset
predictTest = predict(lr, type = "response", newdata = test2)

# Preparing ROCR graph
ROCRpred <- prediction(predictTest, test2$Profitable)
ROCRperf <- performance(ROCRpred, 'tpr', 'fpr')

#Displaying ROCR graph
plot(ROCRperf,colorize = TRUE)

#Calculating AUC
auc2 <- as.numeric(performance(ROCRpred, "auc")@y.values)
auc2
## [1] 0.6126697

Result

  • It is evident that the best threshold is between 0.4 and 0.47
  • Since AUC = 0.6126697, the prediction based on analysis is a bit more reliable than guessing.

Conclusion

Comparing the results from Decision Tree and Logistic Regression analysis, it is evident that:

  • The optimal threshold range for the DT is higher than the one for the LR (0.45-0.65 > 0.4 and 0.47). This means that the DT requires a stricter filtering approach comparing to LR
  • The AUC from the DT is lower than the one from LR (0.5161086 < 0.6126697). This means that LR is more reliable at predicting than DT

As a result, Logistic Regression is a better method for prediction than Decision Tree