According to a recent study by Bureau of Labor Statistics (BLS), about 50% of the startups fail within 5 years from their inception and about 35% survive for close to 10 years or more . In another study, Harvard Business School reported that about 75% of the startups eventually fail.
Statistics indicate that majority of the startups eventually fail. What separates a new business from becoming successful from the ones that would eventually fail is linked to a host of factors. Some of the fators are ability to gauge the market requirement, having a steady cash inflow to sustain the company needs, team with a strong technical expertise, ability to with stand ever growing competition from the competitors, aggressive pricing of the product, value of the core product, business model (either B2B or B2C), a sizable customer base, advisory network to guide and mentor, company location, overcoming legal adversaries and most importantly strong leadership from the senior management which is passionate about the company’s vision and mission.
This is a case study which predicts whether a start-up succeeds or fails based on the sample data obtained from CrowdAnalytixs.
library(readr)
library(tidyverse)
cax_train <- read_csv("H:/Projects/Predicitve modeling/Crowdanalytics/Business Analytics/CAX_Startup_Train.csv")
cax_test <- read_csv("H:/Projects/Predicitve modeling/Crowdanalytics/Business Analytics/CAX_Startup_Test.csv")
cax <- bind_rows(cax_train, cax_test) # conslidated dataset
#sapply(cax, function(x) sum(is.na(x))) # no missing values
Get the numeric , character, integer and factor variables of the data frame
num_features <- which(sapply(cax, function(x) class(x)) == 'numeric') # numeric features
int_features <- which(sapply(cax, function(x) class(x)) == 'integer') # integer features
char_features <- which(sapply(cax, function(x) class(x)) == 'character') # categorical features
factor_features <- which(sapply(cax, function(x) class(x) == 'factor')) # no factor features
Subset numerical, character and integer variables
num <- cax[, num_features,] # Num variables
char <- cax[, char_features] # char variables
int <- cax[, int_features] # integer variables
### Categorical variables
library(miscset)
library(ggplot2)
list_char_vars <- char_features[-1] %>% names()
char_df <- as.data.frame(char[, -1])
list1 <- list_char_vars[1:12]
ggplotGrid(ncol = 4,
lapply(list1,
function(col) {
ggplot(char_df[,1:12], aes_string(col)) + geom_bar(fill = "blue", colour = "red")
}))
list2 <- list_char_vars[13:25]
ggplotGrid(ncol = 4,
lapply(list2,
function(col) {
ggplot(char_df[,13:25], aes_string(col)) + geom_bar(fill = "orange", colour = "red")
}))
Majority of the startups are based out from USA. Only a handful of companies have managed to raise funds and have mobile presence. A large number of founders have previously worked with startups before and carry significant work experience with global exposure. Majority of the company are neither crowd-sourcing or crowd-funding. A large number of companies are not big data savvy.
Key KPI’s of startup’s
#numeric variables
num %>% as.data.frame() %>% gather() %>% ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = "yellow", colour = "black") +
labs(title = "Key KPI's of start-ups")
int %>% as.data.frame() %>% select(-Founders_Popularity, -Dependent)%>% gather() %>% ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(fill = "pink", colour = "black") +
labs(title = "Key KPI's of start-ups")
Correlation matrix of numeric variables
# corrrealtion matrix
library(ggcorrplot) # library to plot correlation plot
corr <- round(cor(num), 1) # correlation matrix of numerical variables
ggcorrplot(corr = corr, hc.order = TRUE, type = "lower", outline.color = "white") +
ggtitle(" Correlation heatmap")
Feature Engineering
Dummy coding of categorical features
# company location dummy variables
loc <- char %>% select(Company_Location) %>% mutate(loc1 = ifelse(.$Company_Location == 'USA', 1, ifelse(.$Company_Location == 'Europe', 0, NA)),
loc2 = ifelse(.$Company_Location == 'Europe', 1, 0))
loc[which(is.na(loc$loc1)), 'loc1'] <- 0
# company raising funds
funds <- char %>% select(Company_raising_fund) %>% mutate(fund = ifelse(.$Company_raising_fund == 'Yes', 1, 0))
# company industry count
ind.count <- char %>% select(Company_Industry_count) %>% mutate(ind_count_many = ifelse(.$Company_Industry_count == 'Many', 1,
ifelse(.$Company_Industry_count == 'Few', 0, NA)),
ind_count_few = ifelse(.$Company_Industry_count == 'Few', 1, 0))
ind.count[which(is.na(ind.count$ind_count_many)), 'ind_count_many'] <- 0
# Company mobile app
app <- char %>% select(Company_mobile_app) %>% mutate(mobile_app = ifelse(.$Company_mobile_app == 'Yes', 1, 0))
# Company top angel vc funding
vc.funding <- char %>% select(Company_top_Angel_VC_funding) %>% mutate(angel_vc_funding = ifelse(.$Company_top_Angel_VC_funding == 'Yes', 1, 0))
# Company Founders top company experience
founders.experience <- char %>% select(Founders_top_company_experience) %>% mutate(founders_experience = ifelse(.$Founders_top_company_experience == 'Yes', 1, 0))
# Founders previous company employee count
previous.employee.count <- char %>% select(Founders_previous_company_employee_count) %>%
mutate(previous_employee_count_large = ifelse(.$Founders_previous_company_employee_count == 'Large', 1,
ifelse(.$Founders_previous_company_employee_count == 'Medium', 0, NA)),
previous_employee_count_medium = ifelse(.$Founders_previous_company_employee_count == 'Medium', 1, 0))
previous.employee.count[which(is.na(previous.employee.count$previous_employee_count_large)), 'previous_employee_count_large'] <- 0
# Founders start up experience
founder.startup.exp <- char %>% select(Founders_startup_experience) %>%
mutate(founder_startup_exp = ifelse(.$Founders_startup_experience == 'Yes', 1, 0))
# Founders big five exprience
f.bfive.exp <- char %>% select(Founders_big_5_experience) %>% mutate(bigfive_exp = ifelse(.$Founders_big_5_experience == 'Yes', 1, 0))
# Company business model
b.model <- char %>% select(Company_business_model) %>%
mutate(btob = ifelse(.$Company_business_model == 'B2C', 1,
ifelse(.$Company_business_model == 'B2B', 0, NA)),
btoc = ifelse(.$Company_business_model == 'B2B', 1, 0))
b.model[which(is.na(b.model$btob)), 'btob'] <- 0
# Founders experience
f.exp <- char %>% select(Founders_experience) %>%
mutate(f_exp_high = ifelse(.$Founders_experience == 'High', 1,
ifelse(.$Founders_experience == 'Medium', 0 , NA)),
f_exp_medium = ifelse(.$Founders_experience == 'Medium', 1, 0))
f.exp[which(is.na(f.exp$f_exp_high)), 'f_exp_high'] <- 0
# Founders global exposure
f.g.expo <- char %>% select(Founders_global_exposure) %>%
mutate(global_expo = ifelse(.$Founders_global_exposure == 'Yes', 1, 0))
# Founders industry exposure
f.i.expo <- char %>% select(Founders_Industry_exposure) %>%
mutate(indusexpo_high = ifelse(.$Founders_Industry_exposure == 'High', 1,
ifelse(.$Founders_Industry_exposure == 'Medium', 0, NA)),
indusexpo_medium = ifelse(.$Founders_Industry_exposure == 'Medium', 1, 0))
f.i.expo[which(is.na(f.i.expo$indusexpo_high)), "indusexpo_high"] <- 0
# Founders education
f.edu <- char %>% select(Founder_education) %>%
mutate(f.phd = ifelse(.$Founder_education == 'PhD', 1,
ifelse(.$Founder_education == 'Masters', 0, NA)),
f.masters = ifelse(.$Founder_education == 'Masters', 1, 0))
f.edu[which(is.na(f.edu$f.phd)), "f.phd"] <- 0
# Founders profile similarity
f.prof.sim <- char %>% select(Founders_profile_similarity) %>%
mutate(prof_sim_high = ifelse(.$Founders_profile_similarity == 'High', 1, 0),
prof_sim_medium = ifelse(.$Founders_profile_similarity == 'Medium', 1,
ifelse(.$Founders_profile_similarity == 'Low', 0, NA)),
prof_sim_low = ifelse(.$Founders_profile_similarity == 'Low', 1, 0))
f.prof.sim[which(is.na(f.prof.sim$prof_sim_medium)), "prof_sim_medium"] <- 0
# Founders publications
f.pub <- char %>% select(Founders_publications) %>%
mutate(Many_pub = ifelse(.$Founders_publications == 'Many', 1,
ifelse(.$Founders_publications == 'Few', 0, NA)),
Few_pub = ifelse(.$Founders_publications == 'Few', 1, 0))
f.pub[which(is.na(f.pub$Many_pub)), "Many_pub"] <- 0
# Incubation investor
incub.invstr <- char %>% select(Company_incubation_investor) %>%
mutate(incub_invstr = ifelse(.$Company_incubation_investor == 'Yes', 1,0))
# Company funding
crwd_fund <- char %>% select(Company_crowdfunding) %>%
mutate(crwd_funding = ifelse(.$Company_crowdfunding == 'Yes', 1, 0))
# Company crowdsourcing
crwd_source <- char %>% select(Company_crowdsourcing) %>%
mutate(crwd_sourcing = ifelse(.$Company_crowdsourcing == 'Yes', 1, 0))
# Company big data
C_bigdata <- char %>% select(Company_big_data) %>%
mutate(bigdata = ifelse(.$Company_big_data == 'Yes', 1, 0))
# Company product or service or both
prod.service <- char %>% select(Company_Product_or_service) %>%
mutate(productbased = ifelse(.$Company_Product_or_service == 'Product', 1,
ifelse(.$Company_Product_or_service == 'Service', 0 ,NA)),
servicebased = ifelse(.$Company_Product_or_service == 'Service', 1, 0))
prod.service[which(is.na(prod.service$productbased)), "productbased"] <- 0
# Company subscription offering
subscription <- char %>% select(Company_subscription_offering) %>%
mutate(com_subscription = ifelse(.$Company_subscription_offering == 'Yes', 1, 0))
# Founder highest degree type
highest.degree.type <- char %>% select(Founder_highest_degree_type) %>%
mutate(management = ifelse(.$Founder_highest_degree_type == 'Management', 1, 0),
science = ifelse(.$Founder_highest_degree_type == 'Science', 1,
ifelse(.$Founder_highest_degree_type == 'Technology', 0, NA)),
technology = ifelse(.$Founder_highest_degree_type == 'Technology', 1, 0))
highest.degree.type[which(is.na(highest.degree.type$science)), "science"] <- 0
# Company difficulty obtaining workforce
work.force <- char %>% select(Company_difficulty_obtaining_workforce) %>%
mutate(workforce_high = ifelse(.$Company_difficulty_obtaining_workforce == 'High', 1,
ifelse(.$Company_difficulty_obtaining_workforce == 'Medium', 0, NA)),
workforce_medium = ifelse(.$Company_difficulty_obtaining_workforce == 'Medium', 1, 0))
work.force[which(is.na(work.force$workforce_high)), "workforce_high"] <- 0
# Company founder patent
patent <- char %>% select(Company_Founder_Patent) %>%
mutate(patent_yes = ifelse(.$Company_Founder_Patent == 'Yes', 1, 0))
Putting it all together after creating dummy varaibles.
# Putting all together
finalset <- data.frame(num, int, patent_yes = patent$patent_yes,
workforce_high = work.force$workforce_high, workforce_medium = work.force$workforce_medium,
management_degree = highest.degree.type$management, science_degree = highest.degree.type$science,
technology_degree = highest.degree.type$technology, com_subscription = subscription$com_subscription,
productbased = prod.service$productbased, servicebased = prod.service$servicebased,
bigdata = C_bigdata$bigdata, crowdsourcing = crwd_source$crwd_sourcing,
crowdfunding = crwd_fund$crwd_funding, incubator_investor = incub.invstr$incub_invstr,
Manypublications = f.pub$Many_pub, fewpublications = f.pub$Few_pub,
prof_sim_high = f.prof.sim$prof_sim_high, prof_sim_medium = f.prof.sim$prof_sim_medium,
prof_sim_low = f.prof.sim$prof_sim_low, phd_degree = f.edu$f.phd, masters_degree = f.edu$f.masters,
high_exposure = f.i.expo$indusexpo_high, medium_exposure = f.i.expo$indusexpo_medium,
global_exposure = f.g.expo$global_expo, high_founder_exp = f.exp$f_exp_high,
medium_founder_experience = f.exp$f_exp_medium, btob = b.model$btob, btoc = b.model$btoc,
big_five_exp = f.bfive.exp$bigfive_exp, startup_exp = founder.startup.exp$founder_startup_exp,
previous_employee_count_large = previous.employee.count$previous_employee_count_large,
previous_employee_count_medium = previous.employee.count$previous_employee_count_medium,
founder_experience = founders.experience$founders_experience, vc_funding = vc.funding$angel_vc_funding,
mobile_app = app$mobile_app, industry_count_many = ind.count$ind_count_many,
industry_count_few = ind.count$ind_count_few, raising_funds = funds$fund, location_usa = loc$loc1,
location_europe = loc$loc2)
factor_vars <- sapply(finalset[, c(26:64)], function(x) as.factor(x)) %>% as.data.frame() # conver all the above variables to factor
cleaneddata <- data.frame(num, int, factor_vars) # final cleaned data frame fr use
cleaneddata$Founders_Popularity <- as.factor(cleaneddata$Founders_Popularity ) # oops, forgot to convert this to a factor before, hence doing it now
Splitting data in training, testing and validation sets
# train set and test set
train <- cleaneddata[(1:234),] # training set to be further split in to training and validation set
train$Dependent <- as.factor(train$Dependent) # convert dependent variable to a factor
test <- cleaneddata[c(235:314),] # testing set
test[which(is.na(test$Dependent)), "Dependent" ] <- 0 # change missing values to 0
test$Dependent <- as.factor(test$Dependent) # convert dependent varaible to factor
# validation set
vset <- sample(1:nrow(train), nrow(train)* 0.70, replace = FALSE) # random sampling
trainset <- train[vset,] # training set
validationset <- train[-vset,] # validation set
The features that are likely to help up predcit whether a start up succeedsor fails are selected using Boruta method which is built on top of random forest feature selection method. Feature selection methods only suggest which features are likely to contribute in explaining variablity in response variable and should be taken with a grain of salt.
Boruta Method
library(Boruta)
set.seed(1234)
boruta.train <- Boruta(Dependent~.,data = train, doTrace = 2) # train boruta model on the training set
# plot results from boruta feature selection method
plot(boruta.train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
Final list of important features after feature selection using Boruta Method
final.boruta <- TentativeRoughFix(boruta.train) # list of tentative important variables
final_vars <- getSelectedAttributes(final.boruta, withTentative = FALSE) # final list of features
final_vars
## [1] "Company_avg_investment_time" "Company_senior_team_count"
## [3] "Company_repeat_investors_count" "Company_competitor_count"
## [5] "Company_analytics_score" "prof_sim_medium"
## [7] "btob" "btoc"
Subset important features from training, testing and validation sets
trset <- trainset[,final_vars] # training set
trset$dependent <- trainset$Dependent # convert dependent feature to to factor
valset <- validationset[,final_vars] # validation set
valset$dependent <- validationset$Dependent # convert dependent feature to to factor
tset <- test[,final_vars] # testing set
tset$dependent <- test$Dependent # convert dependent feature to to factor
library(h2o)
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 5 hours 51 minutes
## H2O cluster version: 3.16.0.2
## H2O cluster version age: 4 months and 26 days !!!
## H2O cluster name: H2O_started_from_R_Karthik_aog658
## H2O cluster total nodes: 1
## H2O cluster total memory: 0.69 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.1 (2017-06-30)
train.h2o <- as.h2o(trset) # convert training set to h20 data frame
##
|
| | 0%
|
|=================================================================| 100%
validation.h2o <- as.h2o(valset) # convert validation set to h2o data frame
##
|
| | 0%
|
|=================================================================| 100%
test.h2o <- as.h2o(tset) # convert testing set to h2o data frame
##
|
| | 0%
|
|=================================================================| 100%
x.indep <- c(1:8) # independent variables
y.dep <- 9 # dependent variable
logistic <- h2o.glm(x = x.indep, y = y.dep, training_frame = train.h2o ,family = "binomial", seed = 1234) # train logisitic regression model on the training set
##
|
| | 0%
|
|=================================================================| 100%
Coefficients
logistic@model$coefficients_table
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept -1.099159 -0.129992
## 2 prof_sim_medium.0 0.000000 0.000000
## 3 prof_sim_medium.1 0.000000 0.000000
## 4 btob.0 0.116989 0.116989
## 5 btob.1 -0.121404 -0.121404
## 6 btoc.0 -0.122533 -0.122533
## 7 btoc.1 0.119152 0.119152
## 8 Company_avg_investment_time 0.001500 0.026164
## 9 Company_senior_team_count 0.202464 0.511721
## 10 Company_repeat_investors_count 0.043306 0.062609
## 11 Company_competitor_count -0.131911 -0.456626
## 12 Company_analytics_score 0.283543 0.296333
Interpretaion of logistic function
log[Y/(1-Y)] = b0 + b1X1 + b2X2 + …… + bnXn
logodds <- function(x){ # function to calculate odds ratio
exp(x)/(exp(x) + 1)
}
# company competitor count
#Co-efficient : - 0.028137
logodds(- 0.028137)
## [1] 0.4929662
For one unit increase in company competitor count, there is a 49% decrease in the odds for a company to succeed.
#Company_analytics_score:0.320876
logodds(0.320876)
## [1] 0.5795377
For one unit increase in company analytics score, there is a 57% increase in the odds for a company to succeed.
#Company_senior_team_count:0.246476
logodds(0.246476)
## [1] 0.5613089
For one unit increase in company senior team count, there is a 56% increase in the odds for a company to succeed.
#btob.0: 0.116618
logodds(0.116618)
## [1] 0.5291215
There is 52% increase in the odds for a company to succeed if its business model is both B2B and B2C.
#btob.1: -0.121676
logodds(-0.121676)
## [1] 0.4696185
There is a 46% decrease in the odds for a company to succeed if it’s business model is only B2B.
Based on the available data, a startup is more likely to succeed if has a better analytics score, lesser number of competitors(being unique in the market space), having more experienced talent onboard and have both B2B & B2C business model.
h2o.performance(logistic, validation.h2o) # model performance on validation set
## H2OBinomialMetrics: glm
##
## MSE: 0.2121043
## RMSE: 0.4605478
## LogLoss: 0.6276378
## Mean Per-Class Error: 0.2654762
## AUC: 0.7468254
## Gini: 0.4936508
## R^2: 0.1514146
## Residual Deviance: 89.12457
## AIC: 109.1246
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 21 15 0.416667 =15/36
## 1 4 31 0.114286 =4/35
## Totals 25 46 0.267606 =19/71
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.387795 0.765432 45
## 2 max f2 0.332975 0.859375 50
## 3 max f0point5 0.449681 0.714286 34
## 4 max accuracy 0.391720 0.732394 43
## 5 max precision 0.662413 0.900000 9
## 6 max recall 0.054060 1.000000 67
## 7 max specificity 0.883836 0.972222 0
## 8 max absolute_mcc 0.387795 0.490968 45
## 9 max min_per_class_accuracy 0.449681 0.714286 34
## 10 max mean_per_class_accuracy 0.387795 0.734524 45
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Area under the curve
h2o.auc(h2o.performance(logistic, train.h2o));h2o.auc(h2o.performance(logistic, validation.h2o)) # AUC of the model on training set and validaion set
## [1] 0.7777778
## [1] 0.7468254
We made predictions on testing set using the logistic regressin model we trained. The AUC for testing set is 0.67929 AUC, sitting at 40th position on the leaderboard.Not bad for a base model. We can further improve the model by adding more important features which we hadn’t in our base model.