According to a recent study by Bureau of Labor Statistics (BLS), about 50% of the startups fail within 5 years from their inception and about 35% survive for close to 10 years or more . In another study, Harvard Business School reported that about 75% of the startups eventually fail.

Statistics indicate that majority of the startups eventually fail. What separates a new business from becoming successful from the ones that would eventually fail is linked to a host of factors. Some of the fators are ability to gauge the market requirement, having a steady cash inflow to sustain the company needs, team with a strong technical expertise, ability to with stand ever growing competition from the competitors, aggressive pricing of the product, value of the core product, business model (either B2B or B2C), a sizable customer base, advisory network to guide and mentor, company location, overcoming legal adversaries and most importantly strong leadership from the senior management which is passionate about the company’s vision and mission.

This is a case study which predicts whether a start-up succeeds or fails based on the sample data obtained from CrowdAnalytixs.

library(readr)
library(tidyverse)
cax_train <- read_csv("H:/Projects/Predicitve modeling/Crowdanalytics/Business Analytics/CAX_Startup_Train.csv")


cax_test <- read_csv("H:/Projects/Predicitve modeling/Crowdanalytics/Business Analytics/CAX_Startup_Test.csv")


cax <- bind_rows(cax_train, cax_test) # conslidated dataset


#sapply(cax, function(x) sum(is.na(x))) # no missing values

Get the numeric , character, integer and factor variables of the data frame

num_features <- which(sapply(cax, function(x) class(x)) == 'numeric') # numeric features


int_features <- which(sapply(cax, function(x) class(x)) == 'integer') # integer features


char_features <- which(sapply(cax, function(x) class(x)) == 'character') # categorical features


factor_features <- which(sapply(cax, function(x) class(x) == 'factor')) # no factor features

Subset numerical, character and integer variables

num <- cax[, num_features,]  # Num variables

char <- cax[, char_features]   # char variables

int <- cax[, int_features]  # integer variables

Exploratory Data Analysis

### Categorical variables

library(miscset)
library(ggplot2)



list_char_vars <- char_features[-1] %>% names() 

char_df <- as.data.frame(char[, -1])


list1 <- list_char_vars[1:12]

ggplotGrid(ncol = 4,
           lapply(list1,
                  function(col) {
                    ggplot(char_df[,1:12], aes_string(col)) + geom_bar(fill = "blue", colour = "red") 
                  }))

list2 <- list_char_vars[13:25]

ggplotGrid(ncol = 4,
           lapply(list2,
                  function(col) {
                    ggplot(char_df[,13:25], aes_string(col)) + geom_bar(fill = "orange", colour = "red") 
                  })) 

Majority of the startups are based out from USA. Only a handful of companies have managed to raise funds and have mobile presence. A large number of founders have previously worked with startups before and carry significant work experience with global exposure. Majority of the company are neither crowd-sourcing or crowd-funding. A large number of companies are not big data savvy.

Key KPI’s of startup’s

#numeric variables

num %>% as.data.frame() %>% gather() %>% ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(fill = "yellow", colour = "black") +
  labs(title =  "Key KPI's of start-ups")

int %>% as.data.frame() %>% select(-Founders_Popularity, -Dependent)%>% gather() %>% ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(fill = "pink", colour = "black") +
  labs(title =  "Key KPI's of start-ups")

Correlation matrix of numeric variables

# corrrealtion matrix
library(ggcorrplot) # library to plot correlation plot

corr <- round(cor(num), 1) # correlation matrix of numerical variables

ggcorrplot(corr = corr, hc.order = TRUE, type = "lower", outline.color = "white") +
  ggtitle("           Correlation heatmap")

Data Preparation

Feature Engineering

Dummy coding of categorical features

# company location dummy variables
loc <- char %>% select(Company_Location) %>% mutate(loc1 = ifelse(.$Company_Location == 'USA', 1, ifelse(.$Company_Location == 'Europe', 0, NA)),
                                                  loc2 = ifelse(.$Company_Location == 'Europe', 1, 0))

loc[which(is.na(loc$loc1)), 'loc1'] <- 0



# company raising funds

funds <- char %>% select(Company_raising_fund) %>% mutate(fund = ifelse(.$Company_raising_fund == 'Yes', 1, 0))


# company industry count


ind.count <- char %>% select(Company_Industry_count) %>% mutate(ind_count_many = ifelse(.$Company_Industry_count == 'Many', 1,
                                                                                   ifelse(.$Company_Industry_count == 'Few', 0, NA)),
                                                                ind_count_few = ifelse(.$Company_Industry_count == 'Few', 1, 0))

ind.count[which(is.na(ind.count$ind_count_many)), 'ind_count_many'] <- 0


# Company mobile app

app <- char %>% select(Company_mobile_app) %>% mutate(mobile_app = ifelse(.$Company_mobile_app == 'Yes', 1, 0))


# Company top angel vc funding

vc.funding <- char %>% select(Company_top_Angel_VC_funding) %>% mutate(angel_vc_funding = ifelse(.$Company_top_Angel_VC_funding == 'Yes', 1, 0))


# Company Founders top company experience

founders.experience <- char %>% select(Founders_top_company_experience) %>% mutate(founders_experience = ifelse(.$Founders_top_company_experience == 'Yes', 1, 0))


# Founders previous company employee count

previous.employee.count <- char %>% select(Founders_previous_company_employee_count) %>% 
                           mutate(previous_employee_count_large = ifelse(.$Founders_previous_company_employee_count == 'Large', 1, 
                                                                   ifelse(.$Founders_previous_company_employee_count == 'Medium', 0, NA)),
                                 previous_employee_count_medium = ifelse(.$Founders_previous_company_employee_count == 'Medium', 1, 0))


previous.employee.count[which(is.na(previous.employee.count$previous_employee_count_large)), 'previous_employee_count_large'] <- 0



# Founders start up experience

founder.startup.exp <- char %>% select(Founders_startup_experience) %>% 
                                mutate(founder_startup_exp = ifelse(.$Founders_startup_experience == 'Yes', 1, 0))


# Founders big five exprience

f.bfive.exp <-  char %>% select(Founders_big_5_experience) %>% mutate(bigfive_exp = ifelse(.$Founders_big_5_experience == 'Yes', 1, 0))


# Company business model

b.model <- char %>% select(Company_business_model) %>% 
                      mutate(btob = ifelse(.$Company_business_model == 'B2C', 1, 
                                           ifelse(.$Company_business_model == 'B2B', 0, NA)),
                             btoc = ifelse(.$Company_business_model == 'B2B', 1, 0))

b.model[which(is.na(b.model$btob)), 'btob'] <- 0



# Founders experience 

f.exp <- char %>% select(Founders_experience) %>% 
                  mutate(f_exp_high = ifelse(.$Founders_experience == 'High', 1, 
                                             ifelse(.$Founders_experience == 'Medium', 0 , NA)),
                         f_exp_medium = ifelse(.$Founders_experience == 'Medium', 1, 0))

f.exp[which(is.na(f.exp$f_exp_high)), 'f_exp_high'] <- 0
  

# Founders global exposure


f.g.expo <- char %>% select(Founders_global_exposure) %>% 
                   mutate(global_expo = ifelse(.$Founders_global_exposure == 'Yes', 1, 0))


# Founders industry exposure



f.i.expo <- char %>% select(Founders_Industry_exposure) %>% 
                     mutate(indusexpo_high = ifelse(.$Founders_Industry_exposure == 'High', 1,
                                                    ifelse(.$Founders_Industry_exposure == 'Medium', 0, NA)),
                            indusexpo_medium = ifelse(.$Founders_Industry_exposure == 'Medium', 1, 0))

f.i.expo[which(is.na(f.i.expo$indusexpo_high)), "indusexpo_high"] <- 0



# Founders education

f.edu <- char %>% select(Founder_education) %>% 
                  mutate(f.phd = ifelse(.$Founder_education == 'PhD', 1, 
                                        ifelse(.$Founder_education == 'Masters', 0, NA)),
                         f.masters = ifelse(.$Founder_education == 'Masters', 1, 0))

f.edu[which(is.na(f.edu$f.phd)), "f.phd"] <- 0



# Founders profile similarity


f.prof.sim <- char %>% select(Founders_profile_similarity) %>% 
                       mutate(prof_sim_high = ifelse(.$Founders_profile_similarity == 'High', 1, 0), 
                              prof_sim_medium = ifelse(.$Founders_profile_similarity == 'Medium', 1, 
                                                       ifelse(.$Founders_profile_similarity == 'Low', 0, NA)),
                              prof_sim_low = ifelse(.$Founders_profile_similarity == 'Low', 1, 0))


f.prof.sim[which(is.na(f.prof.sim$prof_sim_medium)), "prof_sim_medium"] <- 0




# Founders publications

f.pub <- char %>% select(Founders_publications) %>% 
                  mutate(Many_pub = ifelse(.$Founders_publications == 'Many', 1, 
                                           ifelse(.$Founders_publications == 'Few', 0, NA)),
                         Few_pub = ifelse(.$Founders_publications == 'Few', 1, 0))

f.pub[which(is.na(f.pub$Many_pub)), "Many_pub"] <- 0


# Incubation investor

incub.invstr <- char %>% select(Company_incubation_investor) %>% 
                    mutate(incub_invstr = ifelse(.$Company_incubation_investor == 'Yes', 1,0))



# Company funding

crwd_fund <- char %>% select(Company_crowdfunding) %>% 
                 mutate(crwd_funding = ifelse(.$Company_crowdfunding == 'Yes', 1, 0))


# Company crowdsourcing
crwd_source <- char %>% select(Company_crowdsourcing) %>% 
                 mutate(crwd_sourcing = ifelse(.$Company_crowdsourcing == 'Yes', 1, 0))



# Company big data
C_bigdata <- char %>% select(Company_big_data) %>% 
                mutate(bigdata = ifelse(.$Company_big_data == 'Yes', 1, 0))


# Company product or service or both

prod.service <- char %>% select(Company_Product_or_service) %>% 
                mutate(productbased = ifelse(.$Company_Product_or_service == 'Product', 1, 
                                             ifelse(.$Company_Product_or_service == 'Service', 0 ,NA)),
                       servicebased = ifelse(.$Company_Product_or_service == 'Service', 1, 0))


prod.service[which(is.na(prod.service$productbased)), "productbased"] <- 0


# Company  subscription offering

subscription <- char %>% select(Company_subscription_offering) %>% 
                 mutate(com_subscription = ifelse(.$Company_subscription_offering == 'Yes', 1, 0))


# Founder highest degree type


highest.degree.type <- char %>% select(Founder_highest_degree_type) %>% 
                mutate(management = ifelse(.$Founder_highest_degree_type == 'Management', 1, 0),
                       science = ifelse(.$Founder_highest_degree_type == 'Science', 1, 
                                        ifelse(.$Founder_highest_degree_type == 'Technology', 0, NA)),
                       technology = ifelse(.$Founder_highest_degree_type == 'Technology', 1, 0))

highest.degree.type[which(is.na(highest.degree.type$science)), "science"] <- 0

 
# Company difficulty obtaining workforce


work.force <- char %>% select(Company_difficulty_obtaining_workforce) %>% 
                   mutate(workforce_high = ifelse(.$Company_difficulty_obtaining_workforce == 'High', 1, 
                                                  ifelse(.$Company_difficulty_obtaining_workforce == 'Medium', 0, NA)),
                          workforce_medium = ifelse(.$Company_difficulty_obtaining_workforce == 'Medium', 1, 0))

work.force[which(is.na(work.force$workforce_high)), "workforce_high"] <- 0


# Company founder patent


patent <- char %>% select(Company_Founder_Patent) %>% 
                  mutate(patent_yes = ifelse(.$Company_Founder_Patent == 'Yes', 1, 0))

Putting it all together after creating dummy varaibles.

# Putting all together 

finalset <- data.frame(num, int, patent_yes = patent$patent_yes,
                       workforce_high = work.force$workforce_high, workforce_medium = work.force$workforce_medium,
                       management_degree = highest.degree.type$management, science_degree = highest.degree.type$science, 
                       technology_degree = highest.degree.type$technology, com_subscription = subscription$com_subscription,
                       productbased = prod.service$productbased, servicebased = prod.service$servicebased,
                       bigdata = C_bigdata$bigdata, crowdsourcing = crwd_source$crwd_sourcing, 
                       crowdfunding = crwd_fund$crwd_funding, incubator_investor = incub.invstr$incub_invstr,
                       Manypublications = f.pub$Many_pub, fewpublications = f.pub$Few_pub,
                       prof_sim_high = f.prof.sim$prof_sim_high, prof_sim_medium = f.prof.sim$prof_sim_medium, 
                       prof_sim_low = f.prof.sim$prof_sim_low, phd_degree = f.edu$f.phd, masters_degree = f.edu$f.masters,
                       high_exposure = f.i.expo$indusexpo_high, medium_exposure = f.i.expo$indusexpo_medium,
                       global_exposure = f.g.expo$global_expo, high_founder_exp = f.exp$f_exp_high,
                       medium_founder_experience = f.exp$f_exp_medium, btob = b.model$btob, btoc = b.model$btoc,
                       big_five_exp = f.bfive.exp$bigfive_exp, startup_exp = founder.startup.exp$founder_startup_exp,
                       previous_employee_count_large = previous.employee.count$previous_employee_count_large,
                       previous_employee_count_medium = previous.employee.count$previous_employee_count_medium,
                       founder_experience = founders.experience$founders_experience, vc_funding = vc.funding$angel_vc_funding,
                       mobile_app = app$mobile_app, industry_count_many = ind.count$ind_count_many,
                       industry_count_few = ind.count$ind_count_few, raising_funds = funds$fund, location_usa = loc$loc1,
                       location_europe = loc$loc2) 



factor_vars <- sapply(finalset[, c(26:64)], function(x) as.factor(x)) %>% as.data.frame() # conver all the above variables to factor

cleaneddata <- data.frame(num, int, factor_vars) # final cleaned data frame fr use

cleaneddata$Founders_Popularity <- as.factor(cleaneddata$Founders_Popularity ) # oops, forgot to convert this to a factor before, hence doing it now

Splitting data in training, testing and validation sets

# train set and test set


train <- cleaneddata[(1:234),]  # training set to be further split in to training and validation set

train$Dependent <- as.factor(train$Dependent)  # convert dependent variable to a factor


test <- cleaneddata[c(235:314),]  # testing set


test[which(is.na(test$Dependent)), "Dependent" ] <- 0  # change missing values to 0

test$Dependent <- as.factor(test$Dependent) # convert dependent varaible to factor 

# validation set
vset <- sample(1:nrow(train), nrow(train)* 0.70, replace = FALSE) # random sampling

trainset <- train[vset,] # training set

validationset <- train[-vset,] # validation set

Feature Selection

The features that are likely to help up predcit whether a start up succeedsor fails are selected using Boruta method which is built on top of random forest feature selection method. Feature selection methods only suggest which features are likely to contribute in explaining variablity in response variable and should be taken with a grain of salt.

Boruta Method

library(Boruta) 

set.seed(1234)

boruta.train <- Boruta(Dependent~.,data = train, doTrace = 2) # train boruta model on the training set


# plot results from boruta feature selection method
plot(boruta.train, xlab = "", xaxt = "n")

lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
  boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])

names(lz) <- colnames(boruta.train$ImpHistory)

Labels <- sort(sapply(lz,median))

axis(side = 1,las=2,labels = names(Labels),
     at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7) 

Final list of important features after feature selection using Boruta Method

final.boruta <- TentativeRoughFix(boruta.train) # list of tentative important variables

final_vars <- getSelectedAttributes(final.boruta, withTentative = FALSE)  # final list of features

final_vars
## [1] "Company_avg_investment_time"    "Company_senior_team_count"     
## [3] "Company_repeat_investors_count" "Company_competitor_count"      
## [5] "Company_analytics_score"        "prof_sim_medium"               
## [7] "btob"                           "btoc"

Model Building

Subset important features from training, testing and validation sets

trset <- trainset[,final_vars]  # training set

trset$dependent <- trainset$Dependent  # convert dependent feature to to factor

valset <- validationset[,final_vars]  # validation set

valset$dependent <- validationset$Dependent  # convert dependent feature to to factor

tset <- test[,final_vars]    # testing set

tset$dependent <- test$Dependent  # convert dependent feature to to factor

library(h2o)
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 51 minutes 
##     H2O cluster version:        3.16.0.2 
##     H2O cluster version age:    4 months and 26 days !!! 
##     H2O cluster name:           H2O_started_from_R_Karthik_aog658 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.69 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.1 (2017-06-30)
train.h2o <- as.h2o(trset) # convert training set to h20 data frame
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
validation.h2o <- as.h2o(valset) # convert validation set to h2o data frame
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(tset) # convert testing set to h2o data frame
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
x.indep <- c(1:8) # independent variables

y.dep <- 9 # dependent variable

logistic <- h2o.glm(x = x.indep, y = y.dep, training_frame = train.h2o  ,family = "binomial", seed = 1234) # train logisitic regression model on the training set
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Coefficients

logistic@model$coefficients_table
## Coefficients: glm coefficients
##                             names coefficients standardized_coefficients
## 1                       Intercept    -1.099159                 -0.129992
## 2               prof_sim_medium.0     0.000000                  0.000000
## 3               prof_sim_medium.1     0.000000                  0.000000
## 4                          btob.0     0.116989                  0.116989
## 5                          btob.1    -0.121404                 -0.121404
## 6                          btoc.0    -0.122533                 -0.122533
## 7                          btoc.1     0.119152                  0.119152
## 8     Company_avg_investment_time     0.001500                  0.026164
## 9       Company_senior_team_count     0.202464                  0.511721
## 10 Company_repeat_investors_count     0.043306                  0.062609
## 11       Company_competitor_count    -0.131911                 -0.456626
## 12        Company_analytics_score     0.283543                  0.296333

Interpretaion of logistic function

log[Y/(1-Y)] = b0 + b1X1 + b2X2 + …… + bnXn

logodds <- function(x){  # function to calculate odds ratio
  
  exp(x)/(exp(x) + 1)
}
# company competitor count

#Co-efficient : - 0.028137

logodds(- 0.028137)
## [1] 0.4929662

For one unit increase in company competitor count, there is a 49% decrease in the odds for a company to succeed.

#Company_analytics_score:0.320876

logodds(0.320876)
## [1] 0.5795377

For one unit increase in company analytics score, there is a 57% increase in the odds for a company to succeed.

#Company_senior_team_count:0.246476

logodds(0.246476)
## [1] 0.5613089

For one unit increase in company senior team count, there is a 56% increase in the odds for a company to succeed.

#btob.0: 0.116618

logodds(0.116618)
## [1] 0.5291215

There is 52% increase in the odds for a company to succeed if its business model is both B2B and B2C.

#btob.1: -0.121676

logodds(-0.121676)
## [1] 0.4696185

There is a 46% decrease in the odds for a company to succeed if it’s business model is only B2B.

Based on the available data, a startup is more likely to succeed if has a better analytics score, lesser number of competitors(being unique in the market space), having more experienced talent onboard and have both B2B & B2C business model.

Model Evaluation

h2o.performance(logistic, validation.h2o)  # model performance on validation set
## H2OBinomialMetrics: glm
## 
## MSE:  0.2121043
## RMSE:  0.4605478
## LogLoss:  0.6276378
## Mean Per-Class Error:  0.2654762
## AUC:  0.7468254
## Gini:  0.4936508
## R^2:  0.1514146
## Residual Deviance:  89.12457
## AIC:  109.1246
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         0  1    Error    Rate
## 0      21 15 0.416667  =15/36
## 1       4 31 0.114286   =4/35
## Totals 25 46 0.267606  =19/71
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.387795 0.765432  45
## 2                       max f2  0.332975 0.859375  50
## 3                 max f0point5  0.449681 0.714286  34
## 4                 max accuracy  0.391720 0.732394  43
## 5                max precision  0.662413 0.900000   9
## 6                   max recall  0.054060 1.000000  67
## 7              max specificity  0.883836 0.972222   0
## 8             max absolute_mcc  0.387795 0.490968  45
## 9   max min_per_class_accuracy  0.449681 0.714286  34
## 10 max mean_per_class_accuracy  0.387795 0.734524  45
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Area under the curve

h2o.auc(h2o.performance(logistic, train.h2o));h2o.auc(h2o.performance(logistic, validation.h2o))  # AUC of the model on  training set and validaion set
## [1] 0.7777778
## [1] 0.7468254

We made predictions on testing set using the logistic regressin model we trained. The AUC for testing set is 0.67929 AUC, sitting at 40th position on the leaderboard.Not bad for a base model. We can further improve the model by adding more important features which we hadn’t in our base model.