According to a recent study by Bureau of Labor Statistics (BLS), about 50% of the startups fail within 5 years from their inception and about 35% survive for close to 10 years or more . In another study, Harvard Business School reported that about 75% of the startups eventually fail.

Statistics indicate that majority of the startups eventually fail. What separates a new business from becoming successful from the ones that would eventually fail is linked to a host of factors. Some of the fators are ability to gauge the market requirement, having a steady cash inflow to sustain the company needs, team with a strong technical expertise, ability to with stand ever growing competition from the competitors, aggressive pricing of the product, value of the core product, business model (either B2B or B2C), a sizable customer base, advisory network to guide and mentor, company location, overcoming legal adversaries and most importantly strong leadership from the senior management which is passionate about the company’s vision and mission.

This is a case study which predicts whether a start-up succeeds or fails based on the sample data obtained from CrowdAnalytixs.

library(readr)
library(tidyverse)
cax_train <- read_csv("H:/Projects/Predicitve modeling/Crowdanalytics/Business Analytics/CAX_Startup_Train.csv")


cax_test <- read_csv("H:/Projects/Predicitve modeling/Crowdanalytics/Business Analytics/CAX_Startup_Test.csv")


cax <- bind_rows(cax_train, cax_test) # conslidated dataset


#sapply(cax, function(x) sum(is.na(x))) # no missing values

Get the numeric , character, integer and factor variables of the data frame

num_features <- which(sapply(cax, function(x) class(x)) == 'numeric') # numeric features


int_features <- which(sapply(cax, function(x) class(x)) == 'integer') # integer features


char_features <- which(sapply(cax, function(x) class(x)) == 'character') # categorical features


factor_features <- which(sapply(cax, function(x) class(x) == 'factor')) # no factor features

Subset numerical, character and integer variables

num <- cax[, num_features,]  # Num variables

char <- cax[, char_features]   # char variables

int <- cax[, int_features]  # integer variables

Exploratory Data Analysis

### Categorical variables

library(miscset)
library(ggplot2)



list_char_vars <- char_features[-1] %>% names() 

char_df <- as.data.frame(char[, -1])


list1 <- list_char_vars[1:12]

ggplotGrid(ncol = 4,
           lapply(list1,
                  function(col) {
                    ggplot(char_df[,1:12], aes_string(col)) + geom_bar(fill = "blue", colour = "red") 
                  }))

list2 <- list_char_vars[13:25]

ggplotGrid(ncol = 4,
           lapply(list2,
                  function(col) {
                    ggplot(char_df[,13:25], aes_string(col)) + geom_bar(fill = "orange", colour = "red") 
                  })) 

Majority of the startups are based out from USA. Only a handful of companies have managed to raise funds and have mobile presence. A large number of founders have previously worked with startups before and carry significant work experience with global exposure. Majority of the company are neither crowd-sourcing or crowd-funding. A large number of companies are not big data savvy.

Key KPI’s of startup’s

#numeric variables

num %>% as.data.frame() %>% gather() %>% ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(fill = "yellow", colour = "black") +
  labs(title =  "Key KPI's of start-ups")

int %>% as.data.frame() %>% select(-Founders_Popularity, -Dependent)%>% gather() %>% ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(fill = "pink", colour = "black") +
  labs(title =  "Key KPI's of start-ups")

Correlation matrix of numeric variables

# corrrealtion matrix
library(ggcorrplot) # library to plot correlation plot

corr <- round(cor(num), 1) # correlation matrix of numerical variables

ggcorrplot(corr = corr, hc.order = TRUE, type = "lower", outline.color = "white") +
  ggtitle("           Correlation heatmap")

Data Preparation

Feature Engineering

Dummy coding of categorical features

# company location dummy variables
loc <- char %>% select(Company_Location) %>% mutate(loc1 = ifelse(.$Company_Location == 'USA', 1, ifelse(.$Company_Location == 'Europe', 0, NA)),
                                                  loc2 = ifelse(.$Company_Location == 'Europe', 1, 0))

loc[which(is.na(loc$loc1)), 'loc1'] <- 0



# company raising funds

funds <- char %>% select(Company_raising_fund) %>% mutate(fund = ifelse(.$Company_raising_fund == 'Yes', 1, 0))


# company industry count


ind.count <- char %>% select(Company_Industry_count) %>% mutate(ind_count_many = ifelse(.$Company_Industry_count == 'Many', 1,
                                                                                   ifelse(.$Company_Industry_count == 'Few', 0, NA)),
                                                                ind_count_few = ifelse(.$Company_Industry_count == 'Few', 1, 0))

ind.count[which(is.na(ind.count$ind_count_many)), 'ind_count_many'] <- 0


# Company mobile app

app <- char %>% select(Company_mobile_app) %>% mutate(mobile_app = ifelse(.$Company_mobile_app == 'Yes', 1, 0))


# Company top angel vc funding

vc.funding <- char %>% select(Company_top_Angel_VC_funding) %>% mutate(angel_vc_funding = ifelse(.$Company_top_Angel_VC_funding == 'Yes', 1, 0))


# Company Founders top company experience

founders.experience <- char %>% select(Founders_top_company_experience) %>% mutate(founders_experience = ifelse(.$Founders_top_company_experience == 'Yes', 1, 0))


# Founders previous company employee count

previous.employee.count <- char %>% select(Founders_previous_company_employee_count) %>% 
                           mutate(previous_employee_count_large = ifelse(.$Founders_previous_company_employee_count == 'Large', 1, 
                                                                   ifelse(.$Founders_previous_company_employee_count == 'Medium', 0, NA)),
                                 previous_employee_count_medium = ifelse(.$Founders_previous_company_employee_count == 'Medium', 1, 0))


previous.employee.count[which(is.na(previous.employee.count$previous_employee_count_large)), 'previous_employee_count_large'] <- 0



# Founders start up experience

founder.startup.exp <- char %>% select(Founders_startup_experience) %>% 
                                mutate(founder_startup_exp = ifelse(.$Founders_startup_experience == 'Yes', 1, 0))


# Founders big five exprience

f.bfive.exp <-  char %>% select(Founders_big_5_experience) %>% mutate(bigfive_exp = ifelse(.$Founders_big_5_experience == 'Yes', 1, 0))


# Company business model

b.model <- char %>% select(Company_business_model) %>% 
                      mutate(btob = ifelse(.$Company_business_model == 'B2C', 1, 
                                           ifelse(.$Company_business_model == 'B2B', 0, NA)),
                             btoc = ifelse(.$Company_business_model == 'B2B', 1, 0))

b.model[which(is.na(b.model$btob)), 'btob'] <- 0



# Founders experience 

f.exp <- char %>% select(Founders_experience) %>% 
                  mutate(f_exp_high = ifelse(.$Founders_experience == 'High', 1, 
                                             ifelse(.$Founders_experience == 'Medium', 0 , NA)),
                         f_exp_medium = ifelse(.$Founders_experience == 'Medium', 1, 0))

f.exp[which(is.na(f.exp$f_exp_high)), 'f_exp_high'] <- 0
  

# Founders global exposure


f.g.expo <- char %>% select(Founders_global_exposure) %>% 
                   mutate(global_expo = ifelse(.$Founders_global_exposure == 'Yes', 1, 0))


# Founders industry exposure



f.i.expo <- char %>% select(Founders_Industry_exposure) %>% 
                     mutate(indusexpo_high = ifelse(.$Founders_Industry_exposure == 'High', 1,
                                                    ifelse(.$Founders_Industry_exposure == 'Medium', 0, NA)),
                            indusexpo_medium = ifelse(.$Founders_Industry_exposure == 'Medium', 1, 0))

f.i.expo[which(is.na(f.i.expo$indusexpo_high)), "indusexpo_high"] <- 0



# Founders education

f.edu <- char %>% select(Founder_education) %>% 
                  mutate(f.phd = ifelse(.$Founder_education == 'PhD', 1, 
                                        ifelse(.$Founder_education == 'Masters', 0, NA)),
                         f.masters = ifelse(.$Founder_education == 'Masters', 1, 0))

f.edu[which(is.na(f.edu$f.phd)), "f.phd"] <- 0



# Founders profile similarity


f.prof.sim <- char %>% select(Founders_profile_similarity) %>% 
                       mutate(prof_sim_high = ifelse(.$Founders_profile_similarity == 'High', 1, 0), 
                              prof_sim_medium = ifelse(.$Founders_profile_similarity == 'Medium', 1, 
                                                       ifelse(.$Founders_profile_similarity == 'Low', 0, NA)),
                              prof_sim_low = ifelse(.$Founders_profile_similarity == 'Low', 1, 0))


f.prof.sim[which(is.na(f.prof.sim$prof_sim_medium)), "prof_sim_medium"] <- 0




# Founders publications

f.pub <- char %>% select(Founders_publications) %>% 
                  mutate(Many_pub = ifelse(.$Founders_publications == 'Many', 1, 
                                           ifelse(.$Founders_publications == 'Few', 0, NA)),
                         Few_pub = ifelse(.$Founders_publications == 'Few', 1, 0))

f.pub[which(is.na(f.pub$Many_pub)), "Many_pub"] <- 0


# Incubation investor

incub.invstr <- char %>% select(Company_incubation_investor) %>% 
                    mutate(incub_invstr = ifelse(.$Company_incubation_investor == 'Yes', 1,0))



# Company funding

crwd_fund <- char %>% select(Company_crowdfunding) %>% 
                 mutate(crwd_funding = ifelse(.$Company_crowdfunding == 'Yes', 1, 0))


# Company crowdsourcing
crwd_source <- char %>% select(Company_crowdsourcing) %>% 
                 mutate(crwd_sourcing = ifelse(.$Company_crowdsourcing == 'Yes', 1, 0))



# Company big data
C_bigdata <- char %>% select(Company_big_data) %>% 
                mutate(bigdata = ifelse(.$Company_big_data == 'Yes', 1, 0))


# Company product or service or both

prod.service <- char %>% select(Company_Product_or_service) %>% 
                mutate(productbased = ifelse(.$Company_Product_or_service == 'Product', 1, 
                                             ifelse(.$Company_Product_or_service == 'Service', 0 ,NA)),
                       servicebased = ifelse(.$Company_Product_or_service == 'Service', 1, 0))


prod.service[which(is.na(prod.service$productbased)), "productbased"] <- 0


# Company  subscription offering

subscription <- char %>% select(Company_subscription_offering) %>% 
                 mutate(com_subscription = ifelse(.$Company_subscription_offering == 'Yes', 1, 0))


# Founder highest degree type


highest.degree.type <- char %>% select(Founder_highest_degree_type) %>% 
                mutate(management = ifelse(.$Founder_highest_degree_type == 'Management', 1, 0),
                       science = ifelse(.$Founder_highest_degree_type == 'Science', 1, 
                                        ifelse(.$Founder_highest_degree_type == 'Technology', 0, NA)),
                       technology = ifelse(.$Founder_highest_degree_type == 'Technology', 1, 0))

highest.degree.type[which(is.na(highest.degree.type$science)), "science"] <- 0

 
# Company difficulty obtaining workforce


work.force <- char %>% select(Company_difficulty_obtaining_workforce) %>% 
                   mutate(workforce_high = ifelse(.$Company_difficulty_obtaining_workforce == 'High', 1, 
                                                  ifelse(.$Company_difficulty_obtaining_workforce == 'Medium', 0, NA)),
                          workforce_medium = ifelse(.$Company_difficulty_obtaining_workforce == 'Medium', 1, 0))

work.force[which(is.na(work.force$workforce_high)), "workforce_high"] <- 0


# Company founder patent


patent <- char %>% select(Company_Founder_Patent) %>% 
                  mutate(patent_yes = ifelse(.$Company_Founder_Patent == 'Yes', 1, 0))

Putting it all together after creating dummy varaibles.

# Putting all together 

finalset <- data.frame(num, int, patent_yes = patent$patent_yes,
                       workforce_high = work.force$workforce_high, workforce_medium = work.force$workforce_medium,
                       management_degree = highest.degree.type$management, science_degree = highest.degree.type$science, 
                       technology_degree = highest.degree.type$technology, com_subscription = subscription$com_subscription,
                       productbased = prod.service$productbased, servicebased = prod.service$servicebased,
                       bigdata = C_bigdata$bigdata, crowdsourcing = crwd_source$crwd_sourcing, 
                       crowdfunding = crwd_fund$crwd_funding, incubator_investor = incub.invstr$incub_invstr,
                       Manypublications = f.pub$Many_pub, fewpublications = f.pub$Few_pub,
                       prof_sim_high = f.prof.sim$prof_sim_high, prof_sim_medium = f.prof.sim$prof_sim_medium, 
                       prof_sim_low = f.prof.sim$prof_sim_low, phd_degree = f.edu$f.phd, masters_degree = f.edu$f.masters,
                       high_exposure = f.i.expo$indusexpo_high, medium_exposure = f.i.expo$indusexpo_medium,
                       global_exposure = f.g.expo$global_expo, high_founder_exp = f.exp$f_exp_high,
                       medium_founder_experience = f.exp$f_exp_medium, btob = b.model$btob, btoc = b.model$btoc,
                       big_five_exp = f.bfive.exp$bigfive_exp, startup_exp = founder.startup.exp$founder_startup_exp,
                       previous_employee_count_large = previous.employee.count$previous_employee_count_large,
                       previous_employee_count_medium = previous.employee.count$previous_employee_count_medium,
                       founder_experience = founders.experience$founders_experience, vc_funding = vc.funding$angel_vc_funding,
                       mobile_app = app$mobile_app, industry_count_many = ind.count$ind_count_many,
                       industry_count_few = ind.count$ind_count_few, raising_funds = funds$fund, location_usa = loc$loc1,
                       location_europe = loc$loc2) 



factor_vars <- sapply(finalset[, c(26:64)], function(x) as.factor(x)) %>% as.data.frame() # conver all the above variables to factor

cleaneddata <- data.frame(num, int, factor_vars) # final cleaned data frame fr use

cleaneddata$Founders_Popularity <- as.factor(cleaneddata$Founders_Popularity ) # oops, forgot to convert this to a factor before, hence doing it now

Splitting data in training, testing and validation sets

# train set and test set


train <- cleaneddata[(1:234),]  # training set to be further split in to training and validation set

train$Dependent <- as.factor(train$Dependent)  # convert dependent variable to a factor


test <- cleaneddata[c(235:314),]  # testing set


test[which(is.na(test$Dependent)), "Dependent" ] <- 0  # change missing values to 0

test$Dependent <- as.factor(test$Dependent) # convert dependent varaible to factor 

# validation set
vset <- sample(1:nrow(train), nrow(train)* 0.70, replace = FALSE) # random sampling

trainset <- train[vset,] # training set

validationset <- train[-vset,] # validation set

Feature Selection

The features that are likely to help up predcit whether a start up succeedsor fails are selected using Boruta method which is built on top of random forest feature selection method. Feature selection methods only suggest which features are likely to contribute in explaining variablity in response variable and should be taken with a grain of salt.

Boruta Method

library(Boruta) 

set.seed(1234)

boruta.train <- Boruta(Dependent~.,data = train, doTrace = 2) # train boruta model on the training set


# plot results from boruta feature selection method
plot(boruta.train, xlab = "", xaxt = "n")

lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
  boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])

names(lz) <- colnames(boruta.train$ImpHistory)

Labels <- sort(sapply(lz,median))

axis(side = 1,las=2,labels = names(Labels),
     at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7) 

Final list of important features after feature selection using Boruta Method

final.boruta <- TentativeRoughFix(boruta.train) # list of tentative important variables

final_vars <- getSelectedAttributes(final.boruta, withTentative = FALSE)  # final list of features

final_vars
## [1] "Company_avg_investment_time"    "Company_senior_team_count"     
## [3] "Company_repeat_investors_count" "Company_competitor_count"      
## [5] "Company_analytics_score"        "prof_sim_medium"               
## [7] "btob"                           "btoc"

Model Building

Subset important features from training, testing and validation sets

trset <- trainset[,final_vars]  # training set

trset$dependent <- trainset$Dependent  # convert dependent feature to to factor

valset <- validationset[,final_vars]  # validation set

valset$dependent <- validationset$Dependent  # convert dependent feature to to factor

tset <- test[,final_vars]    # testing set

tset$dependent <- test$Dependent  # convert dependent feature to to factor
library(h2o)
h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 hours 2 minutes 
##     H2O cluster version:        3.16.0.2 
##     H2O cluster version age:    4 months and 26 days !!! 
##     H2O cluster name:           H2O_started_from_R_Karthik_aog658 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   0.70 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.4.1 (2017-06-30)
train.h2o <- as.h2o(trset) # convert training set to h20 data frame
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
validation.h2o <- as.h2o(valset) # convert validation set to h2o data frame
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(tset) # convert testing set to h2o data frame
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
x.indep <- c(1:8) # independent variables

y.dep <- 9 # dependent variable

logistic <- h2o.glm(x = x.indep, y = y.dep, training_frame = train.h2o  ,family = "binomial",
                    nfolds = 5) # train logisitic regression model on the training set
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Coefficients

logistic@model$coefficients_table
## Coefficients: glm coefficients
##                             names coefficients standardized_coefficients
## 1                       Intercept    -1.264866                 -0.086485
## 2               prof_sim_medium.0     0.000000                  0.000000
## 3               prof_sim_medium.1     0.000000                  0.000000
## 4                          btob.0     0.212614                  0.212614
## 5                          btob.1    -0.216565                 -0.216565
## 6                          btoc.0     0.000000                  0.000000
## 7                          btoc.1     0.000000                  0.000000
## 8     Company_avg_investment_time     0.000000                  0.000000
## 9       Company_senior_team_count     0.269783                  0.683503
## 10 Company_repeat_investors_count     0.000000                  0.000000
## 11       Company_competitor_count    -0.034870                 -0.112072
## 12        Company_analytics_score     0.124186                  0.129698

Model Evaluation

h2o.performance(logistic, validation.h2o)  # model performance on validation set
## H2OBinomialMetrics: glm
## 
## MSE:  0.2174649
## RMSE:  0.4663314
## LogLoss:  0.6283217
## Mean Per-Class Error:  0.2924603
## AUC:  0.7190476
## Gini:  0.4380952
## R^2:  0.1299676
## Residual Deviance:  89.22169
## AIC:  101.2217
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         0  1    Error    Rate
## 0      17 19 0.527778  =19/36
## 1       2 33 0.057143   =2/35
## Totals 19 52 0.295775  =21/71
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.413985 0.758621  37
## 2                       max f2  0.341125 0.883838  43
## 3                 max f0point5  0.489286 0.692308  27
## 4                 max accuracy  0.489286 0.704225  27
## 5                max precision  0.884887 1.000000   0
## 6                   max recall  0.341125 1.000000  43
## 7              max specificity  0.884887 1.000000   0
## 8             max absolute_mcc  0.413985 0.468746  37
## 9   max min_per_class_accuracy  0.547866 0.685714  22
## 10 max mean_per_class_accuracy  0.413985 0.707540  37
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Area under the curve

h2o.auc(h2o.performance(logistic, train.h2o));h2o.auc(h2o.performance(logistic, validation.h2o))  # AUC of the model on  training set and validaion set
## [1] 0.7585065
## [1] 0.7190476

The validation AUC is slighter lesser than the training AUC, indicating the model is not overfitting on the data. We can except the model perform good on the testing set. We used the testing set to make predictions using the logistic regressin model we trained. The AUC for testing set is 0.67929 AUC, sitting at 40th position on the leaderboard.Not bad for a base model. We can further improve the model by adding more important features which we hadn’t in our base model.