Investment strategies for investing in start-up companies are widely based on intuition or past experience. As a result, investors rely primarily on the need being addressed, background of the founders, size of the market being addressed and the ability of the company to scale after tasting early success. The question we pose here is, can we perform some rigorous analysis that can be used to identify relevant factors and score prospective start- ups based on their potential to be successful. This model/ analysis will then allow investors to make more informed decisions and rely less on their intuitions.
The data used in this analysis has been obtained from the ‘Data’ tab of the ‘Business Analytics For Beginners Using R - Part I’ competition hosted by www.crowdanalytix.com.
library('ggplot2')
library('dplyr')
library('data.table')
library('mice')
library('caTools')
library('lubridate')
library('e1071')
library('randomForest')
library(reshape2)
setwd("/Users/mareksalamon/Desktop/Projects/R/Start-Up Analysis")
data <- read.csv('CAX_Startup_Data.csv', na.strings=c("","NA"))
data.dict <- read.csv('CAX_Startup_Data_Dictionary.csv')
Before cleaning the data, it is a good idea to get a sense of the data that has been made available to the analyst by looking at the breadth of information, determining the dimensions of the data, the data types, the unique responses for each variable, whether there are any missing values and, if so, how many.
Through out this process, it is wise to keep in mind the goal of this project and to make sure that every action taken is done so to bring the analyst closer to that goal. The goal of this project is to build a model that will predict whether or not a startup will be successful based on the responses to the available variables. With this in mind, additional steps will need to be taken during preparation of the data such as turning categorical variables into numerical ones by encoding them into dummy variables and evaluating certain variables on their potential relevance for the prediction of business success.
Let’s take a look at the structure of the data.
str(data)
Our data consists of 116 variables and 472 data points. Many, if not all, of the variables in the data are ‘factors’. This is desirable for categorical responses but not so for numerical ones. Let’s change the variables to their appropriate data types.
cols.notnumeric <- c(
"Company_Name", "Dependent.Company.Status", "year.of.founding", "Short.Description.of.company.profile", "Industry.of.company", "Focus.functions.of.company", "Investors", "Has.the.team.size.grown", "Est..Founding.Date", "Last.Funding.Date", "Country.of.company", "Continent.of.company", "Presence.of.a.top.angel.or.venture.fund.in.previous.round.of.investment", "Number.of..Sales.Support.material", "Worked.in.top.companies", "Average.size.of.companies.worked.for.in.the.past", "Have.been.part.of.startups.in.the.past.", "Have.been.part.of.successful.startups.in.the.past.", "Was.he.or.she.partner.in.Big.5.consulting.", "Consulting.experience.", "Product.or.service.company.", "Catering.to.product.service.across.verticals", "Focus.on.private.or.public.data.", "Focus.on.consumer.data.", "Focus.on.structured.or.unstructured.data", "Subscription.based.business", "Cloud.or.platform.based.serive.product.", "Local.or.global.player", "Linear.or.Non.linear.business.model", "Capital.intensive.business.e.g..e.commerce..Engineering.products.and.operations.can.also.cause.a.business.to.be.capital.intensive", "Number.of..of.Partners.of.company", "Crowdsourcing.based.business","Crowdfunding.based.business", "Machine.Learning.based.business", "Predictive.Analytics.business", "Speech.analytics.business", "Prescriptive.analytics.business", "Big.Data.Business", "Cross.Channel.Analytics..marketing.channels", "Owns.data.or.not...monetization.of.data..e.g..Factual", "Is.the.company.an.aggregator.market.place..e.g..Bluekai", "Online.or.offline.venture...physical.location.based.business.or.online.venture.", "B2C.or.B2B.venture.", "Top.forums.like..Tech.crunch..or..Venture.beat..talking.about.the.company.model...How.much.is.it.being.talked.about.", "Average.Years.of.experience.for.founder.and.co.founder", "Exposure.across.the.globe", "Breadth.of.experience.across.verticals", "Highest.education", "Specialization.of.highest.education", "Relevance.of.education.to.venture", "Relevance.of.experience.to.venture", "Degree.from.a.Tier.1.or.Tier.2.university.", "Experience.in.selling.and.building.products", "Top.management.similarity", "Number.of..of.Research.publications", "Team.Composition.score", "Dificulty.of.Obtaining.Work.force", "Pricing.Strategy", "Hyper.localisation", "Time.to.market.service.or.product", "Employee.benefits.and.salary.structures", "Long.term.relationship.with.other.founders", "Proprietary.or.patent.position..competitive.position.", "Barriers.of.entry.for.the.competitors", "Company.awards", "Controversial.history.of.founder.or.co.founder", "Legal.risk.and.intellectual.property", "Client.Reputation", "Technical.proficiencies.to.analyse.and.interpret.unstructured.data", "Solutions.offered", "Invested.through.global.incubation.competitions.", "Disruptiveness.of.technology", "Survival.through.recession..based.on.existence.of.the.company.through.recession.times", "Gartner.hype.cycle.stage", "Time.to.maturity.of.technology..in.years."
)
cols.numeric <- colnames(data)[!(colnames(data) %in% cols.notnumeric)]
data[cols.numeric] <- lapply(data[cols.numeric], as.character)
data[cols.numeric] <- lapply(data[cols.numeric], as.numeric)
data[cols.notnumeric] <- lapply(data[cols.notnumeric], as.character)
Let’s make ‘Est..Founding.Data’ and ‘Last.Funding.Date’ in terms of ‘days ago’ from the current date.
data$Est..Founding.Date <- as.Date.character(data$Est..Founding.Date, format = "%m/%d/%Y", optional=TRUE)
data$Last.Funding.Date <- as.Date.character(data$Last.Funding.Date, format = "%m/%d/%Y", optional=TRUE)
data$Est..Founding.Date <- sapply(data$Est..Founding.Date, function(x){Sys.Date()-x})
data$Last.Funding.Date <- sapply(data$Last.Funding.Date, function(x){Sys.Date()-x})
cols.notnumeric <- cols.notnumeric[!(cols.notnumeric %in% c("Est..Founding.Date", "Last.Funding.Date"))]
cols.numeric <- append(cols.numeric, c("Est..Founding.Date", "Last.Funding.Date"))
Let’s take a look at the amount and density of missing, NULL, values in the data.
(sum(is.na(data))/(nrow(data)*ncol(data)))*100 # Percent of total data that is a missing value
## [1] 8.200614
# Obtaining the proportion of missing values in each column with at least one missing value
missing.col <- (colSums(is.na(data))[colSums(is.na(data)) > 0]/nrow(data))*100
missing.col.names <- row.names(data.frame(missing.col))
missing.col.percent <- data.frame(missing.col)
missing.col.percent <- setorder(data.frame(missing.col), -missing.col)
head(missing.col.percent)
## missing.col
## Employees.count.MoM.change 43.43220
## Gartner.hype.cycle.stage 36.44068
## Time.to.maturity.of.technology..in.years. 36.44068
## Last.round.of.funding.received..in.milionUSD. 35.38136
## Employee.Count 35.16949
## Last.Funding.Amount 33.89831
We have now created a table of each variable, with at least a single missing value, along with the percentage of the data that is missing in that column. Above, we have a preview of the 6 sparsest columns with the sparsest missing 43.4% of its information. It seems that missing data are located in 50 of the 116 variables. Furthermore, there seems to be a pattern where variables that are related to each other exhibit equal proportions of missing data. Taking the ‘Percent_skill’ columns as an example, it is very likely that the person filling out the survey may have decided not to provide a response to these variables because the founder has no prior experience in any of the skills listed and, therefore, believes that the questions do not apply to them.
In the case of a real world analysis/project, I would investigate the missing questions further by consulting with the subjects of the surveys as well as the survey creators in order to try and acquire more data. In this case, I have collected all of the data I possibly can leaving me with the option of imputing the missing data and/or removing observations and/or variables missing ‘too much’ data. Since our goal is to create a classification model, we would like to manipulate our data in such a way that we maximize the predictive ability of the model. What we will do is create one dataset retaining all of the variables and observations and another similar to the first but with those observations or variables missing >30% of data removed. Once our model is built, we can utilize both datasets and select the one that results in a higher accuracy by the model.
Now, lets create a second dataset with variables missing >30% of data removed:
drop <- rownames(subset(missing.col.percent,missing.col>30))
data.new <- data[, !(names(data) %in% drop)]
Now lets visualize the amount of missing data across each observation, or row, as oppose to each variable. We’ll produce a histogram of the density of missing dinformation in each observation.
Most of the observations are missing less than 30% of their data. Let’s remove all those points missing more than 30% from our new dataset.
drop <- rownames(subset(missing.row,value>30))
data.new <- data.new[!(row.names(data.new) %in% drop), ]
## [1] 6.896552
## [1] 7.627119
~7% of the original variables were removed along with ~8% of the original observations. Ideally, we would have only removed <5% but these figures are acceptable. Now that we’ve removed all the unwanted pieces of data, let’s impute the missing values that remain. First, let’s replace all NA values in the categorical columns with ‘No Info’.
data.is.char <- sapply(data, is.character)
char.data <- data[, data.is.char]
data.new.is.char <- sapply(data.new, is.character)
char.data.new <- data.new[, data.new.is.char]
data[, colnames(char.data)][is.na(data[colnames(char.data)])] <- 'No Info'
data.new[, colnames(char.data.new)][is.na(data.new[, colnames(char.data.new)])] <- 'No Info'
Next, lets utilize multiple imputation on our numerical data via the ‘mice’ package. But before we do that, we need to make sure that there is no collinearity between variables otherwise our function will not work.
At this point, I have not yet split my data into a train and test set. By failing to do so at this point, I may be compromising my results by leaking information about the future test set into my train set. The imputations that follow will be based on all of the data, essentially the data from the train as well as the test set. We will keep this in the back of our minds and tread lightly.
# Computing the correlation matrix
cormat <- cor(data[cols.numeric], use = "pairwise.complete.obs")
melted_cormat <- melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
ggtitle('Variable Correlation Heatmap') +
geom_tile() +
labs(fill='Pearson Score') +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
# Making an ordered dataframe of variables pairs and their correlations
cormat[lower.tri(cormat,diag=TRUE)] <- NA #Prepare to drop redundant
cormat <- as.data.frame(as.table(cormat)) #Turn into a 3-column table
cormat <- na.omit(cormat) #Get rid of the junk we flagged above
cormat <- cormat[order(-abs(cormat$Freq)),] #Sort by highest correlation (whether +ve or -ve)
It looks like ‘Last.Funding.Amount’ and ‘Last.round.of.funding.received..in.milionUSD’ are almost perfectly correlated, with a Pearson Score = 0.98, as are ‘Team.size.all.employees’ and ‘Employees.per.year.of.company.existence’, with a Pearson Score = 0.95 and ‘Age.of.company.in.years’ and ‘Est..Founding.Date’ with a Pearson Score = 0.99. Due to the fact that both pairs of variables are different ways of expressing the same thing, we can conclude that the variables in each pair are collinear as well. To fix this, one of the variables need to be removed. We will discard the varaible that is missing more information.
sum(is.na(data$Last.Funding.Amount))
## [1] 160
sum(is.na(data$Last.round.of.funding.received..in.milionUSD.)) # Drop
## [1] 167
sum(is.na(data$Team.size.all.employees))
## [1] 68
sum(is.na(data$Employees.per.year.of.company.existence)) # Drop
## [1] 128
sum(is.na(data$Age.of.company.in.years))
## [1] 59
sum(is.na(data$Est..Founding.Date)) # Drop
## [1] 109
data <- data[ , !(names(data) %in% c('Last.round.of.funding.received..in.milionUSD.', 'Employees.per.year.of.company.existence', 'Est..Founding.Date'))]
data.new <- data.new[ , !(names(data.new) %in% c('Last.round.of.funding.received..in.milionUSD.', 'Employees.per.year.of.company.existence', 'Est..Founding.Date'))]
Let’s proceed with the multiple imputation of the data using the ‘mice’ package.
set.seed(101)
data.imputed <- mice(data, method = 'cart', m=1, seed=101)
## Warning: Number of logged events: 73
data.imputed <- complete(data.imputed, 1)
set.seed(102)
data.new.imputed <- mice(data.new, method = 'cart', m=1, seed=102)
## Warning: Number of logged events: 70
data.new.imputed <- complete(data.new.imputed, 1)
Let’s confirm that our imputed datasets are free of missing values.
(sum(is.na(data.imputed))/(nrow(data.imputed)*ncol(data.imputed)))*100
## [1] 0
(sum(is.na(data.new.imputed))/(nrow(data.new.imputed)*ncol(data.new.imputed)))*100
## [1] 0
Now, we can begin producing our model. It is important to note that variables other than numeric are often unacceptable as variables to the model; in the case of categorical variables, we would create dummy variables for each level of the variables successfully representing the categorical variable as a number. But, model functions in R are often built with this in mind and the explicit creation of dummy variables is unnecessary.
There are several models we may choose from for classification: Support Vector Machine (SVM), Logistic Regression, Random Forest, Bayes Classifier, or Neural Network. Neural networks often require large amounts of data points, thousands at least, in order to produce robust models; for this reason, we will opt out of using a neural network.
Since producing models in R is quick and seamless, we will create one of each model for each dataset. This will result in 8 different models at which point we will be free to choose the one that provides the most accuracy.
Lets make some final adjustments to our datasets.
# Turning categorical variables back into factors
data.imputed[!(colnames(data.imputed) %in% cols.numeric)] <- lapply(data.imputed[!(colnames(data.imputed) %in% cols.numeric)], as.factor)
data.new.imputed[!(colnames(data.new.imputed) %in% cols.numeric)] <- lapply(data.new.imputed[!(colnames(data.new.imputed) %in% cols.numeric)], as.factor)
Lets code the results of the companies as 0’s and 1’s just so it’s easier to integrate and intepret into the models to come. 1 = “Success”, 0 = “Failed”
data.imputed$Dependent.Company.Status <- as.character(data.imputed$Dependent.Company.Status)
data.new.imputed$Dependent.Company.Status <- as.character(data.new.imputed$Dependent.Company.Status)
for(i in 1:472){
if(data.imputed[i,'Dependent.Company.Status'] == "Success"){
data.imputed[i,'Dependent.Company.Status'] <- 1
}
else{
data.imputed[i,'Dependent.Company.Status'] <- 0
}
}
for(i in 1:436){
if(data.new.imputed[i,'Dependent.Company.Status'] == "Success"){
data.new.imputed[i,'Dependent.Company.Status'] <- 1
}
else{
data.new.imputed[i,'Dependent.Company.Status'] <- 0
}
}
data.imputed$Dependent.Company.Status <- as.numeric(data.imputed$Dependent.Company.Status)
data.new.imputed$Dependent.Company.Status <- as.numeric(data.new.imputed$Dependent.Company.Status)
Finally, before we move on to our model, let’s split our datasets using an 80-20 train-test split.
set.seed(123) # Splitting the original data
split.data <- sample.split(data.imputed$Dependent.Company.Status, SplitRatio = 0.8)
train.data <- subset(data.imputed, split.data==TRUE)
test.data <- subset(data.imputed, split.data==FALSE)
train.data[,sapply(train.data, is.numeric) & colnames(train.data) != "Dependent.Company.Status"] <- scale(train.data[,sapply(train.data, is.numeric) & colnames(train.data) != "Dependent.Company.Status"])
test.data[,sapply(test.data, is.numeric) & colnames(test.data) != "Dependent.Company.Status"] <- scale(test.data[,sapply(test.data, is.numeric) & colnames(test.data) != "Dependent.Company.Status"])
set.seed(124) # Splitting the 'new' data
split.data.new <- sample.split(data.new.imputed$Dependent.Company.Status, SplitRatio = 0.8)
train.data.new <- subset(data.new.imputed, split.data.new==TRUE)
test.data.new <- subset(data.new.imputed, split.data.new==FALSE)
train.data.new[,sapply(train.data.new, is.numeric) & colnames(train.data.new) != "Dependent.Company.Status"] <- scale(train.data.new[,sapply(train.data.new, is.numeric) & colnames(train.data.new) != "Dependent.Company.Status"])
test.data.new[,sapply(test.data.new, is.numeric) & colnames(test.data.new) != "Dependent.Company.Status"] <- scale(test.data.new[,sapply(test.data.new, is.numeric) & colnames(test.data.new) != "Dependent.Company.Status"])
cols.exclude.d1 <- c("Company_Name", "Short.Description.of.company.profile", "Industry.of.company", "Focus.functions.of.company", "Investors", "Country.of.company", "Specialization.of.highest.education", "year.of.founding", "Online.or.offline.venture...physical.location.based.business.or.online.venture.")
cols.exclude.nd1 <- c("Company_Name", "Short.Description.of.company.profile","Industry.of.company", "Focus.functions.of.company", "Investors", "Country.of.company", "Specialization.of.highest.education", "year.of.founding", "Online.or.offline.venture...physical.location.based.business.or.online.venture.")
classifier.1 <- glm(formula = Dependent.Company.Status ~., family = binomial, data = train.data[,!colnames(train.data) %in% cols.exclude.d1], maxit = 100)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
classifier.2 <- glm(formula = Dependent.Company.Status ~., family = binomial, data = train.data.new[,!colnames(train.data.new) %in% cols.exclude.nd1], maxit = 100)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Note that several variables were excluded from the model at the last minute. This is because, due to the limited amount of observations available, relative to the number of variables, some variable responses occur within the test set that the model was not trained on. For this reason, they were left out of both models to make sure that the models are equivalent in all respects except for the fact that one is trained on a dataset with all of its original observations and variables while the other is not.
cols.exclude.d2 <- c("Company_Name", "Dependent.Company.Status", "Short.Description.of.company.profile", "Industry.of.company", "Focus.functions.of.company", "Investors", "Country.of.company", "Specialization.of.highest.education", "year.of.founding", "Online.or.offline.venture...physical.location.based.business.or.online.venture.")
cols.exclude.nd2 <- c("Company_Name", "Dependent.Company.Status", "Short.Description.of.company.profile", "Industry.of.company", "Focus.functions.of.company", "Investors", "Country.of.company", "Specialization.of.highest.education", "year.of.founding", "Online.or.offline.venture...physical.location.based.business.or.online.venture.")
prob.pred.1 <- predict(classifier.1, type = 'response', newdata = test.data[,!colnames(test.data) %in% cols.exclude.d2])
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
predictions.1 <- ifelse(prob.pred.1 > 0.5, 1, 0)
prob.pred.2 <- predict(classifier.2, type = 'response', newdata = test.data.new[,!colnames(test.data.new) %in% cols.exclude.nd2])
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
predictions.2 <- ifelse(prob.pred.2 > 0.5, 1, 0)
cm.1 <- table(test.data[,2], predictions.1)
cm.2 <- table(test.data.new[,2], predictions.2)
cm.1
## predictions.1
## 0 1
## 0 23 10
## 1 16 45
cm.2
## predictions.2
## 0 1
## 0 18 8
## 1 6 55
((23+45)/length(predictions.1))*100
## [1] 72.34043
((18+55)/length(predictions.2))*100
## [1] 83.90805
Our final models result in accuracies of 73%, for the dataset in which no variables or observations were removed, and 84% for the other. Although a rudimentary evaluation, this is not bad for a single run. There are ways to tweak the model and improve the accuracy but they will not be explored here.
classifier.1 <- svm(formula = Dependent.Company.Status ~., type = 'C-classification', data = train.data[,!colnames(train.data) %in% cols.exclude.d1], kernel = 'linear')
classifier.2 <- svm(formula = Dependent.Company.Status ~., type = 'C-classification', data = train.data.new[,!colnames(train.data.new) %in% cols.exclude.nd1], kernel = 'linear')
prob.pred.1 <- predict(classifier.1, newdata = test.data[,!colnames(test.data) %in% cols.exclude.d2])
prob.pred.2 <- predict(classifier.2, newdata = test.data.new[,!colnames(test.data.new) %in% cols.exclude.nd2])
cm.1 <- table(test.data[,2], prob.pred.1)
cm.2 <- table(test.data.new[,2], prob.pred.2)
cm.1
## prob.pred.1
## 0 1
## 0 23 10
## 1 6 55
cm.2
## prob.pred.2
## 0 1
## 0 19 7
## 1 1 60
((23+55)/length(prob.pred.1))*100
## [1] 82.97872
((19+60)/length(prob.pred.2))*100
## [1] 90.8046
At first glance it looks like SVMs are far superior, relative to the logistic regression performed earlier, with accuracies of 83% and 91%; the higher accuracy being awarded to the model trained on the dataset that went through the removal of missing observations and variables in the initial stages of this analysis.
There are other kernels that one can use giving the possibility of different types of SVMs with different performance capabilities. Although, research suggests that it wisest to star with the linear kernel as it is the fastest to train and test on, and any improvements that occur as a result of using a higher dimensional kernel are marginal at best.
train.data['Dependent.Company.Status'] <- lapply(train.data['Dependent.Company.Status'], as.factor)
test.data['Dependent.Company.Status'] <- lapply(test.data['Dependent.Company.Status'], as.factor)
train.data.new['Dependent.Company.Status'] <- lapply(train.data.new['Dependent.Company.Status'], as.factor)
test.data.new['Dependent.Company.Status'] <- lapply(test.data.new['Dependent.Company.Status'], as.factor)
classifier.1 <- naiveBayes(x = train.data[,!colnames(train.data) %in% cols.exclude.d2], y = train.data$Dependent.Company.Status)
classifier.2 <- naiveBayes(x = train.data.new[,!colnames(train.data.new) %in% cols.exclude.nd2], y = train.data.new$Dependent.Company.Status)
prob.pred.1 <- predict(classifier.1, newdata = test.data[,!colnames(test.data) %in% cols.exclude.d2])
prob.pred.2 <- predict(classifier.2, newdata = test.data.new[,!colnames(test.data.new) %in% cols.exclude.nd2])
cm.1 <- table(test.data[,2], prob.pred.1)
cm.2 <- table(test.data.new[,2], prob.pred.2)
cm.1
## prob.pred.1
## 0 1
## 0 23 10
## 1 2 59
cm.2
## prob.pred.2
## 0 1
## 0 18 8
## 1 3 58
((23+59)/length(prob.pred.1))*100
## [1] 87.23404
((18+58)/length(prob.pred.2))*100
## [1] 87.35632
The results here seem like a compromise between a logistic regression and SVM. We have accuracies of 87% and the performances of both datasets are remarkably similar.
classifier.1 <- randomForest(x = train.data[,!colnames(train.data) %in% cols.exclude.d2], y = train.data$Dependent.Company.Status, ntree = 10)
classifier.2 <- randomForest(x = train.data.new[,!colnames(train.data.new) %in% cols.exclude.nd2], y = train.data.new$Dependent.Company.Status, ntree = 10)
prob.pred.1 <- predict(classifier.1, newdata = test.data[,!colnames(test.data) %in% cols.exclude.d2])
prob.pred.2 <- predict(classifier.2, newdata = test.data.new[,!colnames(test.data.new) %in% cols.exclude.nd2])
cm.1 <- table(test.data[,2], prob.pred.1)
cm.2 <- table(test.data.new[,2], prob.pred.2)
cm.1
## prob.pred.1
## 0 1
## 0 24 9
## 1 1 60
cm.2
## prob.pred.2
## 0 1
## 0 20 6
## 1 1 60
((24+60)/length(prob.pred.1))*100
## [1] 89.3617
((20+60)/length(prob.pred.2))*100
## [1] 91.95402
Looks like we have a winner! … maybe. These are the best accuracies we have achieved thus far from any of the other models; low 90’s and the two datasets perform equally well. Let’s take a look at what the random forest has determined to be the most significant indicator’s of a start-up’s success.
rf.imp <- setorder(data.frame(classifier.2$importance), -MeanDecreaseGini)
head(rf.imp, 10)
## MeanDecreaseGini
## Survival.through.recession..based.on.existence.of.the.company.through.recession.times 15.408475
## Focus.on.structured.or.unstructured.data 9.352296
## Last.Funding.Date 8.737716
## Invested.through.global.incubation.competitions. 8.486179
## Experience.in.selling.and.building.products 6.996102
## Disruptiveness.of.technology 6.264939
## Client.Reputation 5.565729
## Age.of.company.in.years 4.986637
## Technical.proficiencies.to.analyse.and.interpret.unstructured.data 4.861578
## Internet.Activity.Score 4.545325
We set out to discover a model that would predict whether or not a start-up company would succeed and to determine which characteristics have the greatest impact on the success of a company. In the end, the random forest model, with the removal of observations/variables missing excessive amounts of data, came out on top with a 91% accuracy. In addition, the top 3 most important features, according to the random forest model, are whether or not the company has survived an economic recession, whether the company works with structured or unstructured data, and the last funding date. It seems as if successful start-ups are capable of pushing through tough economic times. The type of data they deal with may also play some role by indicating the type of industry it is involved in; I suspect this my be some sort of confounding variable. On the other hand, this may indicate that the most successful start-ups are taking advantage of unstructured data in new ways that have never been done before. Finally, the date on which the company received its most recent investment funds may hint towards how long the company has been in business. The longer a company is around, the more likely they are to ‘succeed’ as their prolonged existence serves as proof of some public need for it in the economy. Other notable features include whether the founder has experience in the selling and building of products, the disruptiveness of the technology being sold, and internet activity score.
This was somewhat of a difficult dataset to deal with. The combination of a large number of independent variables and a relatively low number of observations increased the probability of missing data. Additionally, variables were removed out of necessity as the responses in some of the variables where not evenly represented in the train and test datasets.
The results of our models have varied but all of which seemed to have done a satisfactory job of predicting whether a company will fail or succeed based on an array of available information. But, I’m afraid that I have inflated the results of my models by failing to separate the original data into train and test sets before imputing the data. By doing so, bias was injected into the test set as the information from the train set was used to ‘predict’ the values of the missing data pieces in the test set. If done again, I would have split the data first before performing any other transformations on it. In addition, I would have taken a closer look at each variable in the dataset individually and considered whether the missing data is likely to be missing at random or not. This would allow for a more intelligent selection of placeholder values for missing data.
Reflecting further upon this project, there are several things I would do to improve upon the models as well as the analysis as a whole. In a real-world setting, I would have spent much more time cleaning the data and making sure it is as representative of real-world observations as possible. To this end, I would collaborate with the researchers administering these surveys to determine how they can be improved upon in order to increase the response rate for each question, which would translate to less missing data. I would also strive to establish some sort of standard response values as to decrease the variety of responses for each variable while maximizing information acquisition. In addition, I would consider utilizing methods such as natural language processing to extract additional information from the more chaotic, free-form survey responses such as the short description of the company’s profile. Finally, after deciding on a model type, perhaps random forest classification in this scenario, I would spend the rest of my time tweaking the model’s parameters in order to extract as much predictive power from it as possible.