Every year, hundreds of thousands of students around the world graduate from colleges or universities. Many of them end up getting employed not in the field they studied, being lured with higher wages and benefits. Majority begins careers by joining already operating businesses and companies, usually starting as interns, or performing entry-level jobs. However, there are some of graduates who aim higher, who would rather employ others than become employed, who are self-starters and would rather create and develop some new idea, concept, or business than contribute to already prosperous ones. According to USA Today, in years 2010-2013, 45% of business school graduates decided to set up their own companies, which is twice as many as in years 2000-2009. This trend among college graduates transfers to increasing number of startup companies being founded each year. However, only few become successful in the long run.
Each newly-founded startup seeks to become successful in the market they operate in. But what does it mean exactly to achieve success in case of a startup? Startup is an institution that enables its founders to introduce their unique, revolutionary idea or product, which they believe it could skyrocket, to a market. Most often the main target of a startup is either to become a well-established company or to attract potential acquisitions. According to Forbes, “a startup can graduate to a larger company by being acquired, opening more than one office, generating revenues greater than $20 million, or having more than 80 employees”. To achieve either of these milestones, a startup requires both financial resources as well as knowledgeable, skillful, and experienced staff. There are several factors that influences startup’s successfulness. Among financial requirements the most important is to choose how startup should acquire funds for its operations. Whether it is money obtained from personal savings, family, and friends, or a bank loan, or angel investors, or venture capital funds, or even crowdfunding, startup owners have to decide which option suits best their needs. Each of funding alternatives brings its pros and cons along, and while for one borrowing money from family members is the best solution, for others it may impose relationship problems. Another factor is choice of industry in which startup will operate. Founding another IT or tech startup may not be the best idea due to great competition, high entrance barriers, and lawful regulations. Perhaps narrow specialization of a service or product may give a startup a competitive advantage over already well-established companies. Therefore, it is worth noticing that an intellectual and technological aspect plays a crucial role in development of a startup. Even the best funded startup may not achieve a market success if its service/product/idea is poorly created and lacks scientific background. Behind each successful company there are skillful employees and smart managers who know how to take advantages of their subordinates’ technical know-how. Hence, choice of employees and managers is as important as financial support at the early stage. Among other factors that might affect startup’s performance are location, where a startup is being develop and where it offers its products and services to public. Rent costs, accessibility of an office, equipment, among others are crucial factors that contribute to costs of operations, which for a crawling company should be kept as low as possible.
The analysis will be conducted on a dataset that comes originally from Crunchbase.com, which is a website collecting information about companies and their operations around the world. It displays data about over 190,000 startups founded up to the year 2013 and consists of 11 tables and 154 variables. The main table is called objects and gathers general information about each startup: name, date founded, date closed, location, industry, operations description, etc. Other tables consist of data about employees, offices, investments, funds obtained, milestones, and acquisitions. The dataset includes all necessary factors affecting startups’ performances. The analysis will consist of two approaches: descriptive and predictive. The first one will focus on identifying trends among operating and closed startups and determining factors’ importance in relation to startups’ successfulness.For instance, it will be checed whether number of funding rounds or total amount of funds raised influences the fact whether a company got closed or is still operating. Using Tableau, a visual representation of correlations between covariates and target variable will be conducted. By creating bar-plots, scatterplots, and maps that will display which factors influence the operating status of startups and which have no impact whatsoever. The latter part will focus on training machine learning models like logistic regression, decision trees, bagging, boosting, random forests, and many more that will predict whether a startup will become successful or fail to achieve expectations. The successfulness in the analysis will be ruled by three standards: either startup became acquired by other entity, or it was still operating by the end of the year 2013, or it went public on a stock market. Closure of a startup by the end of the year 2013 is equal to failure. Unfortunately, the lack of access to recent data does not allow me to predict outcome on the latest founded startups. Nevertheless, I believe that the model which will be obtained through this analysis may greatly improve decision-making process among current startup owners and contribute to higher survival rate of startups.
There are many determinants of a successful startup, which makes it almost impossible to focus on all of them while setting up a business. Verifying the most influential components may not only improve initial performance, but also contribute to reduction of operational costs. The resulting surplus can later be used for further development of the startup. Moreover, recognizing contemporary trends among startups may result in a useful piece of information. An owner may decide to follow a market tendency to reduce potential risks or quite the contrary, set up in a new direction hoping for an early competitive advantage. Finally, analyzing the history of both successful and unprofitable startups may lead to beneficial conclusions. Owners and managers will learn which choices should be avoided and which directions should be pursued. These three pieces of analysis put together should generate a perfect strategy that any rising startup should follow.
Operations of a newly-founded startup affect several contributing parties:
Startup’s founders/owners – they seek not only to revolutionize market with their idea, product, or service, but also to make their business idea profitable. Success or failure of the startup directly influences them financially, as they tend to invest their personal savings in the startup. They support startup both financially and intellectually as they manage day-to-day operations.
Investors – they seek to gain a return on their investment, therefore achieving success is highly desired by them. Successful startup equals to high revenues, acquisition or even going public(shares & dividends). They want to minimize risks and maximize gains. Often, investors are familiar with the startup’s industry and support their operations factually.
Startup’s employees – besides financial benefits, specialists working for startups want to develop professionally by acquiring new skills, specializing in a certain field, or just contributing to a new invention. Success of a startup can boost their careers, even if they decide not to continue employment with the entity.
Competitors – depending on the competitor’s size and market share, they may want to take advantage of dstartup’s success, either by its acquisition, purchase of trademarks, copyrights, patents, etc., or simply by copying and implementing startup’s product, service, or idea if it is not legally protected. They may try to undermine startup’s operations to reduce its market value or obtain access to its product/service.
Customers – they tend to seek products or services that offer highest quality for the lowest price (luxury products/service are exception). Greater competition and more alternatives in the market usually result in lower prices, which customers look for. Startups’ success means that new products/services are introduced to the market offering greater choice and new solutions.
Data was originally sourced from the Crunchbase website, which focuses on providing business information about private and public companies. Their content includes investment and funding information, founding members and individuals in leadership positions, mergers and acquisitions, news, and industry trends. Since obtaining data from crunchbase.com is costly, I acquired my dataset about startups from kaggle. The dataset consists of 11 tables:
For the purpose of better understanding of the dataset, I created a table in Excel that supports further data preparation process with information about each column in each table. It consists of column name, data purpose, data type, data role, its consolidation, as well as definition of a column followed by a commentary section, in which data characteristics or possible data analysis actions/guidelines are included.
Since my analysis is divided into two parts: descriptive and predictive, data preparation process got also divided into two independent phases. The descriptive part of startup project is conducted in Tableau through meaningful visualizations. The predictive part focuses on building models using R-Studio. However, these analytics tools require different data cleaning processes. In Tableau, the focus is on use of possibly greatest number of covariates(dimensions) to define the target variable(status:operating vs.closed), hence joining tables together to create minimum number of outputs is of crucial importance. On the other hand, development of predictive models like logistic regression, decision trees, or random forests requires the cleanliness of data(no missing values, only numeric variables and categorical variables with limited number of levels), therefore imputing missing values and generalizing categories is the area of main focus.
Tableau Prep is an application designed to format and clean data before importing it to Tableau Desktop. Its main advantage is an user-friendly interface that enables cleaning data in an easy way.
I begin the cleaning process with importing my original data into Tableau Prep.Then I dragged them into the interface and start the cleaning process. The figure below shows the entire diagram of how the tables got cleaned and joined together.
As it can be seen, I ended up with several outputs. Unfortunately,
the structure of dataset did not allow me to join all the tables
together. In most of the tables occured one-to-many relation. It means
that one row from objects table could be joined with
several rows from funding_rounds table, since one startup
could have several funding rounds. If I joined all the tables together,
I would end up with several millions rows, instead of a few hundred
thousand of them. Such mistake could greatly impact later analysis
results. Therefore I decided to join objects table with
other tables independently. Despite the fact that I cannot use all
variables together, I can still focus on certain groups of startups to
analyze their exclusive features.
created_at and
updated_at.id column into
degree_id by adding “d:” followed by id
number.degree_id by a person(object_id) obtaining
degree.degree_id to
number_of_degrees.people table through
object_id including all the people table
rows.object_id duplicate field.created_at, updated_at,
id from relationships table.relationship_id
starting with”r:” and followed by number.relationships table to
people and degrees table by
object_id field. The program output only those
relationships that has people frompeople table connected to
it.object_id.objects table(filtered for
“company” in entity_type) to my flow through
relationship_object_id(flow) and
id(objects table).id field and
changed the role of state-code field to geographic.Relationship+people+degree+object.created_at, updated_at,
id,source_url, and
source_description fields from the
funding rounds table.funding_round_id to start with “fr:” followed by
number.objects table(filtered for
“company” in entity_type) to my flow through
object_id(funding rounds table) and
id(objects table).id field and changed the
role of state_code to geographic.Funding rounds+object.milestones table, I removed
created_at, updated_at,
id,source_url, and
source_description fields.objects table(filtered for
“company” in entity_type) to my flow through
object_id(milestones table) and
id(objects table).id field and changed the
role of state_code to geographic.Milestone Prep.acquisitions table, I removed
created_at, updated_at,
id,source_url, and
source_description fields.acquisition_id field
starting with “a:” followed by a number. I also filtered
acquired_object_id and removed null values.objects table(filtered for
“company” in entity_type) to my flow through
acquired_object_id(acquisitions table) and
id(objects table).status, entity_type,
and id.acquisitions+object.ipos table, I removed created_at,
updated_at, id,source_url, and
source_description fields.object_id to exclude null
values.objects table(filtered for
“company” in entity_type) to my flow through
object_id(ipos table) and
id(objects table).ipo_id, closed_at,
id, entity_id from the joined flow.IPO+object.offices table, I removed created_at,
updated_at, id fields.office_id starting
with “o:” followed by number.objects table(filtered for
“company” in entity_type) to my flow through
object_id(offices table) and
id(objects table).region, city, state_code, and
country_code. I created a calculated field called
office_type that would assign either “Headquarters” or
“Branch office” label. 5.I created and output of the flow called
Offices.The following four packages are used in this final project report:
# Below, the package tidyverse is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("tidyverse")
library(tidyverse)
# Below, the package lubridate is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("lubridate")
library(lubridate)
# Below, the package missForest is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("missForest")
library(missForest)
# Below, the package rpart is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("rpart")
library(rpart)
# Below, the package rpart.plot is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("rpart.plot")
library(rpart.plot)
# Below, the package randomForest is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("randomForest")
library(randomForest)
# Below, the package ipred is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("ipred")
library(ipred)
# Below, the package adabag is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("adabag")
library(adabag)
Note that while loading the packages above, message and warning were both set to FALSE. This suppressed the messages and warnings resulting from loading the four packages. Also, echo was set to TRUE in order to ensure that the reader is able to view the R code for loading the required packages.
Here, we explain the purpose of using each package in our data
analysis. In loading the package tidyverse, other packages
are automatically loaded that will be helpful in our analysis. When we
load the package tidyverse, the package
ggplot2 is automatically loaded. The package
ggplot2 will allow us to create nice visualizations of our
data (i.e., graphs and plots). A couple other packages that are
automatically loaded with tidyverse include
dplyr and tidyr. We will leverage the power of
dplyr to manipulate our data set, and we will use
tidyr to tidy our data. The package missForest
will help impute missing values in the dataset using already existing
data to create randomForest model which will predict what values should
be imputed. I will use lubridate package to transform my
date variables into numeric columns. Finally, packages
ipred, rpart, randomForest,
adabag will be used to create following models: bagging,
decision tree, randomForest, and boosting. Package
rpart.plot however will be used to plot created decision
trees.
For my predictive analysis, I wanted to use multiple date variables
included in the objects table. However, R only recognizes
“DD-MM-YYYY” format, hence I formatted date fields in Excel to obtain
required output. I then saved the table as a text file. To import
objects table into R-Studio I used read.delim
function which read text delimited files.
#The code below makes the r-markdown to have exact same random samples each time I open the file
set.seed(1234)
#Using the code below, I imported the objects text file into R-Studio and named it object.
object <- read.delim("/Users/macuser/Desktop/MSDA/Capstone Project/Startup dataset/objects_text.txt", na.strings=c(""," ","NA"))
Then, I examined the number of rows in the data frame and filtered
its entity_type column to only include startups.
# Using the code below, I displayed the number of rows in the object data frame.
nrow(object) #462,651 rows
## [1] 462651
#Using the code below, I filtered the values of entity_type column to "Company" and named it object1. Then i displayed number of observations in the new data frame and names of variables.
object1<-object[grep("Company",object$entity_type),]
nrow(object1) #196,553 rows
## [1] 196553
variable.names(object1)
## [1] "id" "entity_type" "entity_id"
## [4] "parent_id" "name" "normalized_name"
## [7] "permalink" "category_code" "status"
## [10] "founded_at" "closed_at" "domain"
## [13] "homepage_url" "twitter_username" "logo_url"
## [16] "logo_width" "logo_height" "short_description"
## [19] "description" "overview" "tag_list"
## [22] "country_code" "state_code" "city"
## [25] "region" "first_investment_at" "last_investment_at"
## [28] "investment_rounds" "invested_companies" "first_funding_at"
## [31] "last_funding_at" "funding_rounds" "funding_total_usd"
## [34] "first_milestone_at" "last_milestone_at" "milestones"
## [37] "relationships" "created_by" "created_at"
## [40] "updated_at"
Then, I started to change my data into numeric columns obtaining time difference between several dates in my data frame, so that I could use these fields in my predictive analysis. I began with changing data roles of these columns to Date.
# Using the code below, I changed the data roles of founded_at, closed_at, first_milestone_at, last_milestone_at, first_investment_at, last_investment_at, first_funding_at, last_funding_at to Date.
object1$founded_at <-as.Date(object1$founded_at)
object1$closed_at <-as.Date(object1$closed_at)
object1$first_investment_at <-as.Date(object1$first_investment_at)
object1$last_investment_at <- as.Date(object1$last_investment_at)
object1$first_funding_at <-as.Date(object1$first_funding_at)
object1$last_funding_at <- as.Date(object1$last_funding_at)
object1$first_milestone_at <- as.Date(object1$first_milestone_at)
object1$last_milestone_at <- as.Date(object1$last_milestone_at)
Then, I started to calculate time differences between dates, so that I could use them as numeric variables in my predictive analysis.
#Using the code below, I calculate the time difference in days between founded_at and closed_at. Then I imputed all the missing values with 999,999,999 so that the operating startups woud have advantage over the closed ones. I called the new column TimeDiff1. Finally, I changed the data role to numeric and displayed top 5 rows.
TimeDiff1<-difftime(object1$closed_at,object1$founded_at,units ="days")
TimeDiff1[is.na(TimeDiff1)]<-999999999
TimeDiff1<-as.numeric(TimeDiff1)
head(TimeDiff1)
## [1] 1e+09 1e+09 1e+09 1e+09 1e+09 1e+09
Above, I calculated the time difference between foundation date and closing date. Then, using similar code I created 9 new calculations displaying time difference between date columns. However, instead of imputing missing values with huge numbers, I imputed them with 0.
#From foudning to last milestone
TimeDiff2<-difftime(object1$last_investment_at,object1$founded_at, units = "days")
TimeDiff2[is.na(TimeDiff2)]<-0
TimeDiff2<-as.numeric(TimeDiff2)
head(TimeDiff2)
## [1] 0 0 0 0 0 0
#From first to last investment
TimeDiff3<-difftime(object1$last_investment_at,object1$first_investment_at, units = "days")
TimeDiff3[is.na(TimeDiff3)]<-0
TimeDiff3<-as.numeric(TimeDiff3)
head(TimeDiff3)
## [1] 0 0 0 0 0 0
#From first to last funding
TimeDiff4<-difftime(object1$last_funding_at,object1$first_funding_at, units = "days")
TimeDiff4[is.na(TimeDiff4)]<-0
TimeDiff4<-as.numeric(TimeDiff4)
head(TimeDiff4)
## [1] 961 0 0 0 0 0
# From first to last milestone
TimeDiff5<-difftime(object1$last_milestone_at,object1$first_milestone_at, units = "days")
TimeDiff5[is.na(TimeDiff5)]<-0
TimeDiff5<-as.numeric(TimeDiff5)
head(TimeDiff5)
## [1] 1109 0 3156 0 0 0
# From founding to first investment
TimeDiff6<-difftime(object1$first_investment_at,object1$founded_at, units = "days")
TimeDiff6[is.na(TimeDiff6)]<-0
TimeDiff6<-as.numeric(TimeDiff6)
head(TimeDiff6)
## [1] 0 0 0 0 0 0
# From founding to first funding
TimeDiff7<-difftime(object1$first_funding_at,object1$founded_at, units = "days")
TimeDiff7[is.na(TimeDiff7)]<-0
TimeDiff7<-as.numeric(TimeDiff7)
head(TimeDiff7)
## [1] -16 0 0 0 0 0
#From founding to first milestone
TimeDiff8<-difftime(object1$first_milestone_at,object1$founded_at, units = "days")
TimeDiff8[is.na(TimeDiff8)]<-0
TimeDiff8<-as.numeric(TimeDiff8)
head(TimeDiff8)
## [1] 1784 0 0 0 0 0
#From founding to last investment
TimeDiff9<-difftime(object1$last_investment_at,object1$founded_at, units = "days")
TimeDiff9[is.na(TimeDiff9)]<-0
TimeDiff9<-as.numeric(TimeDiff9)
head(TimeDiff9)
## [1] 0 0 0 0 0 0
#From founding to last funding
TimeDiff10<-difftime(object1$last_funding_at,object1$founded_at, units = "days")
TimeDiff10[is.na(TimeDiff10)]<-0
TimeDiff10<-as.numeric(TimeDiff10)
head(TimeDiff10)
## [1] 945 0 0 0 0 0
Having calculated 10 new columns, which were called TimeDiff from 1 to 10, I decided to delete character columns in the data frame that were not possible to be categorized and use in regression.
# Using the code below, I create a new data frame called object2 that does not contain character variables except for category_code and country_code, and then display the names of variables in the new data frame.
object2<-object1[,-c(1,2,3,4,5,6,7,10,11,12,13,14,15,18,19,20,21,23,24,25,26,27,30,31,34,35,38,39,40)]
variable.names(object2)
## [1] "category_code" "status" "logo_width"
## [4] "logo_height" "country_code" "investment_rounds"
## [7] "invested_companies" "funding_rounds" "funding_total_usd"
## [10] "milestones" "relationships"
category_code,
country_code, and status
variablesThen using replace function, I generalize categories in
category_code column to new 8 groups: leisure,
bizsupport, building, petcare,
travel, health, IT, and
other.
#Grouping categories in category_code
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("games_video","photo_video","social","hospitality","sports","fashion","messaging","music"),"leisure")
object2$category_code<-replace(object2$category_code,object2$category_code %in%c("network_hosting","advertising","enterprise","consulting","analytics","public_relations","security","legal","finance"),"bizsupport")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("cleantech","manufacturing","semiconductor","automotive","real_estate","nanotech"),"building")
object2$category_code<-replace(object2$category_code,object2$category_code=="pets","petcare")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("travel","transportation"),"travel")
object2$category_code<-replace(object2$category_code,object2$category_code %in%c("health","medical","biotech"),"health")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("search","hardware","web","software"),"IT")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("other","mobile","design","education","ecommerce","news","government","nonprofit","local"),"other")
Then I changed the category_code variable into factor
using as.factor function, so that it can be used in
training predictive models.
#Using the code below, I change the data role of category_code to factor and display new categories.
object2$category_code<-as.factor(object2$category_code)
levels(object2$category_code)
## [1] "bizsupport" "building" "health" "IT" "leisure"
## [6] "other" "petcare" "travel"
Next move is to generalize categories of country_code variable. I decide to group them by continent.
#Grouping country_code by continents using replace function. The code below creates 6 new categories: Africa, Asia, Europe, North America, South America, and Other containing mostly Australia, New Zealand and smaller Oceanian island countries.
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('AGO', 'BDI', 'BEN', 'BWA', 'CIV', 'CMR', 'DZA', 'EGY', 'ETH', 'GHA', 'GIN', 'KEN', 'LSO', 'MAR', 'MDG', 'MUS', 'NAM', 'NER','NGA', 'REU','RWA', 'SDN','SEN', 'SLE', 'SOM','SWZ', 'SYC', 'TUN', 'TZA', 'UGA', 'ZAF', 'ZMB', 'ZWE'), "Africa")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('AFG', 'ARE', 'BGD', 'BHR', 'BRN', 'CHN', 'HKG', 'IDN', 'IND', 'IOT', 'IRN', 'IRQ', 'ISR','JOR', 'JPN', 'KAZ', 'KGZ', 'KHM', 'KOR', 'KWT','LAO', 'LBN', 'LKA', 'MAC', 'MDV', 'MMR', 'MYS', 'NPL', 'OMN', 'PAK', 'PCN','PHL','PRK','PST', 'QAT', 'SAU', 'SGP','SYR', 'THA', 'TJK', 'TWN', 'UZB', 'VNM', 'YEM'), "Asia")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('AIA', 'ALB', 'AND', 'ARM', 'AUT', 'AZE', 'BEL', 'BGR','BIH', 'BLR', 'CHE', 'CYP', 'CZE', 'DEU', 'DNK','ESP', 'EST', 'FIN', 'FRA', 'GBR', 'GEO', 'GIB', 'GLB', 'GRC', 'HRV', 'HUN', 'IRL', 'ISL', 'ITA', 'LIE', 'LTU','LUX', 'LVA', 'MCO', 'MDA', 'MKD', 'MLT', 'NLD', 'NOR', 'POL', 'PRT', 'ROM', 'RUS', 'SMR', 'SVK', 'SVN','SWE', 'TUR', 'UKR'),"Europe")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('ATG', 'BHS','BLZ', 'BMU', 'BRB', 'CAN', 'CRI','CUB','CYM', 'DMA', 'GLP', 'GRD', 'GTM', 'HND', 'HTI', 'JAM', 'MEX', 'MTQ', 'PAN', 'PRI', 'SLV', 'UMI','USA', 'VGB', 'VIR'),"North America")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('ARG', 'BOL', 'BRA', 'CHL', 'COL', 'DOM', 'ECU', 'NIC', 'PER', 'PRY', 'SUR', 'TTO', 'URY','VEN', 'VCT'), "South America")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('ANT', 'ARA', 'AUS', 'CSS', 'FST', 'HMI','NCL', 'NFK','NRU', 'NZL'), "Other")
Then I change the data role of the country_code variable
to factor using as.factor function.
#Using the code below, I change data role of the `country_code` variable to factor and display its categories.
object2$country_code<-as.factor(object2$country_code)
levels(object2$country_code)
## [1] "Africa" "Asia" "Europe" "North America"
## [5] "Other" "South America"
I also changed the name of the variable to
continent.
#Change name of col 'country code' to 'continent'
colnames(object2)[5] <- 'continent'
I also decided to group my categories in status variable
which is going to be the target variable. In case of acqusition or
initial public offering, startups are still operating, hence I switch
these two categories to “operating”.
#Switching "acquired" and "ipo" to "operating in `status` column using replace function. Then changing its data role to factor.
object2$status<-replace(object2$status,object2$status=="acquired","operating")
object2$status<-replace(object2$status,object2$status=="ipo","operating")
object2$status<-as.factor(object2$status)
levels(object2$status)
## [1] "closed" "operating"
Since my dataset includes a lot of missing data in across all
columns, deleting rows with missing values would reduce my dataset to
only a few hundred rows. Therefore, I decided to impute missing values.
Doing it by using mean and mode values could be highly inaccurate and
could influence the results of my analysis. Instead, I use
missForest package which automatically imputes missing data
based on existing variables by creating randomForest models. In this
case, this way of dealing with blank cells seems to be most
reasonable.
#Imputing missing values using random Forest. Second line of code shows the misclassification error of imputing data. Third line of code assigns new data frame with imputed values to object3.
object2.imp <-missForest(object2)
object2.imp$OOBerror
## NRMSE PFC
## 0.0000000 0.4186082
object3<-object2.imp$ximp
As it can be seen, the misclassification error for categorical
variables equals to 41.9% which is a significant error. This might
affect the final creation of models, their results and findings. Despite
that imputing missing values with missForest function seems
to be the most reasonable approach. Deleting these rows would lead to
model’s underfitting. Hence, the high misclassification rate for class
variables should not be perceived as a flaw of the imputation process
but rather as poor collection of data.
To complete data cleaning process, I add newly created columns to
imputed data frame object3.
# Using cbind function I added TimeDiff1-TimeDiff10 columns to the object3 data frame.
object3<-cbind(object3,TimeDiff1,TimeDiff2,TimeDiff3,TimeDiff4,TimeDiff5,TimeDiff6,TimeDiff7,TimeDiff8,TimeDiff10)
As previously stated, I begin the analysis by exploration the dataset and looking for possible relationships between covariates and my target variable which is status(operating vs.closed). Using Tableau Desktop I created several charts that helped assess overall distribution of data.
Based on the above graphs made in Tableau, I can draw a few conclusions. First of all, dataset consists of mostly operating companies. Closed startups add up to less than 2% of all data. This small number of startups which got closed may imply that either startups tend to be mostly successful and I should look for reasons of startups failures rather than successes, or the number of bankrupt startups is higher but those instances were not recorded. Since I cannot assess that I will focus on first assumption and seek for purposes of startups’ failures. Analyzing maps, the biggest hub for startups is the USA and Western Europe. These developed regions of the world tend to offer better opportunities for startups’ founders: access to educated and experienced workforce, great pool of investors which may support their early operations, well-developed infrastructure that helps to establish a company and promote it both, locally and globally, and potential government incentives. In the USA, most startups headquarters are founded either in California, New York state, or in more densely populated urban areas. Closed startups however are distributed pretty evenly across all regions and pattern for their was not established. Bar charts displaying number of new startups by foundation date display interesting occurence. The number of new startups increased almost exponentially in years 2000-2011 and then recorded a drop in year 2012 and 2013. Perhaps, global economic crisis of years 2008-2012 negatively influenced startups market in following years. Data of closed startups seem to prove this hypothesis, since the greatest increase in startup failures was recorded in years 2008-2013. Another interesting characteristic is that majority of failed startups seem to be very small institutions employing on average 3 people. Such small companies probably suffered from financial and “teething problems”.However, there were also bigger companies that got closed, some employing even 49 people. Startups tend to close mostly in web, software, mobile, games_video, and e-commerce industries. All above are highly developing fields, where tech giants like Google, Apple, Microsoft, or Facebook are dominating the scene. High competition among those and smaller companies imposes high entrance barriers which tiny startups struggle to overcome. Since finances seem to be the main problem of early startups, I looked over funding data and resulting conclusion is that failed startups tend to obtain external funding at much earlier stage than their successful counterparts. This may mean that financial buffer created by successful founders is higher than one for those who failed.
I begin my predictive analysis with dividing my data into training and testing sets. I decided to create as accurate models as it is possible, therefore I selected great majority of rows(90%) to train my models while the rest 10% of rows will be used to test models and predict my target variable.
# Using sample function, I randomly selected 90% of rows in the dataset for training set and the rest of it for testing set. I called my training set "train" and my testing set "test".
index<-sample(nrow(object3), nrow(object3)*0.9)
train<-object3[index,]
test<-object3[-index,]
Searching for answers, I decided to create a decision tree which would help me understand which variables are most important for determining successfulness of startup. Then I began construction of decision tree.
#Using the code below, I created a decision tree by setting my target variable to status which was to be predicted with all covariates from object3 data frame. Sicne target is binary I set method to class and complexity to 0.0001. Then I predicted results on the testing set and displayed them in a table. I also plot the decision tree itself.
rpart0 <- rpart(formula = status ~., data = train, method = "class", cp=0.0001)
pred0 <- predict(rpart0, test, type = "class")
head(pred0)
## 4 19 29 33 46 54
## operating operating operating operating operating operating
## Levels: closed operating
table(test$status, pred0, dnn = c("True", "Pred"))
## Pred
## True closed operating
## closed 197 65
## operating 7 19387
sum(test$status != pred0) #number misclassified
## [1] 72
#misclassification rate
sum(test$status != pred0)/nrow(test)
## [1] 0.003663004
prp(rpart0, extra = 1)
table(train$status, predict(rpart0, type = "class"), dnn = c("True", "Pred"))
## Pred
## True closed operating
## closed 1781 541
## operating 30 174545
Based on the decision tree above the most important variable is
TimeDiff1 which is the time difference between foundation
and closing date. The decision tree states that if a startup lasted more
than 5,000,000 days it still operates. The second split is made on
funding_rounds variable. Decision tree misclassified only
72 rows of testing data. The first split however is the most important.
It states that the startup would have to operate for at least 13,000
years to carry on operations until today. This is impossible since the
earliest records in this dataset are from 1960s. Since
TimeDiff1 greatly disturbs my results, I decided to delete
it from my model and create a new decision tree without it.
#Using the code below, I created a decision tree by setting my target variable to status which was to be predicted with all covariates from object3 data frame. Sicne target is binary I set method to class and complexity to 0.002. Then I predicted results on the testing set and displayed them in a table. I also plot the decision tree itself.
rpart1 <- rpart(formula = status ~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10, data = train, method = "class", cp=0.0002)
pred1 <- predict(rpart1, test, type = "class")
head(pred1)
## 4 19 29 33 46 54
## operating closed operating operating operating operating
## Levels: closed operating
table(test$status, pred1, dnn = c("True", "Pred"))
## Pred
## True closed operating
## closed 6 256
## operating 15 19379
sum(test$status != pred1) #number misclassified
## [1] 271
#misclassification rate
sum(test$status != pred1)/nrow(test)
## [1] 0.01378714
prp(rpart1, extra = 1)
table(train$status, predict(rpart1, type = "class"), dnn = c("True", "Pred"))
## Pred
## True closed operating
## closed 67 2255
## operating 35 174540
This time, decision tree is more plausible. The first split was made
with funding_rounds variable and states that if a startup
received no funding rounds that it would operate otherwise other
variables had to be assessed. Other important variables used to create
this decision tree include TimeDiff10(time difference
between foundation and last funding round obtained),
category_code, and `funding_total_usd. This implies that
industry and amount of funds obtained as well as time when they were
obtained palce crucial roel in determining startups successfulness.
Decision tree misclassified only 271 rows of testing data, which is more
than in the previous decision tree, however misclassified rows add up to
only 1.3% of all data rows. Such small rate implies precision of created
model.
Then I created a logistic regression model using glm
function and tidyverse package.
#Using the code below, I imported tidyverse package and created a logistic regression model using all variables except for TimeDiff1 in object3 dataset to determine status. Then I predicted status of startups from testing set and displyed results in the table.
library(tidyverse)
log_reg<-glm(status~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10,data = train,family = binomial)
summary(log_reg)
##
## Call:
## glm(formula = status ~ category_code + continent + funding_rounds +
## funding_total_usd + milestones + relationships + TimeDiff2 +
## TimeDiff3 + TimeDiff4 + TimeDiff5 + TimeDiff6 + TimeDiff7 +
## TimeDiff8 + TimeDiff10, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4001 0.0458 0.1130 0.1822 3.4831
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.014e+00 1.572e-01 44.630 < 2e-16 ***
## category_codebuilding -4.061e-01 1.124e-01 -3.613 0.000302 ***
## category_codehealth 2.065e-01 1.086e-01 1.902 0.057220 .
## category_codeIT -7.816e-01 6.594e-02 -11.853 < 2e-16 ***
## category_codeleisure -5.186e-01 8.310e-02 -6.240 4.36e-10 ***
## category_codeother -2.038e-01 7.633e-02 -2.670 0.007594 **
## category_codepetcare 4.608e+00 7.099e-01 6.491 8.54e-11 ***
## category_codetravel 1.534e-01 2.552e-01 0.601 0.547706
## continentAsia -2.110e+00 1.735e-01 -12.164 < 2e-16 ***
## continentEurope -2.405e+00 1.661e-01 -14.480 < 2e-16 ***
## continentNorth America -2.533e+00 1.626e-01 -15.575 < 2e-16 ***
## continentOther -7.814e-01 2.135e-01 -3.659 0.000253 ***
## continentSouth America -1.367e+00 2.491e-01 -5.487 4.09e-08 ***
## funding_rounds -1.006e+00 3.081e-02 -32.657 < 2e-16 ***
## funding_total_usd 2.099e-09 9.150e-10 2.294 0.021775 *
## milestones -2.275e-01 3.218e-02 -7.069 1.56e-12 ***
## relationships 7.480e-02 9.038e-03 8.276 < 2e-16 ***
## TimeDiff2 -6.200e+00 1.266e+02 -0.049 0.960953
## TimeDiff3 6.203e+00 1.266e+02 0.049 0.960939
## TimeDiff4 1.422e-03 1.774e-04 8.012 1.13e-15 ***
## TimeDiff5 -2.456e-06 4.320e-05 -0.057 0.954668
## TimeDiff6 6.201e+00 1.266e+02 0.049 0.960951
## TimeDiff7 -1.116e-03 1.734e-04 -6.436 1.23e-10 ***
## TimeDiff8 1.304e-04 2.222e-05 5.867 4.45e-09 ***
## TimeDiff10 1.110e-03 1.712e-04 6.486 8.84e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24736 on 176896 degrees of freedom
## Residual deviance: 20816 on 176872 degrees of freedom
## AIC: 20866
##
## Number of Fisher Scoring iterations: 25
log_predict<-predict(log_reg,newdata = test, type = "response")
table(test$status, (log_predict>0.9)*1 , dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## closed 14 248
## operating 141 19253
The model finds out that some of the variables have the higher
p-value than 0.05 which indicates that those variables are statistically
insignificant in predicting the target variable. Among those variables
are TimeDiff2,TimeDiff3,
TimeDiff5 and TimeDiff6.
The model incorrectly predicted 389 rows of testing set. 141 of those were predicted to be closed while they are still operating and 248 were predicted to be operating while they got closed. The total number of misclassified rows equals to 389, which gives a misclassification rate of 1.97%. This rate is higher than the one for decision tree
I tried removing statistically insignificant variables to see if model would improve.
#TimeDiff6,TimeDiff2, TimeDiff3, investment_rounds, and invested_companies have the highest p-value therefore will be eliminated from the model and new logistic regression model will be run.
log_reg1<-glm(status~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff4+TimeDiff5+TimeDiff7+TimeDiff8+TimeDiff10,data = train,family = binomial)
summary(log_reg1)
##
## Call:
## glm(formula = status ~ category_code + continent + funding_rounds +
## funding_total_usd + milestones + relationships + TimeDiff4 +
## TimeDiff5 + TimeDiff7 + TimeDiff8 + TimeDiff10, family = binomial,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3999 0.0457 0.1155 0.1820 3.4892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.015e+00 1.572e-01 44.628 < 2e-16 ***
## category_codebuilding -4.055e-01 1.124e-01 -3.608 0.000309 ***
## category_codehealth 2.063e-01 1.086e-01 1.899 0.057511 .
## category_codeIT -7.863e-01 6.593e-02 -11.926 < 2e-16 ***
## category_codeleisure -5.216e-01 8.309e-02 -6.278 3.43e-10 ***
## category_codeother -2.049e-01 7.632e-02 -2.685 0.007251 **
## category_codepetcare 4.602e+00 7.096e-01 6.486 8.83e-11 ***
## category_codetravel 1.497e-01 2.553e-01 0.586 0.557654
## continentAsia -2.105e+00 1.735e-01 -12.132 < 2e-16 ***
## continentEurope -2.400e+00 1.661e-01 -14.453 < 2e-16 ***
## continentNorth America -2.529e+00 1.626e-01 -15.553 < 2e-16 ***
## continentOther -7.801e-01 2.135e-01 -3.653 0.000259 ***
## continentSouth America -1.365e+00 2.491e-01 -5.479 4.27e-08 ***
## funding_rounds -1.008e+00 3.079e-02 -32.730 < 2e-16 ***
## funding_total_usd 2.129e-09 9.179e-10 2.319 0.020381 *
## milestones -2.280e-01 3.215e-02 -7.091 1.33e-12 ***
## relationships 7.557e-02 9.009e-03 8.388 < 2e-16 ***
## TimeDiff4 1.421e-03 1.774e-04 8.007 1.17e-15 ***
## TimeDiff5 1.179e-06 4.310e-05 0.027 0.978173
## TimeDiff7 -1.115e-03 1.734e-04 -6.429 1.28e-10 ***
## TimeDiff8 1.335e-04 2.207e-05 6.050 1.45e-09 ***
## TimeDiff10 1.109e-03 1.712e-04 6.479 9.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24736 on 176896 degrees of freedom
## Residual deviance: 20826 on 176875 degrees of freedom
## AIC: 20870
##
## Number of Fisher Scoring iterations: 11
log_predict1<-predict(log_reg1,newdata = test, type = "response")
table(test$status, (log_predict1>0.9)*1 , dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## closed 14 248
## operating 141 19253
Although, now, most of variables seem to be statistically significant, the number of misclassified rows has not changed. Hence, I tried again by removing even more statistically insignificant variables.
#milestones and funding_total_usd have p-values greater than 0.1 which means that they are statistically insignificant of predicting target variable, therefore will be eliminated from the model and new logistic regression model will be run..
log_reg2<-glm(status~category_code+continent+funding_rounds+relationships+TimeDiff4+TimeDiff7+TimeDiff8+TimeDiff10,data = train,family = binomial)
summary(log_reg2)
##
## Call:
## glm(formula = status ~ category_code + continent + funding_rounds +
## relationships + TimeDiff4 + TimeDiff7 + TimeDiff8 + TimeDiff10,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4220 0.0444 0.1165 0.1859 3.5135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.872e+00 1.568e-01 43.831 < 2e-16 ***
## category_codebuilding -3.170e-01 1.120e-01 -2.830 0.00466 **
## category_codehealth 3.245e-01 1.078e-01 3.010 0.00261 **
## category_codeIT -7.952e-01 6.579e-02 -12.087 < 2e-16 ***
## category_codeleisure -5.233e-01 8.291e-02 -6.311 2.77e-10 ***
## category_codeother -2.038e-01 7.624e-02 -2.674 0.00751 **
## category_codepetcare 4.677e+00 7.095e-01 6.591 4.36e-11 ***
## category_codetravel 1.938e-01 2.562e-01 0.756 0.44936
## continentAsia -2.027e+00 1.736e-01 -11.676 < 2e-16 ***
## continentEurope -2.351e+00 1.663e-01 -14.140 < 2e-16 ***
## continentNorth America -2.488e+00 1.628e-01 -15.282 < 2e-16 ***
## continentOther -6.686e-01 2.134e-01 -3.132 0.00173 **
## continentSouth America -1.364e+00 2.494e-01 -5.467 4.57e-08 ***
## funding_rounds -1.024e+00 3.050e-02 -33.569 < 2e-16 ***
## relationships 5.070e-02 7.752e-03 6.540 6.15e-11 ***
## TimeDiff4 1.483e-03 1.767e-04 8.391 < 2e-16 ***
## TimeDiff7 -1.091e-03 1.730e-04 -6.308 2.83e-10 ***
## TimeDiff8 1.154e-04 2.242e-05 5.147 2.65e-07 ***
## TimeDiff10 1.102e-03 1.707e-04 6.454 1.09e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24736 on 176896 degrees of freedom
## Residual deviance: 20896 on 176878 degrees of freedom
## AIC: 20934
##
## Number of Fisher Scoring iterations: 11
log_predict2<-predict(log_reg2,newdata = test, type = "response")
table(test$status, (log_predict2>0.9)*1 , dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## closed 13 249
## operating 126 19268
All the numeric variables are statistically significant now and the number of misclassified rows got reduced to 375. However, it is still higher than the number of incorrectly predicted rows by the decision tree model. Reducing number of variables in the logistic regression did not improve the model.
I decided to try other techniques to see if I can improve model. I
installed package ipred which contains bagging function.
Then, I created the model.
#Using the code below, I imported ipred package and created bagging model using status as my target variable and rest of variables from train dataset except for TimeDiff1 as covariates. I also chose number of bootstrap samples to 50. I set the constraint on created decision trees to create minimum number of splits to two and reduced complexity so that the model would run faster.
bag_model <- bagging(formula = status~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10,
data = train,
nbagg = 50,
coob = TRUE,
control = rpart.control(minsplit = 2, cp = 0))
#50 bootstrap samples, and we fit a tree model to each of these 50 bootstrap samples
bag_pred <- predict(bag_model, newdata = test)
bag_pred$confusion
## Observed Class
## Predicted Class closed operating
## closed 6 20
## operating 256 19374
The bagging misclassified 275 startups from testing dataset, which is 4 more than the decision tree.
Let’s see if RandomForest model can better predict the target variable.
#Using the code below, I imported randomForest package. Then using status as my target variable and rest of variables from train dataset as covariates I created a RandomForest model with 100 trees.
rf_model <- randomForest(status ~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10,
data = train,
importance = TRUE,
ntree = 100)
rf_pred <- predict(rf_model, test)
head(rf_pred)
## 4 19 29 33 46 54
## operating operating operating operating operating operating
## Levels: closed operating
table(test$status, rf_pred, dnn = c("True", "Pred"))
## Pred
## True closed operating
## closed 1 261
## operating 0 19394
sum(test$status != rf_pred)
## [1] 261
RandomForest model generated the least misclassifications summing up
to 261. Although, it is the most complicated of models it seemed to
perform the best. However, the difference in number of
misclassifications is so small that it does not influence predictions.
All of the models ended up with misclassification rate of less than 2%.
Such results suggest that predicting fail or success of a startup is
highly accurate. However, the results imply that model expected
status of all rows of data to “operating”. This seems to be
quite odd and suggests a flaw in the model. Therefore, I decided to
create another randomForest model and increased number of
trees to 200.
#Using the code below, Then using status as my target variable and rest of variables from train dataset as covariates I created a RandomForest model with 200 trees.
rf_model1 <- randomForest(status ~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10,
data = train,
importance = TRUE,
ntree = 200,
control = rpart.control(minsplit = 2, cp = 0))
rf_pred1 <- predict(rf_model1, test)
head(rf_pred1)
## 4 19 29 33 46 54
## operating operating operating operating operating operating
## Levels: closed operating
table(test$status, rf_pred1, dnn = c("True", "Pred"))
## Pred
## True closed operating
## closed 0 262
## operating 0 19394
sum(test$status != rf_pred1)
## [1] 262
Increasing number of trees did not improve the predictions. The number of misclassified observations remained the same and still the model predicted all the startups in the testing set to operate.
When setting up a new business entity, startups’ founders have to consider several factors that will affect their companies.There are many challenges waiting for brand-new startups, which lead to unmanageable situations and effectively to closure or bankruptcy. Reducing risk of startup getting closed is not only important for its founder and employees, but also for the people and entities that invested in it. While gaining no profit, investors would lose their invested capital. On the other hand, founders would lose their own resources and comitted time, as well as reputation which is important in acquiring new funds.By using data analysis to draw conclusions, the objective of this study was to help mitigate the risk posed to investors and owners in case of startup going bankrupt.
Because of the great risk involved in establishing a new startup, I performed data analysis to find relationships that can aid its founders in avoiding unnecessary risk in conducting company’s operations. In particular, I analyzed a data set containing information about startups, obtained from the kaggle website. The data set consists of 11 tables and 154 columns which gather data about offices, employees, relationships, funds gathered, investments, acquisitions and ipos of almost 130,000 startups from all around the world. In analyzing this data set, I created scatter plots, tables, histograms, frequency graphs,maps, and other graphical displays to find relationships between status and startups’ various characteristics. I also uncovered new information from the data set by creating new variables, splitting the data, and grouping the data. Startup owners can use these relationships and new insights to adjust their decisions involving company’s operations. To further help owners and potential investors develop strategic decisions, I used regression (in particular logistic regression), bagging, decision trees, and random forests analyses to create models that predict whether a startup is likely to become successful and continue its operations.
By examining plots in Tableau, I drew following conclusions:
Only 2% of all startups in the dataset got closed. This may imply that problem is marginal, however, it is not known whether all closures and bankruptcies were registered in this dataset. In reality, the share of defaulting startups may be much higher, but obtaining data about closed companies that existed for very short time may be impossible.
Startups tend to be founded in urban areas of California, East Coast of USA, Central and Western Europe. In the same areas the number of closed startups is also the highest. Although it may seem like there is no correlation, but greater number of startups in a region increases competition and leads to even greater number of closures.
The trend of founding new startups is increasing. Each year the number of new startups increases. Although these statistics slowed down during 2008-2012, it happened mostly due to economic crisis and recession on global markets.
The number of closed startups increases proportionately to number of startups overall, however, significant rise in closures in years 2010-2012 implies that volatility of global markets tend to have high influence on survival of startups.
Startups tend to get closed at the very early stage of operations, since average time from foundation to closing is about 48 months, while average number of people employed is 3. Closed startups also decide at to obtain external funding much earlier than their successful counterparts.
IT is the leading industry in case of number of closed startups. Huge competition and high entrant barriers quickly verify if startups’products and services are so unique. Those poorly developed are prone to getting closed.
To reduce risk of investing, I created models that turned out to precisely predict whether a startup will be operating or will get closed. The least precise of all, logistic regression model wrongly predicted only 1.97% of observations, while the most precise, random forest model, improved this rate to 1.34%.
Based on this report, entrepreneurs can decide whether they want to take a more risky approach and for instance, enter an IT industry convinced of their product’s/service’s superiority, or choose a safer pathway and develop their startup in less competitive field. They should gather more funds from personal savings for its early stage of operations and should look for possible external funds when product/service is ready to be deployed. on market. Otherwise, investors’ funds necessary for covering expenses in marketing and manufacturing would be wasted in research and development phase resulting in startup having not enough financing for actual product/service release. Startup founders should focus on perfect allocation of early funds and follow economic indicators that can warn them of unfavorable market conditions. When setting up a startup, founders should consider urban areas that may offer beneficial conditions to new businesses. On the other hand, investors can use created models to mitigate risk of losing capital while investing in promising companies. Choosing startups that survived more than 2 years on the market, highly reduces risk of a faulty investment. Moreover, the more developed a startup is, the less likely it is to go bankrupt. Hence, investors should consider startups employing more than 5 people and possessing more than one office.
A major limitation of my research is the completeness of the dataset. Many rows had plenty missing values. Only one in four startups has data concerning investors and funding. Less than 10% contained data about startups’ offices. Because of that my models were highly impacted by imputed data. Although I chose one of the most accurate ways of imputing missing values, which was based on random forest models created with existing values, 47% of categorical variables had high probability of being incorrect. Despite many missing values, dataset seems to biased, consisting mostly of operating startups. While according to the newest 2023 statistics, 10% of startups fail within one year from foundation, only 2% failed in my dataset. Future researches should focus mostly on obtaining better quality and more recent data. This would help to improve already created models and discover new trends. Performing text analysis and searching for patterns within websites or articles included in the dataset could also improve the risk assessment. Indeed, there are multiple avenues for further research related to this topic, but my predictive models provide a strong foundational cornerstone to inform future development in this area. In so doing, I shed new light on insights and relationships which can spur further research and investigation.