Introduction

Explanation of Problem Statement

Every year, hundreds of thousands of students around the world graduate from colleges or universities. Many of them end up getting employed not in the field they studied, being lured with higher wages and benefits. Majority begins careers by joining already operating businesses and companies, usually starting as interns, or performing entry-level jobs. However, there are some of graduates who aim higher, who would rather employ others than become employed, who are self-starters and would rather create and develop some new idea, concept, or business than contribute to already prosperous ones. According to USA Today, in years 2010-2013, 45% of business school graduates decided to set up their own companies, which is twice as many as in years 2000-2009. This trend among college graduates transfers to increasing number of startup companies being founded each year. However, only few become successful in the long run.

Business Problem

Each newly-founded startup seeks to become successful in the market they operate in. But what does it mean exactly to achieve success in case of a startup? Startup is an institution that enables its founders to introduce their unique, revolutionary idea or product, which they believe it could skyrocket, to a market. Most often the main target of a startup is either to become a well-established company or to attract potential acquisitions. According to Forbes, “a startup can graduate to a larger company by being acquired, opening more than one office, generating revenues greater than $20 million, or having more than 80 employees”. To achieve either of these milestones, a startup requires both financial resources as well as knowledgeable, skillful, and experienced staff. There are several factors that influences startup’s successfulness. Among financial requirements the most important is to choose how startup should acquire funds for its operations. Whether it is money obtained from personal savings, family, and friends, or a bank loan, or angel investors, or venture capital funds, or even crowdfunding, startup owners have to decide which option suits best their needs. Each of funding alternatives brings its pros and cons along, and while for one borrowing money from family members is the best solution, for others it may impose relationship problems. Another factor is choice of industry in which startup will operate. Founding another IT or tech startup may not be the best idea due to great competition, high entrance barriers, and lawful regulations. Perhaps narrow specialization of a service or product may give a startup a competitive advantage over already well-established companies. Therefore, it is worth noticing that an intellectual and technological aspect plays a crucial role in development of a startup. Even the best funded startup may not achieve a market success if its service/product/idea is poorly created and lacks scientific background. Behind each successful company there are skillful employees and smart managers who know how to take advantages of their subordinates’ technical know-how. Hence, choice of employees and managers is as important as financial support at the early stage. Among other factors that might affect startup’s performance are location, where a startup is being develop and where it offers its products and services to public. Rent costs, accessibility of an office, equipment, among others are crucial factors that contribute to costs of operations, which for a crawling company should be kept as low as possible.

Data and Methodology Used to Address the Problem Statement

The analysis will be conducted on a dataset that comes originally from Crunchbase.com, which is a website collecting information about companies and their operations around the world. It displays data about over 190,000 startups founded up to the year 2013 and consists of 11 tables and 154 variables. The main table is called objects and gathers general information about each startup: name, date founded, date closed, location, industry, operations description, etc. Other tables consist of data about employees, offices, investments, funds obtained, milestones, and acquisitions. The dataset includes all necessary factors affecting startups’ performances. The analysis will consist of two approaches: descriptive and predictive. The first one will focus on identifying trends among operating and closed startups and determining factors’ importance in relation to startups’ successfulness.For instance, it will be checed whether number of funding rounds or total amount of funds raised influences the fact whether a company got closed or is still operating. Using Tableau, a visual representation of correlations between covariates and target variable will be conducted. By creating bar-plots, scatterplots, and maps that will display which factors influence the operating status of startups and which have no impact whatsoever. The latter part will focus on training machine learning models like logistic regression, decision trees, bagging, boosting, random forests, and many more that will predict whether a startup will become successful or fail to achieve expectations. The successfulness in the analysis will be ruled by three standards: either startup became acquired by other entity, or it was still operating by the end of the year 2013, or it went public on a stock market. Closure of a startup by the end of the year 2013 is equal to failure. Unfortunately, the lack of access to recent data does not allow me to predict outcome on the latest founded startups. Nevertheless, I believe that the model which will be obtained through this analysis may greatly improve decision-making process among current startup owners and contribute to higher survival rate of startups.

Benefits and Significance of Proposed Work

There are many determinants of a successful startup, which makes it almost impossible to focus on all of them while setting up a business. Verifying the most influential components may not only improve initial performance, but also contribute to reduction of operational costs. The resulting surplus can later be used for further development of the startup. Moreover, recognizing contemporary trends among startups may result in a useful piece of information. An owner may decide to follow a market tendency to reduce potential risks or quite the contrary, set up in a new direction hoping for an early competitive advantage. Finally, analyzing the history of both successful and unprofitable startups may lead to beneficial conclusions. Owners and managers will learn which choices should be avoided and which directions should be pursued. These three pieces of analysis put together should generate a perfect strategy that any rising startup should follow.

Stakeholder Analysis

Operations of a newly-founded startup affect several contributing parties:

  • Startup’s founders/owners – they seek not only to revolutionize market with their idea, product, or service, but also to make their business idea profitable. Success or failure of the startup directly influences them financially, as they tend to invest their personal savings in the startup. They support startup both financially and intellectually as they manage day-to-day operations.

  • Investors – they seek to gain a return on their investment, therefore achieving success is highly desired by them. Successful startup equals to high revenues, acquisition or even going public(shares & dividends). They want to minimize risks and maximize gains. Often, investors are familiar with the startup’s industry and support their operations factually.

  • Startup’s employees – besides financial benefits, specialists working for startups want to develop professionally by acquiring new skills, specializing in a certain field, or just contributing to a new invention. Success of a startup can boost their careers, even if they decide not to continue employment with the entity.

  • Competitors – depending on the competitor’s size and market share, they may want to take advantage of dstartup’s success, either by its acquisition, purchase of trademarks, copyrights, patents, etc., or simply by copying and implementing startup’s product, service, or idea if it is not legally protected. They may try to undermine startup’s operations to reduce its market value or obtain access to its product/service.

  • Customers – they tend to seek products or services that offer highest quality for the lowest price (luxury products/service are exception). Greater competition and more alternatives in the market usually result in lower prices, which customers look for. Startups’ success means that new products/services are introduced to the market offering greater choice and new solutions.

Data Preparation

Original Data Source

Data was originally sourced from the Crunchbase website, which focuses on providing business information about private and public companies. Their content includes investment and funding information, founding members and individuals in leadership positions, mergers and acquisitions, news, and industry trends. Since obtaining data from crunchbase.com is costly, I acquired my dataset about startups from kaggle. The dataset consists of 11 tables:

  • objects - gathers fundamental information about startups(date founded/closed, name, status, address, as well as general information about funds acquired by startup)
  • offices - contains addresses and coordinates of all offices that belong to each startups
  • people - containing names, birthplace, and affiliation of founders and startups’ employees
  • relationship - this table tells us who worked for which startup; whether it is a current employment and how long it lasts; and what is the name of position taken by employee.
  • milestones - tells us what kind of milestones each startup achieved, when did it happen, and gives URL to the news.
  • ipos - contains data about startups which went public(date, funds obtained, etc.)
  • investment - provides data about relationships between investors and startups
  • funds - consists of data about investors and funds they gathered to invest in startups
  • funding_rounds - contains details about funding rounds of startups; dates when the funding round was obtained, amount raised, currency, as well as market value of startup before and after receiving a funding round.
  • degrees - conatins data about college/university degrees obtained by founders and employees of startups: major, type of degree, name of institution, and graduation date.
  • acquisitions - gathers data about startups which were acquired by other companies: date of acquisition, price of acquisition, company that acquired startup,etc.

Data Understanding

For the purpose of better understanding of the dataset, I created a table in Excel that supports further data preparation process with information about each column in each table. It consists of column name, data purpose, data type, data role, its consolidation, as well as definition of a column followed by a commentary section, in which data characteristics or possible data analysis actions/guidelines are included.

Data Preparation

Since my analysis is divided into two parts: descriptive and predictive, data preparation process got also divided into two independent phases. The descriptive part of startup project is conducted in Tableau through meaningful visualizations. The predictive part focuses on building models using R-Studio. However, these analytics tools require different data cleaning processes. In Tableau, the focus is on use of possibly greatest number of covariates(dimensions) to define the target variable(status:operating vs.closed), hence joining tables together to create minimum number of outputs is of crucial importance. On the other hand, development of predictive models like logistic regression, decision trees, or random forests requires the cleanliness of data(no missing values, only numeric variables and categorical variables with limited number of levels), therefore imputing missing values and generalizing categories is the area of main focus.

Part 1: Tableau Prep

Tableau Prep is an application designed to format and clean data before importing it to Tableau Desktop. Its main advantage is an user-friendly interface that enables cleaning data in an easy way.

I begin the cleaning process with importing my original data into Tableau Prep.

Then I dragged them into the interface and start the cleaning process. The figure below shows the entire diagram of how the tables got cleaned and joined together.

As it can be seen, I ended up with several outputs. Unfortunately, the structure of dataset did not allow me to join all the tables together. In most of the tables occured one-to-many relation. It means that one row from objects table could be joined with several rows from funding_rounds table, since one startup could have several funding rounds. If I joined all the tables together, I would end up with several millions rows, instead of a few hundred thousand of them. Such mistake could greatly impact later analysis results. Therefore I decided to join objects table with other tables independently. Despite the fact that I cannot use all variables together, I can still focus on certain groups of startups to analyze their exclusive features.

Degrees-People-Relationship-Objects Output
  1. I removed variables created_at and updated_at.
  2. In Clean 1, I changed the id column into degree_id by adding “d:” followed by id number.
  3. I created an output of the table, before aggregating the degree cells, so that I could come back and analyze degrees by major, institution or date of graduation.
  4. Since one to many relation between cells(one person can obtain several degrees) occurs, I used aggregation field to aggregate degree_id by a person(object_id) obtaining degree.
  5. In the next clean step, I changed name degree_id to number_of_degrees.
  6. I created a right-join with people table through object_id including all the people table rows.
  7. I removed an object_id duplicate field.
  8. I removed fields created_at, updated_at, id from relationships table.
  9. Then I added a clean step. I created a `relationship_id starting with”r:” and followed by number.
  10. I inner-joined relationships table to people and degrees table by object_id field. The program output only those relationships that has people frompeople table connected to it.
  11. In the following clean step, I removed the duplicate of object_id.
  12. Then I inner-joined objects table(filtered for “company” in entity_type) to my flow through relationship_object_id(flow) and id(objects table).
  13. In the following clean step I removed id field and changed the role of state-code field to geographic.
  14. I created an output for my flow which was called Relationship+people+degree+object.
Funding Rounds-Objects Output
  1. I removed created_at, updated_at, id,source_url, and source_description fields from the funding rounds table.
  2. I added a clean step in which I changed funding_round_id to start with “fr:” followed by number.
  3. Then I inner-joined objects table(filtered for “company” in entity_type) to my flow through object_id(funding rounds table) and id(objects table).
  4. In the clean step I removed id field and changed the role of state_code to geographic.
  5. I created an output of this flow named Funding rounds+object.
Milestones-Object Output
  1. From milestones table, I removed created_at, updated_at, id,source_url, and source_description fields.
  2. Then I inner-joined objects table(filtered for “company” in entity_type) to my flow through object_id(milestones table) and id(objects table).
  3. In the clean step I removed id field and changed the role of state_code to geographic.
  4. I created an output for the flow named Milestone Prep.
Acquisitions-Object Output
  1. From acquisitions table, I removed created_at, updated_at, id,source_url, and source_description fields.
  2. In the clean step, I created a new acquisition_id field starting with “a:” followed by a number. I also filtered acquired_object_id and removed null values.
  3. Then I inner-joined objects table(filtered for “company” in entity_type) to my flow through acquired_object_id(acquisitions table) and id(objects table).
  4. Then I removed fields status, entity_type, and id.
  5. I created an output called acquisitions+object.
IPO-Object Output
  1. From ipos table, I removed created_at, updated_at, id,source_url, and source_description fields.
  2. In clean step, I filtered object_id to exclude null values.
  3. Then I inner-joined objects table(filtered for “company” in entity_type) to my flow through object_id(ipos table) and id(objects table).
  4. I removed fields ipo_id, closed_at, id, entity_id from the joined flow.
  5. I created an output of the flow called IPO+object.
Offices Output
  1. From offices table, I removed created_at, updated_at, id fields.
  2. In the clean step, I created a new office_id starting with “o:” followed by number.
  3. Then I inner-joined objects table(filtered for “company” in entity_type) to my flow through object_id(offices table) and id(objects table).
  4. In the following clean step, I removed duplicates of fields: region, city, state_code, and country_code. I created a calculated field called office_type that would assign either “Headquarters” or “Branch office” label. 5.I created and output of the flow called Offices.

Part 2: R-Studio and Predictive Analysis Preparation

R-packages used for Analysis

The following four packages are used in this final project report:

  • tidyverse
  • adabag
  • lubridate
  • randomForest
  • missForest
  • ipred
  • rpart
  • rpart.plot By clicking on the code button shown directly below, you can see the code for loading these four packages.
# Below, the package tidyverse is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("tidyverse")
library(tidyverse)

# Below, the package lubridate is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("lubridate")
library(lubridate)

# Below, the package missForest is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("missForest")
library(missForest)

# Below, the package rpart is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("rpart")
library(rpart)

# Below, the package rpart.plot is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("rpart.plot")
library(rpart.plot)

# Below, the package randomForest is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("randomForest")
library(randomForest)

# Below, the package ipred is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("ipred")
library(ipred)

# Below, the package adabag is loaded. If the user has not yet installed this package, he or she should additionally use the code install.packages("adabag")
library(adabag)

Note that while loading the packages above, message and warning were both set to FALSE. This suppressed the messages and warnings resulting from loading the four packages. Also, echo was set to TRUE in order to ensure that the reader is able to view the R code for loading the required packages.

Purpose of Each Package

Here, we explain the purpose of using each package in our data analysis. In loading the package tidyverse, other packages are automatically loaded that will be helpful in our analysis. When we load the package tidyverse, the package ggplot2 is automatically loaded. The package ggplot2 will allow us to create nice visualizations of our data (i.e., graphs and plots). A couple other packages that are automatically loaded with tidyverse include dplyr and tidyr. We will leverage the power of dplyr to manipulate our data set, and we will use tidyr to tidy our data. The package missForest will help impute missing values in the dataset using already existing data to create randomForest model which will predict what values should be imputed. I will use lubridate package to transform my date variables into numeric columns. Finally, packages ipred, rpart, randomForest, adabag will be used to create following models: bagging, decision tree, randomForest, and boosting. Package rpart.plot however will be used to plot created decision trees.

Importing original dataset

For my predictive analysis, I wanted to use multiple date variables included in the objects table. However, R only recognizes “DD-MM-YYYY” format, hence I formatted date fields in Excel to obtain required output. I then saved the table as a text file. To import objects table into R-Studio I used read.delim function which read text delimited files.

#The code below makes the r-markdown to have exact same random samples each time I open the file
set.seed(1234)
#Using the code below, I imported the objects text file into R-Studio and named it object.
object <- read.delim("/Users/macuser/Desktop/MSDA/Capstone Project/Startup dataset/objects_text.txt", na.strings=c(""," ","NA"))

Then, I examined the number of rows in the data frame and filtered its entity_type column to only include startups.

# Using the code below, I displayed the number of rows in the object data frame.
nrow(object) #462,651 rows
## [1] 462651
Filtering object table to display only rows defining startups
#Using the code below, I filtered the values of entity_type column to "Company" and named it object1. Then i displayed number of observations in the new data frame and names of variables.
object1<-object[grep("Company",object$entity_type),]
nrow(object1) #196,553 rows
## [1] 196553
variable.names(object1)
##  [1] "id"                  "entity_type"         "entity_id"          
##  [4] "parent_id"           "name"                "normalized_name"    
##  [7] "permalink"           "category_code"       "status"             
## [10] "founded_at"          "closed_at"           "domain"             
## [13] "homepage_url"        "twitter_username"    "logo_url"           
## [16] "logo_width"          "logo_height"         "short_description"  
## [19] "description"         "overview"            "tag_list"           
## [22] "country_code"        "state_code"          "city"               
## [25] "region"              "first_investment_at" "last_investment_at" 
## [28] "investment_rounds"   "invested_companies"  "first_funding_at"   
## [31] "last_funding_at"     "funding_rounds"      "funding_total_usd"  
## [34] "first_milestone_at"  "last_milestone_at"   "milestones"         
## [37] "relationships"       "created_by"          "created_at"         
## [40] "updated_at"
Switching date columns to numeric variables

Then, I started to change my data into numeric columns obtaining time difference between several dates in my data frame, so that I could use these fields in my predictive analysis. I began with changing data roles of these columns to Date.

# Using the code below, I changed the data roles of founded_at, closed_at, first_milestone_at, last_milestone_at, first_investment_at, last_investment_at, first_funding_at, last_funding_at to Date.
object1$founded_at <-as.Date(object1$founded_at)
object1$closed_at <-as.Date(object1$closed_at)
object1$first_investment_at <-as.Date(object1$first_investment_at)
object1$last_investment_at <- as.Date(object1$last_investment_at)
object1$first_funding_at <-as.Date(object1$first_funding_at)
object1$last_funding_at <- as.Date(object1$last_funding_at)
object1$first_milestone_at <- as.Date(object1$first_milestone_at)
object1$last_milestone_at <- as.Date(object1$last_milestone_at)

Then, I started to calculate time differences between dates, so that I could use them as numeric variables in my predictive analysis.

#Using the code below, I calculate the time difference in days between founded_at and closed_at. Then I imputed all the missing values with 999,999,999 so that the operating startups woud have advantage over the closed ones. I called the new column TimeDiff1. Finally, I  changed the data role to numeric and displayed top 5 rows.
TimeDiff1<-difftime(object1$closed_at,object1$founded_at,units ="days")
TimeDiff1[is.na(TimeDiff1)]<-999999999
TimeDiff1<-as.numeric(TimeDiff1)
head(TimeDiff1)
## [1] 1e+09 1e+09 1e+09 1e+09 1e+09 1e+09

Above, I calculated the time difference between foundation date and closing date. Then, using similar code I created 9 new calculations displaying time difference between date columns. However, instead of imputing missing values with huge numbers, I imputed them with 0.

#From foudning to last milestone
TimeDiff2<-difftime(object1$last_investment_at,object1$founded_at, units = "days")
TimeDiff2[is.na(TimeDiff2)]<-0
TimeDiff2<-as.numeric(TimeDiff2)
head(TimeDiff2)
## [1] 0 0 0 0 0 0
#From first to last investment
TimeDiff3<-difftime(object1$last_investment_at,object1$first_investment_at, units = "days")
TimeDiff3[is.na(TimeDiff3)]<-0
TimeDiff3<-as.numeric(TimeDiff3)
head(TimeDiff3)
## [1] 0 0 0 0 0 0
#From first to last funding
TimeDiff4<-difftime(object1$last_funding_at,object1$first_funding_at, units = "days")
TimeDiff4[is.na(TimeDiff4)]<-0
TimeDiff4<-as.numeric(TimeDiff4)
head(TimeDiff4)
## [1] 961   0   0   0   0   0
# From first to last milestone
TimeDiff5<-difftime(object1$last_milestone_at,object1$first_milestone_at, units = "days")
TimeDiff5[is.na(TimeDiff5)]<-0
TimeDiff5<-as.numeric(TimeDiff5)
head(TimeDiff5)
## [1] 1109    0 3156    0    0    0
# From founding to first investment
TimeDiff6<-difftime(object1$first_investment_at,object1$founded_at, units = "days")
TimeDiff6[is.na(TimeDiff6)]<-0
TimeDiff6<-as.numeric(TimeDiff6)
head(TimeDiff6)
## [1] 0 0 0 0 0 0
# From founding to first funding
TimeDiff7<-difftime(object1$first_funding_at,object1$founded_at, units = "days")
TimeDiff7[is.na(TimeDiff7)]<-0
TimeDiff7<-as.numeric(TimeDiff7)
head(TimeDiff7)
## [1] -16   0   0   0   0   0
#From founding to first milestone
TimeDiff8<-difftime(object1$first_milestone_at,object1$founded_at, units = "days")
TimeDiff8[is.na(TimeDiff8)]<-0
TimeDiff8<-as.numeric(TimeDiff8)
head(TimeDiff8)
## [1] 1784    0    0    0    0    0
#From founding to last investment
TimeDiff9<-difftime(object1$last_investment_at,object1$founded_at, units = "days")
TimeDiff9[is.na(TimeDiff9)]<-0
TimeDiff9<-as.numeric(TimeDiff9)
head(TimeDiff9)
## [1] 0 0 0 0 0 0
#From founding to last funding
TimeDiff10<-difftime(object1$last_funding_at,object1$founded_at, units = "days")
TimeDiff10[is.na(TimeDiff10)]<-0
TimeDiff10<-as.numeric(TimeDiff10)
head(TimeDiff10)
## [1] 945   0   0   0   0   0
Removing non-factorable character variables

Having calculated 10 new columns, which were called TimeDiff from 1 to 10, I decided to delete character columns in the data frame that were not possible to be categorized and use in regression.

# Using the code below, I create a new data frame called object2 that does not contain character variables except for category_code and country_code, and then display the names of variables in the new data frame.
object2<-object1[,-c(1,2,3,4,5,6,7,10,11,12,13,14,15,18,19,20,21,23,24,25,26,27,30,31,34,35,38,39,40)]
variable.names(object2)
##  [1] "category_code"      "status"             "logo_width"        
##  [4] "logo_height"        "country_code"       "investment_rounds" 
##  [7] "invested_companies" "funding_rounds"     "funding_total_usd" 
## [10] "milestones"         "relationships"
Generalizing categories in category_code, country_code, and status variables

Then using replace function, I generalize categories in category_code column to new 8 groups: leisure, bizsupport, building, petcare, travel, health, IT, and other.

#Grouping categories in category_code
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("games_video","photo_video","social","hospitality","sports","fashion","messaging","music"),"leisure")
object2$category_code<-replace(object2$category_code,object2$category_code %in%c("network_hosting","advertising","enterprise","consulting","analytics","public_relations","security","legal","finance"),"bizsupport")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("cleantech","manufacturing","semiconductor","automotive","real_estate","nanotech"),"building")
object2$category_code<-replace(object2$category_code,object2$category_code=="pets","petcare")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("travel","transportation"),"travel")
object2$category_code<-replace(object2$category_code,object2$category_code %in%c("health","medical","biotech"),"health")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("search","hardware","web","software"),"IT")
object2$category_code<-replace(object2$category_code,object2$category_code %in% c("other","mobile","design","education","ecommerce","news","government","nonprofit","local"),"other")

Then I changed the category_code variable into factor using as.factor function, so that it can be used in training predictive models.

#Using the code below, I change the data role of category_code to factor and display new categories.
object2$category_code<-as.factor(object2$category_code)
levels(object2$category_code)
## [1] "bizsupport" "building"   "health"     "IT"         "leisure"   
## [6] "other"      "petcare"    "travel"

Next move is to generalize categories of country_code variable. I decide to group them by continent.

#Grouping country_code by continents using replace function. The code below creates 6 new categories: Africa, Asia, Europe, North America, South America, and Other containing mostly Australia, New Zealand and smaller Oceanian island countries.
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('AGO', 'BDI', 'BEN', 'BWA', 'CIV', 'CMR', 'DZA', 'EGY', 'ETH', 'GHA', 'GIN', 'KEN', 'LSO', 'MAR', 'MDG', 'MUS', 'NAM', 'NER','NGA', 'REU','RWA', 'SDN','SEN', 'SLE', 'SOM','SWZ', 'SYC', 'TUN', 'TZA', 'UGA', 'ZAF', 'ZMB', 'ZWE'), "Africa")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('AFG', 'ARE', 'BGD', 'BHR', 'BRN', 'CHN', 'HKG', 'IDN', 'IND', 'IOT', 'IRN', 'IRQ', 'ISR','JOR', 'JPN', 'KAZ', 'KGZ', 'KHM', 'KOR', 'KWT','LAO', 'LBN', 'LKA', 'MAC', 'MDV', 'MMR', 'MYS', 'NPL', 'OMN', 'PAK', 'PCN','PHL','PRK','PST', 'QAT', 'SAU', 'SGP','SYR', 'THA', 'TJK', 'TWN', 'UZB', 'VNM', 'YEM'), "Asia")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('AIA', 'ALB', 'AND', 'ARM', 'AUT', 'AZE', 'BEL', 'BGR','BIH', 'BLR', 'CHE', 'CYP', 'CZE', 'DEU', 'DNK','ESP', 'EST', 'FIN', 'FRA', 'GBR', 'GEO', 'GIB', 'GLB', 'GRC', 'HRV', 'HUN', 'IRL', 'ISL', 'ITA', 'LIE', 'LTU','LUX', 'LVA', 'MCO', 'MDA', 'MKD', 'MLT', 'NLD', 'NOR', 'POL', 'PRT', 'ROM', 'RUS', 'SMR', 'SVK', 'SVN','SWE', 'TUR', 'UKR'),"Europe")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('ATG', 'BHS','BLZ', 'BMU', 'BRB', 'CAN', 'CRI','CUB','CYM', 'DMA', 'GLP', 'GRD', 'GTM', 'HND', 'HTI', 'JAM', 'MEX', 'MTQ', 'PAN', 'PRI', 'SLV', 'UMI','USA', 'VGB', 'VIR'),"North America")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('ARG', 'BOL', 'BRA', 'CHL', 'COL', 'DOM', 'ECU', 'NIC', 'PER', 'PRY', 'SUR', 'TTO', 'URY','VEN', 'VCT'), "South America")
object2$country_code<- replace(object2$country_code, object2$country_code %in% c('ANT', 'ARA', 'AUS', 'CSS', 'FST', 'HMI','NCL', 'NFK','NRU', 'NZL'), "Other")

Then I change the data role of the country_code variable to factor using as.factor function.

#Using the code below, I change data role of the `country_code` variable to factor and display its categories.
object2$country_code<-as.factor(object2$country_code)
levels(object2$country_code)
## [1] "Africa"        "Asia"          "Europe"        "North America"
## [5] "Other"         "South America"

I also changed the name of the variable to continent.

#Change name of col 'country code' to 'continent'
colnames(object2)[5] <- 'continent'

I also decided to group my categories in status variable which is going to be the target variable. In case of acqusition or initial public offering, startups are still operating, hence I switch these two categories to “operating”.

#Switching "acquired" and "ipo" to "operating in `status` column using replace function. Then changing its data role to factor.
object2$status<-replace(object2$status,object2$status=="acquired","operating")
object2$status<-replace(object2$status,object2$status=="ipo","operating")
object2$status<-as.factor(object2$status)
levels(object2$status)
## [1] "closed"    "operating"
Imputing missing values

Since my dataset includes a lot of missing data in across all columns, deleting rows with missing values would reduce my dataset to only a few hundred rows. Therefore, I decided to impute missing values. Doing it by using mean and mode values could be highly inaccurate and could influence the results of my analysis. Instead, I use missForest package which automatically imputes missing data based on existing variables by creating randomForest models. In this case, this way of dealing with blank cells seems to be most reasonable.

#Imputing missing values using random Forest. Second line of code shows the misclassification error of imputing data. Third line of code assigns new data frame with imputed values to object3.
object2.imp <-missForest(object2)
object2.imp$OOBerror
##     NRMSE       PFC 
## 0.0000000 0.4186082
object3<-object2.imp$ximp

As it can be seen, the misclassification error for categorical variables equals to 41.9% which is a significant error. This might affect the final creation of models, their results and findings. Despite that imputing missing values with missForest function seems to be the most reasonable approach. Deleting these rows would lead to model’s underfitting. Hence, the high misclassification rate for class variables should not be perceived as a flaw of the imputation process but rather as poor collection of data.

Adding TimeDiff1-10 columns to the object3 data frame

To complete data cleaning process, I add newly created columns to imputed data frame object3.

# Using cbind function I added TimeDiff1-TimeDiff10 columns to the object3 data frame.
object3<-cbind(object3,TimeDiff1,TimeDiff2,TimeDiff3,TimeDiff4,TimeDiff5,TimeDiff6,TimeDiff7,TimeDiff8,TimeDiff10)

Modelling

Exploratory Analysis

As previously stated, I begin the analysis by exploration the dataset and looking for possible relationships between covariates and my target variable which is status(operating vs.closed). Using Tableau Desktop I created several charts that helped assess overall distribution of data.

Based on the above graphs made in Tableau, I can draw a few conclusions. First of all, dataset consists of mostly operating companies. Closed startups add up to less than 2% of all data. This small number of startups which got closed may imply that either startups tend to be mostly successful and I should look for reasons of startups failures rather than successes, or the number of bankrupt startups is higher but those instances were not recorded. Since I cannot assess that I will focus on first assumption and seek for purposes of startups’ failures. Analyzing maps, the biggest hub for startups is the USA and Western Europe. These developed regions of the world tend to offer better opportunities for startups’ founders: access to educated and experienced workforce, great pool of investors which may support their early operations, well-developed infrastructure that helps to establish a company and promote it both, locally and globally, and potential government incentives. In the USA, most startups headquarters are founded either in California, New York state, or in more densely populated urban areas. Closed startups however are distributed pretty evenly across all regions and pattern for their was not established. Bar charts displaying number of new startups by foundation date display interesting occurence. The number of new startups increased almost exponentially in years 2000-2011 and then recorded a drop in year 2012 and 2013. Perhaps, global economic crisis of years 2008-2012 negatively influenced startups market in following years. Data of closed startups seem to prove this hypothesis, since the greatest increase in startup failures was recorded in years 2008-2013. Another interesting characteristic is that majority of failed startups seem to be very small institutions employing on average 3 people. Such small companies probably suffered from financial and “teething problems”.However, there were also bigger companies that got closed, some employing even 49 people. Startups tend to close mostly in web, software, mobile, games_video, and e-commerce industries. All above are highly developing fields, where tech giants like Google, Apple, Microsoft, or Facebook are dominating the scene. High competition among those and smaller companies imposes high entrance barriers which tiny startups struggle to overcome. Since finances seem to be the main problem of early startups, I looked over funding data and resulting conclusion is that failed startups tend to obtain external funding at much earlier stage than their successful counterparts. This may mean that financial buffer created by successful founders is higher than one for those who failed.

Predictive Analysis

Creating Training and Testing Sets

I begin my predictive analysis with dividing my data into training and testing sets. I decided to create as accurate models as it is possible, therefore I selected great majority of rows(90%) to train my models while the rest 10% of rows will be used to test models and predict my target variable.

# Using sample function, I randomly selected 90% of rows in the dataset for training set and the rest of it for testing set. I called my training set "train" and my testing set "test".
index<-sample(nrow(object3), nrow(object3)*0.9)
train<-object3[index,]
test<-object3[-index,]

Decision Tree

Searching for answers, I decided to create a decision tree which would help me understand which variables are most important for determining successfulness of startup. Then I began construction of decision tree.

#Using the code below, I created a decision tree by setting my target variable to status which was to be predicted with all covariates from object3 data frame. Sicne target is binary I set method to class and complexity to 0.0001. Then I predicted results on the testing set and displayed them in a table. I also plot the decision tree itself.
rpart0 <- rpart(formula = status ~., data = train, method = "class", cp=0.0001)
pred0 <- predict(rpart0, test, type = "class")
head(pred0)
##         4        19        29        33        46        54 
## operating operating operating operating operating operating 
## Levels: closed operating
table(test$status, pred0, dnn = c("True", "Pred"))
##            Pred
## True        closed operating
##   closed       197        65
##   operating      7     19387
sum(test$status != pred0) #number misclassified
## [1] 72
#misclassification rate
sum(test$status != pred0)/nrow(test)
## [1] 0.003663004
prp(rpart0, extra = 1)

table(train$status, predict(rpart0, type = "class"), dnn = c("True", "Pred"))
##            Pred
## True        closed operating
##   closed      1781       541
##   operating     30    174545

Based on the decision tree above the most important variable is TimeDiff1 which is the time difference between foundation and closing date. The decision tree states that if a startup lasted more than 5,000,000 days it still operates. The second split is made on funding_rounds variable. Decision tree misclassified only 72 rows of testing data. The first split however is the most important. It states that the startup would have to operate for at least 13,000 years to carry on operations until today. This is impossible since the earliest records in this dataset are from 1960s. Since TimeDiff1 greatly disturbs my results, I decided to delete it from my model and create a new decision tree without it.

#Using the code below, I created a decision tree by setting my target variable to status which was to be predicted with all covariates from object3 data frame. Sicne target is binary I set method to class and complexity to 0.002. Then I predicted results on the testing set and displayed them in a table. I also plot the decision tree itself.
rpart1 <- rpart(formula = status ~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10, data = train, method = "class", cp=0.0002)
pred1 <- predict(rpart1, test, type = "class")
head(pred1)
##         4        19        29        33        46        54 
## operating    closed operating operating operating operating 
## Levels: closed operating
table(test$status, pred1, dnn = c("True", "Pred"))
##            Pred
## True        closed operating
##   closed         6       256
##   operating     15     19379
sum(test$status != pred1) #number misclassified
## [1] 271
#misclassification rate
sum(test$status != pred1)/nrow(test)
## [1] 0.01378714
prp(rpart1, extra = 1)

table(train$status, predict(rpart1, type = "class"), dnn = c("True", "Pred"))
##            Pred
## True        closed operating
##   closed        67      2255
##   operating     35    174540

This time, decision tree is more plausible. The first split was made with funding_rounds variable and states that if a startup received no funding rounds that it would operate otherwise other variables had to be assessed. Other important variables used to create this decision tree include TimeDiff10(time difference between foundation and last funding round obtained), category_code, and `funding_total_usd. This implies that industry and amount of funds obtained as well as time when they were obtained palce crucial roel in determining startups successfulness. Decision tree misclassified only 271 rows of testing data, which is more than in the previous decision tree, however misclassified rows add up to only 1.3% of all data rows. Such small rate implies precision of created model.

Logistic Regression

Then I created a logistic regression model using glm function and tidyverse package.

#Using the code below, I imported tidyverse package and created a logistic regression model using all variables except for TimeDiff1 in object3 dataset to determine status. Then I predicted status of startups from testing set and displyed results in the table.
library(tidyverse)
log_reg<-glm(status~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10,data = train,family = binomial)
summary(log_reg)
## 
## Call:
## glm(formula = status ~ category_code + continent + funding_rounds + 
##     funding_total_usd + milestones + relationships + TimeDiff2 + 
##     TimeDiff3 + TimeDiff4 + TimeDiff5 + TimeDiff6 + TimeDiff7 + 
##     TimeDiff8 + TimeDiff10, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4001   0.0458   0.1130   0.1822   3.4831  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             7.014e+00  1.572e-01  44.630  < 2e-16 ***
## category_codebuilding  -4.061e-01  1.124e-01  -3.613 0.000302 ***
## category_codehealth     2.065e-01  1.086e-01   1.902 0.057220 .  
## category_codeIT        -7.816e-01  6.594e-02 -11.853  < 2e-16 ***
## category_codeleisure   -5.186e-01  8.310e-02  -6.240 4.36e-10 ***
## category_codeother     -2.038e-01  7.633e-02  -2.670 0.007594 ** 
## category_codepetcare    4.608e+00  7.099e-01   6.491 8.54e-11 ***
## category_codetravel     1.534e-01  2.552e-01   0.601 0.547706    
## continentAsia          -2.110e+00  1.735e-01 -12.164  < 2e-16 ***
## continentEurope        -2.405e+00  1.661e-01 -14.480  < 2e-16 ***
## continentNorth America -2.533e+00  1.626e-01 -15.575  < 2e-16 ***
## continentOther         -7.814e-01  2.135e-01  -3.659 0.000253 ***
## continentSouth America -1.367e+00  2.491e-01  -5.487 4.09e-08 ***
## funding_rounds         -1.006e+00  3.081e-02 -32.657  < 2e-16 ***
## funding_total_usd       2.099e-09  9.150e-10   2.294 0.021775 *  
## milestones             -2.275e-01  3.218e-02  -7.069 1.56e-12 ***
## relationships           7.480e-02  9.038e-03   8.276  < 2e-16 ***
## TimeDiff2              -6.200e+00  1.266e+02  -0.049 0.960953    
## TimeDiff3               6.203e+00  1.266e+02   0.049 0.960939    
## TimeDiff4               1.422e-03  1.774e-04   8.012 1.13e-15 ***
## TimeDiff5              -2.456e-06  4.320e-05  -0.057 0.954668    
## TimeDiff6               6.201e+00  1.266e+02   0.049 0.960951    
## TimeDiff7              -1.116e-03  1.734e-04  -6.436 1.23e-10 ***
## TimeDiff8               1.304e-04  2.222e-05   5.867 4.45e-09 ***
## TimeDiff10              1.110e-03  1.712e-04   6.486 8.84e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24736  on 176896  degrees of freedom
## Residual deviance: 20816  on 176872  degrees of freedom
## AIC: 20866
## 
## Number of Fisher Scoring iterations: 25
log_predict<-predict(log_reg,newdata = test, type = "response")
table(test$status, (log_predict>0.9)*1 , dnn = c("Truth", "Predicted"))
##            Predicted
## Truth           0     1
##   closed       14   248
##   operating   141 19253

The model finds out that some of the variables have the higher p-value than 0.05 which indicates that those variables are statistically insignificant in predicting the target variable. Among those variables are TimeDiff2,TimeDiff3, TimeDiff5 and TimeDiff6.

The model incorrectly predicted 389 rows of testing set. 141 of those were predicted to be closed while they are still operating and 248 were predicted to be operating while they got closed. The total number of misclassified rows equals to 389, which gives a misclassification rate of 1.97%. This rate is higher than the one for decision tree

I tried removing statistically insignificant variables to see if model would improve.

#TimeDiff6,TimeDiff2, TimeDiff3, investment_rounds, and invested_companies have the highest p-value therefore will be eliminated from the model and new logistic regression model will be run.
log_reg1<-glm(status~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff4+TimeDiff5+TimeDiff7+TimeDiff8+TimeDiff10,data = train,family = binomial)
summary(log_reg1)
## 
## Call:
## glm(formula = status ~ category_code + continent + funding_rounds + 
##     funding_total_usd + milestones + relationships + TimeDiff4 + 
##     TimeDiff5 + TimeDiff7 + TimeDiff8 + TimeDiff10, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3999   0.0457   0.1155   0.1820   3.4892  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             7.015e+00  1.572e-01  44.628  < 2e-16 ***
## category_codebuilding  -4.055e-01  1.124e-01  -3.608 0.000309 ***
## category_codehealth     2.063e-01  1.086e-01   1.899 0.057511 .  
## category_codeIT        -7.863e-01  6.593e-02 -11.926  < 2e-16 ***
## category_codeleisure   -5.216e-01  8.309e-02  -6.278 3.43e-10 ***
## category_codeother     -2.049e-01  7.632e-02  -2.685 0.007251 ** 
## category_codepetcare    4.602e+00  7.096e-01   6.486 8.83e-11 ***
## category_codetravel     1.497e-01  2.553e-01   0.586 0.557654    
## continentAsia          -2.105e+00  1.735e-01 -12.132  < 2e-16 ***
## continentEurope        -2.400e+00  1.661e-01 -14.453  < 2e-16 ***
## continentNorth America -2.529e+00  1.626e-01 -15.553  < 2e-16 ***
## continentOther         -7.801e-01  2.135e-01  -3.653 0.000259 ***
## continentSouth America -1.365e+00  2.491e-01  -5.479 4.27e-08 ***
## funding_rounds         -1.008e+00  3.079e-02 -32.730  < 2e-16 ***
## funding_total_usd       2.129e-09  9.179e-10   2.319 0.020381 *  
## milestones             -2.280e-01  3.215e-02  -7.091 1.33e-12 ***
## relationships           7.557e-02  9.009e-03   8.388  < 2e-16 ***
## TimeDiff4               1.421e-03  1.774e-04   8.007 1.17e-15 ***
## TimeDiff5               1.179e-06  4.310e-05   0.027 0.978173    
## TimeDiff7              -1.115e-03  1.734e-04  -6.429 1.28e-10 ***
## TimeDiff8               1.335e-04  2.207e-05   6.050 1.45e-09 ***
## TimeDiff10              1.109e-03  1.712e-04   6.479 9.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24736  on 176896  degrees of freedom
## Residual deviance: 20826  on 176875  degrees of freedom
## AIC: 20870
## 
## Number of Fisher Scoring iterations: 11
log_predict1<-predict(log_reg1,newdata = test, type = "response")
table(test$status, (log_predict1>0.9)*1 , dnn = c("Truth", "Predicted"))
##            Predicted
## Truth           0     1
##   closed       14   248
##   operating   141 19253

Although, now, most of variables seem to be statistically significant, the number of misclassified rows has not changed. Hence, I tried again by removing even more statistically insignificant variables.

#milestones and funding_total_usd have p-values greater than 0.1 which means that they are statistically insignificant of predicting target variable, therefore will be eliminated from the model and new logistic regression model will be run..
log_reg2<-glm(status~category_code+continent+funding_rounds+relationships+TimeDiff4+TimeDiff7+TimeDiff8+TimeDiff10,data = train,family = binomial)
summary(log_reg2)
## 
## Call:
## glm(formula = status ~ category_code + continent + funding_rounds + 
##     relationships + TimeDiff4 + TimeDiff7 + TimeDiff8 + TimeDiff10, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.4220   0.0444   0.1165   0.1859   3.5135  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             6.872e+00  1.568e-01  43.831  < 2e-16 ***
## category_codebuilding  -3.170e-01  1.120e-01  -2.830  0.00466 ** 
## category_codehealth     3.245e-01  1.078e-01   3.010  0.00261 ** 
## category_codeIT        -7.952e-01  6.579e-02 -12.087  < 2e-16 ***
## category_codeleisure   -5.233e-01  8.291e-02  -6.311 2.77e-10 ***
## category_codeother     -2.038e-01  7.624e-02  -2.674  0.00751 ** 
## category_codepetcare    4.677e+00  7.095e-01   6.591 4.36e-11 ***
## category_codetravel     1.938e-01  2.562e-01   0.756  0.44936    
## continentAsia          -2.027e+00  1.736e-01 -11.676  < 2e-16 ***
## continentEurope        -2.351e+00  1.663e-01 -14.140  < 2e-16 ***
## continentNorth America -2.488e+00  1.628e-01 -15.282  < 2e-16 ***
## continentOther         -6.686e-01  2.134e-01  -3.132  0.00173 ** 
## continentSouth America -1.364e+00  2.494e-01  -5.467 4.57e-08 ***
## funding_rounds         -1.024e+00  3.050e-02 -33.569  < 2e-16 ***
## relationships           5.070e-02  7.752e-03   6.540 6.15e-11 ***
## TimeDiff4               1.483e-03  1.767e-04   8.391  < 2e-16 ***
## TimeDiff7              -1.091e-03  1.730e-04  -6.308 2.83e-10 ***
## TimeDiff8               1.154e-04  2.242e-05   5.147 2.65e-07 ***
## TimeDiff10              1.102e-03  1.707e-04   6.454 1.09e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24736  on 176896  degrees of freedom
## Residual deviance: 20896  on 176878  degrees of freedom
## AIC: 20934
## 
## Number of Fisher Scoring iterations: 11
log_predict2<-predict(log_reg2,newdata = test, type = "response")
table(test$status, (log_predict2>0.9)*1 , dnn = c("Truth", "Predicted"))
##            Predicted
## Truth           0     1
##   closed       13   249
##   operating   126 19268

All the numeric variables are statistically significant now and the number of misclassified rows got reduced to 375. However, it is still higher than the number of incorrectly predicted rows by the decision tree model. Reducing number of variables in the logistic regression did not improve the model.

Bagging

I decided to try other techniques to see if I can improve model. I installed package ipred which contains bagging function. Then, I created the model.

#Using the code below, I imported ipred package and created bagging model using status as my target variable and rest of variables from train dataset except for TimeDiff1 as covariates. I also chose number of bootstrap samples to 50. I set the constraint on created decision trees to create minimum number of splits to two and reduced complexity so that the model would run faster.
bag_model <- bagging(formula = status~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10, 
                     data = train, 
                     nbagg = 50, 
                     coob = TRUE,
                     control = rpart.control(minsplit = 2, cp = 0)) 
#50 bootstrap samples, and we fit a tree model to each of these 50 bootstrap samples
bag_pred <- predict(bag_model, newdata = test)
bag_pred$confusion
##                Observed Class
## Predicted Class closed operating
##       closed         6        20
##       operating    256     19374

The bagging misclassified 275 startups from testing dataset, which is 4 more than the decision tree.

Random Forest

Let’s see if RandomForest model can better predict the target variable.

#Using the code below, I imported randomForest package. Then using status as my target variable and rest of variables from train dataset as covariates I created  a RandomForest model with 100 trees.
rf_model <- randomForest(status ~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10, 
                         data = train, 
                         importance = TRUE, 
                         ntree = 100)
rf_pred <- predict(rf_model, test)
head(rf_pred)
##         4        19        29        33        46        54 
## operating operating operating operating operating operating 
## Levels: closed operating
table(test$status, rf_pred, dnn = c("True", "Pred"))
##            Pred
## True        closed operating
##   closed         1       261
##   operating      0     19394
sum(test$status != rf_pred)
## [1] 261

RandomForest model generated the least misclassifications summing up to 261. Although, it is the most complicated of models it seemed to perform the best. However, the difference in number of misclassifications is so small that it does not influence predictions. All of the models ended up with misclassification rate of less than 2%. Such results suggest that predicting fail or success of a startup is highly accurate. However, the results imply that model expected status of all rows of data to “operating”. This seems to be quite odd and suggests a flaw in the model. Therefore, I decided to create another randomForest model and increased number of trees to 200.

#Using the code below, Then using status as my target variable and rest of variables from train dataset as covariates I created  a RandomForest model with 200 trees.
rf_model1 <- randomForest(status ~category_code+continent+funding_rounds+funding_total_usd+milestones+relationships+TimeDiff2+TimeDiff3+TimeDiff4+TimeDiff5+TimeDiff6+TimeDiff7+TimeDiff8+TimeDiff10, 
                         data = train, 
                         importance = TRUE, 
                         ntree = 200,
                         control = rpart.control(minsplit = 2, cp = 0))
rf_pred1 <- predict(rf_model1, test)
head(rf_pred1)
##         4        19        29        33        46        54 
## operating operating operating operating operating operating 
## Levels: closed operating
table(test$status, rf_pred1, dnn = c("True", "Pred"))
##            Pred
## True        closed operating
##   closed         0       262
##   operating      0     19394
sum(test$status != rf_pred1)
## [1] 262

Increasing number of trees did not improve the predictions. The number of misclassified observations remained the same and still the model predicted all the startups in the testing set to operate.

Summary

The problem statement

When setting up a new business entity, startups’ founders have to consider several factors that will affect their companies.There are many challenges waiting for brand-new startups, which lead to unmanageable situations and effectively to closure or bankruptcy. Reducing risk of startup getting closed is not only important for its founder and employees, but also for the people and entities that invested in it. While gaining no profit, investors would lose their invested capital. On the other hand, founders would lose their own resources and comitted time, as well as reputation which is important in acquiring new funds.By using data analysis to draw conclusions, the objective of this study was to help mitigate the risk posed to investors and owners in case of startup going bankrupt.

Addressing the Problem Statement

Because of the great risk involved in establishing a new startup, I performed data analysis to find relationships that can aid its founders in avoiding unnecessary risk in conducting company’s operations. In particular, I analyzed a data set containing information about startups, obtained from the kaggle website. The data set consists of 11 tables and 154 columns which gather data about offices, employees, relationships, funds gathered, investments, acquisitions and ipos of almost 130,000 startups from all around the world. In analyzing this data set, I created scatter plots, tables, histograms, frequency graphs,maps, and other graphical displays to find relationships between status and startups’ various characteristics. I also uncovered new information from the data set by creating new variables, splitting the data, and grouping the data. Startup owners can use these relationships and new insights to adjust their decisions involving company’s operations. To further help owners and potential investors develop strategic decisions, I used regression (in particular logistic regression), bagging, decision trees, and random forests analyses to create models that predict whether a startup is likely to become successful and continue its operations.

Insights Provided by Analysis

By examining plots in Tableau, I drew following conclusions:

  • Only 2% of all startups in the dataset got closed. This may imply that problem is marginal, however, it is not known whether all closures and bankruptcies were registered in this dataset. In reality, the share of defaulting startups may be much higher, but obtaining data about closed companies that existed for very short time may be impossible.

  • Startups tend to be founded in urban areas of California, East Coast of USA, Central and Western Europe. In the same areas the number of closed startups is also the highest. Although it may seem like there is no correlation, but greater number of startups in a region increases competition and leads to even greater number of closures.

  • The trend of founding new startups is increasing. Each year the number of new startups increases. Although these statistics slowed down during 2008-2012, it happened mostly due to economic crisis and recession on global markets.

  • The number of closed startups increases proportionately to number of startups overall, however, significant rise in closures in years 2010-2012 implies that volatility of global markets tend to have high influence on survival of startups.

  • Startups tend to get closed at the very early stage of operations, since average time from foundation to closing is about 48 months, while average number of people employed is 3. Closed startups also decide at to obtain external funding much earlier than their successful counterparts.

  • IT is the leading industry in case of number of closed startups. Huge competition and high entrant barriers quickly verify if startups’products and services are so unique. Those poorly developed are prone to getting closed.

To reduce risk of investing, I created models that turned out to precisely predict whether a startup will be operating or will get closed. The least precise of all, logistic regression model wrongly predicted only 1.97% of observations, while the most precise, random forest model, improved this rate to 1.34%.

Implications for stakeholders

Based on this report, entrepreneurs can decide whether they want to take a more risky approach and for instance, enter an IT industry convinced of their product’s/service’s superiority, or choose a safer pathway and develop their startup in less competitive field. They should gather more funds from personal savings for its early stage of operations and should look for possible external funds when product/service is ready to be deployed. on market. Otherwise, investors’ funds necessary for covering expenses in marketing and manufacturing would be wasted in research and development phase resulting in startup having not enough financing for actual product/service release. Startup founders should focus on perfect allocation of early funds and follow economic indicators that can warn them of unfavorable market conditions. When setting up a startup, founders should consider urban areas that may offer beneficial conditions to new businesses. On the other hand, investors can use created models to mitigate risk of losing capital while investing in promising companies. Choosing startups that survived more than 2 years on the market, highly reduces risk of a faulty investment. Moreover, the more developed a startup is, the less likely it is to go bankrupt. Hence, investors should consider startups employing more than 5 people and possessing more than one office.

Limitations and future research

A major limitation of my research is the completeness of the dataset. Many rows had plenty missing values. Only one in four startups has data concerning investors and funding. Less than 10% contained data about startups’ offices. Because of that my models were highly impacted by imputed data. Although I chose one of the most accurate ways of imputing missing values, which was based on random forest models created with existing values, 47% of categorical variables had high probability of being incorrect. Despite many missing values, dataset seems to biased, consisting mostly of operating startups. While according to the newest 2023 statistics, 10% of startups fail within one year from foundation, only 2% failed in my dataset. Future researches should focus mostly on obtaining better quality and more recent data. This would help to improve already created models and discover new trends. Performing text analysis and searching for patterns within websites or articles included in the dataset could also improve the risk assessment. Indeed, there are multiple avenues for further research related to this topic, but my predictive models provide a strong foundational cornerstone to inform future development in this area. In so doing, I shed new light on insights and relationships which can spur further research and investigation.