Our main goal here is to see how we can walk through a business case to analyze it and conclude useful results from it. Although the dataset here is a dummy one, it has the same issues as most real datasets like:
So let us begin our journey.
First we will read the data using read_csv methof from readr packing within tidyverse package.
startups <- read_csv("data/CAX_Startup_Data.csv")
So we have 472 observations belongs to 116 variables, one of them is the response variable that we would to designate later for predction and the rest are the predictors.
Let us see how the first 5 rows of the data looks like
head(startups, 5)
## # A tibble: 5 x 116
## Company_Name `Dependent-Company S~ `year of foundin~ `Age of company in~
## <chr> <chr> <chr> <chr>
## 1 Company1 Success No Info No Info
## 2 Company2 Success 2011 3
## 3 Company3 Success 2011 3
## 4 Company4 Success 2009 5
## 5 Company5 Success 2010 4
## # ... with 112 more variables: `Internet Activity Score` <int>, `Short
## # Description of company profile` <chr>, `Industry of company` <chr>,
## # `Focus functions of company` <chr>, Investors <chr>, `Employee Count`
## # <int>, `Employees count MoM change` <int>, `Has the team size grown`
## # <chr>, `Est. Founding Date` <chr>, `Last Funding Date` <chr>, `Last
## # Funding Amount` <int>, `Country of company` <chr>, `Continent of
## # company` <chr>, `Number of Investors in Seed` <chr>, `Number of
## # Investors in Angel and or VC` <chr>, `Number of Co-founders` <int>,
## # `Number of of advisors` <int>, `Team size Senior leadership` <int>,
## # `Team size all employees` <chr>, `Presence of a top angel or venture
## # fund in previous round of investment` <chr>, `Number of of repeat
## # investors` <chr>, `Number of Sales Support material` <chr>, `Worked in
## # top companies` <chr>, `Average size of companies worked for in the
## # past` <chr>, `Have been part of startups in the past?` <chr>, `Have
## # been part of successful startups in the past?` <chr>, `Was he or she
## # partner in Big 5 consulting?` <chr>, `Consulting experience?` <chr>,
## # `Product or service company?` <chr>, `Catering to product/service
## # across verticals` <chr>, `Focus on private or public data?` <chr>,
## # `Focus on consumer data?` <chr>, `Focus on structured or unstructured
## # data` <chr>, `Subscription based business` <chr>, `Cloud or platform
## # based serive/product?` <chr>, `Local or global player` <chr>, `Linear
## # or Non-linear business model` <chr>, `Capital intensive business e.g.
## # e-commerce, Engineering products and operations can also cause a
## # business to be capital intensive` <chr>, `Number of of Partners of
## # company` <chr>, `Crowdsourcing based business` <chr>, `Crowdfunding
## # based business` <chr>, `Machine Learning based business` <chr>,
## # `Predictive Analytics business` <chr>, `Speech analytics business`
## # <chr>, `Prescriptive analytics business` <chr>, `Big Data Business`
## # <chr>, `Cross-Channel Analytics/ marketing channels` <chr>, `Owns data
## # or not? (monetization of data) e.g. Factual` <chr>, `Is the company an
## # aggregator/market place? e.g. Bluekai` <chr>, `Online or offline
## # venture - physical location based business or online venture?` <chr>,
## # `B2C or B2B venture?` <chr>, `Top forums like 'Tech crunch' or
## # 'Venture beat' talking about the company/model - How much is it being
## # talked about?` <chr>, `Average Years of experience for founder and co
## # founder` <chr>, `Exposure across the globe` <chr>, `Breadth of
## # experience across verticals` <chr>, `Highest education` <chr>, `Years
## # of education` <chr>, `Specialization of highest education` <chr>,
## # `Relevance of education to venture` <chr>, `Relevance of experience to
## # venture` <chr>, `Degree from a Tier 1 or Tier 2 university?` <chr>,
## # `Renowned in professional circle` <chr>, `Experience in selling and
## # building products` <chr>, `Experience in Fortune 100 organizations`
## # <chr>, `Experience in Fortune 500 organizations` <chr>, `Experience in
## # Fortune 1000 organizations` <chr>, `Top management similarity` <chr>,
## # `Number of Recognitions for Founders and Co-founders` <chr>, `Number
## # of of Research publications` <chr>, `Skills score` <chr>, `Team
## # Composition score` <chr>, `Dificulty of Obtaining Work force` <chr>,
## # `Pricing Strategy` <chr>, `Hyper localisation` <chr>, `Time to market
## # service or product` <chr>, `Employee benefits and salary structures`
## # <chr>, `Long term relationship with other founders` <chr>,
## # `Proprietary or patent position (competitive position)` <chr>,
## # `Barriers of entry for the competitors` <chr>, `Company awards` <chr>,
## # `Controversial history of founder or co founder` <chr>, `Legal risk
## # and intellectual property` <chr>, `Client Reputation` <chr>, `google
## # page rank of company website` <chr>, `Technical proficiencies to
## # analyse and interpret unstructured data` <chr>, `Solutions offered`
## # <chr>, `Invested through global incubation competitions?` <chr>,
## # `Industry trend in investing` <int>, `Disruptiveness of technology`
## # <chr>, `Number of Direct competitors` <chr>, `Employees per year of
## # company existence` <chr>, `Last round of funding received (in
## # milionUSD)` <chr>, `Survival through recession, based on existence of
## # the company through recession times` <chr>, `Time to 1st investment
## # (in months)` <chr>, `Avg time to investment - average across all
## # rounds, measured from previous investment` <chr>, `Gartner hype cycle
## # stage` <chr>, `Time to maturity of technology (in years)` <chr>,
## # Percent_skill_Entrepreneurship <chr>, Percent_skill_Operations <chr>,
## # Percent_skill_Engineering <chr>, ...
From that it seems we have the following notes:
NA but there are other values like No Info or just an empty string.Short Description of company profile , Specialization of highest education or InvestorsDependent-Company Status is the response variable.So, the following are the steps we will conduct to clean our dataset
set_missing <- function(x) {
# Replace 'No Info' with NA
x[x == 'No Info'] <- NA
# Replace empty string with NA
x[x == ''] <- NA
return(x)
}
startups <- map_df(startups, set_missing)
Construct factor of factor variables and convert them
factor_cols <- c(2, 12, 16:17, 24, 26:65, 67, 71, 73, 75:87, 89:91, 93, 97, 100:101)
startups[, factor_cols] <- map_df(startups[, factor_cols], toupper)
startups[, factor_cols] <- map_df(startups[, factor_cols], as.factor)
Let us look at the summary for each variable to make sure that everything is ok.
map(startups[, factor_cols], summary)
## $`Dependent-Company Status`
## FAILED SUCCESS
## 167 305
##
## $`Has the team size grown`
## NO YES NA's
## 266 155 51
##
## $`Country of company`
## ARGENTINA AUSTRIA AZERBAIJAN
## 2 2 2
## BELGIUM BULGARIA CANADA
## 5 3 3
## CZECH REPUBLIC DENMARK ESTONIA
## 1 3 1
## FINLAND FRANCE GERMANY
## 2 8 6
## INDIA ISRAEL ITALY
## 10 4 1
## RUSSIAN FEDERATION SINGAPORE SPAIN
## 1 1 5
## SWEDEN SWITZERLAND UNITED KINGDOM
## 1 2 33
## UNITED STATES NA's
## 305 71
##
## $`Continent of company`
## ASIA EUROPE NORTH AMERICA SOUTH AMERICA NA's
## 15 76 308 2 71
##
## $`Presence of a top angel or venture fund in previous round of investment`
## NO YES NA's
## 282 93 97
##
## $`Number of Sales Support material`
## HIGH LOW MEDIUM NOTHING NA's
## 73 150 120 81 48
##
## $`Worked in top companies`
## NO YES NA's
## 380 73 19
##
## $`Average size of companies worked for in the past`
## LARGE MEDIUM SMALL NA's
## 83 130 228 31
##
## $`Have been part of startups in the past?`
## NO YES NA's
## 154 298 20
##
## $`Have been part of successful startups in the past?`
## NO YES NA's
## 194 258 20
##
## $`Was he or she partner in Big 5 consulting?`
## NO YES NA's
## 428 24 20
##
## $`Consulting experience?`
## NO YES NA's
## 245 205 22
##
## $`Product or service company?`
## BOTH PRODUCT SERVICE NA's
## 24 207 231 10
##
## $`Catering to product/service across verticals`
## NO YES NA's
## 231 230 11
##
## $`Focus on private or public data?`
## BOTH NO PRIVATE PUBLIC NA's
## 68 113 162 120 9
##
## $`Focus on consumer data?`
## NO YES NA's
## 281 182 9
##
## $`Focus on structured or unstructured data`
## BOTH NO NOT APPLICABLE STRUCTURED UNSTRUCTURED
## 120 98 7 166 72
## NA's
## 9
##
## $`Subscription based business`
## NO YES NA's
## 192 267 13
##
## $`Cloud or platform based serive/product?`
## BOTH CLOUD NONE PLATFORM NA's
## 68 65 31 296 12
##
## $`Local or global player`
## GLOBAL LOCAL NA's
## 237 211 24
##
## $`Linear or Non-linear business model`
## LINEAR NON-LINEAR NA's
## 134 320 18
##
## $`Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive`
## NO YES NA's
## 328 118 26
##
## $`Number of of Partners of company`
## FEW MANY NONE NA's
## 73 14 284 101
##
## $`Crowdsourcing based business`
## NO YES NA's
## 437 30 5
##
## $`Crowdfunding based business`
## NO YES NA's
## 445 22 5
##
## $`Machine Learning based business`
## NO YES NA's
## 337 129 6
##
## $`Predictive Analytics business`
## NO YES NA's
## 316 151 5
##
## $`Speech analytics business`
## NO YES NA's
## 436 31 5
##
## $`Prescriptive analytics business`
## NO YES NA's
## 321 116 35
##
## $`Big Data Business`
## NO YES NA's
## 251 216 5
##
## $`Cross-Channel Analytics/ marketing channels`
## NO YES NA's
## 395 72 5
##
## $`Owns data or not? (monetization of data) e.g. Factual`
## NO YES NA's
## 411 52 9
##
## $`Is the company an aggregator/market place? e.g. Bluekai`
## NO YES NA's
## 335 107 30
##
## $`Online or offline venture - physical location based business or online venture?`
## BOTH OFFLINE ONLINE NA's
## 9 47 410 6
##
## $`B2C or B2B venture?`
## B2B B2C NA's
## 307 162 3
##
## $`Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?`
## HIGH LOW MEDIUM NONE NA's
## 28 243 78 43 80
##
## $`Average Years of experience for founder and co founder`
## HIGH LOW MEDIUM NA's
## 271 13 108 80
##
## $`Exposure across the globe`
## NO YES NA's
## 141 246 85
##
## $`Breadth of experience across verticals`
## HIGH LOW MEDIUM NA's
## 37 178 172 85
##
## $`Highest education`
## BACHELORS MASTERS PHD NA's
## 169 166 34 103
##
## $`Years of education`
## 18 21 25 NA's
## 169 166 34 103
##
## $`Specialization of highest education`
## BUSINESS
## 25
## MBA
## 23
## COMPUTER SCIENCE
## 22
## TECH
## 21
## ENGG
## 19
## MANAGEMENT
## 18
## MGMT
## 18
## ECONOMICS
## 16
## ARTS
## 12
## BUSINESS MANAGEMENT
## 10
## FINANCE
## 8
## CSE
## 7
## PHD
## 7
## TECHNOLOGY
## 7
## COMPUTERS
## 6
## ELECTRICAL
## 6
## LAW
## 6
## MARKETING
## 5
## COMPUTER SC
## 4
## GENERAL
## 4
## MANGEMENT
## 4
## BUSINESS ADMINISTRATION
## 3
## ELECTRICAL ENGINEERING
## 3
## MATHS
## 3
## MEDIA
## 3
## BUSSINESS
## 2
## COMMUNICATION
## 2
## COMPUTER
## 2
## DIPLOMA
## 2
## ECO
## 2
## HISTORY
## 2
## IT
## 2
## MGM
## 2
## NO
## 2
## PHYSICS
## 2
## POLITICAL SCIENCE
## 2
## ACCOUNTING
## 1
## AEROSPACE
## 1
## AEROSPACE ENGINEERING
## 1
## AQUATIC BIOLOGY
## 1
## ARTIFICIAL INTELLIGENCE AND ADVANCED TECHNLOGES
## 1
## ARTS AND CULTURE
## 1
## ARTS, ECONOMICS
## 1
## ARTS, MEDIA
## 1
## BACHELOR
## 1
## BIO ENGG
## 1
## BIOLOGY
## 1
## BIOMEDICAL AND MECHANICAL ENGINEERING
## 1
## BIOMEDICAL ENTERPRISE
## 1
## BIOMEDICAL INFORMATICS
## 1
## BSC
## 1
## BSEET
## 1
## BUSINESS STUDIES
## 1
## CHEMICAL ENGG
## 1
## CHEMICAL ENGINEERING
## 1
## COMMUNICTIONS
## 1
## COMPUTER SYSTEMS ENGINEERING
## 1
## CS
## 1
## DATA MINING
## 1
## DSIGN
## 1
## EARTH SCIENCES
## 1
## EAST ASIAN STUDIES AND ECONOMICS
## 1
## ECO.POLSC
## 1
## ECONOMICS AND AMERICAN LITERATURE
## 1
## ECONOMICS,COMPUTERS
## 1
## ELECTRICAL AND ELECTRONICS ENGINEERING\nAEROSPACE ENGINEERING
## 1
## ELECTRICAL ENG
## 1
## ELECTRICAL ENGG
## 1
## ENGINEERING
## 1
## ENGINEERING, MANAGEMENT
## 1
## ENGLISH
## 1
## ENTERTAINMENT
## 1
## ENTREPRENEURIAL
## 1
## ENTREPRENEURSHIP
## 1
## ENTREPRENEURSHIP & INNOVATION
## 1
## FINANCE, MANAGEMENT
## 1
## FINANCE, STARTEGY AND ENTERPRENEURSHIP
## 1
## FINE ARTS
## 1
## GENERAL MANGEMENT
## 1
## GEOLOGY
## 1
## HONORS PROGRAM, BUSINESS, FINANCE, INFORMATION
## 1
## HUMAN COMPUTER INTERACTION
## 1
## IMAGE PROCESSING
## 1
## INDUSTRI
## 1
## INDUSTRIAL ENGINEERING AND COMPUTER SCIENCE
## 1
## INFORMATION TECH
## 1
## INTERNATIONAL TRADE
## 1
## JD
## 1
## JOURNALISM
## 1
## LEARNING AND ORGANIZATIONAL CHANGE\nCOMPUTER SCIENCE
## 1
## LIT
## 1
## LITERATURE AND HISTORY
## 1
## MA
## 1
## MANAGEMENT AND INFORMATION TECHNOLOGY
## 1
## MANAGEMENT AND TECHNOLOGY
## 1
## MANAGEMENT, COMPUTER SCIENCE
## 1
## MARKETING, LOGISTICS & DISTRIBUTION
## 1
## MARKTING AND COMMUNICATON
## 1
## (Other)
## 27
## NA's
## 101
##
## $`Relevance of education to venture`
## NO YES NA's
## 96 281 95
##
## $`Relevance of experience to venture`
## NO YES NA's
## 84 301 87
##
## $`Degree from a Tier 1 or Tier 2 university?`
## BOTH NONE TIER_1 TIER_2 NA's
## 43 144 139 58 88
##
## $`Experience in selling and building products`
## HIGH LOW MEDIUM NONE NA's
## 127 82 147 34 82
##
## $`Top management similarity`
## HIGH LOW MEDIUM NONE NA's
## 55 47 88 199 83
##
## $`Number of of Research publications`
## FEW MANY NONE NA's
## 57 81 250 84
##
## $`Team Composition score`
## HIGH LOW MEDIUM NA's
## 82 185 121 84
##
## $`Dificulty of Obtaining Work force`
## HIGH LOW MEDIUM NA's
## 58 178 150 86
##
## $`Pricing Strategy`
## NO YES NA's
## 198 190 84
##
## $`Hyper localisation`
## NO YES NA's
## 330 60 82
##
## $`Time to market service or product`
## HIGH LOW MEDIUM NA's
## 17 253 119 83
##
## $`Employee benefits and salary structures`
## AVERAGE BAD GOOD VERY GOOD NA's
## 26 42 33 20 351
##
## $`Long term relationship with other founders`
## NO YES NA's
## 277 106 89
##
## $`Proprietary or patent position (competitive position)`
## NO YES NA's
## 294 92 86
##
## $`Barriers of entry for the competitors`
## NO YES
## 220 252
##
## $`Company awards`
## NO YES NA's
## 311 76 85
##
## $`Controversial history of founder or co founder`
## NO YES NA's
## 380 10 82
##
## $`Legal risk and intellectual property`
## NO YES NA's
## 329 58 85
##
## $`Client Reputation`
## HIGH LOW MEDIUM NA's
## 58 119 21 274
##
## $`Technical proficiencies to analyse and interpret unstructured data`
## NO YES NA's
## 216 173 83
##
## $`Solutions offered`
## NO YES NA's
## 169 218 85
##
## $`Invested through global incubation competitions?`
## NO YES NA's
## 285 51 136
##
## $`Disruptiveness of technology`
## HIGH LOW MEDIUM NA's
## 108 93 189 82
##
## $`Survival through recession, based on existence of the company through recession times`
## NO NOT APPLICABLE YES NA's
## 31 268 75 98
##
## $`Gartner hype cycle stage`
## PEAK PLATEAU SLOPE TRIGGER TROUGH NA's
## 77 85 24 67 47 172
##
## $`Time to maturity of technology (in years)`
## 0 TO 2 0 TO 5 2 TO 5 5 TO 10 NA's
## 77 1 180 42 172
It seems that everything in order. However, there is a small note about some variables are about the investors/owners/founders not about the company itself which should not be there as certainly each person has his/her own qualifications and experience, but for sake of simplicity we will consider these varaibles are talking about all of these people at once as one unit.
Construct factor of numeric variables and convert them
numeric_cols <- c(3:5,10,11,18:23,25,61,66,68:70,72,74,88,92,94:96,98,99,102:116)
startups[, numeric_cols] <- map_df(startups[, numeric_cols], as.numeric)
Let us look at the summary for each variable to make sure that everything is ok.
map(startups[, numeric_cols], summary)
## $`year of founding`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1997 2008 2010 2009 2011 2013 59
##
## $`Age of company in years`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.000 4.000 4.605 6.000 17.000 59
##
## $`Internet Activity Score`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -725.0 -3.5 60.0 114.2 216.0 1535.0 65
##
## $`Employee Count`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 4.25 13.00 31.41 31.00 594.00 166
##
## $`Employees count MoM change`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -100.0 0.0 0.0 -1.3 6.0 50.0 205
##
## $`Number of Investors in Seed`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 1.546 2.000 24.000 49
##
## $`Number of Investors in Angel and or VC`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.5768 0.0000 9.0000 49
##
## $`Number of Co-founders`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.869 2.250 7.000
##
## $`Number of of advisors`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.017 1.000 13.000
##
## $`Team size Senior leadership`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.731 5.000 24.000
##
## $`Team size all employees`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 10.00 16.50 69.48 50.00 5000.00 68
##
## $`Number of of repeat investors`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.6065 1.0000 10.0000 40
##
## $`Years of education`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.634 2.000 3.000 103
##
## $`Renowned in professional circle`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 16.0 500.0 500.0 469.1 500.0 500.0 91
##
## $`Experience in Fortune 100 organizations`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.2692 1.0000 1.0000 82
##
## $`Experience in Fortune 500 organizations`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.259 1.000 1.000 82
##
## $`Experience in Fortune 1000 organizations`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.218 0.000 1.000 82
##
## $`Number of Recognitions for Founders and Co-founders`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 3.00 72.27 107.50 600.00 81
##
## $`Skills score`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 14.00 21.00 21.69 25.00 200.00 81
##
## $`google page rank of company website`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 483 209558 835064 2518863 2456174 22391670 154
##
## $`Industry trend in investing`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.00 3.00 2.89 3.00 5.00 82
##
## $`Number of Direct competitors`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.258 3.000 33.000 80
##
## $`Employees per year of company existence`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 3.30 8.30 18.44 15.25 833.30 128
##
## $`Last round of funding received (in milionUSD)`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.010 0.750 2.500 5.866 7.500 62.500 167
##
## $`Time to 1st investment (in months)`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 1.00 10.00 14.61 19.25 156.00 96
##
## $`Avg time to investment - average across all rounds, measured from previous investment`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.582 7.309 10.563 12.000 156.000 98
##
## $Percent_skill_Entrepreneurship
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 5.882 7.538 11.111 100.000 61
##
## $Percent_skill_Operations
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.385 3.452 50.000 61
##
## $Percent_skill_Engineering
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 9.804 18.632 28.665 100.000 61
##
## $Percent_skill_Marketing
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 5.556 11.001 14.286 76.471 61
##
## $Percent_skill_Leadership
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.870 5.556 40.000 61
##
## $`Percent_skill_Data Science`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 1.852 6.082 8.333 80.000 61
##
## $`Percent_skill_Business Strategy`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 8.333 10.981 18.382 50.000 61
##
## $`Percent_skill_Product Management`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 3.430 5.556 25.000 61
##
## $Percent_skill_Sales
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 3.357 5.556 33.333 61
##
## $Percent_skill_Domain
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.750 5.882 44.444 61
##
## $Percent_skill_Law
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1995 0.0000 33.3333 61
##
## $Percent_skill_Consulting
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.4821 0.0000 20.0000 61
##
## $Percent_skill_Finance
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 1.592 0.000 78.571 61
##
## $Percent_skill_Investment
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 1.359 0.000 33.333 61
##
## $`Renown score`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 3.000 3.292 5.000 11.000 61
It seems that seems that there are some outliers there, we will find out more we reach the EDA phase.
Construct factor of date variables and convert them
date_cols <- c(13:14)
startups[, date_cols] <- map_df(startups[, date_cols], mdy)
Let us look at the summary for each variable to make sure that everything is ok.
map(startups[, date_cols], summary)
## $`Est. Founding Date`
## Min. 1st Qu. Median Mean 3rd Qu.
## "1997-06-01" "2008-01-01" "2010-02-05" "2009-08-05" "2011-06-16"
## Max. NA's
## "2013-07-01" "109"
##
## $`Last Funding Date`
## Min. 1st Qu. Median Mean 3rd Qu.
## "2004-04-01" "2010-10-03" "2012-08-06" "2011-12-12" "2013-07-11"
## Max. NA's
## "2014-04-08" "122"
Now, as the variables in their proper data types, let us remove predictors with more than 40% missing data
startups <- startups[colSums(is.na(startups))/nrow(startups) < .4]
dim(startups)
## [1] 472 113
It seems we git rid off 3 variables, let us take another look at the data now
head(startups)
## # A tibble: 6 x 113
## Company_Name `Dependent-Company S~ `year of foundin~ `Age of company in~
## <chr> <fct> <dbl> <dbl>
## 1 Company1 SUCCESS NA NA
## 2 Company2 SUCCESS 2011 3.00
## 3 Company3 SUCCESS 2011 3.00
## 4 Company4 SUCCESS 2009 5.00
## 5 Company5 SUCCESS 2010 4.00
## 6 Company6 SUCCESS 2010 4.00
## # ... with 109 more variables: `Internet Activity Score` <dbl>, `Short
## # Description of company profile` <chr>, `Industry of company` <chr>,
## # `Focus functions of company` <chr>, Investors <chr>, `Employee Count`
## # <dbl>, `Has the team size grown` <fct>, `Est. Founding Date` <date>,
## # `Last Funding Date` <date>, `Last Funding Amount` <int>, `Country of
## # company` <fct>, `Continent of company` <fct>, `Number of Investors in
## # Seed` <dbl>, `Number of Investors in Angel and or VC` <dbl>, `Number
## # of Co-founders` <dbl>, `Number of of advisors` <dbl>, `Team size
## # Senior leadership` <dbl>, `Team size all employees` <dbl>, `Presence
## # of a top angel or venture fund in previous round of investment` <fct>,
## # `Number of of repeat investors` <dbl>, `Number of Sales Support
## # material` <fct>, `Worked in top companies` <fct>, `Average size of
## # companies worked for in the past` <fct>, `Have been part of startups
## # in the past?` <fct>, `Have been part of successful startups in the
## # past?` <fct>, `Was he or she partner in Big 5 consulting?` <fct>,
## # `Consulting experience?` <fct>, `Product or service company?` <fct>,
## # `Catering to product/service across verticals` <fct>, `Focus on
## # private or public data?` <fct>, `Focus on consumer data?` <fct>,
## # `Focus on structured or unstructured data` <fct>, `Subscription based
## # business` <fct>, `Cloud or platform based serive/product?` <fct>,
## # `Local or global player` <fct>, `Linear or Non-linear business model`
## # <fct>, `Capital intensive business e.g. e-commerce, Engineering
## # products and operations can also cause a business to be capital
## # intensive` <fct>, `Number of of Partners of company` <fct>,
## # `Crowdsourcing based business` <fct>, `Crowdfunding based business`
## # <fct>, `Machine Learning based business` <fct>, `Predictive Analytics
## # business` <fct>, `Speech analytics business` <fct>, `Prescriptive
## # analytics business` <fct>, `Big Data Business` <fct>, `Cross-Channel
## # Analytics/ marketing channels` <fct>, `Owns data or not? (monetization
## # of data) e.g. Factual` <fct>, `Is the company an aggregator/market
## # place? e.g. Bluekai` <fct>, `Online or offline venture - physical
## # location based business or online venture?` <fct>, `B2C or B2B
## # venture?` <fct>, `Top forums like 'Tech crunch' or 'Venture beat'
## # talking about the company/model - How much is it being talked about?`
## # <fct>, `Average Years of experience for founder and co founder` <fct>,
## # `Exposure across the globe` <fct>, `Breadth of experience across
## # verticals` <fct>, `Highest education` <fct>, `Years of education`
## # <dbl>, `Specialization of highest education` <fct>, `Relevance of
## # education to venture` <fct>, `Relevance of experience to venture`
## # <fct>, `Degree from a Tier 1 or Tier 2 university?` <fct>, `Renowned
## # in professional circle` <dbl>, `Experience in selling and building
## # products` <fct>, `Experience in Fortune 100 organizations` <dbl>,
## # `Experience in Fortune 500 organizations` <dbl>, `Experience in
## # Fortune 1000 organizations` <dbl>, `Top management similarity` <fct>,
## # `Number of Recognitions for Founders and Co-founders` <dbl>, `Number
## # of of Research publications` <fct>, `Skills score` <dbl>, `Team
## # Composition score` <fct>, `Dificulty of Obtaining Work force` <fct>,
## # `Pricing Strategy` <fct>, `Hyper localisation` <fct>, `Time to market
## # service or product` <fct>, `Long term relationship with other
## # founders` <fct>, `Proprietary or patent position (competitive
## # position)` <fct>, `Barriers of entry for the competitors` <fct>,
## # `Company awards` <fct>, `Controversial history of founder or co
## # founder` <fct>, `Legal risk and intellectual property` <fct>, `google
## # page rank of company website` <dbl>, `Technical proficiencies to
## # analyse and interpret unstructured data` <fct>, `Solutions offered`
## # <fct>, `Invested through global incubation competitions?` <fct>,
## # `Industry trend in investing` <dbl>, `Disruptiveness of technology`
## # <fct>, `Number of Direct competitors` <dbl>, `Employees per year of
## # company existence` <dbl>, `Last round of funding received (in
## # milionUSD)` <dbl>, `Survival through recession, based on existence of
## # the company through recession times` <fct>, `Time to 1st investment
## # (in months)` <dbl>, `Avg time to investment - average across all
## # rounds, measured from previous investment` <dbl>, `Gartner hype cycle
## # stage` <fct>, `Time to maturity of technology (in years)` <fct>,
## # Percent_skill_Entrepreneurship <dbl>, Percent_skill_Operations <dbl>,
## # Percent_skill_Engineering <dbl>, Percent_skill_Marketing <dbl>,
## # Percent_skill_Leadership <dbl>, `Percent_skill_Data Science` <dbl>,
## # ...
Much better.
Textual columns are those that contain free text. What differentiates these from the categorical columns is that the number of unique values for the textual columns would be too big. Such columns are typically messy, and we will have to deal with them on a column-by-column basis as no single preprocessing procedure would suit them all.
Before we move on, let’s modify the column names by making them lowercase and replacing spaces with underscores. Note that this more of a personal preference than a neccessity.
colnames(startups) <- tolower(gsub(x = colnames(startups), pattern = ' ', replacement = '_', fixed = TRUE))
# the 'fixed' parameter is set as TRUE in order to match the pattern exactly.
# otherwise, the pattern will be interpreted as a regular expression instead, and an unexpected output may result.
head(colnames(startups))
## [1] "company_name"
## [2] "dependent-company_status"
## [3] "year_of_founding"
## [4] "age_of_company_in_years"
## [5] "internet_activity_score"
## [6] "short_description_of_company_profile"
After a quick manual inspection (on the .csv file), we identified the following textual variables. We additionally set the contents of these columns to lowercase for easier processing later on.
# the columns containing text in them
textual_col_names <- c('industry_of_company',
'short_description_of_company_profile',
'focus_functions_of_company',
'investors')
# set contents of these columns to lowercase
startups[,textual_col_names] <- map_df(startups[,textual_col_names], tolower)
And now, a quick look at the industry_of_company column – which indicates the particular domain that a company is working in – shows us…
head(startups$industry_of_company, 10)
## [1] NA
## [2] "market research|marketing|crowdfunding"
## [3] "analytics|cloud computing|software development"
## [4] "mobile|analytics"
## [5] "analytics|marketing|enterprise software"
## [6] "food & beverages|hospitality"
## [7] "analytics"
## [8] "cloud computing|network / hosting / infrastructure"
## [9] "analytics|mobile|marketing"
## [10] "healthcare|pharmaceuticals|analytics"
From the output, it seems that it is a multiple-value variable with | as the separator.
Now, we wish to create a document-term matrix (DTM) from this column. The DTM is a matrix where the documents (i.e. records) and terms (e.g. words) are the rows and columns, respectively, and the cells contain the frequencies of the terms occurring in the documents (e.g. the cell DTM(i,j) would tell how many times term j occurred in document i). To generate the DTM, we shall use the quanteda package.
# create the corpus object from the industry_of_company column
mycorpus <- corpus(startups$industry_of_company)
# generate the DTM using the formed corpus
dfm_industry <- dfm(mycorpus, # the corpus to generate the DTM from
tolower = FALSE) # data is already lowercase
At this point, let’s just have a look at the DTM before moving on.
# observe the 'terms' of the DTM
colnames(dfm_industry)
## [1] "NA" "market" "research"
## [4] "|" "marketing" "crowdfunding"
## [7] "analytics" "cloud" "computing"
## [10] "software" "development" "mobile"
## [13] "enterprise" "food" "&"
## [16] "beverages" "hospitality" "network"
## [19] "/" "hosting" "infrastructure"
## [22] "healthcare" "pharmaceuticals" "media"
## [25] "finance" "music" "e-commerce"
## [28] "gaming" "advertising" "retail"
## [31] "security" "email" "human"
## [34] "resources" "(" "hr"
## [37] ")" "career" "job"
## [40] "search" "publishing" "education"
## [43] "energy" "deals" "entertainment"
## [46] "transportation" "social" "networking"
## [49] "real" "estate" "telecommunications"
## [52] "insurance" "cleantech" "space"
## [55] "travel" "classifieds" "government"
Unfortunately, the output is unsatisfactory. Some of the terms that are parsed by the quanteda package include |, /, &, ( and ) (improper terms, obviously). We also observe that industries such as cloud computing and software development have been broken down into their constituent words, which is also incorrect.
With some investigation of the quanteda package (specifically the quanteda::tokens() function), it seems that it is only able to split the terms over the spaces in the text. In other words, we cannot specify the | character as the separator for splitting the text into terms.
As such, we will deal with this issue by performing the following actions:
| character with spaces: helps the quanteda::dfm() function split the terms over the new spaces./, &, ( and ): makes life easier for the quanteda::dfm() function.It may be a good idea to look at the cases on which we intend to perform the above actions. Let’s have a look.
# retrieve and view those industries that have '/' in them
startups$industry_of_company[ grepl(x = startups$industry_of_company, pattern = '/', fixed = TRUE) ]
## [1] "cloud computing|network / hosting / infrastructure"
## [2] "analytics|security|network / hosting / infrastructure"
## [3] "human resources (hr)|marketing|career / job search"
## [4] "analytics|network / hosting / infrastructure"
## [5] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [6] "network / hosting / infrastructure|food & beverages|analytics"
## [7] "network / hosting / infrastructure"
## [8] "media|entertainment|analytics|network / hosting / infrastructure|publishing"
## [9] "network / hosting / infrastructure|enterprise software|software development|analytics"
## [10] "network / hosting / infrastructure"
## [11] "network / hosting / infrastructure|analytics"
## [12] "network / hosting / infrastructure|enterprise software"
## [13] "career / job search"
## [14] "network / hosting / infrastructure|publishing"
## [15] "classifieds|network / hosting / infrastructure"
## [16] "e-commerce|analytics|network / hosting / infrastructure"
## [17] "analytics|social networking|network / hosting / infrastructure"
## [18] "human resources (hr)|career / job search"
## [19] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [20] "network / hosting / infrastructure|publishing"
## [21] "analytics|social networking|network / hosting / infrastructure"
## [22] "human resources (hr)|analytics|marketing|career / job search"
## [23] "network / hosting / infrastructure|telecommunications|enterprise software"
## [24] "network / hosting / infrastructure|marketing"
## [25] "human resources (hr)|career / job search"
## [26] "e-commerce|network / hosting / infrastructure"
# retrieve and view those industries that have '&' in them
startups$industry_of_company[ grepl(x = startups$industry_of_company, pattern = '&', fixed = TRUE) ]
## [1] "food & beverages|hospitality"
## [2] "analytics|food & beverages|social networking|mobile"
## [3] "network / hosting / infrastructure|food & beverages|analytics"
## [4] "e-commerce|food & beverages|mobile"
## [5] "food & beverages"
## [6] "e-commerce|food & beverages|mobile"
## [7] "healthcare|analytics|mobile|food & beverages"
## [8] "e-commerce|food & beverages"
# retrieve and view those industries that have '(' in them
startups$industry_of_company[ grepl(x = startups$industry_of_company, pattern = '(', fixed = TRUE) ]
## [1] "human resources (hr)|marketing|career / job search"
## [2] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [3] "human resources (hr)|career / job search"
## [4] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [5] "human resources (hr)|analytics|marketing|career / job search"
## [6] "human resources (hr)|career / job search"
From the above inspections, it seems that all occurrences of / are in the terms network / hosting / infrastructure and career / job search. As for & and (, their occurrences are in the terms food & beverages and human resources (hr), respectively. Accordingly, we perform the following modifications to the industry_of_company variable.
# remove all occurrences of ' (hr)', ' /', ' &'
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' (hr)', replacement='', fixed=TRUE)
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' /', replacement='', fixed=TRUE)
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' &', replacement='', fixed=TRUE)
# replace spaces with underscores to merge multi-word terms
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' ', replacement='_', fixed=TRUE)
# replace all occurrences of '|' with spaces to separate between terms
startups$industry_of_company <- gsub(startups$industry_of_company, pattern='|', replacement=' ', fixed=TRUE)
head(startups$industry_of_company, 10)
## [1] NA
## [2] "market_research marketing crowdfunding"
## [3] "analytics cloud_computing software_development"
## [4] "mobile analytics"
## [5] "analytics marketing enterprise_software"
## [6] "food_beverages hospitality"
## [7] "analytics"
## [8] "cloud_computing network_hosting_infrastructure"
## [9] "analytics mobile marketing"
## [10] "healthcare pharmaceuticals analytics"
Modifications have been applied successfully and as intended; i.e. multi-word terms are merged with underscores, and the terms are separated by spaces. Now, let’s generate the DTM and observe the obtained terms.
# create the corpus object from the industry_of_company column
mycorpus <- corpus(startups$industry_of_company)
# generate the DTM using the formed corpus
dfm_industry <- dfm(mycorpus, # the corpus to generate the DTM from
tolower = FALSE) # data is already lowercase
# observe the 'terms' of the DTM
colnames(dfm_industry)
## [1] "NA" "market_research"
## [3] "marketing" "crowdfunding"
## [5] "analytics" "cloud_computing"
## [7] "software_development" "mobile"
## [9] "enterprise_software" "food_beverages"
## [11] "hospitality" "network_hosting_infrastructure"
## [13] "healthcare" "pharmaceuticals"
## [15] "media" "finance"
## [17] "music" "e-commerce"
## [19] "gaming" "advertising"
## [21] "retail" "security"
## [23] "email" "human_resources"
## [25] "career_job_search" "publishing"
## [27] "education" "energy"
## [29] "deals" "entertainment"
## [31] "transportation" "social_networking"
## [33] "real_estate" "search"
## [35] "telecommunications" "insurance"
## [37] "cleantech" "space_travel"
## [39] "classifieds" "travel"
## [41] "government"
At last, we can move on to the analysis of the industry_of_company variable.
Let’s create a new data frame object where we bind the industry’s DTM with the column of the response variable. Note that the DTM is currently of type matrix, so we need to coerce into a data.frame before binding.
# convert `dfm_industry` to data.frame and bind with response variable
startups_industry <- cbind(startups[,2], as.data.frame(dfm_industry))
glimpse(startups_industry)
## Observations: 472
## Variables: 42
## $ `dependent-company_status` <fct> SUCCESS, SUCCESS, SUCCESS, SUCC...
## $ `NA` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ market_research <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ marketing <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ crowdfunding <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ analytics <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1...
## $ cloud_computing <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0...
## $ software_development <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ mobile <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...
## $ enterprise_software <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1...
## $ food_beverages <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ hospitality <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ network_hosting_infrastructure <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ healthcare <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ pharmaceuticals <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ media <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ finance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ music <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `e-commerce` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gaming <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ advertising <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ retail <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ security <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ email <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ human_resources <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ career_job_search <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ publishing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ education <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ energy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ deals <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ entertainment <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ transportation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ social_networking <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ real_estate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ search <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ telecommunications <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ insurance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ cleantech <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ space_travel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ classifieds <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ travel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ government <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
The bind appears to be successful. Now, let’s view the frequency of each of the industries among the list of companies in the dataset.
# frequencies of successes/fails in each of the industries
industry_frequencies <-
startups_industry %>%
group_by(`dependent-company_status`) %>%
summarise_at(vars(-starts_with("dependent")), funs(sum)) %>%
t()
industry_frequencies
## [,1] [,2]
## dependent-company_status "FAILED" "SUCCESS"
## NA "69" "55"
## market_research "1" "7"
## marketing "11" "55"
## crowdfunding "0" "2"
## analytics " 18" "180"
## cloud_computing " 6" "13"
## software_development " 7" "11"
## mobile "17" "37"
## enterprise_software " 3" "27"
## food_beverages "5" "3"
## hospitality "0" "4"
## network_hosting_infrastructure " 9" "10"
## healthcare "5" "7"
## pharmaceuticals "0" "1"
## media " 9" "19"
## finance "2" "6"
## music "6" "2"
## e-commerce "17" "37"
## gaming "3" "4"
## advertising "11" "26"
## retail " 0" "15"
## security "0" "6"
## email "3" "4"
## human_resources "2" "4"
## career_job_search "2" "5"
## publishing "2" "6"
## education "3" "3"
## energy "4" "8"
## deals "0" "3"
## entertainment "7" "8"
## transportation "1" "1"
## social_networking " 8" "10"
## real_estate "0" "3"
## search "6" "5"
## telecommunications "3" "3"
## insurance "0" "1"
## cleantech "1" "4"
## space_travel "1" "0"
## classifieds "2" "0"
## travel "0" "1"
## government "0" "1"
Judging from the output, it seems that all the numerical values have been turned to character strings due to the dependent-company_status variable (1st row). Also note that the industries are the names of the rows. For reasons that will be apparent very shortly, we will want the industries to be an actual column in industry_frequencies. Moreover, we will have to remove the first row and then set the names of the columns as failed and success.
Also note that industry_frequencies is currently of type matrix (not data.frame).
# remove the 1st row containing the 'dependent-company_status'
industry_frequencies <- industry_frequencies[-1,]
# add the industries (currently the row names) as a column in 'industry_frequencies'
industry_frequencies <- cbind(rownames(industry_frequencies), industry_frequencies)
# remove the row names
rownames(industry_frequencies) <- NULL
# set the column names
colnames(industry_frequencies) <- c('industry','failed','success')
industry_frequencies
## industry failed success
## [1,] "NA" "69" "55"
## [2,] "market_research" "1" "7"
## [3,] "marketing" "11" "55"
## [4,] "crowdfunding" "0" "2"
## [5,] "analytics" " 18" "180"
## [6,] "cloud_computing" " 6" "13"
## [7,] "software_development" " 7" "11"
## [8,] "mobile" "17" "37"
## [9,] "enterprise_software" " 3" "27"
## [10,] "food_beverages" "5" "3"
## [11,] "hospitality" "0" "4"
## [12,] "network_hosting_infrastructure" " 9" "10"
## [13,] "healthcare" "5" "7"
## [14,] "pharmaceuticals" "0" "1"
## [15,] "media" " 9" "19"
## [16,] "finance" "2" "6"
## [17,] "music" "6" "2"
## [18,] "e-commerce" "17" "37"
## [19,] "gaming" "3" "4"
## [20,] "advertising" "11" "26"
## [21,] "retail" " 0" "15"
## [22,] "security" "0" "6"
## [23,] "email" "3" "4"
## [24,] "human_resources" "2" "4"
## [25,] "career_job_search" "2" "5"
## [26,] "publishing" "2" "6"
## [27,] "education" "3" "3"
## [28,] "energy" "4" "8"
## [29,] "deals" "0" "3"
## [30,] "entertainment" "7" "8"
## [31,] "transportation" "1" "1"
## [32,] "social_networking" " 8" "10"
## [33,] "real_estate" "0" "3"
## [34,] "search" "6" "5"
## [35,] "telecommunications" "3" "3"
## [36,] "insurance" "0" "1"
## [37,] "cleantech" "1" "4"
## [38,] "space_travel" "1" "0"
## [39,] "classifieds" "2" "0"
## [40,] "travel" "0" "1"
## [41,] "government" "0" "1"
Now, we have a three-column matrix containing the number of successful and failed companies in each of 41 different industries. We’ll be doing some visualizations using the ggplot2 which expects the data to be visualized in the form of a data frame, which is why we do the following:
# convert to data.frame
industry_frequencies <- as.data.frame(industry_frequencies)
industry_frequencies
## industry failed success
## 1 NA 69 55
## 2 market_research 1 7
## 3 marketing 11 55
## 4 crowdfunding 0 2
## 5 analytics 18 180
## 6 cloud_computing 6 13
## 7 software_development 7 11
## 8 mobile 17 37
## 9 enterprise_software 3 27
## 10 food_beverages 5 3
## 11 hospitality 0 4
## 12 network_hosting_infrastructure 9 10
## 13 healthcare 5 7
## 14 pharmaceuticals 0 1
## 15 media 9 19
## 16 finance 2 6
## 17 music 6 2
## 18 e-commerce 17 37
## 19 gaming 3 4
## 20 advertising 11 26
## 21 retail 0 15
## 22 security 0 6
## 23 email 3 4
## 24 human_resources 2 4
## 25 career_job_search 2 5
## 26 publishing 2 6
## 27 education 3 3
## 28 energy 4 8
## 29 deals 0 3
## 30 entertainment 7 8
## 31 transportation 1 1
## 32 social_networking 8 10
## 33 real_estate 0 3
## 34 search 6 5
## 35 telecommunications 3 3
## 36 insurance 0 1
## 37 cleantech 1 4
## 38 space_travel 1 0
## 39 classifieds 2 0
## 40 travel 0 1
## 41 government 0 1
Let’s do some visualizations, shall we?
ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
geom_bar(stat="identity") + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
Excuse me!? What’s up with the y-axis? Why is the order of the labels all messed up? Let’s see what the data type of the failed column is.
# check type of 'failed' column
typeof(industry_frequencies$failed)
## [1] "integer"
Still, no reason for it to behave the way it did. I wonder… Let’s convert it to type double and see what happens.
# convert 'failed' column to type 'double'
industry_frequencies$failed <- as.double(industry_frequencies$failed)
ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
geom_bar(stat="identity") + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
Well, what d’you know? It’s all fixed now, right? Nope! The above figure is actually identical to the one before it (with the exception of the y-axis scale). In addition, the scale of the y-axis is not supposed to end at 18. There is supposed to be a number with a value of 69 (the NA industry) and it is supposed to be dwarfing the other industries; the next biggest value in the failed companies is 18.
The NA industry serves no purpose here. Let’s try removing it and see what happens.
# remove the NA row
industry_frequencies <- industry_frequencies %>% filter(!is.na(industry))
industry_frequencies
## industry failed success
## 1 NA 17 55
## 2 market_research 9 7
## 3 marketing 10 55
## 4 crowdfunding 8 2
## 5 analytics 2 180
## 6 cloud_computing 4 13
## 7 software_development 5 11
## 8 mobile 11 37
## 9 enterprise_software 3 27
## 10 food_beverages 15 3
## 11 hospitality 8 4
## 12 network_hosting_infrastructure 7 10
## 13 healthcare 15 7
## 14 pharmaceuticals 8 1
## 15 media 7 19
## 16 finance 12 6
## 17 music 16 2
## 18 e-commerce 11 37
## 19 gaming 13 4
## 20 advertising 10 26
## 21 retail 1 15
## 22 security 8 6
## 23 email 13 4
## 24 human_resources 12 4
## 25 career_job_search 12 5
## 26 publishing 12 6
## 27 education 13 3
## 28 energy 14 8
## 29 deals 8 3
## 30 entertainment 18 8
## 31 transportation 9 1
## 32 social_networking 6 10
## 33 real_estate 8 3
## 34 search 16 5
## 35 telecommunications 13 3
## 36 insurance 8 1
## 37 cleantech 9 4
## 38 space_travel 9 0
## 39 classifieds 12 0
## 40 travel 8 1
## 41 government 8 1
The NA row does not feel like leaving, eh? Let’s try this then.
# remove the NA row
industry_frequencies <- industry_frequencies %>% filter(!(industry == 'NA'))
industry_frequencies
## industry failed success
## 1 market_research 9 7
## 2 marketing 10 55
## 3 crowdfunding 8 2
## 4 analytics 2 180
## 5 cloud_computing 4 13
## 6 software_development 5 11
## 7 mobile 11 37
## 8 enterprise_software 3 27
## 9 food_beverages 15 3
## 10 hospitality 8 4
## 11 network_hosting_infrastructure 7 10
## 12 healthcare 15 7
## 13 pharmaceuticals 8 1
## 14 media 7 19
## 15 finance 12 6
## 16 music 16 2
## 17 e-commerce 11 37
## 18 gaming 13 4
## 19 advertising 10 26
## 20 retail 1 15
## 21 security 8 6
## 22 email 13 4
## 23 human_resources 12 4
## 24 career_job_search 12 5
## 25 publishing 12 6
## 26 education 13 3
## 27 energy 14 8
## 28 deals 8 3
## 29 entertainment 18 8
## 30 transportation 9 1
## 31 social_networking 6 10
## 32 real_estate 8 3
## 33 search 16 5
## 34 telecommunications 13 3
## 35 insurance 8 1
## 36 cleantech 9 4
## 37 space_travel 9 0
## 38 classifieds 12 0
## 39 travel 8 1
## 40 government 8 1
Well, the NA row is gone, but the data are all irreversibly damaged now. For whatever reason, the data now do not reflect those obtained from the DTM that we generated earlier from the industry_of_company variable. Let us quickly undo the mess we have made and start over.
# convert `dfm_industry` to data.frame and bind with response variable
startups_industry <- cbind(startups[,2], as.data.frame(dfm_industry))
# frequencies of successes/fails in each of the industries
industry_frequencies <-
startups_industry %>%
group_by(`dependent-company_status`) %>%
summarise_at(vars(-starts_with("dependent")), funs(sum)) %>%
t()
# remove the 1st row containing the 'dependent-company_status'
industry_frequencies <- industry_frequencies[-1,]
# add the industries (currently the row names) as a column in 'industry_frequencies'
industry_frequencies <- cbind(rownames(industry_frequencies), industry_frequencies)
# remove the row names
rownames(industry_frequencies) <- NULL
# set the column names
colnames(industry_frequencies) <- c('industry','failed','success')
# convert to data.frame
industry_frequencies <- as.data.frame(industry_frequencies)
# remove the NA row
industry_frequencies <- industry_frequencies %>% filter(!(industry == 'NA'))
industry_frequencies
## industry failed success
## 1 market_research 1 7
## 2 marketing 11 55
## 3 crowdfunding 0 2
## 4 analytics 18 180
## 5 cloud_computing 6 13
## 6 software_development 7 11
## 7 mobile 17 37
## 8 enterprise_software 3 27
## 9 food_beverages 5 3
## 10 hospitality 0 4
## 11 network_hosting_infrastructure 9 10
## 12 healthcare 5 7
## 13 pharmaceuticals 0 1
## 14 media 9 19
## 15 finance 2 6
## 16 music 6 2
## 17 e-commerce 17 37
## 18 gaming 3 4
## 19 advertising 11 26
## 20 retail 0 15
## 21 security 0 6
## 22 email 3 4
## 23 human_resources 2 4
## 24 career_job_search 2 5
## 25 publishing 2 6
## 26 education 3 3
## 27 energy 4 8
## 28 deals 0 3
## 29 entertainment 7 8
## 30 transportation 1 1
## 31 social_networking 8 10
## 32 real_estate 0 3
## 33 search 6 5
## 34 telecommunications 3 3
## 35 insurance 0 1
## 36 cleantech 1 4
## 37 space_travel 1 0
## 38 classifieds 2 0
## 39 travel 0 1
## 40 government 0 1
Please work now, okay?
ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
geom_bar(stat="identity") + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
The messed-up y-axis again… converting to double… (not optimistic)
ggplot(industry_frequencies, aes(x = industry, y = as.double(failed))) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
No luck :(
After a short break, we stumbled upon something…
# arrange records in descending order of number of successful companies
industry_frequencies %>% arrange(desc(success))
## industry failed success
## 1 energy 4 8
## 2 entertainment 7 8
## 3 market_research 1 7
## 4 healthcare 5 7
## 5 finance 2 6
## 6 security 0 6
## 7 publishing 2 6
## 8 marketing 11 55
## 9 career_job_search 2 5
## 10 search 6 5
## 11 hospitality 0 4
## 12 gaming 3 4
## 13 email 3 4
## 14 human_resources 2 4
## 15 cleantech 1 4
## 16 mobile 17 37
## 17 e-commerce 17 37
## 18 food_beverages 5 3
## 19 education 3 3
## 20 deals 0 3
## 21 real_estate 0 3
## 22 telecommunications 3 3
## 23 enterprise_software 3 27
## 24 advertising 11 26
## 25 crowdfunding 0 2
## 26 music 6 2
## 27 media 9 19
## 28 analytics 18 180
## 29 retail 0 15
## 30 cloud_computing 6 13
## 31 software_development 7 11
## 32 network_hosting_infrastructure 9 10
## 33 social_networking 8 10
## 34 pharmaceuticals 0 1
## 35 transportation 1 1
## 36 insurance 0 1
## 37 travel 0 1
## 38 government 0 1
## 39 space_travel 1 0
## 40 classifieds 2 0
Very suspicious behaviour indeed. A closer look…
# sort the numbers of successful companies
sort(industry_frequencies$success)
## [1] 0 0 1 1 1 1 1 10 10 11 13 15 180 19 2 2 26
## [18] 27 3 3 3 3 3 37 37 4 4 4 4 4 5 5 55 6
## [35] 6 6 7 7 8 8
## Levels: 0 1 10 11 13 15 180 19 2 26 27 3 37 4 5 55 6 7 8
Suspicion confirmed. The numbers are not being sorted as numbers. They are treated like characters. The funny thing is what happens next.
# sort the numbers of successful companies AFTER converting to numerics
sort(as.numeric(industry_frequencies$success))
## [1] 1 1 2 2 2 2 2 3 3 4 5 6 7 8 9 9 10 11 12 12 12 12 12
## [24] 13 13 14 14 14 14 14 15 15 16 17 17 17 18 18 19 19
Some of the higher numbers (e.g. 55, 180) vanished, and we are left with these consecutive numbers. It did not take us much time to figure out what is wrong.
industry_frequencies$success %>%
as.character() %>% # convert to 'character'...
as.numeric() %>% # ...then to 'numeric'...
sort() # ...then sort
## [1] 0 0 1 1 1 1 1 2 2 3 3 3 3 3 4 4 4
## [18] 4 4 5 5 6 6 6 7 7 8 8 10 10 11 13 15 19
## [35] 26 27 37 37 55 180
Lesson learned the hard way: When you have a column that has number values and you want to coerce them into type numeric, you MUST make sure they are NOT of type factor. If they are of type factor, then they MUST be converted into type character first, and THEN into type numeric. NEVER directly convert from type factor to type numeric.
To be specific, our grand mistake was at a previous step where the matrix, industry_frequencies, was converted into a data frame. The as.data.frame() function has this parameter called stringsAsFactors, and you typically want it set to FALSE whenever numerics are involved.
And so we continue. Shall we try this visualization thing, again?
# convert all columns (which are currently of type 'factor' now) to type 'character'
industry_frequencies <- map_df(industry_frequencies,as.character)
# convert the numeric columns to type 'numeric'
industry_frequencies[,c("failed","success")] <- map_df(industry_frequencies[,c("failed","success")],as.numeric)
ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
geom_bar(stat="identity") + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
Yaaaaaaay! It worked! Now, we’re back in business! Let’s try something fancy now.
industry_frequencies %>%
arrange(success) %>% # arrange order of data by number of successful companies in each industry
ggplot(aes(x = industry, y = success)) + # X: industry, Y: #successful companies
geom_bar(stat="identity") + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
HEY! What gives? Why won’t the x-axis labels get sorted by frequency? With some internet searching, we found this resolution.
industry_frequencies %>%
transform(industry=reorder(industry, -success) ) %>% # order x-axis labels by '-success'
ggplot(aes(x = industry, y = success)) + # X: industry, Y: #successful companies
geom_bar(stat="identity") + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
That’s more like it. From the figure, one can quickly tell which industries have successful startups appearing in them and which do not.
Let’s try something else now.
industry_frequencies %>%
gather("company_status", "n", 2:3) %>%
transform(industry=reorder(industry, -n) ) %>% # order x-axis labels by '-success'
ggplot(aes(x = industry, y = n, fill = company_status)) + # X: industry, Y: n, fill: company_status
geom_bar(stat="identity", show.legend = FALSE) + #
theme(axis.text.x = element_text(angle = 90, # rotate orientation of x-axis labels
vjust = 0.3, # top-align x-axis labels
hjust = 1)) # right-align x-axis labels
At last, a visualization that is pleasant to look at. From the figure, it seems that most startups actually succeed regardless of the industry. Why don’t we get some exact numbers? Below, we will create a new variable called success_percent which displays the percentage of successful companies per industry.
industry_frequencies %>%
mutate(success_percent = round(success / (success+failed) * 100, 1)) %>%
arrange(desc(success_percent), desc(success))
## # A tibble: 40 x 4
## industry failed success success_percent
## <chr> <dbl> <dbl> <dbl>
## 1 retail 0 15.0 100
## 2 security 0 6.00 100
## 3 hospitality 0 4.00 100
## 4 deals 0 3.00 100
## 5 real_estate 0 3.00 100
## 6 crowdfunding 0 2.00 100
## 7 pharmaceuticals 0 1.00 100
## 8 insurance 0 1.00 100
## 9 travel 0 1.00 100
## 10 government 0 1.00 100
## # ... with 30 more rows
Ahem… I want to see the ENTIRE thing, please?
industry_frequencies %>%
mutate(success_percent = round(success / (success+failed) * 100, 1)) %>%
arrange(desc(success_percent), desc(success)) %>%
as.data.frame()
## industry failed success success_percent
## 1 retail 0 15 100.0
## 2 security 0 6 100.0
## 3 hospitality 0 4 100.0
## 4 deals 0 3 100.0
## 5 real_estate 0 3 100.0
## 6 crowdfunding 0 2 100.0
## 7 pharmaceuticals 0 1 100.0
## 8 insurance 0 1 100.0
## 9 travel 0 1 100.0
## 10 government 0 1 100.0
## 11 analytics 18 180 90.9
## 12 enterprise_software 3 27 90.0
## 13 market_research 1 7 87.5
## 14 marketing 11 55 83.3
## 15 cleantech 1 4 80.0
## 16 finance 2 6 75.0
## 17 publishing 2 6 75.0
## 18 career_job_search 2 5 71.4
## 19 advertising 11 26 70.3
## 20 mobile 17 37 68.5
## 21 e-commerce 17 37 68.5
## 22 cloud_computing 6 13 68.4
## 23 media 9 19 67.9
## 24 energy 4 8 66.7
## 25 human_resources 2 4 66.7
## 26 software_development 7 11 61.1
## 27 healthcare 5 7 58.3
## 28 gaming 3 4 57.1
## 29 email 3 4 57.1
## 30 social_networking 8 10 55.6
## 31 entertainment 7 8 53.3
## 32 network_hosting_infrastructure 9 10 52.6
## 33 education 3 3 50.0
## 34 telecommunications 3 3 50.0
## 35 transportation 1 1 50.0
## 36 search 6 5 45.5
## 37 food_beverages 5 3 37.5
## 38 music 6 2 25.0
## 39 space_travel 1 0 0.0
## 40 classifieds 2 0 0.0
All endeavors in retail, security and hospitality seem to have ended in success. There are other industries that have flawlessly succeeded as well, but the number of successes are not big enough to be significant.
For the industries with success ratios < 100%, analytics seems to be wildly successful with a ~91% success ratio, followed by enterprise_software, marketing and advertising, all with success ratios > 70%. There are still other industries with success ratios > 70%, but (again) the number of companies in these industries are not big enough for the percentages to be significant.
On the other side of the spectrum, there exist industries that are wildly UNsuccessful such as space_travel and classifieds. Although the numbers of companies are not significant enough to draw conclusions, we don’t need numbers to tell us that space_travel is a tough business to get into :). Other industries with more fails than successes are music, food_beverages and search.
Industries in the hit-or-miss region (~50% success ratio) include entertainment, network_hosting_infrastructure, education and telecommunications. Again, other industries within this hit-or-miss region do not have significantly large number of companies to constitute a trend and, as such, will be ignored.
It would be great if we can discover some of kind of global indicator that can help distinguish between the successful and failed startups. Let’s see if we can find such an indicator!
To be continued…
Until now I do not know what to do with it
head(startups$short_description_of_company_profile, 10)
## [1] "video distribution"
## [2] NA
## [3] "event data analytics api"
## [4] "the most advanced analytics for mobile"
## [5] "the location-based marketing platform"
## [6] "big data for foodservice"
## [7] NA
## [8] NA
## [9] "engagement engine"
## [10] "big data for clinical insight"
It is a multi-value variable indicate the company foucs.
head(startups$focus_functions_of_company, 10)
## [1] "operation" "marketing, sales" "operations"
## [4] "marketing & sales" "marketing & sales" "analytics"
## [7] "research" "computing" "marketing"
## [10] "research"
The following are some notes about this field: * values are either seprated by , or & and sometimes both. * The characters case is not standarized. * Missing functions
It is a multi-value variable. Values are seprated by |, and sometimes there are missing values
head(startups$investors, 10)
## [1] "kpcb holdings|draper fisher jurvetson (dfj)|kleiner perkins caufield & byers|at&t|blueprint ventures|cisco|zone ventures"
## [2] NA
## [3] "techstars|streamlined ventures|amplify partners|rincon venture partners|pelion venture partners|500 startups|loren siebert|jason seats|xg ventures|george karidis|sam choi|morris wheeler|data collective|pejman nozad|ullas naik|dirk elmendorf|galvanize|pat matthews|paul kedrosky|matt ocko|cloud power capital|jared kopf|anne johnson|issac roth|george karutz|jim deters|zachary aarons|zack bogue"
## [4] "michael birch|max levchin|sequoia capital|keith rabois|andreessen horowitz|marc benioff|david sacks|y combinator|voyager capital"
## [5] "dfj frontier|draper nexus ventures|gil elbaz|auren hoffman|walter kortschak|mi ventures|brand ventures|daher capital|double m partners|gold hill capital|clark landry|draper associates|mi ventures llc|signia venture partners"
## [6] "pritzker group venture capital|excelerate labs|hyde park venture partners|chicago ventures|amicus capital|ideo|olive ventures|kd capital l.l.c"
## [7] "plug & play ventures|correlation ventures|crosslink capital|roham gharegozlou|start capital|naguib sawiris|pejman nozad|streamlined ventures"
## [8] "norwest venture partners|bessemer venture partners|atlas venture"
## [9] "promus ventures|softtech vc|costanoa venture capital|lee linden|chamath palihapitiya|raj de datta|tim kendall|omar siddiqui|david vivero|sequoia capital"
## [10] "khosla ventures"