Data Wrangling

First we will read the data using read_csv methof from readr packing within tidyverse package.

startups <- read_csv("data/CAX_Startup_Data.csv")

So we have 472 observations belongs to 116 variables, one of them is the response variable that we would to designate later for predction and the rest are the predictors.

Let us see how the first 5 rows of the data looks like

head(startups, 5)

## # A tibble: 5 x 116
##   Company_Name `Dependent-Company S~ `year of foundin~ `Age of company in~
##   <chr>        <chr>                 <chr>             <chr>              
## 1 Company1     Success               No Info           No Info            
## 2 Company2     Success               2011              3                  
## 3 Company3     Success               2011              3                  
## 4 Company4     Success               2009              5                  
## 5 Company5     Success               2010              4                  
## # ... with 112 more variables: `Internet Activity Score` <int>, `Short
## #   Description of company profile` <chr>, `Industry of company` <chr>,
## #   `Focus functions of company` <chr>, Investors <chr>, `Employee Count`
## #   <int>, `Employees count MoM change` <int>, `Has the team size grown`
## #   <chr>, `Est. Founding Date` <chr>, `Last Funding Date` <chr>, `Last
## #   Funding Amount` <int>, `Country of company` <chr>, `Continent of
## #   company` <chr>, `Number of Investors in Seed` <chr>, `Number of
## #   Investors in Angel and or VC` <chr>, `Number of Co-founders` <int>,
## #   `Number of of advisors` <int>, `Team size Senior leadership` <int>,
## #   `Team size all employees` <chr>, `Presence of a top angel or venture
## #   fund in previous round of investment` <chr>, `Number of of repeat
## #   investors` <chr>, `Number of Sales Support material` <chr>, `Worked in
## #   top companies` <chr>, `Average size of companies worked for in the
## #   past` <chr>, `Have been part of startups in the past?` <chr>, `Have
## #   been part of successful startups in the past?` <chr>, `Was he or she
## #   partner in Big 5 consulting?` <chr>, `Consulting experience?` <chr>,
## #   `Product or service company?` <chr>, `Catering to product/service
## #   across verticals` <chr>, `Focus on private or public data?` <chr>,
## #   `Focus on consumer data?` <chr>, `Focus on structured or unstructured
## #   data` <chr>, `Subscription based business` <chr>, `Cloud or platform
## #   based serive/product?` <chr>, `Local or global player` <chr>, `Linear
## #   or Non-linear business model` <chr>, `Capital intensive business e.g.
## #   e-commerce, Engineering products and operations can also cause a
## #   business to be capital intensive` <chr>, `Number of of Partners of
## #   company` <chr>, `Crowdsourcing based business` <chr>, `Crowdfunding
## #   based business` <chr>, `Machine Learning based business` <chr>,
## #   `Predictive Analytics business` <chr>, `Speech analytics business`
## #   <chr>, `Prescriptive analytics business` <chr>, `Big Data Business`
## #   <chr>, `Cross-Channel Analytics/ marketing channels` <chr>, `Owns data
## #   or not? (monetization of data) e.g. Factual` <chr>, `Is the company an
## #   aggregator/market place? e.g. Bluekai` <chr>, `Online or offline
## #   venture - physical location based business or online venture?` <chr>,
## #   `B2C or B2B venture?` <chr>, `Top forums like 'Tech crunch' or
## #   'Venture beat' talking about the company/model - How much is it being
## #   talked about?` <chr>, `Average Years of experience for founder and co
## #   founder` <chr>, `Exposure across the globe` <chr>, `Breadth of
## #   experience across verticals` <chr>, `Highest education` <chr>, `Years
## #   of education` <chr>, `Specialization of highest education` <chr>,
## #   `Relevance of education to venture` <chr>, `Relevance of experience to
## #   venture` <chr>, `Degree from a Tier 1 or Tier 2 university?` <chr>,
## #   `Renowned in professional circle` <chr>, `Experience in selling and
## #   building products` <chr>, `Experience in Fortune 100 organizations`
## #   <chr>, `Experience in Fortune 500 organizations` <chr>, `Experience in
## #   Fortune 1000 organizations` <chr>, `Top management similarity` <chr>,
## #   `Number of Recognitions for Founders and Co-founders` <chr>, `Number
## #   of of Research publications` <chr>, `Skills score` <chr>, `Team
## #   Composition score` <chr>, `Dificulty of Obtaining Work force` <chr>,
## #   `Pricing Strategy` <chr>, `Hyper localisation` <chr>, `Time to market
## #   service or product` <chr>, `Employee benefits and salary structures`
## #   <chr>, `Long term relationship with other founders` <chr>,
## #   `Proprietary or patent position (competitive position)` <chr>,
## #   `Barriers of entry for the competitors` <chr>, `Company awards` <chr>,
## #   `Controversial history of founder or co founder` <chr>, `Legal risk
## #   and intellectual property` <chr>, `Client Reputation` <chr>, `google
## #   page rank of company website` <chr>, `Technical proficiencies to
## #   analyse and interpret unstructured data` <chr>, `Solutions offered`
## #   <chr>, `Invested through global incubation competitions?` <chr>,
## #   `Industry trend in investing` <int>, `Disruptiveness of technology`
## #   <chr>, `Number of Direct competitors` <chr>, `Employees per year of
## #   company existence` <chr>, `Last round of funding received (in
## #   milionUSD)` <chr>, `Survival through recession, based on existence of
## #   the company through recession times` <chr>, `Time to 1st investment
## #   (in months)` <chr>, `Avg time to investment - average across all
## #   rounds, measured from previous investment` <chr>, `Gartner hype cycle
## #   stage` <chr>, `Time to maturity of technology (in years)` <chr>,
## #   Percent_skill_Entrepreneurship <chr>, Percent_skill_Operations <chr>,
## #   Percent_skill_Engineering <chr>, ...

From that it seems we have the following notes:

There are many missong values
Missing values are not only marked as NA but there are other values like No Info or just an empty string.
Varaible are read as characters datatype by default and would need to be probably converted to their original data type(dates, factors, numeric)
There are variables clearly will need special processing like Short Description of company profile , Specialization of highest education or Investors
There are issues in texual columns (upper and lower case, different format, etc..)
Clearly Dependent-Company Status is the response variable.

So, the following are the steps we will conduct to clean our dataset

Unify the way missing data are marked.
Will convert the variables into their proper datatypes.
Remove column with missed data over 40%.

Setup Missing Data

set_missing <- function(x) {
  # Replace 'No Info' with NA
  x[x == 'No Info'] <- NA
  # Replace empty string with NA
  x[x == ''] <- NA
  return(x)
}

startups <- map_df(startups, set_missing)

Variables Correct type

Factor Variables

Construct factor of factor variables and convert them

factor_cols <- c(2, 12, 16:17, 24, 26:65, 67, 71, 73, 75:87, 89:91, 93, 97, 100:101)
startups[, factor_cols] <- map_df(startups[, factor_cols], toupper)
startups[, factor_cols] <- map_df(startups[, factor_cols], as.factor)

Let us look at the summary for each variable to make sure that everything is ok.

map(startups[, factor_cols], summary)

## $`Dependent-Company Status`
##  FAILED SUCCESS 
##     167     305 
## 
## $`Has the team size grown`
##   NO  YES NA's 
##  266  155   51 
## 
## $`Country of company`
##          ARGENTINA            AUSTRIA         AZERBAIJAN 
##                  2                  2                  2 
##            BELGIUM           BULGARIA             CANADA 
##                  5                  3                  3 
##     CZECH REPUBLIC            DENMARK            ESTONIA 
##                  1                  3                  1 
##            FINLAND             FRANCE            GERMANY 
##                  2                  8                  6 
##              INDIA             ISRAEL              ITALY 
##                 10                  4                  1 
## RUSSIAN FEDERATION          SINGAPORE              SPAIN 
##                  1                  1                  5 
##             SWEDEN        SWITZERLAND     UNITED KINGDOM 
##                  1                  2                 33 
##      UNITED STATES               NA's 
##                305                 71 
## 
## $`Continent of company`
##          ASIA        EUROPE NORTH AMERICA SOUTH AMERICA          NA's 
##            15            76           308             2            71 
## 
## $`Presence of a top angel or venture fund in previous round of investment`
##   NO  YES NA's 
##  282   93   97 
## 
## $`Number of  Sales Support material`
##    HIGH     LOW  MEDIUM NOTHING    NA's 
##      73     150     120      81      48 
## 
## $`Worked in top companies`
##   NO  YES NA's 
##  380   73   19 
## 
## $`Average size of companies worked for in the past`
##  LARGE MEDIUM  SMALL   NA's 
##     83    130    228     31 
## 
## $`Have been part of startups in the past?`
##   NO  YES NA's 
##  154  298   20 
## 
## $`Have been part of successful startups in the past?`
##   NO  YES NA's 
##  194  258   20 
## 
## $`Was he or she partner in Big 5 consulting?`
##   NO  YES NA's 
##  428   24   20 
## 
## $`Consulting experience?`
##   NO  YES NA's 
##  245  205   22 
## 
## $`Product or service company?`
##    BOTH PRODUCT SERVICE    NA's 
##      24     207     231      10 
## 
## $`Catering to product/service across verticals`
##   NO  YES NA's 
##  231  230   11 
## 
## $`Focus on private or public data?`
##    BOTH      NO PRIVATE  PUBLIC    NA's 
##      68     113     162     120       9 
## 
## $`Focus on consumer data?`
##   NO  YES NA's 
##  281  182    9 
## 
## $`Focus on structured or unstructured data`
##           BOTH             NO NOT APPLICABLE     STRUCTURED   UNSTRUCTURED 
##            120             98              7            166             72 
##           NA's 
##              9 
## 
## $`Subscription based business`
##   NO  YES NA's 
##  192  267   13 
## 
## $`Cloud or platform based serive/product?`
##     BOTH    CLOUD     NONE PLATFORM     NA's 
##       68       65       31      296       12 
## 
## $`Local or global player`
## GLOBAL  LOCAL   NA's 
##    237    211     24 
## 
## $`Linear or Non-linear business model`
##     LINEAR NON-LINEAR       NA's 
##        134        320         18 
## 
## $`Capital intensive business e.g. e-commerce, Engineering products and operations can also cause a business to be capital intensive`
##   NO  YES NA's 
##  328  118   26 
## 
## $`Number of  of Partners of company`
##  FEW MANY NONE NA's 
##   73   14  284  101 
## 
## $`Crowdsourcing based business`
##   NO  YES NA's 
##  437   30    5 
## 
## $`Crowdfunding based business`
##   NO  YES NA's 
##  445   22    5 
## 
## $`Machine Learning based business`
##   NO  YES NA's 
##  337  129    6 
## 
## $`Predictive Analytics business`
##   NO  YES NA's 
##  316  151    5 
## 
## $`Speech analytics business`
##   NO  YES NA's 
##  436   31    5 
## 
## $`Prescriptive analytics business`
##   NO  YES NA's 
##  321  116   35 
## 
## $`Big Data Business`
##   NO  YES NA's 
##  251  216    5 
## 
## $`Cross-Channel Analytics/ marketing channels`
##   NO  YES NA's 
##  395   72    5 
## 
## $`Owns data or not? (monetization of data) e.g. Factual`
##   NO  YES NA's 
##  411   52    9 
## 
## $`Is the company an aggregator/market place? e.g. Bluekai`
##   NO  YES NA's 
##  335  107   30 
## 
## $`Online or offline venture - physical location based business or online venture?`
##    BOTH OFFLINE  ONLINE    NA's 
##       9      47     410       6 
## 
## $`B2C or B2B venture?`
##  B2B  B2C NA's 
##  307  162    3 
## 
## $`Top forums like 'Tech crunch' or 'Venture beat' talking about the company/model - How much is it being talked about?`
##   HIGH    LOW MEDIUM   NONE   NA's 
##     28    243     78     43     80 
## 
## $`Average Years of experience for founder and co founder`
##   HIGH    LOW MEDIUM   NA's 
##    271     13    108     80 
## 
## $`Exposure across the globe`
##   NO  YES NA's 
##  141  246   85 
## 
## $`Breadth of experience across verticals`
##   HIGH    LOW MEDIUM   NA's 
##     37    178    172     85 
## 
## $`Highest education`
## BACHELORS   MASTERS       PHD      NA's 
##       169       166        34       103 
## 
## $`Years of education`
##   18   21   25 NA's 
##  169  166   34  103 
## 
## $`Specialization of highest education`
##                                                      BUSINESS 
##                                                            25 
##                                                           MBA 
##                                                            23 
##                                              COMPUTER SCIENCE 
##                                                            22 
##                                                          TECH 
##                                                            21 
##                                                          ENGG 
##                                                            19 
##                                                    MANAGEMENT 
##                                                            18 
##                                                          MGMT 
##                                                            18 
##                                                     ECONOMICS 
##                                                            16 
##                                                          ARTS 
##                                                            12 
##                                           BUSINESS MANAGEMENT 
##                                                            10 
##                                                       FINANCE 
##                                                             8 
##                                                           CSE 
##                                                             7 
##                                                           PHD 
##                                                             7 
##                                                    TECHNOLOGY 
##                                                             7 
##                                                     COMPUTERS 
##                                                             6 
##                                                    ELECTRICAL 
##                                                             6 
##                                                           LAW 
##                                                             6 
##                                                     MARKETING 
##                                                             5 
##                                                   COMPUTER SC 
##                                                             4 
##                                                       GENERAL 
##                                                             4 
##                                                     MANGEMENT 
##                                                             4 
##                                       BUSINESS ADMINISTRATION 
##                                                             3 
##                                        ELECTRICAL ENGINEERING 
##                                                             3 
##                                                         MATHS 
##                                                             3 
##                                                         MEDIA 
##                                                             3 
##                                                     BUSSINESS 
##                                                             2 
##                                                 COMMUNICATION 
##                                                             2 
##                                                      COMPUTER 
##                                                             2 
##                                                       DIPLOMA 
##                                                             2 
##                                                           ECO 
##                                                             2 
##                                                       HISTORY 
##                                                             2 
##                                                            IT 
##                                                             2 
##                                                           MGM 
##                                                             2 
##                                                            NO 
##                                                             2 
##                                                       PHYSICS 
##                                                             2 
##                                             POLITICAL SCIENCE 
##                                                             2 
##                                                    ACCOUNTING 
##                                                             1 
##                                                     AEROSPACE 
##                                                             1 
##                                         AEROSPACE ENGINEERING 
##                                                             1 
##                                               AQUATIC BIOLOGY 
##                                                             1 
##               ARTIFICIAL INTELLIGENCE AND ADVANCED TECHNLOGES 
##                                                             1 
##                                              ARTS AND CULTURE 
##                                                             1 
##                                               ARTS, ECONOMICS 
##                                                             1 
##                                                   ARTS, MEDIA 
##                                                             1 
##                                                      BACHELOR 
##                                                             1 
##                                                      BIO ENGG 
##                                                             1 
##                                                       BIOLOGY 
##                                                             1 
##                         BIOMEDICAL AND MECHANICAL ENGINEERING 
##                                                             1 
##                                         BIOMEDICAL ENTERPRISE 
##                                                             1 
##                                        BIOMEDICAL INFORMATICS 
##                                                             1 
##                                                           BSC 
##                                                             1 
##                                                         BSEET 
##                                                             1 
##                                              BUSINESS STUDIES 
##                                                             1 
##                                                 CHEMICAL ENGG 
##                                                             1 
##                                          CHEMICAL ENGINEERING 
##                                                             1 
##                                                 COMMUNICTIONS 
##                                                             1 
##                                  COMPUTER SYSTEMS ENGINEERING 
##                                                             1 
##                                                            CS 
##                                                             1 
##                                                   DATA MINING 
##                                                             1 
##                                                         DSIGN 
##                                                             1 
##                                                EARTH SCIENCES 
##                                                             1 
##                              EAST ASIAN STUDIES AND ECONOMICS 
##                                                             1 
##                                                     ECO.POLSC 
##                                                             1 
##                             ECONOMICS AND AMERICAN LITERATURE 
##                                                             1 
##                                           ECONOMICS,COMPUTERS 
##                                                             1 
## ELECTRICAL AND ELECTRONICS ENGINEERING\nAEROSPACE ENGINEERING 
##                                                             1 
##                                                ELECTRICAL ENG 
##                                                             1 
##                                               ELECTRICAL ENGG 
##                                                             1 
##                                                   ENGINEERING 
##                                                             1 
##                                       ENGINEERING, MANAGEMENT 
##                                                             1 
##                                                       ENGLISH 
##                                                             1 
##                                                 ENTERTAINMENT 
##                                                             1 
##                                               ENTREPRENEURIAL 
##                                                             1 
##                                              ENTREPRENEURSHIP 
##                                                             1 
##                                 ENTREPRENEURSHIP & INNOVATION 
##                                                             1 
##                                           FINANCE, MANAGEMENT 
##                                                             1 
##                        FINANCE, STARTEGY AND ENTERPRENEURSHIP 
##                                                             1 
##                                                     FINE ARTS 
##                                                             1 
##                                             GENERAL MANGEMENT 
##                                                             1 
##                                                       GEOLOGY 
##                                                             1 
##               HONORS PROGRAM, BUSINESS, FINANCE, INFORMATION  
##                                                             1 
##                                    HUMAN COMPUTER INTERACTION 
##                                                             1 
##                                              IMAGE PROCESSING 
##                                                             1 
##                                                      INDUSTRI 
##                                                             1 
##                   INDUSTRIAL ENGINEERING AND COMPUTER SCIENCE 
##                                                             1 
##                                              INFORMATION TECH 
##                                                             1 
##                                           INTERNATIONAL TRADE 
##                                                             1 
##                                                            JD 
##                                                             1 
##                                                    JOURNALISM 
##                                                             1 
##          LEARNING AND ORGANIZATIONAL CHANGE\nCOMPUTER SCIENCE 
##                                                             1 
##                                                           LIT 
##                                                             1 
##                                        LITERATURE AND HISTORY 
##                                                             1 
##                                                            MA 
##                                                             1 
##                         MANAGEMENT AND INFORMATION TECHNOLOGY 
##                                                             1 
##                                     MANAGEMENT AND TECHNOLOGY 
##                                                             1 
##                                  MANAGEMENT, COMPUTER SCIENCE 
##                                                             1 
##                           MARKETING, LOGISTICS & DISTRIBUTION 
##                                                             1 
##                                     MARKTING AND COMMUNICATON 
##                                                             1 
##                                                       (Other) 
##                                                            27 
##                                                          NA's 
##                                                           101 
## 
## $`Relevance of education to venture`
##   NO  YES NA's 
##   96  281   95 
## 
## $`Relevance of experience to venture`
##   NO  YES NA's 
##   84  301   87 
## 
## $`Degree from a Tier 1 or Tier 2 university?`
##   BOTH   NONE TIER_1 TIER_2   NA's 
##     43    144    139     58     88 
## 
## $`Experience in selling and building products`
##   HIGH    LOW MEDIUM   NONE   NA's 
##    127     82    147     34     82 
## 
## $`Top management similarity`
##   HIGH    LOW MEDIUM   NONE   NA's 
##     55     47     88    199     83 
## 
## $`Number of  of Research publications`
##  FEW MANY NONE NA's 
##   57   81  250   84 
## 
## $`Team Composition score`
##   HIGH    LOW MEDIUM   NA's 
##     82    185    121     84 
## 
## $`Dificulty of Obtaining Work force`
##   HIGH    LOW MEDIUM   NA's 
##     58    178    150     86 
## 
## $`Pricing Strategy`
##   NO  YES NA's 
##  198  190   84 
## 
## $`Hyper localisation`
##   NO  YES NA's 
##  330   60   82 
## 
## $`Time to market service or product`
##   HIGH    LOW MEDIUM   NA's 
##     17    253    119     83 
## 
## $`Employee benefits and salary structures`
##   AVERAGE       BAD      GOOD VERY GOOD      NA's 
##        26        42        33        20       351 
## 
## $`Long term relationship with other founders`
##   NO  YES NA's 
##  277  106   89 
## 
## $`Proprietary or patent position (competitive position)`
##   NO  YES NA's 
##  294   92   86 
## 
## $`Barriers of entry for the competitors`
##  NO YES 
## 220 252 
## 
## $`Company awards`
##   NO  YES NA's 
##  311   76   85 
## 
## $`Controversial history of founder or co founder`
##   NO  YES NA's 
##  380   10   82 
## 
## $`Legal risk and intellectual property`
##   NO  YES NA's 
##  329   58   85 
## 
## $`Client Reputation`
##   HIGH    LOW MEDIUM   NA's 
##     58    119     21    274 
## 
## $`Technical proficiencies to analyse and interpret unstructured data`
##   NO  YES NA's 
##  216  173   83 
## 
## $`Solutions offered`
##   NO  YES NA's 
##  169  218   85 
## 
## $`Invested through global incubation competitions?`
##   NO  YES NA's 
##  285   51  136 
## 
## $`Disruptiveness of technology`
##   HIGH    LOW MEDIUM   NA's 
##    108     93    189     82 
## 
## $`Survival through recession, based on existence of the company through recession times`
##             NO NOT APPLICABLE            YES           NA's 
##             31            268             75             98 
## 
## $`Gartner hype cycle stage`
##    PEAK PLATEAU   SLOPE TRIGGER  TROUGH    NA's 
##      77      85      24      67      47     172 
## 
## $`Time to maturity of technology (in years)`
##  0 TO 2  0 TO 5  2 TO 5 5 TO 10    NA's 
##      77       1     180      42     172

It seems that everything in order. However, there is a small note about some variables are about the investors/owners/founders not about the company itself which should not be there as certainly each person has his/her own qualifications and experience, but for sake of simplicity we will consider these varaibles are talking about all of these people at once as one unit.

Numeric Variables

Construct factor of numeric variables and convert them

numeric_cols <- c(3:5,10,11,18:23,25,61,66,68:70,72,74,88,92,94:96,98,99,102:116)
startups[, numeric_cols] <- map_df(startups[, numeric_cols], as.numeric)

Let us look at the summary for each variable to make sure that everything is ok.

map(startups[, numeric_cols], summary)

## $`year of founding`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1997    2008    2010    2009    2011    2013      59 
## 
## $`Age of company in years`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.000   4.000   4.605   6.000  17.000      59 
## 
## $`Internet Activity Score`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -725.0    -3.5    60.0   114.2   216.0  1535.0      65 
## 
## $`Employee Count`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    4.25   13.00   31.41   31.00  594.00     166 
## 
## $`Employees count MoM change`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -100.0     0.0     0.0    -1.3     6.0    50.0     205 
## 
## $`Number of Investors in Seed`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   1.546   2.000  24.000      49 
## 
## $`Number of Investors in Angel and or VC`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.5768  0.0000  9.0000      49 
## 
## $`Number of Co-founders`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.869   2.250   7.000 
## 
## $`Number of of advisors`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.017   1.000  13.000 
## 
## $`Team size Senior leadership`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.731   5.000  24.000 
## 
## $`Team size all employees`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   10.00   16.50   69.48   50.00 5000.00      68 
## 
## $`Number of of repeat investors`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.6065  1.0000 10.0000      40 
## 
## $`Years of education`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.634   2.000   3.000     103 
## 
## $`Renowned in professional circle`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    16.0   500.0   500.0   469.1   500.0   500.0      91 
## 
## $`Experience in Fortune 100 organizations`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.2692  1.0000  1.0000      82 
## 
## $`Experience in Fortune 500 organizations`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.259   1.000   1.000      82 
## 
## $`Experience in Fortune 1000 organizations`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   0.218   0.000   1.000      82 
## 
## $`Number of Recognitions for Founders and Co-founders`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    3.00   72.27  107.50  600.00      81 
## 
## $`Skills score`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   14.00   21.00   21.69   25.00  200.00      81 
## 
## $`google page rank of company website`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      483   209558   835064  2518863  2456174 22391670      154 
## 
## $`Industry trend in investing`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.00    3.00    2.89    3.00    5.00      82 
## 
## $`Number of Direct competitors`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.258   3.000  33.000      80 
## 
## $`Employees per year of company existence`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.30    8.30   18.44   15.25  833.30     128 
## 
## $`Last round of funding received (in milionUSD)`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   0.750   2.500   5.866   7.500  62.500     167 
## 
## $`Time to 1st investment (in months)`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    1.00   10.00   14.61   19.25  156.00      96 
## 
## $`Avg time to investment - average across all rounds, measured from previous investment`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.582   7.309  10.563  12.000 156.000      98 
## 
## $Percent_skill_Entrepreneurship
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   5.882   7.538  11.111 100.000      61 
## 
## $Percent_skill_Operations
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.385   3.452  50.000      61 
## 
## $Percent_skill_Engineering
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   9.804  18.632  28.665 100.000      61 
## 
## $Percent_skill_Marketing
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   5.556  11.001  14.286  76.471      61 
## 
## $Percent_skill_Leadership
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.870   5.556  40.000      61 
## 
## $`Percent_skill_Data Science`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   1.852   6.082   8.333  80.000      61 
## 
## $`Percent_skill_Business Strategy`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   8.333  10.981  18.382  50.000      61 
## 
## $`Percent_skill_Product Management`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.430   5.556  25.000      61 
## 
## $Percent_skill_Sales
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.357   5.556  33.333      61 
## 
## $Percent_skill_Domain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.750   5.882  44.444      61 
## 
## $Percent_skill_Law
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1995  0.0000 33.3333      61 
## 
## $Percent_skill_Consulting
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.4821  0.0000 20.0000      61 
## 
## $Percent_skill_Finance
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   1.592   0.000  78.571      61 
## 
## $Percent_skill_Investment
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   1.359   0.000  33.333      61 
## 
## $`Renown score`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   3.000   3.292   5.000  11.000      61

It seems that seems that there are some outliers there, we will find out more we reach the EDA phase.

Date Variables

Construct factor of date variables and convert them

date_cols <- c(13:14)
startups[, date_cols] <- map_df(startups[, date_cols], mdy)

Let us look at the summary for each variable to make sure that everything is ok.

map(startups[, date_cols], summary)

## $`Est. Founding Date`
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1997-06-01" "2008-01-01" "2010-02-05" "2009-08-05" "2011-06-16" 
##         Max.         NA's 
## "2013-07-01"        "109" 
## 
## $`Last Funding Date`
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2004-04-01" "2010-10-03" "2012-08-06" "2011-12-12" "2013-07-11" 
##         Max.         NA's 
## "2014-04-08"        "122"

Remove polluted predictors

Now, as the variables in their proper data types, let us remove predictors with more than 40% missing data

startups <- startups[colSums(is.na(startups))/nrow(startups) < .4]
dim(startups)

## [1] 472 113

It seems we git rid off 3 variables, let us take another look at the data now

head(startups)

## # A tibble: 6 x 113
##   Company_Name `Dependent-Company S~ `year of foundin~ `Age of company in~
##   <chr>        <fct>                             <dbl>               <dbl>
## 1 Company1     SUCCESS                              NA               NA   
## 2 Company2     SUCCESS                            2011                3.00
## 3 Company3     SUCCESS                            2011                3.00
## 4 Company4     SUCCESS                            2009                5.00
## 5 Company5     SUCCESS                            2010                4.00
## 6 Company6     SUCCESS                            2010                4.00
## # ... with 109 more variables: `Internet Activity Score` <dbl>, `Short
## #   Description of company profile` <chr>, `Industry of company` <chr>,
## #   `Focus functions of company` <chr>, Investors <chr>, `Employee Count`
## #   <dbl>, `Has the team size grown` <fct>, `Est. Founding Date` <date>,
## #   `Last Funding Date` <date>, `Last Funding Amount` <int>, `Country of
## #   company` <fct>, `Continent of company` <fct>, `Number of Investors in
## #   Seed` <dbl>, `Number of Investors in Angel and or VC` <dbl>, `Number
## #   of Co-founders` <dbl>, `Number of of advisors` <dbl>, `Team size
## #   Senior leadership` <dbl>, `Team size all employees` <dbl>, `Presence
## #   of a top angel or venture fund in previous round of investment` <fct>,
## #   `Number of of repeat investors` <dbl>, `Number of Sales Support
## #   material` <fct>, `Worked in top companies` <fct>, `Average size of
## #   companies worked for in the past` <fct>, `Have been part of startups
## #   in the past?` <fct>, `Have been part of successful startups in the
## #   past?` <fct>, `Was he or she partner in Big 5 consulting?` <fct>,
## #   `Consulting experience?` <fct>, `Product or service company?` <fct>,
## #   `Catering to product/service across verticals` <fct>, `Focus on
## #   private or public data?` <fct>, `Focus on consumer data?` <fct>,
## #   `Focus on structured or unstructured data` <fct>, `Subscription based
## #   business` <fct>, `Cloud or platform based serive/product?` <fct>,
## #   `Local or global player` <fct>, `Linear or Non-linear business model`
## #   <fct>, `Capital intensive business e.g. e-commerce, Engineering
## #   products and operations can also cause a business to be capital
## #   intensive` <fct>, `Number of of Partners of company` <fct>,
## #   `Crowdsourcing based business` <fct>, `Crowdfunding based business`
## #   <fct>, `Machine Learning based business` <fct>, `Predictive Analytics
## #   business` <fct>, `Speech analytics business` <fct>, `Prescriptive
## #   analytics business` <fct>, `Big Data Business` <fct>, `Cross-Channel
## #   Analytics/ marketing channels` <fct>, `Owns data or not? (monetization
## #   of data) e.g. Factual` <fct>, `Is the company an aggregator/market
## #   place? e.g. Bluekai` <fct>, `Online or offline venture - physical
## #   location based business or online venture?` <fct>, `B2C or B2B
## #   venture?` <fct>, `Top forums like 'Tech crunch' or 'Venture beat'
## #   talking about the company/model - How much is it being talked about?`
## #   <fct>, `Average Years of experience for founder and co founder` <fct>,
## #   `Exposure across the globe` <fct>, `Breadth of experience across
## #   verticals` <fct>, `Highest education` <fct>, `Years of education`
## #   <dbl>, `Specialization of highest education` <fct>, `Relevance of
## #   education to venture` <fct>, `Relevance of experience to venture`
## #   <fct>, `Degree from a Tier 1 or Tier 2 university?` <fct>, `Renowned
## #   in professional circle` <dbl>, `Experience in selling and building
## #   products` <fct>, `Experience in Fortune 100 organizations` <dbl>,
## #   `Experience in Fortune 500 organizations` <dbl>, `Experience in
## #   Fortune 1000 organizations` <dbl>, `Top management similarity` <fct>,
## #   `Number of Recognitions for Founders and Co-founders` <dbl>, `Number
## #   of of Research publications` <fct>, `Skills score` <dbl>, `Team
## #   Composition score` <fct>, `Dificulty of Obtaining Work force` <fct>,
## #   `Pricing Strategy` <fct>, `Hyper localisation` <fct>, `Time to market
## #   service or product` <fct>, `Long term relationship with other
## #   founders` <fct>, `Proprietary or patent position (competitive
## #   position)` <fct>, `Barriers of entry for the competitors` <fct>,
## #   `Company awards` <fct>, `Controversial history of founder or co
## #   founder` <fct>, `Legal risk and intellectual property` <fct>, `google
## #   page rank of company website` <dbl>, `Technical proficiencies to
## #   analyse and interpret unstructured data` <fct>, `Solutions offered`
## #   <fct>, `Invested through global incubation competitions?` <fct>,
## #   `Industry trend in investing` <dbl>, `Disruptiveness of technology`
## #   <fct>, `Number of Direct competitors` <dbl>, `Employees per year of
## #   company existence` <dbl>, `Last round of funding received (in
## #   milionUSD)` <dbl>, `Survival through recession, based on existence of
## #   the company through recession times` <fct>, `Time to 1st investment
## #   (in months)` <dbl>, `Avg time to investment - average across all
## #   rounds, measured from previous investment` <dbl>, `Gartner hype cycle
## #   stage` <fct>, `Time to maturity of technology (in years)` <fct>,
## #   Percent_skill_Entrepreneurship <dbl>, Percent_skill_Operations <dbl>,
## #   Percent_skill_Engineering <dbl>, Percent_skill_Marketing <dbl>,
## #   Percent_skill_Leadership <dbl>, `Percent_skill_Data Science` <dbl>,
## #   ...

Much better.

Textual Variables

Textual columns are those that contain free text. What differentiates these from the categorical columns is that the number of unique values for the textual columns would be too big. Such columns are typically messy, and we will have to deal with them on a column-by-column basis as no single preprocessing procedure would suit them all.

Before we move on, let’s modify the column names by making them lowercase and replacing spaces with underscores. Note that this more of a personal preference than a neccessity.

colnames(startups) <- tolower(gsub(x = colnames(startups), pattern = ' ', replacement = '_', fixed = TRUE))
# the 'fixed' parameter is set as TRUE in order to match the pattern exactly. 
# otherwise, the pattern will be interpreted as a regular expression instead, and an unexpected output may result.

head(colnames(startups))

## [1] "company_name"                        
## [2] "dependent-company_status"            
## [3] "year_of_founding"                    
## [4] "age_of_company_in_years"             
## [5] "internet_activity_score"             
## [6] "short_description_of_company_profile"

After a quick manual inspection (on the .csv file), we identified the following textual variables. We additionally set the contents of these columns to lowercase for easier processing later on.

# the columns containing text in them
textual_col_names <- c('industry_of_company',
                       'short_description_of_company_profile',
                       'focus_functions_of_company',
                       'investors')

# set contents of these columns to lowercase
startups[,textual_col_names] <- map_df(startups[,textual_col_names], tolower)

Industry of the Company

And now, a quick look at the industry_of_company column – which indicates the particular domain that a company is working in – shows us…

head(startups$industry_of_company, 10)

##  [1] NA                                                  
##  [2] "market research|marketing|crowdfunding"            
##  [3] "analytics|cloud computing|software development"    
##  [4] "mobile|analytics"                                  
##  [5] "analytics|marketing|enterprise software"           
##  [6] "food & beverages|hospitality"                      
##  [7] "analytics"                                         
##  [8] "cloud computing|network / hosting / infrastructure"
##  [9] "analytics|mobile|marketing"                        
## [10] "healthcare|pharmaceuticals|analytics"

From the output, it seems that it is a multiple-value variable with | as the separator.

Now, we wish to create a document-term matrix (DTM) from this column. The DTM is a matrix where the documents (i.e. records) and terms (e.g. words) are the rows and columns, respectively, and the cells contain the frequencies of the terms occurring in the documents (e.g. the cell DTM(i,j) would tell how many times term j occurred in document i). To generate the DTM, we shall use the quanteda package.

# create the corpus object from the industry_of_company column
mycorpus <- corpus(startups$industry_of_company)

# generate the DTM using the formed corpus
dfm_industry <- dfm(mycorpus,               # the corpus to generate the DTM from
                    tolower = FALSE)        # data is already lowercase

At this point, let’s just have a look at the DTM before moving on.

# observe the 'terms' of the DTM
colnames(dfm_industry)

##  [1] "NA"                 "market"             "research"          
##  [4] "|"                  "marketing"          "crowdfunding"      
##  [7] "analytics"          "cloud"              "computing"         
## [10] "software"           "development"        "mobile"            
## [13] "enterprise"         "food"               "&"                 
## [16] "beverages"          "hospitality"        "network"           
## [19] "/"                  "hosting"            "infrastructure"    
## [22] "healthcare"         "pharmaceuticals"    "media"             
## [25] "finance"            "music"              "e-commerce"        
## [28] "gaming"             "advertising"        "retail"            
## [31] "security"           "email"              "human"             
## [34] "resources"          "("                  "hr"                
## [37] ")"                  "career"             "job"               
## [40] "search"             "publishing"         "education"         
## [43] "energy"             "deals"              "entertainment"     
## [46] "transportation"     "social"             "networking"        
## [49] "real"               "estate"             "telecommunications"
## [52] "insurance"          "cleantech"          "space"             
## [55] "travel"             "classifieds"        "government"

Unfortunately, the output is unsatisfactory. Some of the terms that are parsed by the quanteda package include |, /, &, ( and ) (improper terms, obviously). We also observe that industries such as cloud computing and software development have been broken down into their constituent words, which is also incorrect.

With some investigation of the quanteda package (specifically the quanteda::tokens() function), it seems that it is only able to split the terms over the spaces in the text. In other words, we cannot specify the | character as the separator for splitting the text into terms.

As such, we will deal with this issue by performing the following actions:

Replace spaces with underscores: merges each of the multi-word industries into a single term.
Replace occurrences of the | character with spaces: helps the quanteda::dfm() function split the terms over the new spaces.
Remove the special characters, /, &, ( and ): makes life easier for the quanteda::dfm() function.

It may be a good idea to look at the cases on which we intend to perform the above actions. Let’s have a look.

# retrieve and view those industries that have '/' in them
startups$industry_of_company[ grepl(x = startups$industry_of_company, pattern = '/', fixed = TRUE) ]

##  [1] "cloud computing|network / hosting / infrastructure"                                      
##  [2] "analytics|security|network / hosting / infrastructure"                                   
##  [3] "human resources (hr)|marketing|career / job search"                                      
##  [4] "analytics|network / hosting / infrastructure"                                            
##  [5] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
##  [6] "network / hosting / infrastructure|food & beverages|analytics"                           
##  [7] "network / hosting / infrastructure"                                                      
##  [8] "media|entertainment|analytics|network / hosting / infrastructure|publishing"             
##  [9] "network / hosting / infrastructure|enterprise software|software development|analytics"   
## [10] "network / hosting / infrastructure"                                                      
## [11] "network / hosting / infrastructure|analytics"                                            
## [12] "network / hosting / infrastructure|enterprise software"                                  
## [13] "career / job search"                                                                     
## [14] "network / hosting / infrastructure|publishing"                                           
## [15] "classifieds|network / hosting / infrastructure"                                          
## [16] "e-commerce|analytics|network / hosting / infrastructure"                                 
## [17] "analytics|social networking|network / hosting / infrastructure"                          
## [18] "human resources (hr)|career / job search"                                                
## [19] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [20] "network / hosting / infrastructure|publishing"                                           
## [21] "analytics|social networking|network / hosting / infrastructure"                          
## [22] "human resources (hr)|analytics|marketing|career / job search"                            
## [23] "network / hosting / infrastructure|telecommunications|enterprise software"               
## [24] "network / hosting / infrastructure|marketing"                                            
## [25] "human resources (hr)|career / job search"                                                
## [26] "e-commerce|network / hosting / infrastructure"

# retrieve and view those industries that have '&' in them
startups$industry_of_company[ grepl(x = startups$industry_of_company, pattern = '&', fixed = TRUE) ]

## [1] "food & beverages|hospitality"                                 
## [2] "analytics|food & beverages|social networking|mobile"          
## [3] "network / hosting / infrastructure|food & beverages|analytics"
## [4] "e-commerce|food & beverages|mobile"                           
## [5] "food & beverages"                                             
## [6] "e-commerce|food & beverages|mobile"                           
## [7] "healthcare|analytics|mobile|food & beverages"                 
## [8] "e-commerce|food & beverages"

# retrieve and view those industries that have '(' in them
startups$industry_of_company[ grepl(x = startups$industry_of_company, pattern = '(', fixed = TRUE) ]

## [1] "human resources (hr)|marketing|career / job search"                                      
## [2] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [3] "human resources (hr)|career / job search"                                                
## [4] "human resources (hr)|enterprise software|career / job search|social networking|analytics"
## [5] "human resources (hr)|analytics|marketing|career / job search"                            
## [6] "human resources (hr)|career / job search"

From the above inspections, it seems that all occurrences of / are in the terms network / hosting / infrastructure and career / job search. As for & and (, their occurrences are in the terms food & beverages and human resources (hr), respectively. Accordingly, we perform the following modifications to the industry_of_company variable.

# remove all occurrences of ' (hr)', ' /', ' &'
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' (hr)', replacement='', fixed=TRUE)
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' /',    replacement='', fixed=TRUE)
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' &',    replacement='', fixed=TRUE)

# replace spaces with underscores to merge multi-word terms
startups$industry_of_company <- gsub(startups$industry_of_company, pattern=' ',     replacement='_', fixed=TRUE)

# replace all occurrences of '|' with spaces to separate between terms
startups$industry_of_company <- gsub(startups$industry_of_company, pattern='|',     replacement=' ', fixed=TRUE)

head(startups$industry_of_company, 10)

##  [1] NA                                              
##  [2] "market_research marketing crowdfunding"        
##  [3] "analytics cloud_computing software_development"
##  [4] "mobile analytics"                              
##  [5] "analytics marketing enterprise_software"       
##  [6] "food_beverages hospitality"                    
##  [7] "analytics"                                     
##  [8] "cloud_computing network_hosting_infrastructure"
##  [9] "analytics mobile marketing"                    
## [10] "healthcare pharmaceuticals analytics"

Modifications have been applied successfully and as intended; i.e. multi-word terms are merged with underscores, and the terms are separated by spaces. Now, let’s generate the DTM and observe the obtained terms.

# create the corpus object from the industry_of_company column
mycorpus <- corpus(startups$industry_of_company)

# generate the DTM using the formed corpus
dfm_industry <- dfm(mycorpus,               # the corpus to generate the DTM from
                    tolower = FALSE)        # data is already lowercase

# observe the 'terms' of the DTM
colnames(dfm_industry)

##  [1] "NA"                             "market_research"               
##  [3] "marketing"                      "crowdfunding"                  
##  [5] "analytics"                      "cloud_computing"               
##  [7] "software_development"           "mobile"                        
##  [9] "enterprise_software"            "food_beverages"                
## [11] "hospitality"                    "network_hosting_infrastructure"
## [13] "healthcare"                     "pharmaceuticals"               
## [15] "media"                          "finance"                       
## [17] "music"                          "e-commerce"                    
## [19] "gaming"                         "advertising"                   
## [21] "retail"                         "security"                      
## [23] "email"                          "human_resources"               
## [25] "career_job_search"              "publishing"                    
## [27] "education"                      "energy"                        
## [29] "deals"                          "entertainment"                 
## [31] "transportation"                 "social_networking"             
## [33] "real_estate"                    "search"                        
## [35] "telecommunications"             "insurance"                     
## [37] "cleantech"                      "space_travel"                  
## [39] "classifieds"                    "travel"                        
## [41] "government"

At last, we can move on to the analysis of the industry_of_company variable.

Let’s create a new data frame object where we bind the industry’s DTM with the column of the response variable. Note that the DTM is currently of type matrix, so we need to coerce into a data.frame before binding.

# convert `dfm_industry` to data.frame and bind with response variable
startups_industry <- cbind(startups[,2], as.data.frame(dfm_industry))

glimpse(startups_industry)

## Observations: 472
## Variables: 42
## $ `dependent-company_status`     <fct> SUCCESS, SUCCESS, SUCCESS, SUCC...
## $ `NA`                           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ market_research                <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ marketing                      <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0...
## $ crowdfunding                   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ analytics                      <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1...
## $ cloud_computing                <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0...
## $ software_development           <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ mobile                         <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...
## $ enterprise_software            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1...
## $ food_beverages                 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ hospitality                    <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ network_hosting_infrastructure <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ healthcare                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ pharmaceuticals                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ media                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ finance                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ music                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `e-commerce`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ gaming                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ advertising                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ retail                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ security                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ email                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ human_resources                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ career_job_search              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ publishing                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ education                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ energy                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ deals                          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ entertainment                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ transportation                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ social_networking              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ real_estate                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ search                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ telecommunications             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ insurance                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ cleantech                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ space_travel                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ classifieds                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ travel                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ government                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

The bind appears to be successful. Now, let’s view the frequency of each of the industries among the list of companies in the dataset.

# frequencies of successes/fails in each of the industries
industry_frequencies <-
  startups_industry %>% 
  group_by(`dependent-company_status`) %>% 
  summarise_at(vars(-starts_with("dependent")), funs(sum)) %>% 
  t()

industry_frequencies

##                                [,1]     [,2]     
## dependent-company_status       "FAILED" "SUCCESS"
## NA                             "69"     "55"     
## market_research                "1"      "7"      
## marketing                      "11"     "55"     
## crowdfunding                   "0"      "2"      
## analytics                      " 18"    "180"    
## cloud_computing                " 6"     "13"     
## software_development           " 7"     "11"     
## mobile                         "17"     "37"     
## enterprise_software            " 3"     "27"     
## food_beverages                 "5"      "3"      
## hospitality                    "0"      "4"      
## network_hosting_infrastructure " 9"     "10"     
## healthcare                     "5"      "7"      
## pharmaceuticals                "0"      "1"      
## media                          " 9"     "19"     
## finance                        "2"      "6"      
## music                          "6"      "2"      
## e-commerce                     "17"     "37"     
## gaming                         "3"      "4"      
## advertising                    "11"     "26"     
## retail                         " 0"     "15"     
## security                       "0"      "6"      
## email                          "3"      "4"      
## human_resources                "2"      "4"      
## career_job_search              "2"      "5"      
## publishing                     "2"      "6"      
## education                      "3"      "3"      
## energy                         "4"      "8"      
## deals                          "0"      "3"      
## entertainment                  "7"      "8"      
## transportation                 "1"      "1"      
## social_networking              " 8"     "10"     
## real_estate                    "0"      "3"      
## search                         "6"      "5"      
## telecommunications             "3"      "3"      
## insurance                      "0"      "1"      
## cleantech                      "1"      "4"      
## space_travel                   "1"      "0"      
## classifieds                    "2"      "0"      
## travel                         "0"      "1"      
## government                     "0"      "1"

Judging from the output, it seems that all the numerical values have been turned to character strings due to the dependent-company_status variable (1st row). Also note that the industries are the names of the rows. For reasons that will be apparent very shortly, we will want the industries to be an actual column in industry_frequencies. Moreover, we will have to remove the first row and then set the names of the columns as failed and success.

Also note that industry_frequencies is currently of type matrix (not data.frame).

# remove the 1st row containing the 'dependent-company_status'
industry_frequencies <- industry_frequencies[-1,]

# add the industries (currently the row names) as a column in 'industry_frequencies'
industry_frequencies <- cbind(rownames(industry_frequencies), industry_frequencies)

# remove the row names
rownames(industry_frequencies) <- NULL

# set the column names
colnames(industry_frequencies) <- c('industry','failed','success')

industry_frequencies

##       industry                         failed success
##  [1,] "NA"                             "69"   "55"   
##  [2,] "market_research"                "1"    "7"    
##  [3,] "marketing"                      "11"   "55"   
##  [4,] "crowdfunding"                   "0"    "2"    
##  [5,] "analytics"                      " 18"  "180"  
##  [6,] "cloud_computing"                " 6"   "13"   
##  [7,] "software_development"           " 7"   "11"   
##  [8,] "mobile"                         "17"   "37"   
##  [9,] "enterprise_software"            " 3"   "27"   
## [10,] "food_beverages"                 "5"    "3"    
## [11,] "hospitality"                    "0"    "4"    
## [12,] "network_hosting_infrastructure" " 9"   "10"   
## [13,] "healthcare"                     "5"    "7"    
## [14,] "pharmaceuticals"                "0"    "1"    
## [15,] "media"                          " 9"   "19"   
## [16,] "finance"                        "2"    "6"    
## [17,] "music"                          "6"    "2"    
## [18,] "e-commerce"                     "17"   "37"   
## [19,] "gaming"                         "3"    "4"    
## [20,] "advertising"                    "11"   "26"   
## [21,] "retail"                         " 0"   "15"   
## [22,] "security"                       "0"    "6"    
## [23,] "email"                          "3"    "4"    
## [24,] "human_resources"                "2"    "4"    
## [25,] "career_job_search"              "2"    "5"    
## [26,] "publishing"                     "2"    "6"    
## [27,] "education"                      "3"    "3"    
## [28,] "energy"                         "4"    "8"    
## [29,] "deals"                          "0"    "3"    
## [30,] "entertainment"                  "7"    "8"    
## [31,] "transportation"                 "1"    "1"    
## [32,] "social_networking"              " 8"   "10"   
## [33,] "real_estate"                    "0"    "3"    
## [34,] "search"                         "6"    "5"    
## [35,] "telecommunications"             "3"    "3"    
## [36,] "insurance"                      "0"    "1"    
## [37,] "cleantech"                      "1"    "4"    
## [38,] "space_travel"                   "1"    "0"    
## [39,] "classifieds"                    "2"    "0"    
## [40,] "travel"                         "0"    "1"    
## [41,] "government"                     "0"    "1"

Now, we have a three-column matrix containing the number of successful and failed companies in each of 41 different industries. We’ll be doing some visualizations using the ggplot2 which expects the data to be visualized in the form of a data frame, which is why we do the following:

# convert to data.frame
industry_frequencies <- as.data.frame(industry_frequencies)

industry_frequencies

##                          industry failed success
## 1                              NA     69      55
## 2                 market_research      1       7
## 3                       marketing     11      55
## 4                    crowdfunding      0       2
## 5                       analytics     18     180
## 6                 cloud_computing      6      13
## 7            software_development      7      11
## 8                          mobile     17      37
## 9             enterprise_software      3      27
## 10                 food_beverages      5       3
## 11                    hospitality      0       4
## 12 network_hosting_infrastructure      9      10
## 13                     healthcare      5       7
## 14                pharmaceuticals      0       1
## 15                          media      9      19
## 16                        finance      2       6
## 17                          music      6       2
## 18                     e-commerce     17      37
## 19                         gaming      3       4
## 20                    advertising     11      26
## 21                         retail      0      15
## 22                       security      0       6
## 23                          email      3       4
## 24                human_resources      2       4
## 25              career_job_search      2       5
## 26                     publishing      2       6
## 27                      education      3       3
## 28                         energy      4       8
## 29                          deals      0       3
## 30                  entertainment      7       8
## 31                 transportation      1       1
## 32              social_networking      8      10
## 33                    real_estate      0       3
## 34                         search      6       5
## 35             telecommunications      3       3
## 36                      insurance      0       1
## 37                      cleantech      1       4
## 38                   space_travel      1       0
## 39                    classifieds      2       0
## 40                         travel      0       1
## 41                     government      0       1

Let’s do some visualizations, shall we?

ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
  geom_bar(stat="identity") +                                 #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

Excuse me!? What’s up with the y-axis? Why is the order of the labels all messed up? Let’s see what the data type of the failed column is.

# check type of 'failed' column
typeof(industry_frequencies$failed)

## [1] "integer"

Still, no reason for it to behave the way it did. I wonder… Let’s convert it to type double and see what happens.

# convert 'failed' column to type 'double'
industry_frequencies$failed <- as.double(industry_frequencies$failed)

ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
  geom_bar(stat="identity") +                                 #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

Well, what d’you know? It’s all fixed now, right? Nope! The above figure is actually identical to the one before it (with the exception of the y-axis scale). In addition, the scale of the y-axis is not supposed to end at 18. There is supposed to be a number with a value of 69 (the NA industry) and it is supposed to be dwarfing the other industries; the next biggest value in the failed companies is 18.

The NA industry serves no purpose here. Let’s try removing it and see what happens.

# remove the NA row
industry_frequencies <- industry_frequencies %>% filter(!is.na(industry))

industry_frequencies

##                          industry failed success
## 1                              NA     17      55
## 2                 market_research      9       7
## 3                       marketing     10      55
## 4                    crowdfunding      8       2
## 5                       analytics      2     180
## 6                 cloud_computing      4      13
## 7            software_development      5      11
## 8                          mobile     11      37
## 9             enterprise_software      3      27
## 10                 food_beverages     15       3
## 11                    hospitality      8       4
## 12 network_hosting_infrastructure      7      10
## 13                     healthcare     15       7
## 14                pharmaceuticals      8       1
## 15                          media      7      19
## 16                        finance     12       6
## 17                          music     16       2
## 18                     e-commerce     11      37
## 19                         gaming     13       4
## 20                    advertising     10      26
## 21                         retail      1      15
## 22                       security      8       6
## 23                          email     13       4
## 24                human_resources     12       4
## 25              career_job_search     12       5
## 26                     publishing     12       6
## 27                      education     13       3
## 28                         energy     14       8
## 29                          deals      8       3
## 30                  entertainment     18       8
## 31                 transportation      9       1
## 32              social_networking      6      10
## 33                    real_estate      8       3
## 34                         search     16       5
## 35             telecommunications     13       3
## 36                      insurance      8       1
## 37                      cleantech      9       4
## 38                   space_travel      9       0
## 39                    classifieds     12       0
## 40                         travel      8       1
## 41                     government      8       1

The NA row does not feel like leaving, eh? Let’s try this then.

# remove the NA row
industry_frequencies <- industry_frequencies %>% filter(!(industry == 'NA'))

industry_frequencies

##                          industry failed success
## 1                 market_research      9       7
## 2                       marketing     10      55
## 3                    crowdfunding      8       2
## 4                       analytics      2     180
## 5                 cloud_computing      4      13
## 6            software_development      5      11
## 7                          mobile     11      37
## 8             enterprise_software      3      27
## 9                  food_beverages     15       3
## 10                    hospitality      8       4
## 11 network_hosting_infrastructure      7      10
## 12                     healthcare     15       7
## 13                pharmaceuticals      8       1
## 14                          media      7      19
## 15                        finance     12       6
## 16                          music     16       2
## 17                     e-commerce     11      37
## 18                         gaming     13       4
## 19                    advertising     10      26
## 20                         retail      1      15
## 21                       security      8       6
## 22                          email     13       4
## 23                human_resources     12       4
## 24              career_job_search     12       5
## 25                     publishing     12       6
## 26                      education     13       3
## 27                         energy     14       8
## 28                          deals      8       3
## 29                  entertainment     18       8
## 30                 transportation      9       1
## 31              social_networking      6      10
## 32                    real_estate      8       3
## 33                         search     16       5
## 34             telecommunications     13       3
## 35                      insurance      8       1
## 36                      cleantech      9       4
## 37                   space_travel      9       0
## 38                    classifieds     12       0
## 39                         travel      8       1
## 40                     government      8       1

Well, the NA row is gone, but the data are all irreversibly damaged now. For whatever reason, the data now do not reflect those obtained from the DTM that we generated earlier from the industry_of_company variable. Let us quickly undo the mess we have made and start over.

# convert `dfm_industry` to data.frame and bind with response variable
startups_industry <- cbind(startups[,2], as.data.frame(dfm_industry))

# frequencies of successes/fails in each of the industries
industry_frequencies <- 
  startups_industry %>% 
  group_by(`dependent-company_status`) %>% 
  summarise_at(vars(-starts_with("dependent")), funs(sum)) %>% 
  t()

# remove the 1st row containing the 'dependent-company_status'
industry_frequencies <- industry_frequencies[-1,]

# add the industries (currently the row names) as a column in 'industry_frequencies'
industry_frequencies <- cbind(rownames(industry_frequencies), industry_frequencies)

# remove the row names
rownames(industry_frequencies) <- NULL

# set the column names
colnames(industry_frequencies) <- c('industry','failed','success')

# convert to data.frame
industry_frequencies <- as.data.frame(industry_frequencies)

# remove the NA row
industry_frequencies <- industry_frequencies %>% filter(!(industry == 'NA'))

industry_frequencies

##                          industry failed success
## 1                 market_research      1       7
## 2                       marketing     11      55
## 3                    crowdfunding      0       2
## 4                       analytics     18     180
## 5                 cloud_computing      6      13
## 6            software_development      7      11
## 7                          mobile     17      37
## 8             enterprise_software      3      27
## 9                  food_beverages      5       3
## 10                    hospitality      0       4
## 11 network_hosting_infrastructure      9      10
## 12                     healthcare      5       7
## 13                pharmaceuticals      0       1
## 14                          media      9      19
## 15                        finance      2       6
## 16                          music      6       2
## 17                     e-commerce     17      37
## 18                         gaming      3       4
## 19                    advertising     11      26
## 20                         retail      0      15
## 21                       security      0       6
## 22                          email      3       4
## 23                human_resources      2       4
## 24              career_job_search      2       5
## 25                     publishing      2       6
## 26                      education      3       3
## 27                         energy      4       8
## 28                          deals      0       3
## 29                  entertainment      7       8
## 30                 transportation      1       1
## 31              social_networking      8      10
## 32                    real_estate      0       3
## 33                         search      6       5
## 34             telecommunications      3       3
## 35                      insurance      0       1
## 36                      cleantech      1       4
## 37                   space_travel      1       0
## 38                    classifieds      2       0
## 39                         travel      0       1
## 40                     government      0       1

Please work now, okay?

ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
  geom_bar(stat="identity") +                                 #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

The messed-up y-axis again… converting to double… (not optimistic)

ggplot(industry_frequencies, aes(x = industry, y = as.double(failed))) + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

No luck :(

After a short break, we stumbled upon something…

# arrange records in descending order of number of successful companies
industry_frequencies %>% arrange(desc(success))

##                          industry failed success
## 1                          energy      4       8
## 2                   entertainment      7       8
## 3                 market_research      1       7
## 4                      healthcare      5       7
## 5                         finance      2       6
## 6                        security      0       6
## 7                      publishing      2       6
## 8                       marketing     11      55
## 9               career_job_search      2       5
## 10                         search      6       5
## 11                    hospitality      0       4
## 12                         gaming      3       4
## 13                          email      3       4
## 14                human_resources      2       4
## 15                      cleantech      1       4
## 16                         mobile     17      37
## 17                     e-commerce     17      37
## 18                 food_beverages      5       3
## 19                      education      3       3
## 20                          deals      0       3
## 21                    real_estate      0       3
## 22             telecommunications      3       3
## 23            enterprise_software      3      27
## 24                    advertising     11      26
## 25                   crowdfunding      0       2
## 26                          music      6       2
## 27                          media      9      19
## 28                      analytics     18     180
## 29                         retail      0      15
## 30                cloud_computing      6      13
## 31           software_development      7      11
## 32 network_hosting_infrastructure      9      10
## 33              social_networking      8      10
## 34                pharmaceuticals      0       1
## 35                 transportation      1       1
## 36                      insurance      0       1
## 37                         travel      0       1
## 38                     government      0       1
## 39                   space_travel      1       0
## 40                    classifieds      2       0

Very suspicious behaviour indeed. A closer look…

# sort the numbers of successful companies
sort(industry_frequencies$success)

##  [1] 0   0   1   1   1   1   1   10  10  11  13  15  180 19  2   2   26 
## [18] 27  3   3   3   3   3   37  37  4   4   4   4   4   5   5   55  6  
## [35] 6   6   7   7   8   8  
## Levels: 0 1 10 11 13 15 180 19 2 26 27 3 37 4 5 55 6 7 8

Suspicion confirmed. The numbers are not being sorted as numbers. They are treated like characters. The funny thing is what happens next.

# sort the numbers of successful companies AFTER converting to numerics
sort(as.numeric(industry_frequencies$success))

##  [1]  1  1  2  2  2  2  2  3  3  4  5  6  7  8  9  9 10 11 12 12 12 12 12
## [24] 13 13 14 14 14 14 14 15 15 16 17 17 17 18 18 19 19

Some of the higher numbers (e.g. 55, 180) vanished, and we are left with these consecutive numbers. It did not take us much time to figure out what is wrong.

industry_frequencies$success %>% 
  as.character() %>%   # convert to 'character'...
  as.numeric() %>%     # ...then to 'numeric'...
  sort()               # ...then sort

##  [1]   0   0   1   1   1   1   1   2   2   3   3   3   3   3   4   4   4
## [18]   4   4   5   5   6   6   6   7   7   8   8  10  10  11  13  15  19
## [35]  26  27  37  37  55 180

Lesson learned the hard way: When you have a column that has number values and you want to coerce them into type numeric, you MUST make sure they are NOT of type factor. If they are of type factor, then they MUST be converted into type character first, and THEN into type numeric. NEVER directly convert from type factor to type numeric.

To be specific, our grand mistake was at a previous step where the matrix, industry_frequencies, was converted into a data frame. The as.data.frame() function has this parameter called stringsAsFactors, and you typically want it set to FALSE whenever numerics are involved.

And so we continue. Shall we try this visualization thing, again?

# convert all columns (which are currently of type 'factor' now) to type 'character'
industry_frequencies <- map_df(industry_frequencies,as.character)

# convert the numeric columns to type 'numeric'
industry_frequencies[,c("failed","success")] <- map_df(industry_frequencies[,c("failed","success")],as.numeric)

ggplot(industry_frequencies, aes(x = industry, y = failed)) + # X: industry, Y: #failed companies
  geom_bar(stat="identity") +                                 #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

Yaaaaaaay! It worked! Now, we’re back in business! Let’s try something fancy now.

industry_frequencies %>% 
  arrange(success) %>%     # arrange order of data by number of successful companies in each industry 
  ggplot(aes(x = industry, y = success)) +                    # X: industry, Y: #successful companies
  geom_bar(stat="identity") +                                 #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

HEY! What gives? Why won’t the x-axis labels get sorted by frequency? With some internet searching, we found this resolution.

industry_frequencies %>% 
  transform(industry=reorder(industry, -success) ) %>%        # order x-axis labels by '-success'
  ggplot(aes(x = industry, y = success)) +                    # X: industry, Y: #successful companies
  geom_bar(stat="identity") +                                 #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

That’s more like it. From the figure, one can quickly tell which industries have successful startups appearing in them and which do not.

Let’s try something else now.

industry_frequencies %>% 
  gather("company_status", "n", 2:3) %>% 
  transform(industry=reorder(industry, -n) ) %>%              # order x-axis labels by '-success'
  ggplot(aes(x = industry, y = n, fill = company_status)) +   # X: industry, Y: n, fill: company_status
  geom_bar(stat="identity", show.legend = FALSE) +            #
  theme(axis.text.x = element_text(angle = 90,                # rotate orientation of x-axis labels
                                   vjust = 0.3,               # top-align x-axis labels
                                   hjust = 1))                # right-align x-axis labels

At last, a visualization that is pleasant to look at. From the figure, it seems that most startups actually succeed regardless of the industry. Why don’t we get some exact numbers? Below, we will create a new variable called success_percent which displays the percentage of successful companies per industry.

industry_frequencies %>% 
  mutate(success_percent = round(success / (success+failed) * 100, 1)) %>% 
  arrange(desc(success_percent), desc(success))

## # A tibble: 40 x 4
##    industry        failed success success_percent
##    <chr>            <dbl>   <dbl>           <dbl>
##  1 retail               0   15.0              100
##  2 security             0    6.00             100
##  3 hospitality          0    4.00             100
##  4 deals                0    3.00             100
##  5 real_estate          0    3.00             100
##  6 crowdfunding         0    2.00             100
##  7 pharmaceuticals      0    1.00             100
##  8 insurance            0    1.00             100
##  9 travel               0    1.00             100
## 10 government           0    1.00             100
## # ... with 30 more rows

Ahem… I want to see the ENTIRE thing, please?

industry_frequencies %>% 
  mutate(success_percent = round(success / (success+failed) * 100, 1)) %>% 
  arrange(desc(success_percent), desc(success)) %>% 
  as.data.frame()

##                          industry failed success success_percent
## 1                          retail      0      15           100.0
## 2                        security      0       6           100.0
## 3                     hospitality      0       4           100.0
## 4                           deals      0       3           100.0
## 5                     real_estate      0       3           100.0
## 6                    crowdfunding      0       2           100.0
## 7                 pharmaceuticals      0       1           100.0
## 8                       insurance      0       1           100.0
## 9                          travel      0       1           100.0
## 10                     government      0       1           100.0
## 11                      analytics     18     180            90.9
## 12            enterprise_software      3      27            90.0
## 13                market_research      1       7            87.5
## 14                      marketing     11      55            83.3
## 15                      cleantech      1       4            80.0
## 16                        finance      2       6            75.0
## 17                     publishing      2       6            75.0
## 18              career_job_search      2       5            71.4
## 19                    advertising     11      26            70.3
## 20                         mobile     17      37            68.5
## 21                     e-commerce     17      37            68.5
## 22                cloud_computing      6      13            68.4
## 23                          media      9      19            67.9
## 24                         energy      4       8            66.7
## 25                human_resources      2       4            66.7
## 26           software_development      7      11            61.1
## 27                     healthcare      5       7            58.3
## 28                         gaming      3       4            57.1
## 29                          email      3       4            57.1
## 30              social_networking      8      10            55.6
## 31                  entertainment      7       8            53.3
## 32 network_hosting_infrastructure      9      10            52.6
## 33                      education      3       3            50.0
## 34             telecommunications      3       3            50.0
## 35                 transportation      1       1            50.0
## 36                         search      6       5            45.5
## 37                 food_beverages      5       3            37.5
## 38                          music      6       2            25.0
## 39                   space_travel      1       0             0.0
## 40                    classifieds      2       0             0.0

All endeavors in retail, security and hospitality seem to have ended in success. There are other industries that have flawlessly succeeded as well, but the number of successes are not big enough to be significant.

For the industries with success ratios < 100%, analytics seems to be wildly successful with a ~91% success ratio, followed by enterprise_software, marketing and advertising, all with success ratios > 70%. There are still other industries with success ratios > 70%, but (again) the number of companies in these industries are not big enough for the percentages to be significant.

On the other side of the spectrum, there exist industries that are wildly UNsuccessful such as space_travel and classifieds. Although the numbers of companies are not significant enough to draw conclusions, we don’t need numbers to tell us that space_travel is a tough business to get into :). Other industries with more fails than successes are music, food_beverages and search.

Industries in the hit-or-miss region (~50% success ratio) include entertainment, network_hosting_infrastructure, education and telecommunications. Again, other industries within this hit-or-miss region do not have significantly large number of companies to constitute a trend and, as such, will be ignored.

It would be great if we can discover some of kind of global indicator that can help distinguish between the successful and failed startups. Let’s see if we can find such an indicator!

To be continued…

Short description of company profile

Until now I do not know what to do with it

head(startups$short_description_of_company_profile, 10)

##  [1] "video distribution"                    
##  [2] NA                                      
##  [3] "event data analytics api"              
##  [4] "the most advanced analytics for mobile"
##  [5] "the location-based marketing platform" 
##  [6] "big data for foodservice"              
##  [7] NA                                      
##  [8] NA                                      
##  [9] "engagement engine"                     
## [10] "big data for clinical insight"

Focus functions of company

It is a multi-value variable indicate the company foucs.

head(startups$focus_functions_of_company, 10)

##  [1] "operation"         "marketing, sales"  "operations"       
##  [4] "marketing & sales" "marketing & sales" "analytics"        
##  [7] "research"          "computing"         "marketing"        
## [10] "research"

The following are some notes about this field: * values are either seprated by , or & and sometimes both. * The characters case is not standarized. * Missing functions

Investors

It is a multi-value variable. Values are seprated by |, and sometimes there are missing values

head(startups$investors, 10)

##  [1] "kpcb holdings|draper fisher jurvetson (dfj)|kleiner perkins caufield & byers|at&t|blueprint ventures|cisco|zone ventures"                                                                                                                                                                                                                                                                                 
##  [2] NA                                                                                                                                                                                                                                                                                                                                                                                                         
##  [3] "techstars|streamlined ventures|amplify partners|rincon venture partners|pelion venture partners|500 startups|loren siebert|jason seats|xg ventures|george karidis|sam choi|morris wheeler|data collective|pejman nozad|ullas naik|dirk elmendorf|galvanize|pat matthews|paul kedrosky|matt ocko|cloud power capital|jared kopf|anne johnson|issac roth|george karutz|jim deters|zachary aarons|zack bogue"
##  [4] "michael birch|max levchin|sequoia capital|keith rabois|andreessen horowitz|marc benioff|david sacks|y combinator|voyager capital"                                                                                                                                                                                                                                                                         
##  [5] "dfj frontier|draper nexus ventures|gil elbaz|auren hoffman|walter kortschak|mi ventures|brand ventures|daher capital|double m partners|gold hill capital|clark landry|draper associates|mi ventures llc|signia venture partners"                                                                                                                                                                          
##  [6] "pritzker group venture capital|excelerate labs|hyde park venture partners|chicago ventures|amicus capital|ideo|olive ventures|kd capital l.l.c"                                                                                                                                                                                                                                                           
##  [7] "plug & play ventures|correlation ventures|crosslink capital|roham gharegozlou|start capital|naguib sawiris|pejman nozad|streamlined ventures"                                                                                                                                                                                                                                                             
##  [8] "norwest venture partners|bessemer venture partners|atlas venture"                                                                                                                                                                                                                                                                                                                                         
##  [9] "promus ventures|softtech vc|costanoa venture capital|lee linden|chamath palihapitiya|raj de datta|tim kendall|omar siddiqui|david vivero|sequoia capital"                                                                                                                                                                                                                                                 
## [10] "khosla ventures"

Startups Business Analytics

Mohammed Ali, Ali Ezzat

February 17, 2018

Objective

Data Wrangling

Setup Missing Data

Variables Correct type

Factor Variables

Numeric Variables

Date Variables

Remove polluted predictors

Textual Variables

Industry of the Company

Short description of company profile

Focus functions of company

Investors

Exploratory Data Analysis

Feature Engineering

Pre-Modeling Processing

Missing Values

More Exploration

Hypothesis Testing

Predictor Selection