Email

krnsy30@gmail.com

College

Indian Institute of Technology, Bombay (IIT-B)

Introduction.

The craze among Indian Graduates or student still studying in Various Engieering colleges and Business schools to build their own company right from scratch has been at a peak in the last decades. Although, as a country we have seen examples of various companies starting from small firms to Huge MNC’s like Infosys, Reliance etc. we have not been looked upto for our startup ecosystem to the extent as that the United States and other Western Coutries have been looked upto. But, with increasing Investment into various small startups and the GOI coming in with their StartupIndia Campaign, the craze of Startups among budding Indian entreprenuers is here to stay. So, in this report, We would like to see the extent of Investment in the firms, how has it changes over the years in amount as well as the type of investment and how factors like the Industrial Sector, City Location in which the Startup is based affect the whole Startup Investment variable.

About out Data.

So the data is available through the site Kaggle.com which is a famous site for its vast amount of datasets available to common public on various topics. The link for the Dataset which was used for our study is;

https://www.kaggle.com/sudalairajkumar/indian-startup-funding/feed

Our Dataset contains information of upto 2372 investment transactions. The coloumnwise description is as follows.

  1. Sr.No : Serial Number.
  2. Date : Year of Funding.
  3. StartupName : Name of the startup which got funded.
  4. IndustryVertical : Industry to which the startup belongs.
  5. SubVertical : Sub category of the industry type.
  6. CityLocation : City in which the startup is based out of.
  7. InvestorsName : Name of the investors involved in the funding round.
  8. AmountInUSD : Funding Amount in USD.
  9. Remarks : Other information if any.

An Important point: Many cells were empty and thus careful sorting and cleaning of data was done on my part to exclude the empty data and analyse stuffs from the remaining dataset. If a certain cell of only a coloumn in blank, then the entire row / transaction is ignored while doinf various tests and Regression analysis We have Bracketed the major cities into Bangalore, Mumbai, New Delhi, Gurgaon, Pune, Hyderabad and others. Similarly bracketing the major Industrail Vertical Sectors into Consumer Internet, Technolgy, ECommerce, Logistics, Education, Healthcare and others. THis would help into out analysis and analysing a large number of cities with very small number of startups would become very tedious.

So our Action Plan is to analyse important Variables Year, City Location, Industry Vertical, Investment Type and Amount in USD. We analyse my ceating Contingency Tables, performing Chi-Square Tests and Correlation Tests and fitting it into a Linear Regression Model.

Linear Regression Model

Our Independent Variable would be Amount Of Investment in USD. The other Dependent variables are Date, Industry Vertical, City Location and Investment type. Thus our proposed equation of Linear Regression model Mathematically looks like,

\[Amt.Invested(USD)= \alpha_0 + \alpha_1 Date + \alpha_2 Ind. Vertical + \alpha_3 City + \alpha_4 Inv.type + \epsilon\] The result of the Lin Model Regression;

                (Intercept)               |      Date 
                           -2.319851e+10  |              1.151216e+07 
               IndustryVerticalECommerce  | IndustryVerticalEducation 
                            1.810677e+07  |             -9.614772e+05 
              IndustryVerticalHealthcare  | IndustryVerticalLogistics 
                           -2.881645e+06  |             -1.154833e+06 
            IndustryVerticalOtherSectors  |IndustryVerticalTechnology 
                           -1.278081e+06  |             -4.390373e+06 
                     CityLocationGurgaon  |     CityLocationHyderabad 
                           -1.000115e+07  |             -1.148655e+07 
                      CityLocationMumbai  |     CityLocationNew Delhi 
                           -1.304183e+07  |             -9.596368e+06 
                      CityLocationOthers  |          CityLocationPune 
                           -1.371399e+07  |             -9.117334e+06 
            InvestmentTypePrivate Equity  |InvestmentTypeSeed Funding 
                            1.067842e+07  |             -5.383905e+06 

Key Findings

A total of about 18.3 Billion US dollars worth of investment has taken place in Indian Startups from 2015 to August 2017.

  1. Over the period 2016 and 2017, the number of Educational based startup firms have decreased significantly (% wise on comparision to previous year).
  2. Private Equity based finding to startups is more in practice in 2017 then in 2016 as compared to Seed Funding based Investment. (% wise comparision).
  3. Bangalore gathers the most technology based startup funding. Consumer Internet Based startups are dominant in Mumbai and Delhi & NCR then other startups in the cities.
  4. ECommerce Based startups look to have much more Private Equity based investment then Seed Funding. Consumer Internet based startups look to have more Seed Funding. Technology too has more Private Equity based investment.
  5. Investment in year 2017 upto August has been more then in the entire year of 2016.
  6. Bangalore Clearly is the winner when it comes to the amount of Investment. Mumbai comes a distant second.
  7. Amount is Private Equity based Investment is much much more then Seed Based Funding.

Through our various Chi-Square Tests, we have found out that there exists a Relationship between following variables (Condition: p-value < 0.005) 1. Year and Type of Investment (p-value of 0.000141) 2. Industrial Sector (Industry Vertical) and Type of Investment (p-value of 0.004196) 3. Investment type and Amount if Investment (p-value is < 2.2e-16)

Conclusion

More and more Private Equity based funding is observed as compared to Seed based funding which indicates that as the years are passing more and more established firms are existing in the marketplace which do not need Seed Funding based Investment.(Seed funding - is primarily raised during the early stage of the venture especially like idea stage or prototyping stage and is also much less in capital amount when compared to private equity based capital funding).Technical Based Startups also gather much more investment per startup firm which is also evident form the fact that Bangalore also has highly significant amount of Investment as Compared to other cities.

Appendix

Descriptive Analysis of the dataset performed.

Contingency Tables

table(Data1.df$Date)
## 
## 2015 2016 2017 
##  936  993  443

This tells us the total no. of funding that has occured in the years 2015, 2016 and 2017 (till August)

table(CleanData3.df$IndustryVertical)
## 
## Consumer Internet         ECommerce         Education        Healthcare 
##               459               147                15                15 
##         Logistics      OtherSectors        Technology 
##                16                28               189

The major Industrial sectors acquiring funding by numbers.

table(CleanData3.df$CityLocation)
## 
## Bangalore   Gurgaon Hyderabad    Mumbai New Delhi    Others      Pune 
##       267        97        35       181       130       122        37

The funded startups location

table(CleanData3.df$InvestmentType)
## 
##   Debt Funding Private Equity   Seed Funding 
##              1            476            392

The type of Investment practiced by the Investor. Now moving on to two way contingency tables:

1. Year vs Industry Vertical

mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
mytable
##                   CleanData3.df$IndustryVertical
## CleanData3.df$Date Consumer Internet ECommerce Education Healthcare
##               2016               303       104        14         10
##               2017               156        43         1          5
##                   CleanData3.df$IndustryVertical
## CleanData3.df$Date Logistics OtherSectors Technology
##               2016        10           23        120
##               2017         6            5         69

There has been a significant dip in funding for Eductaional based startups as compared to other startups.

2. Year vs City Location

mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
mytable
##                   CleanData3.df$CityLocation
## CleanData3.df$Date Bangalore Gurgaon Hyderabad Mumbai New Delhi Others
##               2016       165      65        21    117       100     90
##               2017       102      32        14     64        30     32
##                   CleanData3.df$CityLocation
## CleanData3.df$Date Pune
##               2016   26
##               2017   11

The number of New Delhi Based startups acquiring Funds shows the most significant difference.

3. Year vs Investment Type

mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
mytable
##                   CleanData3.df$InvestmentType
## CleanData3.df$Date Debt Funding Private Equity Seed Funding
##               2016            0            293          291
##               2017            1            183          101

Private Equity based Investment is much more in practice in 2017 than In 2016 as compared to Seed Funding. Debt Funding practice is much less significant the the other two Investment type options.

4. Industry Vertical vs City Location

mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
mytable
##                               CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical Bangalore Gurgaon Hyderabad Mumbai
##              Consumer Internet       143      60        17    101
##              ECommerce                44      19         3     29
##              Education                 4       1         0      4
##              Healthcare                4       1         1      3
##              Logistics                 3       3         1      6
##              OtherSectors              4       2         2      7
##              Technology               65      11        11     31
##                               CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical New Delhi Others Pune
##              Consumer Internet        71     55   12
##              ECommerce                28     21    3
##              Education                 2      4    0
##              Healthcare                3      2    1
##              Logistics                 2      1    0
##              OtherSectors              7      5    1
##              Technology               17     34   20

Clearly, Bangalore lives upto its reputation of being the Tech hub of India by gathering the most technology based startup funding. Consumer Internet Based startups are dominant in Mumbai and Delhi & NCR then other startups in the cities.

5. Industry Vertical vs Investment type

mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
mytable
##                               CleanData3.df$InvestmentType
## CleanData3.df$IndustryVertical Debt Funding Private Equity Seed Funding
##              Consumer Internet            1            220          238
##              ECommerce                    0            101           46
##              Education                    0             10            5
##              Healthcare                   0             11            4
##              Logistics                    0             13            3
##              OtherSectors                 0             14           14
##              Technology                   0            107           82

ECommerce Based startups look to have much more Private Equity based investment then Seed Funding. Consumer Internet based startups look to have more Seed Funding. Technology too has more Private Equity based investment.

6. City Location vs Investment type

mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
mytable
##                           CleanData3.df$InvestmentType
## CleanData3.df$CityLocation Debt Funding Private Equity Seed Funding
##                  Bangalore            0            157          110
##                  Gurgaon              0             59           38
##                  Hyderabad            0             15           20
##                  Mumbai               0            110           71
##                  New Delhi            0             55           75
##                  Others               1             61           60
##                  Pune                 0             19           18

Mumbai and Bangalore based startups have more dominant Private Equity Based investment. For Delhi its Seed Funding which is more dominant.

Plot of Amount of Investment

sum(cleanData.df$AmountInUSD)
## [1] 18347386476

So, considering the available data about 18.35 Billion USD has been Invested in Major Startups in India. Looking at the Distribution;

boxplot(cleanData.df$AmountInUSD, horizontal = TRUE, xlab = "Amount in USD", main = "Startup Investment plot")

Thus, the investment is very skewed. Some famous startups have acquired millions of investment from big firms like Alibaba, SoftBank etc.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$Date, horizontal = TRUE, xlab= "Amount of Investment in USD", ylab = "Year", main = "Year Wise Investment Analysis")

Clearly, 2017 has more Outliers then 2016.

sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2017'])
## [1] 5846275500
sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2016'])
## [1] 3828088608

Thus Investment in year 2017 upto August has been more then in the entire year of 2016.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Industry Sector", main = "Industrial Sector Wise Investment Analysis", boxwex = 0.6, names = c("Con Int", "Ecomm", "Edu", "HealthCare", "Logist.", "Others", "Tech"))

mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical)
plot(mytable, ylab = "AMount of Investment", xlab = "Industry Sector")

Thus ECommerce and Consumer Internet seem to rule the roost when it comes to massive funding. Although the number of Startups based on these two sectors are also much more in number.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "City", main = "Startup citiwise location Wise Investment Analysis", boxwex = 0.6, names = c("Bang", "Gurgaon", "Hyd", "Mum", "N. Delhi", "Others", "Pune"))

mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation)
plot(mytable, ylab = "AMount of Investment", xlab = "City")

Here Bangalore takes a very significant lead in amount of investment. Mumbai comes a distant Second. The number of startups though are higher in Bangalore but indeed the outliers and a decent median value show the higher investment result in the Boxplot.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Investment Type", main = "Investment type vs Investment Analysis")

mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType)
plot(mytable, ylab = "AMount of Investment", xlab = "Type of Investment")

We have a clear winner here in Private Equity . This is also because much of Private Equity based investment is to startups who have well established themselves. Seed funding is mainly in the incubation Stage.

Now we would create a dataset with sticking to our 5 set of variables for various correlation tests and t-tests.

Correlation of numeric value.

cor(Testdata.df$Date, Testdata.df$AmountInUSD)
## [1] 0.09190294

A positive Correlation but not very significant.

Corrgram Plot

library(corrgram)
corrgram(Testdata.df, order=TRUE, lower.panel=panel.shade,
        upper.panel=panel.pie, text.panel=panel.txt,
        main="Corrgram of correlations between Startup funding variables")

Performing various tests

  1. Year vs Industry Vertical
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 9.959, df = 6, p-value = 0.1264

A low p-value but not very low to ignore the Null Hypotheses.

  1. Year vs City Location
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
chisq.test(mytable)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 13.022, df = 6, p-value = 0.04269

A lower p-value but still unable to ingnore Null Hypotheses.

  1. Year vs Investment Type
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 17.733, df = 2, p-value = 0.000141

P-value < 0.005 which signifies that there is a relation between Year and the type of Investment that occurred in the year majorly.

  1. Industry Vertical vs City Location
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 60.908, df = 36, p-value = 0.005871

On the brink but still can;t safelly reject the Null Hypotheses.

  1. Industry Vertical vs Investment type
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 28.816, df = 12, p-value = 0.004196

P-value < 0.005 signfying that the variables Indtrial sector of the startup and the investment type it gets are not independent and there lies a relationship.

  1. City Location vs Investment type
mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 23.222, df = 12, p-value = 0.0259

Unable to reject the null hypotheses.

Comparing AMount of investment with other 4 factors.

  1. Industrial Sector
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$AmountInUSD)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 1495.2, df = 1416, p-value = 0.07019

A high p-value thus unable to reject null Hypotheses.

2.City in which Sartup is based

mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 1520.2, df = 1416, p-value = 0.0271

A high p-value thus unable to reject null Hypotheses.

3.Investment type

mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 1536.9, df = 472, p-value < 2.2e-16

A very low value signifying that Investment type and AMount of Investment are not independent and there lies a relationship.

4.Year of investment

mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$Date)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 288.6, df = 236, p-value = 0.01097

A low p-value but not upto to the mark for rejecting the null hypotheses.

Linear Regression

linfit <- lm(AmountInUSD ~ ., data = Testdata.df)
summary(linfit)
## 
## Call:
## lm(formula = AmountInUSD ~ ., data = Testdata.df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -49799181  -15630087   -4457928    5583221 1349700819 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                  -2.320e+10  1.051e+10  -2.207  0.02755 * 
## Date                          1.151e+07  5.210e+06   2.210  0.02740 * 
## IndustryVerticalECommerce     1.811e+07  6.782e+06   2.670  0.00773 **
## IndustryVerticalEducation    -9.615e+05  1.862e+07  -0.052  0.95883   
## IndustryVerticalHealthcare   -2.882e+06  1.856e+07  -0.155  0.87667   
## IndustryVerticalLogistics    -1.155e+06  1.804e+07  -0.064  0.94896   
## IndustryVerticalOtherSectors -1.278e+06  1.380e+07  -0.093  0.92621   
## IndustryVerticalTechnology   -4.390e+06  6.249e+06  -0.703  0.48250   
## CityLocationGurgaon          -1.000e+07  8.404e+06  -1.190  0.23437   
## CityLocationHyderabad        -1.149e+07  1.273e+07  -0.902  0.36723   
## CityLocationMumbai           -1.304e+07  6.819e+06  -1.913  0.05614 . 
## CityLocationNew Delhi        -9.596e+06  7.659e+06  -1.253  0.21056   
## CityLocationOthers           -1.371e+07  7.781e+06  -1.762  0.07836 . 
## CityLocationPune             -9.117e+06  1.252e+07  -0.728  0.46652   
## InvestmentTypePrivate Equity  1.068e+07  7.097e+07   0.150  0.88043   
## InvestmentTypeSeed Funding   -5.384e+06  7.097e+07  -0.076  0.93955   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70480000 on 853 degrees of freedom
## Multiple R-squared:  0.04088,    Adjusted R-squared:  0.02401 
## F-statistic: 2.424 on 15 and 853 DF,  p-value: 0.001844