Indian Institute of Technology, Bombay (IIT-B)
The craze among Indian Graduates or student still studying in Various Engieering colleges and Business schools to build their own company right from scratch has been at a peak in the last decades. Although, as a country we have seen examples of various companies starting from small firms to Huge MNC’s like Infosys, Reliance etc. we have not been looked upto for our startup ecosystem to the extent as that the United States and other Western Coutries have been looked upto. But, with increasing Investment into various small startups and the GOI coming in with their StartupIndia Campaign, the craze of Startups among budding Indian entreprenuers is here to stay. So, in this report, We would like to see the extent of Investment in the firms, how has it changes over the years in amount as well as the type of investment and how factors like the Industrial Sector, City Location in which the Startup is based affect the whole Startup Investment variable.
So the data is available through the site Kaggle.com which is a famous site for its vast amount of datasets available to common public on various topics. The link for the Dataset which was used for our study is;
https://www.kaggle.com/sudalairajkumar/indian-startup-funding/feed
Our Dataset contains information of upto 2372 investment transactions. The coloumnwise description is as follows.
An Important point: Many cells were empty and thus careful sorting and cleaning of data was done on my part to exclude the empty data and analyse stuffs from the remaining dataset. If a certain cell of only a coloumn in blank, then the entire row / transaction is ignored while doinf various tests and Regression analysis We have Bracketed the major cities into Bangalore, Mumbai, New Delhi, Gurgaon, Pune, Hyderabad and others. Similarly bracketing the major Industrail Vertical Sectors into Consumer Internet, Technolgy, ECommerce, Logistics, Education, Healthcare and others. THis would help into out analysis and analysing a large number of cities with very small number of startups would become very tedious.
So our Action Plan is to analyse important Variables Year, City Location, Industry Vertical, Investment Type and Amount in USD. We analyse my ceating Contingency Tables, performing Chi-Square Tests and Correlation Tests and fitting it into a Linear Regression Model.
Our Independent Variable would be Amount Of Investment in USD. The other Dependent variables are Date, Industry Vertical, City Location and Investment type. Thus our proposed equation of Linear Regression model Mathematically looks like,
\[Amt.Invested(USD)= \alpha_0 + \alpha_1 Date + \alpha_2 Ind. Vertical + \alpha_3 City + \alpha_4 Inv.type + \epsilon\] The result of the Lin Model Regression;
(Intercept) | Date
-2.319851e+10 | 1.151216e+07
IndustryVerticalECommerce | IndustryVerticalEducation
1.810677e+07 | -9.614772e+05
IndustryVerticalHealthcare | IndustryVerticalLogistics
-2.881645e+06 | -1.154833e+06
IndustryVerticalOtherSectors |IndustryVerticalTechnology
-1.278081e+06 | -4.390373e+06
CityLocationGurgaon | CityLocationHyderabad
-1.000115e+07 | -1.148655e+07
CityLocationMumbai | CityLocationNew Delhi
-1.304183e+07 | -9.596368e+06
CityLocationOthers | CityLocationPune
-1.371399e+07 | -9.117334e+06
InvestmentTypePrivate Equity |InvestmentTypeSeed Funding
1.067842e+07 | -5.383905e+06
A total of about 18.3 Billion US dollars worth of investment has taken place in Indian Startups from 2015 to August 2017.
Through our various Chi-Square Tests, we have found out that there exists a Relationship between following variables (Condition: p-value < 0.005) 1. Year and Type of Investment (p-value of 0.000141) 2. Industrial Sector (Industry Vertical) and Type of Investment (p-value of 0.004196) 3. Investment type and Amount if Investment (p-value is < 2.2e-16)
More and more Private Equity based funding is observed as compared to Seed based funding which indicates that as the years are passing more and more established firms are existing in the marketplace which do not need Seed Funding based Investment.(Seed funding - is primarily raised during the early stage of the venture especially like idea stage or prototyping stage and is also much less in capital amount when compared to private equity based capital funding).Technical Based Startups also gather much more investment per startup firm which is also evident form the fact that Bangalore also has highly significant amount of Investment as Compared to other cities.
table(Data1.df$Date)
##
## 2015 2016 2017
## 936 993 443
This tells us the total no. of funding that has occured in the years 2015, 2016 and 2017 (till August)
table(CleanData3.df$IndustryVertical)
##
## Consumer Internet ECommerce Education Healthcare
## 459 147 15 15
## Logistics OtherSectors Technology
## 16 28 189
The major Industrial sectors acquiring funding by numbers.
table(CleanData3.df$CityLocation)
##
## Bangalore Gurgaon Hyderabad Mumbai New Delhi Others Pune
## 267 97 35 181 130 122 37
The funded startups location
table(CleanData3.df$InvestmentType)
##
## Debt Funding Private Equity Seed Funding
## 1 476 392
The type of Investment practiced by the Investor. Now moving on to two way contingency tables:
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
mytable
## CleanData3.df$IndustryVertical
## CleanData3.df$Date Consumer Internet ECommerce Education Healthcare
## 2016 303 104 14 10
## 2017 156 43 1 5
## CleanData3.df$IndustryVertical
## CleanData3.df$Date Logistics OtherSectors Technology
## 2016 10 23 120
## 2017 6 5 69
There has been a significant dip in funding for Eductaional based startups as compared to other startups.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
mytable
## CleanData3.df$CityLocation
## CleanData3.df$Date Bangalore Gurgaon Hyderabad Mumbai New Delhi Others
## 2016 165 65 21 117 100 90
## 2017 102 32 14 64 30 32
## CleanData3.df$CityLocation
## CleanData3.df$Date Pune
## 2016 26
## 2017 11
The number of New Delhi Based startups acquiring Funds shows the most significant difference.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
mytable
## CleanData3.df$InvestmentType
## CleanData3.df$Date Debt Funding Private Equity Seed Funding
## 2016 0 293 291
## 2017 1 183 101
Private Equity based Investment is much more in practice in 2017 than In 2016 as compared to Seed Funding. Debt Funding practice is much less significant the the other two Investment type options.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
mytable
## CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical Bangalore Gurgaon Hyderabad Mumbai
## Consumer Internet 143 60 17 101
## ECommerce 44 19 3 29
## Education 4 1 0 4
## Healthcare 4 1 1 3
## Logistics 3 3 1 6
## OtherSectors 4 2 2 7
## Technology 65 11 11 31
## CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical New Delhi Others Pune
## Consumer Internet 71 55 12
## ECommerce 28 21 3
## Education 2 4 0
## Healthcare 3 2 1
## Logistics 2 1 0
## OtherSectors 7 5 1
## Technology 17 34 20
Clearly, Bangalore lives upto its reputation of being the Tech hub of India by gathering the most technology based startup funding. Consumer Internet Based startups are dominant in Mumbai and Delhi & NCR then other startups in the cities.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
mytable
## CleanData3.df$InvestmentType
## CleanData3.df$IndustryVertical Debt Funding Private Equity Seed Funding
## Consumer Internet 1 220 238
## ECommerce 0 101 46
## Education 0 10 5
## Healthcare 0 11 4
## Logistics 0 13 3
## OtherSectors 0 14 14
## Technology 0 107 82
ECommerce Based startups look to have much more Private Equity based investment then Seed Funding. Consumer Internet based startups look to have more Seed Funding. Technology too has more Private Equity based investment.
mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
mytable
## CleanData3.df$InvestmentType
## CleanData3.df$CityLocation Debt Funding Private Equity Seed Funding
## Bangalore 0 157 110
## Gurgaon 0 59 38
## Hyderabad 0 15 20
## Mumbai 0 110 71
## New Delhi 0 55 75
## Others 1 61 60
## Pune 0 19 18
Mumbai and Bangalore based startups have more dominant Private Equity Based investment. For Delhi its Seed Funding which is more dominant.
sum(cleanData.df$AmountInUSD)
## [1] 18347386476
So, considering the available data about 18.35 Billion USD has been Invested in Major Startups in India. Looking at the Distribution;
boxplot(cleanData.df$AmountInUSD, horizontal = TRUE, xlab = "Amount in USD", main = "Startup Investment plot")
Thus, the investment is very skewed. Some famous startups have acquired millions of investment from big firms like Alibaba, SoftBank etc.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$Date, horizontal = TRUE, xlab= "Amount of Investment in USD", ylab = "Year", main = "Year Wise Investment Analysis")
Clearly, 2017 has more Outliers then 2016.
sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2017'])
## [1] 5846275500
sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2016'])
## [1] 3828088608
Thus Investment in year 2017 upto August has been more then in the entire year of 2016.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Industry Sector", main = "Industrial Sector Wise Investment Analysis", boxwex = 0.6, names = c("Con Int", "Ecomm", "Edu", "HealthCare", "Logist.", "Others", "Tech"))
mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical)
plot(mytable, ylab = "AMount of Investment", xlab = "Industry Sector")
Thus ECommerce and Consumer Internet seem to rule the roost when it comes to massive funding. Although the number of Startups based on these two sectors are also much more in number.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "City", main = "Startup citiwise location Wise Investment Analysis", boxwex = 0.6, names = c("Bang", "Gurgaon", "Hyd", "Mum", "N. Delhi", "Others", "Pune"))
mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation)
plot(mytable, ylab = "AMount of Investment", xlab = "City")
Here Bangalore takes a very significant lead in amount of investment. Mumbai comes a distant Second. The number of startups though are higher in Bangalore but indeed the outliers and a decent median value show the higher investment result in the Boxplot.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Investment Type", main = "Investment type vs Investment Analysis")
mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType)
plot(mytable, ylab = "AMount of Investment", xlab = "Type of Investment")
We have a clear winner here in Private Equity . This is also because much of Private Equity based investment is to startups who have well established themselves. Seed funding is mainly in the incubation Stage.
Now we would create a dataset with sticking to our 5 set of variables for various correlation tests and t-tests.
cor(Testdata.df$Date, Testdata.df$AmountInUSD)
## [1] 0.09190294
A positive Correlation but not very significant.
library(corrgram)
corrgram(Testdata.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of correlations between Startup funding variables")
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 9.959, df = 6, p-value = 0.1264
A low p-value but not very low to ignore the Null Hypotheses.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
chisq.test(mytable)
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 13.022, df = 6, p-value = 0.04269
A lower p-value but still unable to ingnore Null Hypotheses.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 17.733, df = 2, p-value = 0.000141
P-value < 0.005 which signifies that there is a relation between Year and the type of Investment that occurred in the year majorly.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 60.908, df = 36, p-value = 0.005871
On the brink but still can;t safelly reject the Null Hypotheses.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 28.816, df = 12, p-value = 0.004196
P-value < 0.005 signfying that the variables Indtrial sector of the startup and the investment type it gets are not independent and there lies a relationship.
mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 23.222, df = 12, p-value = 0.0259
Unable to reject the null hypotheses.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$AmountInUSD)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 1495.2, df = 1416, p-value = 0.07019
A high p-value thus unable to reject null Hypotheses.
2.City in which Sartup is based
mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 1520.2, df = 1416, p-value = 0.0271
A high p-value thus unable to reject null Hypotheses.
3.Investment type
mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 1536.9, df = 472, p-value < 2.2e-16
A very low value signifying that Investment type and AMount of Investment are not independent and there lies a relationship.
4.Year of investment
mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$Date)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 288.6, df = 236, p-value = 0.01097
A low p-value but not upto to the mark for rejecting the null hypotheses.
linfit <- lm(AmountInUSD ~ ., data = Testdata.df)
summary(linfit)
##
## Call:
## lm(formula = AmountInUSD ~ ., data = Testdata.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49799181 -15630087 -4457928 5583221 1349700819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.320e+10 1.051e+10 -2.207 0.02755 *
## Date 1.151e+07 5.210e+06 2.210 0.02740 *
## IndustryVerticalECommerce 1.811e+07 6.782e+06 2.670 0.00773 **
## IndustryVerticalEducation -9.615e+05 1.862e+07 -0.052 0.95883
## IndustryVerticalHealthcare -2.882e+06 1.856e+07 -0.155 0.87667
## IndustryVerticalLogistics -1.155e+06 1.804e+07 -0.064 0.94896
## IndustryVerticalOtherSectors -1.278e+06 1.380e+07 -0.093 0.92621
## IndustryVerticalTechnology -4.390e+06 6.249e+06 -0.703 0.48250
## CityLocationGurgaon -1.000e+07 8.404e+06 -1.190 0.23437
## CityLocationHyderabad -1.149e+07 1.273e+07 -0.902 0.36723
## CityLocationMumbai -1.304e+07 6.819e+06 -1.913 0.05614 .
## CityLocationNew Delhi -9.596e+06 7.659e+06 -1.253 0.21056
## CityLocationOthers -1.371e+07 7.781e+06 -1.762 0.07836 .
## CityLocationPune -9.117e+06 1.252e+07 -0.728 0.46652
## InvestmentTypePrivate Equity 1.068e+07 7.097e+07 0.150 0.88043
## InvestmentTypeSeed Funding -5.384e+06 7.097e+07 -0.076 0.93955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70480000 on 853 degrees of freedom
## Multiple R-squared: 0.04088, Adjusted R-squared: 0.02401
## F-statistic: 2.424 on 15 and 853 DF, p-value: 0.001844