setwd("C:/Users/karansy/Downloads/Documents/Pdf")
Data1.df <- read.csv(paste("startupfunding.csv", sep = ""))
View(Data1.df)
Firstly we would visualize the length and breadth of our dataset
dim(Data1.df)
## [1] 2372 10
Thus we have 2372 rows and 10 coloumns of various startup fundings in year 2015, 2016 and 2017
summary(Data1.df)
## SNo Date StartupName
## Min. : 0.0 Min. :2015 Swiggy : 7
## 1st Qu.: 592.8 1st Qu.:2015 UrbanClap: 6
## Median :1185.5 Median :2016 Jugnoo : 5
## Mean :1185.5 Mean :2016 Medinfi : 5
## 3rd Qu.:1778.2 3rd Qu.:2016 NoBroker : 5
## Max. :2371.0 Max. :2017 Paytm : 5
## (Other) :2339
## IndustryVertical SubVertical CityLocation
## Consumer Internet:772 : 936 Bangalore:627
## Technology :313 Online Pharmacy : 9 Mumbai :446
## ECommerce :230 Food Delivery Platform : 8 New Delhi:381
## :171 Online lending platform : 5 Gurgaon :240
## Healthcare : 31 ECommerce Marketplace : 4 :179
## Logistics : 24 Online Learning Platform: 4 Pune : 84
## (Other) :831 (Other) :1406 (Other) :415
## InvestorsName InvestmentType AmountInUSD
## Undisclosed Investors : 33 Seed Funding :1271 : 847
## Undisclosed investors : 27 Private Equity:1066 1,000,000: 130
## Indian Angel Network : 24 SeedFunding : 30 500,000 : 91
## Ratan Tata : 24 : 1 100,000 : 55
## Kalaari Capital : 16 Crowd funding : 1 2,000,000: 55
## Group of Angel Investors: 15 Crowd Funding : 1 3,000,000: 50
## (Other) :2233 (Other) : 2 (Other) :1144
## Remarks
## :1953
## Series A : 177
## Series B : 64
## Pre-Series A: 37
## Series C : 28
## Series D : 11
## (Other) : 102
Our data has various blank cells in the Industry Vertical Coloumn, Subvertical Coloumn, City Location, Amount in USD. Thus while plotting indivudual data of each data set we would ignore the incomplete Data set or empty cell. Also as there are upto 2000 empty cells in remarks section with a lot of varied information we would ignore analysing each and every remark
Cleaning the data by removing the , characters to convert it into pure numeric forms from the AMount coloumn and filling the null cells with NA value will help us efficient analysis of the data
Data1.df$Remarks <- NULL
Data1.df[Data1.df == ""] <- NA
Data1.df$AmountInUSD <- as.numeric(gsub(",","",Data1.df$AmountInUSD))
Let us Describe statistics of our significant variables.
library(psych)
Now looking up at the major Industrial sectors that have been funded. Firstly we would in general like to see the amount of Investment. Thus calling out the describe function
cleanData.df <- Data1.df[complete.cases(Data1.df$AmountInUSD), ]
describe(cleanData.df$AmountInUSD)
## vars n mean sd median trimmed mad min max
## X1 1 1525 12031073 64031175 1070000 3335938 1378818 16000 1.4e+09
## range skew kurtosis se
## X1 1399984000 15.94 309.36 1639670
Creating contingency plots. Here we would clear the entire row in the presence of an empty cell in the row which would help in clear analysis. We would still be left with significant amount of data to analyse the trends.
CleanData1.df <- Data1.df[complete.cases(Data1.df), ]
View(CleanData1.df)
dim(CleanData1.df)
## [1] 869 9
Thus we have 869 data rows to check that have complete data at hand.
summary(CleanData1.df)
## SNo Date StartupName
## Min. : 0 Min. :2016 Swiggy : 4
## 1st Qu.: 336 1st Qu.:2016 Byju’s : 3
## Median : 690 Median :2016 Capital Float: 3
## Mean : 697 Mean :2016 Flipkart : 3
## 3rd Qu.:1042 3rd Qu.:2017 Fynd : 3
## Max. :1431 Max. :2017 Koovs : 3
## (Other) :850
## IndustryVertical SubVertical CityLocation
## Consumer Internet:459 ECommerce Marketplace : 4 Bangalore:267
## Technology :189 Food Delivery Platform : 4 Mumbai :181
## ECommerce :147 Online lending platform : 4 New Delhi:130
## Logistics : 16 Online Pharmacy : 4 Gurgaon : 97
## Education : 15 Online Learning Platform: 3 Pune : 37
## Healthcare : 15 Cab Aggregation App : 2 Hyderabad: 35
## (Other) : 28 (Other) :848 (Other) :122
## InvestorsName InvestmentType AmountInUSD
## Undisclosed Investors: 17 Private Equity:476 Min. :1.800e+04
## Undisclosed investors: 16 Seed Funding :392 1st Qu.:3.500e+05
## undisclosed investors: 11 Debt Funding : 1 Median :1.000e+06
## Kalaari Capital : 9 : 0 Mean :1.113e+07
## Brand Capital : 7 Crowd funding : 0 3rd Qu.:5.000e+06
## Indian Angel Network : 7 Crowd Funding : 0 Max. :1.400e+09
## (Other) :802 (Other) : 0
Bracketing the major cities into Bangalore, Mumbai, New Delhi, Gurgaon, Pune, Hyderabad and others. Similarly bracketing the major Industrail Vertical Sectors into Consumer Internet, Technolgy, ECommerce, Logistics, Education, Healthcare and others. THis would help into out analysis and analysing a large number of cities with very small number of startups would become very tedious.
CleanData2.df <- CleanData1.df
str(CleanData2.df)
## 'data.frame': 869 obs. of 9 variables:
## $ SNo : int 0 3 4 5 6 7 8 9 10 13 ...
## $ Date : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ StartupName : Factor w/ 2001 levels "#Fame","121Policy",..: 1734 1965 291 161 455 431 914 1123 1713 292 ...
## $ IndustryVertical: Factor w/ 738 levels "","360-degree view creating platform",..: 695 101 101 101 695 159 159 159 101 101 ...
## $ SubVertical : Factor w/ 1364 levels "","3D printed experimental Human Liver tissue creator",..: 1118 294 490 1086 352 936 943 75 539 285 ...
## $ CityLocation : Factor w/ 72 levels "","Agra","Ahmedabad",..: 5 41 26 5 3 22 5 51 41 5 ...
## $ InvestorsName : Factor w/ 1886 levels "","1Crowd","1Crowd (through crowd funding)",..: 815 896 1088 1336 748 219 834 764 247 717 ...
## $ InvestmentType : Factor w/ 8 levels "","Crowd funding",..: 5 7 7 7 5 5 5 5 5 7 ...
## $ AmountInUSD : num 1.3e+06 5.0e+05 8.5e+05 1.0e+06 2.6e+06 2.0e+07 8.5e+06 1.2e+07 1.0e+06 1.0e+06 ...
CleanData2.df$CityLocation <- as.character(CleanData2.df$CityLocation)
CleanData2.df$CityLocation[CleanData2.df$CityLocation != "Bangalore" & CleanData2.df$CityLocation != "Mumbai" & CleanData2.df$CityLocation != "New Delhi" & CleanData2.df$CityLocation != "Gurgaon" & CleanData2.df$CityLocation != "Pune" & CleanData2.df$CityLocation != "Hyderabad"] <- "Others"
View(CleanData2.df)
table(CleanData2.df$CityLocation)
##
## Bangalore Gurgaon Hyderabad Mumbai New Delhi Others Pune
## 267 97 35 181 130 122 37
Now Bracketing the Vertical Industrial Sector
CleanData3.df <- CleanData2.df
CleanData3.df$IndustryVertical <- as.character(CleanData3.df$IndustryVertical)
CleanData3.df$IndustryVertical[CleanData3.df$IndustryVertical != "Consumer Internet" & CleanData3.df$IndustryVertical != "Technology" & CleanData3.df$IndustryVertical != "ECommerce" & CleanData3.df$IndustryVertical != "Logistics" & CleanData3.df$IndustryVertical != "Education" & CleanData3.df$IndustryVertical != "Healthcare"] <- "OtherSectors"
View(CleanData3.df)
table(CleanData3.df$IndustryVertical)
##
## Consumer Internet ECommerce Education Healthcare
## 459 147 15 15
## Logistics OtherSectors Technology
## 16 28 189
dim(CleanData3.df)
## [1] 869 9
We would also clean the type of Investment;Simply converting from Factor to character is sufficient
CleanData3.df$InvestmentType <- as.character(CleanData3.df$InvestmentType)
So Finally we have our dataset CleanData3.df which is a subset of Data1.df after cleaning and bracketing major variables that were significantly less for us to invest our time to analyse them. Thus moving ahead with our CleanData3.df dataset;
table(Data1.df$Date)
##
## 2015 2016 2017
## 936 993 443
This tells us the total no. of funding that has occured in the years 2015, 2016 and 2017 (till August)
table(CleanData3.df$IndustryVertical)
##
## Consumer Internet ECommerce Education Healthcare
## 459 147 15 15
## Logistics OtherSectors Technology
## 16 28 189
The major Industrial sectors acquiring funding by numbers.
table(CleanData3.df$CityLocation)
##
## Bangalore Gurgaon Hyderabad Mumbai New Delhi Others Pune
## 267 97 35 181 130 122 37
The funded startups location
table(CleanData3.df$InvestmentType)
##
## Debt Funding Private Equity Seed Funding
## 1 476 392
The type of Investment practiced by the Investor. Now moving on to two way contingency tables:
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
mytable
## CleanData3.df$IndustryVertical
## CleanData3.df$Date Consumer Internet ECommerce Education Healthcare
## 2016 303 104 14 10
## 2017 156 43 1 5
## CleanData3.df$IndustryVertical
## CleanData3.df$Date Logistics OtherSectors Technology
## 2016 10 23 120
## 2017 6 5 69
There has been a dip in funding for Eductaional based startups.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
mytable
## CleanData3.df$CityLocation
## CleanData3.df$Date Bangalore Gurgaon Hyderabad Mumbai New Delhi Others
## 2016 165 65 21 117 100 90
## 2017 102 32 14 64 30 32
## CleanData3.df$CityLocation
## CleanData3.df$Date Pune
## 2016 26
## 2017 11
The number of New Delhi Based startups acquiring Funds have vast difference.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
mytable
## CleanData3.df$InvestmentType
## CleanData3.df$Date Debt Funding Private Equity Seed Funding
## 2016 0 293 291
## 2017 1 183 101
Private Equity based Investment is much more in practice in 2017 than In 2016 as compared to Seed Funding. Debt Funding practice is much less significant the the other two Investment type options.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
mytable
## CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical Bangalore Gurgaon Hyderabad Mumbai
## Consumer Internet 143 60 17 101
## ECommerce 44 19 3 29
## Education 4 1 0 4
## Healthcare 4 1 1 3
## Logistics 3 3 1 6
## OtherSectors 4 2 2 7
## Technology 65 11 11 31
## CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical New Delhi Others Pune
## Consumer Internet 71 55 12
## ECommerce 28 21 3
## Education 2 4 0
## Healthcare 3 2 1
## Logistics 2 1 0
## OtherSectors 7 5 1
## Technology 17 34 20
Clearly, Bangalore lives upto its reputation of being the Tech hub of India by gathering the most technology based startup funding. Consumer Internet BAsed startups are dominant in Mumbai and Delhi & NCR then other startups in the cities.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
mytable
## CleanData3.df$InvestmentType
## CleanData3.df$IndustryVertical Debt Funding Private Equity Seed Funding
## Consumer Internet 1 220 238
## ECommerce 0 101 46
## Education 0 10 5
## Healthcare 0 11 4
## Logistics 0 13 3
## OtherSectors 0 14 14
## Technology 0 107 82
ECommerce Based startups look to have much more Private Equity based investment then Seed Funding.
mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
mytable
## CleanData3.df$InvestmentType
## CleanData3.df$CityLocation Debt Funding Private Equity Seed Funding
## Bangalore 0 157 110
## Gurgaon 0 59 38
## Hyderabad 0 15 20
## Mumbai 0 110 71
## New Delhi 0 55 75
## Others 1 61 60
## Pune 0 19 18
Mumbai and Bangalore based startups have more dominant Private Equity Based investment. For Delhi its Seed Funding which is more dominant.
Now We would see the relation of the amount of Investment (USD) data in relation with other factors like 1. Year 2. Industry Vertical 3. City Location 4. Investment type.
sum(cleanData.df$AmountInUSD)
## [1] 18347386476
So, considering the available data about 18.35 Billion USD has been Invested in Major Startups in India. Looking at the Distribution;
boxplot(cleanData.df$AmountInUSD, horizontal = TRUE, xlab = "Amount in USD", main = "Startup Investment plot")
Thus, the investment is very skewed. Some famous startups have acquired millions of investment from big firms like Alibaba, SoftBank etc.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$Date, horizontal = TRUE, xlab= "Amount of Investment in USD", ylab = "Year", main = "Year Wise Investment Analysis")
Clearly, 2017 has more Outliers then 2016.
sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2017'])
## [1] 5846275500
sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2016'])
## [1] 3828088608
Thus Investment in year 2017 upto August has been more then in the entire year of 2016.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Industry Sector", main = "Industrial Sector Wise Investment Analysis", boxwex = 0.6, names = c("Con Int", "Ecomm", "Edu", "HealthCare", "Logist.", "Others", "Tech"))
mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical)
plot(mytable, ylab = "AMount of Investment", xlab = "Industry Sector")
Thus ECommerce and Consumer Internet seem to rule the roost when it comes to massive funding. Although the number of Startups based on these two sectors are also much more in number.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "City", main = "Startup citiwise location Wise Investment Analysis", boxwex = 0.6, names = c("Bang", "Gurgaon", "Hyd", "Mum", "N. Delhi", "Others", "Pune"))
mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation)
plot(mytable, ylab = "AMount of Investment", xlab = "City")
Here Bangalore takes a very significant lead in amount of investment. Mumbai comes a distant Second. The number of startups though are higher in Bangalore but indeed the outliers and a decent median value show the higher investment result in the Boxplot.
boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Investment Type", main = "Investment type vs Investment Analysis")
mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType)
plot(mytable, ylab = "AMount of Investment", xlab = "Type of Investment")
We have a clear winner here in Private Equity . This is also because much of PRivate Equity based investment is to startups who have well established themselves. Seed funding is mainly in the incubation Stage.
Now we would create a dataset with sticking to our 5 set of variables for various correlation tests and t-tests.
Testdata.df <- CleanData3.df
Testdata.df$SNo <- NULL
Testdata.df$StartupName <- NULL
Testdata.df$SubVertical <- NULL
Testdata.df$InvestorsName <- NULL
View(Testdata.df)
cor(Testdata.df$Date, Testdata.df$AmountInUSD)
## [1] 0.09190294
A positive Correlation but not very significant.
library(corrgram)
corrgram(Testdata.df, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Corrgram of correlations between Startup funding variables")
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 9.959, df = 6, p-value = 0.1264
A low p-value but not very low to ignore the Null Hypotheses.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
chisq.test(mytable)
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 13.022, df = 6, p-value = 0.04269
A lower p-value but still unable to ingnore Null Hypotheses.
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 17.733, df = 2, p-value = 0.000141
P-value < 0.005 which signifies that there is a relation between Year and the type of Investment that occurred in the year majorly.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 60.908, df = 36, p-value = 0.005871
On the brink but still can;t safelly reject the Null Hypotheses.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 28.816, df = 12, p-value = 0.004196
P-value < 0.005 signfying that the variables Indtrial sector of the startup and the investment type it gets are not independent and there lies a relationship.
mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 23.222, df = 12, p-value = 0.0259
Unable to reject the null hypotheses.
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$AmountInUSD)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 1495.2, df = 1416, p-value = 0.07019
A high p-value thus unable to reject null Hypotheses.
2.City in which Sartup is based
mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 1520.2, df = 1416, p-value = 0.0271
A high p-value thus unable to reject null Hypotheses.
3.Investment type
mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 1536.9, df = 472, p-value < 2.2e-16
A very low value signifying that Investment type and AMount of Investment are not independent and there lies a relationship.
4.Year of investment
mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$Date)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: mytable
## X-squared = 288.6, df = 236, p-value = 0.01097
A low p-value but not upto to the mark for rejecting the null hypotheses.