setwd("C:/Users/karansy/Downloads/Documents/Pdf")
Data1.df <- read.csv(paste("startupfunding.csv", sep = ""))
View(Data1.df)

Firstly we would visualize the length and breadth of our dataset

dim(Data1.df)
## [1] 2372   10

Thus we have 2372 rows and 10 coloumns of various startup fundings in year 2015, 2016 and 2017

summary(Data1.df)
##       SNo              Date         StartupName  
##  Min.   :   0.0   Min.   :2015   Swiggy   :   7  
##  1st Qu.: 592.8   1st Qu.:2015   UrbanClap:   6  
##  Median :1185.5   Median :2016   Jugnoo   :   5  
##  Mean   :1185.5   Mean   :2016   Medinfi  :   5  
##  3rd Qu.:1778.2   3rd Qu.:2016   NoBroker :   5  
##  Max.   :2371.0   Max.   :2017   Paytm    :   5  
##                                  (Other)  :2339  
##           IndustryVertical                   SubVertical      CityLocation
##  Consumer Internet:772                             : 936   Bangalore:627  
##  Technology       :313     Online Pharmacy         :   9   Mumbai   :446  
##  ECommerce        :230     Food Delivery Platform  :   8   New Delhi:381  
##                   :171     Online lending platform :   5   Gurgaon  :240  
##  Healthcare       : 31     ECommerce Marketplace   :   4            :179  
##  Logistics        : 24     Online Learning Platform:   4   Pune     : 84  
##  (Other)          :831     (Other)                 :1406   (Other)  :415  
##                   InvestorsName         InvestmentType    AmountInUSD  
##  Undisclosed Investors   :  33   Seed Funding  :1271            : 847  
##  Undisclosed investors   :  27   Private Equity:1066   1,000,000: 130  
##  Indian Angel Network    :  24   SeedFunding   :  30   500,000  :  91  
##  Ratan Tata              :  24                 :   1   100,000  :  55  
##  Kalaari Capital         :  16   Crowd funding :   1   2,000,000:  55  
##  Group of Angel Investors:  15   Crowd Funding :   1   3,000,000:  50  
##  (Other)                 :2233   (Other)       :   2   (Other)  :1144  
##          Remarks    
##              :1953  
##  Series A    : 177  
##  Series B    :  64  
##  Pre-Series A:  37  
##  Series C    :  28  
##  Series D    :  11  
##  (Other)     : 102

Important Information

Our data has various blank cells in the Industry Vertical Coloumn, Subvertical Coloumn, City Location, Amount in USD. Thus while plotting indivudual data of each data set we would ignore the incomplete Data set or empty cell. Also as there are upto 2000 empty cells in remarks section with a lot of varied information we would ignore analysing each and every remark

Cleaning the data by removing the , characters to convert it into pure numeric forms from the AMount coloumn and filling the null cells with NA value will help us efficient analysis of the data

Data1.df$Remarks <- NULL
Data1.df[Data1.df == ""] <- NA
Data1.df$AmountInUSD <- as.numeric(gsub(",","",Data1.df$AmountInUSD))

Let us Describe statistics of our significant variables.

library(psych)

Now looking up at the major Industrial sectors that have been funded. Firstly we would in general like to see the amount of Investment. Thus calling out the describe function

cleanData.df <- Data1.df[complete.cases(Data1.df$AmountInUSD), ]
describe(cleanData.df$AmountInUSD)
##    vars    n     mean       sd  median trimmed     mad   min     max
## X1    1 1525 12031073 64031175 1070000 3335938 1378818 16000 1.4e+09
##         range  skew kurtosis      se
## X1 1399984000 15.94   309.36 1639670

Creating contingency plots. Here we would clear the entire row in the presence of an empty cell in the row which would help in clear analysis. We would still be left with significant amount of data to analyse the trends.

CleanData1.df <- Data1.df[complete.cases(Data1.df), ]
View(CleanData1.df)
dim(CleanData1.df)
## [1] 869   9

Thus we have 869 data rows to check that have complete data at hand.

summary(CleanData1.df)
##       SNo            Date             StartupName 
##  Min.   :   0   Min.   :2016   Swiggy       :  4  
##  1st Qu.: 336   1st Qu.:2016   Byju’s     :  3  
##  Median : 690   Median :2016   Capital Float:  3  
##  Mean   : 697   Mean   :2016   Flipkart     :  3  
##  3rd Qu.:1042   3rd Qu.:2017   Fynd         :  3  
##  Max.   :1431   Max.   :2017   Koovs        :  3  
##                                (Other)      :850  
##           IndustryVertical                   SubVertical     CityLocation
##  Consumer Internet:459     ECommerce Marketplace   :  4   Bangalore:267  
##  Technology       :189     Food Delivery Platform  :  4   Mumbai   :181  
##  ECommerce        :147     Online lending platform :  4   New Delhi:130  
##  Logistics        : 16     Online Pharmacy         :  4   Gurgaon  : 97  
##  Education        : 15     Online Learning Platform:  3   Pune     : 37  
##  Healthcare       : 15     Cab Aggregation App     :  2   Hyderabad: 35  
##  (Other)          : 28     (Other)                 :848   (Other)  :122  
##                InvestorsName        InvestmentType  AmountInUSD       
##  Undisclosed Investors: 17   Private Equity:476    Min.   :1.800e+04  
##  Undisclosed investors: 16   Seed Funding  :392    1st Qu.:3.500e+05  
##  undisclosed investors: 11   Debt Funding  :  1    Median :1.000e+06  
##  Kalaari Capital      :  9                 :  0    Mean   :1.113e+07  
##  Brand Capital        :  7   Crowd funding :  0    3rd Qu.:5.000e+06  
##  Indian Angel Network :  7   Crowd Funding :  0    Max.   :1.400e+09  
##  (Other)              :802   (Other)       :  0

Bracketing the major cities into Bangalore, Mumbai, New Delhi, Gurgaon, Pune, Hyderabad and others. Similarly bracketing the major Industrail Vertical Sectors into Consumer Internet, Technolgy, ECommerce, Logistics, Education, Healthcare and others. THis would help into out analysis and analysing a large number of cities with very small number of startups would become very tedious.

CleanData2.df <- CleanData1.df
str(CleanData2.df)
## 'data.frame':    869 obs. of  9 variables:
##  $ SNo             : int  0 3 4 5 6 7 8 9 10 13 ...
##  $ Date            : int  2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
##  $ StartupName     : Factor w/ 2001 levels "#Fame","121Policy",..: 1734 1965 291 161 455 431 914 1123 1713 292 ...
##  $ IndustryVertical: Factor w/ 738 levels "","360-degree view creating platform",..: 695 101 101 101 695 159 159 159 101 101 ...
##  $ SubVertical     : Factor w/ 1364 levels "","3D printed experimental Human Liver tissue creator",..: 1118 294 490 1086 352 936 943 75 539 285 ...
##  $ CityLocation    : Factor w/ 72 levels "","Agra","Ahmedabad",..: 5 41 26 5 3 22 5 51 41 5 ...
##  $ InvestorsName   : Factor w/ 1886 levels "","1Crowd","1Crowd (through crowd funding)",..: 815 896 1088 1336 748 219 834 764 247 717 ...
##  $ InvestmentType  : Factor w/ 8 levels "","Crowd funding",..: 5 7 7 7 5 5 5 5 5 7 ...
##  $ AmountInUSD     : num  1.3e+06 5.0e+05 8.5e+05 1.0e+06 2.6e+06 2.0e+07 8.5e+06 1.2e+07 1.0e+06 1.0e+06 ...
CleanData2.df$CityLocation <- as.character(CleanData2.df$CityLocation)
CleanData2.df$CityLocation[CleanData2.df$CityLocation != "Bangalore" & CleanData2.df$CityLocation != "Mumbai" & CleanData2.df$CityLocation != "New Delhi" & CleanData2.df$CityLocation != "Gurgaon" & CleanData2.df$CityLocation != "Pune" & CleanData2.df$CityLocation != "Hyderabad"] <- "Others"
View(CleanData2.df)
table(CleanData2.df$CityLocation)
## 
## Bangalore   Gurgaon Hyderabad    Mumbai New Delhi    Others      Pune 
##       267        97        35       181       130       122        37

Now Bracketing the Vertical Industrial Sector

CleanData3.df <- CleanData2.df
CleanData3.df$IndustryVertical <- as.character(CleanData3.df$IndustryVertical)
CleanData3.df$IndustryVertical[CleanData3.df$IndustryVertical != "Consumer Internet" & CleanData3.df$IndustryVertical != "Technology" & CleanData3.df$IndustryVertical != "ECommerce" & CleanData3.df$IndustryVertical != "Logistics" & CleanData3.df$IndustryVertical != "Education" & CleanData3.df$IndustryVertical != "Healthcare"] <- "OtherSectors"
View(CleanData3.df)
table(CleanData3.df$IndustryVertical)
## 
## Consumer Internet         ECommerce         Education        Healthcare 
##               459               147                15                15 
##         Logistics      OtherSectors        Technology 
##                16                28               189
dim(CleanData3.df)
## [1] 869   9

We would also clean the type of Investment;Simply converting from Factor to character is sufficient

CleanData3.df$InvestmentType <- as.character(CleanData3.df$InvestmentType)

So Finally we have our dataset CleanData3.df which is a subset of Data1.df after cleaning and bracketing major variables that were significantly less for us to invest our time to analyse them. Thus moving ahead with our CleanData3.df dataset;

Contingency Tables

table(Data1.df$Date)
## 
## 2015 2016 2017 
##  936  993  443

This tells us the total no. of funding that has occured in the years 2015, 2016 and 2017 (till August)

table(CleanData3.df$IndustryVertical)
## 
## Consumer Internet         ECommerce         Education        Healthcare 
##               459               147                15                15 
##         Logistics      OtherSectors        Technology 
##                16                28               189

The major Industrial sectors acquiring funding by numbers.

table(CleanData3.df$CityLocation)
## 
## Bangalore   Gurgaon Hyderabad    Mumbai New Delhi    Others      Pune 
##       267        97        35       181       130       122        37

The funded startups location

table(CleanData3.df$InvestmentType)
## 
##   Debt Funding Private Equity   Seed Funding 
##              1            476            392

The type of Investment practiced by the Investor. Now moving on to two way contingency tables:

1. Year vs Industry Vertical

mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
mytable
##                   CleanData3.df$IndustryVertical
## CleanData3.df$Date Consumer Internet ECommerce Education Healthcare
##               2016               303       104        14         10
##               2017               156        43         1          5
##                   CleanData3.df$IndustryVertical
## CleanData3.df$Date Logistics OtherSectors Technology
##               2016        10           23        120
##               2017         6            5         69

There has been a dip in funding for Eductaional based startups.

2. Year vs City Location

mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
mytable
##                   CleanData3.df$CityLocation
## CleanData3.df$Date Bangalore Gurgaon Hyderabad Mumbai New Delhi Others
##               2016       165      65        21    117       100     90
##               2017       102      32        14     64        30     32
##                   CleanData3.df$CityLocation
## CleanData3.df$Date Pune
##               2016   26
##               2017   11

The number of New Delhi Based startups acquiring Funds have vast difference.

3. Year vs Investment Type

mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
mytable
##                   CleanData3.df$InvestmentType
## CleanData3.df$Date Debt Funding Private Equity Seed Funding
##               2016            0            293          291
##               2017            1            183          101

Private Equity based Investment is much more in practice in 2017 than In 2016 as compared to Seed Funding. Debt Funding practice is much less significant the the other two Investment type options.

4. Industry Vertical vs City Location

mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
mytable
##                               CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical Bangalore Gurgaon Hyderabad Mumbai
##              Consumer Internet       143      60        17    101
##              ECommerce                44      19         3     29
##              Education                 4       1         0      4
##              Healthcare                4       1         1      3
##              Logistics                 3       3         1      6
##              OtherSectors              4       2         2      7
##              Technology               65      11        11     31
##                               CleanData3.df$CityLocation
## CleanData3.df$IndustryVertical New Delhi Others Pune
##              Consumer Internet        71     55   12
##              ECommerce                28     21    3
##              Education                 2      4    0
##              Healthcare                3      2    1
##              Logistics                 2      1    0
##              OtherSectors              7      5    1
##              Technology               17     34   20

Clearly, Bangalore lives upto its reputation of being the Tech hub of India by gathering the most technology based startup funding. Consumer Internet BAsed startups are dominant in Mumbai and Delhi & NCR then other startups in the cities.

5. Industry Vertical vs Investment type

mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
mytable
##                               CleanData3.df$InvestmentType
## CleanData3.df$IndustryVertical Debt Funding Private Equity Seed Funding
##              Consumer Internet            1            220          238
##              ECommerce                    0            101           46
##              Education                    0             10            5
##              Healthcare                   0             11            4
##              Logistics                    0             13            3
##              OtherSectors                 0             14           14
##              Technology                   0            107           82

ECommerce Based startups look to have much more Private Equity based investment then Seed Funding.

6. City Location vs Investment type

mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
mytable
##                           CleanData3.df$InvestmentType
## CleanData3.df$CityLocation Debt Funding Private Equity Seed Funding
##                  Bangalore            0            157          110
##                  Gurgaon              0             59           38
##                  Hyderabad            0             15           20
##                  Mumbai               0            110           71
##                  New Delhi            0             55           75
##                  Others               1             61           60
##                  Pune                 0             19           18

Mumbai and Bangalore based startups have more dominant Private Equity Based investment. For Delhi its Seed Funding which is more dominant.

Now We would see the relation of the amount of Investment (USD) data in relation with other factors like 1. Year 2. Industry Vertical 3. City Location 4. Investment type.

Plot of Amount of Investment

sum(cleanData.df$AmountInUSD)
## [1] 18347386476

So, considering the available data about 18.35 Billion USD has been Invested in Major Startups in India. Looking at the Distribution;

boxplot(cleanData.df$AmountInUSD, horizontal = TRUE, xlab = "Amount in USD", main = "Startup Investment plot")

Thus, the investment is very skewed. Some famous startups have acquired millions of investment from big firms like Alibaba, SoftBank etc.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$Date, horizontal = TRUE, xlab= "Amount of Investment in USD", ylab = "Year", main = "Year Wise Investment Analysis")

Clearly, 2017 has more Outliers then 2016.

sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2017'])
## [1] 5846275500
sum(cleanData.df$AmountInUSD[cleanData.df$Date == '2016'])
## [1] 3828088608

Thus Investment in year 2017 upto August has been more then in the entire year of 2016.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Industry Sector", main = "Industrial Sector Wise Investment Analysis", boxwex = 0.6, names = c("Con Int", "Ecomm", "Edu", "HealthCare", "Logist.", "Others", "Tech"))

mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$IndustryVertical)
plot(mytable, ylab = "AMount of Investment", xlab = "Industry Sector")

Thus ECommerce and Consumer Internet seem to rule the roost when it comes to massive funding. Although the number of Startups based on these two sectors are also much more in number.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "City", main = "Startup citiwise location Wise Investment Analysis", boxwex = 0.6, names = c("Bang", "Gurgaon", "Hyd", "Mum", "N. Delhi", "Others", "Pune"))

mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$CityLocation)
plot(mytable, ylab = "AMount of Investment", xlab = "City")

Here Bangalore takes a very significant lead in amount of investment. Mumbai comes a distant Second. The number of startups though are higher in Bangalore but indeed the outliers and a decent median value show the higher investment result in the Boxplot.

boxplot(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType,horizontal = FALSE, ylab= "Amount of Investment in USD", xlab = "Investment Type", main = "Investment type vs Investment Analysis")

mytable <- xtabs(CleanData3.df$AmountInUSD ~ CleanData3.df$InvestmentType)
plot(mytable, ylab = "AMount of Investment", xlab = "Type of Investment")

We have a clear winner here in Private Equity . This is also because much of PRivate Equity based investment is to startups who have well established themselves. Seed funding is mainly in the incubation Stage.

Now we would create a dataset with sticking to our 5 set of variables for various correlation tests and t-tests.

Testdata.df <- CleanData3.df
Testdata.df$SNo <- NULL
Testdata.df$StartupName <- NULL
Testdata.df$SubVertical <- NULL
Testdata.df$InvestorsName <- NULL
View(Testdata.df)

Correlation of numeric value.

cor(Testdata.df$Date, Testdata.df$AmountInUSD)
## [1] 0.09190294

A positive Correlation but not very significant.

Corrgram Plot

library(corrgram)
corrgram(Testdata.df, order=TRUE, lower.panel=panel.shade,
        upper.panel=panel.pie, text.panel=panel.txt,
        main="Corrgram of correlations between Startup funding variables")

Performing various tests

  1. Year vs Industry Vertical
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$IndustryVertical)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 9.959, df = 6, p-value = 0.1264

A low p-value but not very low to ignore the Null Hypotheses.

  1. Year vs City Location
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$CityLocation)
chisq.test(mytable)
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 13.022, df = 6, p-value = 0.04269

A lower p-value but still unable to ingnore Null Hypotheses.

  1. Year vs Investment Type
mytable <- xtabs(~ CleanData3.df$Date + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 17.733, df = 2, p-value = 0.000141

P-value < 0.005 which signifies that there is a relation between Year and the type of Investment that occurred in the year majorly.

  1. Industry Vertical vs City Location
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 60.908, df = 36, p-value = 0.005871

On the brink but still can;t safelly reject the Null Hypotheses.

  1. Industry Vertical vs Investment type
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 28.816, df = 12, p-value = 0.004196

P-value < 0.005 signfying that the variables Indtrial sector of the startup and the investment type it gets are not independent and there lies a relationship.

  1. City Location vs Investment type
mytable <- xtabs(~ CleanData3.df$CityLocation + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 23.222, df = 12, p-value = 0.0259

Unable to reject the null hypotheses.

Comparing AMount of investment with other 4 factors.

  1. Industrial Sector
mytable <- xtabs(~ CleanData3.df$IndustryVertical + CleanData3.df$AmountInUSD)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 1495.2, df = 1416, p-value = 0.07019

A high p-value thus unable to reject null Hypotheses.

2.City in which Sartup is based

mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$CityLocation)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 1520.2, df = 1416, p-value = 0.0271

A high p-value thus unable to reject null Hypotheses.

3.Investment type

mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$InvestmentType)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 1536.9, df = 472, p-value < 2.2e-16

A very low value signifying that Investment type and AMount of Investment are not independent and there lies a relationship.

4.Year of investment

mytable <- xtabs(~ CleanData3.df$AmountInUSD + CleanData3.df$Date)
chisq.test(mytable)
## Warning in chisq.test(mytable): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  mytable
## X-squared = 288.6, df = 236, p-value = 0.01097

A low p-value but not upto to the mark for rejecting the null hypotheses.