Y Combinator is one of the most renowned startup accelerators in the world. Since its founding in 2005, Y Combinator has invested in more than 2,000 startups, including outsanding names such as Airbnb, Dropbox, Coinbase or Reddit. The Y Combinator investment dataset, which contains information on startups that received funding from Y Combinator between 2005 and 2019, provides a valuable resource for exploring trends and patterns in startup funding.
In this project, we will explore the Y Combinator investment dataset and apply several data mining techniques to gain insights into the characteristics of successful Y Combinator startups. We will focus on three main techniques: clustering, dimension reduction, and association rules.
First, we will use clustering techniques to group startups based on their categories and total amount of money raised, with the goal of identifying groups of companies with similar funding patterns and business focuses.
Finally, we will apply association rule mining techniques to identify relationships between different features of the Y Combinator investment dataset, such as the relationship between the funding amount and the type of industry.
Overall, the aim of this project is to provide an analysis of the Y Combinator investment dataset and to gain insights into the characteristics of successful Y Combinator startups. By applying these data mining techniques, we hope to provide valuable insights for entrepreneurs, investors, and anyone interested in the startup ecosystem.
To do the analysis, the following packages will be needed.
packages <- c("tidyverse", "corrplot", "ggplot2", "zoo", "mice",
"patchwork", "tibble", "patchwork", "gridExtra", "ClusterR", "cluster", "flexclust", "clustertend","ggthemes", "plotly", "jpeg", "dplyr","arules","arulesViz","factoextra")
# Load packages if not already loaded
for (package in packages) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
library(package, character.only = TRUE)
}
}
for (package in packages){
library(package, character.only = TRUE)
}
The dataset contains information on startups that received funding from Y Combinator between 2005 and 2019. It includes data on over 2,000 startups, including information on the startup’s name, category, founding investors, funding stage, funding type, funding amount, pre-money valuation, post-money valuation, and more.
Some key features of the dataset include:
Funding amount: The total amount of money that the startup received from Y Combinator. Funding stage: The stage of funding that the startup received (e.g., ). Category: The category or industry that the startup operates in (e.g., healthcare, e-commerce, education). Funding type: The type of funding that the startup received (e.g., pre-seed, seed, series A, series B).
#upload the dataset
data <- read.csv("YC.csv")
#present column names
colnames(data)
## [1] "Name"
## [2] "Transaction.Name"
## [3] "Funding.Type"
## [4] "Money.Raised.Currency..in.USD."
## [5] "Announced.Date"
## [6] "Funding.Stage"
## [7] "Pre.Money.Valuation.Currency..in.USD."
## [8] "Description"
## [9] "Categories"
## [10] "Location"
## [11] "Website"
## [12] "Revenue.Range"
## [13] "Total.Funding.Amount.Currency..in.USD."
## [14] "Funding.Status"
## [15] "Number.of.Funding.Rounds"
## [16] "Lead.Investors"
## [17] "Investor.Names"
## [18] "Number.of.Investors"
## [19] "Number.of.Partner.Investors"
Let’s explore 6 first rows to have a brief overview of the data.
head(data)
## Name Transaction.Name Funding.Type
## 1 Copia Seed Round - Copia Seed
## 2 Suiteness Series A - Suiteness Series A
## 3 Astranis Seed Round - Astranis Seed
## 4 Shield Bio Seed Round - Shield Bio Seed
## 5 Platzi Seed Round - Platzi Seed
## 6 Kisan Network Seed Round - Kisan Network Seed
## Money.Raised.Currency..in.USD. Announced.Date Funding.Stage
## 1 3100000 2016-03-22 Seed
## 2 5000000 2016-12-14 Early Stage Venture
## 3 120000 2016-03-22 Seed
## 4 4100000 2017-02-23 Seed
## 5 NA 2014-12-01 Seed
## 6 NA 2016-06-01 Seed
## Pre.Money.Valuation.Currency..in.USD.
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Description
## 1 Copia is the next-generation technology platform for food waste management.
## 2 Suiteness is a free-to-join hotel booking website connecting hotel rooms and suites.
## 3 Astranis is building small, low-cost telecommunications satellites.
## 4 Shield Bio - using ultra-fast sequencing to prevent antibiotic resistance
## 5 Platzi is an effective online education platform that offers classes on marketing, learn coding, business, and design.
## 6 Kisan Network is an online marketplace for Indian agriculture.
## Categories
## 1 Analytics, Communities, Enterprise, Enterprise Software, Marketplace, SaaS, Sharing Economy, Sustainability, Waste Management
## 2 Family, Hospitality, Hotel, Leisure, Reservations, Travel
## 3 Aerospace, Internet, Telecommunications
## 4 Biotechnology, Genetics, Health Care, Health Diagnostics
## 5 Education, Edutainment, Recruiting, Training
## 6 Agriculture, AgTech, E-Commerce, Mobile
## Location
## 1 San Francisco, California, United States, North America
## 2 Oakland, California, United States, North America
## 3 San Francisco, California, United States, North America
## 4 San Jose, California, United States, North America
## 5 Mountain View, California, United States, North America
## 6 Gurgaon, Haryana, India, Asia
## Website Revenue.Range
## 1 https://www.GoCopia.com/ $1M to $10M
## 2 https://www.suiteness.com $1M to $10M
## 3 http://www.astranis.com/ $1M to $10M
## 4 http://shieldbio.com
## 5 https://platzi.com Less than $1M
## 6 http://www.kisannetwork.com
## Total.Funding.Amount.Currency..in.USD. Funding.Status
## 1 4580000 Seed
## 2 6000000 Early Stage Venture
## 3 13619998 Early Stage Venture
## 4 12100000 Early Stage Venture
## 5 16428315 Early Stage Venture
## 6 38300 Seed
## Number.of.Funding.Rounds Lead.Investors
## 1 2 Structure Capital
## 2 3 Bullpen Capital, Global Founders Capital
## 3 7
## 4 2 Andreessen Horowitz
## 5 5
## 6 3
## Investor.Names
## 1 8VC, Alps Investing Holdings LLC, Chivas Venture, Cynthia Ringo, David Pottruck, Emerson Collective, Eucalyptus Burlingame LLC, Jahan Ali, Jillian Manus, John Solomon, Jordan Kretchmer, Ken Tam, Lutetilla LLC, Lynett Capital, Maples Burlingame LLC, Mitchell Kapor, Moment Ventures, Nurzhas Makishev, Riggs Capital Partners, Steve Case, Structure Capital, Toyota USA, Y Combinator
## 2 AltaIR Capital, Bullpen Capital, David Hauser, FundersClub, Global Founders Capital, HVF Labs, Jared Ablon, Kima Ventures, MetaProp NYC, Muhsen Syed, Rocket Internet, Roland Tanner, SciFi VC, Scott Banister, Tilo Bonow, Y Combinator
## 3 ACE & Company, Fifty Years, Jaan Tallinn, Lars Rasmussen, Refactor Capital, S2 Capital, Samvit Ramadurgam, Wei Guo, Y Combinator
## 4 Andreessen Horowitz, Friále, Josh Buckley, Refactor Capital, SGH CAPITAL, Soma Capital, Y Combinator
## 5 500 Startups, Amasia, BoomStartup, Deepak Desai, Elies Campo, FundersClub, GE32 Capital, Graph Ventures, Josh Jones, Mind the Seed - MTS Fund, TA Ventures, Thomas Floracks, Y Combinator, Zillionize Angel
## 6 FundersClub, Venture Highway, Y Combinator
## Number.of.Investors Number.of.Partner.Investors
## 1 23 10
## 2 16 2
## 3 9 NA
## 4 7 6
## 5 14 NA
## 6 3 NA
In the analysis we will use only Name, Founding type, Money Raised, Announced date, Categories, Location, Total founding amount, Investor names, Number of founding rounds. Therefore we need to update our data frame
data <- data[,c("Name", "Funding.Type","Money.Raised.Currency..in.USD.", "Announced.Date","Categories","Location","Total.Funding.Amount.Currency..in.USD.","Number.of.Funding.Rounds","Investor.Names")]
Let’s first analyze founding type.
#Funding type
table(data$Funding.Type)
##
## Angel Convertible Note Corporate Round
## 30 10 1
## Debt Financing Funding Round Grant
## 2 11 8
## Pre-Seed Product Crowdfunding Seed
## 37 1 2188
## Series A Series B Series C
## 155 61 27
## Series D Series E Series F
## 5 3 2
## Venture - Series Unknown
## 44
As we can see the data can differ as the values are showed for all the rounds. As Y Combinator is an accelerator let’s focus only on the Seed and Pre-seed rounds.
# Load the dataset
data <- data[data$Funding.Type %in% c("Seed", "Pre-Seed"), ]
table(data$Funding.Type)
##
## Pre-Seed Seed
## 37 2188
Let’s explore Money.Raised.Currency..in.USD. and Total.Funding.Amount.Currency..in.USD. columns which will be useful for use in the further analysis.
#checking the Money.Raised.Currency..in.USD.
options(scipen = 999)
summary(data$Money.Raised.Currency..in.USD.)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6000 120000 120000 788232 1000000 20000000 667
#Checking the Total.Funding.Amount.Currency..in.USD.
summary(data$Total.Funding.Amount.Currency..in.USD.)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10000 150000 1637500 23234822 6620000 5268800000 441
As we can see we have 667 NA values for Money.Raised.Currency..in.USD. and 441 NA values for Total.Funding.Amount.Currency..in.USD. columns.
#Data Cleaning and Initial Feature data Enginering.
We will start with some initial data engineering. I would like to first divide the location for 4 separate columns -> City, State, Country, Continent, as well as to create a new column named “year” that contains year of the founding.
# Create a new column 'year' based on Announced.Date
data <- data %>% mutate(year = lubridate::year(Announced.Date))
# Create a new columns "City", "State", "Country", "Continent" based on Location
data <- data %>%
separate(Location, into = c("City", "State", "Country", "Continent"), sep = ", ", remove = FALSE)
## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 56 rows [53, 55, 191,
## 297, 327, 646, 830, 834, 835, 837, 847, 848, 854, 855, 857, 858, 861, 862, 874,
## 884, ...].
As the columns with the Total.Funding.Amount.Currency..in.USD. and Money.Raised.Currency..in.USD. are crucial in the further analysis, we cannot replace them with zeros or just remove. So we need replace the value. We realise that we can replace NA with values from Money.Raised.Currency..in.USD. to Total.Funding.Amount.Currency..in.USD. if Number.of.Funding.Rounds is equal to 1. In other cases we will use imputation method. First, we tested Predictive mean matching, Bayesian linear regression and Logistic regression to check with method will be the best for which case. During the pre-analysis we decide to use Bayesian linear regression for Money.Raised.Currency..in.USD. and Predictive mean matching for Total.Funding.Amount.Currency..in.USD. as the results were the most promising.
impute_data1 <- data[, c( "year","Money.Raised.Currency..in.USD.","Investor.Names","Location")]
# Perform regression imputation
imputed_data1 <- mice(impute_data1, method = "norm.predict")
##
## iter imp variable
## 1 1 Money.Raised.Currency..in.USD.
## 1 2 Money.Raised.Currency..in.USD.
## 1 3 Money.Raised.Currency..in.USD.
## 1 4 Money.Raised.Currency..in.USD.
## 1 5 Money.Raised.Currency..in.USD.
## 2 1 Money.Raised.Currency..in.USD.
## 2 2 Money.Raised.Currency..in.USD.
## 2 3 Money.Raised.Currency..in.USD.
## 2 4 Money.Raised.Currency..in.USD.
## 2 5 Money.Raised.Currency..in.USD.
## 3 1 Money.Raised.Currency..in.USD.
## 3 2 Money.Raised.Currency..in.USD.
## 3 3 Money.Raised.Currency..in.USD.
## 3 4 Money.Raised.Currency..in.USD.
## 3 5 Money.Raised.Currency..in.USD.
## 4 1 Money.Raised.Currency..in.USD.
## 4 2 Money.Raised.Currency..in.USD.
## 4 3 Money.Raised.Currency..in.USD.
## 4 4 Money.Raised.Currency..in.USD.
## 4 5 Money.Raised.Currency..in.USD.
## 5 1 Money.Raised.Currency..in.USD.
## 5 2 Money.Raised.Currency..in.USD.
## 5 3 Money.Raised.Currency..in.USD.
## 5 4 Money.Raised.Currency..in.USD.
## 5 5 Money.Raised.Currency..in.USD.
## Warning: Number of logged events: 2
imputed_values1 <- complete(imputed_data1)
# Replace NA values in original dataset with imputed values
data$Money.Raised.Currency..in.USD.[is.na(data$Money.Raised.Currency..in.USD.)] <- imputed_values1$Money.Raised.Currency..in.USD.[is.na(data$Money.Raised.Currency..in.USD.)]
summary(data$Money.Raised.Currency..in.USD.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6000 120000 384051 752592 827710 20000000
impute_data2 <- data[, c("Total.Funding.Amount.Currency..in.USD.", "year","Number.of.Funding.Rounds","Investor.Names")]
# Replace NA with values from "Money.Raised.Currency..in.USD." if "Number.of.Funding.Rounds" is equal to 1
data$Total.Funding.Amount.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.) & data$Number.of.Funding.Rounds == 1] <- data$Money.Raised.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.) & data$Number.of.Funding.Rounds == 1]
#As norm method is giving negative values I decided to use different method.
# Impute missing values using mice package
imputed_data2 <- mice(impute_data2, method = "pmm", m=4)
##
## iter imp variable
## 1 1 Total.Funding.Amount.Currency..in.USD.
## 1 2 Total.Funding.Amount.Currency..in.USD.
## 1 3 Total.Funding.Amount.Currency..in.USD.
## 1 4 Total.Funding.Amount.Currency..in.USD.
## 2 1 Total.Funding.Amount.Currency..in.USD.
## 2 2 Total.Funding.Amount.Currency..in.USD.
## 2 3 Total.Funding.Amount.Currency..in.USD.
## 2 4 Total.Funding.Amount.Currency..in.USD.
## 3 1 Total.Funding.Amount.Currency..in.USD.
## 3 2 Total.Funding.Amount.Currency..in.USD.
## 3 3 Total.Funding.Amount.Currency..in.USD.
## 3 4 Total.Funding.Amount.Currency..in.USD.
## 4 1 Total.Funding.Amount.Currency..in.USD.
## 4 2 Total.Funding.Amount.Currency..in.USD.
## 4 3 Total.Funding.Amount.Currency..in.USD.
## 4 4 Total.Funding.Amount.Currency..in.USD.
## 5 1 Total.Funding.Amount.Currency..in.USD.
## 5 2 Total.Funding.Amount.Currency..in.USD.
## 5 3 Total.Funding.Amount.Currency..in.USD.
## 5 4 Total.Funding.Amount.Currency..in.USD.
## Warning: Number of logged events: 1
imputed_values2 <- complete(imputed_data2)
# Replace remaining NA with imputed values
data$Total.Funding.Amount.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.)] <- imputed_values2$Total.Funding.Amount.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.)]
data$Total.Funding.Amount.Currency..in.USD. <- ifelse(data$Total.Funding.Amount.Currency..in.USD. < data$Money.Raised.Currency..in.USD., data$Money.Raised.Currency..in.USD., data$Total.Funding.Amount.Currency..in.USD.)
summary(data$Total.Funding.Amount.Currency..in.USD.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10000 150000 908376 18993330 4700000 5268800000
Let’s first summarize how many companies were found per year.
# Convert the Announced.Date column to a date format
data$Announced.Date <- as.Date(data$Announced.Date, format = "%Y-%m-%d")
# Count the number of startups announced per year
startup_count_per_year <- table(format(data$Announced.Date, "%Y"))
print(startup_count_per_year)
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## 10 18 35 43 49 66 99 164 118 233 203 254 286 284 363
# create bar plot
ggplot(data = data.frame(Year = names(startup_count_per_year), Count = as.numeric(startup_count_per_year)),
aes(x = Year, y = Count, fill = Year)) +
geom_col() +
scale_fill_viridis_d() +
ggtitle("Number of Startups Announced per Year") +
xlab("Year") +
ylab("Count") +
theme_minimal()
# Aggregate the total money raised by year
money_raised_per_year <- data %>%
group_by(year) %>%
summarise(total_money_raised = sum(`Money.Raised.Currency..in.USD.`))
print(money_raised_per_year)
## # A tibble: 15 × 2
## year total_money_raised
## <dbl> <dbl>
## 1 2005 2562029.
## 2 2006 5802767.
## 3 2007 8732676.
## 4 2008 16080359.
## 5 2009 19576669.
## 6 2010 32847560.
## 7 2011 63621007.
## 8 2012 155529115.
## 9 2013 76273350.
## 10 2014 145391602.
## 11 2015 125109456.
## 12 2016 170194789.
## 13 2017 261995823.
## 14 2018 310073623.
## 15 2019 280726028.
# Plot total money raised by year
data %>%
group_by(year) %>%
summarise(total_money_raised = sum(`Money.Raised.Currency..in.USD.`)) %>%
ggplot(aes(x = year, y = total_money_raised, fill = factor(year))) +
geom_bar(stat = "identity", color = "black", alpha = 0.8) +
scale_fill_viridis_d() +
ggtitle("Total Money Raised by Startups Announced per Year") +
xlab("Year") +
ylab("Money Raised (USD)")
# select the top 5 countries for Asia + Oceania only
AsiaOceania_data <- data %>%
group_by(Continent) %>%
filter(Continent == c("Asia", "Oceania")) %>%
group_by(Country) %>%
summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
top_n(10) %>%
ungroup()
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Continent == c("Asia", "Oceania")`.
## ℹ In group 5: `Continent = "Oceania"`.
## Caused by warning in `Continent == c("Asia", "Oceania")`:
## ! długość dłuszego obiektu nie jest wielokrotnością długości krótszego obiektu
## Selecting by Total_Money_Raised
# create the bar chart
ggplot(AsiaOceania_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(Total_Money_Raised)),
position = position_stack(vjust = 1.0),
color = "black", size = 3) +
labs(title = "Total Money Raised by Top 10 Countries in Asia and Oceania",
x = "Country",
y = "Total Money Raised (in USD)",
fill = "") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
# select the top 6 countries for Africa only
Africa_data <- data %>%
group_by(Continent) %>%
filter(Continent == "Africa") %>%
group_by(Country) %>%
summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
top_n(5) %>%
ungroup()
## Selecting by Total_Money_Raised
# create the bar chart
ggplot(Africa_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(Total_Money_Raised)),
position = position_stack(vjust = 1.0),
color = "black", size = 4) +
labs(title = "Total Money Raised by Top 5 Countries in Africa",
x = "Country",
y = "Total Money Raised (in USD)",
fill = "") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
# select the top 3 countries for North America only
North_America_data <- data %>%
group_by(Continent) %>%
filter(Continent == "North America") %>%
group_by(Country) %>%
summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
top_n(3) %>%
ungroup()
## Selecting by Total_Money_Raised
# create the bar chart
ggplot(North_America_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(Total_Money_Raised)),
position = position_stack(vjust = 1.0),
color = "black", size = 5)+
labs(title = "Total Money Raised by Top 3 Countries in North America",
x = "Country",
y = "Total Money Raised (in USD)",
fill = "") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
# select the top 5 countries for South America only
South_America_data <- data %>%
group_by(Continent) %>%
filter(Continent == "South America") %>%
group_by(Country) %>%
summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
top_n(5) %>%
ungroup()
## Selecting by Total_Money_Raised
# create the bar chart
ggplot(South_America_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(Total_Money_Raised)),
position = position_stack(vjust = 1.0),
color = "black", size = 5)+
labs(title = "Total Money Raised by Top 5 Countries in South America",
x = "Country",
y = "Total Money Raised (in USD)",
fill = "") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
Continents_data <- data%>%
group_by(Continent) %>%
summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
ungroup()
ggplot(Continents_data, aes(x = reorder(Continent, Total_Money_Raised), y = Total_Money_Raised, fill = Continent)) +
geom_bar(stat = "identity") +
geom_text(aes(label = scales::comma(Total_Money_Raised)),
position = position_stack(vjust = 1.0),
color = "black", size = 3)+
labs(title = "Total Money Raised by Continent",
x = "Continent",
y = "Total Money Raised (in USD)",
fill = "") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
The ROI value indicates the efficiency of the funding raised by companies in each country. A higher ROI means that the companies in that country are able to generate a higher return on investment with less funding. This could be due to various reasons, such as lower operating costs, higher profitability, or better business models, but it is a good sign to take a closer look on those countries in the future rounds.
#Check what is the ROI by country
Country_by_Roi <- data %>%
group_by(Country) %>%
summarise(ROI = (sum(Total.Funding.Amount.Currency..in.USD.)/sum(Money.Raised.Currency..in.USD.))*100) %>%
filter(!is.na(ROI))%>%
arrange(desc(ROI)) %>%
head(10)
ggplot(Country_by_Roi, aes(reorder(Country, ROI), ROI, fill = Country)) +
geom_bar(stat = "identity") +
xlab("Country") +
ylab("ROI (%)") +
ggtitle("Top 10 Countries with the Highest Rate") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The output shows the top 10 countries with the highest ROI (Return of Investment). Portugal has the highest ROI of 75983%, followed by Hungary with 19598%, and Colombia with 8730%. The rest of the countries on the list have ROIs ranging from 7517% to 1633%.
Continent_by_Roi <- data %>%
group_by(Continent) %>%
summarise(ROI = (sum(Total.Funding.Amount.Currency..in.USD.)/sum(Money.Raised.Currency..in.USD.))*100) %>%
filter(!is.na(ROI))%>%
arrange(desc(ROI))
ggplot(Continent_by_Roi, aes(reorder(Continent, ROI), ROI, fill = Continent)) +
geom_bar(stat = "identity") +
xlab("Continent") +
ylab("ROI (%)") +
ggtitle("ROI by Continent") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The output shows the ROI (Return on Investment) of companies grouped by continent. The continent with the highest ROI is South America, with an ROI of 4554%, followed by North America with 2712% and Oceania with 1674%. The rest of the continents on the list have ROIs ranging from 916% to 105%.
These numbers suggest that companies in South America and North America are generating higher returns on investment compared to other continents. The data suggest that these regions may be worth closer attention for future investment opportunities.
We know that the startup market’s specific are high returns associated with high risk what we can translate to just a few – super high valuation (Total.Funding.Amount.Currency..in.USD.) with high number of investments, therefore we need to first remove all the outliers for clustering
remove_outliers <- function(df, var) {
Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
df <- df[df[[var]] >= Q1 - 1.5 * IQR & df[[var]] <= Q3 + 1.5 * IQR,]
return(df)
}
# Remove outliers from Money.Raised.Currency..in.USD.
data <- remove_outliers(data, "Money.Raised.Currency..in.USD.")
# Remove outliers from Total.Funding.Amount.Currency..in.USD.
data <- remove_outliers(data, "Total.Funding.Amount.Currency..in.USD.")
To do the clustering we need to first normalized the data
# Load the dataset
Clustering_Raised_Total <- data.frame(Money_Raised = data$Money.Raised.Currency..in.USD., Total_Funding = data$Total.Funding.Amount.Currency..in.USD.)
Clustering_Norm <- as.data.frame(scale(Clustering_Raised_Total))
# Original data
ggplot(Clustering_Raised_Total, aes(x=Money_Raised, y=Total_Funding)) +
geom_point() +
labs(title="Original data") +
theme_bw()
# Normalized data
ggplot(Clustering_Norm, aes(x=Money_Raised, y=Total_Funding)) +
geom_point() +
labs(title="Normalized data") +
theme_bw()
f1 <- fviz_nbclust(Clustering_Norm, FUNcluster = kmeans, method = "silhouette") +
ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(Clustering_Norm, FUNcluster = cluster::pam, method = "silhouette") +
ggtitle("Optimal number of clusters \n PAM")
grid.arrange(f1, f2, ncol=2)
For K-Means we will use 5 even though the graph showing 10 because the average slihouette is almost the same, but we want to have less clusters.
For PAM clustering we will use 8 clusters according to the method.
km5 <- eclust(Clustering_Norm, k=5 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)
c2 <- fviz_cluster(km5, data=Clustering_Norm, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")
s2 <- fviz_silhouette(km5)
## cluster size ave.sil.width
## 1 1 826 0.68
## 2 2 275 0.32
## 3 3 298 0.63
## 4 4 151 0.39
## 5 5 87 0.37
grid.arrange(c2, s2, ncol=2)
The K-Means clustering method shows use that the startups in the 4th cluster are probably the best in terms of ROI and risk of the investitions as the invested capital was the lowest but brings the highest returns. In further analysis, we could look at the data from the perspective of common ground bettween the invested startups in the 4th cluster and other firms.
# Perform PAM clustering with 5 clusters
pam8 <- eclust(Clustering_Norm, k=8 , FUNcluster="pam", graph=F)
# Visualize the clustering results
c3 <- fviz_cluster(pam8, data=Clustering_Norm, ellipse.type="convex", geom=c("point")) + ggtitle("PAM with 8 clusters")
s3 <- fviz_silhouette(pam8)
## cluster size ave.sil.width
## 1 1 207 0.65
## 2 2 690 0.88
## 3 3 37 0.40
## 4 4 74 0.40
## 5 5 193 0.49
## 6 6 74 0.56
## 7 7 302 0.55
## 8 8 60 0.53
grid.arrange(c3, s3, ncol=2)
PAM method is giving even better results. We can say that the 3th and 7th cluster are bringing the highest returns with the lowest invested value. Let’s deep dive a bit and look what we can take from it in terms of the invested country.
Looking to the data 3th and 7th cluster consist the best investments (the highest valuation). Let’s look at the data in table.
table(data$Country, pam8$cluster == c(7,3))
## Warning in pam8$cluster == c(7, 3): długość dłuszego obiektu nie jest
## wielokrotnością długości krótszego obiektu
##
## FALSE TRUE
## Argentina 1 0
## Australia 4 1
## Bangladesh 2 0
## Brazil 5 0
## Canada 52 3
## Chile 1 0
## China 9 0
## Colombia 7 0
## Czech Republic 1 0
## Denmark 4 0
## Ecuador 1 0
## Egypt 4 0
## El Salvador 1 0
## Finland 1 0
## France 8 2
## Germany 6 0
## Ghana 2 0
## Hong Kong 5 1
## Iceland 1 0
## India 44 1
## Indonesia 6 0
## Iraq 1 0
## Ireland 1 1
## Israel 3 0
## Mexico 10 0
## Morocco 1 0
## Nigeria 12 0
## Panama 1 0
## Peru 2 0
## Philippines 0 1
## Poland 1 0
## Puerto Rico 1 0
## Senegal 1 0
## Singapore 11 0
## Slovenia 2 0
## South Africa 1 0
## South Korea 1 0
## Sweden 2 0
## Switzerland 1 0
## Tanzania 1 0
## The Netherlands 1 0
## Turkey 0 1
## United Kingdom 33 3
## United States 1180 137
## Uruguay 1 0
table(data$Continent, pam8$cluster)
##
## 1 2 3 4 5 6 7 8
## Africa 4 11 1 1 3 0 1 1
## Asia 16 37 4 1 13 6 2 6
## Europe 3 36 2 1 10 2 14 1
## North America 172 581 30 69 166 66 251 50
## Oceania 0 3 0 1 0 0 1 0
## South America 3 10 0 1 1 0 2 1
library(dplyr)
# Filter the data for cluster 7 and 3 as they are the best options.
cluster_data <- data %>% filter(pam8$cluster == c(7,3))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `pam8$cluster == c(7, 3)`.
## Caused by warning in `pam8$cluster == c(7, 3)`:
## ! długość dłuszego obiektu nie jest wielokrotnością długości krótszego obiektu
# Calculate the ratio of total money raised to number of startups for each country
ratio_by_country <- cluster_data %>% group_by(Country) %>%
summarize(ratio = sum(Money.Raised.Currency..in.USD.) / n()) %>%
arrange(desc(ratio))
# Create the bar chart
ggplot(ratio_by_country, aes(x = reorder(Country, ratio), y = ratio, fill = Country)) +
geom_col() +
geom_text(aes(label = scales::comma(round(ratio))), vjust = -0.5, size = 3) +
scale_y_continuous(labels = scales::comma_format()) +
labs(x = "Country", y = "Ratio of Total Money Raised to Number of Startups") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
Based on the analysis we can say that Y Combinator should put more attention to the United Kingdom, as pottential is almost 2 times higher (827,710 points to 427,723 points in the United States) higher than in the second United States
In this part we will treat the transactions made by investors co-investing with Y Combinator to help understand the connections between partnerships and help startup’s founders on whom they should focus and connect the most to achieve the highest probability of the further investments. We also would like to create a recommendation system for investors to understand their investment biases with a particular co-investors or even explore strong connections between each other.
#upload the dataset
df <- read.csv("YC.csv")
df <- df[,c("Name","Investor.Names","Announced.Date")]
# Remove duplicates based on the latest "Announced.Date"
df <- df %>%
arrange(desc(Announced.Date)) %>% # sort by Announced.Date in descending order
distinct(Name, .keep_all = TRUE) # keep only the first occurrence of each unique name
# Convert categories to a list of vectors
inverstors_list <- strsplit(as.character(df$Investor.Names), ",\\s*")
# Filter out rows without values
Inverstors_list <- Filter(length, inverstors_list)
# Remove reverse pairs of investors
Inverstors_list <- lapply(inverstors_list, function(x) {
if (length(x) > 1) {
x <- unique(sort(x))
}
x
})
# Get all unique categories
all_Inverstors <- unique(unlist(inverstors_list))
# Create an empty binary matrix
binary_matrix <- matrix(0, nrow = length(inverstors_list), ncol = length(all_Inverstors), dimnames = list(NULL, all_Inverstors))
# Fill in the binary matrix
for (i in seq_along(inverstors_list)) {
binary_matrix[i, inverstors_list[[i]]] <- 1
}
# Convert the binary matrix to a transaction object
transactions <- as(binary_matrix, "transactions")
# Find association rules
rules <- apriori(transactions, parameter = list(minlen=2, support = 0.004, confidence = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.004 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 8
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2024 item(s), 2133 transaction(s)] done [0.00s].
## sorting and recoding items ... [91 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [129 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Rules with the highest values of support, confidence and lift have been displayed below.
# Sort by support
rules_by_support <- sort(rules, by = "support", decreasing = TRUE)
# Inspect rules by support
inspect(rules_by_support[1:10])
## lhs rhs support confidence coverage
## [1] {SV Angel} => {Y Combinator} 0.05250820 1 0.05250820
## [2] {FundersClub} => {Y Combinator} 0.05016409 1 0.05016409
## [3] {Paul Buchheit} => {Y Combinator} 0.02812940 1 0.02812940
## [4] {Zillionize Angel} => {Y Combinator} 0.02578528 1 0.02578528
## [5] {Andreessen Horowitz} => {Y Combinator} 0.02062822 1 0.02062822
## [6] {Alexis Ohanian} => {Y Combinator} 0.02062822 1 0.02062822
## [7] {Soma Capital} => {Y Combinator} 0.01828411 1 0.01828411
## [8] {500 Startups} => {Y Combinator} 0.01828411 1 0.01828411
## [9] {AltaIR Capital} => {Y Combinator} 0.01734646 1 0.01734646
## [10] {ACE & Company} => {Y Combinator} 0.01734646 1 0.01734646
## lift count
## [1] 1 112
## [2] 1 107
## [3] 1 60
## [4] 1 55
## [5] 1 44
## [6] 1 44
## [7] 1 39
## [8] 1 39
## [9] 1 37
## [10] 1 37
The output shows the top 20 association rules sorted by support, which indicates the frequency of each rule in the dataset. The rules involve investors or investment firms in the left-hand side (LHS) and Y Combinator in the right-hand side (RHS).
The rule with the highest support (0.0525) is {SV Angel} => {Y Combinator}, which means that 5.25% of the transactions in the dataset involve SV Angel as an investor and Y Combinator as an investment firm. The second and third rules also have high support values, indicating a strong association between FundersClub and Y Combinator, and between Paul Buchheit and Y Combinator, respectively.
The rest of the rules on the list have lower support values, indicating a weaker association between the investors or investment firms in the LHS and Y Combinator in the RHS. However, they still provide insights into the co-occurrence patterns between investors and investment firms in the startup ecosystem.
plot(rules[1:20,], method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph for 20 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
# Sort by lift
rules_by_lift <- sort(rules, by = "lift", decreasing = TRUE)
#inspect by lift
inspect(rules_by_lift[1:20])
## lhs rhs support confidence coverage lift count
## [1] {UpHonest Capital} => {Wei Guo} 0.004219409 0.6428571 0.006563526 59.61801 9
## [2] {Y Combinator,
## UpHonest Capital} => {Wei Guo} 0.004219409 0.6428571 0.006563526 59.61801 9
## [3] {SV Angel,
## Alexis Ohanian} => {Garry Tan} 0.005157056 0.5500000 0.009376465 58.65750 11
## [4] {Y Combinator,
## SV Angel,
## Alexis Ohanian} => {Garry Tan} 0.005157056 0.5500000 0.009376465 58.65750 11
## [5] {Garry Tan} => {Alexis Ohanian} 0.008907642 0.9500000 0.009376465 46.05341 19
## [6] {Y Combinator,
## Garry Tan} => {Alexis Ohanian} 0.008907642 0.9500000 0.009376465 46.05341 19
## [7] {SV Angel,
## Garry Tan} => {Alexis Ohanian} 0.005157056 0.9166667 0.005625879 44.43750 11
## [8] {Y Combinator,
## SV Angel,
## Garry Tan} => {Alexis Ohanian} 0.005157056 0.9166667 0.005625879 44.43750 11
## [9] {Yuri Milner} => {SV Angel} 0.004219409 0.9000000 0.004688233 17.14018 9
## [10] {Y Combinator,
## Yuri Milner} => {SV Angel} 0.004219409 0.9000000 0.004688233 17.14018 9
## [11] {Paul Buchheit,
## Alexis Ohanian} => {SV Angel} 0.004219409 0.8181818 0.005157056 15.58198 9
## [12] {Y Combinator,
## Paul Buchheit,
## Alexis Ohanian} => {SV Angel} 0.004219409 0.8181818 0.005157056 15.58198 9
## [13] {Tuesday Capital} => {SV Angel} 0.006094702 0.6500000 0.009376465 12.37902 13
## [14] {Y Combinator,
## Tuesday Capital} => {SV Angel} 0.006094702 0.6500000 0.009376465 12.37902 13
## [15] {Garry Tan} => {SV Angel} 0.005625879 0.6000000 0.009376465 11.42679 12
## [16] {Y Combinator,
## Garry Tan} => {SV Angel} 0.005625879 0.6000000 0.009376465 11.42679 12
## [17] {Alexis Ohanian,
## Garry Tan} => {SV Angel} 0.005157056 0.5789474 0.008907642 11.02585 11
## [18] {Y Combinator,
## Alexis Ohanian,
## Garry Tan} => {SV Angel} 0.005157056 0.5789474 0.008907642 11.02585 11
## [19] {Susa Ventures} => {SV Angel} 0.004219409 0.5625000 0.007501172 10.71261 9
## [20] {Y Combinator,
## Susa Ventures} => {SV Angel} 0.004219409 0.5625000 0.007501172 10.71261 9
library(arules)
# Sort by confidence
rules_by_confidence <- sort(rules, by = "confidence", decreasing = TRUE)
# Inspect by confidence
inspect(rules_by_confidence[1:20])
## lhs rhs support confidence
## [1] {S28 Capital} => {Y Combinator} 0.004219409 1
## [2] {AAF Management Ltd.} => {Y Combinator} 0.004219409 1
## [3] {Social Starts} => {Y Combinator} 0.004688233 1
## [4] {Streamlined Ventures} => {Y Combinator} 0.004219409 1
## [5] {Refactor Capital} => {Y Combinator} 0.004219409 1
## [6] {Draper Associates} => {Y Combinator} 0.004219409 1
## [7] {Lynett Capital} => {Y Combinator} 0.004219409 1
## [8] {StartX (Stanford-StartX Fund)} => {Y Combinator} 0.004688233 1
## [9] {Oyster Ventures} => {Y Combinator} 0.006563526 1
## [10] {Brainchild Holdings} => {Y Combinator} 0.004688233 1
## [11] {Kevin Moore} => {Y Combinator} 0.006094702 1
## [12] {Kevin Mahaffey} => {Y Combinator} 0.004688233 1
## [13] {Salesforce Ventures} => {Y Combinator} 0.005157056 1
## [14] {Uncork Capital} => {Y Combinator} 0.004688233 1
## [15] {CRCM Ventures} => {Y Combinator} 0.005625879 1
## [16] {AME Cloud Ventures} => {Y Combinator} 0.004688233 1
## [17] {Fifty Years} => {Y Combinator} 0.005157056 1
## [18] {Menlo Ventures} => {Y Combinator} 0.004688233 1
## [19] {Accel} => {Y Combinator} 0.008438819 1
## [20] {Bobby Goodlatte} => {Y Combinator} 0.004219409 1
## coverage lift count
## [1] 0.004219409 1 9
## [2] 0.004219409 1 9
## [3] 0.004688233 1 10
## [4] 0.004219409 1 9
## [5] 0.004219409 1 9
## [6] 0.004219409 1 9
## [7] 0.004219409 1 9
## [8] 0.004688233 1 10
## [9] 0.006563526 1 14
## [10] 0.004688233 1 10
## [11] 0.006094702 1 13
## [12] 0.004688233 1 10
## [13] 0.005157056 1 11
## [14] 0.004688233 1 10
## [15] 0.005625879 1 12
## [16] 0.004688233 1 10
## [17] 0.005157056 1 11
## [18] 0.004688233 1 10
## [19] 0.008438819 1 18
## [20] 0.004219409 1 9
The confidence values of the association rules indicate the likelihood of the right-hand side (rhs) occurring given the left-hand side (lhs) occurred. Specifically, a confidence of 1.0 means that the rhs always occurs when the lhs occurs, while a confidence of 0.5 means that the rhs occurs half of the time when the lhs occurs.
For example, consider rule [1]: {Total Access Fund} => {Alumni Ventures Group}. The confidence value is 1.0, indicating that whenever Total Access Fund invests in a startup, Alumni Ventures Group also invests in the same startup.
plot(rules_by_lift[1:20,], method = "graph", measure = "lift", shading = "support", main = "Association Rules Graph for 20 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Using association rules we can create a recommendation system for investors. To do it let’s look at the case for Andreessen Horowitz
rules_Andreessen_Horowitz <- apriori(transactions, parameter=list(minlen=2, supp=0.001, conf = 0.05), appearance = list(default="rhs",lhs= c("Andreessen Horowitz")))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[2024 item(s), 2133 transaction(s)] done [0.00s].
## sorting and recoding items ... [405 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_Andreessen_Horowitz <- subset(rules_Andreessen_Horowitz, lift != 1)
rules_Andreessen_Horowitz <- sort(rules_Andreessen_Horowitz, by = "support", decreasing = TRUE)
inspect(rules_Andreessen_Horowitz)
## lhs rhs support confidence
## [1] {Andreessen Horowitz} => {SV Angel} 0.008907642 0.43181818
## [2] {Andreessen Horowitz} => {General Catalyst} 0.003281763 0.15909091
## [3] {Andreessen Horowitz} => {Start Fund} 0.003281763 0.15909091
## [4] {Andreessen Horowitz} => {Ignition Partners} 0.001875293 0.09090909
## [5] {Andreessen Horowitz} => {Joshua Schachter} 0.001875293 0.09090909
## [6] {Andreessen Horowitz} => {Refactor Capital} 0.001406470 0.06818182
## [7] {Andreessen Horowitz} => {Salesforce Ventures} 0.001406470 0.06818182
## [8] {Andreessen Horowitz} => {Lerer Hippeau} 0.001406470 0.06818182
## [9] {Andreessen Horowitz} => {Signatures Capital} 0.001406470 0.06818182
## [10] {Andreessen Horowitz} => {Ashton Kutcher} 0.001406470 0.06818182
## [11] {Andreessen Horowitz} => {First Round Capital} 0.001406470 0.06818182
## [12] {Andreessen Horowitz} => {ACE & Company} 0.001406470 0.06818182
## [13] {Andreessen Horowitz} => {Khosla Ventures} 0.001406470 0.06818182
## [14] {Andreessen Horowitz} => {Data Collective DCVC} 0.001406470 0.06818182
## [15] {Andreessen Horowitz} => {Alexis Ohanian} 0.001406470 0.06818182
## [16] {Andreessen Horowitz} => {Paul Buchheit} 0.001406470 0.06818182
## [17] {Andreessen Horowitz} => {FundersClub} 0.001406470 0.06818182
## coverage lift count
## [1] 0.02062822 8.223823 19
## [2] 0.02062822 21.208807 7
## [3] 0.02062822 9.171376 7
## [4] 0.02062822 24.238636 4
## [5] 0.02062822 17.628099 4
## [6] 0.02062822 16.159091 3
## [7] 0.02062822 13.221074 3
## [8] 0.02062822 13.221074 3
## [9] 0.02062822 18.178977 3
## [10] 0.02062822 16.159091 3
## [11] 0.02062822 6.610537 3
## [12] 0.02062822 3.930590 3
## [13] 0.02062822 4.039773 3
## [14] 0.02062822 4.155195 3
## [15] 0.02062822 3.305269 3
## [16] 0.02062822 2.423864 3
## [17] 0.02062822 1.359176 3
plot(rules_Andreessen_Horowitz[1:15,], method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph for 20 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
Based on the association rule results, it can be observed that Andreessen Horowitz is significantly associated with SV Angel, as it has the highest support value of 0.0089, and a high lift value of 8.22, indicating that the occurrence of SV Angel is more likely when Andreessen Horowitz is present.
Other rules with relatively high support values include Andreessen Horowitz with General Catalyst, Start Fund, and Ignition Partners. These rules also have high lift values, indicating strong associations between the antecedent and the consequent.
However, it is important to note that the confidence values for all rules are relatively low, ranging from 0.068 to 0.431, indicating that the probability of the consequent occurring given the antecedent is not very high.
In conclusion, our analysis of the Y Combinator investment dataset has shown that clustering and association rule mining techniques can be valuable tools for gaining insights into the characteristics of successful startups. By grouping startups based on funding patterns and business focuses, we were able to identify commonalities among successful companies. Additionally, by applying association rule mining techniques, we were able to identify relationships between different features of the dataset, such as the relationship between funding amount and industry type.
Overall, our analysis provides valuable insights for entrepreneurs and investors interested in the startup ecosystem. By understanding the characteristics of successful Y Combinator startups, entrepreneurs can tailor their business strategies to increase their chances of success, while investors can make more informed investment decisions. The Y Combinator investment dataset is a valuable resource for exploring trends and patterns in startup funding, and our analysis demonstrates the value of applying data mining techniques to gain insights into this important industry.