Introduction

Y Combinator is one of the most renowned startup accelerators in the world. Since its founding in 2005, Y Combinator has invested in more than 2,000 startups, including outsanding names such as Airbnb, Dropbox, Coinbase or Reddit. The Y Combinator investment dataset, which contains information on startups that received funding from Y Combinator between 2005 and 2019, provides a valuable resource for exploring trends and patterns in startup funding.

In this project, we will explore the Y Combinator investment dataset and apply several data mining techniques to gain insights into the characteristics of successful Y Combinator startups. We will focus on three main techniques: clustering, dimension reduction, and association rules.

First, we will use clustering techniques to group startups based on their categories and total amount of money raised, with the goal of identifying groups of companies with similar funding patterns and business focuses.

Finally, we will apply association rule mining techniques to identify relationships between different features of the Y Combinator investment dataset, such as the relationship between the funding amount and the type of industry.

Overall, the aim of this project is to provide an analysis of the Y Combinator investment dataset and to gain insights into the characteristics of successful Y Combinator startups. By applying these data mining techniques, we hope to provide valuable insights for entrepreneurs, investors, and anyone interested in the startup ecosystem.

Loading packages

To do the analysis, the following packages will be needed.

packages <- c("tidyverse", "corrplot", "ggplot2", "zoo", "mice",
              "patchwork", "tibble", "patchwork", "gridExtra", "ClusterR", "cluster", "flexclust", "clustertend","ggthemes", "plotly", "jpeg", "dplyr","arules","arulesViz","factoextra")


# Load packages if not already loaded
for (package in packages) {
         if (!require(package, character.only = TRUE)) {
                  install.packages(package)
                  library(package, character.only = TRUE)
         }
}

for (package in packages){
  library(package, character.only = TRUE)
}

Loading dataset

The dataset contains information on startups that received funding from Y Combinator between 2005 and 2019. It includes data on over 2,000 startups, including information on the startup’s name, category, founding investors, funding stage, funding type, funding amount, pre-money valuation, post-money valuation, and more.

Some key features of the dataset include:

Funding amount: The total amount of money that the startup received from Y Combinator. Funding stage: The stage of funding that the startup received (e.g., ). Category: The category or industry that the startup operates in (e.g., healthcare, e-commerce, education). Funding type: The type of funding that the startup received (e.g., pre-seed, seed, series A, series B).

#upload the dataset 
data <- read.csv("YC.csv")
#present column names
colnames(data)

##  [1] "Name"                                  
##  [2] "Transaction.Name"                      
##  [3] "Funding.Type"                          
##  [4] "Money.Raised.Currency..in.USD."        
##  [5] "Announced.Date"                        
##  [6] "Funding.Stage"                         
##  [7] "Pre.Money.Valuation.Currency..in.USD." 
##  [8] "Description"                           
##  [9] "Categories"                            
## [10] "Location"                              
## [11] "Website"                               
## [12] "Revenue.Range"                         
## [13] "Total.Funding.Amount.Currency..in.USD."
## [14] "Funding.Status"                        
## [15] "Number.of.Funding.Rounds"              
## [16] "Lead.Investors"                        
## [17] "Investor.Names"                        
## [18] "Number.of.Investors"                   
## [19] "Number.of.Partner.Investors"

Let’s explore 6 first rows to have a brief overview of the data.

head(data)

##            Name           Transaction.Name Funding.Type
## 1         Copia         Seed Round - Copia         Seed
## 2     Suiteness       Series A - Suiteness     Series A
## 3      Astranis      Seed Round - Astranis         Seed
## 4    Shield Bio    Seed Round - Shield Bio         Seed
## 5        Platzi        Seed Round - Platzi         Seed
## 6 Kisan Network Seed Round - Kisan Network         Seed
##   Money.Raised.Currency..in.USD. Announced.Date       Funding.Stage
## 1                        3100000     2016-03-22                Seed
## 2                        5000000     2016-12-14 Early Stage Venture
## 3                         120000     2016-03-22                Seed
## 4                        4100000     2017-02-23                Seed
## 5                             NA     2014-12-01                Seed
## 6                             NA     2016-06-01                Seed
##   Pre.Money.Valuation.Currency..in.USD.
## 1                                    NA
## 2                                    NA
## 3                                    NA
## 4                                    NA
## 5                                    NA
## 6                                    NA
##                                                                                                              Description
## 1                                            Copia is the next-generation technology platform for food waste management.
## 2                                   Suiteness is a free-to-join hotel booking website connecting hotel rooms and suites.
## 3                                                    Astranis is building small, low-cost telecommunications satellites.
## 4                                              Shield Bio - using ultra-fast sequencing to prevent antibiotic resistance
## 5 Platzi is an effective online education platform that offers classes on marketing, learn coding, business, and design.
## 6                                                         Kisan Network is an online marketplace for Indian agriculture.
##                                                                                                                      Categories
## 1 Analytics, Communities, Enterprise, Enterprise Software, Marketplace, SaaS, Sharing Economy, Sustainability, Waste Management
## 2                                                                     Family, Hospitality, Hotel, Leisure, Reservations, Travel
## 3                                                                                       Aerospace, Internet, Telecommunications
## 4                                                                      Biotechnology, Genetics, Health Care, Health Diagnostics
## 5                                                                                  Education, Edutainment, Recruiting, Training
## 6                                                                                       Agriculture, AgTech, E-Commerce, Mobile
##                                                  Location
## 1 San Francisco, California, United States, North America
## 2       Oakland, California, United States, North America
## 3 San Francisco, California, United States, North America
## 4      San Jose, California, United States, North America
## 5 Mountain View, California, United States, North America
## 6                           Gurgaon, Haryana, India, Asia
##                       Website Revenue.Range
## 1    https://www.GoCopia.com/   $1M to $10M
## 2   https://www.suiteness.com   $1M to $10M
## 3    http://www.astranis.com/   $1M to $10M
## 4        http://shieldbio.com              
## 5          https://platzi.com Less than $1M
## 6 http://www.kisannetwork.com              
##   Total.Funding.Amount.Currency..in.USD.      Funding.Status
## 1                                4580000                Seed
## 2                                6000000 Early Stage Venture
## 3                               13619998 Early Stage Venture
## 4                               12100000 Early Stage Venture
## 5                               16428315 Early Stage Venture
## 6                                  38300                Seed
##   Number.of.Funding.Rounds                           Lead.Investors
## 1                        2                        Structure Capital
## 2                        3 Bullpen Capital, Global Founders Capital
## 3                        7                                         
## 4                        2                      Andreessen Horowitz
## 5                        5                                         
## 6                        3                                         
##                                                                                                                                                                                                                                                                                                                                                                                Investor.Names
## 1 8VC, Alps Investing Holdings LLC, Chivas Venture, Cynthia Ringo, David Pottruck, Emerson Collective, Eucalyptus Burlingame LLC, Jahan Ali, Jillian Manus, John Solomon, Jordan Kretchmer, Ken Tam, Lutetilla LLC, Lynett Capital, Maples Burlingame LLC, Mitchell Kapor, Moment Ventures, Nurzhas Makishev, Riggs Capital Partners, Steve Case, Structure Capital, Toyota USA, Y Combinator
## 2                                                                                                                                                    AltaIR Capital, Bullpen Capital, David Hauser, FundersClub, Global Founders Capital, HVF Labs, Jared Ablon, Kima Ventures, MetaProp NYC, Muhsen Syed, Rocket Internet, Roland Tanner, SciFi VC, Scott Banister, Tilo Bonow, Y Combinator
## 3                                                                                                                                                                                                                                                            ACE & Company, Fifty Years, Jaan Tallinn, Lars Rasmussen, Refactor Capital, S2 Capital, Samvit Ramadurgam, Wei Guo, Y Combinator
## 4                                                                                                                                                                                                                                                                                        Andreessen Horowitz, Friále, Josh Buckley, Refactor Capital, SGH CAPITAL, Soma Capital, Y Combinator
## 5                                                                                                                                                                                 500 Startups, Amasia, BoomStartup, Deepak Desai, Elies Campo, FundersClub, GE32 Capital, Graph Ventures, Josh Jones, Mind the Seed - MTS Fund, TA Ventures, Thomas Floracks, Y Combinator, Zillionize Angel
## 6                                                                                                                                                                                                                                                                                                                                                  FundersClub, Venture Highway, Y Combinator
##   Number.of.Investors Number.of.Partner.Investors
## 1                  23                          10
## 2                  16                           2
## 3                   9                          NA
## 4                   7                           6
## 5                  14                          NA
## 6                   3                          NA

Defining / selecting the necessary data and their source

In the analysis we will use only Name, Founding type, Money Raised, Announced date, Categories, Location, Total founding amount, Investor names, Number of founding rounds. Therefore we need to update our data frame

data <- data[,c("Name", "Funding.Type","Money.Raised.Currency..in.USD.", "Announced.Date","Categories","Location","Total.Funding.Amount.Currency..in.USD.","Number.of.Funding.Rounds","Investor.Names")]

Let’s first analyze founding type.

#Funding type
table(data$Funding.Type)

## 
##                    Angel         Convertible Note          Corporate Round 
##                       30                       10                        1 
##           Debt Financing            Funding Round                    Grant 
##                        2                       11                        8 
##                 Pre-Seed     Product Crowdfunding                     Seed 
##                       37                        1                     2188 
##                 Series A                 Series B                 Series C 
##                      155                       61                       27 
##                 Series D                 Series E                 Series F 
##                        5                        3                        2 
## Venture - Series Unknown 
##                       44

As we can see the data can differ as the values are showed for all the rounds. As Y Combinator is an accelerator let’s focus only on the Seed and Pre-seed rounds.

# Load the dataset
data <- data[data$Funding.Type %in% c("Seed", "Pre-Seed"), ]
table(data$Funding.Type)

## 
## Pre-Seed     Seed 
##       37     2188

Let’s explore Money.Raised.Currency..in.USD. and Total.Funding.Amount.Currency..in.USD. columns which will be useful for use in the further analysis.

#checking the Money.Raised.Currency..in.USD. 
options(scipen = 999)
summary(data$Money.Raised.Currency..in.USD.)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##     6000   120000   120000   788232  1000000 20000000      667

#Checking the Total.Funding.Amount.Currency..in.USD.
summary(data$Total.Funding.Amount.Currency..in.USD.)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
##      10000     150000    1637500   23234822    6620000 5268800000        441

As we can see we have 667 NA values for Money.Raised.Currency..in.USD. and 441 NA values for Total.Funding.Amount.Currency..in.USD. columns.

#Data Cleaning and Initial Feature data Enginering.

We will start with some initial data engineering. I would like to first divide the location for 4 separate columns -> City, State, Country, Continent, as well as to create a new column named “year” that contains year of the founding.

# Create a new column 'year' based on Announced.Date
data <- data %>% mutate(year = lubridate::year(Announced.Date))


# Create a new columns "City", "State", "Country", "Continent" based on Location
data <- data %>%
  separate(Location, into = c("City", "State", "Country", "Continent"), sep = ", ", remove = FALSE)

## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 56 rows [53, 55, 191,
## 297, 327, 646, 830, 834, 835, 837, 847, 848, 854, 855, 857, 858, 861, 862, 874,
## 884, ...].

As the columns with the Total.Funding.Amount.Currency..in.USD. and Money.Raised.Currency..in.USD. are crucial in the further analysis, we cannot replace them with zeros or just remove. So we need replace the value. We realise that we can replace NA with values from Money.Raised.Currency..in.USD. to Total.Funding.Amount.Currency..in.USD. if Number.of.Funding.Rounds is equal to 1. In other cases we will use imputation method. First, we tested Predictive mean matching, Bayesian linear regression and Logistic regression to check with method will be the best for which case. During the pre-analysis we decide to use Bayesian linear regression for Money.Raised.Currency..in.USD. and Predictive mean matching for Total.Funding.Amount.Currency..in.USD. as the results were the most promising.

impute_data1 <- data[, c( "year","Money.Raised.Currency..in.USD.","Investor.Names","Location")]



# Perform regression imputation
imputed_data1 <- mice(impute_data1, method = "norm.predict")

## 
##  iter imp variable
##   1   1  Money.Raised.Currency..in.USD.
##   1   2  Money.Raised.Currency..in.USD.
##   1   3  Money.Raised.Currency..in.USD.
##   1   4  Money.Raised.Currency..in.USD.
##   1   5  Money.Raised.Currency..in.USD.
##   2   1  Money.Raised.Currency..in.USD.
##   2   2  Money.Raised.Currency..in.USD.
##   2   3  Money.Raised.Currency..in.USD.
##   2   4  Money.Raised.Currency..in.USD.
##   2   5  Money.Raised.Currency..in.USD.
##   3   1  Money.Raised.Currency..in.USD.
##   3   2  Money.Raised.Currency..in.USD.
##   3   3  Money.Raised.Currency..in.USD.
##   3   4  Money.Raised.Currency..in.USD.
##   3   5  Money.Raised.Currency..in.USD.
##   4   1  Money.Raised.Currency..in.USD.
##   4   2  Money.Raised.Currency..in.USD.
##   4   3  Money.Raised.Currency..in.USD.
##   4   4  Money.Raised.Currency..in.USD.
##   4   5  Money.Raised.Currency..in.USD.
##   5   1  Money.Raised.Currency..in.USD.
##   5   2  Money.Raised.Currency..in.USD.
##   5   3  Money.Raised.Currency..in.USD.
##   5   4  Money.Raised.Currency..in.USD.
##   5   5  Money.Raised.Currency..in.USD.

## Warning: Number of logged events: 2

imputed_values1 <- complete(imputed_data1)

# Replace NA values in original dataset with imputed values
data$Money.Raised.Currency..in.USD.[is.na(data$Money.Raised.Currency..in.USD.)] <- imputed_values1$Money.Raised.Currency..in.USD.[is.na(data$Money.Raised.Currency..in.USD.)]



summary(data$Money.Raised.Currency..in.USD.)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     6000   120000   384051   752592   827710 20000000

impute_data2 <- data[, c("Total.Funding.Amount.Currency..in.USD.", "year","Number.of.Funding.Rounds","Investor.Names")]


# Replace NA with values from "Money.Raised.Currency..in.USD." if "Number.of.Funding.Rounds" is equal to 1
data$Total.Funding.Amount.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.) & data$Number.of.Funding.Rounds == 1] <- data$Money.Raised.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.) & data$Number.of.Funding.Rounds == 1]

#As norm method is giving negative values I decided to use different method. 

# Impute missing values using mice package
imputed_data2 <- mice(impute_data2, method = "pmm", m=4)

## 
##  iter imp variable
##   1   1  Total.Funding.Amount.Currency..in.USD.
##   1   2  Total.Funding.Amount.Currency..in.USD.
##   1   3  Total.Funding.Amount.Currency..in.USD.
##   1   4  Total.Funding.Amount.Currency..in.USD.
##   2   1  Total.Funding.Amount.Currency..in.USD.
##   2   2  Total.Funding.Amount.Currency..in.USD.
##   2   3  Total.Funding.Amount.Currency..in.USD.
##   2   4  Total.Funding.Amount.Currency..in.USD.
##   3   1  Total.Funding.Amount.Currency..in.USD.
##   3   2  Total.Funding.Amount.Currency..in.USD.
##   3   3  Total.Funding.Amount.Currency..in.USD.
##   3   4  Total.Funding.Amount.Currency..in.USD.
##   4   1  Total.Funding.Amount.Currency..in.USD.
##   4   2  Total.Funding.Amount.Currency..in.USD.
##   4   3  Total.Funding.Amount.Currency..in.USD.
##   4   4  Total.Funding.Amount.Currency..in.USD.
##   5   1  Total.Funding.Amount.Currency..in.USD.
##   5   2  Total.Funding.Amount.Currency..in.USD.
##   5   3  Total.Funding.Amount.Currency..in.USD.
##   5   4  Total.Funding.Amount.Currency..in.USD.

## Warning: Number of logged events: 1

imputed_values2 <- complete(imputed_data2)

# Replace remaining NA with imputed values
data$Total.Funding.Amount.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.)] <- imputed_values2$Total.Funding.Amount.Currency..in.USD.[is.na(data$Total.Funding.Amount.Currency..in.USD.)]


data$Total.Funding.Amount.Currency..in.USD. <- ifelse(data$Total.Funding.Amount.Currency..in.USD. < data$Money.Raised.Currency..in.USD., data$Money.Raised.Currency..in.USD., data$Total.Funding.Amount.Currency..in.USD.)

summary(data$Total.Funding.Amount.Currency..in.USD.)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##      10000     150000     908376   18993330    4700000 5268800000

Expioratory Data Analysis

Let’s first summarize how many companies were found per year.

# Convert the Announced.Date column to a date format
data$Announced.Date <- as.Date(data$Announced.Date, format = "%Y-%m-%d")

# Count the number of startups announced per year
startup_count_per_year <- table(format(data$Announced.Date, "%Y"))

print(startup_count_per_year)

## 
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 
##   10   18   35   43   49   66   99  164  118  233  203  254  286  284  363

# create bar plot
ggplot(data = data.frame(Year = names(startup_count_per_year), Count = as.numeric(startup_count_per_year)), 
       aes(x = Year, y = Count, fill = Year)) + 
  geom_col() + 
  scale_fill_viridis_d() + 
  ggtitle("Number of Startups Announced per Year") + 
  xlab("Year") + 
  ylab("Count") + 
  theme_minimal()

Money rised per year

# Aggregate the total money raised by year
money_raised_per_year <- data %>%
  group_by(year) %>%
  summarise(total_money_raised = sum(`Money.Raised.Currency..in.USD.`))

print(money_raised_per_year)

## # A tibble: 15 × 2
##     year total_money_raised
##    <dbl>              <dbl>
##  1  2005           2562029.
##  2  2006           5802767.
##  3  2007           8732676.
##  4  2008          16080359.
##  5  2009          19576669.
##  6  2010          32847560.
##  7  2011          63621007.
##  8  2012         155529115.
##  9  2013          76273350.
## 10  2014         145391602.
## 11  2015         125109456.
## 12  2016         170194789.
## 13  2017         261995823.
## 14  2018         310073623.
## 15  2019         280726028.

# Plot total money raised by year
data %>% 
  group_by(year) %>%
  summarise(total_money_raised = sum(`Money.Raised.Currency..in.USD.`)) %>%
  ggplot(aes(x = year, y = total_money_raised, fill = factor(year))) +
  geom_bar(stat = "identity", color = "black", alpha = 0.8) +
  scale_fill_viridis_d() + 
  ggtitle("Total Money Raised by Startups Announced per Year") +
  xlab("Year") +
  ylab("Money Raised (USD)")

Money raised per location

Europe - money raised by top 6 countries

Money Raised by Top 10 Countries in Asia + Oceania

# select the top 5 countries for Asia + Oceania only
AsiaOceania_data <- data %>%
  group_by(Continent) %>%
  filter(Continent == c("Asia", "Oceania")) %>%
  group_by(Country) %>% 
  summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
  top_n(10) %>%
  ungroup()

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `Continent == c("Asia", "Oceania")`.
## ℹ In group 5: `Continent = "Oceania"`.
## Caused by warning in `Continent == c("Asia", "Oceania")`:
## ! długość dłuszego obiektu nie jest wielokrotnością długości krótszego obiektu

## Selecting by Total_Money_Raised

# create the bar chart
ggplot(AsiaOceania_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::comma(Total_Money_Raised)), 
            position = position_stack(vjust = 1.0), 
            color = "black", size = 3) +
  labs(title = "Total Money Raised by Top 10 Countries in Asia and Oceania",
       x = "Country",
       y = "Total Money Raised (in USD)",
       fill = "") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Money Raised by Top 6 Countries in Africa

# select the top 6 countries for Africa only
Africa_data <- data %>%
  group_by(Continent) %>%
  filter(Continent == "Africa") %>%
  group_by(Country) %>% 
  summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
  top_n(5) %>%
  ungroup()

## Selecting by Total_Money_Raised

# create the bar chart

ggplot(Africa_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::comma(Total_Money_Raised)), 
            position = position_stack(vjust = 1.0), 
            color = "black", size = 4) +
  labs(title =  "Total Money Raised by Top 5 Countries in Africa",
       x = "Country",
       y = "Total Money Raised (in USD)",
       fill = "") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Money Raised by Top 3 Countries in South America

# select the top 3 countries for North America only
North_America_data <- data %>%
  group_by(Continent) %>%
  filter(Continent == "North America") %>%
  group_by(Country) %>% 
  summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
  top_n(3) %>%
  ungroup()

## Selecting by Total_Money_Raised

# create the bar chart

ggplot(North_America_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::comma(Total_Money_Raised)), 
            position = position_stack(vjust = 1.0), 
            color = "black", size = 5)+
  labs(title =  "Total Money Raised by Top 3 Countries in North America",
       x = "Country",
       y = "Total Money Raised (in USD)",
       fill = "") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Money Raised by Top 5 Countries in South America

# select the top 5 countries for South America only
South_America_data <- data %>%
  group_by(Continent) %>%
  filter(Continent == "South America") %>%
  group_by(Country) %>% 
  summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
  top_n(5) %>%
  ungroup()

## Selecting by Total_Money_Raised

# create the bar chart

ggplot(South_America_data, aes(x = reorder(Country, Total_Money_Raised), y = Total_Money_Raised, fill = Country)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::comma(Total_Money_Raised)), 
            position = position_stack(vjust = 1.0), 
            color = "black", size = 5)+
  labs(title =  "Total Money Raised by Top 5 Countries in South America",
       x = "Country",
       y = "Total Money Raised (in USD)",
       fill = "") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Money raised by Continent

Continents_data <- data%>%
  group_by(Continent) %>%
  summarize(Total_Money_Raised = sum(Money.Raised.Currency..in.USD.)) %>%
  ungroup()


ggplot(Continents_data, aes(x = reorder(Continent, Total_Money_Raised), y = Total_Money_Raised, fill = Continent)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::comma(Total_Money_Raised)), 
            position = position_stack(vjust = 1.0), 
            color = "black", size = 3)+
  labs(title =  "Total Money Raised by Continent",
       x = "Continent",
       y = "Total Money Raised (in USD)",
       fill = "") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

ROI by the country

The ROI value indicates the efficiency of the funding raised by companies in each country. A higher ROI means that the companies in that country are able to generate a higher return on investment with less funding. This could be due to various reasons, such as lower operating costs, higher profitability, or better business models, but it is a good sign to take a closer look on those countries in the future rounds.

#Check what is the ROI by country 
Country_by_Roi <-  data %>%
  group_by(Country) %>%
  summarise(ROI = (sum(Total.Funding.Amount.Currency..in.USD.)/sum(Money.Raised.Currency..in.USD.))*100) %>%
  filter(!is.na(ROI))%>%
  arrange(desc(ROI)) %>%
  head(10)

  
ggplot(Country_by_Roi, aes(reorder(Country, ROI), ROI, fill = Country)) +
  geom_bar(stat = "identity") +
  xlab("Country") +
  ylab("ROI (%)") +
  ggtitle("Top 10 Countries with the Highest Rate") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The output shows the top 10 countries with the highest ROI (Return of Investment). Portugal has the highest ROI of 75983%, followed by Hungary with 19598%, and Colombia with 8730%. The rest of the countries on the list have ROIs ranging from 7517% to 1633%.

Continent_by_Roi <-  data %>%
  group_by(Continent) %>%
  summarise(ROI = (sum(Total.Funding.Amount.Currency..in.USD.)/sum(Money.Raised.Currency..in.USD.))*100) %>%
  filter(!is.na(ROI))%>%
  arrange(desc(ROI))

  
ggplot(Continent_by_Roi, aes(reorder(Continent, ROI), ROI, fill = Continent)) +
  geom_bar(stat = "identity") +
  xlab("Continent") +
  ylab("ROI (%)") +
  ggtitle("ROI by Continent")  +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The output shows the ROI (Return on Investment) of companies grouped by continent. The continent with the highest ROI is South America, with an ROI of 4554%, followed by North America with 2712% and Oceania with 1674%. The rest of the continents on the list have ROIs ranging from 916% to 105%.

These numbers suggest that companies in South America and North America are generating higher returns on investment compared to other continents. The data suggest that these regions may be worth closer attention for future investment opportunities.

Clustering

We know that the startup market’s specific are high returns associated with high risk what we can translate to just a few – super high valuation (Total.Funding.Amount.Currency..in.USD.) with high number of investments, therefore we need to first remove all the outliers for clustering

Removing outliers

remove_outliers <- function(df, var) {
  Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
  Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  df <- df[df[[var]] >= Q1 - 1.5 * IQR & df[[var]] <= Q3 + 1.5 * IQR,]
  return(df)
}

# Remove outliers from Money.Raised.Currency..in.USD.
data <- remove_outliers(data, "Money.Raised.Currency..in.USD.")

# Remove outliers from Total.Funding.Amount.Currency..in.USD.
data <- remove_outliers(data, "Total.Funding.Amount.Currency..in.USD.")

Normalized the data

To do the clustering we need to first normalized the data

# Load the dataset
Clustering_Raised_Total <- data.frame(Money_Raised = data$Money.Raised.Currency..in.USD., Total_Funding = data$Total.Funding.Amount.Currency..in.USD.)


Clustering_Norm <- as.data.frame(scale(Clustering_Raised_Total))

# Original data
ggplot(Clustering_Raised_Total, aes(x=Money_Raised, y=Total_Funding)) +
  geom_point() +
  labs(title="Original data") +
  theme_bw()

# Normalized data 
ggplot(Clustering_Norm, aes(x=Money_Raised, y=Total_Funding)) +
  geom_point() +
  labs(title="Normalized data") +
  theme_bw()

Choosing the proper number of cluster

f1 <- fviz_nbclust(Clustering_Norm, FUNcluster = kmeans, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n K-means")
f2 <- fviz_nbclust(Clustering_Norm, FUNcluster = cluster::pam, method = "silhouette") + 
  ggtitle("Optimal number of clusters \n PAM")

grid.arrange(f1, f2, ncol=2)

For K-Means we will use 5 even though the graph showing 10 because the average slihouette is almost the same, but we want to have less clusters.

For PAM clustering we will use 8 clusters according to the method.

Kmeans clustering presentation

km5 <- eclust(Clustering_Norm, k=5 , FUNcluster="kmeans", hc_metric="euclidean", graph=F)

c2 <- fviz_cluster(km5, data=Clustering_Norm, elipse.type="convex", geom=c("point")) + ggtitle("K-means with 5 clusters")
s2 <- fviz_silhouette(km5)

##   cluster size ave.sil.width
## 1       1  826          0.68
## 2       2  275          0.32
## 3       3  298          0.63
## 4       4  151          0.39
## 5       5   87          0.37

grid.arrange(c2, s2, ncol=2)

The K-Means clustering method shows use that the startups in the 4th cluster are probably the best in terms of ROI and risk of the investitions as the invested capital was the lowest but brings the highest returns. In further analysis, we could look at the data from the perspective of common ground bettween the invested startups in the 4th cluster and other firms.

Pam clustering presentation

# Perform PAM clustering with 5 clusters
pam8 <- eclust(Clustering_Norm, k=8 , FUNcluster="pam", graph=F)

# Visualize the clustering results
c3 <- fviz_cluster(pam8, data=Clustering_Norm, ellipse.type="convex", geom=c("point")) + ggtitle("PAM with 8 clusters")
s3 <- fviz_silhouette(pam8)

##   cluster size ave.sil.width
## 1       1  207          0.65
## 2       2  690          0.88
## 3       3   37          0.40
## 4       4   74          0.40
## 5       5  193          0.49
## 6       6   74          0.56
## 7       7  302          0.55
## 8       8   60          0.53

grid.arrange(c3, s3, ncol=2)

PAM method is giving even better results. We can say that the 3th and 7th cluster are bringing the highest returns with the lowest invested value. Let’s deep dive a bit and look what we can take from it in terms of the invested country.

Which country is the origin for the greatest investments

Looking to the data 3th and 7th cluster consist the best investments (the highest valuation). Let’s look at the data in table.

table(data$Country, pam8$cluster == c(7,3))

## Warning in pam8$cluster == c(7, 3): długość dłuszego obiektu nie jest
## wielokrotnością długości krótszego obiektu

##                  
##                   FALSE TRUE
##   Argentina           1    0
##   Australia           4    1
##   Bangladesh          2    0
##   Brazil              5    0
##   Canada             52    3
##   Chile               1    0
##   China               9    0
##   Colombia            7    0
##   Czech Republic      1    0
##   Denmark             4    0
##   Ecuador             1    0
##   Egypt               4    0
##   El Salvador         1    0
##   Finland             1    0
##   France              8    2
##   Germany             6    0
##   Ghana               2    0
##   Hong Kong           5    1
##   Iceland             1    0
##   India              44    1
##   Indonesia           6    0
##   Iraq                1    0
##   Ireland             1    1
##   Israel              3    0
##   Mexico             10    0
##   Morocco             1    0
##   Nigeria            12    0
##   Panama              1    0
##   Peru                2    0
##   Philippines         0    1
##   Poland              1    0
##   Puerto Rico         1    0
##   Senegal             1    0
##   Singapore          11    0
##   Slovenia            2    0
##   South Africa        1    0
##   South Korea         1    0
##   Sweden              2    0
##   Switzerland         1    0
##   Tanzania            1    0
##   The Netherlands     1    0
##   Turkey              0    1
##   United Kingdom     33    3
##   United States    1180  137
##   Uruguay             1    0

table(data$Continent, pam8$cluster)

##                
##                   1   2   3   4   5   6   7   8
##   Africa          4  11   1   1   3   0   1   1
##   Asia           16  37   4   1  13   6   2   6
##   Europe          3  36   2   1  10   2  14   1
##   North America 172 581  30  69 166  66 251  50
##   Oceania         0   3   0   1   0   0   1   0
##   South America   3  10   0   1   1   0   2   1

library(dplyr)
# Filter the data for cluster 7 and 3 as they are the best options. 
cluster_data <- data %>% filter(pam8$cluster == c(7,3))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `pam8$cluster == c(7, 3)`.
## Caused by warning in `pam8$cluster == c(7, 3)`:
## ! długość dłuszego obiektu nie jest wielokrotnością długości krótszego obiektu

# Calculate the ratio of total money raised to number of startups for each country
ratio_by_country <- cluster_data %>% group_by(Country) %>% 
  summarize(ratio = sum(Money.Raised.Currency..in.USD.) / n()) %>%
  arrange(desc(ratio))


# Create the bar chart
ggplot(ratio_by_country, aes(x = reorder(Country, ratio), y = ratio, fill = Country)) +
  geom_col() +
  geom_text(aes(label = scales::comma(round(ratio))), vjust = -0.5, size = 3) +
  scale_y_continuous(labels = scales::comma_format()) +
  labs(x = "Country", y = "Ratio of Total Money Raised to Number of Startups") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

Based on the analysis we can say that Y Combinator should put more attention to the United Kingdom, as pottential is almost 2 times higher (827,710 points to 427,723 points in the United States) higher than in the second United States

Associaton Rules

In this part we will treat the transactions made by investors co-investing with Y Combinator to help understand the connections between partnerships and help startup’s founders on whom they should focus and connect the most to achieve the highest probability of the further investments. We also would like to create a recommendation system for investors to understand their investment biases with a particular co-investors or even explore strong connections between each other.

#upload the dataset 
df <- read.csv("YC.csv")
df <- df[,c("Name","Investor.Names","Announced.Date")]
# Remove duplicates based on the latest "Announced.Date"
df <- df %>%
  arrange(desc(Announced.Date)) %>%   # sort by Announced.Date in descending order
  distinct(Name, .keep_all = TRUE)   # keep only the first occurrence of each unique name

# Convert categories to a list of vectors
inverstors_list <- strsplit(as.character(df$Investor.Names), ",\\s*")


# Filter out rows without values
Inverstors_list <- Filter(length, inverstors_list)

# Remove reverse pairs of investors
Inverstors_list <- lapply(inverstors_list, function(x) {
  if (length(x) > 1) {
    x <- unique(sort(x))
  }
  x
})


# Get all unique categories
all_Inverstors <- unique(unlist(inverstors_list))

# Create an empty binary matrix
binary_matrix <- matrix(0, nrow = length(inverstors_list), ncol = length(all_Inverstors), dimnames = list(NULL, all_Inverstors))

# Fill in the binary matrix
for (i in seq_along(inverstors_list)) {
  binary_matrix[i, inverstors_list[[i]]] <- 1
}

# Convert the binary matrix to a transaction object
transactions <- as(binary_matrix, "transactions")

# Find association rules
rules <- apriori(transactions, parameter = list(minlen=2, support = 0.004, confidence = 0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.004      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 8 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2024 item(s), 2133 transaction(s)] done [0.00s].
## sorting and recoding items ... [91 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [129 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Rules with the highest values of support, confidence and lift have been displayed below.

By support

# Sort by support
rules_by_support <- sort(rules, by = "support", decreasing = TRUE)

# Inspect rules by support
inspect(rules_by_support[1:10])

##      lhs                      rhs            support    confidence coverage  
## [1]  {SV Angel}            => {Y Combinator} 0.05250820 1          0.05250820
## [2]  {FundersClub}         => {Y Combinator} 0.05016409 1          0.05016409
## [3]  {Paul Buchheit}       => {Y Combinator} 0.02812940 1          0.02812940
## [4]  {Zillionize Angel}    => {Y Combinator} 0.02578528 1          0.02578528
## [5]  {Andreessen Horowitz} => {Y Combinator} 0.02062822 1          0.02062822
## [6]  {Alexis Ohanian}      => {Y Combinator} 0.02062822 1          0.02062822
## [7]  {Soma Capital}        => {Y Combinator} 0.01828411 1          0.01828411
## [8]  {500 Startups}        => {Y Combinator} 0.01828411 1          0.01828411
## [9]  {AltaIR Capital}      => {Y Combinator} 0.01734646 1          0.01734646
## [10] {ACE & Company}       => {Y Combinator} 0.01734646 1          0.01734646
##      lift count
## [1]  1    112  
## [2]  1    107  
## [3]  1     60  
## [4]  1     55  
## [5]  1     44  
## [6]  1     44  
## [7]  1     39  
## [8]  1     39  
## [9]  1     37  
## [10] 1     37

The output shows the top 20 association rules sorted by support, which indicates the frequency of each rule in the dataset. The rules involve investors or investment firms in the left-hand side (LHS) and Y Combinator in the right-hand side (RHS).

The rule with the highest support (0.0525) is {SV Angel} => {Y Combinator}, which means that 5.25% of the transactions in the dataset involve SV Angel as an investor and Y Combinator as an investment firm. The second and third rules also have high support values, indicating a strong association between FundersClub and Y Combinator, and between Paul Buchheit and Y Combinator, respectively.

The rest of the rules on the list have lower support values, indicating a weaker association between the investors or investment firms in the LHS and Y Combinator in the RHS. However, they still provide insights into the co-occurrence patterns between investors and investment firms in the startup ecosystem.

plot(rules[1:20,], method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph for 20 rules")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

by lift

# Sort by lift
rules_by_lift <- sort(rules, by = "lift", decreasing = TRUE)

#inspect by lift
inspect(rules_by_lift[1:20])

##      lhs                   rhs                  support confidence    coverage     lift count
## [1]  {UpHonest Capital} => {Wei Guo}        0.004219409  0.6428571 0.006563526 59.61801     9
## [2]  {Y Combinator,                                                                          
##       UpHonest Capital} => {Wei Guo}        0.004219409  0.6428571 0.006563526 59.61801     9
## [3]  {SV Angel,                                                                              
##       Alexis Ohanian}   => {Garry Tan}      0.005157056  0.5500000 0.009376465 58.65750    11
## [4]  {Y Combinator,                                                                          
##       SV Angel,                                                                              
##       Alexis Ohanian}   => {Garry Tan}      0.005157056  0.5500000 0.009376465 58.65750    11
## [5]  {Garry Tan}        => {Alexis Ohanian} 0.008907642  0.9500000 0.009376465 46.05341    19
## [6]  {Y Combinator,                                                                          
##       Garry Tan}        => {Alexis Ohanian} 0.008907642  0.9500000 0.009376465 46.05341    19
## [7]  {SV Angel,                                                                              
##       Garry Tan}        => {Alexis Ohanian} 0.005157056  0.9166667 0.005625879 44.43750    11
## [8]  {Y Combinator,                                                                          
##       SV Angel,                                                                              
##       Garry Tan}        => {Alexis Ohanian} 0.005157056  0.9166667 0.005625879 44.43750    11
## [9]  {Yuri Milner}      => {SV Angel}       0.004219409  0.9000000 0.004688233 17.14018     9
## [10] {Y Combinator,                                                                          
##       Yuri Milner}      => {SV Angel}       0.004219409  0.9000000 0.004688233 17.14018     9
## [11] {Paul Buchheit,                                                                         
##       Alexis Ohanian}   => {SV Angel}       0.004219409  0.8181818 0.005157056 15.58198     9
## [12] {Y Combinator,                                                                          
##       Paul Buchheit,                                                                         
##       Alexis Ohanian}   => {SV Angel}       0.004219409  0.8181818 0.005157056 15.58198     9
## [13] {Tuesday Capital}  => {SV Angel}       0.006094702  0.6500000 0.009376465 12.37902    13
## [14] {Y Combinator,                                                                          
##       Tuesday Capital}  => {SV Angel}       0.006094702  0.6500000 0.009376465 12.37902    13
## [15] {Garry Tan}        => {SV Angel}       0.005625879  0.6000000 0.009376465 11.42679    12
## [16] {Y Combinator,                                                                          
##       Garry Tan}        => {SV Angel}       0.005625879  0.6000000 0.009376465 11.42679    12
## [17] {Alexis Ohanian,                                                                        
##       Garry Tan}        => {SV Angel}       0.005157056  0.5789474 0.008907642 11.02585    11
## [18] {Y Combinator,                                                                          
##       Alexis Ohanian,                                                                        
##       Garry Tan}        => {SV Angel}       0.005157056  0.5789474 0.008907642 11.02585    11
## [19] {Susa Ventures}    => {SV Angel}       0.004219409  0.5625000 0.007501172 10.71261     9
## [20] {Y Combinator,                                                                          
##       Susa Ventures}    => {SV Angel}       0.004219409  0.5625000 0.007501172 10.71261     9

by confidence

library(arules)

# Sort by confidence 
rules_by_confidence <- sort(rules, by = "confidence", decreasing = TRUE)

# Inspect by confidence
inspect(rules_by_confidence[1:20])

##      lhs                                rhs            support     confidence
## [1]  {S28 Capital}                   => {Y Combinator} 0.004219409 1         
## [2]  {AAF Management Ltd.}           => {Y Combinator} 0.004219409 1         
## [3]  {Social Starts}                 => {Y Combinator} 0.004688233 1         
## [4]  {Streamlined Ventures}          => {Y Combinator} 0.004219409 1         
## [5]  {Refactor Capital}              => {Y Combinator} 0.004219409 1         
## [6]  {Draper Associates}             => {Y Combinator} 0.004219409 1         
## [7]  {Lynett Capital}                => {Y Combinator} 0.004219409 1         
## [8]  {StartX (Stanford-StartX Fund)} => {Y Combinator} 0.004688233 1         
## [9]  {Oyster Ventures}               => {Y Combinator} 0.006563526 1         
## [10] {Brainchild Holdings}           => {Y Combinator} 0.004688233 1         
## [11] {Kevin Moore}                   => {Y Combinator} 0.006094702 1         
## [12] {Kevin Mahaffey}                => {Y Combinator} 0.004688233 1         
## [13] {Salesforce Ventures}           => {Y Combinator} 0.005157056 1         
## [14] {Uncork Capital}                => {Y Combinator} 0.004688233 1         
## [15] {CRCM Ventures}                 => {Y Combinator} 0.005625879 1         
## [16] {AME Cloud Ventures}            => {Y Combinator} 0.004688233 1         
## [17] {Fifty Years}                   => {Y Combinator} 0.005157056 1         
## [18] {Menlo Ventures}                => {Y Combinator} 0.004688233 1         
## [19] {Accel}                         => {Y Combinator} 0.008438819 1         
## [20] {Bobby Goodlatte}               => {Y Combinator} 0.004219409 1         
##      coverage    lift count
## [1]  0.004219409 1     9   
## [2]  0.004219409 1     9   
## [3]  0.004688233 1    10   
## [4]  0.004219409 1     9   
## [5]  0.004219409 1     9   
## [6]  0.004219409 1     9   
## [7]  0.004219409 1     9   
## [8]  0.004688233 1    10   
## [9]  0.006563526 1    14   
## [10] 0.004688233 1    10   
## [11] 0.006094702 1    13   
## [12] 0.004688233 1    10   
## [13] 0.005157056 1    11   
## [14] 0.004688233 1    10   
## [15] 0.005625879 1    12   
## [16] 0.004688233 1    10   
## [17] 0.005157056 1    11   
## [18] 0.004688233 1    10   
## [19] 0.008438819 1    18   
## [20] 0.004219409 1     9

The confidence values of the association rules indicate the likelihood of the right-hand side (rhs) occurring given the left-hand side (lhs) occurred. Specifically, a confidence of 1.0 means that the rhs always occurs when the lhs occurs, while a confidence of 0.5 means that the rhs occurs half of the time when the lhs occurs.

For example, consider rule [1]: {Total Access Fund} => {Alumni Ventures Group}. The confidence value is 1.0, indicating that whenever Total Access Fund invests in a startup, Alumni Ventures Group also invests in the same startup.

plot(rules_by_lift[1:20,], method = "graph", measure = "lift", shading = "support", main = "Association Rules Graph for 20 rules")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

Finding associations for choosen comapny

Using association rules we can create a recommendation system for investors. To do it let’s look at the case for Andreessen Horowitz

rules_Andreessen_Horowitz <- apriori(transactions, parameter=list(minlen=2, supp=0.001, conf = 0.05), appearance = list(default="rhs",lhs= c("Andreessen Horowitz")))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[2024 item(s), 2133 transaction(s)] done [0.00s].
## sorting and recoding items ... [405 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_Andreessen_Horowitz <- subset(rules_Andreessen_Horowitz, lift != 1)

rules_Andreessen_Horowitz <- sort(rules_Andreessen_Horowitz, by = "support", decreasing = TRUE)

inspect(rules_Andreessen_Horowitz)

##      lhs                      rhs                    support     confidence
## [1]  {Andreessen Horowitz} => {SV Angel}             0.008907642 0.43181818
## [2]  {Andreessen Horowitz} => {General Catalyst}     0.003281763 0.15909091
## [3]  {Andreessen Horowitz} => {Start Fund}           0.003281763 0.15909091
## [4]  {Andreessen Horowitz} => {Ignition Partners}    0.001875293 0.09090909
## [5]  {Andreessen Horowitz} => {Joshua Schachter}     0.001875293 0.09090909
## [6]  {Andreessen Horowitz} => {Refactor Capital}     0.001406470 0.06818182
## [7]  {Andreessen Horowitz} => {Salesforce Ventures}  0.001406470 0.06818182
## [8]  {Andreessen Horowitz} => {Lerer Hippeau}        0.001406470 0.06818182
## [9]  {Andreessen Horowitz} => {Signatures Capital}   0.001406470 0.06818182
## [10] {Andreessen Horowitz} => {Ashton Kutcher}       0.001406470 0.06818182
## [11] {Andreessen Horowitz} => {First Round Capital}  0.001406470 0.06818182
## [12] {Andreessen Horowitz} => {ACE & Company}        0.001406470 0.06818182
## [13] {Andreessen Horowitz} => {Khosla Ventures}      0.001406470 0.06818182
## [14] {Andreessen Horowitz} => {Data Collective DCVC} 0.001406470 0.06818182
## [15] {Andreessen Horowitz} => {Alexis Ohanian}       0.001406470 0.06818182
## [16] {Andreessen Horowitz} => {Paul Buchheit}        0.001406470 0.06818182
## [17] {Andreessen Horowitz} => {FundersClub}          0.001406470 0.06818182
##      coverage   lift      count
## [1]  0.02062822  8.223823 19   
## [2]  0.02062822 21.208807  7   
## [3]  0.02062822  9.171376  7   
## [4]  0.02062822 24.238636  4   
## [5]  0.02062822 17.628099  4   
## [6]  0.02062822 16.159091  3   
## [7]  0.02062822 13.221074  3   
## [8]  0.02062822 13.221074  3   
## [9]  0.02062822 18.178977  3   
## [10] 0.02062822 16.159091  3   
## [11] 0.02062822  6.610537  3   
## [12] 0.02062822  3.930590  3   
## [13] 0.02062822  4.039773  3   
## [14] 0.02062822  4.155195  3   
## [15] 0.02062822  3.305269  3   
## [16] 0.02062822  2.423864  3   
## [17] 0.02062822  1.359176  3

plot(rules_Andreessen_Horowitz[1:15,], method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph for 20 rules")

## Warning: Unknown control parameters: main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

Based on the association rule results, it can be observed that Andreessen Horowitz is significantly associated with SV Angel, as it has the highest support value of 0.0089, and a high lift value of 8.22, indicating that the occurrence of SV Angel is more likely when Andreessen Horowitz is present.

Other rules with relatively high support values include Andreessen Horowitz with General Catalyst, Start Fund, and Ignition Partners. These rules also have high lift values, indicating strong associations between the antecedent and the consequent.

However, it is important to note that the confidence values for all rules are relatively low, ranging from 0.068 to 0.431, indicating that the probability of the consequent occurring given the antecedent is not very high.

Counlusion

In conclusion, our analysis of the Y Combinator investment dataset has shown that clustering and association rule mining techniques can be valuable tools for gaining insights into the characteristics of successful startups. By grouping startups based on funding patterns and business focuses, we were able to identify commonalities among successful companies. Additionally, by applying association rule mining techniques, we were able to identify relationships between different features of the dataset, such as the relationship between funding amount and industry type.

Overall, our analysis provides valuable insights for entrepreneurs and investors interested in the startup ecosystem. By understanding the characteristics of successful Y Combinator startups, entrepreneurs can tailor their business strategies to increase their chances of success, while investors can make more informed investment decisions. The Y Combinator investment dataset is a valuable resource for exploring trends and patterns in startup funding, and our analysis demonstrates the value of applying data mining techniques to gain insights into this important industry.

Y Combinator Clustering and Association Rules

Szymon Karkoszka

2023-01-23