A unicorn company is a privately held company with a current valuation of over $1 billion USD. This dataset consists of unicorn companies and startups across the globe as of November 2021, including country of origin, sector, select investors, and valuation of each unicorn.
Note former unicorn companies that have since exited due to IPO or acquisitions are not included in this list.
You have been hired as a data scientist for a company that invests in start-ups. Your manager is interested in whether it is possible to predict whether a company reaches a valuation over 5 billion based on characteristics such as its country of origin, its category, and details about its investors.
Using the dataset provided, you have been asked to test whether such predictions are possible, and the confidence one can have in the results.
# import libraries
suppressPackageStartupMessages(library(tidyverse))
library(ggplot2)
# import datasets
companies <- read_csv("companies.csv")
## Rows: 1074 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): company, city, country, continent
## dbl (1): company_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
industries <- read_csv("industries.csv")
## Rows: 1074 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): industry
## dbl (1): company_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
funding <- read_csv("funding.csv")
## Rows: 1074 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): select_investors
## dbl (3): company_id, valuation, funding
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# view datasets
glimpse(companies)
## Rows: 1,074
## Columns: 5
## $ company_id <dbl> 189, 848, 556, 999, 396, 931, 364, 732, 906, 72, 983, 898, …
## $ company <chr> "Otto Bock HealthCare", "Matrixport", "Cloudinary", "PLACE"…
## $ city <chr> "Duderstadt", NA, "Santa Clara", "Bellingham", "New York", …
## $ country <chr> "Germany", "Singapore", "United States", "United States", "…
## $ continent <chr> "Europe", "Asia", "North America", "North America", "North …
glimpse(industries)
## Rows: 1,074
## Columns: 2
## $ company_id <dbl> 189, 848, 556, 999, 396, 931, 364, 732, 906, 72, 983, 898, …
## $ industry <chr> "Health", "Fintech", "Internet software & services", "Inter…
glimpse(funding)
## Rows: 1,074
## Columns: 4
## $ company_id <dbl> 189, 848, 556, 999, 396, 931, 364, 732, 906, 72, 983,…
## $ valuation <dbl> 4e+09, 1e+09, 2e+09, 1e+09, 2e+09, 1e+09, 2e+09, 1e+0…
## $ funding <dbl> 0.00e+00, 1.00e+08, 1.00e+08, 1.00e+08, 1.00e+08, 1.0…
## $ select_investors <chr> "EQT Partners", "\"Dragonfly Captial, Qiming Venture …
# join the three datasets together
unicorn <- left_join(companies, industries, by= "company_id") %>%
left_join(funding, by="company_id")
The dataset consists of 1024 rows with 9 columns
# check for missing values
unicorn %>%
is.na() %>%
colSums()
## company_id company city country
## 0 0 16 0
## continent industry valuation funding
## 0 0 0 0
## select_investors
## 1
# check for duplicate entries
unicorn %>%
distinct()
## # A tibble: 1,074 × 9
## company_id company city country continent industry valuation funding
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 189 Otto Bock Heal… Dude… Germany Europe Health 4e9 0
## 2 848 Matrixport <NA> Singap… Asia Fintech 1e9 1 e8
## 3 556 Cloudinary Sant… United… North Am… Interne… 2e9 1 e8
## 4 999 PLACE Bell… United… North Am… Interne… 1e9 1 e8
## 5 396 candy.com New … United… North Am… Fintech 2e9 1 e8
## 6 931 HAYDON Shan… China Asia Consume… 1e9 1 e8
## 7 364 eDaili Shan… China Asia E-comme… 2e9 1.01e8
## 8 732 CoinTracker San … United… North Am… Fintech 1e9 1.02e8
## 9 906 EcoFlow Shen… China Asia Hardware 1e9 1.05e8
## 10 72 DJI Innovations Shen… China Asia Hardware 8e9 1.05e8
## # ℹ 1,064 more rows
## # ℹ 1 more variable: select_investors <chr>
Most missing values are contained in the city column which is not useful for my analysis, so I’ll drop the city and company_id columns. The dataset also contains no duplicated entries,
unicorn <- unicorn %>%
select(-city, -company_id)
# describe the dataset
n_distinct(unicorn$company)
## [1] 1073
n_distinct(unicorn$country)
## [1] 46
n_distinct(unicorn$industry)
## [1] 15
This dataset contains records of 1073 unicorn companies from 46 countries of the world, spread out over 15 different industries.
The data has been prepared and is now ready for analysis.
unicorn_summary <- unicorn %>%
group_by(industry) %>%
summarise(total_ind= n()) %>%
arrange(-total_ind)
# create a bar chart
ggplot(unicorn_summary, aes(x = reorder(industry, -total_ind), y = total_ind)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
geom_label(aes(label = total_ind), vjust = 1.0) + # Provide labels using aes()
labs(title = "Unicorns in Each Industry",
x = "Industry",
y = "Number of companies") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.grid = element_blank(), axis.text.y = element_blank())
The Fintech industry which has experienced a massive boom in recent years unsurprisingly has the highest number of unicorn companies with 224, closely followed by the Internet software & services industry with 205 unicorn companies. This represents a significant lead over other startup companies in the unicorn space. When compared to traditional industries like Consumer & retail, travel, etc, the difference in the number of unicorn companies becomes apparent, highlighting a clear preference for investment and growth in these booming sectors.
country_sum <- unicorn %>%
group_by(continent) %>%
summarise(total_com = n()) %>%
arrange(-total_com) %>%
head(10)
ggplot(country_sum, aes(x = reorder(continent, -total_com), y = total_com)) +
geom_bar(stat = "identity", fill = "skyblue", color = "black") +
labs(title = "Total Unicorn Companies in Each Continent",
x = "",
y = "") +
theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1), panel.grid = element_blank())
The bar chart illustrates the distribution of unicorn companies across continents, revealing interesting insights into their prevalence in different regions.
Unsurprisingly, North America leads the way with the most unicorn companies (589), with majority of these companies domiciled in the United States. The vibrant tech ecosystem in North America particularly in regions like Silicon Valley, contributes significantly to this dominance.
Following North America, Asia and Europe exhibit substantial numbers of unicorn companies. Asia, driven by technology hubs in countries like China and India, secures the second position, closely followed by Europe.
In contrast, Africa is home to the fewest unicorn companies among the continents analyzed. While the tech landscape is growing in Africa, it currently faces challenges that impact the number of unicorn startups compared to other continents.
To analyze this dataset further, each company has separate investors and this column needs to be cleaned to ensure accuracy of analysis
# replace every instance of double quotes with an empty string
unicorn <- unicorn %>%
mutate(investor =
gsub("\"", "", unicorn$select_investors)) %>%
select(-select_investors)
# separate each investor to a different row
unicorn_clean <- unicorn %>%
mutate(investors = str_split(investor, ",")) %>%
unnest() %>%
select(-investor)
In this section, I analyze the valuation column in the dataset by calculating different descriptive statistic measures, aggregating by different variables, to highlight potential outliers and understand the dataset better.
# average market valuation
unicorn %>%
summarise(avg_valuation = mean(valuation))
## # A tibble: 1 × 1
## avg_valuation
## <dbl>
## 1 3455307263.
## average valuation for companies in each country
av_countr <- unicorn %>%
group_by(country) %>%
summarise(avg_country_valuation = mean(valuation)) %>%
arrange(-avg_country_valuation)
## estimating each continent's contribution to total valuation
continent_val <- unicorn %>%
group_by(continent) %>%
summarise(total_val = sum(valuation), pct_contribution = round(total_val/sum(unicorn$valuation)*100,1)) %>%
arrange(-pct_contribution)
## estimate each industry's valuation contribution
unicorn %>%
group_by(industry) %>%
summarise(total_val = sum(valuation), pct_contribution = round(total_val/sum(unicorn$valuation)*100,1)) %>%
arrange(-pct_contribution) %>%
head(5) %>% # plot a bubble chart
ggplot(aes(industry, y = pct_contribution, size = total_val)) + geom_point() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + labs(title = "Industry valuation and % contribution", x = "", y = "") + theme(
plot.background = element_rect(size = 1, color = "black"),
panel.background = element_rect(fill = "#EAEAEA"),
plot.title = element_text(size = 15),
axis.title = element_text(size = 7),
axis.text = element_text(size = 9),
legend.title = element_text(size = 10),
legend.text = element_text(size = 12)
)
As expected, the Fintech bubble is the biggest,
indicating its significant contribution to the valuation
landscape.Further exploration of investment opportunities in this sector
could be beneficial.
Internet software &
services, a relative oldhead in the unicorn space still looks
in good shape, and you can hardly go wrong by investing into this
space.
ggplot(unicorn, aes(continent, valuation)) + geom_boxplot(fill = "lightblue") + scale_y_log10() +theme_minimal() + labs(title = "Valuation by Continent", y = "Valuation (log scale)")
In this section, I want to understand if the valuation of a unicorn company is greatly affected by the number of investors the company has, and what the correlation is. Also, how the presence of some specific investors influences a company’s valuation.
unicorn_clean %>%
group_by(company, industry) %>%
mutate(count = n()) %>%
ungroup() %>%
group_by(count) %>%
summarise(average_valuation = mean(valuation)) %>%
arrange(desc(average_valuation)) %>%
ggplot(aes(x = factor(count), y = average_valuation, fill = factor(count))) +
geom_bar(stat = "identity") +
scale_x_discrete(name = "No of Investors") +
scale_y_continuous(name = "Average Valuation") +
ggtitle("Average Valuation by No of Investors") +
theme_minimal() + guides(fill = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
Our analysis revealed that companies with 4 investors exhibited the highest average valuation among unicorn companies. This suggests a positive correlation between the number of investors and the overall valuation of a company. A more in-depth examination of these companies and their unique characteristics could provide insights into the factors driving such high valuations.
Following closely behind, companies with 3 investors displayed the second-highest average valuation. This indicates a substantial valuation impact even with a slightly lower number of investors.
Interestingly, companies with only 1 investor secured the third-highest average valuation. This counter-intuitive finding suggests that individual investors, under certain circumstances, may contribute significantly to a company’s valuation.
On the other hand, companies with 2 investors exhibited the lowest average valuation among the groups we analyzed. This prompts further investigation into potential reasons behind this lower valuation and whether it is influenced by specific industry dynamics or company characteristics.
inv <- unicorn_clean %>%
group_by(company, industry) %>%
mutate(count = n()) %>%
ungroup() %>%
group_by(count) %>%
summarise(average_valuation = mean(valuation)) %>%
arrange(desc(average_valuation))
cor(inv$count, inv$average_valuation)
## [1] 0.7839556
With a correlation of 0.78, this suggests a strong positive correlation between the number of investors and the expected average valuation of the company. It is important to note that correlation =/= causation.
# remove leading and trailing spaces in the investors column
unicorn_clean$investors <- stringr::str_trim(unicorn_clean$investors)
#
unicorn_clean %>%
group_by(investors) %>%
summarise(total = n()) %>%
arrange(-total) %>%
head(5)
## # A tibble: 5 × 2
## investors total
## <chr> <int>
## 1 Accel 60
## 2 Andreessen Horowitz 53
## 3 Tiger Global Management 53
## 4 Sequoia Capital China 48
## 5 Insight Partners 47
The top 3 biggest investors are: - Accel (60) - Andreeessen Horowitz (53) - Tiger Global Management (53)
Further analysis can be conducted to understand the nature of investments of the heavy hitters and how profitable they are.
In summary, our analysis of the unicorn companies dataset has unearthed key insights into the dynamics of this unique startup ecosystem:
These recommendations align with the initial scenario of predicting a company’s potential to reach a valuation over $5 billion based on characteristics such as its origin, industry, and investor details. By focusing on emerging sectors, strategic geographical locations, and understanding the impact of investor involvement, your company can enhance its ability to make informed and lucrative investment decisions in the dynamic unicorn startup landscape.
Project by: Ajanaku Ayomide image attribute: Image by rawpixel.com on Freepik