Final Project: What Factors Are Associated with Higher Valuation in SaaS Companies?
Author
Sinem
Approach
For this final project, I analyze what factors are associated with higher valuation in SaaS companies. Valuation is often used as a measure of business success and understanding what relates to higher valuation can provide useful insights for entrepreneurs and investors. I used a Kaggle dataset of the top 100 SaaS companies as my main data source. And I also use country level information from the REST Countries API as a second data source. My workflow follows a standard data science process: load the data, clean the columns, transform variables into usable numeric formats, join a second data source, perform statistical analysis, create visualization and summarize the main conclusions.
Code Base
Load Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
The money columns in this dataset are stored as text, such as $270B or $65.4M. I convert these values into numeric dollar amounts so they can be analyzed.
Warning: There were 15 warnings in `mutate()`.
The first warning was:
ℹ In argument: `total_funding_num = convert_money(total_funding)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 14 remaining warnings.
Load Data Source 2: Country Data from REST Countries API
I use the REST Countries API. This allows me to add basic country level information like region and population based on the company headquarters location. This data is not central to the main analysis, but it adds additional context.
# A tibble: 10 × 4
company hq_country country_region country_population
<chr> <chr> <chr> <int>
1 Microsoft United States Americas 340110988
2 Salesforce United States Americas 340110988
3 Adobe United States Americas 340110988
4 Oracle United States Americas 340110988
5 Intuit United States Americas 340110988
6 ServiceNow United States Americas 340110988
7 Workday United States Americas 340110988
8 Zoom United States Americas 340110988
9 Shopify Canada Americas 41651653
10 Atlassian Australia Oceania 27536874
ggplot(correlation_results, aes(x =reorder(variable, correlation), y = correlation)) +geom_col() +coord_flip() +labs(title ="Correlation Between Business Factors and Valuation",x ="Variable",y ="Correlation with Valuation" ) +theme_minimal()
Statistical Analysis 2: Linear Regression
I use a simple regression model to see how ARR, funding, employees, company age, and G2 rating relate to valuation at the same time. I use log values because the money variables are very large and spread out.
Call:
lm(formula = log_valuation ~ log_arr + log_funding + log_employees +
company_age + g2_rating, data = saas_model_data)
Residuals:
Min 1Q Median 3Q Max
-1.31269 -0.25480 0.05232 0.31620 1.60783
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.716832 2.395972 -0.717 0.4757
log_arr 0.666182 0.120731 5.518 4.00e-07 ***
log_funding 0.006495 0.040493 0.160 0.8730
log_employees 0.293577 0.154034 1.906 0.0602 .
company_age 0.003477 0.012576 0.276 0.7829
g2_rating 1.959936 0.373026 5.254 1.18e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5375 on 81 degrees of freedom
Multiple R-squared: 0.8591, Adjusted R-squared: 0.8504
F-statistic: 98.74 on 5 and 81 DF, p-value: < 2.2e-16
Graphic 2: ARR vs Valuation
ggplot(saas_joined, aes(x = arr_billions, y = valuation_billions)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE) +scale_x_log10() +scale_y_log10() +labs(title ="ARR and Valuation Among Top SaaS Companies",x ="ARR in Billions of Dollars (log scale)",y ="Valuation in Billions of Dollars (log scale)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Graphic 3: Funding vs Valuation
ggplot(saas_joined, aes(x = funding_billions, y = valuation_billions)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE) +scale_x_log10() +scale_y_log10() +labs(title ="Funding and Valuation Among Top SaaS Companies",x ="Total Funding in Billions of Dollars (log scale)",y ="Valuation in Billions of Dollars (log scale)" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Extra Feature: Create a SaaS Efficiency Ranking
For the extra feature, I create a business efficiency metric. This ranks companies by how much valuation they created relative to their total funding. Because some older companies have very small recorded funding amounts, the funding efficiency ranking should be interpreted carefully. It is useful as an exploratory metric, but it may not fully represent all historical capital used by each company.
ggplot(industry_summary, aes(x =reorder(industry, average_valuation_b), y = average_valuation_b)) +geom_col() +coord_flip() +labs(title ="Top SaaS Industries by Average Valuation",x ="Industry",y ="Average Valuation in Billions of Dollars" ) +theme_minimal()
Conclusion
I looked at what factors are related to higher valuation in SaaS companies. Based on the correlation results, ARR had the strongest relationship with valuation, which was much higher than the other variables. The number of employees also showed a strong relationship, while funding had almost no correlation with valuation in this dataset.
This was a bit surprising, because I expected funding to have a bigger impact. However, the results suggest that revenue is a much stronger indicator of company value than how much funding a company has raised.
I also created a funding efficiency metric, which shows how much valuation a company generates relative to its funding. Some companies appear extremely efficient, but this should be interpreted carefully because older companies may not have complete funding data in the dataset.
Overall, this analysis shows that business performance, especially revenue, seems to matter more than funding when it comes to valuation.
Project Challenge
One challenge I faced in this project was that important columns such as funding, ARR, and valuation were stored as text instead of numbers. These values included symbols like $, M, B, and T, so I could not directly use them in calculations.
To fix this, I wrote a function to convert these values into numeric format. During this process, some values such as “N/A” were converted into missing values, which caused warnings in R. I handled this by filtering out incomplete rows before running the analysis.