Final Project: What Factors Are Associated with Higher Valuation in SaaS Companies?

Author

Sinem

Approach

For this final project, I analyze what factors are associated with higher valuation in SaaS companies. Valuation is often used as a measure of business success and understanding what relates to higher valuation can provide useful insights for entrepreneurs and investors. I used a Kaggle dataset of the top 100 SaaS companies as my main data source. And I also use country level information from the REST Countries API as a second data source. My workflow follows a standard data science process: load the data, clean the columns, transform variables into usable numeric formats, join a second data source, perform statistical analysis, create visualization and summarize the main conclusions.

Code Base

Load Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(jsonlite)


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

Load Data Source 1: SaaS Companies Dataset

raw_url <- "https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/FINAL/top_100_saas_companies_2025.csv"

saas_raw <- read_csv(raw_url, show_col_types = FALSE)

glimpse(saas_raw)

Rows: 100
Columns: 11
$ `Company Name`  <chr> "Microsoft", "Salesforce", "Adobe", "Oracle", "SAP", "…
$ `Founded Year`  <dbl> 1975, 1999, 1982, 1977, 1972, 1983, 2004, 2005, 2011, …
$ HQ              <chr> "Redmond, WA, USA", "San Francisco, CA, USA", "San Jos…
$ Industry        <chr> "Enterprise Software", "CRM", "Creative Software", "Da…
$ `Total Funding` <chr> "$1B", "$65.4M", "$2.5M", "$2K", "N/A", "$273M", "$82.…
$ ARR             <chr> "$270B", "$37.9B", "$19.4B", "$52.9B", "$32.5B", "$14.…
$ Valuation       <chr> "$3T", "$227.8B", "$240B", "$350B", "$215B", "$180B", …
$ Employees       <dbl> 221000, 75000, 29945, 143000, 107415, 18200, 20000, 18…
$ `Top Investors` <chr> "Bill Gates, Paul Allen", "Halsey Minor, Larry Ellison…
$ Product         <chr> "Azure, Office 365, Teams", "Sales Cloud, Service Clou…
$ `G2 Rating`     <dbl> 4.4, 4.3, 4.5, 4.0, 4.1, 4.4, 4.4, 4.2, 4.5, 4.4, 4.3,…

Clean and Transform the SaaS Dataset

The money columns in this dataset are stored as text, such as $270B or $65.4M. I convert these values into numeric dollar amounts so they can be analyzed.

convert_money <- function(x) {
  x <- str_remove_all(x, "\\$|,")
  case_when(
    str_detect(x, "T") ~ as.numeric(str_remove(x, "T")) * 1e12,
    str_detect(x, "B") ~ as.numeric(str_remove(x, "B")) * 1e9,
    str_detect(x, "M") ~ as.numeric(str_remove(x, "M")) * 1e6,
    str_detect(x, "K") ~ as.numeric(str_remove(x, "K")) * 1e3,
    TRUE ~ as.numeric(x)
  )
}

saas_clean <- saas_raw %>%
  rename(
    company = `Company Name`,
    founded_year = `Founded Year`,
    hq = HQ,
    industry = Industry,
    total_funding = `Total Funding`,
    arr = ARR,
    valuation = Valuation,
    employees = Employees,
    top_investors = `Top Investors`,
    product = Product,
    g2_rating = `G2 Rating`
  ) %>%
  mutate(
    total_funding_num = convert_money(total_funding),
    arr_num = convert_money(arr),
    valuation_num = convert_money(valuation),
    employees_num = as.numeric(str_remove_all(employees, ",")),
    company_age = 2025 - founded_year,
    hq_country = str_trim(str_extract(hq, "[^,]+$")),
    hq_country = if_else(hq_country == "USA", "United States", hq_country),
    valuation_billions = valuation_num / 1e9,
    arr_billions = arr_num / 1e9,
    funding_billions = total_funding_num / 1e9,
    revenue_per_employee = arr_num / employees_num,
    valuation_to_arr = valuation_num / arr_num,
    funding_efficiency = valuation_num / total_funding_num
  )

Warning: There were 15 warnings in `mutate()`.
The first warning was:
ℹ In argument: `total_funding_num = convert_money(total_funding)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 14 remaining warnings.

saas_clean %>%
  select(company, industry, hq_country, funding_billions, arr_billions, valuation_billions, employees_num, g2_rating) %>%
  head(10)

# A tibble: 10 × 8
   company  industry hq_country funding_billions arr_billions valuation_billions
   <chr>    <chr>    <chr>                 <dbl>        <dbl>              <dbl>
 1 Microso… Enterpr… United St…         1               270                3000 
 2 Salesfo… CRM      United St…         0.0654           37.9               228.
 3 Adobe    Creativ… United St…         0.0025           19.4               240 
 4 Oracle   Databas… United St…         0.000002         52.9               350 
 5 SAP      Enterpr… Germany           NA                32.5               215 
 6 Intuit   Financi… United St…         0.273            14.4               180 
 7 Service… IT Serv… United St…         0.0825            8.9               147 
 8 Workday  HR & Fi… United St…         0.250             7.3                65 
 9 Zoom     Video C… United St…         0.146             4.5                85 
10 Shopify  E-comme… Canada             0.122             7.1                95 
# ℹ 2 more variables: employees_num <dbl>, g2_rating <dbl>

Some values are listed as “N/A” in the original dataset, so R converts them to missing values during cleaning.

Check Missing Values

I check how many missing values exist in important columns.

saas_clean %>%
  summarize(
    missing_funding = sum(is.na(total_funding_num)),
    missing_arr = sum(is.na(arr_num)),
    missing_valuation = sum(is.na(valuation_num)),
    missing_employees = sum(is.na(employees_num)),
    missing_g2_rating = sum(is.na(g2_rating))
  )

# A tibble: 1 × 5
  missing_funding missing_arr missing_valuation missing_employees
            <int>       <int>             <int>             <int>
1               1           0                12                 0
# ℹ 1 more variable: missing_g2_rating <int>

For analysis that uses funding, I remove rows with missing funding values because the efficiency calculation needs funding in the denominator.

saas_analysis <- saas_clean %>%
  filter(
    !is.na(total_funding_num),
    !is.na(arr_num),
    !is.na(valuation_num),
    !is.na(employees_num),
    total_funding_num > 0,
    arr_num > 0,
    valuation_num > 0,
    employees_num > 0
  )

Load Data Source 2: Country Data from REST Countries API

I use the REST Countries API. This allows me to add basic country level information like region and population based on the company headquarters location. This data is not central to the main analysis, but it adds additional context.

country_api <- "https://restcountries.com/v3.1/all?fields=name,population,region,cca3"

country_data <- fromJSON(country_api, flatten = TRUE) %>%
  as_tibble() %>%
  transmute(
    hq_country = name.common,
    country_region = region,
    country_population = population,
    country_code = cca3
  )

saas_joined <- saas_analysis %>%
  left_join(country_data, by = "hq_country")

saas_joined %>%
  select(company, hq_country, country_region, country_population) %>%
  head(10)

# A tibble: 10 × 4
   company    hq_country    country_region country_population
   <chr>      <chr>         <chr>                       <int>
 1 Microsoft  United States Americas                340110988
 2 Salesforce United States Americas                340110988
 3 Adobe      United States Americas                340110988
 4 Oracle     United States Americas                340110988
 5 Intuit     United States Americas                340110988
 6 ServiceNow United States Americas                340110988
 7 Workday    United States Americas                340110988
 8 Zoom       United States Americas                340110988
 9 Shopify    Canada        Americas                 41651653
10 Atlassian  Australia     Oceania                  27536874

Descriptive Summary

saas_joined %>%
  summarize(
    number_of_companies = n(),
    average_valuation_b = mean(valuation_billions, na.rm = TRUE),
    median_valuation_b = median(valuation_billions, na.rm = TRUE),
    average_arr_b = mean(arr_billions, na.rm = TRUE),
    average_funding_b = mean(funding_billions, na.rm = TRUE),
    average_employees = mean(employees_num, na.rm = TRUE),
    average_g2_rating = mean(g2_rating, na.rm = TRUE)
  )

# A tibble: 1 × 7
  number_of_companies average_valuation_b median_valuation_b average_arr_b
                <int>               <dbl>              <dbl>         <dbl>
1                  87                65.2                9.5          6.20
# ℹ 3 more variables: average_funding_b <dbl>, average_employees <dbl>,
#   average_g2_rating <dbl>

Statistical Analysis 1: Correlation with Valuation

This analysis checks which numeric variables are most strongly related to valuation.

correlation_results <- saas_joined %>%
  summarize(
    funding_correlation = cor(valuation_num, total_funding_num, use = "complete.obs"),
    arr_correlation = cor(valuation_num, arr_num, use = "complete.obs"),
    employees_correlation = cor(valuation_num, employees_num, use = "complete.obs"),
    g2_rating_correlation = cor(valuation_num, g2_rating, use = "complete.obs"),
    age_correlation = cor(valuation_num, company_age, use = "complete.obs")
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "variable",
    values_to = "correlation"
  ) %>%
  arrange(desc(abs(correlation)))

correlation_results

# A tibble: 5 × 2
  variable              correlation
  <chr>                       <dbl>
1 arr_correlation            0.993 
2 employees_correlation      0.878 
3 age_correlation            0.589 
4 g2_rating_correlation     -0.0629
5 funding_correlation        0.0213

Graphic 1: Correlations with Valuation

ggplot(correlation_results, aes(x = reorder(variable, correlation), y = correlation)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Correlation Between Business Factors and Valuation",
    x = "Variable",
    y = "Correlation with Valuation"
  ) +
  theme_minimal()

Statistical Analysis 2: Linear Regression

I use a simple regression model to see how ARR, funding, employees, company age, and G2 rating relate to valuation at the same time. I use log values because the money variables are very large and spread out.

saas_model_data <- saas_joined %>%
  mutate(
    log_valuation = log(valuation_num),
    log_arr = log(arr_num),
    log_funding = log(total_funding_num),
    log_employees = log(employees_num)
  )

valuation_model <- lm(
  log_valuation ~ log_arr + log_funding + log_employees + company_age + g2_rating,
  data = saas_model_data
)

summary(valuation_model)


Call:
lm(formula = log_valuation ~ log_arr + log_funding + log_employees + 
    company_age + g2_rating, data = saas_model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.31269 -0.25480  0.05232  0.31620  1.60783 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -1.716832   2.395972  -0.717   0.4757    
log_arr        0.666182   0.120731   5.518 4.00e-07 ***
log_funding    0.006495   0.040493   0.160   0.8730    
log_employees  0.293577   0.154034   1.906   0.0602 .  
company_age    0.003477   0.012576   0.276   0.7829    
g2_rating      1.959936   0.373026   5.254 1.18e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5375 on 81 degrees of freedom
Multiple R-squared:  0.8591,    Adjusted R-squared:  0.8504 
F-statistic: 98.74 on 5 and 81 DF,  p-value: < 2.2e-16

Graphic 2: ARR vs Valuation

ggplot(saas_joined, aes(x = arr_billions, y = valuation_billions)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "ARR and Valuation Among Top SaaS Companies",
    x = "ARR in Billions of Dollars (log scale)",
    y = "Valuation in Billions of Dollars (log scale)"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Graphic 3: Funding vs Valuation

ggplot(saas_joined, aes(x = funding_billions, y = valuation_billions)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_log10() +
  scale_y_log10() +
  labs(
    title = "Funding and Valuation Among Top SaaS Companies",
    x = "Total Funding in Billions of Dollars (log scale)",
    y = "Valuation in Billions of Dollars (log scale)"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Extra Feature: Create a SaaS Efficiency Ranking

For the extra feature, I create a business efficiency metric. This ranks companies by how much valuation they created relative to their total funding. Because some older companies have very small recorded funding amounts, the funding efficiency ranking should be interpreted carefully. It is useful as an exploratory metric, but it may not fully represent all historical capital used by each company.

efficiency_ranking <- saas_joined %>%
  arrange(desc(funding_efficiency)) %>%
  select(
    company,
    industry,
    valuation_billions,
    funding_billions,
    arr_billions,
    funding_efficiency,
    revenue_per_employee
  ) %>%
  head(10)

efficiency_ranking

# A tibble: 10 × 7
   company            industry  valuation_billions funding_billions arr_billions
   <chr>              <chr>                  <dbl>            <dbl>        <dbl>
 1 Oracle             Database…               350          0.000002         52.9
 2 Adobe              Creative…               240          0.0025           19.4
 3 Veeva Systems      Life Sci…                35          0.007             2.4
 4 Salesforce         CRM                     228.         0.0654           37.9
 5 Microsoft          Enterpri…              3000          1               270  
 6 ServiceNow         IT Servi…               147          0.0825            8.9
 7 Palo Alto Networks Cybersec…                95          0.0663            7.5
 8 Atlassian          Collabor…                55          0.06              3.5
 9 Shopify            E-commer…                95          0.122             7.1
10 Intuit             Financia…               180          0.273            14.4
# ℹ 2 more variables: funding_efficiency <dbl>, revenue_per_employee <dbl>

Graphic 4: Top 10 Companies by Funding Efficiency

ggplot(efficiency_ranking, aes(x = reorder(company, funding_efficiency), y = funding_efficiency)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 SaaS Companies by Funding Efficiency",
    x = "Company",
    y = "Valuation / Total Funding"
  ) +
  theme_minimal()

Industry Level Summary

industry_summary <- saas_joined %>%
  group_by(industry) %>%
  summarize(
    company_count = n(),
    average_valuation_b = mean(valuation_billions, na.rm = TRUE),
    average_arr_b = mean(arr_billions, na.rm = TRUE),
    average_g2_rating = mean(g2_rating, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(average_valuation_b)) %>%
  head(10)

industry_summary

# A tibble: 10 × 5
   industry    company_count average_valuation_b average_arr_b average_g2_rating
   <chr>               <int>               <dbl>         <dbl>             <dbl>
 1 Enterprise…             1              3000           270                 4.4
 2 Database &…             1               350            52.9               4  
 3 Creative S…             1               240            19.4               4.5
 4 CRM                     1               228.           37.9               4.3
 5 Financial …             1               180            14.4               4.4
 6 IT Service…             1               147             8.9               4.4
 7 E-commerce              1                95             7.1               4.4
 8 Video Comm…             1                85             4.5               4.5
 9 Cybersecur…             2                82.5           5.3               4.6
10 Data Wareh…             1                75             2.8               4.4

Graphic 5: Top Industries by Average Valuation

ggplot(industry_summary, aes(x = reorder(industry, average_valuation_b), y = average_valuation_b)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top SaaS Industries by Average Valuation",
    x = "Industry",
    y = "Average Valuation in Billions of Dollars"
  ) +
  theme_minimal()

Conclusion

I looked at what factors are related to higher valuation in SaaS companies. Based on the correlation results, ARR had the strongest relationship with valuation, which was much higher than the other variables. The number of employees also showed a strong relationship, while funding had almost no correlation with valuation in this dataset.

This was a bit surprising, because I expected funding to have a bigger impact. However, the results suggest that revenue is a much stronger indicator of company value than how much funding a company has raised.

I also created a funding efficiency metric, which shows how much valuation a company generates relative to its funding. Some companies appear extremely efficient, but this should be interpreted carefully because older companies may not have complete funding data in the dataset.

Overall, this analysis shows that business performance, especially revenue, seems to matter more than funding when it comes to valuation.

Project Challenge

One challenge I faced in this project was that important columns such as funding, ARR, and valuation were stored as text instead of numbers. These values included symbols like $, M, B, and T, so I could not directly use them in calculations.

To fix this, I wrote a function to convert these values into numeric format. During this process, some values such as “N/A” were converted into missing values, which caused warnings in R. I handled this by filtering out incomplete rows before running the analysis.