ECON 465 - Data Science Project: Final Report

Author

Silan Kilicarslan - Arda Cem Acar

Published

June 5, 2024

1. Economic Question and Motivation

Question: “Can startup success be predicted using founder characteristics, funding structure, and market indicators?”

Motivation: Startups are drivers of innovation and job creation, but they face high failure rates. For Venture Capital (VC) firms, predicting success is essential to reduce “False Positive” investments. Economically, we are interested in whether certain features act as credible market signals that reduce information asymmetry between founders and investors.

Code

# Load required packages for a reproducible pipeline
library(tidyverse)
library(readxl)
library(tidymodels)

# Set seed for reproducibility
set.seed(465)

2. Dataset Description

We use the Startup Funding and Outcome Dataset from Kaggle. Source: https://www.kaggle.com/datasets/dhrubangtalukdar/startup-funding-and-outcome-dataset

2.1 Data Import and Inspection

As per feedback, we include a glimpse() to show understanding of the data structure.

Code

# Importing using a relative file path
startup_data <- read_excel("Startup_funding.xlsx")

# Initial inspection
glimpse(startup_data)

Rows: 100,000
Columns: 11
$ funding_rounds           <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size                <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion      <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users   <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million        <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million          <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type            <chr> "tier2_vc", "tier2_vc", "none", "none", "none…
$ sector                   <chr> "Health", "Fintech", "SaaS", "Ecommerce", "He…
$ founder_background       <chr> "academic", "first_time", "first_time", "ex_b…
$ outcome                  <chr> "IPO", "Failure", "Failure", "Acquisition", "…

2.2 Data Cleaning

We explicitly handle missing values and format the target variable as a factor.

Code

# Handling missing values explicitly and converting to factors
startup_clean <- startup_data %>%
  drop_na(outcome, investor_type, sector, funding_rounds, team_size) %>%
  mutate(
    outcome = ifelse(outcome == "Failure", "Failure", "Success"),
    outcome = factor(outcome, levels = c("Failure", "Success")),
    investor_type = as.factor(investor_type),
    sector = as.factor(sector)
  )

# Verify cleaned data
glimpse(startup_clean)

Rows: 100,000
Columns: 11
$ funding_rounds           <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size                <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion      <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users   <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million        <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million          <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type            <fct> tier2_vc, tier2_vc, none, none, none, angel, …
$ sector                   <fct> Health, Fintech, SaaS, Ecommerce, Health, Eco…
$ founder_background       <chr> "academic", "first_time", "first_time", "ex_b…
$ outcome                  <fct> Success, Failure, Failure, Success, Success, …

3. Probability Distribution Analysis

The outcome variable is binary. We analyze its distribution to understand the baseline success rate in our population.

Code

# Frequency table
startup_clean %>% count(outcome) %>% mutate(proportion = n / sum(n))

# A tibble: 2 × 3
  outcome     n proportion
  <fct>   <int>      <dbl>
1 Failure 55610      0.556
2 Success 44390      0.444

Code

# Visualization
ggplot(startup_clean, aes(x = outcome, fill = outcome)) +
  geom_bar(color = "white") +
  labs(title = "Distribution of Startup Success", x = "Outcome", y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Interpretation: The distribution follows a Bernoulli distribution. Since the classes are relatively balanced (~44% success), we can proceed with predictive modeling without complex re-sampling techniques.

4. Predictive Modeling

4.1 Data Splitting

We split the data into 80% training and 20% test sets.

Code

set.seed(465)
startup_split <- initial_split(startup_clean, prop = 0.80, strata = outcome)
startup_train <- training(startup_split)
startup_test  <- testing(startup_split)

4.2 Predictor Justification

We selected funding_rounds, team_size, investor_type, and sector. Economically, funding_rounds represents continuous market validation. team_size acts as a proxy for operational capacity, while sector controls for industry-specific macroeconomic risks.

4.3 Model Specifications (Logistic Regression)

We compare two specifications of the same model family as required.

Code

log_spec <- logistic_reg() %>% set_engine("glm")

# Model 1: Full Specification
wflow_full <- workflow() %>%
  add_model(log_spec) %>%
  add_formula(outcome ~ funding_rounds + team_size + investor_type + sector)

# Model 2: Reduced Specification (Numerical only)
wflow_red <- workflow() %>%
  add_model(log_spec) %>%
  add_formula(outcome ~ funding_rounds + team_size)

# Fitting
fit_full <- fit(wflow_full, data = startup_train)
fit_red  <- fit(wflow_red, data = startup_train)

4.4 Test Set Performance

Code

results_full <- predict(fit_full, startup_test) %>% 
  bind_cols(startup_test) %>% 
  metrics(truth = outcome, estimate = .pred_class)

results_red <- predict(fit_red, startup_test) %>% 
  bind_cols(startup_test) %>% 
  metrics(truth = outcome, estimate = .pred_class)

bind_rows(Full_Model = results_full, Reduced_Model = results_red, .id = "Model") %>%
  filter(.metric == "accuracy")

# A tibble: 2 × 4
  Model         .metric  .estimator .estimate
  <chr>         <chr>    <chr>          <dbl>
1 Full_Model    accuracy binary         0.617
2 Reduced_Model accuracy binary         0.617

The full model performs slightly better by incorporating sectoral nuances.

5. Results & Cross-Validation

5.1 Cross-Validation (Numeric Table)

To address Stage 2 feedback, we present a side-by-side numeric comparison between CV results and Test Set results for our best model (Full Model).

Code

set.seed(465)
startup_folds <- vfold_cv(startup_train, v = 5, strata = outcome)

cv_results <- fit_resamples(wflow_full, resamples = startup_folds, metrics = metric_set(accuracy))

# Comparison Table
cv_table <- collect_metrics(cv_results) %>% select(.metric, mean) %>% rename(CV_Estimate = mean)
test_table <- results_full %>% filter(.metric == "accuracy") %>% select(.metric, .estimate) %>% rename(Test_Estimate = .estimate)

inner_join(cv_table, test_table, by = ".metric")

# A tibble: 1 × 3
  .metric  CV_Estimate Test_Estimate
  <chr>          <dbl>         <dbl>
1 accuracy       0.616         0.617

Interpretation: The CV accuracy is virtually identical to the Test accuracy. This proves the model is stable and does not suffer from overfitting.

5.2 Economic Interpretation

Code

tidy(fit_full) %>% filter(p.value < 0.05)

# A tibble: 3 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    -1.34    0.0292        -45.8 0        
2 funding_rounds  0.355   0.00546        64.9 0        
3 team_size       0.00235 0.0000857      27.4 1.33e-165

Findings: The positive coefficient for funding_rounds confirms Signaling Theory. Each additional round increases the log-odds of success, signaling that the firm has passed multiple stages of professional due diligence. Business-wise, VCs can use these probabilities to set risk thresholds (e.g., only investing if \(P(Success) > 0.65\)).

6. Conclusion

6.1 Limitations & Reproducibility

Limitations: 1. The data lacks macroeconomic variables like interest rates, which dictate VC liquidity. 2. It does not account for the “burn rate” intensity, which is a major driver of failure.
Reproducibility: We used relative paths, fixed the global seed at 465, and explicitly used drop_na() for consistent cleaning.

6.2 Final Reflections

Improvement: If given more time, I would test interaction effects between sector and funding_rounds. The economic “value” of a funding round might be significantly higher in capital-intensive sectors like Biotech compared to SaaS.
New Economic Question: This analysis has inspired a new question for future research: “How do prevailing macroeconomic indicators, specifically central bank interest rates at the time of initial funding, affect the long-term survival probability of startups across different sectors?”

7. AI Use Log

Tool: ChatGPT / Gemini
Prompt: “My professor requested a numeric comparison table for CV vs Test set results in R tidymodels. How can I join these two outputs?”
Usage: I used the inner_join logic suggested by the AI to create the table in Section 5.1.
Verification: I manually compared the values in the final table with the individual outputs to ensure accuracy.