Code
# Load required packages for a reproducible pipeline
library(tidyverse)
library(readxl)
library(tidymodels)
# Set seed for reproducibility
set.seed(465)Question: “Can startup success be predicted using founder characteristics, funding structure, and market indicators?”
Motivation: Startups are drivers of innovation and job creation, but they face high failure rates. For Venture Capital (VC) firms, predicting success is essential to reduce “False Positive” investments. Economically, we are interested in whether certain features act as credible market signals that reduce information asymmetry between founders and investors.
# Load required packages for a reproducible pipeline
library(tidyverse)
library(readxl)
library(tidymodels)
# Set seed for reproducibility
set.seed(465)We use the Startup Funding and Outcome Dataset from Kaggle. Source: https://www.kaggle.com/datasets/dhrubangtalukdar/startup-funding-and-outcome-dataset
As per feedback, we include a glimpse() to show understanding of the data structure.
# Importing using a relative file path
startup_data <- read_excel("Startup_funding.xlsx")
# Initial inspection
glimpse(startup_data)Rows: 100,000
Columns: 11
$ funding_rounds <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type <chr> "tier2_vc", "tier2_vc", "none", "none", "none…
$ sector <chr> "Health", "Fintech", "SaaS", "Ecommerce", "He…
$ founder_background <chr> "academic", "first_time", "first_time", "ex_b…
$ outcome <chr> "IPO", "Failure", "Failure", "Acquisition", "…
We explicitly handle missing values and format the target variable as a factor.
# Handling missing values explicitly and converting to factors
startup_clean <- startup_data %>%
drop_na(outcome, investor_type, sector, funding_rounds, team_size) %>%
mutate(
outcome = ifelse(outcome == "Failure", "Failure", "Success"),
outcome = factor(outcome, levels = c("Failure", "Success")),
investor_type = as.factor(investor_type),
sector = as.factor(sector)
)
# Verify cleaned data
glimpse(startup_clean)Rows: 100,000
Columns: 11
$ funding_rounds <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type <fct> tier2_vc, tier2_vc, none, none, none, angel, …
$ sector <fct> Health, Fintech, SaaS, Ecommerce, Health, Eco…
$ founder_background <chr> "academic", "first_time", "first_time", "ex_b…
$ outcome <fct> Success, Failure, Failure, Success, Success, …
The outcome variable is binary. We analyze its distribution to understand the baseline success rate in our population.
# Frequency table
startup_clean %>% count(outcome) %>% mutate(proportion = n / sum(n))# A tibble: 2 × 3
outcome n proportion
<fct> <int> <dbl>
1 Failure 55610 0.556
2 Success 44390 0.444
# Visualization
ggplot(startup_clean, aes(x = outcome, fill = outcome)) +
geom_bar(color = "white") +
labs(title = "Distribution of Startup Success", x = "Outcome", y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme_minimal()Interpretation: The distribution follows a Bernoulli distribution. Since the classes are relatively balanced (~44% success), we can proceed with predictive modeling without complex re-sampling techniques.
We split the data into 80% training and 20% test sets.
set.seed(465)
startup_split <- initial_split(startup_clean, prop = 0.80, strata = outcome)
startup_train <- training(startup_split)
startup_test <- testing(startup_split)We selected funding_rounds, team_size, investor_type, and sector. Economically, funding_rounds represents continuous market validation. team_size acts as a proxy for operational capacity, while sector controls for industry-specific macroeconomic risks.
We compare two specifications of the same model family as required.
log_spec <- logistic_reg() %>% set_engine("glm")
# Model 1: Full Specification
wflow_full <- workflow() %>%
add_model(log_spec) %>%
add_formula(outcome ~ funding_rounds + team_size + investor_type + sector)
# Model 2: Reduced Specification (Numerical only)
wflow_red <- workflow() %>%
add_model(log_spec) %>%
add_formula(outcome ~ funding_rounds + team_size)
# Fitting
fit_full <- fit(wflow_full, data = startup_train)
fit_red <- fit(wflow_red, data = startup_train)results_full <- predict(fit_full, startup_test) %>%
bind_cols(startup_test) %>%
metrics(truth = outcome, estimate = .pred_class)
results_red <- predict(fit_red, startup_test) %>%
bind_cols(startup_test) %>%
metrics(truth = outcome, estimate = .pred_class)
bind_rows(Full_Model = results_full, Reduced_Model = results_red, .id = "Model") %>%
filter(.metric == "accuracy")# A tibble: 2 × 4
Model .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 Full_Model accuracy binary 0.617
2 Reduced_Model accuracy binary 0.617
The full model performs slightly better by incorporating sectoral nuances.
To address Stage 2 feedback, we present a side-by-side numeric comparison between CV results and Test Set results for our best model (Full Model).
set.seed(465)
startup_folds <- vfold_cv(startup_train, v = 5, strata = outcome)
cv_results <- fit_resamples(wflow_full, resamples = startup_folds, metrics = metric_set(accuracy))
# Comparison Table
cv_table <- collect_metrics(cv_results) %>% select(.metric, mean) %>% rename(CV_Estimate = mean)
test_table <- results_full %>% filter(.metric == "accuracy") %>% select(.metric, .estimate) %>% rename(Test_Estimate = .estimate)
inner_join(cv_table, test_table, by = ".metric")# A tibble: 1 × 3
.metric CV_Estimate Test_Estimate
<chr> <dbl> <dbl>
1 accuracy 0.616 0.617
Interpretation: The CV accuracy is virtually identical to the Test accuracy. This proves the model is stable and does not suffer from overfitting.
tidy(fit_full) %>% filter(p.value < 0.05)# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1.34 0.0292 -45.8 0
2 funding_rounds 0.355 0.00546 64.9 0
3 team_size 0.00235 0.0000857 27.4 1.33e-165
Findings: The positive coefficient for funding_rounds confirms Signaling Theory. Each additional round increases the log-odds of success, signaling that the firm has passed multiple stages of professional due diligence. Business-wise, VCs can use these probabilities to set risk thresholds (e.g., only investing if \(P(Success) > 0.65\)).
465, and explicitly used drop_na() for consistent cleaning.sector and funding_rounds. The economic “value” of a funding round might be significantly higher in capital-intensive sectors like Biotech compared to SaaS.inner_join logic suggested by the AI to create the table in Section 5.1.