library(tidyverse)
library(readxl)
library(tidymodels)
library(patchwork)
ds_salaries <- read_excel("ds_salaries.xlsx")
startup_data <- read_excel("startup_funding.xlsx")ECON 465 - Data Science Project
Stage 1: Data Acquisition & Probability Foundations
Introduction
This report represents Stage 1 of the ECON 465 Data Science Project. We will analyze two distinct datasets to answer specific economic questions using probability distribution analysis.
Dataset 1: Regression (Data Science Salaries)
Economic Question: “Which professional characteristics are associated with higher salaries?”
Dataset Description: This dataset contains global salary information for data science and tech roles. The target variable is continuous (salary_in_usd). We aim to analyze how factors like experience level, employment type, and company size impact salaries.
Source: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries
1.1 Data Cleaning
We need to format categorical variables as factors so that our future regression models can interpret them correctly.
# Convert categorical variables to factors
ds_salaries_clean <- ds_salaries %>%
mutate(
experience_level = as.factor(experience_level),
employment_type = as.factor(employment_type),
company_size = as.factor(company_size),
remote_ratio = as.factor(remote_ratio)
)
# Inspect raw data
glimpse(ds_salaries)Rows: 607
Columns: 10
$ work_year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ experience_level <chr> "MI", "SE", "SE", "MI", "SE", "EN", "SE", "MI", "MI…
$ employment_type <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
$ salary <dbl> 70000, 260000, 85000, 20000, 150000, 72000, 190000,…
$ salary_currency <chr> "EUR", "USD", "GBP", "USD", "USD", "USD", "USD", "H…
$ salary_in_usd <dbl> 79833, 260000, 109024, 20000, 150000, 72000, 190000…
$ employee_residence <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ remote_ratio <dbl> 0, 0, 50, 0, 50, 100, 100, 50, 100, 50, 0, 0, 0, 10…
$ company_location <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ company_size <chr> "L", "S", "M", "S", "L", "L", "S", "L", "L", "S", "…
# Handle missing values and convert to factors
ds_salaries_clean <- ds_salaries %>%
drop_na(salary_in_usd, experience_level, employment_type, company_size, remote_ratio) %>%
mutate(
experience_level = as.factor(experience_level),
employment_type = as.factor(employment_type),
company_size = as.factor(company_size),
remote_ratio = as.factor(remote_ratio)
)
# Inspect clean data
glimpse(ds_salaries_clean)Rows: 607
Columns: 10
$ work_year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ experience_level <fct> MI, SE, SE, MI, SE, EN, SE, MI, MI, SE, EN, MI, EN,…
$ employment_type <fct> FT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT,…
$ salary <dbl> 70000, 260000, 85000, 20000, 150000, 72000, 190000,…
$ salary_currency <chr> "EUR", "USD", "GBP", "USD", "USD", "USD", "USD", "H…
$ salary_in_usd <dbl> 79833, 260000, 109024, 20000, 150000, 72000, 190000…
$ employee_residence <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ remote_ratio <fct> 0, 0, 50, 0, 50, 100, 100, 50, 100, 50, 0, 0, 0, 10…
$ company_location <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ company_size <fct> L, S, M, S, L, L, S, L, L, S, S, L, M, L, L, L, S, …
# Inspect raw data
glimpse(startup_data)Rows: 100,000
Columns: 11
$ funding_rounds <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type <chr> "tier2_vc", "tier2_vc", "none", "none", "none…
$ sector <chr> "Health", "Fintech", "SaaS", "Ecommerce", "He…
$ founder_background <chr> "academic", "first_time", "first_time", "ex_b…
$ outcome <chr> "IPO", "Failure", "Failure", "Acquisition", "…
# Handle missing values and format target variable
startup_clean <- startup_data %>%
drop_na(outcome, investor_type, sector, founder_background) %>%
mutate(
outcome = ifelse(outcome == "Failure", "Failure", "Success"),
outcome = factor(outcome, levels = c("Failure", "Success")),
investor_type = as.factor(investor_type),
sector = as.factor(sector),
founder_background = as.factor(founder_background)
)
# Inspect clean data
glimpse(startup_clean)Rows: 100,000
Columns: 11
$ funding_rounds <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type <fct> tier2_vc, tier2_vc, none, none, none, angel, …
$ sector <fct> Health, Fintech, SaaS, Ecommerce, Health, Eco…
$ founder_background <fct> academic, first_time, first_time, ex_bigtech,…
$ outcome <fct> Success, Failure, Failure, Success, Success, …
1.2 Summary Statistics
Below are the summary statistics (mean, median, standard deviation, min, max) for our continuous target variable, salary_in_usd.
# Create summary statistics table for the salary variable
summary_stats_salary <- ds_salaries_clean %>%
summarise(
Mean = mean(salary_in_usd, na.rm = TRUE),
Median = median(salary_in_usd, na.rm = TRUE),
SD = sd(salary_in_usd, na.rm = TRUE),
Min = min(salary_in_usd, na.rm = TRUE),
Max = max(salary_in_usd, na.rm = TRUE)
)
# Print the table
summary_stats_salary# A tibble: 1 × 5
Mean Median SD Min Max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 112298. 101570 70957. 2859 600000
1.3 Probability Distribution Analysis
Income and salary distributions are typically right-skewed in economics. We will plot the original histogram and then apply a log transformation to observe the changes.
# Original salary distribution plot
p1 <- ggplot(ds_salaries_clean, aes(x = salary_in_usd)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) +
labs(title = "Original Salary Distribution", x = "Salary (USD)", y = "Frequency") +
theme_minimal()
# Log-transformed salary distribution plot
p2 <- ggplot(ds_salaries_clean, aes(x = log(salary_in_usd))) +
geom_histogram(fill = "darkorange", color = "white", bins = 30) +
labs(title = "Log-Transformed Salary", x = "Log(Salary in USD)", y = "Frequency") +
theme_minimal()
# Display two plots side by side using patchwork
p1 + p2Interpretation: The original histogram confirms that the salary data is right-skewed. After applying the log transformation, the distribution shape becomes much more symmetric and bell-shaped. Therefore, a Log-Normal distribution is the most suitable theoretical approximation for our target variable.
Dataset 2: Classification (Startup Funding Success)
Economic Question: “Can startup success be predicted using founder characteristics, funding structure, and market indicators?”
Dataset Description: This dataset tracks various startups, their funding metrics, and market conditions. The target variable is binary (outcome), where startups are classified as either “Success” or “Failure”, indicating whether a startup was ultimately successful or not.
Source: https://www.kaggle.com/datasets/dhrubangtalukdar/startup-funding-and-outcome-dataset
2.1 Data Cleaning
We prepare the dataset by converting the target variable (outcome) and categorical features into factors.
startup_clean <- startup_data %>%
mutate(
outcome = ifelse(outcome == "Failure", "Failure", "Success"),
outcome = factor(outcome, levels = c("Failure", "Success")),
investor_type = as.factor(investor_type),
sector = as.factor(sector),
founder_background = as.factor(founder_background)
)
table(startup_clean$outcome)
Failure Success
55610 44390
prop.table(table(startup_clean$outcome))
Failure Success
0.5561 0.4439
2.2 Summary Statistics (Success Rate)
Since the target variable is binary, traditional continuous-variable statistics such as quartiles and log transformations are not appropriate. Instead, we look at the frequency and success rate.
# View startup success/failure counts
table(startup_clean$outcome)
Failure Success
55610 44390
# Calculate success rates (percentages)
prop.table(table(startup_clean$outcome))
Failure Success
0.5561 0.4439
2.3 Probability Distribution Analysis
Because outcome is binary (categorical), a histogram and log transformation are mathematically inappropriate. Instead, we use a bar plot to visualize the class balance.
# Startup success frequency plot
ggplot(startup_clean, aes(x = outcome, fill = outcome)) +
geom_bar(color = "white") +
labs(title = "Distribution of Startup Success (Outcome)", x = "Outcome Status", y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(legend.position = "none")Interpretation: The target variable (outcome) is binary, meaning it takes on one of two possible states.The theoretical distribution that best approximates this binary success/failure outcome is the Bernoulli distribution, since each startup can only belong to one of two possible categories. The bar chart above helps us understand if our dataset is balanced or imbalanced regarding successful versus unsuccessful startups.
Stage 2: Predictive Modeling
2.1 Data Splitting
We split both datasets into training (80%) and test (20%) sets using the initial_split() function from the tidymodels package. We also set a seed to ensure reproducibility of the results.
# Set seed for reproducibility
set.seed(465)
# -----------------------------
# Regression Dataset Split
# -----------------------------
salary_split <- initial_split(ds_salaries_clean, prop = 0.80)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)
# Check dimensions
dim(salary_train)[1] 485 10
dim(salary_test)[1] 122 10
# -----------------------------
# Classification Dataset Split
# -----------------------------
startup_split <- initial_split(
startup_clean,
prop = 0.80,
strata = outcome
)
startup_train <- training(startup_split)
startup_test <- testing(startup_split)
# Check dimensions
dim(startup_train)[1] 80000 11
dim(startup_test)[1] 20000 11
2.2 Regression Models
Linear Regression Model
We first build a linear regression model to predict salaries using professional characteristics such as experience level, employment type, company size, and remote work ratio.
Predictor Justification: We selected experience level, employment type, company size, and remote ratio as predictors because human capital theory suggests that higher experience directly increases marginal productivity, thus commanding higher wages. Additionally, larger companies typically have greater capital and market power to offer competitive salaries, while remote work dynamics can influence salary based on geographic cost-of-living differences.
# Linear regression model
lm_model <- linear_reg() %>%
set_engine("lm")
# Workflow
lm_workflow <- workflow() %>%
add_model(lm_model) %>%
add_formula(
salary_in_usd ~ experience_level +
employment_type +
company_size +
remote_ratio
)
# Train model
lm_fit <- fit(lm_workflow, data = salary_train)
# Predictions
lm_predictions <- predict(lm_fit, salary_test) %>%
bind_cols(salary_test)
# Evaluation metrics
lm_metrics <- lm_predictions %>%
metrics(truth = salary_in_usd, estimate = .pred)
lm_metrics# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 51913.
2 rsq standard 0.255
3 mae standard 41421.
Decision Tree Regression Model
As a second regression model, we use a decision tree. Decision trees can capture non-linear relationships between professional characteristics and salary.
# Decision tree regression model
tree_reg_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
# Workflow
tree_reg_workflow <- workflow() %>%
add_model(tree_reg_model) %>%
add_formula(
salary_in_usd ~ experience_level +
employment_type +
company_size +
remote_ratio
)
# Train model
tree_reg_fit <- fit(tree_reg_workflow, data = salary_train)
# Predictions
tree_reg_predictions <- predict(tree_reg_fit, salary_test) %>%
bind_cols(salary_test)
# Evaluation metrics
tree_reg_metrics <- tree_reg_predictions %>%
metrics(truth = salary_in_usd, estimate = .pred)
tree_reg_metrics# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 51356.
2 rsq standard 0.267
3 mae standard 38916.
Regression Model Comparison
We compared the performance of the linear regression and decision tree regression models using RMSE and R?? metrics on the test dataset.
bind_rows(
Linear_Regression = lm_metrics,
Decision_Tree = tree_reg_metrics,
.id = "Model"
)# A tibble: 6 × 4
Model .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 Linear_Regression rmse standard 51913.
2 Linear_Regression rsq standard 0.255
3 Linear_Regression mae standard 41421.
4 Decision_Tree rmse standard 51356.
5 Decision_Tree rsq standard 0.267
6 Decision_Tree mae standard 38916.
2.3 Classification Models
Logistic Regression Model
We first build a logistic regression model to predict startup success using founder characteristics, funding structure, and market indicators.
Predictor Justification: We included investor type, sector, founder background, funding rounds, and team size. Economically, the number of funding rounds acts as a proxy for continuous investor confidence and market validation. Founder background and team size reflect the human capital required to execute a business plan, while the sector captures varying industry-specific growth trends and failure risks.
# Logistic regression model
log_model <- logistic_reg() %>%
set_engine("glm")
# Workflow
log_workflow <- workflow() %>%
add_model(log_model) %>%
add_formula(
outcome ~ investor_type +
sector +
founder_background +
funding_rounds +
team_size
)
# Train model
log_fit <- fit(log_workflow, data = startup_train)
table(startup_train$outcome)
Failure Success
44488 35512
# Predicted probabilities
log_probabilities <- predict(
log_fit,
startup_test,
type = "prob"
)
# Check probability column names
colnames(log_probabilities)[1] ".pred_Failure" ".pred_Success"
# Predicted classes
log_predictions <- predict(log_fit, startup_test) %>%
bind_cols(startup_test)
# Accuracy, precision, recall
class_metrics <- metric_set(
accuracy,
precision,
recall
)
log_metrics <- log_predictions %>%
class_metrics(
truth = outcome,
estimate = .pred_class
)
log_metrics# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.620
2 precision binary 0.626
3 recall binary 0.786
Decision Tree Classification Model
We also build a decision tree classification model to predict startup success.
# Decision tree classification model
tree_class_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
# Workflow
tree_class_workflow <- workflow() %>%
add_model(tree_class_model) %>%
add_formula(
outcome ~ investor_type +
sector +
founder_background +
funding_rounds +
team_size
)
# Train model
tree_class_fit <- fit(
tree_class_workflow,
data = startup_train
)
# Predictions
tree_class_predictions <- predict(
tree_class_fit,
startup_test
) %>%
bind_cols(startup_test)
# Metrics
tree_class_metrics <- tree_class_predictions %>%
class_metrics(
truth = outcome,
estimate = .pred_class
)
tree_class_metrics# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.614
2 precision binary 0.624
3 recall binary 0.765
Confusion Matrix
conf_mat(
log_predictions,
truth = outcome,
estimate = .pred_class
) Truth
Prediction Failure Success
Failure 8740 5216
Success 2382 3662
2.4 Probability Predictions
The logistic regression model produces predicted probabilities for startup success. The histogram below shows the distribution of these predicted probabilities.
# Add probabilities to dataset
prob_data <- log_probabilities %>%
bind_cols(startup_test)
# Histogram
ggplot(prob_data, aes(x = .pred_Success)) +
geom_histogram(
bins = 30,
fill = "steelblue",
color = "white"
) +
labs(
title = "Predicted Probability Distribution",
x = "Predicted Probability of Success",
y = "Frequency"
) +
theme_minimal()A probability threshold of 0.5 was selected because it is the standard balanced cutoff for binary classification. Lowering the threshold would classify more startups as successful, increasing recall but potentially reducing precision. Raising the threshold would make the model more conservative, increasing precision but potentially reducing recall.
2.5 Classification Model Comparison
We compared the logistic regression and decision tree classification models using accuracy, precision, and recall metrics.
bind_rows(
Logistic_Regression = log_metrics,
Decision_Tree = tree_class_metrics,
.id = "Model"
)# A tibble: 6 × 4
Model .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 Logistic_Regression accuracy binary 0.620
2 Logistic_Regression precision binary 0.626
3 Logistic_Regression recall binary 0.786
4 Decision_Tree accuracy binary 0.614
5 Decision_Tree precision binary 0.624
6 Decision_Tree recall binary 0.765
The logistic regression model performed better overall because its accuracy, precision, and recall values were more balanced and stable on the test dataset. This suggests that the model generalized better to unseen startup data.
Economic Interpretation and Model Selection:
For the startup classification problem, the Logistic Regression model performed better overall with balanced accuracy and recall. From an investor’s perspective (practical trade-off), precision is crucial because a “False Positive” means investing millions into a startup that will ultimately fail. Recall is also important so as not to miss “unicorn” opportunities. Logistic regression provides clear probabilities, allowing venture capitalists to adjust their risk thresholds (e.g., demanding a 70% probability of success before investing) rather than relying on a black-box decision tree.
For the salary regression dataset, the Decision Tree performed slightly better in terms of RMSE and R??. However, Linear Regression is often preferred in labor economics and HR analytics because of its interpretability. A linear model allows a company to isolate the exact dollar value of one extra year of experience or moving from a medium to a large company, making it much more actionable for setting fair salary policies despite a marginally higher error rate. ## 2.6 Cross-Validation
Cross-Validation for Logistic Regression
We also perform 5-fold cross-validation for the logistic regression model.
# 5-fold CV
set.seed(465)
startup_folds <- vfold_cv(
startup_train,
v = 5,
strata = outcome
)
# Cross-validation
log_cv_results <- fit_resamples(
log_workflow,
resamples = startup_folds,
metrics = metric_set(
accuracy,
precision,
recall)
)
# Collect metrics
collect_metrics(log_cv_results)# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.615 5 0.00114 pre0_mod0_post0
2 precision binary 0.624 5 0.000921 pre0_mod0_post0
3 recall binary 0.778 5 0.00313 pre0_mod0_post0
Cross-Validation for Linear Regression
# 5-fold CV
set.seed(465)
salary_folds <- vfold_cv(
salary_train,
v = 5
)
# Cross-validation
lm_cv_results <- fit_resamples(
lm_workflow,
resamples = salary_folds,
metrics = metric_set(
rmse,
rsq
)
)
# Collect metrics
collect_metrics(lm_cv_results)# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 62686. 5 3131. pre0_mod0_post0
2 rsq standard 0.283 5 0.0299 pre0_mod0_post0
Interpretation of Cross-Validation Results
The cross-validation results were generally similar to the test set results, suggesting that the models generalize reasonably well to unseen data. Small differences between cross-validation and test performance may indicate mild overfitting or sampling variability. Overall, the models appear stable and suitable for predictive analysis.
2.7 AI Interaction
During Stage 2, We have used Gemini and ChatGPT to better understand how to interpret model evaluation results.
Prompt: How can I compare my logistic regression model with a decision tree classification model in R using tidymodels?
AI Response:You can fit both models on the training data, make predictions on the same test data, calculate accuracy, precision, and recall, and then combine the metric outputs into one comparison table.
log_metrics <- log_predictions |>
metrics(truth = outcome, estimate = .pred_class)
tree_metrics <- tree_class_predictions |>
metrics(truth = outcome, estimate = .pred_class)
classification_comparison <- bind_rows(
Logistic_Regression = log_metrics,
Decision_Tree = tree_metrics,
.id = "Model"
)How I used it: I used this structure, but I changed the object names and variables according to my startup success dataset.
Reflection: It was helpful because it showed me how to put the results of two different models into one table. I checked the outputs and made sure the metrics matched my own test data.
Prompt:My regression model for salary prediction has similar RMSE values in cross-validation and the test set. How should I interpret this?
AI Responset:If the RMSE values are similar, this usually suggests that the model performs consistently across different data splits and is not heavily overfitting.
How I used it: I used this explanation while interpreting the performance of my salary prediction model.
Reflection: This helped me better understand model stability and overfitting. I compared the RMSE values from my own results before writing the interpretation.