ECON 465 - Data Science Project

Author

Silan Kilicarslan - Arda Cem Acar

Stage 1: Data Acquisition & Probability Foundations

Introduction

This report represents Stage 1 of the ECON 465 Data Science Project. We will analyze two distinct datasets to answer specific economic questions using probability distribution analysis.

library(tidyverse)
library(readxl)
library(tidymodels)
library(patchwork)

ds_salaries <- read_excel("ds_salaries.xlsx")
startup_data <- read_excel("startup_funding.xlsx")

Dataset 1: Regression (Data Science Salaries)

Economic Question: “Which professional characteristics are associated with higher salaries?”

Dataset Description: This dataset contains global salary information for data science and tech roles. The target variable is continuous (salary_in_usd). We aim to analyze how factors like experience level, employment type, and company size impact salaries.

Source: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries

1.1 Data Cleaning

We need to format categorical variables as factors so that our future regression models can interpret them correctly.

# Convert categorical variables to factors
ds_salaries_clean <- ds_salaries %>%
  mutate(
    experience_level = as.factor(experience_level),
    employment_type = as.factor(employment_type),
    company_size = as.factor(company_size),
    remote_ratio = as.factor(remote_ratio)
  )

# Inspect raw data
glimpse(ds_salaries)
Rows: 607
Columns: 10
$ work_year          <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ experience_level   <chr> "MI", "SE", "SE", "MI", "SE", "EN", "SE", "MI", "MI…
$ employment_type    <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
$ salary             <dbl> 70000, 260000, 85000, 20000, 150000, 72000, 190000,…
$ salary_currency    <chr> "EUR", "USD", "GBP", "USD", "USD", "USD", "USD", "H…
$ salary_in_usd      <dbl> 79833, 260000, 109024, 20000, 150000, 72000, 190000…
$ employee_residence <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ remote_ratio       <dbl> 0, 0, 50, 0, 50, 100, 100, 50, 100, 50, 0, 0, 0, 10…
$ company_location   <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ company_size       <chr> "L", "S", "M", "S", "L", "L", "S", "L", "L", "S", "…
# Handle missing values and convert to factors
ds_salaries_clean <- ds_salaries %>%
  drop_na(salary_in_usd, experience_level, employment_type, company_size, remote_ratio) %>%
  mutate(
    experience_level = as.factor(experience_level),
    employment_type = as.factor(employment_type),
    company_size = as.factor(company_size),
    remote_ratio = as.factor(remote_ratio)
  )

# Inspect clean data
glimpse(ds_salaries_clean)
Rows: 607
Columns: 10
$ work_year          <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
$ experience_level   <fct> MI, SE, SE, MI, SE, EN, SE, MI, MI, SE, EN, MI, EN,…
$ employment_type    <fct> FT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT, FT,…
$ salary             <dbl> 70000, 260000, 85000, 20000, 150000, 72000, 190000,…
$ salary_currency    <chr> "EUR", "USD", "GBP", "USD", "USD", "USD", "USD", "H…
$ salary_in_usd      <dbl> 79833, 260000, 109024, 20000, 150000, 72000, 190000…
$ employee_residence <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ remote_ratio       <fct> 0, 0, 50, 0, 50, 100, 100, 50, 100, 50, 0, 0, 0, 10…
$ company_location   <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
$ company_size       <fct> L, S, M, S, L, L, S, L, L, S, S, L, M, L, L, L, S, …
# Inspect raw data
glimpse(startup_data)
Rows: 100,000
Columns: 11
$ funding_rounds           <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size                <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion      <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users   <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million        <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million          <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type            <chr> "tier2_vc", "tier2_vc", "none", "none", "none…
$ sector                   <chr> "Health", "Fintech", "SaaS", "Ecommerce", "He…
$ founder_background       <chr> "academic", "first_time", "first_time", "ex_b…
$ outcome                  <chr> "IPO", "Failure", "Failure", "Acquisition", "…
# Handle missing values and format target variable
startup_clean <- startup_data %>%
  drop_na(outcome, investor_type, sector, founder_background) %>%
  mutate(
    outcome = ifelse(outcome == "Failure", "Failure", "Success"),
    outcome = factor(outcome, levels = c("Failure", "Success")),
    investor_type = as.factor(investor_type),
    sector = as.factor(sector),
    founder_background = as.factor(founder_background)
  )

# Inspect clean data
glimpse(startup_clean)
Rows: 100,000
Columns: 11
$ funding_rounds           <dbl> 4, 1, 3, 3, 1, 2, 1, 1, 2, 2, 1, 0, 4, 2, 1, …
$ founder_experience_years <dbl> 13, 6, 5, 14, 17, 9, 15, 7, 24, 23, 24, 2, 20…
$ team_size                <dbl> 58, 221, 247, 229, 235, 184, 148, 31, 238, 17…
$ market_size_billion      <dbl> 48.225483, 31.532647, 4.969722, 3.084209, 13.…
$ product_traction_users   <dbl> 594843, 393020, 27636, 235376, 391765, 551576…
$ burn_rate_million        <dbl> 18.519211, 14.298149, 20.447567, 8.177417, 4.…
$ revenue_million          <dbl> 1483962.45, 862056.82, 97261.69, 1145785.44, …
$ investor_type            <fct> tier2_vc, tier2_vc, none, none, none, angel, …
$ sector                   <fct> Health, Fintech, SaaS, Ecommerce, Health, Eco…
$ founder_background       <fct> academic, first_time, first_time, ex_bigtech,…
$ outcome                  <fct> Success, Failure, Failure, Success, Success, …

1.2 Summary Statistics

Below are the summary statistics (mean, median, standard deviation, min, max) for our continuous target variable, salary_in_usd.

# Create summary statistics table for the salary variable
summary_stats_salary <- ds_salaries_clean %>%
  summarise(
    Mean = mean(salary_in_usd, na.rm = TRUE),
    Median = median(salary_in_usd, na.rm = TRUE),
    SD = sd(salary_in_usd, na.rm = TRUE),
    Min = min(salary_in_usd, na.rm = TRUE),
    Max = max(salary_in_usd, na.rm = TRUE)
  )

# Print the table
summary_stats_salary
# A tibble: 1 × 5
     Mean Median     SD   Min    Max
    <dbl>  <dbl>  <dbl> <dbl>  <dbl>
1 112298. 101570 70957.  2859 600000

1.3 Probability Distribution Analysis

Income and salary distributions are typically right-skewed in economics. We will plot the original histogram and then apply a log transformation to observe the changes.

# Original salary distribution plot
p1 <- ggplot(ds_salaries_clean, aes(x = salary_in_usd)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs(title = "Original Salary Distribution", x = "Salary (USD)", y = "Frequency") +
  theme_minimal()

# Log-transformed salary distribution plot
p2 <- ggplot(ds_salaries_clean, aes(x = log(salary_in_usd))) +
  geom_histogram(fill = "darkorange", color = "white", bins = 30) +
  labs(title = "Log-Transformed Salary", x = "Log(Salary in USD)", y = "Frequency") +
  theme_minimal()

# Display two plots side by side using patchwork
p1 + p2

Interpretation: The original histogram confirms that the salary data is right-skewed. After applying the log transformation, the distribution shape becomes much more symmetric and bell-shaped. Therefore, a Log-Normal distribution is the most suitable theoretical approximation for our target variable.


Dataset 2: Classification (Startup Funding Success)

Economic Question: “Can startup success be predicted using founder characteristics, funding structure, and market indicators?”

Dataset Description: This dataset tracks various startups, their funding metrics, and market conditions. The target variable is binary (outcome), where startups are classified as either “Success” or “Failure”, indicating whether a startup was ultimately successful or not.

Source: https://www.kaggle.com/datasets/dhrubangtalukdar/startup-funding-and-outcome-dataset

2.1 Data Cleaning

We prepare the dataset by converting the target variable (outcome) and categorical features into factors.

startup_clean <- startup_data %>%
  mutate(
    outcome = ifelse(outcome == "Failure", "Failure", "Success"),
    outcome = factor(outcome, levels = c("Failure", "Success")),

    investor_type = as.factor(investor_type),
    sector = as.factor(sector),
    founder_background = as.factor(founder_background)
  )

table(startup_clean$outcome)

Failure Success 
  55610   44390 
prop.table(table(startup_clean$outcome))

Failure Success 
 0.5561  0.4439 

2.2 Summary Statistics (Success Rate)

Since the target variable is binary, traditional continuous-variable statistics such as quartiles and log transformations are not appropriate. Instead, we look at the frequency and success rate.

# View startup success/failure counts
table(startup_clean$outcome)

Failure Success 
  55610   44390 
# Calculate success rates (percentages)
prop.table(table(startup_clean$outcome))

Failure Success 
 0.5561  0.4439 

2.3 Probability Distribution Analysis

Because outcome is binary (categorical), a histogram and log transformation are mathematically inappropriate. Instead, we use a bar plot to visualize the class balance.

# Startup success frequency plot
ggplot(startup_clean, aes(x = outcome, fill = outcome)) +
  geom_bar(color = "white") +
  labs(title = "Distribution of Startup Success (Outcome)", x = "Outcome Status", y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: The target variable (outcome) is binary, meaning it takes on one of two possible states.The theoretical distribution that best approximates this binary success/failure outcome is the Bernoulli distribution, since each startup can only belong to one of two possible categories. The bar chart above helps us understand if our dataset is balanced or imbalanced regarding successful versus unsuccessful startups.

Stage 2: Predictive Modeling

2.1 Data Splitting

We split both datasets into training (80%) and test (20%) sets using the initial_split() function from the tidymodels package. We also set a seed to ensure reproducibility of the results.

# Set seed for reproducibility
set.seed(465)

# -----------------------------
# Regression Dataset Split
# -----------------------------

salary_split <- initial_split(ds_salaries_clean, prop = 0.80)

salary_train <- training(salary_split)
salary_test  <- testing(salary_split)

# Check dimensions
dim(salary_train)
[1] 485  10
dim(salary_test)
[1] 122  10
# -----------------------------
# Classification Dataset Split
# -----------------------------

startup_split <- initial_split(
  startup_clean,
  prop = 0.80,
  strata = outcome
)

startup_train <- training(startup_split)
startup_test  <- testing(startup_split)

# Check dimensions
dim(startup_train)
[1] 80000    11
dim(startup_test)
[1] 20000    11

2.2 Regression Models

Linear Regression Model

We first build a linear regression model to predict salaries using professional characteristics such as experience level, employment type, company size, and remote work ratio.

Predictor Justification: We selected experience level, employment type, company size, and remote ratio as predictors because human capital theory suggests that higher experience directly increases marginal productivity, thus commanding higher wages. Additionally, larger companies typically have greater capital and market power to offer competitive salaries, while remote work dynamics can influence salary based on geographic cost-of-living differences.

# Linear regression model

lm_model <- linear_reg() %>%
  set_engine("lm")

# Workflow
lm_workflow <- workflow() %>%
  add_model(lm_model) %>%
  add_formula(
    salary_in_usd ~ experience_level +
      employment_type +
      company_size +
      remote_ratio
  )

# Train model
lm_fit <- fit(lm_workflow, data = salary_train)

# Predictions
lm_predictions <- predict(lm_fit, salary_test) %>%
  bind_cols(salary_test)

# Evaluation metrics
lm_metrics <- lm_predictions %>%
  metrics(truth = salary_in_usd, estimate = .pred)

lm_metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   51913.   
2 rsq     standard       0.255
3 mae     standard   41421.   

Decision Tree Regression Model

As a second regression model, we use a decision tree. Decision trees can capture non-linear relationships between professional characteristics and salary.

# Decision tree regression model

tree_reg_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")

# Workflow
tree_reg_workflow <- workflow() %>%
  add_model(tree_reg_model) %>%
  add_formula(
    salary_in_usd ~ experience_level +
      employment_type +
      company_size +
      remote_ratio
  )

# Train model
tree_reg_fit <- fit(tree_reg_workflow, data = salary_train)

# Predictions
tree_reg_predictions <- predict(tree_reg_fit, salary_test) %>%
  bind_cols(salary_test)

# Evaluation metrics
tree_reg_metrics <- tree_reg_predictions %>%
  metrics(truth = salary_in_usd, estimate = .pred)

tree_reg_metrics
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   51356.   
2 rsq     standard       0.267
3 mae     standard   38916.   

Regression Model Comparison

We compared the performance of the linear regression and decision tree regression models using RMSE and R?? metrics on the test dataset.

bind_rows(
  Linear_Regression = lm_metrics,
  Decision_Tree = tree_reg_metrics,
  .id = "Model"
)
# A tibble: 6 × 4
  Model             .metric .estimator .estimate
  <chr>             <chr>   <chr>          <dbl>
1 Linear_Regression rmse    standard   51913.   
2 Linear_Regression rsq     standard       0.255
3 Linear_Regression mae     standard   41421.   
4 Decision_Tree     rmse    standard   51356.   
5 Decision_Tree     rsq     standard       0.267
6 Decision_Tree     mae     standard   38916.   

2.3 Classification Models

Logistic Regression Model

We first build a logistic regression model to predict startup success using founder characteristics, funding structure, and market indicators.

Predictor Justification: We included investor type, sector, founder background, funding rounds, and team size. Economically, the number of funding rounds acts as a proxy for continuous investor confidence and market validation. Founder background and team size reflect the human capital required to execute a business plan, while the sector captures varying industry-specific growth trends and failure risks.

# Logistic regression model
log_model <- logistic_reg() %>%
  set_engine("glm")

# Workflow
log_workflow <- workflow() %>%
  add_model(log_model) %>%
  add_formula(
    outcome ~ investor_type +
      sector +
      founder_background +
      funding_rounds +
      team_size
  )

# Train model
log_fit <- fit(log_workflow, data = startup_train)


table(startup_train$outcome)

Failure Success 
  44488   35512 
# Predicted probabilities
log_probabilities <- predict(
  log_fit,
  startup_test,
  type = "prob"
)

# Check probability column names
colnames(log_probabilities)
[1] ".pred_Failure" ".pred_Success"
# Predicted classes
log_predictions <- predict(log_fit, startup_test) %>%
  bind_cols(startup_test)

# Accuracy, precision, recall
class_metrics <- metric_set(
  accuracy,
  precision,
  recall
)

log_metrics <- log_predictions %>%
  class_metrics(
    truth = outcome,
    estimate = .pred_class
  )

log_metrics
# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.620
2 precision binary         0.626
3 recall    binary         0.786

Decision Tree Classification Model

We also build a decision tree classification model to predict startup success.

# Decision tree classification model
tree_class_model <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

# Workflow
tree_class_workflow <- workflow() %>%
  add_model(tree_class_model) %>%
  add_formula(
    outcome ~ investor_type +
      sector +
      founder_background +
      funding_rounds +
      team_size
  )

# Train model
tree_class_fit <- fit(
  tree_class_workflow,
  data = startup_train
)

# Predictions
tree_class_predictions <- predict(
  tree_class_fit,
  startup_test
) %>%
  bind_cols(startup_test)

# Metrics
tree_class_metrics <- tree_class_predictions %>%
  class_metrics(
    truth = outcome,
    estimate = .pred_class
  )

tree_class_metrics
# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.614
2 precision binary         0.624
3 recall    binary         0.765

Confusion Matrix

conf_mat(
  log_predictions,
  truth = outcome,
  estimate = .pred_class
)
          Truth
Prediction Failure Success
   Failure    8740    5216
   Success    2382    3662

2.4 Probability Predictions

The logistic regression model produces predicted probabilities for startup success. The histogram below shows the distribution of these predicted probabilities.

# Add probabilities to dataset
prob_data <- log_probabilities %>%
  bind_cols(startup_test)

# Histogram
ggplot(prob_data, aes(x = .pred_Success)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "white"
  ) +
  labs(
    title = "Predicted Probability Distribution",
    x = "Predicted Probability of Success",
    y = "Frequency"
  ) +
  theme_minimal()

A probability threshold of 0.5 was selected because it is the standard balanced cutoff for binary classification. Lowering the threshold would classify more startups as successful, increasing recall but potentially reducing precision. Raising the threshold would make the model more conservative, increasing precision but potentially reducing recall.

2.5 Classification Model Comparison

We compared the logistic regression and decision tree classification models using accuracy, precision, and recall metrics.

bind_rows(
  Logistic_Regression = log_metrics,
  Decision_Tree = tree_class_metrics,
  .id = "Model"
)
# A tibble: 6 × 4
  Model               .metric   .estimator .estimate
  <chr>               <chr>     <chr>          <dbl>
1 Logistic_Regression accuracy  binary         0.620
2 Logistic_Regression precision binary         0.626
3 Logistic_Regression recall    binary         0.786
4 Decision_Tree       accuracy  binary         0.614
5 Decision_Tree       precision binary         0.624
6 Decision_Tree       recall    binary         0.765

The logistic regression model performed better overall because its accuracy, precision, and recall values were more balanced and stable on the test dataset. This suggests that the model generalized better to unseen startup data.

Economic Interpretation and Model Selection:

For the startup classification problem, the Logistic Regression model performed better overall with balanced accuracy and recall. From an investor’s perspective (practical trade-off), precision is crucial because a “False Positive” means investing millions into a startup that will ultimately fail. Recall is also important so as not to miss “unicorn” opportunities. Logistic regression provides clear probabilities, allowing venture capitalists to adjust their risk thresholds (e.g., demanding a 70% probability of success before investing) rather than relying on a black-box decision tree.

For the salary regression dataset, the Decision Tree performed slightly better in terms of RMSE and R??. However, Linear Regression is often preferred in labor economics and HR analytics because of its interpretability. A linear model allows a company to isolate the exact dollar value of one extra year of experience or moving from a medium to a large company, making it much more actionable for setting fair salary policies despite a marginally higher error rate. ## 2.6 Cross-Validation

Cross-Validation for Logistic Regression

We also perform 5-fold cross-validation for the logistic regression model.

# 5-fold CV

set.seed(465)

startup_folds <- vfold_cv(
  startup_train,
  v = 5,
  strata = outcome
)

# Cross-validation

log_cv_results <- fit_resamples(
  log_workflow,
  resamples = startup_folds,
 metrics = metric_set(
  accuracy,
  precision,
  recall)
)

# Collect metrics

collect_metrics(log_cv_results)
# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config        
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>          
1 accuracy  binary     0.615     5 0.00114  pre0_mod0_post0
2 precision binary     0.624     5 0.000921 pre0_mod0_post0
3 recall    binary     0.778     5 0.00313  pre0_mod0_post0

Cross-Validation for Linear Regression

# 5-fold CV

set.seed(465)

salary_folds <- vfold_cv(
  salary_train,
  v = 5
)

# Cross-validation

lm_cv_results <- fit_resamples(
  lm_workflow,
  resamples = salary_folds,
  metrics = metric_set(
    rmse,
    rsq
  )
)

# Collect metrics

collect_metrics(lm_cv_results)
# A tibble: 2 × 6
  .metric .estimator      mean     n   std_err .config        
  <chr>   <chr>          <dbl> <int>     <dbl> <chr>          
1 rmse    standard   62686.        5 3131.     pre0_mod0_post0
2 rsq     standard       0.283     5    0.0299 pre0_mod0_post0

Interpretation of Cross-Validation Results

The cross-validation results were generally similar to the test set results, suggesting that the models generalize reasonably well to unseen data. Small differences between cross-validation and test performance may indicate mild overfitting or sampling variability. Overall, the models appear stable and suitable for predictive analysis.

2.7 AI Interaction

During Stage 2, We have used Gemini and ChatGPT to better understand how to interpret model evaluation results.

Prompt: How can I compare my logistic regression model with a decision tree classification model in R using tidymodels?

AI Response:You can fit both models on the training data, make predictions on the same test data, calculate accuracy, precision, and recall, and then combine the metric outputs into one comparison table.

log_metrics <- log_predictions |>
  metrics(truth = outcome, estimate = .pred_class)

tree_metrics <- tree_class_predictions |>
  metrics(truth = outcome, estimate = .pred_class)

classification_comparison <- bind_rows(
  Logistic_Regression = log_metrics,
  Decision_Tree = tree_metrics,
  .id = "Model"
)

How I used it: I used this structure, but I changed the object names and variables according to my startup success dataset.

Reflection: It was helpful because it showed me how to put the results of two different models into one table. I checked the outputs and made sure the metrics matched my own test data.

Prompt:My regression model for salary prediction has similar RMSE values in cross-validation and the test set. How should I interpret this?

AI Responset:If the RMSE values are similar, this usually suggests that the model performs consistently across different data splits and is not heavily overfitting.

How I used it: I used this explanation while interpreting the performance of my salary prediction model.

Reflection: This helped me better understand model stability and overfitting. I compared the RMSE values from my own results before writing the interpretation.