ECON 465 - Data Science Project: Stage 1

Author

Arda Cem Acar, Silan Kilicarslan

Introduction

This report represents Stage 1 of the ECON 465 Data Science Project. We will analyze two distinct datasets to answer specific economic questions using probability distribution analysis.

# 1. Load required packages
library(tidyverse)
library(readxl)
library(tidymodels)
library(patchwork)

# 2. Import Excel files into R
ds_salaries <- read_excel("ds_salaries.xlsx")
startup_data <- read_excel("Startup_funding.xlsx")

Dataset 1: Regression (Data Science Salaries)

Economic Question: “Which professional characteristics are associated with higher salaries?”

Dataset Description: This dataset contains global salary information for data science and tech roles. The target variable is continuous (salary_in_usd). We aim to analyze how factors like experience level, employment type, and company size impact salaries.

Source: https://www.kaggle.com/datasets/dhrubangtalukdar/startup-funding-and-outcome-dataset

1.1 Data Cleaning

We need to format categorical variables as factors so that our future regression models can interpret them correctly.

# Convert categorical variables to factors
ds_salaries_clean <- ds_salaries %>%
  mutate(
    experience_level = as.factor(experience_level),
    employment_type = as.factor(employment_type),
    company_size = as.factor(company_size),
    remote_ratio = as.factor(remote_ratio)
  )

1.2 Summary Statistics

Below are the summary statistics (mean, median, standard deviation, min, max) for our continuous target variable, salary_in_usd.

# Create summary statistics table for the salary variable
summary_stats_salary <- ds_salaries_clean %>%
  summarise(
    Mean = mean(salary_in_usd, na.rm = TRUE),
    Median = median(salary_in_usd, na.rm = TRUE),
    SD = sd(salary_in_usd, na.rm = TRUE),
    Min = min(salary_in_usd, na.rm = TRUE),
    Max = max(salary_in_usd, na.rm = TRUE)
  )

# Print the table
summary_stats_salary

# A tibble: 1 × 5
     Mean Median     SD   Min    Max
    <dbl>  <dbl>  <dbl> <dbl>  <dbl>
1 112298. 101570 70957.  2859 600000

1.3 Probability Distribution Analysis

Income and salary distributions are typically right-skewed in economics. We will plot the original histogram and then apply a log transformation to observe the changes.

# Original salary distribution plot
p1 <- ggplot(ds_salaries_clean, aes(x = salary_in_usd)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs(title = "Original Salary Distribution", x = "Salary (USD)", y = "Frequency") +
  theme_minimal()

# Log-transformed salary distribution plot
p2 <- ggplot(ds_salaries_clean, aes(x = log(salary_in_usd))) +
  geom_histogram(fill = "darkorange", color = "white", bins = 30) +
  labs(title = "Log-Transformed Salary", x = "Log(Salary in USD)", y = "Frequency") +
  theme_minimal()

# Display two plots side by side using patchwork
p1 + p2

Interpretation: The original histogram confirms that the salary data is right-skewed. After applying the log transformation, the distribution shape becomes much more symmetric and bell-shaped. Therefore, a Log-Normal distribution is the most suitable theoretical approximation for our target variable.

Dataset 2: Classification (Startup Funding Success)

Economic Question: “Can startup success be predicted using founder characteristics, funding structure, and market indicators?”

Dataset Description: This dataset tracks various startups, their funding metrics, and market conditions. The target variable is binary (outcome), indicating whether a startup was ultimately successful or not.

Source: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries

2.1 Data Cleaning

We prepare the dataset by converting the target variable (outcome) and categorical features into factors.

# Convert outcome and other text columns to factors
startup_clean <- startup_data %>%
  mutate(
    outcome = as.factor(outcome),
    investor_type = as.factor(investor_type),
    sector = as.factor(sector),
    founder_background = as.factor(founder_background)
  )

2.2 Summary Statistics (Success Rate)

Since the target variable is binary, traditional statistics like standard deviation are less relevant. Instead, we look at the frequency and success rate.

# View startup success/failure counts
table(startup_clean$outcome)


Acquisition     Failure         IPO 
      42335       55610        2055

# Calculate success rates (percentages)
prop.table(table(startup_clean$outcome))


Acquisition     Failure         IPO 
    0.42335     0.55610     0.02055

2.3 Probability Distribution Analysis

Because outcome is binary (categorical), a histogram and log transformation are mathematically inappropriate. Instead, we use a bar plot to visualize the class balance.

# Startup success frequency plot
ggplot(startup_clean, aes(x = outcome, fill = outcome)) +
  geom_bar(color = "white") +
  labs(title = "Distribution of Startup Success (Outcome)", x = "Outcome Status", y = "Count") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: The target variable (outcome) is binary, meaning it takes on one of two possible states. The theoretical distribution that best fits this type of binary success/failure data is the Bernoulli distribution. The bar chart above helps us understand if our dataset is balanced or imbalanced regarding successful versus unsuccessful startups.