# 1. Load required packages
library(tidyverse)
library(readxl)
library(tidymodels)
library(patchwork)
# 2. Import Excel files into R
ds_salaries <- read_excel("ds_salaries.xlsx")
startup_data <- read_excel("Startup_funding.xlsx")ECON 465 - Data Science Project: Stage 1
Introduction
This report represents Stage 1 of the ECON 465 Data Science Project. We will analyze two distinct datasets to answer specific economic questions using probability distribution analysis.
Dataset 1: Regression (Data Science Salaries)
Economic Question: “Which professional characteristics are associated with higher salaries?”
Dataset Description: This dataset contains global salary information for data science and tech roles. The target variable is continuous (salary_in_usd). We aim to analyze how factors like experience level, employment type, and company size impact salaries.
Source: https://www.kaggle.com/datasets/dhrubangtalukdar/startup-funding-and-outcome-dataset
1.1 Data Cleaning
We need to format categorical variables as factors so that our future regression models can interpret them correctly.
# Convert categorical variables to factors
ds_salaries_clean <- ds_salaries %>%
mutate(
experience_level = as.factor(experience_level),
employment_type = as.factor(employment_type),
company_size = as.factor(company_size),
remote_ratio = as.factor(remote_ratio)
)1.2 Summary Statistics
Below are the summary statistics (mean, median, standard deviation, min, max) for our continuous target variable, salary_in_usd.
# Create summary statistics table for the salary variable
summary_stats_salary <- ds_salaries_clean %>%
summarise(
Mean = mean(salary_in_usd, na.rm = TRUE),
Median = median(salary_in_usd, na.rm = TRUE),
SD = sd(salary_in_usd, na.rm = TRUE),
Min = min(salary_in_usd, na.rm = TRUE),
Max = max(salary_in_usd, na.rm = TRUE)
)
# Print the table
summary_stats_salary# A tibble: 1 × 5
Mean Median SD Min Max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 112298. 101570 70957. 2859 600000
1.3 Probability Distribution Analysis
Income and salary distributions are typically right-skewed in economics. We will plot the original histogram and then apply a log transformation to observe the changes.
# Original salary distribution plot
p1 <- ggplot(ds_salaries_clean, aes(x = salary_in_usd)) +
geom_histogram(fill = "steelblue", color = "white", bins = 30) +
labs(title = "Original Salary Distribution", x = "Salary (USD)", y = "Frequency") +
theme_minimal()
# Log-transformed salary distribution plot
p2 <- ggplot(ds_salaries_clean, aes(x = log(salary_in_usd))) +
geom_histogram(fill = "darkorange", color = "white", bins = 30) +
labs(title = "Log-Transformed Salary", x = "Log(Salary in USD)", y = "Frequency") +
theme_minimal()
# Display two plots side by side using patchwork
p1 + p2Interpretation: The original histogram confirms that the salary data is right-skewed. After applying the log transformation, the distribution shape becomes much more symmetric and bell-shaped. Therefore, a Log-Normal distribution is the most suitable theoretical approximation for our target variable.
Dataset 2: Classification (Startup Funding Success)
Economic Question: “Can startup success be predicted using founder characteristics, funding structure, and market indicators?”
Dataset Description: This dataset tracks various startups, their funding metrics, and market conditions. The target variable is binary (outcome), indicating whether a startup was ultimately successful or not.
Source: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries
2.1 Data Cleaning
We prepare the dataset by converting the target variable (outcome) and categorical features into factors.
# Convert outcome and other text columns to factors
startup_clean <- startup_data %>%
mutate(
outcome = as.factor(outcome),
investor_type = as.factor(investor_type),
sector = as.factor(sector),
founder_background = as.factor(founder_background)
)2.2 Summary Statistics (Success Rate)
Since the target variable is binary, traditional statistics like standard deviation are less relevant. Instead, we look at the frequency and success rate.
# View startup success/failure counts
table(startup_clean$outcome)
Acquisition Failure IPO
42335 55610 2055
# Calculate success rates (percentages)
prop.table(table(startup_clean$outcome))
Acquisition Failure IPO
0.42335 0.55610 0.02055
2.3 Probability Distribution Analysis
Because outcome is binary (categorical), a histogram and log transformation are mathematically inappropriate. Instead, we use a bar plot to visualize the class balance.
# Startup success frequency plot
ggplot(startup_clean, aes(x = outcome, fill = outcome)) +
geom_bar(color = "white") +
labs(title = "Distribution of Startup Success (Outcome)", x = "Outcome Status", y = "Count") +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
theme(legend.position = "none")Interpretation: The target variable (outcome) is binary, meaning it takes on one of two possible states. The theoretical distribution that best fits this type of binary success/failure data is the Bernoulli distribution. The bar chart above helps us understand if our dataset is balanced or imbalanced regarding successful versus unsuccessful startups.