library(tidyverse)
library(janitor)
library(knitr)
library(scales)ECON 465 - Data Science Project: Stage 1
Project Overview
This Stage 1 report presents two real-world economic datasets. The first dataset is used for a regression problem, where the goal is to predict a continuous outcome. The second dataset is used for a classification problem, where the goal is to predict a binary outcome.
The two datasets are:
- Regression dataset: Football player salary prediction
- Classification dataset: German credit risk prediction
For each dataset, this report includes a dataset description, economic question, data cleaning code, summary statistics, distribution analysis, log transformation where appropriate, and a proposed theoretical distribution.
Dataset 1: Regression — Football Player Salary Prediction
Dataset Description and Source
The first dataset is SalaryPrediction.csv. It contains information about professional football players, including their wage, age, club, league, nationality, position, appearances, and national team caps.
The dataset is relevant to economics because professional football is a labor market where wages are determined by player characteristics, experience, reputation, and performance-related factors.
Economic Question
Which player characteristics best predict professional football players’ wages?
Target Variable
The target variable is Wage, which is a continuous numerical variable. Therefore, this dataset is appropriate for a regression analysis.
Importing and Cleaning the Data
salary_raw <- read_csv("SalaryPrediction.csv")
salary_clean <- salary_raw %>%
clean_names() %>%
mutate(
wage = parse_number(as.character(wage)),
club = as.factor(club),
league = as.factor(league),
nation = as.factor(nation),
position = as.factor(position)
) %>%
drop_na(wage, age, apps, caps) %>%
distinct()
glimpse(salary_clean)Rows: 3,842
Columns: 8
$ wage <dbl> 46427000, 42125000, 34821000, 19959000, 19500000, 18810000, 1…
$ age <dbl> 23, 30, 35, 31, 31, 30, 29, 30, 27, 29, 31, 22, 32, 29, 31, 3…
$ club <fct> PSG, PSG, PSG, R. Madrid, Man UFC, R. Madrid, Inter, Liverpoo…
$ league <fct> Ligue 1 Uber Eats, Ligue 1 Uber Eats, Ligue 1 Uber Eats, La L…
$ nation <fct> FRA, BRA, ARG, BEL, ESP, AUT, BEL, EGY, ENG, FRA, BEL, NOR, G…
$ position <fct> Forward, Midfilder, Forward, Forward, Goalkeeper, Defender, F…
$ apps <dbl> 190, 324, 585, 443, 480, 371, 427, 367, 326, 287, 399, 159, 4…
$ caps <dbl> 57, 119, 162, 120, 45, 94, 102, 85, 77, 86, 91, 21, 105, 50, …
Dataset Requirements Check
salary_requirements <- tibble(
requirement = c("Number of observations", "Number of variables", "Target variable type"),
value = c(nrow(salary_clean), ncol(salary_clean), "Continuous numeric")
)
kable(salary_requirements)| requirement | value |
|---|---|
| Number of observations | 3842 |
| Number of variables | 8 |
| Target variable type | Continuous numeric |
This dataset satisfies the project requirement of at least 500 observations and at least 5 variables.
Missing Values
salary_clean %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
kable()| variable | missing_count |
|---|---|
| wage | 0 |
| age | 0 |
| club | 0 |
| league | 0 |
| nation | 0 |
| position | 0 |
| apps | 0 |
| caps | 0 |
Summary Statistics
salary_clean %>%
summarise(
mean_wage = mean(wage),
median_wage = median(wage),
sd_wage = sd(wage),
min_wage = min(wage),
q1_wage = quantile(wage, 0.25),
q3_wage = quantile(wage, 0.75),
max_wage = max(wage),
mean_age = mean(age),
mean_apps = mean(apps),
mean_caps = mean(caps)
) %>%
kable(digits = 2)| mean_wage | median_wage | sd_wage | min_wage | q1_wage | q3_wage | max_wage | mean_age | mean_apps | mean_caps |
|---|---|---|---|---|---|---|---|---|---|
| 1390327 | 416000 | 2605882 | 1400 | 78000 | 1569500 | 46427000 | 24.22 | 142.43 | 9.08 |
Histogram of Original Wage
ggplot(salary_clean, aes(x = wage)) +
geom_histogram(bins = 30, color = "white") +
scale_x_continuous(labels = comma) +
labs(
title = "Distribution of Football Player Wages",
x = "Wage",
y = "Frequency"
) +
theme_minimal()Interpretation of Original Wage Distribution
The wage distribution is strongly right-skewed. Most professional football players earn relatively lower wages, while a small number of elite players earn extremely high wages. This pattern is common in professional sports labor markets, where superstar players receive very large wage premiums.
Log Transformation
Because the wage distribution is highly skewed, a logarithmic transformation is applied.
salary_clean <- salary_clean %>%
mutate(log_wage = log(wage))Histogram of Log-Transformed Wage
ggplot(salary_clean, aes(x = log_wage)) +
geom_histogram(bins = 30, color = "white") +
labs(
title = "Distribution of Log-Transformed Football Player Wages",
x = "Log(Wage)",
y = "Frequency"
) +
theme_minimal()Interpretation After Log Transformation
The log transformation reduces the influence of extreme wage values and makes the distribution more symmetric. Compared with the original wage distribution, the log-transformed wage distribution is closer to a normal distribution.
Proposed Theoretical Distribution
The original wage variable appears to follow a log-normal distribution because it is positive, highly right-skewed, and contains a small number of very large values. After applying the logarithmic transformation, the transformed variable is closer to a normal distribution.
Potential Predictive Models for Later Stages
Two regression models that can be used in later stages are:
- Linear Regression
- Random Forest Regression
Dataset 2: Classification — German Credit Risk
Dataset Description and Source
The second dataset is german_credit_data (1).csv. It contains borrower-level credit information, including age, sex, job type, housing status, saving accounts, checking account status, credit amount, loan duration, loan purpose, and credit risk.
The dataset is relevant to economics because credit risk prediction is an important issue in financial markets. Banks and lenders need to evaluate borrower characteristics to estimate the probability of default or risky repayment behavior.
Economic Question
Can borrower characteristics predict credit default risk?
Target Variable
The target variable is Risk, which has two categories: good and bad. Therefore, this dataset is appropriate for binary classification.
Importing and Cleaning the Data
credit_raw <- read_csv("german_credit_data (1).csv")
credit_clean <- credit_raw %>%
clean_names() %>%
select(-any_of("unnamed_0")) %>%
mutate(
sex = as.factor(sex),
job = as.factor(job),
housing = as.factor(housing),
saving_accounts = replace_na(saving_accounts, "unknown") %>% as.factor(),
checking_account = replace_na(checking_account, "unknown") %>% as.factor(),
purpose = as.factor(purpose),
risk = as.factor(risk),
risk_binary = if_else(risk == "bad", 1, 0)
) %>%
distinct()
glimpse(credit_clean)Rows: 1,000
Columns: 12
$ x1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ age <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 6…
$ sex <fct> male, female, male, male, male, male, male, male, mal…
$ job <fct> 2, 2, 1, 2, 2, 1, 2, 3, 1, 3, 2, 2, 2, 1, 2, 1, 2, 2,…
$ housing <fct> own, own, own, free, free, free, own, rent, own, own,…
$ saving_accounts <fct> unknown, little, little, little, little, unknown, qui…
$ checking_account <fct> little, moderate, unknown, little, little, unknown, u…
$ credit_amount <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059,…
$ duration <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24…
$ purpose <fct> radio/TV, radio/TV, education, furniture/equipment, c…
$ risk <fct> good, bad, good, good, bad, good, good, good, good, b…
$ risk_binary <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,…
Dataset Requirements Check
credit_requirements <- tibble(
requirement = c("Number of observations", "Number of variables", "Target variable type"),
value = c(nrow(credit_clean), ncol(credit_clean), "Binary categorical")
)
kable(credit_requirements)| requirement | value |
|---|---|
| Number of observations | 1000 |
| Number of variables | 12 |
| Target variable type | Binary categorical |
This dataset satisfies the project requirement of at least 500 observations and at least 5 variables.
Missing Values
credit_clean %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
kable()| variable | missing_count |
|---|---|
| x1 | 0 |
| age | 0 |
| sex | 0 |
| job | 0 |
| housing | 0 |
| saving_accounts | 0 |
| checking_account | 0 |
| credit_amount | 0 |
| duration | 0 |
| purpose | 0 |
| risk | 0 |
| risk_binary | 0 |
Summary Statistics
credit_clean %>%
summarise(
mean_age = mean(age),
median_age = median(age),
sd_age = sd(age),
mean_credit_amount = mean(credit_amount),
median_credit_amount = median(credit_amount),
sd_credit_amount = sd(credit_amount),
mean_duration = mean(duration),
median_duration = median(duration),
default_rate = mean(risk_binary)
) %>%
kable(digits = 2)| mean_age | median_age | sd_age | mean_credit_amount | median_credit_amount | sd_credit_amount | mean_duration | median_duration | default_rate |
|---|---|---|---|---|---|---|---|---|
| 35.55 | 33 | 11.38 | 3271.26 | 2319.5 | 2822.74 | 20.9 | 18 | 0.3 |
Distribution of the Binary Target Variable
ggplot(credit_clean, aes(x = risk)) +
geom_bar() +
labs(
title = "Distribution of Credit Risk Categories",
x = "Risk Category",
y = "Count"
) +
theme_minimal()Interpretation of Risk Distribution
The target variable has two outcomes: good credit risk and bad credit risk. Since the target is binary, it is appropriate for classification models. The distribution represents the probability structure of a Bernoulli outcome, where each borrower is classified into one of two risk categories.
Probability Distribution Analysis of Credit Amount
The binary target variable cannot be meaningfully log-transformed because it only takes two values. Therefore, for the histogram and log-transformation part of the probability distribution analysis, the main continuous economic variable, credit amount, is also examined.
Histogram of Original Credit Amount
ggplot(credit_clean, aes(x = credit_amount)) +
geom_histogram(bins = 30, color = "white") +
scale_x_continuous(labels = comma) +
labs(
title = "Distribution of Credit Amount",
x = "Credit Amount",
y = "Frequency"
) +
theme_minimal()Log Transformation of Credit Amount
credit_clean <- credit_clean %>%
mutate(log_credit_amount = log(credit_amount))Histogram of Log-Transformed Credit Amount
ggplot(credit_clean, aes(x = log_credit_amount)) +
geom_histogram(bins = 30, color = "white") +
labs(
title = "Distribution of Log-Transformed Credit Amount",
x = "Log(Credit Amount)",
y = "Frequency"
) +
theme_minimal()Interpretation After Log Transformation
The original credit amount distribution is right-skewed, meaning that most borrowers request relatively smaller loans while a smaller number request very large loans. The log transformation reduces skewness and makes the credit amount distribution more symmetric.
Proposed Theoretical Distribution
The binary target variable, Risk, can be modeled as a Bernoulli distribution, since each observation has one of two possible outcomes: good or bad risk.
The continuous variable credit amount appears closer to a log-normal distribution, because it is positive and right-skewed. After the logarithmic transformation, it becomes closer to a normal distribution.
Potential Predictive Models for Later Stages
Two classification models that can be used in later stages are:
- Logistic Regression
- Random Forest Classification
Conclusion
This Stage 1 report introduced two datasets suitable for predictive modeling. The football salary dataset is appropriate for regression because the target variable, wage, is continuous. The German credit dataset is appropriate for classification because the target variable, risk, is binary.
Both datasets satisfy the project requirements of having at least 500 observations and at least 5 variables. The distribution analysis also shows that wage and credit amount are right-skewed and benefit from logarithmic transformation. In later stages, predictive models can be developed and compared for both datasets.