ECON 465 - Data Science Project: Stage 1

Author

Kutay Polat

Published

May 8, 2026

library(tidyverse)
library(janitor)
library(knitr)
library(scales)

Project Overview

This Stage 1 report presents two real-world economic datasets. The first dataset is used for a regression problem, where the goal is to predict a continuous outcome. The second dataset is used for a classification problem, where the goal is to predict a binary outcome.

The two datasets are:

  1. Regression dataset: Football player salary prediction
  2. Classification dataset: German credit risk prediction

For each dataset, this report includes a dataset description, economic question, data cleaning code, summary statistics, distribution analysis, log transformation where appropriate, and a proposed theoretical distribution.

Dataset 1: Regression — Football Player Salary Prediction

Dataset Description and Source

The first dataset is SalaryPrediction.csv. It contains information about professional football players, including their wage, age, club, league, nationality, position, appearances, and national team caps.

The dataset is relevant to economics because professional football is a labor market where wages are determined by player characteristics, experience, reputation, and performance-related factors.

Economic Question

Which player characteristics best predict professional football players’ wages?

Target Variable

The target variable is Wage, which is a continuous numerical variable. Therefore, this dataset is appropriate for a regression analysis.

Importing and Cleaning the Data

salary_raw <- read_csv("SalaryPrediction.csv")

salary_clean <- salary_raw %>%
  clean_names() %>%
  mutate(
    wage = parse_number(as.character(wage)),
    club = as.factor(club),
    league = as.factor(league),
    nation = as.factor(nation),
    position = as.factor(position)
  ) %>%
  drop_na(wage, age, apps, caps) %>%
  distinct()

glimpse(salary_clean)
Rows: 3,842
Columns: 8
$ wage     <dbl> 46427000, 42125000, 34821000, 19959000, 19500000, 18810000, 1…
$ age      <dbl> 23, 30, 35, 31, 31, 30, 29, 30, 27, 29, 31, 22, 32, 29, 31, 3…
$ club     <fct> PSG, PSG, PSG, R. Madrid, Man UFC, R. Madrid, Inter, Liverpoo…
$ league   <fct> Ligue 1 Uber Eats, Ligue 1 Uber Eats, Ligue 1 Uber Eats, La L…
$ nation   <fct> FRA, BRA, ARG, BEL, ESP, AUT, BEL, EGY, ENG, FRA, BEL, NOR, G…
$ position <fct> Forward, Midfilder, Forward, Forward, Goalkeeper, Defender, F…
$ apps     <dbl> 190, 324, 585, 443, 480, 371, 427, 367, 326, 287, 399, 159, 4…
$ caps     <dbl> 57, 119, 162, 120, 45, 94, 102, 85, 77, 86, 91, 21, 105, 50, …

Dataset Requirements Check

salary_requirements <- tibble(
  requirement = c("Number of observations", "Number of variables", "Target variable type"),
  value = c(nrow(salary_clean), ncol(salary_clean), "Continuous numeric")
)

kable(salary_requirements)
requirement value
Number of observations 3842
Number of variables 8
Target variable type Continuous numeric

This dataset satisfies the project requirement of at least 500 observations and at least 5 variables.

Missing Values

salary_clean %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
  kable()
variable missing_count
wage 0
age 0
club 0
league 0
nation 0
position 0
apps 0
caps 0

Summary Statistics

salary_clean %>%
  summarise(
    mean_wage = mean(wage),
    median_wage = median(wage),
    sd_wage = sd(wage),
    min_wage = min(wage),
    q1_wage = quantile(wage, 0.25),
    q3_wage = quantile(wage, 0.75),
    max_wage = max(wage),
    mean_age = mean(age),
    mean_apps = mean(apps),
    mean_caps = mean(caps)
  ) %>%
  kable(digits = 2)
mean_wage median_wage sd_wage min_wage q1_wage q3_wage max_wage mean_age mean_apps mean_caps
1390327 416000 2605882 1400 78000 1569500 46427000 24.22 142.43 9.08

Histogram of Original Wage

ggplot(salary_clean, aes(x = wage)) +
  geom_histogram(bins = 30, color = "white") +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Distribution of Football Player Wages",
    x = "Wage",
    y = "Frequency"
  ) +
  theme_minimal()

Interpretation of Original Wage Distribution

The wage distribution is strongly right-skewed. Most professional football players earn relatively lower wages, while a small number of elite players earn extremely high wages. This pattern is common in professional sports labor markets, where superstar players receive very large wage premiums.

Log Transformation

Because the wage distribution is highly skewed, a logarithmic transformation is applied.

salary_clean <- salary_clean %>%
  mutate(log_wage = log(wage))

Histogram of Log-Transformed Wage

ggplot(salary_clean, aes(x = log_wage)) +
  geom_histogram(bins = 30, color = "white") +
  labs(
    title = "Distribution of Log-Transformed Football Player Wages",
    x = "Log(Wage)",
    y = "Frequency"
  ) +
  theme_minimal()

Interpretation After Log Transformation

The log transformation reduces the influence of extreme wage values and makes the distribution more symmetric. Compared with the original wage distribution, the log-transformed wage distribution is closer to a normal distribution.

Proposed Theoretical Distribution

The original wage variable appears to follow a log-normal distribution because it is positive, highly right-skewed, and contains a small number of very large values. After applying the logarithmic transformation, the transformed variable is closer to a normal distribution.

Potential Predictive Models for Later Stages

Two regression models that can be used in later stages are:

  1. Linear Regression
  2. Random Forest Regression

Dataset 2: Classification — German Credit Risk

Dataset Description and Source

The second dataset is german_credit_data (1).csv. It contains borrower-level credit information, including age, sex, job type, housing status, saving accounts, checking account status, credit amount, loan duration, loan purpose, and credit risk.

The dataset is relevant to economics because credit risk prediction is an important issue in financial markets. Banks and lenders need to evaluate borrower characteristics to estimate the probability of default or risky repayment behavior.

Economic Question

Can borrower characteristics predict credit default risk?

Target Variable

The target variable is Risk, which has two categories: good and bad. Therefore, this dataset is appropriate for binary classification.

Importing and Cleaning the Data

credit_raw <- read_csv("german_credit_data (1).csv")

credit_clean <- credit_raw %>%
  clean_names() %>%
  select(-any_of("unnamed_0")) %>%
  mutate(
    sex = as.factor(sex),
    job = as.factor(job),
    housing = as.factor(housing),
    saving_accounts = replace_na(saving_accounts, "unknown") %>% as.factor(),
    checking_account = replace_na(checking_account, "unknown") %>% as.factor(),
    purpose = as.factor(purpose),
    risk = as.factor(risk),
    risk_binary = if_else(risk == "bad", 1, 0)
  ) %>%
  distinct()

glimpse(credit_clean)
Rows: 1,000
Columns: 12
$ x1               <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ age              <dbl> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, 6…
$ sex              <fct> male, female, male, male, male, male, male, male, mal…
$ job              <fct> 2, 2, 1, 2, 2, 1, 2, 3, 1, 3, 2, 2, 2, 1, 2, 1, 2, 2,…
$ housing          <fct> own, own, own, free, free, free, own, rent, own, own,…
$ saving_accounts  <fct> unknown, little, little, little, little, unknown, qui…
$ checking_account <fct> little, moderate, unknown, little, little, unknown, u…
$ credit_amount    <dbl> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059,…
$ duration         <dbl> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 24…
$ purpose          <fct> radio/TV, radio/TV, education, furniture/equipment, c…
$ risk             <fct> good, bad, good, good, bad, good, good, good, good, b…
$ risk_binary      <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,…

Dataset Requirements Check

credit_requirements <- tibble(
  requirement = c("Number of observations", "Number of variables", "Target variable type"),
  value = c(nrow(credit_clean), ncol(credit_clean), "Binary categorical")
)

kable(credit_requirements)
requirement value
Number of observations 1000
Number of variables 12
Target variable type Binary categorical

This dataset satisfies the project requirement of at least 500 observations and at least 5 variables.

Missing Values

credit_clean %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
  kable()
variable missing_count
x1 0
age 0
sex 0
job 0
housing 0
saving_accounts 0
checking_account 0
credit_amount 0
duration 0
purpose 0
risk 0
risk_binary 0

Summary Statistics

credit_clean %>%
  summarise(
    mean_age = mean(age),
    median_age = median(age),
    sd_age = sd(age),
    mean_credit_amount = mean(credit_amount),
    median_credit_amount = median(credit_amount),
    sd_credit_amount = sd(credit_amount),
    mean_duration = mean(duration),
    median_duration = median(duration),
    default_rate = mean(risk_binary)
  ) %>%
  kable(digits = 2)
mean_age median_age sd_age mean_credit_amount median_credit_amount sd_credit_amount mean_duration median_duration default_rate
35.55 33 11.38 3271.26 2319.5 2822.74 20.9 18 0.3

Distribution of the Binary Target Variable

ggplot(credit_clean, aes(x = risk)) +
  geom_bar() +
  labs(
    title = "Distribution of Credit Risk Categories",
    x = "Risk Category",
    y = "Count"
  ) +
  theme_minimal()

Interpretation of Risk Distribution

The target variable has two outcomes: good credit risk and bad credit risk. Since the target is binary, it is appropriate for classification models. The distribution represents the probability structure of a Bernoulli outcome, where each borrower is classified into one of two risk categories.

Probability Distribution Analysis of Credit Amount

The binary target variable cannot be meaningfully log-transformed because it only takes two values. Therefore, for the histogram and log-transformation part of the probability distribution analysis, the main continuous economic variable, credit amount, is also examined.

Histogram of Original Credit Amount

ggplot(credit_clean, aes(x = credit_amount)) +
  geom_histogram(bins = 30, color = "white") +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Distribution of Credit Amount",
    x = "Credit Amount",
    y = "Frequency"
  ) +
  theme_minimal()

Log Transformation of Credit Amount

credit_clean <- credit_clean %>%
  mutate(log_credit_amount = log(credit_amount))

Histogram of Log-Transformed Credit Amount

ggplot(credit_clean, aes(x = log_credit_amount)) +
  geom_histogram(bins = 30, color = "white") +
  labs(
    title = "Distribution of Log-Transformed Credit Amount",
    x = "Log(Credit Amount)",
    y = "Frequency"
  ) +
  theme_minimal()

Interpretation After Log Transformation

The original credit amount distribution is right-skewed, meaning that most borrowers request relatively smaller loans while a smaller number request very large loans. The log transformation reduces skewness and makes the credit amount distribution more symmetric.

Proposed Theoretical Distribution

The binary target variable, Risk, can be modeled as a Bernoulli distribution, since each observation has one of two possible outcomes: good or bad risk.

The continuous variable credit amount appears closer to a log-normal distribution, because it is positive and right-skewed. After the logarithmic transformation, it becomes closer to a normal distribution.

Potential Predictive Models for Later Stages

Two classification models that can be used in later stages are:

  1. Logistic Regression
  2. Random Forest Classification

Conclusion

This Stage 1 report introduced two datasets suitable for predictive modeling. The football salary dataset is appropriate for regression because the target variable, wage, is continuous. The German credit dataset is appropriate for classification because the target variable, risk, is binary.

Both datasets satisfy the project requirements of having at least 500 observations and at least 5 variables. The distribution analysis also shows that wage and credit amount are right-skewed and benefit from logarithmic transformation. In later stages, predictive models can be developed and compared for both datasets.