ECON 465 - Stage 1: Data Proposal & Probability Analysis

Author

CEREN MURATSU

Published

May 10, 2025

library(tidyverse)
library(knitr)

set.seed(465)
theme_set(theme_minimal())

1 Project Overview

This report presents the Stage 1 data proposal and probability analysis for two real-world economic datasets. The first dataset is used for a regression problem because the target variable is continuous. The second dataset is used for a classification problem because the target variable is binary. Both datasets satisfy the project requirement of having at least 500 observations and at least 5 variables.

2 Dataset 1: Regression Dataset

2.1 Dataset Description and Source

The first dataset is the Medical Cost Personal Dataset, obtained from Kaggle.

Source URL: https://www.kaggle.com/datasets/mirichoi0218/insurance

This dataset includes individual-level information about medical insurance customers. The variables include age, sex, BMI, number of children, smoking status, region, and medical insurance charges. The target variable is charges, which measures the medical insurance cost for each individual.

This dataset is economically relevant because it is related to health economics and insurance markets. It allows us to examine how demographic and health-related factors are associated with individual insurance costs.

2.2 Economic Question

How do age, BMI, smoking status, sex, number of children, and region affect individual medical insurance charges?

2.3 Data Import and Cleaning

insurance_raw <- read_csv("insurance.csv")

insurance_clean <- insurance_raw |>
  mutate(
    sex = as.factor(sex),
    smoker = as.factor(smoker),
    region = as.factor(region)
  )

insurance_dimensions <- tibble(
  dataset = "Medical Cost Personal Dataset",
  observations = nrow(insurance_clean),
  variables = ncol(insurance_clean)
)

kable(insurance_dimensions)
dataset observations variables
Medical Cost Personal Dataset 1338 7
insurance_missing <- insurance_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_values")

kable(insurance_missing)
variable missing_values
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0

The medical cost dataset has no missing values. Categorical variables such as sex, smoker, and region were converted into factor variables so that they can be used properly in later modeling.

2.4 Summary Statistics for Target Variable

insurance_summary <- insurance_clean |>
  summarise(
    observations = n(),
    mean = mean(charges),
    median = median(charges),
    standard_deviation = sd(charges),
    minimum = min(charges),
    q1 = quantile(charges, 0.25),
    q3 = quantile(charges, 0.75),
    maximum = max(charges)
  )

kable(insurance_summary, digits = 2)
observations mean median standard_deviation minimum q1 q3 maximum
1338 13270.42 9382.03 12110.01 1121.87 4740.29 16639.91 63770.43

2.5 Histogram of Original Target Variable

ggplot(insurance_clean, aes(x = charges)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Medical Insurance Charges",
    x = "Medical Insurance Charges",
    y = "Frequency"
  )

The original distribution of charges is right-skewed. Most individuals have relatively low or moderate medical insurance charges, while a smaller number of individuals have very high charges. This creates a long right tail.

2.6 Histogram of Log-Transformed Target Variable

insurance_clean <- insurance_clean |>
  mutate(log_charges = log(charges))
ggplot(insurance_clean, aes(x = log_charges)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Log-Transformed Distribution of Medical Insurance Charges",
    x = "Log of Medical Insurance Charges",
    y = "Frequency"
  )

After applying a log transformation, the distribution becomes more symmetric than the original distribution. This suggests that the original charges variable may be better approximated by a log-normal distribution rather than a normal distribution.

2.7 Proposed Theoretical Distribution

The target variable charges is a positive continuous variable and is right-skewed. Since the log-transformed version of charges is more symmetric, a log-normal distribution is a reasonable theoretical distribution for this target variable.

3 Dataset 2: Classification Dataset

3.1 Dataset Description and Source

The second dataset is the Loan Predication Dataset, obtained from Kaggle.

Source URL: https://www.kaggle.com/datasets/ninzaami/loan-predication

This dataset includes information about loan applicants, such as gender, marital status, number of dependents, education, self-employment status, income, loan amount, loan term, credit history, property area, and loan approval status. The target variable is Loan_Status, which indicates whether a loan application was approved or not.

This dataset is economically relevant because it is related to financial economics, credit markets, and loan approval decisions. It also includes several categorical control variables, such as Gender, Married, Dependents, Education, Self_Employed, and Property_Area.

3.2 Economic Question

Can loan approval status be predicted based on applicants’ financial and categorical characteristics?

3.3 Data Import and Cleaning

loan_raw <- read_csv("loan.csv")

loan_missing_before <- loan_raw |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_values")

kable(loan_missing_before)
variable missing_values
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0

Instead of dropping all rows with missing values, missing categorical values are replaced with Unknown, and missing numeric values are replaced with the median. This keeps the dataset above 500 observations, which is required for the project.

loan_clean <- loan_raw |>
  select(-Loan_ID) |>
  mutate(
    Gender = replace_na(Gender, "Unknown"),
    Married = replace_na(Married, "Unknown"),
    Dependents = replace_na(Dependents, "Unknown"),
    Self_Employed = replace_na(Self_Employed, "Unknown"),
    LoanAmount = replace_na(LoanAmount, median(LoanAmount, na.rm = TRUE)),
    Loan_Amount_Term = replace_na(Loan_Amount_Term, median(Loan_Amount_Term, na.rm = TRUE)),
    Credit_History = case_when(
      is.na(Credit_History) ~ "Unknown",
      Credit_History == 1 ~ "1",
      Credit_History == 0 ~ "0"
    ),
    Gender = as.factor(Gender),
    Married = as.factor(Married),
    Dependents = as.factor(Dependents),
    Education = as.factor(Education),
    Self_Employed = as.factor(Self_Employed),
    Credit_History = as.factor(Credit_History),
    Property_Area = as.factor(Property_Area),
    Loan_Status = as.factor(Loan_Status),
    loan_status_num = if_else(Loan_Status == "Y", 1, 0)
  )

loan_dimensions <- tibble(
  dataset = "Loan Predication Dataset",
  observations = nrow(loan_clean),
  variables = ncol(loan_clean)
)

kable(loan_dimensions)
dataset observations variables
Loan Predication Dataset 614 13
loan_missing_after <- loan_clean |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_values")

kable(loan_missing_after)
variable missing_values
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
loan_status_num 0

3.4 Summary Statistics for Target Variable

The target variable Loan_Status is binary. For summary statistics, it is converted into a numeric indicator variable, where 1 means the loan was approved and 0 means the loan was not approved.

loan_status_table <- loan_clean |>
  count(Loan_Status) |>
  mutate(percentage = n / sum(n) * 100)

kable(loan_status_table, digits = 2)
Loan_Status n percentage
N 192 31.27
Y 422 68.73
loan_summary <- loan_clean |>
  summarise(
    observations = n(),
    mean = mean(loan_status_num),
    median = median(loan_status_num),
    standard_deviation = sd(loan_status_num),
    minimum = min(loan_status_num),
    q1 = quantile(loan_status_num, 0.25),
    q3 = quantile(loan_status_num, 0.75),
    maximum = max(loan_status_num)
  )

kable(loan_summary, digits = 2)
observations mean median standard_deviation minimum q1 q3 maximum
614 0.69 1 0.46 0 0 1 1

The mean of loan_status_num represents the proportion of approved loans in the dataset.

3.5 Histogram of Original Target Variable

ggplot(loan_clean, aes(x = loan_status_num)) +
  geom_histogram(binwidth = 1, boundary = -0.5) +
  scale_x_continuous(
    breaks = c(0, 1),
    labels = c("0 = Not Approved", "1 = Approved")
  ) +
  labs(
    title = "Distribution of Loan Approval Status",
    x = "Loan Status",
    y = "Frequency"
  )

The distribution of Loan_Status is binary. Therefore, the histogram has only two possible values: not approved and approved.

3.6 Log-Transformed Target Variable

A standard log transformation is not appropriate for a binary target variable because the value 0 is present and log(0) is undefined. For visualization only, the following graph uses log1p(loan_status_num), which means log(1 + x). This transformation does not make the variable continuous or normally distributed; it simply shows that the binary nature of the variable remains unchanged.

loan_clean <- loan_clean |>
  mutate(log1p_loan_status = log1p(loan_status_num))
ggplot(loan_clean, aes(x = log1p_loan_status)) +
  geom_histogram(binwidth = 0.35, boundary = -0.05) +
  scale_x_continuous(
    breaks = c(0, log(2)),
    labels = c("log(1+0) = 0", "log(1+1) = 0.69")
  ) +
  labs(
    title = "Log1p-Transformed Loan Approval Status",
    x = "Log1p of Loan Status",
    y = "Frequency"
  )

3.7 Proposed Theoretical Distribution

The target variable Loan_Status is binary. At the individual level, it can be modeled using a Bernoulli distribution, where each observation takes the value 1 if the loan is approved and 0 if the loan is not approved. For the number of approved loans in a sample, the related theoretical distribution would be the binomial distribution.

4 Conclusion

The Medical Cost Personal Dataset is suitable for the regression part of the project because the target variable charges is continuous and numeric. The Loan Predication Dataset is suitable for the classification part because the target variable Loan_Status is binary. Both datasets are economically relevant, satisfy the minimum observation and variable requirements, and can be used for later predictive modeling.

5 AI Use Log

AI tool Prompt / Task How the output was used Verification
ChatGPT Help select two datasets and prepare the Stage 1 report structure. The output was used to organize dataset descriptions, research questions, cleaning steps, and probability distribution analysis. Dataset dimensions, variables, missing values, and target variables were checked using the uploaded CSV files.
ChatGPT Help write R/Quarto code for importing, cleaning, summary statistics, and histograms. The code was adapted into this Quarto document. The code should be run in RStudio using the uploaded CSV files and checked for errors before submission.