Econ465Project

Author

Azra Ozcirpan and Ozge Yilmaz

Econ 465 Project

Stage 1: Data Acquisition & Probability Foundations

Dataset Description & Source

The first dataset is the Family Income and Expenditure dataset from Kaggle. It contains information about household income, family size, education level, regional location, and different types of household expenditures in the Philippines. The target variable in this analysis is total_household_income, which is a continuous numeric variable.

This dataset is relevant to economics because household income is closely related to living standards, education, and regional economic differences. Understanding the factors associated with household income can help economists and policymakers analyze inequality and regional development.

Source: Kaggle – Family Income and Expenditure Dataset

The dataset was downloaded as a CSV file and imported into R using read_csv().

The dataset contains more than 500 observations and more than 5 variables, including the target variable.

Economic Question

Do household size, education level, and regional location predict a family’s annual income?

Data Import & Cleaning

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

income_raw <- read_csv("data/Family_Income_and_Expenditure.csv")

Rows: 41544 Columns: 60
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Region, Main Source of Income, Household Head Sex, Household Head ...
dbl (45): Total Household Income, Total Food Expenditure, Agricultural House...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(income_raw)

Rows: 41,544
Columns: 60
$ `Total Household Income`                        <dbl> 480332, 198235, 82785,…
$ Region                                          <chr> "CAR", "CAR", "CAR", "…
$ `Total Food Expenditure`                        <dbl> 117848, 67766, 61609, …
$ `Main Source of Income`                         <chr> "Wage/Salaries", "Wage…
$ `Agricultural Household indicator`              <dbl> 0, 0, 1, 0, 0, 0, 0, 1…
$ `Bread and Cereals Expenditure`                 <dbl> 42140, 17329, 34182, 3…
$ `Total Rice Expenditure`                        <dbl> 38300, 13008, 32001, 2…
$ `Meat Expenditure`                              <dbl> 24676, 17434, 7783, 10…
$ `Total Fish and  marine products Expenditure`   <dbl> 16806, 11073, 2590, 10…
$ `Fruit Expenditure`                             <dbl> 3325, 2035, 1730, 690,…
$ `Vegetables Expenditure`                        <dbl> 13460, 7833, 3795, 788…
$ `Restaurant and hotels Expenditure`             <dbl> 3000, 2360, 4545, 6280…
$ `Alcoholic Beverages Expenditure`               <dbl> 0, 960, 270, 480, 1040…
$ `Tobacco Expenditure`                           <dbl> 0, 2132, 4525, 0, 0, 2…
$ `Clothing, Footwear and Other Wear Expenditure` <dbl> 4607, 8230, 2735, 1390…
$ `Housing and water Expenditure`                 <dbl> 63636, 41370, 14340, 1…
$ `Imputed House Rental Value`                    <dbl> 30000, 27000, 7200, 66…
$ `Medical Care Expenditure`                      <dbl> 3457, 3520, 70, 60, 14…
$ `Transportation Expenditure`                    <dbl> 4776, 12900, 324, 6840…
$ `Communication Expenditure`                     <dbl> 2880, 5700, 420, 660, …
$ `Education Expenditure`                         <dbl> 36200, 29300, 425, 300…
$ `Miscellaneous Goods and Services Expenditure`  <dbl> 34056, 9150, 6450, 376…
$ `Special Occasions Expenditure`                 <dbl> 7200, 1500, 500, 500, …
$ `Crop Farming and Gardening expenses`           <dbl> 19370, 0, 0, 15580, 18…
$ `Total Income from Entrepreneurial Acitivites`  <dbl> 44370, 0, 0, 15580, 75…
$ `Household Head Sex`                            <chr> "Female", "Male", "Mal…
$ `Household Head Age`                            <dbl> 49, 40, 39, 52, 65, 46…
$ `Household Head Marital Status`                 <chr> "Single", "Married", "…
$ `Household Head Highest Grade Completed`        <chr> "Teacher Training and …
$ `Household Head Job or Business Indicator`      <chr> "With Job/Business", "…
$ `Household Head Occupation`                     <chr> "General elementary ed…
$ `Household Head Class of Worker`                <chr> "Worked for government…
$ `Type of Household`                             <chr> "Extended Family", "Si…
$ `Total Number of Family members`                <dbl> 4, 3, 6, 3, 4, 4, 5, 5…
$ `Members with age less than 5 year old`         <dbl> 0, 0, 0, 0, 0, 0, 1, 1…
$ `Members with age 5 - 17 years old`             <dbl> 1, 1, 4, 3, 0, 0, 0, 1…
$ `Total number of family members employed`       <dbl> 1, 2, 3, 2, 2, 3, 1, 0…
$ `Type of Building/House`                        <chr> "Single house", "Singl…
$ `Type of Roof`                                  <chr> "Strong material(galva…
$ `Type of Walls`                                 <chr> "Strong", "Strong", "L…
$ `House Floor Area`                              <dbl> 80, 42, 35, 30, 54, 40…
$ `House Age`                                     <dbl> 75, 15, 12, 15, 16, 7,…
$ `Number of bedrooms`                            <dbl> 3, 2, 1, 1, 3, 2, 1, 2…
$ `Tenure Status`                                 <chr> "Own or owner-like pos…
$ `Toilet Facilities`                             <chr> "Water-sealed, sewer s…
$ Electricity                                     <dbl> 1, 1, 0, 1, 1, 1, 1, 1…
$ `Main Source of Water Supply`                   <chr> "Own use, faucet, comm…
$ `Number of Television`                          <dbl> 1, 1, 0, 1, 1, 1, 1, 1…
$ `Number of CD/VCD/DVD`                          <dbl> 1, 1, 0, 0, 0, 0, 0, 1…
$ `Number of Component/Stereo set`                <dbl> 0, 1, 0, 0, 0, 0, 1, 0…
$ `Number of Refrigerator/Freezer`                <dbl> 1, 0, 0, 0, 1, 0, 0, 0…
$ `Number of Washing Machine`                     <dbl> 1, 1, 0, 0, 0, 1, 0, 1…
$ `Number of Airconditioner`                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
$ `Number of Car, Jeep, Van`                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
$ `Number of Landline/wireless telephones`        <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
$ `Number of Cellular phone`                      <dbl> 2, 3, 0, 1, 3, 4, 2, 2…
$ `Number of Personal Computer`                   <dbl> 1, 1, 0, 0, 0, 0, 0, 0…
$ `Number of Stove with Oven/Gas Range`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
$ `Number of Motorized Banca`                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0…
$ `Number of Motorcycle/Tricycle`                 <dbl> 1, 2, 0, 0, 1, 1, 1, 1…

nrow(income_raw)

[1] 41544

This code loads the required package for data manipulation and imports the dataset into R using read_csv(). The dataset is stored as income_raw. The glimpse() function is used to quickly inspect the structure of the dataset, including variable types and sample values. The nrow() function returns the total number of observations (rows) in the dataset, helping to understand its size.

income_clean <- income_raw |>
   rename(
    income      = `Total Household Income`,
    family_size = `Total Number of Family members`,
    education   = `Household Head Highest Grade Completed`,
    region      = Region,
    food_exp    = `Total Food Expenditure`,
    head_age    = `Household Head Age`,
    floor_area  = `House Floor Area`
  ) |>
   select(income, family_size, education, region,
         food_exp, head_age, floor_area) |>
  mutate(
    education = factor(education),
    region    = factor(region)
  ) |>
  drop_na(income, family_size, education, region)

glimpse(income_clean)

Rows: 41,544
Columns: 7
$ income      <dbl> 480332, 198235, 82785, 107589, 189322, 152883, 198621, 134…
$ family_size <dbl> 4, 3, 6, 3, 4, 4, 5, 5, 2, 6, 4, 7, 7, 3, 2, 4, 5, 8, 4, 5…
$ education   <fct> "Teacher Training and Education Sciences Programs", "Trans…
$ region      <fct> CAR, CAR, CAR, CAR, CAR, CAR, CAR, CAR, CAR, CAR, CAR, CAR…
$ food_exp    <dbl> 117848, 67766, 61609, 78189, 94625, 73326, 104644, 95644, …
$ head_age    <dbl> 49, 40, 39, 52, 65, 46, 45, 33, 17, 53, 49, 35, 38, 53, 75…
$ floor_area  <dbl> 80, 42, 35, 30, 54, 40, 35, 35, 35, 70, 40, 35, 35, 50, 35…

This code creates a cleaned version of the dataset for analysis. First, variables are renamed into simpler and more readable names using rename(). Then, only the relevant variables are selected using select() to focus the dataset on economic factors of interest. Categorical variables such as education and region are converted into factor format using mutate(), which is necessary for proper statistical analysis. Finally, observations with missing values in key variables are removed using drop_na(). The glimpse() function is used to check the structure of the cleaned dataset and confirm that the transformations were applied correctly.

Summary Statistics

income_stats <- income_clean |>
  summarise(
    Mean          = mean(income),
    Median        = median(income),
    Std_Deviation = sd(income),
    Q1            = quantile(income, 0.25),
    Q3            = quantile(income, 0.75),
    Minimum       = min(income),
    Maximum       = max(income)
  )

income_stats

# A tibble: 1 × 7
     Mean  Median Std_Deviation     Q1      Q3 Minimum  Maximum
    <dbl>   <dbl>         <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
1 247556. 164080.       286881. 104895 291138.   11285 11815988

This dataset contains household income information, where income represents total annual household income. The summary statistics show the central tendency (mean, median), spread (standard deviation, quartiles), and range (minimum and maximum) of income. These values help describe income distribution and indicate possible inequality and skewness in the data.

Histogram

income_clean %>% 
  ggplot(aes(x = income)) +
  geom_histogram(bins = 80, fill = "lightpink", color = "white") +
  scale_x_continuous() +
  labs(
    title    = "Distribution of Total Household Income",
    subtitle = "Original scale — Philippines FIES",
    x        = "Annual Household Income (PHP)",
    y        = "Number of Households"
  ) +
  theme_minimal()

The distribution is strongly right skewed. The vast majority of households are concentrated at lower income levels, while a long right tail extends toward very high values. This means the income variable is not normally distributed in its original form. The strong right skewness of the income variable may lead to non-normal residuals and heteroskedasticity in regression models. A log transformation helps stabilize the variance and produces a distribution that is closer to normality.

Log Transformation

income_clean <- income_clean |>
  mutate(log_income = log(income))

income_clean|>
  ggplot(aes(x = log_income)) +
  geom_histogram(bins = 60, fill = "lightyellow", color = "black") +
  labs(
    title    = "Distribution of log(Total Household Income)",
    subtitle = "Log-transformed scale — Philippines FIES",
    x        = "log(Annual Household Income)",
    y        = "Number of Households"
  ) +
  theme_minimal()

After the log transformation, the distribution becomes approximately bell shaped and symmetric, closely resembling a normal distribution. The extreme right skew present in the original variable has been substantially corrected. This makes the transformed variable much more appropriate as an outcome in a linear regression model.

Theoretical Distribution

The original income variable appears to follow a log-normal distribution because it is strongly right-skewed and bounded below by zero. After applying the logarithmic transformation, the distribution becomes approximately normal. Therefore, a log-normal distribution is likely to be the best theoretical approximation for household income.

Stage 2: Classification (Bankruptcy Prediction) {#sec-dataset-2-—-classification-(bankruptcy-prediction}

Dataset Description & Source

Dataset: Polish Company Bankruptcy Dataset

Source: https://archive.ics.uci.edu/ml/datasets/Polish+companies+bankruptcy+data

The ARFF file was downloaded and imported into R using the readARFF() function.

This dataset contains financial ratios of firms over multiple years together with a binary bankruptcy indicator.

The dataset is economically relevant because bankruptcy prediction is an important issue in finance and economics. Financial ratios can help investors, banks, and policymakers identify financially risky firms and evaluate economic stability.

The dataset contains more than 500 observations and more than 5 variables.

Economic Question

Can financial ratios predict whether a firm will go bankrupt?

Data Import & Cleaning

library(farff)
library(tidyverse)

d1 <- readARFF("data/1year.arff")

Parse with reader=readr : data/1year.arff

header: 0.006000; preproc: 0.010000; data: 0.018000; postproc: 0.000000; total: 0.034000

data_clean <- d1 %>% # Create cleaned dataset
  drop_na() %>%   # Remove missing values
  mutate(class = as.numeric(as.character(class))) %>%
  rename(is_bankrupt = class)

head(data_clean)

      Attr1   Attr2    Attr3  Attr4   Attr5    Attr6     Attr7   Attr8  Attr9
1  0.200550 0.37951 0.396410 2.0472  32.351 0.388250  0.249760 1.33050 1.1389
2  0.009020 0.63202 0.053735 1.1263 -37.842 0.000000  0.014434 0.58223 1.3332
3  0.266690 0.34994 0.611470 3.0243  43.087 0.559830  0.332070 1.85770 1.1268
4  0.067731 0.19885 0.081562 2.9576  90.606 0.212650  0.078063 4.02900 1.2570
5 -0.029182 0.21131 0.452640 7.5746  57.844 0.010387 -0.034653 3.73240 1.0241
6  0.028084 0.24231 0.432240 3.0128  47.935 0.021598  0.039729 3.10370 1.0125
   Attr10    Attr11    Attr12   Attr13    Attr14   Attr15   Attr16 Attr17
1 0.50494  0.249760  0.659800 0.166600  0.249760   497.42 0.733780 2.6349
2 0.36798  0.043162  0.033921 0.038938  0.014434  4443.70 0.082138 1.5822
3 0.65006  0.332070  1.099300 0.120470  0.332070   367.04 0.994440 2.8577
4 0.80115  0.078063  1.873600 0.310360  0.078063   926.03 0.394150 5.0290
5 0.78869 -0.034653 -0.503330 0.004191 -0.034653 23292.00 0.015671 4.7324
6 0.75206  0.039729  0.185000 0.044190  0.039729  1059.30 0.344580 4.1270
     Attr18    Attr19  Attr20  Attr21    Attr22    Attr23   Attr24  Attr25
1  0.249760  0.149420  43.370 1.24790  0.214020  0.119980 0.477060 0.50494
2  0.014434  0.010827  36.623 1.07520  0.030778  0.006766 0.000222 0.34828
3  0.332070  0.114960  38.183 1.05810  0.304710  0.092322 0.515860 0.65006
4  0.078063  0.309120  44.446 1.18480  0.053730  0.268200 0.246040 0.80115
5 -0.034653 -0.043861 105.350 0.99083 -0.050624 -0.036936 0.016014 0.78869
6  0.039729  0.021027  36.232 0.78628  0.054953  0.014863 0.034470 0.75206
    Attr26   Attr27   Attr28 Attr29    Attr30    Attr31  Attr32  Attr33
1 0.604110  1.45820 1.761500 5.9443  0.117880  0.149420  94.140  3.8772
2 0.073572  1.07140 0.103190 5.9479  0.474050  0.010827 142.090  2.6284
3 0.807590  1.18850 7.072800 3.9412  0.088635  0.114960  43.006  8.4871
4 0.342200  2.67440 0.093025 5.2684  0.780700  0.309120  75.693  4.8221
5 0.041561 -0.65622 0.945950 4.8827  0.119350 -0.043861  32.575 11.2050
6 0.296520  0.29448 1.224400 4.4729 -0.033048  0.021027  42.004  8.6896
    Attr34    Attr35  Attr36    Attr37  Attr38    Attr39   Attr40    Attr41
1  0.56393  0.214020 1.74100 593.27000 0.50591  0.128040 0.662950  0.051402
2  1.76970  0.240130 1.33320   2.75380 0.49344  0.180110 0.072677  0.308650
3  0.87075  0.304710 2.91610  12.77300 0.69793  0.105480 0.325250  0.035883
4  0.27021  0.053730 0.27897   0.58833 0.95834  0.212760 0.048375  0.120970
5 -0.23957 -0.050624 0.80603   2.05990 0.93115 -0.064076 3.250200 -0.548790
6  0.22679  0.054953 1.93410  16.67000 0.77962  0.029084 1.428600  0.080697
     Attr42  Attr43  Attr44   Attr45  Attr46  Attr47    Attr48    Attr49
1  0.128040 114.420  71.050  1.00970 1.52250  49.394  0.185300  0.110850
2  0.023085 122.740  86.122  0.06743 0.81192  43.654 -0.006701 -0.005026
3  0.105480 103.020  64.834  0.88253 2.02390  43.023  0.288790  0.099972
4  0.212760 175.190 130.740  2.20250 2.21950  55.868  0.053417  0.211520
5 -0.064076 137.540  32.193 -0.12797 4.26240 107.890 -0.088588 -0.112130
6  0.029084  65.719  29.487  0.14973 2.13940  36.686  0.011187  0.005921
   Attr50   Attr51   Attr52  Attr53 Attr54     Attr55   Attr56    Attr57
1 2.04200 0.378540 0.257920 2.24370 2.2480 3.4869e+05 0.121960  0.397180
2 0.75832 0.425540 0.380460 0.70666 0.9476 1.1263e+00 0.180110  0.024512
3 2.61060 0.302070 0.117830 7.51920 8.0728 5.3400e+03 0.112500  0.410250
4 0.61971 0.041664 0.207380 0.91375 1.0930 1.5132e+04 0.204440  0.084542
5 2.46790 0.068848 0.089246 1.64820 1.9460 3.4549e+04 0.023565 -0.037001
6 2.67010 0.214750 0.115080 2.13040 2.2085 1.2841e+04 0.012367  0.037342
   Attr58   Attr59  Attr60  Attr61  Attr62  Attr63   Attr64 is_bankrupt
1 0.87804 0.001924  8.4160  5.1372  82.658  4.4158  7.42770           0
2 0.84165 0.340940  9.9665  4.2382 116.500  3.1330  2.56030           0
3 0.88750 0.073630  9.5593  5.6298  38.168  9.5629 33.41300           0
4 0.79556 0.196190  8.2122  2.7917  60.218  6.0613  0.28803           0
5 0.97644 0.180630  3.4646 11.3380  31.807 11.4750  1.65110           0
6 0.98763 0.036647 10.0740 12.3780  41.485  8.7984  5.35230           0

nrow(data_clean)

[1] 3194

We only use the 1-year subset to ensure a cross-sectional dataset and avoid time-series dependence. Missing values are removed because financial ratios must be complete for consistent economic interpretation.

bankruptcy_data <- data_clean %>%
  select(
    is_bankrupt,
    Attr1,
    Attr2,
    Attr5,
    Attr7,
    Attr10
  )

bankruptcy_data <- bankruptcy_data %>%
  mutate(
    is_bankrupt = factor(is_bankrupt,
                         levels = c(0,1),
                         labels = c("No","Yes"))
  )

these variables represent:

profitability (Attr1)
leverage (Attr2)
liquidity (Attr5)
operating performance (Attr7)
asset efficiency (Attr10)

Summary Statistics

bankruptcy_data %>%
  summarise(
    Bankruptcy_Rate = mean(is_bankrupt),
    Mean_Attr1 = mean(Attr1),
    Mean_Attr2 = mean(Attr2),
    Mean_Attr5 = mean(Attr5),
    Mean_Attr7 = mean(Attr7),
    Mean_Attr10 = mean(Attr10),
    Observations = n()
  )

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `Bankruptcy_Rate = mean(is_bankrupt)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA

  Bankruptcy_Rate Mean_Attr1 Mean_Attr2 Mean_Attr5 Mean_Attr7 Mean_Attr10
1              NA 0.08235786  0.5352492   192.1768 0.09801264   0.4386079
  Observations
1         3194

## Summary Statistics for Attr1

bankruptcy_attr1_stats <- bankruptcy_data |>
  summarise(
    Mean = mean(Attr1, na.rm = TRUE),
    Median = median(Attr1, na.rm = TRUE),
    Std_Deviation = sd(Attr1, na.rm = TRUE),
    Q1 = quantile(Attr1, 0.25, na.rm = TRUE),
    Q3 = quantile(Attr1, 0.75, na.rm = TRUE),
    Minimum = min(Attr1, na.rm = TRUE),
    Maximum = max(Attr1, na.rm = TRUE)
  )

bankruptcy_attr1_stats

        Mean   Median Std_Deviation       Q1       Q3 Minimum Maximum
1 0.08235786 0.065399     0.1175964 0.019102 0.126595 -1.1533  1.5399

The bankruptcy rate is relatively low, meaning the dataset is imbalanced. This reflects real-world economics because most firms survive while only a small fraction fail. The profitability variable (Attr1) shows substantial variability and skewness across firms.

The summary statistics indicate substantial variation in profitability across firms. The large difference between the minimum and maximum values suggests the presence of extreme observations and skewness in the distribution.

Target Variable Distribution

bankruptcy_data %>% 
  ggplot(aes(x = as.factor(is_bankrupt))) +
  geom_bar(fill = "cyan") +
  labs(
    title = "Bankruptcy Distribution",
    x = "Bankruptcy Status (0 = No, 1 = Yes)",
    y = "Count"
  ) +
  theme_minimal()

Since the bankruptcy outcome variable is binary (0 = non-bankrupt, 1 = bankrupt), a standard continuous histogram and logarithmic transformation are not economically meaningful for the target variable itself. Therefore, the distribution of an important continuous financial ratio (Attr1) is additionally examined to analyze skewness and distributional properties in the financial data.

Original Distribution

bankruptcy_data %>% 
  ggplot(aes(x = Attr1)) +
  geom_histogram(fill = "turquoise") +
  labs(
    title = "Original Distribution of Profitability (Attr1)",
    x = "Attr1",
    y = "Frequency"
  ) +
  theme_minimal()

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Although the target variable is binary, we additionally examine the distribution of Attr1, an important financial ratio predictor, in order to understand skewness and extreme values in the financial data.

The Attr1 variable is highly skewed with several extreme observations, indicating substantial differences in profitability across firms.

Log Transformation

bankruptcy_data <- bankruptcy_data %>%
  mutate(log_attr1 = log(abs(Attr1) + 1))

Because Attr1 contains negative profitability values, a standard logarithmic transformation cannot be applied directly. Therefore, log(abs(Attr1)+1) is used to reduce skewness while preserving all observations.

Log Histogram

bankruptcy_data %>% 
  ggplot(aes(x = log_attr1)) +
  geom_histogram(bins = 60, fill = "darkblue") +
  labs(
    title = "Log-Transformed Profitability (Attr1)",
    x = "log(|Attr1| + 1)",
    y = "Frequency"
  ) +
  theme_minimal()

The log transformation reduces the influence of extreme outliers and compresses the scale of the variable. As a result, the distribution becomes more symmetric and closer to a bell-shaped form.

Theoretical Distribution

The original Attr1 variable appears highly right-skewed with extreme observations, suggesting a heavy-tailed distribution rather than a normal distribution. After the log transformation, the variable becomes substantially more symmetric and approximately normal. Therefore, a log-normal distribution may provide a reasonable approximation for this financial ratio.