Purpose

The goal of this project is to predict whether a company’s profit will grow in the following year

# Load required libraries
library('httr')
library('readxl')
library('ggplot2')
library('corrplot')
## corrplot 0.95 loaded
library('dplyr')
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.4     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ✔ readr     2.1.6
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('viridis')
## Loading required package: viridisLite
library('caret')
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## The following object is masked from 'package:httr':
## 
##     progress
library('rpart')
library('rpart.plot')
library('randomForest')
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

Data loading

Dataset reference: https://www.kaggle.com/datasets/mdmahfuzsumon/synthetic-finance-dataset-for-machine-learning

# load the data from kaggle
url <- "https://www.kaggle.com/api/v1/datasets/download/mdmahfuzsumon/synthetic-finance-dataset-for-machine-learning"

response <-GET(url,
               authenticate("Maryanne M", "KGAT_b77248cf46a1d424ed10f43537576b1e"),
               write_disk("finance_data.zip", overwrite = TRUE))
# print(response)
unzip("finance_data.zip")
finance_df <- read.csv("financial_200_features_dataset (1).csv")
head(finance_df)

Data inspection

##    company               year         revenue          domestic_sales     
##  Length:1000        Min.   :2016   Min.   :7.389e+08   Min.   :5.987e+08  
##  Class :character   1st Qu.:2018   1st Qu.:9.600e+09   1st Qu.:7.936e+09  
##  Mode  :character   Median :2020   Median :1.758e+10   Median :1.437e+10  
##                     Mean   :2020   Mean   :2.008e+10   Mean   :1.657e+10  
##                     3rd Qu.:2023   3rd Qu.:2.710e+10   3rd Qu.:2.245e+10  
##                     Max.   :2025   Max.   :9.001e+10   Max.   :7.866e+10  
##                                                                           
##   export_sales            cogs            gross_profit      
##  Min.   :8.509e+07   Min.   :3.902e+08   Min.   :3.335e+08  
##  1st Qu.:1.584e+09   1st Qu.:5.499e+09   1st Qu.:4.055e+09  
##  Median :2.879e+09   Median :1.009e+10   Median :7.301e+09  
##  Mean   :3.505e+09   Mean   :1.149e+10   Mean   :8.585e+09  
##  3rd Qu.:4.747e+09   3rd Qu.:1.554e+10   3rd Qu.:1.148e+10  
##  Max.   :1.887e+10   Max.   :4.975e+10   Max.   :4.026e+10  
##                                                             
##  operating_expense        ebit               ebitda         
##  Min.   :1.685e+08   Min.   :1.296e+08   Min.   :1.811e+08  
##  1st Qu.:2.294e+09   1st Qu.:1.692e+09   1st Qu.:2.310e+09  
##  Median :4.061e+09   Median :3.086e+09   Median :4.266e+09  
##  Mean   :4.810e+09   Mean   :3.775e+09   Mean   :5.077e+09  
##  3rd Qu.:6.565e+09   3rd Qu.:5.125e+09   3rd Qu.:6.862e+09  
##  Max.   :2.599e+10   Max.   :2.060e+10   Max.   :2.532e+10  
##                                                             
##    net_profit            assets              equity         
##  Min.   :9.516e+07   Min.   :1.147e+09   Min.   :7.421e+08  
##  1st Qu.:1.278e+09   1st Qu.:1.934e+10   1st Qu.:1.172e+10  
##  Median :2.306e+09   Median :3.650e+10   Median :2.133e+10  
##  Mean   :2.831e+09   Mean   :4.309e+10   Mean   :2.580e+10  
##  3rd Qu.:3.836e+09   3rd Qu.:5.822e+10   3rd Qu.:3.501e+10  
##  Max.   :1.631e+10   Max.   :1.958e+11   Max.   :1.239e+11  
##                                                             
##       debt                cash             inventory        
##  Min.   :4.052e+08   Min.   :7.867e+07   Min.   :8.139e+07  
##  1st Qu.:7.587e+09   1st Qu.:8.903e+08   1st Qu.:1.593e+09  
##  Median :1.403e+10   Median :1.625e+09   Median :2.884e+09  
##  Mean   :1.728e+10   Mean   :1.963e+09   Mean   :3.542e+09  
##  3rd Qu.:2.229e+10   3rd Qu.:2.615e+09   3rd Qu.:4.847e+09  
##  Max.   :9.759e+10   Max.   :9.706e+09   Max.   :1.826e+10  
##                                                             
##    receivable          employees        shares           operating_cf      
##  Min.   :7.304e+07   Min.   : 305   Min.   : 50690778   Min.   :1.239e+08  
##  1st Qu.:1.272e+09   1st Qu.:1398   1st Qu.:171141650   1st Qu.:1.788e+09  
##  Median :2.266e+09   Median :2722   Median :286444384   Median :3.197e+09  
##  Mean   :2.823e+09   Mean   :2676   Mean   :279673639   Mean   :3.844e+09  
##  3rd Qu.:3.735e+09   3rd Qu.:3929   3rd Qu.:391184969   3rd Qu.:5.171e+09  
##  Max.   :1.742e+10   Max.   :4997   Max.   :499282127   Max.   :1.898e+10  
##                                                                            
##   investing_cf         financing_cf        revenue_growth     domestic_share  
##  Min.   :-1.046e+10   Min.   :-5.585e+09   Min.   :-0.11970   Min.   :0.7500  
##  1st Qu.:-2.640e+09   1st Qu.:-6.924e+08   1st Qu.: 0.05586   1st Qu.:0.7873  
##  Median :-1.602e+09   Median : 1.109e+07   Median : 0.12440   Median :0.8255  
##  Mean   :-2.004e+09   Mean   :-7.136e+06   Mean   : 0.12613   Mean   :0.8255  
##  3rd Qu.:-8.777e+08   3rd Qu.: 7.232e+08   3rd Qu.: 0.18809   3rd Qu.:0.8635  
##  Max.   :-3.794e+07   Max.   : 6.354e+09   Max.   : 0.38287   Max.   :0.8999  
##                                            NA's   :100                        
##   export_share    revenue_per_employee revenue_per_asset revenue_per_equity
##  Min.   :0.1001   Min.   :   198047    Min.   :0.3574    Min.   :0.4811    
##  1st Qu.:0.1365   1st Qu.:  3515920    1st Qu.:0.4075    1st Qu.:0.6689    
##  Median :0.1745   Median :  6751117    Median :0.4665    Median :0.7971    
##  Mean   :0.1745   Mean   : 11930863    Mean   :0.4823    Mean   :0.8230    
##  3rd Qu.:0.2127   3rd Qu.: 13899831    3rd Qu.:0.5548    3rd Qu.:0.9424    
##  Max.   :0.2500   Max.   :157050970    Max.   :0.6665    Max.   :1.4334    
##                                                                            
##  revenue_per_share  gross_margin     ebit_margin      ebitda_margin   
##  Min.   :  1.832   Min.   :0.3500   Min.   :0.05446   Min.   :0.1151  
##  1st Qu.: 33.557   1st Qu.:0.3884   1st Qu.:0.14941   1st Qu.:0.2129  
##  Median : 64.958   Median :0.4284   Median :0.18701   Median :0.2509  
##  Mean   :101.250   Mean   :0.4269   Mean   :0.18761   Mean   :0.2522  
##  3rd Qu.:124.831   3rd Qu.:0.4648   3rd Qu.:0.22589   3rd Qu.:0.2921  
##  Max.   :920.742   Max.   :0.5000   Max.   :0.31408   Max.   :0.3851  
##                                                                       
##    net_margin      profit_growth      cogs_margin       opex_ratio    
##  Min.   :0.04258   Min.   :-0.7186   Min.   :0.5000   Min.   :0.1800  
##  1st Qu.:0.11109   1st Qu.:-0.1684   1st Qu.:0.5352   1st Qu.:0.2101  
##  Median :0.14116   Median : 0.1164   Median :0.5716   Median :0.2375  
##  Mean   :0.14082   Mean   : 0.2496   Mean   :0.5731   Mean   :0.2393  
##  3rd Qu.:0.16976   3rd Qu.: 0.5331   3rd Qu.:0.6116   3rd Qu.:0.2685  
##  Max.   :0.25627   Max.   : 3.8825   Max.   :0.6500   Max.   :0.3000  
##                    NA's   :100                                        
##  asset_turnover    equity_ratio      debt_ratio       cash_ratio     
##  Min.   :0.3574   Min.   :0.4506   Min.   :0.2501   Min.   :0.03533  
##  1st Qu.:0.4075   1st Qu.:0.5238   1st Qu.:0.3257   1st Qu.:0.08647  
##  Median :0.4665   Median :0.5986   Median :0.4014   Median :0.11616  
##  Mean   :0.4823   Mean   :0.5997   Mean   :0.4003   Mean   :0.12485  
##  3rd Qu.:0.5548   3rd Qu.:0.6743   3rd Qu.:0.4762   3rd Qu.:0.15428  
##  Max.   :0.6665   Max.   :0.7499   Max.   :0.5494   Max.   :0.32345  
##                                                                      
##  working_capital           fcf               ocf_margin     inventory_turnover
##  Min.   :-4.105e+09   Min.   :-2.398e+09   Min.   :0.0755   Min.   :2.015     
##  1st Qu.: 2.902e+08   1st Qu.: 6.120e+08   1st Qu.:0.1613   1st Qu.:2.721     
##  Median : 1.098e+09   Median : 1.391e+09   Median :0.1903   Median :3.270     
##  Mean   : 1.555e+09   Mean   : 1.840e+09   Mean   :0.1914   Mean   :3.504     
##  3rd Qu.: 2.477e+09   3rd Qu.: 2.535e+09   3rd Qu.:0.2207   3rd Qu.:4.175     
##  Max.   : 1.676e+10   Max.   : 1.164e+10   Max.   :0.3292   Max.   :6.326     
##                                                                               
##  receivable_turnover      eps          book_value_per_share   year_lag1   
##  Min.   : 5.009      Min.   :  0.288   Min.   :   1.839     Min.   :2016  
##  1st Qu.: 5.931      1st Qu.:  4.518   1st Qu.:  42.115     1st Qu.:2018  
##  Median : 7.111      Median :  9.007   Median :  80.711     Median :2020  
##  Mean   : 7.647      Mean   : 14.207   Mean   : 129.771     Mean   :2020  
##  3rd Qu.: 9.172      3rd Qu.: 16.982   3rd Qu.: 160.424     3rd Qu.:2022  
##  Max.   :12.477      Max.   :153.415   Max.   :1090.970     Max.   :2024  
##                                                             NA's   :100   
##    year_lag2    year_rolling_mean year_rolling_std year_momentum
##  Min.   :2016   Min.   :2017      Min.   :1        Min.   :1    
##  1st Qu.:2018   1st Qu.:2019      1st Qu.:1        1st Qu.:1    
##  Median :2020   Median :2020      Median :1        Median :1    
##  Mean   :2020   Mean   :2020      Mean   :1        Mean   :1    
##  3rd Qu.:2021   3rd Qu.:2022      3rd Qu.:1        3rd Qu.:1    
##  Max.   :2023   Max.   :2024      Max.   :1        Max.   :1    
##  NA's   :200    NA's   :200       NA's   :200      NA's   :100  
##   revenue_lag1        revenue_lag2       revenue_rolling_mean
##  Min.   :7.389e+08   Min.   :7.389e+08   Min.   :8.303e+08   
##  1st Qu.:9.247e+09   1st Qu.:8.748e+09   1st Qu.:9.865e+09   
##  Median :1.668e+10   Median :1.589e+10   Median :1.818e+10   
##  Mean   :1.868e+10   Mean   :1.737e+10   Mean   :1.972e+10   
##  3rd Qu.:2.536e+10   3rd Qu.:2.374e+10   3rd Qu.:2.681e+10   
##  Max.   :7.158e+10   Max.   :6.414e+10   Max.   :7.349e+10   
##  NA's   :100         NA's   :200         NA's   :200         
##  revenue_rolling_std revenue_momentum     domestic_sales_lag1
##  Min.   :1.300e+07   Min.   :-2.749e+09   Min.   :5.987e+08  
##  1st Qu.:1.033e+09   1st Qu.: 5.405e+08   1st Qu.:7.568e+09  
##  Median :2.005e+09   Median : 1.679e+09   Median :1.373e+10  
##  Mean   :2.550e+09   Mean   : 2.409e+09   Mean   :1.545e+10  
##  3rd Qu.:3.430e+09   3rd Qu.: 3.502e+09   3rd Qu.:2.118e+10  
##  Max.   :1.435e+10   Max.   : 2.370e+10   Max.   :5.716e+10  
##  NA's   :200         NA's   :100          NA's   :100        
##  domestic_sales_lag2 domestic_sales_rolling_mean domestic_sales_rolling_std
##  Min.   :5.987e+08   Min.   :7.050e+08           Min.   :3.218e+07         
##  1st Qu.:7.270e+09   1st Qu.:8.107e+09           1st Qu.:8.396e+08         
##  Median :1.301e+10   Median :1.491e+10           Median :1.707e+09         
##  Mean   :1.437e+10   Mean   :1.630e+10           Mean   :2.220e+09         
##  3rd Qu.:1.932e+10   3rd Qu.:2.207e+10           3rd Qu.:3.040e+09         
##  Max.   :5.637e+10   Max.   :6.244e+10           Max.   :1.420e+10         
##  NA's   :200         NA's   :200                 NA's   :200               
##  domestic_sales_momentum export_sales_lag1   export_sales_lag2  
##  Min.   :-5.192e+09      Min.   :8.509e+07   Min.   :8.509e+07  
##  1st Qu.: 2.946e+08      1st Qu.:1.453e+09   1st Qu.:1.399e+09  
##  Median : 1.279e+09      Median :2.713e+09   Median :2.543e+09  
##  Mean   : 1.961e+09      Mean   :3.227e+09   Mean   :3.005e+09  
##  3rd Qu.: 3.062e+09      3rd Qu.:4.308e+09   3rd Qu.:4.113e+09  
##  Max.   : 2.638e+10      Max.   :1.585e+10   Max.   :1.175e+10  
##  NA's   :100             NA's   :100         NA's   :200        
##  export_sales_rolling_mean export_sales_rolling_std export_sales_momentum
##  Min.   :1.254e+08         Min.   :1.311e+07        Min.   :-5.030e+09   
##  1st Qu.:1.680e+09         1st Qu.:3.169e+08        1st Qu.:-2.969e+08   
##  Median :3.006e+09         Median :6.756e+08        Median : 1.871e+08   
##  Mean   :3.420e+09         Mean   :8.947e+08        Mean   : 4.485e+08   
##  3rd Qu.:4.656e+09         3rd Qu.:1.227e+09        3rd Qu.: 1.019e+09   
##  Max.   :1.355e+10         Max.   :6.613e+09        Max.   : 1.050e+10   
##  NA's   :200               NA's   :200              NA's   :100          
##    cogs_lag1           cogs_lag2         cogs_rolling_mean  
##  Min.   :3.902e+08   Min.   :3.902e+08   Min.   :4.480e+08  
##  1st Qu.:5.285e+09   1st Qu.:5.056e+09   1st Qu.:5.651e+09  
##  Median :9.600e+09   Median :9.078e+09   Median :1.043e+10  
##  Mean   :1.068e+10   Mean   :9.956e+09   Mean   :1.129e+10  
##  3rd Qu.:1.471e+10   3rd Qu.:1.365e+10   3rd Qu.:1.537e+10  
##  Max.   :4.445e+10   Max.   :3.281e+10   Max.   :4.044e+10  
##  NA's   :100         NA's   :200         NA's   :200        
##  cogs_rolling_std    cogs_momentum        gross_profit_lag1  
##  Min.   :2.948e+07   Min.   :-4.948e+09   Min.   :3.335e+08  
##  1st Qu.:5.570e+08   1st Qu.: 1.094e+08   1st Qu.:3.862e+09  
##  Median :1.222e+09   Median : 8.920e+08   Median :6.969e+09  
##  Mean   :1.599e+09   Mean   : 1.391e+09   Mean   :7.992e+09  
##  3rd Qu.:2.252e+09   3rd Qu.: 2.240e+09   3rd Qu.:1.074e+10  
##  Max.   :9.597e+09   Max.   : 1.628e+10   Max.   :3.444e+10  
##  NA's   :200         NA's   :100          NA's   :100        
##  gross_profit_lag2   gross_profit_rolling_mean gross_profit_rolling_std
##  Min.   :3.335e+08   Min.   :3.776e+08         Min.   :1.210e+07       
##  1st Qu.:3.649e+09   1st Qu.:4.144e+09         1st Qu.:4.687e+08       
##  Median :6.698e+09   Median :7.586e+09         Median :9.506e+08       
##  Mean   :7.414e+09   Mean   :8.432e+09         Mean   :1.322e+09       
##  3rd Qu.:1.004e+10   3rd Qu.:1.131e+10         3rd Qu.:1.805e+09       
##  Max.   :3.133e+10   Max.   :3.481e+10         Max.   :7.904e+09       
##  NA's   :200         NA's   :200               NA's   :200             
##  gross_profit_momentum operating_expense_lag1 operating_expense_lag2
##  Min.   :-5.420e+09    Min.   :1.685e+08      Min.   :1.685e+08     
##  1st Qu.:-2.699e+07    1st Qu.:2.148e+09      1st Qu.:2.020e+09     
##  Median : 6.003e+08    Median :3.896e+09      Median :3.757e+09     
##  Mean   : 1.018e+09    Mean   :4.459e+09      Mean   :4.158e+09     
##  3rd Qu.: 1.754e+09    3rd Qu.:6.061e+09      3rd Qu.:5.596e+09     
##  Max.   : 1.157e+10    Max.   :1.952e+10      Max.   :1.720e+10     
##  NA's   :100           NA's   :100            NA's   :200           
##  operating_expense_rolling_mean operating_expense_rolling_std
##  Min.   :2.011e+08              Min.   :1.114e+07            
##  1st Qu.:2.368e+09              1st Qu.:3.005e+08            
##  Median :4.233e+09              Median :6.308e+08            
##  Mean   :4.717e+09              Mean   :8.616e+08            
##  3rd Qu.:6.358e+09              3rd Qu.:1.112e+09            
##  Max.   :1.755e+10              Max.   :7.371e+09            
##  NA's   :200                    NA's   :200                  
##  operating_expense_momentum   ebit_lag1           ebit_lag2        
##  Min.   :-5.658e+09         Min.   :1.296e+08   Min.   :1.296e+08  
##  1st Qu.:-1.142e+08         1st Qu.:1.630e+09   1st Qu.:1.570e+09  
##  Median : 3.438e+08         Median :2.958e+09   Median :2.771e+09  
##  Mean   : 5.894e+08         Mean   :3.533e+09   Mean   :3.256e+09  
##  3rd Qu.: 1.135e+09         3rd Qu.:4.807e+09   3rd Qu.:4.329e+09  
##  Max.   : 1.169e+10         Max.   :2.060e+10   Max.   :1.897e+10  
##  NA's   :100                NA's   :100         NA's   :200        
##  ebit_rolling_mean   ebit_rolling_std    ebit_momentum       
##  Min.   :1.638e+08   Min.   :2.323e+06   Min.   :-1.016e+10  
##  1st Qu.:1.791e+09   1st Qu.:3.674e+08   1st Qu.:-4.407e+08  
##  Median :3.292e+09   Median :7.657e+08   Median : 2.369e+08  
##  Mean   :3.715e+09   Mean   :1.058e+09   Mean   : 4.290e+08  
##  3rd Qu.:5.009e+09   3rd Qu.:1.436e+09   3rd Qu.: 1.165e+09  
##  Max.   :1.726e+10   Max.   :6.964e+09   Max.   : 1.092e+10  
##  NA's   :200         NA's   :200         NA's   :100         
##   ebitda_lag1         ebitda_lag2        ebitda_rolling_mean
##  Min.   :1.811e+08   Min.   :1.811e+08   Min.   :2.201e+08  
##  1st Qu.:2.218e+09   1st Qu.:2.131e+09   1st Qu.:2.417e+09  
##  Median :4.094e+09   Median :3.818e+09   Median :4.465e+09  
##  Mean   :4.747e+09   Mean   :4.385e+09   Mean   :4.996e+09  
##  3rd Qu.:6.281e+09   3rd Qu.:5.820e+09   3rd Qu.:6.745e+09  
##  Max.   :2.532e+10   Max.   :2.313e+10   Max.   :2.252e+10  
##  NA's   :100         NA's   :200         NA's   :200        
##  ebitda_rolling_std  ebitda_momentum      net_profit_lag1    
##  Min.   :4.259e+06   Min.   :-1.103e+10   Min.   :9.516e+07  
##  1st Qu.:3.956e+08   1st Qu.:-2.940e+08   1st Qu.:1.215e+09  
##  Median :8.176e+08   Median : 3.035e+08   Median :2.221e+09  
##  Mean   :1.132e+09   Mean   : 5.812e+08   Mean   :2.647e+09  
##  3rd Qu.:1.533e+09   3rd Qu.: 1.326e+09   3rd Qu.:3.537e+09  
##  Max.   :7.922e+09   Max.   : 1.163e+10   Max.   :1.631e+10  
##  NA's   :200         NA's   :100          NA's   :100        
##  net_profit_lag2     net_profit_rolling_mean net_profit_rolling_std
##  Min.   :9.516e+07   Min.   :1.160e+08       Min.   :2.433e+06     
##  1st Qu.:1.149e+09   1st Qu.:1.345e+09       1st Qu.:3.060e+08     
##  Median :2.072e+09   Median :2.469e+09       Median :5.646e+08     
##  Mean   :2.445e+09   Mean   :2.787e+09       Mean   :8.109e+08     
##  3rd Qu.:3.283e+09   3rd Qu.:3.823e+09       3rd Qu.:1.035e+09     
##  Max.   :1.464e+10   Max.   :1.284e+10       Max.   :5.716e+09     
##  NA's   :200         NA's   :200             NA's   :200           
##  net_profit_momentum   assets_lag1         assets_lag2       
##  Min.   :-9.503e+09   Min.   :1.147e+09   Min.   :1.147e+09  
##  1st Qu.:-3.468e+08   1st Qu.:1.830e+10   1st Qu.:1.757e+10  
##  Median : 1.647e+08   Median :3.503e+10   Median :3.308e+10  
##  Mean   : 3.235e+08   Mean   :4.025e+10   Mean   :3.734e+10  
##  3rd Qu.: 8.341e+08   3rd Qu.:5.496e+10   3rd Qu.:5.105e+10  
##  Max.   : 8.352e+09   Max.   :1.830e+11   Max.   :1.632e+11  
##  NA's   :100          NA's   :100         NA's   :200        
##  assets_rolling_mean assets_rolling_std  assets_momentum     
##  Min.   :1.454e+09   Min.   :5.328e+07   Min.   :-5.400e+10  
##  1st Qu.:2.092e+10   1st Qu.:2.807e+09   1st Qu.:-1.264e+09  
##  Median :3.843e+10   Median :6.326e+09   Median : 3.071e+09  
##  Mean   :4.243e+10   Mean   :8.220e+09   Mean   : 5.030e+09  
##  3rd Qu.:5.816e+10   3rd Qu.:1.114e+10   3rd Qu.: 1.077e+10  
##  Max.   :1.718e+11   Max.   :5.501e+10   Max.   : 9.013e+10  
##  NA's   :200         NA's   :200         NA's   :100         
##   equity_lag1         equity_lag2        equity_rolling_mean
##  Min.   :7.421e+08   Min.   :7.421e+08   Min.   :8.794e+08  
##  1st Qu.:1.125e+10   1st Qu.:1.062e+10   1st Qu.:1.262e+10  
##  Median :2.034e+10   Median :1.977e+10   Median :2.304e+10  
##  Mean   :2.425e+10   Mean   :2.263e+10   Mean   :2.553e+10  
##  3rd Qu.:3.313e+10   3rd Qu.:3.104e+10   3rd Qu.:3.458e+10  
##  Max.   :1.071e+11   Max.   :1.071e+11   Max.   :1.122e+11  
##  NA's   :100         NA's   :200         NA's   :200        
##  equity_rolling_std  equity_momentum        debt_lag1        
##  Min.   :3.098e+07   Min.   :-3.721e+10   Min.   :4.052e+08  
##  1st Qu.:2.304e+09   1st Qu.:-1.879e+09   1st Qu.:7.078e+09  
##  Median :4.614e+09   Median : 1.677e+09   Median :1.341e+10  
##  Mean   :5.886e+09   Mean   : 2.874e+09   Mean   :1.600e+10  
##  3rd Qu.:7.962e+09   3rd Qu.: 7.356e+09   3rd Qu.:2.119e+10  
##  Max.   :3.153e+10   Max.   : 5.538e+10   Max.   :9.759e+10  
##  NA's   :200         NA's   :100          NA's   :100        
##    debt_lag2         debt_rolling_mean   debt_rolling_std   
##  Min.   :4.052e+08   Min.   :5.364e+08   Min.   :6.126e+07  
##  1st Qu.:6.711e+09   1st Qu.:8.384e+09   1st Qu.:1.512e+09  
##  Median :1.264e+10   Median :1.487e+10   Median :3.392e+09  
##  Mean   :1.471e+10   Mean   :1.690e+10   Mean   :4.670e+09  
##  3rd Qu.:1.994e+10   3rd Qu.:2.284e+10   3rd Qu.:5.990e+09  
##  Max.   :7.230e+10   Max.   :6.712e+10   Max.   :3.177e+10  
##  NA's   :200         NA's   :200         NA's   :200        
##  debt_momentum          cash_lag1           cash_lag2        
##  Min.   :-4.430e+10   Min.   :7.867e+07   Min.   :7.867e+07  
##  1st Qu.:-1.442e+09   1st Qu.:8.518e+08   1st Qu.:7.932e+08  
##  Median : 1.005e+09   Median :1.535e+09   Median :1.423e+09  
##  Mean   : 2.157e+09   Mean   :1.844e+09   Mean   :1.707e+09  
##  3rd Qu.: 5.682e+09   3rd Qu.:2.447e+09   3rd Qu.:2.326e+09  
##  Max.   : 5.769e+10   Max.   :9.706e+09   Max.   :7.909e+09  
##  NA's   :100          NA's   :100         NA's   :200        
##  cash_rolling_mean   cash_rolling_std    cash_momentum       
##  Min.   :8.769e+07   Min.   :1.047e+07   Min.   :-4.833e+09  
##  1st Qu.:9.317e+08   1st Qu.:2.167e+08   1st Qu.:-2.700e+08  
##  Median :1.727e+09   Median :4.191e+08   Median : 1.292e+08  
##  Mean   :1.938e+09   Mean   :5.625e+08   Mean   : 2.190e+08  
##  3rd Qu.:2.573e+09   3rd Qu.:7.278e+08   3rd Qu.: 6.458e+08  
##  Max.   :7.724e+09   Max.   :3.713e+09   Max.   : 5.684e+09  
##  NA's   :200         NA's   :200         NA's   :100         
##  inventory_lag1      inventory_lag2      inventory_rolling_mean
##  Min.   :8.139e+07   Min.   :8.139e+07   Min.   :1.444e+08     
##  1st Qu.:1.544e+09   1st Qu.:1.465e+09   1st Qu.:1.671e+09     
##  Median :2.687e+09   Median :2.537e+09   Median :3.059e+09     
##  Mean   :3.291e+09   Mean   :3.063e+09   Mean   :3.484e+09     
##  3rd Qu.:4.488e+09   3rd Qu.:4.250e+09   3rd Qu.:4.811e+09     
##  Max.   :1.517e+10   Max.   :1.361e+10   Max.   :1.320e+10     
##  NA's   :100         NA's   :200         NA's   :200           
##  inventory_rolling_std inventory_momentum   receivable_lag1    
##  Min.   :1.876e+07     Min.   :-6.342e+09   Min.   :7.304e+07  
##  1st Qu.:3.602e+08     1st Qu.:-3.263e+08   1st Qu.:1.220e+09  
##  Median :6.999e+08     Median : 2.333e+08   Median :2.101e+09  
##  Mean   :8.808e+08     Mean   : 4.328e+08   Mean   :2.623e+09  
##  3rd Qu.:1.204e+09     3rd Qu.: 1.114e+09   3rd Qu.:3.562e+09  
##  Max.   :5.893e+09     Max.   : 9.903e+09   Max.   :1.427e+10  
##  NA's   :200           NA's   :100          NA's   :100        
##  receivable_lag2     receivable_rolling_mean receivable_rolling_std
##  Min.   :7.304e+07   Min.   :8.992e+07       Min.   :6.299e+06     
##  1st Qu.:1.163e+09   1st Qu.:1.352e+09       1st Qu.:2.396e+08     
##  Median :2.015e+09   Median :2.444e+09       Median :5.277e+08     
##  Mean   :2.429e+09   Mean   :2.773e+09       Mean   :7.137e+08     
##  3rd Qu.:3.307e+09   3rd Qu.:3.687e+09       3rd Qu.:9.968e+08     
##  Max.   :1.068e+10   Max.   :1.191e+10       Max.   :6.182e+09     
##  NA's   :200         NA's   :200             NA's   :200           
##  receivable_momentum  employees_lag1 employees_lag2 employees_rolling_mean
##  Min.   :-6.736e+09   Min.   : 305   Min.   : 305   Min.   : 789          
##  1st Qu.:-2.211e+08   1st Qu.:1396   1st Qu.:1392   1st Qu.:2054          
##  Median : 1.893e+08   Median :2733   Median :2738   Median :2643          
##  Mean   : 3.462e+08   Mean   :2687   Mean   :2692   Mean   :2647          
##  3rd Qu.: 8.703e+08   3rd Qu.:3938   3rd Qu.:3938   3rd Qu.:3240          
##  Max.   : 9.032e+09   Max.   :4997   Max.   :4997   Max.   :4799          
##  NA's   :100          NA's   :100    NA's   :200    NA's   :200           
##  employees_rolling_std employees_momentum  shares_lag1       
##  Min.   :   8.0        Min.   :-4411.00   Min.   : 50690778  
##  1st Qu.: 919.1        1st Qu.:-1507.75   1st Qu.:169835258  
##  Median :1281.6        Median :   31.50   Median :288341835  
##  Mean   :1289.0        Mean   :  -50.61   Mean   :280194173  
##  3rd Qu.:1698.9        3rd Qu.: 1439.50   3rd Qu.:392930827  
##  Max.   :2562.5        Max.   : 4502.00   Max.   :499282127  
##  NA's   :200           NA's   :100        NA's   :100        
##   shares_lag2        shares_rolling_mean shares_rolling_std 
##  Min.   : 52181752   Min.   : 70002088   Min.   :   979757  
##  1st Qu.:172664926   1st Qu.:229760300   1st Qu.: 78753490  
##  Median :286961198   Median :281812556   Median :118686117  
##  Mean   :281401466   Mean   :279592734   Mean   :118601504  
##  3rd Qu.:394419786   3rd Qu.:330473144   3rd Qu.:157807119  
##  Max.   :499282127   Max.   :470985140   Max.   :237662957  
##  NA's   :200         NA's   :200         NA's   :200        
##  shares_momentum      operating_cf_lag1   operating_cf_lag2  
##  Min.   :-432229983   Min.   :1.239e+08   Min.   :1.239e+08  
##  1st Qu.:-130024940   1st Qu.:1.680e+09   1st Qu.:1.613e+09  
##  Median :  -4018480   Median :3.026e+09   Median :2.853e+09  
##  Mean   :  -1438342   Mean   :3.587e+09   Mean   :3.315e+09  
##  3rd Qu.: 128116534   3rd Qu.:4.885e+09   3rd Qu.:4.363e+09  
##  Max.   : 433670256   Max.   :1.898e+10   Max.   :1.633e+10  
##  NA's   :100          NA's   :100         NA's   :200        
##  operating_cf_rolling_mean operating_cf_rolling_std operating_cf_momentum
##  Min.   :1.479e+08         Min.   :9.572e+06        Min.   :-9.930e+09   
##  1st Qu.:1.841e+09         1st Qu.:3.477e+08        1st Qu.:-2.678e+08   
##  Median :3.319e+09         Median :6.469e+08        Median : 2.474e+08   
##  Mean   :3.782e+09         Mean   :9.216e+08        Mean   : 4.500e+08   
##  3rd Qu.:5.154e+09         3rd Qu.:1.242e+09        3rd Qu.: 1.063e+09   
##  Max.   :1.614e+10         Max.   :5.660e+09        Max.   : 9.095e+09   
##  NA's   :200               NA's   :200              NA's   :100          
##  investing_cf_lag1    investing_cf_lag2    investing_cf_rolling_mean
##  Min.   :-9.722e+09   Min.   :-9.128e+09   Min.   :-9.455e+09       
##  1st Qu.:-2.512e+09   1st Qu.:-2.304e+09   1st Qu.:-2.638e+09       
##  Median :-1.516e+09   Median :-1.449e+09   Median :-1.724e+09       
##  Mean   :-1.866e+09   Mean   :-1.722e+09   Mean   :-1.969e+09       
##  3rd Qu.:-8.471e+08   3rd Qu.:-8.134e+08   3rd Qu.:-9.639e+08       
##  Max.   :-3.794e+07   Max.   :-3.794e+07   Max.   :-7.481e+07       
##  NA's   :100          NA's   :200          NA's   :200              
##  investing_cf_rolling_std investing_cf_momentum financing_cf_lag1   
##  Min.   :7.387e+06        Min.   :-5.111e+09    Min.   :-5.192e+09  
##  1st Qu.:2.212e+08        1st Qu.:-6.928e+08    1st Qu.:-6.405e+08  
##  Median :4.304e+08        Median :-1.052e+08    Median : 1.109e+07  
##  Mean   :5.672e+08        Mean   :-2.409e+08    Mean   : 2.565e+06  
##  3rd Qu.:7.653e+08        3rd Qu.: 3.087e+08    3rd Qu.: 6.555e+08  
##  Max.   :3.385e+09        Max.   : 3.534e+09    Max.   : 5.930e+09  
##  NA's   :200              NA's   :100           NA's   :100         
##  financing_cf_lag2    financing_cf_rolling_mean financing_cf_rolling_std
##  Min.   :-4.497e+09   Min.   :-3.695e+09        Min.   :2.881e+06       
##  1st Qu.:-6.055e+08   1st Qu.:-3.374e+08        1st Qu.:3.709e+08       
##  Median : 1.947e+07   Median : 1.446e+06        Median :7.672e+08       
##  Mean   : 3.023e+06   Mean   :-6.865e+06        Mean   :9.971e+08       
##  3rd Qu.: 6.052e+08   3rd Qu.: 3.618e+08        3rd Qu.:1.343e+09       
##  Max.   : 4.582e+09   Max.   : 5.243e+09        Max.   :5.221e+09       
##  NA's   :200          NA's   :200               NA's   :200             
##  financing_cf_momentum revenue_growth_lag1 revenue_growth_lag2
##  Min.   :-8.569e+09    Min.   :-0.11970    Min.   :-0.11970   
##  1st Qu.:-8.165e+08    1st Qu.: 0.05931    1st Qu.: 0.05668   
##  Median :-1.216e+07    Median : 0.12804    Median : 0.12558   
##  Mean   :-1.629e+07    Mean   : 0.12857    Mean   : 0.12684   
##  3rd Qu.: 8.590e+08    3rd Qu.: 0.18846    3rd Qu.: 0.18809   
##  Max.   : 8.065e+09    Max.   : 0.38287    Max.   : 0.38287   
##  NA's   :100           NA's   :200         NA's   :300        
##  revenue_growth_rolling_mean revenue_growth_rolling_std revenue_growth_momentum
##  Min.   :0.004631            Min.   :0.005021           Min.   :-0.398044      
##  1st Qu.:0.094238            1st Qu.:0.058974           1st Qu.:-0.117437      
##  Median :0.131864            Median :0.092919           Median :-0.005155      
##  Mean   :0.127803            Mean   :0.095023           Mean   :-0.002201      
##  3rd Qu.:0.162381            3rd Qu.:0.126197           3rd Qu.: 0.113302      
##  Max.   :0.254524            Max.   :0.228557           Max.   : 0.424383      
##  NA's   :300                 NA's   :300                NA's   :200            
##  domestic_share_lag1 domestic_share_lag2 domestic_share_rolling_mean
##  Min.   :0.7500      Min.   :0.7500      Min.   :0.7621             
##  1st Qu.:0.7886      1st Qu.:0.7886      1st Qu.:0.8082             
##  Median :0.8259      Median :0.8255      Median :0.8257             
##  Mean   :0.8261      Mean   :0.8259      Mean   :0.8261             
##  3rd Qu.:0.8647      3rd Qu.:0.8647      3rd Qu.:0.8444             
##  Max.   :0.8999      Max.   :0.8999      Max.   :0.8949             
##  NA's   :100         NA's   :200         NA's   :200                
##  domestic_share_rolling_std domestic_share_momentum export_share_lag1
##  Min.   :0.001101           Min.   :-0.146020       Min.   :0.1001   
##  1st Qu.:0.025001           1st Qu.:-0.042776       1st Qu.:0.1353   
##  Median :0.039182           Median : 0.000750       Median :0.1741   
##  Mean   :0.039348           Mean   :-0.000414       Mean   :0.1739   
##  3rd Qu.:0.053388           3rd Qu.: 0.044058       3rd Qu.:0.2114   
##  Max.   :0.081052           Max.   : 0.137871       Max.   :0.2500   
##  NA's   :200                NA's   :100             NA's   :100      
##  export_share_lag2 export_share_rolling_mean export_share_rolling_std
##  Min.   :0.1001    Min.   :0.1051            Min.   :0.001101        
##  1st Qu.:0.1353    1st Qu.:0.1556            1st Qu.:0.025001        
##  Median :0.1745    Median :0.1743            Median :0.039182        
##  Mean   :0.1741    Mean   :0.1739            Mean   :0.039348        
##  3rd Qu.:0.2114    3rd Qu.:0.1918            3rd Qu.:0.053388        
##  Max.   :0.2500    Max.   :0.2379            Max.   :0.081052        
##  NA's   :200       NA's   :200               NA's   :200             
##  export_share_momentum revenue_per_employee_lag1 revenue_per_employee_lag2
##  Min.   :-0.137871     Min.   :  198047          Min.   :  198047         
##  1st Qu.:-0.044058     1st Qu.: 3387392          1st Qu.: 3153488         
##  Median :-0.000750     Median : 6301592          Median : 5872301         
##  Mean   : 0.000414     Mean   :10833026          Mean   :10376992         
##  3rd Qu.: 0.042776     3rd Qu.:12981077          3rd Qu.:11966675         
##  Max.   : 0.146020     Max.   :82153197          Max.   :82153197         
##  NA's   :100           NA's   :100               NA's   :200              
##  revenue_per_employee_rolling_mean revenue_per_employee_rolling_std
##  Min.   :  262827                  Min.   :   30759                
##  1st Qu.: 4514881                  1st Qu.: 1827885                
##  Median : 8670465                  Median : 4478237                
##  Mean   :11710682                  Mean   : 7991243                
##  3rd Qu.:15377774                  3rd Qu.: 9862523                
##  Max.   :66963532                  Max.   :78035985                
##  NA's   :200                       NA's   :200                     
##  revenue_per_employee_momentum revenue_per_asset_lag1 revenue_per_asset_lag2
##  Min.   :-65775600             Min.   :0.3574         Min.   :0.3574        
##  1st Qu.: -2826681             1st Qu.:0.4067         1st Qu.:0.4075        
##  Median :   350106             Median :0.4640         Median :0.4646        
##  Mean   :  1794261             Mean   :0.4812         Mean   :0.4810        
##  3rd Qu.:  5149380             3rd Qu.:0.5548         3rd Qu.:0.5534        
##  Max.   :133456304             Max.   :0.6665         Max.   :0.6658        
##  NA's   :100                   NA's   :100            NA's   :200           
##  revenue_per_asset_rolling_mean revenue_per_asset_rolling_std
##  Min.   :0.3733                 Min.   :0.006425             
##  1st Qu.:0.4385                 1st Qu.:0.040866             
##  Median :0.4808                 Median :0.074250             
##  Mean   :0.4810                 Mean   :0.074524             
##  3rd Qu.:0.5157                 3rd Qu.:0.104103             
##  Max.   :0.6367                 Max.   :0.172142             
##  NA's   :200                    NA's   :200                  
##  revenue_per_asset_momentum revenue_per_equity_lag1 revenue_per_equity_lag2
##  Min.   :-0.291987          Min.   :0.4811          Min.   :0.4811         
##  1st Qu.:-0.073826          1st Qu.:0.6657          1st Qu.:0.6639         
##  Median : 0.001466          Median :0.7918          Median :0.7900         
##  Mean   : 0.000370          Mean   :0.8184          Mean   :0.8155         
##  3rd Qu.: 0.074934          3rd Qu.:0.9371          3rd Qu.:0.9367         
##  Max.   : 0.300186          Max.   :1.4334          Max.   :1.4334         
##  NA's   :100                NA's   :100             NA's   :200            
##  revenue_per_equity_rolling_mean revenue_per_equity_rolling_std
##  Min.   :0.5456                  Min.   :0.009297              
##  1st Qu.:0.7361                  1st Qu.:0.094488              
##  Median :0.8077                  Median :0.158698              
##  Mean   :0.8183                  Mean   :0.170094              
##  3rd Qu.:0.8905                  3rd Qu.:0.232338              
##  Max.   :1.3062                  Max.   :0.465799              
##  NA's   :200                     NA's   :200                   
##  revenue_per_equity_momentum revenue_per_share_lag1 revenue_per_share_lag2
##  Min.   :-0.782539           Min.   :  1.832        Min.   :  1.832       
##  1st Qu.:-0.165571           1st Qu.: 32.886        1st Qu.: 31.539       
##  Median : 0.005282           Median : 62.212        Median : 57.173       
##  Mean   : 0.002213           Mean   : 93.617        Mean   : 85.186       
##  3rd Qu.: 0.165738           3rd Qu.:117.915        3rd Qu.:108.250       
##  Max.   : 0.919153           Max.   :920.742        Max.   :920.742       
##  NA's   :100                 NA's   :100            NA's   :200           
##  revenue_per_share_rolling_mean revenue_per_share_rolling_std
##  Min.   :  3.45                 Min.   :  0.1449             
##  1st Qu.: 46.06                 1st Qu.: 13.6587             
##  Median : 78.74                 Median : 35.1846             
##  Mean   : 98.35                 Mean   : 58.2226             
##  3rd Qu.:133.07                 3rd Qu.: 75.7685             
##  Max.   :509.92                 Max.   :454.0048             
##  NA's   :200                    NA's   :200                  
##  revenue_per_share_momentum gross_margin_lag1 gross_margin_lag2
##  Min.   :-715.315           Min.   :0.3500    Min.   :0.3500   
##  1st Qu.: -22.643           1st Qu.:0.3894    1st Qu.:0.3882   
##  Median :   4.622           Median :0.4280    Median :0.4271   
##  Mean   :  12.796           Mean   :0.4270    Mean   :0.4264   
##  3rd Qu.:  44.955           3rd Qu.:0.4658    3rd Qu.:0.4656   
##  Max.   : 794.887           Max.   :0.5000    Max.   :0.5000   
##  NA's   :100                NA's   :100       NA's   :200      
##  gross_margin_rolling_mean gross_margin_rolling_std gross_margin_momentum
##  Min.   :0.3564            Min.   :0.001141         Min.   :-0.148800    
##  1st Qu.:0.4096            1st Qu.:0.027902         1st Qu.:-0.044549    
##  Median :0.4277            Median :0.041058         Median : 0.001636    
##  Mean   :0.4268            Mean   :0.040668         Mean   :-0.000465    
##  3rd Qu.:0.4440            3rd Qu.:0.053569         3rd Qu.: 0.045173    
##  Max.   :0.4929            Max.   :0.079875         Max.   : 0.137160    
##  NA's   :200               NA's   :200              NA's   :100          
##  ebit_margin_lag1  ebit_margin_lag2  ebit_margin_rolling_mean
##  Min.   :0.05446   Min.   :0.05683   Min.   :0.1071          
##  1st Qu.:0.15030   1st Qu.:0.14906   1st Qu.:0.1671          
##  Median :0.18773   Median :0.18802   Median :0.1865          
##  Mean   :0.18787   Mean   :0.18700   Mean   :0.1877          
##  3rd Qu.:0.22589   3rd Qu.:0.22558   3rd Qu.:0.2092          
##  Max.   :0.31408   Max.   :0.31408   Max.   :0.2832          
##  NA's   :100       NA's   :200       NA's   :200             
##  ebit_margin_rolling_std ebit_margin_momentum ebitda_margin_lag1
##  Min.   :0.00108         Min.   :-0.202350    Min.   :0.1151    
##  1st Qu.:0.02959         1st Qu.:-0.051292    1st Qu.:0.2136    
##  Median :0.04548         Median :-0.002749    Median :0.2517    
##  Mean   :0.04835         Mean   :-0.000257    Mean   :0.2525    
##  3rd Qu.:0.06567         3rd Qu.: 0.054456    3rd Qu.:0.2921    
##  Max.   :0.11997         Max.   : 0.218452    Max.   :0.3851    
##  NA's   :200             NA's   :100          NA's   :100       
##  ebitda_margin_lag2 ebitda_margin_rolling_mean ebitda_margin_rolling_std
##  Min.   :0.1151     Min.   :0.1682             Min.   :0.002424         
##  1st Qu.:0.2133     1st Qu.:0.2306             1st Qu.:0.030788         
##  Median :0.2516     Median :0.2511             Median :0.046050         
##  Mean   :0.2516     Mean   :0.2523             Mean   :0.048710         
##  3rd Qu.:0.2900     3rd Qu.:0.2736             3rd Qu.:0.066805         
##  Max.   :0.3851     Max.   :0.3526             Max.   :0.116791         
##  NA's   :200        NA's   :200                NA's   :200              
##  ebitda_margin_momentum net_margin_lag1   net_margin_lag2 
##  Min.   :-0.210979      Min.   :0.04258   Min.   :0.0447  
##  1st Qu.:-0.051267      1st Qu.:0.11137   1st Qu.:0.1108  
##  Median :-0.002032      Median :0.14123   Median :0.1410  
##  Mean   :-0.000440      Mean   :0.14098   Mean   :0.1405  
##  3rd Qu.: 0.053644      3rd Qu.:0.16976   3rd Qu.:0.1693  
##  Max.   : 0.204711      Max.   :0.25210   Max.   :0.2521  
##  NA's   :100            NA's   :100       NA's   :200     
##  net_margin_rolling_mean net_margin_rolling_std net_margin_momentum
##  Min.   :0.07437         Min.   :0.00033        Min.   :-0.153114  
##  1st Qu.:0.12499         1st Qu.:0.02215        1st Qu.:-0.038936  
##  Median :0.13952         Median :0.03632        Median :-0.002671  
##  Mean   :0.14105         Mean   :0.03738        Mean   :-0.000103  
##  3rd Qu.:0.15719         3rd Qu.:0.05066        3rd Qu.: 0.042611  
##  Max.   :0.20874         Max.   :0.09924        Max.   : 0.174952  
##  NA's   :200             NA's   :200            NA's   :100        
##  profit_growth_lag1 profit_growth_lag2 profit_growth_rolling_mean
##  Min.   :-0.7186    Min.   :-0.6952    Min.   :-0.21135          
##  1st Qu.:-0.1661    1st Qu.:-0.1716    1st Qu.: 0.07833          
##  Median : 0.1259    Median : 0.1323    Median : 0.22671          
##  Mean   : 0.2593    Mean   : 0.2501    Mean   : 0.25433          
##  3rd Qu.: 0.5251    3rd Qu.: 0.5216    3rd Qu.: 0.37465          
##  Max.   : 3.8825    Max.   : 3.8825    Max.   : 1.29336          
##  NA's   :200        NA's   :300        NA's   :300
str(finance_df)
## 'data.frame':    1000 obs. of  210 variables:
##  $ company                          : chr  "Company_1" "Company_1" "Company_1" "Company_1" ...
##  $ year                             : int  2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 ...
##  $ revenue                          : num  8.17e+09 8.26e+09 9.70e+09 1.19e+10 1.49e+10 ...
##  $ domestic_sales                   : num  6.86e+09 6.22e+09 8.65e+09 9.92e+09 1.15e+10 ...
##  $ export_sales                     : num  1.31e+09 2.04e+09 1.05e+09 1.96e+09 3.42e+09 ...
##  $ cogs                             : num  4.27e+09 4.78e+09 5.67e+09 5.99e+09 9.26e+09 ...
##  $ gross_profit                     : num  3.89e+09 3.48e+09 4.03e+09 5.88e+09 5.67e+09 ...
##  $ operating_expense                : num  1.62e+09 1.88e+09 2.20e+09 3.34e+09 2.82e+09 ...
##  $ ebit                             : num  2.27e+09 1.60e+09 1.84e+09 2.54e+09 2.85e+09 ...
##  $ ebitda                           : num  2.69e+09 2.02e+09 2.33e+09 3.30e+09 4.04e+09 ...
##  $ net_profit                       : num  1.87e+09 1.35e+09 1.28e+09 1.85e+09 2.29e+09 ...
##  $ assets                           : num  1.86e+10 1.49e+10 1.76e+10 3.21e+10 2.63e+10 ...
##  $ equity                           : num  1.23e+10 7.10e+09 1.15e+10 2.15e+10 1.19e+10 ...
##  $ debt                             : num  6.29e+09 7.78e+09 6.07e+09 1.07e+10 1.44e+10 ...
##  $ cash                             : num  4.25e+08 9.23e+08 1.08e+09 9.82e+08 1.96e+09 ...
##  $ inventory                        : num  2.00e+09 1.30e+09 2.18e+09 2.20e+09 3.08e+09 ...
##  $ receivable                       : num  1.47e+09 1.63e+09 9.78e+08 1.69e+09 2.50e+09 ...
##  $ employees                        : int  2733 1828 3452 637 4774 2990 1978 3602 3604 2007 ...
##  $ shares                           : int  294061375 249502978 87709731 220476398 368019332 96321408 73841962 186702141 173850419 362601305 ...
##  $ operating_cf                     : num  2.52e+09 1.94e+09 1.58e+09 2.68e+09 3.42e+09 ...
##  $ investing_cf                     : num  -9.12e+08 -9.74e+08 -1.22e+09 -1.02e+09 -1.72e+09 ...
##  $ financing_cf                     : num  1.82e+08 -8.17e+07 -1.45e+08 -5.20e+08 1.24e+09 ...
##  $ revenue_growth                   : num  NA 0.011 0.175 0.224 0.257 ...
##  $ domestic_share                   : num  0.84 0.753 0.891 0.835 0.771 ...
##  $ export_share                     : num  0.16 0.247 0.109 0.165 0.229 ...
##  $ revenue_per_employee             : num  2987781 4516026 2810452 18645123 3127314 ...
##  $ revenue_per_asset                : num  0.438 0.555 0.551 0.37 0.569 ...
##  $ revenue_per_equity               : num  0.662 1.163 0.842 0.553 1.259 ...
##  $ revenue_per_share                : num  27.8 33.1 110.6 53.9 40.6 ...
##  $ gross_margin                     : num  0.477 0.421 0.416 0.495 0.38 ...
##  $ ebit_margin                      : num  0.278 0.193 0.189 0.214 0.191 ...
##  $ ebitda_margin                    : num  0.33 0.245 0.24 0.278 0.27 ...
##  $ net_margin                       : num  0.229 0.163 0.132 0.156 0.153 ...
##  $ profit_growth                    : num  NA -0.2783 -0.0517 0.4511 0.2349 ...
##  $ cogs_margin                      : num  0.523 0.579 0.584 0.505 0.62 ...
##  $ opex_ratio                       : num  0.199 0.228 0.226 0.281 0.189 ...
##  $ asset_turnover                   : num  0.438 0.555 0.551 0.37 0.569 ...
##  $ equity_ratio                     : num  0.662 0.477 0.655 0.668 0.452 ...
##  $ debt_ratio                       : num  0.338 0.523 0.345 0.332 0.548 ...
##  $ cash_ratio                       : num  0.0676 0.1187 0.1774 0.0921 0.1364 ...
##  $ working_capital                  : num  1.85e+09 1.05e+09 9.66e+08 5.58e+08 2.76e+09 ...
##  $ fcf                              : num  1.60e+09 9.65e+08 3.60e+08 1.67e+09 1.70e+09 ...
##  $ ocf_margin                       : num  0.308 0.235 0.163 0.226 0.229 ...
##  $ inventory_turnover               : num  2.13 3.68 2.6 2.72 3.01 ...
##  $ receivable_turnover              : num  5.56 5.05 9.92 7.02 5.97 ...
##  $ eps                              : num  6.35 5.4 14.57 8.41 6.22 ...
##  $ book_value_per_share             : num  42 28.5 131.4 97.4 32.2 ...
##  $ year_lag1                        : num  NA 2016 2017 2018 2019 ...
##  $ year_lag2                        : num  NA NA 2016 2017 2018 ...
##  $ year_rolling_mean                : num  NA NA 2017 2018 2019 ...
##  $ year_rolling_std                 : num  NA NA 1 1 1 1 1 1 1 1 ...
##  $ year_momentum                    : num  NA 1 1 1 1 1 1 1 1 1 ...
##  $ revenue_lag1                     : num  NA 8.17e+09 8.26e+09 9.70e+09 1.19e+10 ...
##  $ revenue_lag2                     : num  NA NA 8.17e+09 8.26e+09 9.70e+09 ...
##  $ revenue_rolling_mean             : num  NA NA 8.71e+09 9.94e+09 1.22e+10 ...
##  $ revenue_rolling_std              : num  NA NA 8.62e+08 1.82e+09 2.63e+09 ...
##  $ revenue_momentum                 : num  NA 8.97e+07 1.45e+09 2.18e+09 3.05e+09 ...
##  $ domestic_sales_lag1              : num  NA 6.86e+09 6.22e+09 8.65e+09 9.92e+09 ...
##  $ domestic_sales_lag2              : num  NA NA 6.86e+09 6.22e+09 8.65e+09 ...
##  $ domestic_sales_rolling_mean      : num  NA NA 7.24e+09 8.26e+09 1.00e+10 ...
##  $ domestic_sales_rolling_std       : num  NA NA 1.26e+09 1.88e+09 1.44e+09 ...
##  $ domestic_sales_momentum          : num  NA -6.37e+08 2.43e+09 1.27e+09 1.59e+09 ...
##  $ export_sales_lag1                : num  NA 1.31e+09 2.04e+09 1.05e+09 1.96e+09 ...
##  $ export_sales_lag2                : num  NA NA 1.31e+09 2.04e+09 1.05e+09 ...
##  $ export_sales_rolling_mean        : num  NA NA 1.47e+09 1.68e+09 2.14e+09 ...
##  $ export_sales_rolling_std         : num  NA NA 5.09e+08 5.45e+08 1.19e+09 ...
##  $ export_sales_momentum            : num  NA 7.27e+08 -9.81e+08 9.04e+08 1.46e+09 ...
##  $ cogs_lag1                        : num  NA 4.27e+09 4.78e+09 5.67e+09 5.99e+09 ...
##  $ cogs_lag2                        : num  NA NA 4.27e+09 4.78e+09 5.67e+09 ...
##  $ cogs_rolling_mean                : num  NA NA 4.91e+09 5.48e+09 6.98e+09 ...
##  $ cogs_rolling_std                 : num  NA NA 7.07e+08 6.30e+08 1.99e+09 ...
##  $ cogs_momentum                    : num  NA 5.04e+08 8.93e+08 3.24e+08 3.27e+09 ...
##  $ gross_profit_lag1                : num  NA 3.89e+09 3.48e+09 4.03e+09 5.88e+09 ...
##  $ gross_profit_lag2                : num  NA NA 3.89e+09 3.48e+09 4.03e+09 ...
##  $ gross_profit_rolling_mean        : num  NA NA 3.80e+09 4.46e+09 5.19e+09 ...
##  $ gross_profit_rolling_std         : num  NA NA 2.88e+08 1.26e+09 1.01e+09 ...
##  $ gross_profit_momentum            : num  NA -4.14e+08 5.53e+08 1.85e+09 -2.14e+08 ...
##  $ operating_expense_lag1           : num  NA 1.62e+09 1.88e+09 2.20e+09 3.34e+09 ...
##  $ operating_expense_lag2           : num  NA NA 1.62e+09 1.88e+09 2.20e+09 ...
##  $ operating_expense_rolling_mean   : num  NA NA 1.90e+09 2.47e+09 2.78e+09 ...
##  $ operating_expense_rolling_std    : num  NA NA 2.87e+08 7.67e+08 5.73e+08 ...
##  $ operating_expense_momentum       : num  NA 2.59e+08 3.13e+08 1.14e+09 -5.17e+08 ...
##  $ ebit_lag1                        : num  NA 2.27e+09 1.60e+09 1.84e+09 2.54e+09 ...
##  $ ebit_lag2                        : num  NA NA 2.27e+09 1.60e+09 1.84e+09 ...
##  $ ebit_rolling_mean                : num  NA NA 1.90e+09 1.99e+09 2.41e+09 ...
##  $ ebit_rolling_std                 : num  NA NA 3.41e+08 4.93e+08 5.19e+08 ...
##  $ ebit_momentum                    : num  NA -6.73e+08 2.40e+08 7.08e+08 3.03e+08 ...
##  $ ebitda_lag1                      : num  NA 2.69e+09 2.02e+09 2.33e+09 3.30e+09 ...
##  $ ebitda_lag2                      : num  NA NA 2.69e+09 2.02e+09 2.33e+09 ...
##  $ ebitda_rolling_mean              : num  NA NA 2.35e+09 2.55e+09 3.22e+09 ...
##  $ ebitda_rolling_std               : num  NA NA 3.36e+08 6.68e+08 8.58e+08 ...
##  $ ebitda_momentum                  : num  NA -6.71e+08 3.06e+08 9.73e+08 7.38e+08 ...
##  $ net_profit_lag1                  : num  NA 1.87e+09 1.35e+09 1.28e+09 1.85e+09 ...
##  $ net_profit_lag2                  : num  NA NA 1.87e+09 1.35e+09 1.28e+09 ...
##  $ net_profit_rolling_mean          : num  NA NA 1.50e+09 1.49e+09 1.81e+09 ...
##  $ net_profit_rolling_std           : num  NA NA 3.22e+08 3.15e+08 5.08e+08 ...
##  $ net_profit_momentum              : num  NA -5.20e+08 -6.98e+07 5.77e+08 4.36e+08 ...
##  $ assets_lag1                      : num  NA 1.86e+10 1.49e+10 1.76e+10 3.21e+10 ...
##  $ assets_lag2                      : num  NA NA 1.86e+10 1.49e+10 1.76e+10 ...
##   [list output truncated]

Key findings

This dataset is a realistic corporate financial dataset that comprises of 100 companies over 10 years (2016- 2025) that include 210 engineered features from income statements, balance sheets and cash flow statements.

STRUCTURE

The dataset variables are all numerical except “Company” column which is chr.

The company column is organized in suc a way that every ten rows is data pertaining to the same company

There are missing values in the dataset but most importantly, the first line of the start for each company is empty so we have 100 empty rows in the dataset.

The dataset also contains lag columns, rolling mean, rolling standard deviation, and momentum

The target variable is profit_growth

Data Exploration

The goal is to visualize the distribution of the data by checking:

Here i will use the feature categories from which I’ll pick the most relevant to profit growth.

Also, they are all T-1 features.

For the efficiency features we will create a column ‘inventory_turnover_lag1’ by calculating cogs_lag1 / mean(inventory_lag1 & inventory_lag2) For the shareholder metrics, we will create a column ‘eps_lag1’ for year T-1 by net_profit_lag1 / shares_lag1

# efficiency features - inventory_turnover_lag1
finance_df <- finance_df%>% 
  group_by(company)%>%
  arrange(year)%>%
  mutate(
    # average inventory lag 1 using inventory lag1 and inventory lag2
    average_inventory_lag1 = (inventory_lag1 + inventory_lag2) / 2,
    # inventory turnover lag 1 for T-1 values only
    inventory_turnover_lag1 = cogs_lag1 / average_inventory_lag1
  ) %>%
  ungroup()

# shareholder metrics - eps_lag1
finance_df <- finance_df %>%
  group_by(company)%>%
  arrange(year)%>%
  mutate(
    eps_lag1 = net_profit_lag1 / shares_lag1 # earnings per share are calculated by dividing net profit by number of shares
  )%>%
  ungroup()

Histogram plot visualization

features_to_plot <- c("revenue_growth_lag1",
                      "ebit_lag1",
                      "operating_expense_lag1",
                      "revenue_per_asset_lag1",
                      "cash_lag1",
                      "debt_lag1",
                      "operating_cf_lag1",
                      "inventory_turnover_lag1",
                      "eps_lag1",
                      "revenue_growth_momentum"
                    )
for (feature in features_to_plot) {
  plot = ggplot(finance_df, aes_string(x = feature)) +
    geom_histogram(bins = 20, color = "red", fill = "steelblue", alpha = 0.7) +
    labs(title = paste("Distribution of", feature),
         x = feature,
         y = 'Count')+
    theme_minimal()
  print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 200 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 200 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 200 rows containing non-finite outside the scale range
## (`stat_bin()`).

### Key findings

Revenue_growth_lag1, inventory_turnover_lag1, revenue_growth_momentum form a bell-shaped curve indicating normal distribution

What this means for my project

  • There is no need for transformation of these features

  • I am not expecting extreme outliers

  • Logistic regression would work well

Revenue_per_asset_lag1 is irregularly distributed among the ten companies with very high peaks at 0.4 to 0.45 then 0.5 to 0.55

This is how many dollars of revenue the company generates for every dollar invested in assets

Its not a bell shaped curve/ not normal distribution

The range is between 0.35 - 0.7, so for every dollar invested in the companies’ assets, the companies generate between 35 to 70 cents. It is a narrow range that indicates that some companies have high asset efficiency of around 50 cents per dollar and others have lower asset efficiency of about 40 cents per dollar.

What this means for my project

  • I am expecting outliers but not extreme outliers

  • Does not require transformation

  • More visualization by boxplots needed.

ebit_lag1, operating_expense_lag1, cash_lag1, debt_lag1, operating_cf_lag1, eps_lag1 features are right-skewed

What this means for my project

  • These features represent monetary values where most companies cluster at lower values, while few companies cluster at large values which is an expected observation.

  • The large values may distort the model, there is need for log transformation to handle the skewness.

  • Feature scaling is needed to standardize or bring the features to a comparable scale.

Data Cleaning

Check for missing values and patterns if any

## 
##  FALSE   TRUE 
## 185400  27600
##                           company                              year 
##                                 0                                 0 
##                           revenue                    domestic_sales 
##                                 0                                 0 
##                      export_sales                              cogs 
##                                 0                                 0 
##                      gross_profit                 operating_expense 
##                                 0                                 0 
##                              ebit                            ebitda 
##                                 0                                 0 
##                        net_profit                            assets 
##                                 0                                 0 
##                            equity                              debt 
##                                 0                                 0 
##                              cash                         inventory 
##                                 0                                 0 
##                        receivable                         employees 
##                                 0                                 0 
##                            shares                      operating_cf 
##                                 0                                 0 
##                      investing_cf                      financing_cf 
##                                 0                                 0 
##                    revenue_growth                    domestic_share 
##                               100                                 0 
##                      export_share              revenue_per_employee 
##                                 0                                 0 
##                 revenue_per_asset                revenue_per_equity 
##                                 0                                 0 
##                 revenue_per_share                      gross_margin 
##                                 0                                 0 
##                       ebit_margin                     ebitda_margin 
##                                 0                                 0 
##                        net_margin                     profit_growth 
##                                 0                               100 
##                       cogs_margin                        opex_ratio 
##                                 0                                 0 
##                    asset_turnover                      equity_ratio 
##                                 0                                 0 
##                        debt_ratio                        cash_ratio 
##                                 0                                 0 
##                   working_capital                               fcf 
##                                 0                                 0 
##                        ocf_margin                inventory_turnover 
##                                 0                                 0 
##               receivable_turnover                               eps 
##                                 0                                 0 
##              book_value_per_share                         year_lag1 
##                                 0                               100 
##                         year_lag2                 year_rolling_mean 
##                               200                               200 
##                  year_rolling_std                     year_momentum 
##                               200                               100 
##                      revenue_lag1                      revenue_lag2 
##                               100                               200 
##              revenue_rolling_mean               revenue_rolling_std 
##                               200                               200 
##                  revenue_momentum               domestic_sales_lag1 
##                               100                               100 
##               domestic_sales_lag2       domestic_sales_rolling_mean 
##                               200                               200 
##        domestic_sales_rolling_std           domestic_sales_momentum 
##                               200                               100 
##                 export_sales_lag1                 export_sales_lag2 
##                               100                               200 
##         export_sales_rolling_mean          export_sales_rolling_std 
##                               200                               200 
##             export_sales_momentum                         cogs_lag1 
##                               100                               100 
##                         cogs_lag2                 cogs_rolling_mean 
##                               200                               200 
##                  cogs_rolling_std                     cogs_momentum 
##                               200                               100 
##                 gross_profit_lag1                 gross_profit_lag2 
##                               100                               200 
##         gross_profit_rolling_mean          gross_profit_rolling_std 
##                               200                               200 
##             gross_profit_momentum            operating_expense_lag1 
##                               100                               100 
##            operating_expense_lag2    operating_expense_rolling_mean 
##                               200                               200 
##     operating_expense_rolling_std        operating_expense_momentum 
##                               200                               100 
##                         ebit_lag1                         ebit_lag2 
##                               100                               200 
##                 ebit_rolling_mean                  ebit_rolling_std 
##                               200                               200 
##                     ebit_momentum                       ebitda_lag1 
##                               100                               100 
##                       ebitda_lag2               ebitda_rolling_mean 
##                               200                               200 
##                ebitda_rolling_std                   ebitda_momentum 
##                               200                               100 
##                   net_profit_lag1                   net_profit_lag2 
##                               100                               200 
##           net_profit_rolling_mean            net_profit_rolling_std 
##                               200                               200 
##               net_profit_momentum                       assets_lag1 
##                               100                               100 
##                       assets_lag2               assets_rolling_mean 
##                               200                               200 
##                assets_rolling_std                   assets_momentum 
##                               200                               100 
##                       equity_lag1                       equity_lag2 
##                               100                               200 
##               equity_rolling_mean                equity_rolling_std 
##                               200                               200 
##                   equity_momentum                         debt_lag1 
##                               100                               100 
##                         debt_lag2                 debt_rolling_mean 
##                               200                               200 
##                  debt_rolling_std                     debt_momentum 
##                               200                               100 
##                         cash_lag1                         cash_lag2 
##                               100                               200 
##                 cash_rolling_mean                  cash_rolling_std 
##                               200                               200 
##                     cash_momentum                    inventory_lag1 
##                               100                               100 
##                    inventory_lag2            inventory_rolling_mean 
##                               200                               200 
##             inventory_rolling_std                inventory_momentum 
##                               200                               100 
##                   receivable_lag1                   receivable_lag2 
##                               100                               200 
##           receivable_rolling_mean            receivable_rolling_std 
##                               200                               200 
##               receivable_momentum                    employees_lag1 
##                               100                               100 
##                    employees_lag2            employees_rolling_mean 
##                               200                               200 
##             employees_rolling_std                employees_momentum 
##                               200                               100 
##                       shares_lag1                       shares_lag2 
##                               100                               200 
##               shares_rolling_mean                shares_rolling_std 
##                               200                               200 
##                   shares_momentum                 operating_cf_lag1 
##                               100                               100 
##                 operating_cf_lag2         operating_cf_rolling_mean 
##                               200                               200 
##          operating_cf_rolling_std             operating_cf_momentum 
##                               200                               100 
##                 investing_cf_lag1                 investing_cf_lag2 
##                               100                               200 
##         investing_cf_rolling_mean          investing_cf_rolling_std 
##                               200                               200 
##             investing_cf_momentum                 financing_cf_lag1 
##                               100                               100 
##                 financing_cf_lag2         financing_cf_rolling_mean 
##                               200                               200 
##          financing_cf_rolling_std             financing_cf_momentum 
##                               200                               100 
##               revenue_growth_lag1               revenue_growth_lag2 
##                               200                               300 
##       revenue_growth_rolling_mean        revenue_growth_rolling_std 
##                               300                               300 
##           revenue_growth_momentum               domestic_share_lag1 
##                               200                               100 
##               domestic_share_lag2       domestic_share_rolling_mean 
##                               200                               200 
##        domestic_share_rolling_std           domestic_share_momentum 
##                               200                               100 
##                 export_share_lag1                 export_share_lag2 
##                               100                               200 
##         export_share_rolling_mean          export_share_rolling_std 
##                               200                               200 
##             export_share_momentum         revenue_per_employee_lag1 
##                               100                               100 
##         revenue_per_employee_lag2 revenue_per_employee_rolling_mean 
##                               200                               200 
##  revenue_per_employee_rolling_std     revenue_per_employee_momentum 
##                               200                               100 
##            revenue_per_asset_lag1            revenue_per_asset_lag2 
##                               100                               200 
##    revenue_per_asset_rolling_mean     revenue_per_asset_rolling_std 
##                               200                               200 
##        revenue_per_asset_momentum           revenue_per_equity_lag1 
##                               100                               100 
##           revenue_per_equity_lag2   revenue_per_equity_rolling_mean 
##                               200                               200 
##    revenue_per_equity_rolling_std       revenue_per_equity_momentum 
##                               200                               100 
##            revenue_per_share_lag1            revenue_per_share_lag2 
##                               100                               200 
##    revenue_per_share_rolling_mean     revenue_per_share_rolling_std 
##                               200                               200 
##        revenue_per_share_momentum                 gross_margin_lag1 
##                               100                               100 
##                 gross_margin_lag2         gross_margin_rolling_mean 
##                               200                               200 
##          gross_margin_rolling_std             gross_margin_momentum 
##                               200                               100 
##                  ebit_margin_lag1                  ebit_margin_lag2 
##                               100                               200 
##          ebit_margin_rolling_mean           ebit_margin_rolling_std 
##                               200                               200 
##              ebit_margin_momentum                ebitda_margin_lag1 
##                               100                               100 
##                ebitda_margin_lag2        ebitda_margin_rolling_mean 
##                               200                               200 
##         ebitda_margin_rolling_std            ebitda_margin_momentum 
##                               200                               100 
##                   net_margin_lag1                   net_margin_lag2 
##                               100                               200 
##           net_margin_rolling_mean            net_margin_rolling_std 
##                               200                               200 
##               net_margin_momentum                profit_growth_lag1 
##                               100                               200 
##                profit_growth_lag2        profit_growth_rolling_mean 
##                               300                               300 
##            average_inventory_lag1           inventory_turnover_lag1 
##                               200                               200 
##                          eps_lag1 
##                               100

Key findings

The first row of the first year of every company has missing values from lag_1 to the last column which totals to 100 rows of missing values.

  • This is expected because lag, growth and momentum variables can only be calculated from the previous year performance hence the first observation will be blank.

The column year lag_2 has 200 missing values.

  • This is expected because there are missing lag, growth and momentum variables are missing from year 2016

PLAN

Removal of these rows

finance_df2 <- finance_df %>%
  filter(year!= 2016)
# verify
table(nrow(finance_df2))
## 
## 900 
##   1
##                           company                              year 
##                                 0                                 0 
##                           revenue                    domestic_sales 
##                                 0                                 0 
##                      export_sales                              cogs 
##                                 0                                 0 
##                      gross_profit                 operating_expense 
##                                 0                                 0 
##                              ebit                            ebitda 
##                                 0                                 0 
##                        net_profit                            assets 
##                                 0                                 0 
##                            equity                              debt 
##                                 0                                 0 
##                              cash                         inventory 
##                                 0                                 0 
##                        receivable                         employees 
##                                 0                                 0 
##                            shares                      operating_cf 
##                                 0                                 0 
##                      investing_cf                      financing_cf 
##                                 0                                 0 
##                    revenue_growth                    domestic_share 
##                                 0                                 0 
##                      export_share              revenue_per_employee 
##                                 0                                 0 
##                 revenue_per_asset                revenue_per_equity 
##                                 0                                 0 
##                 revenue_per_share                      gross_margin 
##                                 0                                 0 
##                       ebit_margin                     ebitda_margin 
##                                 0                                 0 
##                        net_margin                     profit_growth 
##                                 0                                 0 
##                       cogs_margin                        opex_ratio 
##                                 0                                 0 
##                    asset_turnover                      equity_ratio 
##                                 0                                 0 
##                        debt_ratio                        cash_ratio 
##                                 0                                 0 
##                   working_capital                               fcf 
##                                 0                                 0 
##                        ocf_margin                inventory_turnover 
##                                 0                                 0 
##               receivable_turnover                               eps 
##                                 0                                 0 
##              book_value_per_share                         year_lag1 
##                                 0                                 0 
##                         year_lag2                 year_rolling_mean 
##                               100                               100 
##                  year_rolling_std                     year_momentum 
##                               100                                 0 
##                      revenue_lag1                      revenue_lag2 
##                                 0                               100 
##              revenue_rolling_mean               revenue_rolling_std 
##                               100                               100 
##                  revenue_momentum               domestic_sales_lag1 
##                                 0                                 0 
##               domestic_sales_lag2       domestic_sales_rolling_mean 
##                               100                               100 
##        domestic_sales_rolling_std           domestic_sales_momentum 
##                               100                                 0 
##                 export_sales_lag1                 export_sales_lag2 
##                                 0                               100 
##         export_sales_rolling_mean          export_sales_rolling_std 
##                               100                               100 
##             export_sales_momentum                         cogs_lag1 
##                                 0                                 0 
##                         cogs_lag2                 cogs_rolling_mean 
##                               100                               100 
##                  cogs_rolling_std                     cogs_momentum 
##                               100                                 0 
##                 gross_profit_lag1                 gross_profit_lag2 
##                                 0                               100 
##         gross_profit_rolling_mean          gross_profit_rolling_std 
##                               100                               100 
##             gross_profit_momentum            operating_expense_lag1 
##                                 0                                 0 
##            operating_expense_lag2    operating_expense_rolling_mean 
##                               100                               100 
##     operating_expense_rolling_std        operating_expense_momentum 
##                               100                                 0 
##                         ebit_lag1                         ebit_lag2 
##                                 0                               100 
##                 ebit_rolling_mean                  ebit_rolling_std 
##                               100                               100 
##                     ebit_momentum                       ebitda_lag1 
##                                 0                                 0 
##                       ebitda_lag2               ebitda_rolling_mean 
##                               100                               100 
##                ebitda_rolling_std                   ebitda_momentum 
##                               100                                 0 
##                   net_profit_lag1                   net_profit_lag2 
##                                 0                               100 
##           net_profit_rolling_mean            net_profit_rolling_std 
##                               100                               100 
##               net_profit_momentum                       assets_lag1 
##                                 0                                 0 
##                       assets_lag2               assets_rolling_mean 
##                               100                               100 
##                assets_rolling_std                   assets_momentum 
##                               100                                 0 
##                       equity_lag1                       equity_lag2 
##                                 0                               100 
##               equity_rolling_mean                equity_rolling_std 
##                               100                               100 
##                   equity_momentum                         debt_lag1 
##                                 0                                 0 
##                         debt_lag2                 debt_rolling_mean 
##                               100                               100 
##                  debt_rolling_std                     debt_momentum 
##                               100                                 0 
##                         cash_lag1                         cash_lag2 
##                                 0                               100 
##                 cash_rolling_mean                  cash_rolling_std 
##                               100                               100 
##                     cash_momentum                    inventory_lag1 
##                                 0                                 0 
##                    inventory_lag2            inventory_rolling_mean 
##                               100                               100 
##             inventory_rolling_std                inventory_momentum 
##                               100                                 0 
##                   receivable_lag1                   receivable_lag2 
##                                 0                               100 
##           receivable_rolling_mean            receivable_rolling_std 
##                               100                               100 
##               receivable_momentum                    employees_lag1 
##                                 0                                 0 
##                    employees_lag2            employees_rolling_mean 
##                               100                               100 
##             employees_rolling_std                employees_momentum 
##                               100                                 0 
##                       shares_lag1                       shares_lag2 
##                                 0                               100 
##               shares_rolling_mean                shares_rolling_std 
##                               100                               100 
##                   shares_momentum                 operating_cf_lag1 
##                                 0                                 0 
##                 operating_cf_lag2         operating_cf_rolling_mean 
##                               100                               100 
##          operating_cf_rolling_std             operating_cf_momentum 
##                               100                                 0 
##                 investing_cf_lag1                 investing_cf_lag2 
##                                 0                               100 
##         investing_cf_rolling_mean          investing_cf_rolling_std 
##                               100                               100 
##             investing_cf_momentum                 financing_cf_lag1 
##                                 0                                 0 
##                 financing_cf_lag2         financing_cf_rolling_mean 
##                               100                               100 
##          financing_cf_rolling_std             financing_cf_momentum 
##                               100                                 0 
##               revenue_growth_lag1               revenue_growth_lag2 
##                               100                               200 
##       revenue_growth_rolling_mean        revenue_growth_rolling_std 
##                               200                               200 
##           revenue_growth_momentum               domestic_share_lag1 
##                               100                                 0 
##               domestic_share_lag2       domestic_share_rolling_mean 
##                               100                               100 
##        domestic_share_rolling_std           domestic_share_momentum 
##                               100                                 0 
##                 export_share_lag1                 export_share_lag2 
##                                 0                               100 
##         export_share_rolling_mean          export_share_rolling_std 
##                               100                               100 
##             export_share_momentum         revenue_per_employee_lag1 
##                                 0                                 0 
##         revenue_per_employee_lag2 revenue_per_employee_rolling_mean 
##                               100                               100 
##  revenue_per_employee_rolling_std     revenue_per_employee_momentum 
##                               100                                 0 
##            revenue_per_asset_lag1            revenue_per_asset_lag2 
##                                 0                               100 
##    revenue_per_asset_rolling_mean     revenue_per_asset_rolling_std 
##                               100                               100 
##        revenue_per_asset_momentum           revenue_per_equity_lag1 
##                                 0                                 0 
##           revenue_per_equity_lag2   revenue_per_equity_rolling_mean 
##                               100                               100 
##    revenue_per_equity_rolling_std       revenue_per_equity_momentum 
##                               100                                 0 
##            revenue_per_share_lag1            revenue_per_share_lag2 
##                                 0                               100 
##    revenue_per_share_rolling_mean     revenue_per_share_rolling_std 
##                               100                               100 
##        revenue_per_share_momentum                 gross_margin_lag1 
##                                 0                                 0 
##                 gross_margin_lag2         gross_margin_rolling_mean 
##                               100                               100 
##          gross_margin_rolling_std             gross_margin_momentum 
##                               100                                 0 
##                  ebit_margin_lag1                  ebit_margin_lag2 
##                                 0                               100 
##          ebit_margin_rolling_mean           ebit_margin_rolling_std 
##                               100                               100 
##              ebit_margin_momentum                ebitda_margin_lag1 
##                                 0                                 0 
##                ebitda_margin_lag2        ebitda_margin_rolling_mean 
##                               100                               100 
##         ebitda_margin_rolling_std            ebitda_margin_momentum 
##                               100                                 0 
##                   net_margin_lag1                   net_margin_lag2 
##                                 0                               100 
##           net_margin_rolling_mean            net_margin_rolling_std 
##                               100                               100 
##               net_margin_momentum                profit_growth_lag1 
##                                 0                               100 
##                profit_growth_lag2        profit_growth_rolling_mean 
##                               200                               200 
##            average_inventory_lag1           inventory_turnover_lag1 
##                               100                               100 
##                          eps_lag1 
##                                 0
## 
##   0 100 200 
## 110  98   5

Key findings

The year 2017 has missing values for ‘lag_2’, ‘rolling_mean’, ‘rolling_std’. The missing lag 2 values are because of of the missing previous lag 1 values of 2016.

  • The best solution for this case would be removal of the 2017 rows.The dataset will be samller but replacement of the missing values would be making up false data as lag 2 variables require lag 1 to be calculated.
# remove year 20167 for all the companies
finance_df2 <- finance_df2 %>%
  filter(year != 2017)

# verify the changes
nrow(finance_df2)
## [1] 800
# Confirm the changes using the selected features
features_to_plot <- c("revenue_growth_lag1",
                      "ebit_lag1",
                      "operating_expense_lag1",
                      "revenue_per_asset_lag1",
                      "cash_lag1",
                      "debt_lag1",
                      "operating_cf_lag1",
                      "inventory_turnover_lag1",
                      "eps_lag1",
                      "revenue_growth_momentum"
                    )
colSums(is.na(finance_df2[, features_to_plot]))
##     revenue_growth_lag1               ebit_lag1  operating_expense_lag1 
##                       0                       0                       0 
##  revenue_per_asset_lag1               cash_lag1               debt_lag1 
##                       0                       0                       0 
##       operating_cf_lag1 inventory_turnover_lag1                eps_lag1 
##                       0                       0                       0 
## revenue_growth_momentum 
##                       0
# check for missing values for the whole dataset
table(is.na(finance_df2))
## 
##  FALSE   TRUE 
## 169900    500
##                           company                              year 
##                                 0                                 0 
##                           revenue                    domestic_sales 
##                                 0                                 0 
##                      export_sales                              cogs 
##                                 0                                 0 
##                      gross_profit                 operating_expense 
##                                 0                                 0 
##                              ebit                            ebitda 
##                                 0                                 0 
##                        net_profit                            assets 
##                                 0                                 0 
##                            equity                              debt 
##                                 0                                 0 
##                              cash                         inventory 
##                                 0                                 0 
##                        receivable                         employees 
##                                 0                                 0 
##                            shares                      operating_cf 
##                                 0                                 0 
##                      investing_cf                      financing_cf 
##                                 0                                 0 
##                    revenue_growth                    domestic_share 
##                                 0                                 0 
##                      export_share              revenue_per_employee 
##                                 0                                 0 
##                 revenue_per_asset                revenue_per_equity 
##                                 0                                 0 
##                 revenue_per_share                      gross_margin 
##                                 0                                 0 
##                       ebit_margin                     ebitda_margin 
##                                 0                                 0 
##                        net_margin                     profit_growth 
##                                 0                                 0 
##                       cogs_margin                        opex_ratio 
##                                 0                                 0 
##                    asset_turnover                      equity_ratio 
##                                 0                                 0 
##                        debt_ratio                        cash_ratio 
##                                 0                                 0 
##                   working_capital                               fcf 
##                                 0                                 0 
##                        ocf_margin                inventory_turnover 
##                                 0                                 0 
##               receivable_turnover                               eps 
##                                 0                                 0 
##              book_value_per_share                         year_lag1 
##                                 0                                 0 
##                         year_lag2                 year_rolling_mean 
##                                 0                                 0 
##                  year_rolling_std                     year_momentum 
##                                 0                                 0 
##                      revenue_lag1                      revenue_lag2 
##                                 0                                 0 
##              revenue_rolling_mean               revenue_rolling_std 
##                                 0                                 0 
##                  revenue_momentum               domestic_sales_lag1 
##                                 0                                 0 
##               domestic_sales_lag2       domestic_sales_rolling_mean 
##                                 0                                 0 
##        domestic_sales_rolling_std           domestic_sales_momentum 
##                                 0                                 0 
##                 export_sales_lag1                 export_sales_lag2 
##                                 0                                 0 
##         export_sales_rolling_mean          export_sales_rolling_std 
##                                 0                                 0 
##             export_sales_momentum                         cogs_lag1 
##                                 0                                 0 
##                         cogs_lag2                 cogs_rolling_mean 
##                                 0                                 0 
##                  cogs_rolling_std                     cogs_momentum 
##                                 0                                 0 
##                 gross_profit_lag1                 gross_profit_lag2 
##                                 0                                 0 
##         gross_profit_rolling_mean          gross_profit_rolling_std 
##                                 0                                 0 
##             gross_profit_momentum            operating_expense_lag1 
##                                 0                                 0 
##            operating_expense_lag2    operating_expense_rolling_mean 
##                                 0                                 0 
##     operating_expense_rolling_std        operating_expense_momentum 
##                                 0                                 0 
##                         ebit_lag1                         ebit_lag2 
##                                 0                                 0 
##                 ebit_rolling_mean                  ebit_rolling_std 
##                                 0                                 0 
##                     ebit_momentum                       ebitda_lag1 
##                                 0                                 0 
##                       ebitda_lag2               ebitda_rolling_mean 
##                                 0                                 0 
##                ebitda_rolling_std                   ebitda_momentum 
##                                 0                                 0 
##                   net_profit_lag1                   net_profit_lag2 
##                                 0                                 0 
##           net_profit_rolling_mean            net_profit_rolling_std 
##                                 0                                 0 
##               net_profit_momentum                       assets_lag1 
##                                 0                                 0 
##                       assets_lag2               assets_rolling_mean 
##                                 0                                 0 
##                assets_rolling_std                   assets_momentum 
##                                 0                                 0 
##                       equity_lag1                       equity_lag2 
##                                 0                                 0 
##               equity_rolling_mean                equity_rolling_std 
##                                 0                                 0 
##                   equity_momentum                         debt_lag1 
##                                 0                                 0 
##                         debt_lag2                 debt_rolling_mean 
##                                 0                                 0 
##                  debt_rolling_std                     debt_momentum 
##                                 0                                 0 
##                         cash_lag1                         cash_lag2 
##                                 0                                 0 
##                 cash_rolling_mean                  cash_rolling_std 
##                                 0                                 0 
##                     cash_momentum                    inventory_lag1 
##                                 0                                 0 
##                    inventory_lag2            inventory_rolling_mean 
##                                 0                                 0 
##             inventory_rolling_std                inventory_momentum 
##                                 0                                 0 
##                   receivable_lag1                   receivable_lag2 
##                                 0                                 0 
##           receivable_rolling_mean            receivable_rolling_std 
##                                 0                                 0 
##               receivable_momentum                    employees_lag1 
##                                 0                                 0 
##                    employees_lag2            employees_rolling_mean 
##                                 0                                 0 
##             employees_rolling_std                employees_momentum 
##                                 0                                 0 
##                       shares_lag1                       shares_lag2 
##                                 0                                 0 
##               shares_rolling_mean                shares_rolling_std 
##                                 0                                 0 
##                   shares_momentum                 operating_cf_lag1 
##                                 0                                 0 
##                 operating_cf_lag2         operating_cf_rolling_mean 
##                                 0                                 0 
##          operating_cf_rolling_std             operating_cf_momentum 
##                                 0                                 0 
##                 investing_cf_lag1                 investing_cf_lag2 
##                                 0                                 0 
##         investing_cf_rolling_mean          investing_cf_rolling_std 
##                                 0                                 0 
##             investing_cf_momentum                 financing_cf_lag1 
##                                 0                                 0 
##                 financing_cf_lag2         financing_cf_rolling_mean 
##                                 0                                 0 
##          financing_cf_rolling_std             financing_cf_momentum 
##                                 0                                 0 
##               revenue_growth_lag1               revenue_growth_lag2 
##                                 0                               100 
##       revenue_growth_rolling_mean        revenue_growth_rolling_std 
##                               100                               100 
##           revenue_growth_momentum               domestic_share_lag1 
##                                 0                                 0 
##               domestic_share_lag2       domestic_share_rolling_mean 
##                                 0                                 0 
##        domestic_share_rolling_std           domestic_share_momentum 
##                                 0                                 0 
##                 export_share_lag1                 export_share_lag2 
##                                 0                                 0 
##         export_share_rolling_mean          export_share_rolling_std 
##                                 0                                 0 
##             export_share_momentum         revenue_per_employee_lag1 
##                                 0                                 0 
##         revenue_per_employee_lag2 revenue_per_employee_rolling_mean 
##                                 0                                 0 
##  revenue_per_employee_rolling_std     revenue_per_employee_momentum 
##                                 0                                 0 
##            revenue_per_asset_lag1            revenue_per_asset_lag2 
##                                 0                                 0 
##    revenue_per_asset_rolling_mean     revenue_per_asset_rolling_std 
##                                 0                                 0 
##        revenue_per_asset_momentum           revenue_per_equity_lag1 
##                                 0                                 0 
##           revenue_per_equity_lag2   revenue_per_equity_rolling_mean 
##                                 0                                 0 
##    revenue_per_equity_rolling_std       revenue_per_equity_momentum 
##                                 0                                 0 
##            revenue_per_share_lag1            revenue_per_share_lag2 
##                                 0                                 0 
##    revenue_per_share_rolling_mean     revenue_per_share_rolling_std 
##                                 0                                 0 
##        revenue_per_share_momentum                 gross_margin_lag1 
##                                 0                                 0 
##                 gross_margin_lag2         gross_margin_rolling_mean 
##                                 0                                 0 
##          gross_margin_rolling_std             gross_margin_momentum 
##                                 0                                 0 
##                  ebit_margin_lag1                  ebit_margin_lag2 
##                                 0                                 0 
##          ebit_margin_rolling_mean           ebit_margin_rolling_std 
##                                 0                                 0 
##              ebit_margin_momentum                ebitda_margin_lag1 
##                                 0                                 0 
##                ebitda_margin_lag2        ebitda_margin_rolling_mean 
##                                 0                                 0 
##         ebitda_margin_rolling_std            ebitda_margin_momentum 
##                                 0                                 0 
##                   net_margin_lag1                   net_margin_lag2 
##                                 0                                 0 
##           net_margin_rolling_mean            net_margin_rolling_std 
##                                 0                                 0 
##               net_margin_momentum                profit_growth_lag1 
##                                 0                                 0 
##                profit_growth_lag2        profit_growth_rolling_mean 
##                               100                               100 
##            average_inventory_lag1           inventory_turnover_lag1 
##                                 0                                 0 
##                          eps_lag1 
##                                 0

Key findings

Missing values in:

  • profit_growth_rolling_mean 100

  • profit_growth_lag2 100

  • revenue_growth_rolling_std 100

  • revenue_growth_rolling_mean 100

  • revenue_growth_lag2 100

The reason for these missing values are that lag 2 variables require lag 1 values to be calculated.

Besides the above list, there are no other missing values

Recap

  • Started with 1000 rows

  • Removed 100 rows for year 2016 due to a lack of lag variables that require prior year. this dataset starts from 2016.

  • Removed 100 rows for 2017 because of missing lag 2 variables that requires lag 1 variables from 2016 to be calculated.

  • Currently at 800 observations from 2018 to 2025

  • Identified 500 missing lag 2 Variables from profit_growth & revenue_growth. This is because they require lag 1 values from 2017 in order to be filled.

I will move forward with my data set as is because only the selected features will be used to build a model.

Correlation plot visualization

Visualize the selected features using a correlation plot to check for feature to feature relationship

corr_matrix <- cor(finance_df2[, features_to_plot], use = 'complete.obs')

# correlation plot
corrplot(corr_matrix,
         method = 'color',
         type = 'upper',
         tl.cex = 0.8,
         addCoef.col = 'black',
         number.cex = 0.7,
         tl.srt = 45,
         title = "Feature correlation heatmap",
         mar = c(0,0,2,1))

### Key findings

High correlation features

  • ebit_lag1 & cash_lag1 0.79

  • ebit_lag1 & debt-lag 1 0.80

  • ebit_lag1 & operating_cf_lag1 0.98 - very high

  • ebit_lag1 & operating_cf_lag1 0.78

  • operating_expense_lag1 & cash_lag1 0.86

  • operating_expense_lag1 & debt_lag1 0.87

  • operating_expense_lag1 & operating_cf_lag1 0.87

  • cash_lag1 & debt_lag1 0.78

  • cash_lag1 & operating_cf_lag1 0.82

  • debt_lag1 & operating_cf_lag1 0.82

  • revenue_growth_lag1 & revenue_growth_momentum -0.81 (negatively correlated)

The heat map shows high multicollinearity among absolute features with ebit_lag1 & operating_cf_lag1 the highest at 0.98 which could mean that these tow features hava alsmot similar information.

The identified high correlation features scale in billions and this is not unusual for larger companies.

The multicollinearity shows there is a strong negative correlation between revenue_growth_lag1 & revenue_growth_momentum at -0.81.

  • Companies with high revenue growth in T-1 have slower revenue momentum in year T which is financially intuitive

Create target variable

## 
##  no yes 
## 334 466
## 
##     no    yes 
## 0.4175 0.5825

Key findings

58.25% of companies have had profit growth and 41.75% of companies have not had profit growth.

The target variable is reasonably balanced ruling out class imbalance.

# create the final modeling dataset
model_features <- c("year",
                    "revenue_growth_lag1",
                    "ebit_lag1",
                    "operating_expense_lag1",
                    "revenue_per_asset_lag1",
                    "cash_lag1",
                    "debt_lag1",
                    "operating_cf_lag1",
                    "inventory_turnover_lag1",
                    "eps_lag1",
                    "revenue_growth_momentum",
                    "profit_growth_binary"
                    )
finance_model_df <- finance_df2%>%
  select(all_of(model_features))

# verify the final model dataset
colSums(is.na(finance_model_df))
##                    year     revenue_growth_lag1               ebit_lag1 
##                       0                       0                       0 
##  operating_expense_lag1  revenue_per_asset_lag1               cash_lag1 
##                       0                       0                       0 
##               debt_lag1       operating_cf_lag1 inventory_turnover_lag1 
##                       0                       0                       0 
##                eps_lag1 revenue_growth_momentum    profit_growth_binary 
##                       0                       0                       0
nrow(finance_model_df)
## [1] 800

Key findings

  • I have a data set from features that represent all the feature categories of the finance data set.

  • Visualize the final modeling dataset with boxplots to understand feature vs target relationship

Box plot visualization

Profit is affected by

→ Interest payments on debt → Tax obligations → Depreciation and amortization → One-time charges

Key findings

Revenue_growth_lag1 -> There is almost an overlap for companies that made profit and those that didn’t on the basis of revenue growth. This may not be a strong predictor for profit growth

  • I am not concerned about outliers for this feature hence no waarnt for log transformation.

ebit_lag1 (earnings before interest and taxes) -> companies that had higher earnings before interest did not make profit as much as those that had low earnings before interest in the previous year.

  • There is a slight overlap at the bottom but the ‘no’ group has a higher median.

  • So a company can have high EBIT but low profit growth because of high debt which leads to high interest payments hence high tax burdens or high investment costs therefore large depreciation costs.

  • These eat into profit after EBIT is calculated.

  • This feature shows moderate predictive potential but more outliers in the ‘No’ group warrant for log transformation for modeling.

Operating_expense_lag1 -> The boxes have significant overlap making this a weak predictor for profit growth.

  • However, the ‘yes’ group shows higher operating expenses which could be that companies that are making profit are investing more into operations aiming for profit growth.

  • There are more outliers in the ‘yes’ group which warrants for log transformation before modeling.

Revenue_per_asset -> Companies that did not make profit in the prior year made more revenue per asset

  • A company with no profit but high revenue per asset is likely efficient at generating sales but inefficient at controlling the costs.

  • I am not concerned about outliers for this feature hence no warrant for log transformation.

Cash_lag1 -> Immediate cash flow is slightly higher in companies that made profit than in companies that didn’t make profit.

  • The outliers in both the ‘yes’ and ‘no’ groups warrant for log transformation.

debt_lag1 -> debt is almost equal with companies that made profit in the prior year being slightly higher than companies that didn’t make profit. Not a very strong predictor of profit growth.

Operating cash flow of the previous year is higher in companies that didn’t make profit in the prior year. This could be due to high non-cash expenses and most likely that companies like these have high ebit and ebitda. This will be a strong predictor for the profit growth

Inventory turnover lag1 is low for companies that didn’t make profit, this could also be related to those with high operating cash flow but low profit.

earnings per share is almost equal in companies that made profit and those that didn’t. This will be a weak predictor for profit growth

  • The outliers in both the ‘yes’ and ‘no’ groups warrant for log transformation.

Revenue growth is slightly higher in companies that made profit. Sometimes companies can have revenue growth at profit loss because of efficiency improvements.

  • Based on the bar plot it is a moderate predictor for profit growth.

Verify selected features for modeling

# check the features for modeling
str(finance_model_df)
## tibble [800 × 12] (S3: tbl_df/tbl/data.frame)
##  $ year                   : int [1:800] 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ revenue_growth_lag1    : num [1:800] 0.011 0.1248 0.0829 0.2608 0.1424 ...
##  $ ebit_lag1              : num [1:800] 1.60e+09 2.56e+09 6.59e+09 5.43e+09 2.97e+09 ...
##  $ operating_expense_lag1 : num [1:800] 1.88e+09 5.06e+09 4.21e+09 4.68e+09 5.54e+09 ...
##  $ revenue_per_asset_lag1 : num [1:800] 0.555 0.564 0.383 0.581 0.448 ...
##  $ cash_lag1              : num [1:800] 9.23e+08 2.16e+09 2.11e+09 1.83e+09 1.73e+09 ...
##  $ debt_lag1              : num [1:800] 7.78e+09 1.24e+10 1.92e+10 1.26e+10 2.01e+10 ...
##  $ operating_cf_lag1      : num [1:800] 1.94e+09 2.96e+09 5.00e+09 5.52e+09 3.44e+09 ...
##  $ inventory_turnover_lag1: num [1:800] 2.89 5.05 2.96 4.25 3.36 ...
##  $ eps_lag1               : num [1:800] 5.4 7.11 9.36 42.01 5.77 ...
##  $ revenue_growth_momentum: num [1:800] 0.1642 -0.0389 0.1378 -0.2153 -0.1334 ...
##  $ profit_growth_binary   : chr [1:800] "no" "yes" "yes" "no" ...

Note

Year is included in the dataframe solely for temporal train/test splitting and will be excluded from features.

The goal of the project is to predict future profit growth using historical financial signals, hence the temporal splits is a better fit to ensure the model is trained on earlier years (2018 - 2022) to predict the future (2023 - 2025).

## tibble [800 × 12] (S3: tbl_df/tbl/data.frame)
##  $ year                   : int [1:800] 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ revenue_growth_lag1    : num [1:800] 0.011 0.1248 0.0829 0.2608 0.1424 ...
##  $ ebit_lag1              : num [1:800] 1.60e+09 2.56e+09 6.59e+09 5.43e+09 2.97e+09 ...
##  $ operating_expense_lag1 : num [1:800] 1.88e+09 5.06e+09 4.21e+09 4.68e+09 5.54e+09 ...
##  $ revenue_per_asset_lag1 : num [1:800] 0.555 0.564 0.383 0.581 0.448 ...
##  $ cash_lag1              : num [1:800] 9.23e+08 2.16e+09 2.11e+09 1.83e+09 1.73e+09 ...
##  $ debt_lag1              : num [1:800] 7.78e+09 1.24e+10 1.92e+10 1.26e+10 2.01e+10 ...
##  $ operating_cf_lag1      : num [1:800] 1.94e+09 2.96e+09 5.00e+09 5.52e+09 3.44e+09 ...
##  $ inventory_turnover_lag1: num [1:800] 2.89 5.05 2.96 4.25 3.36 ...
##  $ eps_lag1               : num [1:800] 5.4 7.11 9.36 42.01 5.77 ...
##  $ revenue_growth_momentum: num [1:800] 0.1642 -0.0389 0.1378 -0.2153 -0.1334 ...
##  $ profit_growth_binary   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 2 1 2 1 ...

Data Splitting

train_data <- finance_model_df %>%
  filter(year <= 2022) %>%
  select(-year)
  
test_data <- finance_model_df %>%
  filter(year >2022) %>%
  select(-year)

prop.table(table(nrow(train_data)))
## 
## 500 
##   1
prop.table(table(nrow(test_data)))
## 
## 300 
##   1
dim(finance_model_df)
## [1] 800  12
# Check the target variable distribution
prop.table(table(train_data$profit_growth_binary))
## 
##    no   yes 
## 0.402 0.598
prop.table(table(test_data$profit_growth_binary))
## 
##        no       yes 
## 0.4433333 0.5566667

Key findings

The temporal train/test split yielded 500 observations for training data and 300 observations for test data totaling to 800 observations.

Class distribution is consistent with Yes class (59.8% train data & 55.7% test data) with a difference of 4.1% indicating that the split is representative for both categories.

Feature Scaling

# Check minimum value of every feature
sapply(train_data[, features], min)
##     revenue_growth_lag1               ebit_lag1  operating_expense_lag1 
##           -1.197031e-01            1.316671e+08            1.695674e+08 
##  revenue_per_asset_lag1               cash_lag1               debt_lag1 
##            3.574419e-01            7.866745e+07            4.051942e+08 
##       operating_cf_lag1 inventory_turnover_lag1                eps_lag1 
##            1.368744e+08            2.253622e+00            2.971642e-01 
## revenue_growth_momentum 
##           -3.980441e-01

Key findings

Features that have negative values

  • Revenue growth lag1
  • Revenue growth momentum

Features with only positive values

  • ebit lag 1
  • operating expense lag 1
  • revenue per asset
  • cash lag 1
  • debt lag 1
  • operating cash flow lag 1
  • inventory turnover lag 1
  • eps lag 1

Log transformation

features <- c(
  "revenue_growth_lag1",
  "ebit_lag1",
  "operating_expense_lag1",
  "revenue_per_asset_lag1",
  "cash_lag1",
  "debt_lag1",
  "operating_cf_lag1",
  "inventory_turnover_lag1",
  "eps_lag1",
  "revenue_growth_momentum"
  )
log_features <- c(
  "ebit_lag1",
  "operating_expense_lag1",
  "revenue_per_asset_lag1",
  "cash_lag1",
  "debt_lag1",
  "operating_cf_lag1",
  "inventory_turnover_lag1",
  "eps_lag1"
) # removed "Revenue growth lag1" and "Revenue growth momentum"


# apply log transformation in both train and test data
train_data[, log_features] <- log(train_data[, log_features])
test_data[, log_features] <- log(test_data[, log_features])

# fit scaler on the training features only
scaler <- preProcess(train_data[, features],
                     method = c('center', 'scale'))

# apply to both train and test splits
# predict in this case means "apply log transformation"
# it takes in each value ((value - mean) / std)
train_scaled <- predict(scaler, train_data)
test_scaled <- predict(scaler, test_data)

# verify
summary(train_scaled)
##  revenue_growth_lag1   ebit_lag1       operating_expense_lag1
##  Min.   :-2.45654    Min.   :-3.4320   Min.   :-3.6445       
##  1st Qu.:-0.74312    1st Qu.:-0.5067   1st Qu.:-0.5847       
##  Median :-0.01572    Median : 0.1925   Median : 0.2370       
##  Mean   : 0.00000    Mean   : 0.0000   Mean   : 0.0000       
##  3rd Qu.: 0.65906    3rd Qu.: 0.6865   3rd Qu.: 0.6800       
##  Max.   : 2.57325    Max.   : 1.7756   Max.   : 1.8490       
##  revenue_per_asset_lag1   cash_lag1         debt_lag1       operating_cf_lag1
##  Min.   :-1.60637       Min.   :-3.3026   Min.   :-3.9295   Min.   :-3.4973  
##  1st Qu.:-0.85487       1st Qu.:-0.5667   1st Qu.:-0.5433   1st Qu.:-0.5055  
##  Median :-0.08155       Median : 0.1758   Median : 0.1716   Median : 0.2031  
##  Mean   : 0.00000       Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.85799       3rd Qu.: 0.7307   3rd Qu.: 0.6934   3rd Qu.: 0.6904  
##  Max.   : 1.94749       Max.   : 1.8999   Max.   : 2.0684   Max.   : 1.8392  
##  inventory_turnover_lag1    eps_lag1       revenue_growth_momentum
##  Min.   :-2.18588        Min.   :-3.1266   Min.   :-2.494253      
##  1st Qu.:-0.73815        1st Qu.:-0.5482   1st Qu.:-0.760654      
##  Median :-0.06625        Median : 0.1133   Median : 0.005399      
##  Mean   : 0.00000        Mean   : 0.0000   Mean   : 0.000000      
##  3rd Qu.: 0.75157        3rd Qu.: 0.6202   3rd Qu.: 0.725376      
##  Max.   : 2.90808        Max.   : 2.2708   Max.   : 2.625465      
##  profit_growth_binary
##  no :201             
##  yes:299             
##                      
##                      
##                      
## 
saveRDS("train_scaled", file = "log-reg progress")

Logistic Regression Model

set.seed(123)

logistic_regression_model <- glm(profit_growth_binary~., data = train_scaled, family = 'binomial')
summary(logistic_regression_model)
## 
## Call:
## glm(formula = profit_growth_binary ~ ., family = "binomial", 
##     data = train_scaled)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.597325   0.112304   5.319 1.04e-07 ***
## revenue_growth_lag1      0.710617   0.191050   3.720 0.000200 ***
## ebit_lag1               -2.626469   0.734742  -3.575 0.000351 ***
## operating_expense_lag1   2.012753   0.416034   4.838 1.31e-06 ***
## revenue_per_asset_lag1   0.078142   0.129511   0.603 0.546269    
## cash_lag1               -0.009429   0.279648  -0.034 0.973104    
## debt_lag1                0.741293   0.381997   1.941 0.052310 .  
## operating_cf_lag1       -0.306802   0.801414  -0.383 0.701849    
## inventory_turnover_lag1  0.153245   0.113649   1.348 0.177527    
## eps_lag1                 0.154140   0.184479   0.836 0.403411    
## revenue_growth_momentum  0.989173   0.194081   5.097 3.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 673.81  on 499  degrees of freedom
## Residual deviance: 529.55  on 489  degrees of freedom
## AIC: 551.55
## 
## Number of Fisher Scoring iterations: 5

Key findings

Based on the logistic model, the most significant features are:

  • revenue_growth_lag1
  • ebit_lag1
  • operating_expense_lag1
  • revenue_growth_momentum

The boxplot finding were that ebit_lag1, operating_expense_lag1, revenue_growth_momentum were moderate predictors but the model shows that they are the most significant.

The model contradicts my boxplot findings of operating cash flow lag 1 as a strong predictor as it is not significant.

revenue_growth_lag1 , ebit_lag1 are positive coefficients which means that they increase profit growth while operating_expense_lag1 & revenue_growth_momentum decrease profit growth.

ebit_lag1 -> Companies with highet prior ebit are more likley to not have profit growth because of large depreciation due to high tax burdens, high investment costs and high interest payments.

revenue_growth_lag 1 -> companies with higher prior revenue growth are more likely to have profit growth.

revenue_growth_momentum -> higher growth momentum is accelerating revenue which means that companies have captured the market share and possibly scaling and this leads to profit growth.

operating_expense_lag1 -> High operating expenses can lead to profit growth because it could be that companies these are making profit but investing more into operations aiming for more profit growth.

# make predictions using the model on the test data
logistic_regression_model_predictions <- factor(
  ifelse(predict(logistic_regression_model,
                 newdata = test_scaled) > 0.5,
         "yes", "no"),
  levels = c("no", "yes")
)

# create a confusion matrix 
lr_confusion_mtx <-confusionMatrix(logistic_regression_model_predictions,test_scaled$profit_growth_binary, positive = "yes")
lr_confusion_mtx
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  101  58
##        yes  32 109
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6447, 0.7513)
##     No Information Rate : 0.5567          
##     P-Value [Acc > NIR] : 2.482e-07       
##                                           
##                   Kappa : 0.4041          
##                                           
##  Mcnemar's Test P-Value : 0.008408        
##                                           
##             Sensitivity : 0.6527          
##             Specificity : 0.7594          
##          Pos Pred Value : 0.7730          
##          Neg Pred Value : 0.6352          
##              Prevalence : 0.5567          
##          Detection Rate : 0.3633          
##    Detection Prevalence : 0.4700          
##       Balanced Accuracy : 0.7060          
##                                           
##        'Positive' Class : yes             
## 
# extract the output
accuracy_lr <- lr_confusion_mtx$overall ['Accuracy']
true_pos_rate_lr <- lr_confusion_mtx$byClass ['Sensitivity']
true_neg_rate_lr <- lr_confusion_mtx$byClass ['Specificity']

Key findings

The model achieved an Accuracy: 70% significantly exceeding the ‘No’ information rate of 55.67%,

The model shows stronger Specificity: 76% than Sensitivity: 65% indicating that it is better at identifying companies that are not likely to have profit growth than those that will have profit growth which could indicate financial ditress signals rather than growth signals.

The model correctly identifies 65.27% of profit growing companies but maintains 76% accuracy in uidentifying non-growth companies which is meaningful information but not the goal of the project and what the investors would expect.

Decision tree Model

Decision trees do not require feature scaling or normality like logistic regression because they split data based on threshold rather that distance. Hence, i will apply the same train_scaled and test_scaled datasets on here for consistency.

The target variable remains as a factor with levels “yes” and “no”

Decision trees are suitable for small datasets but they are prone to over fitting so i have to be careful with that.

# set seed
set.seed(123)

# build decision tree model
decision_tree_model <- rpart(profit_growth_binary~., data = train_scaled, method = 'class')
                             
# visualize the model
rpart.plot(decision_tree_model, 
           type = 4, 
           extra = 101, 
           fallen.leaves = TRUE, 
           main = "Decision Tree For Profit Growth")

# make prediction using the decision tree
decision_tree_predictions <- predict(decision_tree_model, test_scaled, type = 'class')

# Create the confusion matrix
confusionMatrix(decision_tree_predictions, test_scaled$profit_growth_binary, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no   43  33
##        yes  90 134
##                                          
##                Accuracy : 0.59           
##                  95% CI : (0.532, 0.6462)
##     No Information Rate : 0.5567         
##     P-Value [Acc > NIR] : 0.1347         
##                                          
##                   Kappa : 0.1314         
##                                          
##  Mcnemar's Test P-Value : 4.433e-07      
##                                          
##             Sensitivity : 0.8024         
##             Specificity : 0.3233         
##          Pos Pred Value : 0.5982         
##          Neg Pred Value : 0.5658         
##              Prevalence : 0.5567         
##          Detection Rate : 0.4467         
##    Detection Prevalence : 0.7467         
##       Balanced Accuracy : 0.5629         
##                                          
##        'Positive' Class : yes            
## 

Key findings

The Decision Tree model achieves an accuracy of 59% , 11% lower than the logistic regression model.

It correctly predicts profit growth; 134 at 80.24% and correctly predicts no profit growth; 43 at 32.33%

This model has performed worse than the logistic regression model.

The Tree predicts “YES” for profit growth at 80.24% of the time while the actual yes rate is 55.67%, the tree is biased towards predicting yes.

The reason could have been that the tree grows too deep and learns specific patterns from training leading it to overpredicting ‘yes’.

The model memorized the training patterns leading to over prediction of ‘yes’.

increasing complexity of the data would have required adding more features which could expose the model to overfitting.

Solution: Try pruning to reduce the yes bias and see if the model improves.

# find optimal complexity parameter(cp) for the model
printcp(decision_tree_model)
## 
## Classification tree:
## rpart(formula = profit_growth_binary ~ ., data = train_scaled, 
##     method = "class")
## 
## Variables actually used in tree construction:
## [1] cash_lag1               ebit_lag1               inventory_turnover_lag1
## [4] operating_expense_lag1  revenue_growth_lag1     revenue_growth_momentum
## 
## Root node error: 201/500 = 0.402
## 
## n= 500 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.094527      0   1.00000 1.00000 0.054545
## 2 0.047264      1   0.90547 1.04975 0.054943
## 3 0.034826      5   0.71642 0.92040 0.053711
## 4 0.019900      7   0.64677 0.91542 0.053650
## 5 0.017413      8   0.62687 0.89055 0.053333
## 6 0.014925     11   0.57214 0.85572 0.052847
## 7 0.010000     12   0.55721 0.86070 0.052920
plotcp(decision_tree_model)

### Key findings

The plot shows complexity parameter on the x_axis and cross validation error on the y-axis which are also displayed on the table.

The dotted line is threshold for selection at 1std

The goal or best outcome would be to find the lowest xerror point

Based on the graph my lowest xerror is at cp = 0.016, tree size = 12 nodes and xerror = 0.855

The threshold = min xerror - xstd at min -> 0.855 + 0.0.052 = 0.907

Find tree that falls below 0.907

  • cp 0.017 with 8 nodes at 0.891

  • use cp = 0.017 for pruning

N/B: I used Claude to help me understand how pruining is done and how it works.

pruned_decision_tree <- prune(decision_tree_model, cp = 0.017)

# plot 
pruned_decision_tree_plot <- rpart.plot(pruned_decision_tree,
                                        type = 4, 
                                        extra = 101, 
                                        fallen.leaves = TRUE, 
                                        main = "Pruned Decision Tree")

print(pruned_decision_tree_plot)
## $obj
## n= 500 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 500 201 yes (0.40200000 0.59800000)  
##     2) ebit_lag1>=0.60842 147  64 no (0.56462585 0.43537415)  
##       4) revenue_growth_momentum< -0.2481026 59  16 no (0.72881356 0.27118644) *
##       5) revenue_growth_momentum>=-0.2481026 88  40 yes (0.45454545 0.54545455)  
##        10) operating_expense_lag1< 0.8448163 57  23 no (0.59649123 0.40350877)  
##          20) inventory_turnover_lag1>=0.3263827 14   2 no (0.85714286 0.14285714) *
##          21) inventory_turnover_lag1< 0.3263827 43  21 no (0.51162791 0.48837209)  
##            42) inventory_turnover_lag1< -0.2095408 34  13 no (0.61764706 0.38235294)  
##              84) revenue_growth_lag1< -0.07026225 26   7 no (0.73076923 0.26923077) *
##              85) revenue_growth_lag1>=-0.07026225 8   2 yes (0.25000000 0.75000000) *
##            43) inventory_turnover_lag1>=-0.2095408 9   1 yes (0.11111111 0.88888889) *
##        11) operating_expense_lag1>=0.8448163 31   6 yes (0.19354839 0.80645161) *
##     3) ebit_lag1< 0.60842 353 118 yes (0.33427762 0.66572238)  
##       6) operating_expense_lag1< -0.6334135 116  57 no (0.50862069 0.49137931)  
##        12) ebit_lag1>=-0.3837157 21   1 no (0.95238095 0.04761905) *
##        13) ebit_lag1< -0.3837157 95  39 yes (0.41052632 0.58947368)  
##          26) operating_expense_lag1< -1.404207 47  20 no (0.57446809 0.42553191)  
##            52) ebit_lag1>=-2.020906 24   5 no (0.79166667 0.20833333) *
##            53) ebit_lag1< -2.020906 23   8 yes (0.34782609 0.65217391)  
##             106) cash_lag1< -2.542958 10   3 no (0.70000000 0.30000000) *
##             107) cash_lag1>=-2.542958 13   1 yes (0.07692308 0.92307692) *
##          27) operating_expense_lag1>=-1.404207 48  12 yes (0.25000000 0.75000000) *
##       7) operating_expense_lag1>=-0.6334135 237  59 yes (0.24894515 0.75105485) *
## 
## $snipped.nodes
## NULL
## 
## $xlim
## [1] 0 1
## 
## $ylim
## [1] 0 1
## 
## $x
##  [1] 0.53053227 0.22714408 0.08182368 0.37246447 0.25518836 0.16340706
##  [7] 0.34696967 0.28578213 0.24499044 0.32657382 0.40815720 0.48974058
## [13] 0.83392047 0.68860007 0.57132396 0.80587618 0.71409488 0.65290734
## [19] 0.77528241 0.73449072 0.81607411 0.89765749 0.97924087
## 
## $y
##  [1] 0.9598391 0.8182741 0.0255096 0.6767090 0.5351439 0.0255096 0.3935788
##  [8] 0.2520137 0.0255096 0.0255096 0.0255096 0.0255096 0.8182741 0.6767090
## [15] 0.0255096 0.5351439 0.3935788 0.0255096 0.2520137 0.0255096 0.0255096
## [22] 0.0255096 0.0255096
## 
## $branch.x
##        [,1]      [,2]       [,3]      [,4]      [,5]      [,6]      [,7]
## x 0.5305323 0.2271441 0.08182368 0.3724645 0.2551884 0.1634071 0.3469697
##          NA 0.2271441 0.08182368 0.3724645 0.2551884 0.1634071 0.3469697
##          NA 0.5305323 0.22714408 0.2271441 0.3724645 0.2551884 0.2551884
##        [,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
## x 0.2857821 0.2449904 0.3265738 0.4081572 0.4897406 0.8339205 0.6886001
##   0.2857821 0.2449904 0.3265738 0.4081572 0.4897406 0.8339205 0.6886001
##   0.3469697 0.2857821 0.2857821 0.3469697 0.3724645 0.5305323 0.8339205
##       [,15]     [,16]     [,17]     [,18]     [,19]     [,20]     [,21]
## x 0.5713240 0.8058762 0.7140949 0.6529073 0.7752824 0.7344907 0.8160741
##   0.5713240 0.8058762 0.7140949 0.6529073 0.7752824 0.7344907 0.8160741
##   0.6886001 0.6886001 0.8058762 0.7140949 0.7140949 0.7752824 0.7752824
##       [,22]     [,23]
## x 0.8976575 0.9792409
##   0.8976575 0.9792409
##   0.8058762 0.8339205
## 
## $branch.y
##        [,1]      [,2]       [,3]      [,4]      [,5]       [,6]      [,7]
## y 0.9996865 0.8581215 0.06535701 0.7165564 0.5749913 0.06535701 0.4334262
##          NA 0.9661116 0.82454647 0.8245465 0.6829814 0.54141631 0.5414163
##          NA 0.9661116 0.82454647 0.8245465 0.6829814 0.54141631 0.5414163
##        [,8]       [,9]      [,10]      [,11]      [,12]     [,13]     [,14]
## y 0.2918611 0.06535701 0.06535701 0.06535701 0.06535701 0.8581215 0.7165564
##   0.3998512 0.25828615 0.25828615 0.39985123 0.68298139 0.9661116 0.8245465
##   0.3998512 0.25828615 0.25828615 0.39985123 0.68298139 0.9661116 0.8245465
##        [,15]     [,16]     [,17]      [,18]     [,19]      [,20]      [,21]
## y 0.06535701 0.5749913 0.4334262 0.06535701 0.2918611 0.06535701 0.06535701
##   0.68298139 0.6829814 0.5414163 0.39985123 0.3998512 0.25828615 0.25828615
##   0.68298139 0.6829814 0.5414163 0.39985123 0.3998512 0.25828615 0.25828615
##        [,22]      [,23]
## y 0.06535701 0.06535701
##   0.54141631 0.82454647
##   0.54141631 0.82454647
## 
## $labs
##  [1] "yes\n201  299\n100%" "no\n83  64\n29%"     "no\n43  16\n12%"    
##  [4] "yes\n40  48\n18%"    "no\n34  23\n11%"     "no\n12  2\n3%"      
##  [7] "no\n22  21\n9%"      "no\n21  13\n7%"      "no\n19  7\n5%"      
## [10] "yes\n2  6\n2%"       "yes\n1  8\n2%"       "yes\n6  25\n6%"     
## [13] "yes\n118  235\n71%"  "no\n59  57\n23%"     "no\n20  1\n4%"      
## [16] "yes\n39  56\n19%"    "no\n27  20\n9%"      "no\n19  5\n5%"      
## [19] "yes\n8  15\n5%"      "no\n7  3\n2%"        "yes\n1  12\n3%"     
## [22] "yes\n12  36\n10%"    "yes\n59  178\n47%"  
## 
## $cex
## [1] 0.3875
## 
## $boxes
## $boxes$x1
##  [1] 0.50556185 0.20805117 0.06273077 0.35337157 0.23609546 0.14725291
##  [7] 0.32787676 0.26668922 0.22883629 0.31335843 0.39494181 0.47358644
## [13] 0.80895004 0.66950717 0.55516982 0.78678328 0.69500197 0.63675320
## [19] 0.75912827 0.72127534 0.79991996 0.87856458 0.95720920
## 
## $boxes$y1
##  [1]  0.932536560  0.790971480 -0.001792973  0.649406399  0.507841318
##  [6] -0.001792973  0.366276237  0.224711157 -0.001792973 -0.001792973
## [11] -0.001792973 -0.001792973  0.790971480  0.649406399 -0.001792973
## [16]  0.507841318  0.366276237 -0.001792973  0.224711157 -0.001792973
## [21] -0.001792973 -0.001792973 -0.001792973
## 
## $boxes$x2
##  [1] 0.5555027 0.2462370 0.1009166 0.3915574 0.2742813 0.1795612 0.3660626
##  [8] 0.3048750 0.2611446 0.3397892 0.4213726 0.5058947 0.8588909 0.7076930
## [15] 0.5874781 0.8249691 0.7331878 0.6690615 0.7914366 0.7477061 0.8322283
## [22] 0.9167504 1.0012725
## 
## $boxes$y2
##  [1] 0.99968654 0.85812146 0.06535701 0.71655638 0.57499130 0.06535701
##  [7] 0.43342622 0.29186114 0.06535701 0.06535701 0.06535701 0.06535701
## [13] 0.85812146 0.71655638 0.06535701 0.57499130 0.43342622 0.06535701
## [19] 0.29186114 0.06535701 0.06535701 0.06535701 0.06535701
## 
## 
## $split.labs
## [1] ""
## 
## $split.cex
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $split.box
## $split.box$x1
##  [1]  0.183965209 -0.009353179           NA  0.176328981  0.081467023
##  [6]           NA  0.205168274  0.171717023           NA           NA
## [11]           NA           NA  0.607981049  0.526385450           NA
## [16]  0.636414611  0.615314442           NA  0.692927013           NA
## [21]           NA           NA           NA
## 
## $split.box$y1
##  [1] 0.8971149 0.7555498        NA 0.6139848 0.4724197        NA 0.3308546
##  [8] 0.1892895        NA        NA        NA        NA 0.7555498 0.6139848
## [15]        NA 0.4724197 0.3308546        NA 0.1892895        NA        NA
## [22]        NA        NA
## 
## $split.box$x2
##  [1] 0.2703229 0.1730005        NA 0.3340477 0.2453471        NA 0.3663960
##  [8] 0.3182639        NA        NA        NA        NA 0.7692191 0.6162625
## [15]        NA 0.7917751 0.6905002        NA 0.7760544        NA        NA
## [22]        NA        NA
## 
## $split.box$y2
##  [1] 0.9222046 0.7806395        NA 0.6390744 0.4975094        NA 0.3559443
##  [8] 0.2143792        NA        NA        NA        NA 0.7806395 0.6390744
## [15]        NA 0.4975094 0.3559443        NA 0.2143792        NA        NA
## [22]        NA        NA
# make predictions using the pruned decision tree
pruned_decision_tree_predictions <- predict(pruned_decision_tree, newdata = test_scaled, type = 'class')

# create a confusion matrix
tree_confusion_matrix <- confusionMatrix(pruned_decision_tree_predictions, test_scaled$profit_growth_binary, positive = 'yes')
tree_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no   50  46
##        yes  83 121
##                                           
##                Accuracy : 0.57            
##                  95% CI : (0.5119, 0.6268)
##     No Information Rate : 0.5567          
##     P-Value [Acc > NIR] : 0.342800        
##                                           
##                   Kappa : 0.1034          
##                                           
##  Mcnemar's Test P-Value : 0.001526        
##                                           
##             Sensitivity : 0.7246          
##             Specificity : 0.3759          
##          Pos Pred Value : 0.5931          
##          Neg Pred Value : 0.5208          
##              Prevalence : 0.5567          
##          Detection Rate : 0.4033          
##    Detection Prevalence : 0.6800          
##       Balanced Accuracy : 0.5502          
##                                           
##        'Positive' Class : yes             
## 

Key Findings

The pruned decision tree achieves an accuracy: 57%, a TPR: 72% and a TNR: 38%.

It has a p value of 0.3428 which is not statistically significant.

The model has slightly improved Specificity: 38% by 5.26% and reduced the Yes prediction bias from 80.24% to 72.46% (134 to 121) but the model still under performs compared to the logistic regression model.

It could be that this dataset has more linear relationships than complex non-linear relationships which favors logistic regression over decision trees which are designed to capture non-linear patterns.

N/B : I used

Random Forest Model

set.seed(123)
# build the rf model
random_forest_model <- randomForest(profit_growth_binary~.,
                                    data = train_scaled,
                                    mtry = 3,
                                    ntree = 500,
                                    importance= TRUE)
# Make predictions using the RF model
randonm_forest_model_predictions <- predict(random_forest_model, newdata = test_scaled)

# create a confusion matrix
rf_confusion_matrix <- confusionMatrix(randonm_forest_model_predictions, test_scaled$profit_growth_binary,positive = 'yes')
rf_confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no   79  50
##        yes  54 117
##                                           
##                Accuracy : 0.6533          
##                  95% CI : (0.5965, 0.7071)
##     No Information Rate : 0.5567          
##     P-Value [Acc > NIR] : 0.0004124       
##                                           
##                   Kappa : 0.2955          
##                                           
##  Mcnemar's Test P-Value : 0.7686248       
##                                           
##             Sensitivity : 0.7006          
##             Specificity : 0.5940          
##          Pos Pred Value : 0.6842          
##          Neg Pred Value : 0.6124          
##              Prevalence : 0.5567          
##          Detection Rate : 0.3900          
##    Detection Prevalence : 0.5700          
##       Balanced Accuracy : 0.6473          
##                                           
##        'Positive' Class : yes             
## 
# extract the output
accuracy_rf <- rf_confusion_matrix$overall ['Accuracy']
true_pos_rate_rf <- rf_confusion_matrix$byClass ['Sensitivity']
true_neg_rate_rf <- rf_confusion_matrix$byClass ['Specificity']

Key findings

The model shows improvement with an accuracy: 65%, a TPR: 70% and a TNR: 59%.

SUMMARY OF THE MODEL OUTCOMES BEFORE ENSEMBLE

Logistic regression model:

  • Accuracy -> 70%
  • Sensitivity (TPR) -> 65.27%
  • Specificty (TNR) -> 75.94%
  • P-value -> 2.482e-07

Decision tree model:

  • Accuracy -> 59%
  • Sensitivity (TPR) -> 80.24%
  • Specificty (TNR) -> 32.33%
  • p-value -> 0.1347

pruned decision tree model:

  • Accuracy -> 57%
  • Sensitivity (TPR) -> 72.46%
  • Specificty (TNR) -> 37.59%
  • p-value -> 0.3428

Random forest model:

  • Accuracy -> 65.33%
  • Sensitivity (TPR) -> 70.06%
  • Specificty (TNR) -> 59.40%
  • p-value -> 0.0004124

Decision tree model has the highest Yes for profit growth at 80.24% but a p-value of 0.1347 which makes the model not significant.

Pruned decision tree model has an accuracy of 59% with a p vlaue of 0.3428 making it the worst performing model in the list.

logistic regression model and random forest model are the best performing with a difference in accuracy of about 5% with random forest having a higher yes prediction and logistic rergression having a higher no prediction for profit growth.

The goal of this project is to correctly predict companies that will make profit in the future hence if one comapny is incorrectly classifed as No for profit growth, an investor might miss an oppportunity to invest in this company. Hence, the best model so far is Random forest model with a TPR of 70.06% at the cost of lower TNR of 59.40%.

Based on the goal of the project, an investor would rather miss a falsely labeled non-growth company(false positive) than miss a growth opportunity (false negative). The logistic regression model is at a FPR of 54 and a FNR of 50

Ensemble Model

# get all predictions from the previous models
logistic_regression_pred <- predict(logistic_regression_model, newdata = test_scaled)
decision_tree_pred <- predict(pruned_decision_tree, newdata = test_scaled, type = 'class')
random_forest_pred <- predict(random_forest_model, newdata = test_scaled)

# create one df of the model predictions
ensemble_df <- data.frame(
  logistic_regression_pred = as.character(logistic_regression_pred),
  decision_tree_pred = as.character(decision_tree_pred),
  random_forest_pred = as.character(random_forest_pred)
)
# We want votes for all the models
# create a function for majority voting
majority_vote <- function(row) {
  votes <- table(row)
  return(names(votes)[which.max(votes)])
}
# apply majority voting to the dataframe
ensemble_predictions <- apply(ensemble_df, 1, majority_vote)

# convert back to factor
ensemble_predictions <- factor(ensemble_predictions, levels = c("no", "yes"))

# Create a confusion matrix
ensemble_conf_mtx <- confusionMatrix(ensemble_predictions, test_scaled$profit_growth_binary, positive = "yes")
ensemble_conf_mtx
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no   40  28
##        yes  43 101
##                                           
##                Accuracy : 0.6651          
##                  95% CI : (0.5972, 0.7283)
##     No Information Rate : 0.6085          
##     P-Value [Acc > NIR] : 0.05179         
##                                           
##                   Kappa : 0.2737          
##                                           
##  Mcnemar's Test P-Value : 0.09661         
##                                           
##             Sensitivity : 0.7829          
##             Specificity : 0.4819          
##          Pos Pred Value : 0.7014          
##          Neg Pred Value : 0.5882          
##              Prevalence : 0.6085          
##          Detection Rate : 0.4764          
##    Detection Prevalence : 0.6792          
##       Balanced Accuracy : 0.6324          
##                                           
##        'Positive' Class : yes             
## 
# extract the output
accuracy_ensemble <- ensemble_conf_mtx$overall ['Accuracy']
ensemble_tpr <- ensemble_conf_mtx$byClass ['Sensitivity']
ensemble_tnr <- ensemble_conf_mtx$byClass ['Specificity']

Key findings

The ensemble model achieved an accuracy: 67%, a TPR: 78% and a TNR: 48%.

SUMMARY OF MODELS AFTER ENSEMBLE

Logistic regression model:

  • Accuracy -> 70%
  • Sensitivity (TPR) -> 65.27%
  • Specificity (TNR) -> 75.94%
  • P-value -> 2.482e-07

Decision tree model:

  • Accuracy -> 59%
  • Sensitivity (TPR) -> 80.24%
  • Specificity (TNR) -> 32.33%
  • p-value -> 0.1347

pruned decision tree model:

  • Accuracy -> 57%
  • Sensitivity (TPR) -> 72.46%
  • Specificity (TNR) -> 37.59%
  • p-value -> 0.3428

Random forest model:

  • Accuracy -> 65.33%
  • Sensitivity(TPR) -> 70.06%
  • Specificity (TNR) -> 59.40%
  • p-value -> 0.0004124

Ensemble model

  • Accuracy -> 66%
  • Sensitivity (TPR) -> 78.12%
  • Specificity (TNR) -> 47.62%
  • p-value -> 0.0522

Prior to ensemble, the best performing model was the random forest model with 65.33% accuracy, TPR of 70.06% at the cost of lower TNR of 59.40%.

The ensemble model performs better with an increased accuracy of 0.71% compared to random forest model. It also has a yes prediction rate of more than 8.06% and a no prediction rate of less than 11.82%.

  • The reduction in TNR/ no prediction rate, could be that more non-growth companies have been labelled as profit growth companies and this could be a risk for investors looking into these companies.

It has a p-value of 0.0522 which is borderline significant and slightly better than random chance of profit growth prediction but is not strong enough to be considered definitively significant.

Based on the goal of the project; to predict profit growth in the future,the ensemble model is recommended as the primary model due to its high TPR: 78%, correctly identifying companies that will have profit growth.

Random forest model remains statistically significant and recommended if sensitivity is not prioritized over predictive reliability at a p value of 0.0004124 and TPR: 70%

N/B: In lieu of ROC curve visualizations, model discrimination performance is summarized through Sensitivity(TPR) and Specificity(TNR) metrics across all models which captures similar trade off between true positive and true negative rates at the chosen classification threshold.

Limitation

The small data size of only 800 observations limit model learning with 10 of the 200 observations representing all the feature categories of the dataset.