The goal of this project is to predict whether a company’s profit will grow in the following year
# Load required libraries
library('httr')
library('readxl')
library('ggplot2')
library('corrplot')
## corrplot 0.95 loaded
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.4 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ✔ readr 2.1.6
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('viridis')
## Loading required package: viridisLite
library('caret')
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
##
## The following object is masked from 'package:httr':
##
## progress
library('rpart')
library('rpart.plot')
library('randomForest')
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
Dataset reference: https://www.kaggle.com/datasets/mdmahfuzsumon/synthetic-finance-dataset-for-machine-learning
# load the data from kaggle
url <- "https://www.kaggle.com/api/v1/datasets/download/mdmahfuzsumon/synthetic-finance-dataset-for-machine-learning"
response <-GET(url,
authenticate("Maryanne M", "KGAT_b77248cf46a1d424ed10f43537576b1e"),
write_disk("finance_data.zip", overwrite = TRUE))
# print(response)
unzip("finance_data.zip")
finance_df <- read.csv("financial_200_features_dataset (1).csv")
head(finance_df)
## company year revenue domestic_sales
## Length:1000 Min. :2016 Min. :7.389e+08 Min. :5.987e+08
## Class :character 1st Qu.:2018 1st Qu.:9.600e+09 1st Qu.:7.936e+09
## Mode :character Median :2020 Median :1.758e+10 Median :1.437e+10
## Mean :2020 Mean :2.008e+10 Mean :1.657e+10
## 3rd Qu.:2023 3rd Qu.:2.710e+10 3rd Qu.:2.245e+10
## Max. :2025 Max. :9.001e+10 Max. :7.866e+10
##
## export_sales cogs gross_profit
## Min. :8.509e+07 Min. :3.902e+08 Min. :3.335e+08
## 1st Qu.:1.584e+09 1st Qu.:5.499e+09 1st Qu.:4.055e+09
## Median :2.879e+09 Median :1.009e+10 Median :7.301e+09
## Mean :3.505e+09 Mean :1.149e+10 Mean :8.585e+09
## 3rd Qu.:4.747e+09 3rd Qu.:1.554e+10 3rd Qu.:1.148e+10
## Max. :1.887e+10 Max. :4.975e+10 Max. :4.026e+10
##
## operating_expense ebit ebitda
## Min. :1.685e+08 Min. :1.296e+08 Min. :1.811e+08
## 1st Qu.:2.294e+09 1st Qu.:1.692e+09 1st Qu.:2.310e+09
## Median :4.061e+09 Median :3.086e+09 Median :4.266e+09
## Mean :4.810e+09 Mean :3.775e+09 Mean :5.077e+09
## 3rd Qu.:6.565e+09 3rd Qu.:5.125e+09 3rd Qu.:6.862e+09
## Max. :2.599e+10 Max. :2.060e+10 Max. :2.532e+10
##
## net_profit assets equity
## Min. :9.516e+07 Min. :1.147e+09 Min. :7.421e+08
## 1st Qu.:1.278e+09 1st Qu.:1.934e+10 1st Qu.:1.172e+10
## Median :2.306e+09 Median :3.650e+10 Median :2.133e+10
## Mean :2.831e+09 Mean :4.309e+10 Mean :2.580e+10
## 3rd Qu.:3.836e+09 3rd Qu.:5.822e+10 3rd Qu.:3.501e+10
## Max. :1.631e+10 Max. :1.958e+11 Max. :1.239e+11
##
## debt cash inventory
## Min. :4.052e+08 Min. :7.867e+07 Min. :8.139e+07
## 1st Qu.:7.587e+09 1st Qu.:8.903e+08 1st Qu.:1.593e+09
## Median :1.403e+10 Median :1.625e+09 Median :2.884e+09
## Mean :1.728e+10 Mean :1.963e+09 Mean :3.542e+09
## 3rd Qu.:2.229e+10 3rd Qu.:2.615e+09 3rd Qu.:4.847e+09
## Max. :9.759e+10 Max. :9.706e+09 Max. :1.826e+10
##
## receivable employees shares operating_cf
## Min. :7.304e+07 Min. : 305 Min. : 50690778 Min. :1.239e+08
## 1st Qu.:1.272e+09 1st Qu.:1398 1st Qu.:171141650 1st Qu.:1.788e+09
## Median :2.266e+09 Median :2722 Median :286444384 Median :3.197e+09
## Mean :2.823e+09 Mean :2676 Mean :279673639 Mean :3.844e+09
## 3rd Qu.:3.735e+09 3rd Qu.:3929 3rd Qu.:391184969 3rd Qu.:5.171e+09
## Max. :1.742e+10 Max. :4997 Max. :499282127 Max. :1.898e+10
##
## investing_cf financing_cf revenue_growth domestic_share
## Min. :-1.046e+10 Min. :-5.585e+09 Min. :-0.11970 Min. :0.7500
## 1st Qu.:-2.640e+09 1st Qu.:-6.924e+08 1st Qu.: 0.05586 1st Qu.:0.7873
## Median :-1.602e+09 Median : 1.109e+07 Median : 0.12440 Median :0.8255
## Mean :-2.004e+09 Mean :-7.136e+06 Mean : 0.12613 Mean :0.8255
## 3rd Qu.:-8.777e+08 3rd Qu.: 7.232e+08 3rd Qu.: 0.18809 3rd Qu.:0.8635
## Max. :-3.794e+07 Max. : 6.354e+09 Max. : 0.38287 Max. :0.8999
## NA's :100
## export_share revenue_per_employee revenue_per_asset revenue_per_equity
## Min. :0.1001 Min. : 198047 Min. :0.3574 Min. :0.4811
## 1st Qu.:0.1365 1st Qu.: 3515920 1st Qu.:0.4075 1st Qu.:0.6689
## Median :0.1745 Median : 6751117 Median :0.4665 Median :0.7971
## Mean :0.1745 Mean : 11930863 Mean :0.4823 Mean :0.8230
## 3rd Qu.:0.2127 3rd Qu.: 13899831 3rd Qu.:0.5548 3rd Qu.:0.9424
## Max. :0.2500 Max. :157050970 Max. :0.6665 Max. :1.4334
##
## revenue_per_share gross_margin ebit_margin ebitda_margin
## Min. : 1.832 Min. :0.3500 Min. :0.05446 Min. :0.1151
## 1st Qu.: 33.557 1st Qu.:0.3884 1st Qu.:0.14941 1st Qu.:0.2129
## Median : 64.958 Median :0.4284 Median :0.18701 Median :0.2509
## Mean :101.250 Mean :0.4269 Mean :0.18761 Mean :0.2522
## 3rd Qu.:124.831 3rd Qu.:0.4648 3rd Qu.:0.22589 3rd Qu.:0.2921
## Max. :920.742 Max. :0.5000 Max. :0.31408 Max. :0.3851
##
## net_margin profit_growth cogs_margin opex_ratio
## Min. :0.04258 Min. :-0.7186 Min. :0.5000 Min. :0.1800
## 1st Qu.:0.11109 1st Qu.:-0.1684 1st Qu.:0.5352 1st Qu.:0.2101
## Median :0.14116 Median : 0.1164 Median :0.5716 Median :0.2375
## Mean :0.14082 Mean : 0.2496 Mean :0.5731 Mean :0.2393
## 3rd Qu.:0.16976 3rd Qu.: 0.5331 3rd Qu.:0.6116 3rd Qu.:0.2685
## Max. :0.25627 Max. : 3.8825 Max. :0.6500 Max. :0.3000
## NA's :100
## asset_turnover equity_ratio debt_ratio cash_ratio
## Min. :0.3574 Min. :0.4506 Min. :0.2501 Min. :0.03533
## 1st Qu.:0.4075 1st Qu.:0.5238 1st Qu.:0.3257 1st Qu.:0.08647
## Median :0.4665 Median :0.5986 Median :0.4014 Median :0.11616
## Mean :0.4823 Mean :0.5997 Mean :0.4003 Mean :0.12485
## 3rd Qu.:0.5548 3rd Qu.:0.6743 3rd Qu.:0.4762 3rd Qu.:0.15428
## Max. :0.6665 Max. :0.7499 Max. :0.5494 Max. :0.32345
##
## working_capital fcf ocf_margin inventory_turnover
## Min. :-4.105e+09 Min. :-2.398e+09 Min. :0.0755 Min. :2.015
## 1st Qu.: 2.902e+08 1st Qu.: 6.120e+08 1st Qu.:0.1613 1st Qu.:2.721
## Median : 1.098e+09 Median : 1.391e+09 Median :0.1903 Median :3.270
## Mean : 1.555e+09 Mean : 1.840e+09 Mean :0.1914 Mean :3.504
## 3rd Qu.: 2.477e+09 3rd Qu.: 2.535e+09 3rd Qu.:0.2207 3rd Qu.:4.175
## Max. : 1.676e+10 Max. : 1.164e+10 Max. :0.3292 Max. :6.326
##
## receivable_turnover eps book_value_per_share year_lag1
## Min. : 5.009 Min. : 0.288 Min. : 1.839 Min. :2016
## 1st Qu.: 5.931 1st Qu.: 4.518 1st Qu.: 42.115 1st Qu.:2018
## Median : 7.111 Median : 9.007 Median : 80.711 Median :2020
## Mean : 7.647 Mean : 14.207 Mean : 129.771 Mean :2020
## 3rd Qu.: 9.172 3rd Qu.: 16.982 3rd Qu.: 160.424 3rd Qu.:2022
## Max. :12.477 Max. :153.415 Max. :1090.970 Max. :2024
## NA's :100
## year_lag2 year_rolling_mean year_rolling_std year_momentum
## Min. :2016 Min. :2017 Min. :1 Min. :1
## 1st Qu.:2018 1st Qu.:2019 1st Qu.:1 1st Qu.:1
## Median :2020 Median :2020 Median :1 Median :1
## Mean :2020 Mean :2020 Mean :1 Mean :1
## 3rd Qu.:2021 3rd Qu.:2022 3rd Qu.:1 3rd Qu.:1
## Max. :2023 Max. :2024 Max. :1 Max. :1
## NA's :200 NA's :200 NA's :200 NA's :100
## revenue_lag1 revenue_lag2 revenue_rolling_mean
## Min. :7.389e+08 Min. :7.389e+08 Min. :8.303e+08
## 1st Qu.:9.247e+09 1st Qu.:8.748e+09 1st Qu.:9.865e+09
## Median :1.668e+10 Median :1.589e+10 Median :1.818e+10
## Mean :1.868e+10 Mean :1.737e+10 Mean :1.972e+10
## 3rd Qu.:2.536e+10 3rd Qu.:2.374e+10 3rd Qu.:2.681e+10
## Max. :7.158e+10 Max. :6.414e+10 Max. :7.349e+10
## NA's :100 NA's :200 NA's :200
## revenue_rolling_std revenue_momentum domestic_sales_lag1
## Min. :1.300e+07 Min. :-2.749e+09 Min. :5.987e+08
## 1st Qu.:1.033e+09 1st Qu.: 5.405e+08 1st Qu.:7.568e+09
## Median :2.005e+09 Median : 1.679e+09 Median :1.373e+10
## Mean :2.550e+09 Mean : 2.409e+09 Mean :1.545e+10
## 3rd Qu.:3.430e+09 3rd Qu.: 3.502e+09 3rd Qu.:2.118e+10
## Max. :1.435e+10 Max. : 2.370e+10 Max. :5.716e+10
## NA's :200 NA's :100 NA's :100
## domestic_sales_lag2 domestic_sales_rolling_mean domestic_sales_rolling_std
## Min. :5.987e+08 Min. :7.050e+08 Min. :3.218e+07
## 1st Qu.:7.270e+09 1st Qu.:8.107e+09 1st Qu.:8.396e+08
## Median :1.301e+10 Median :1.491e+10 Median :1.707e+09
## Mean :1.437e+10 Mean :1.630e+10 Mean :2.220e+09
## 3rd Qu.:1.932e+10 3rd Qu.:2.207e+10 3rd Qu.:3.040e+09
## Max. :5.637e+10 Max. :6.244e+10 Max. :1.420e+10
## NA's :200 NA's :200 NA's :200
## domestic_sales_momentum export_sales_lag1 export_sales_lag2
## Min. :-5.192e+09 Min. :8.509e+07 Min. :8.509e+07
## 1st Qu.: 2.946e+08 1st Qu.:1.453e+09 1st Qu.:1.399e+09
## Median : 1.279e+09 Median :2.713e+09 Median :2.543e+09
## Mean : 1.961e+09 Mean :3.227e+09 Mean :3.005e+09
## 3rd Qu.: 3.062e+09 3rd Qu.:4.308e+09 3rd Qu.:4.113e+09
## Max. : 2.638e+10 Max. :1.585e+10 Max. :1.175e+10
## NA's :100 NA's :100 NA's :200
## export_sales_rolling_mean export_sales_rolling_std export_sales_momentum
## Min. :1.254e+08 Min. :1.311e+07 Min. :-5.030e+09
## 1st Qu.:1.680e+09 1st Qu.:3.169e+08 1st Qu.:-2.969e+08
## Median :3.006e+09 Median :6.756e+08 Median : 1.871e+08
## Mean :3.420e+09 Mean :8.947e+08 Mean : 4.485e+08
## 3rd Qu.:4.656e+09 3rd Qu.:1.227e+09 3rd Qu.: 1.019e+09
## Max. :1.355e+10 Max. :6.613e+09 Max. : 1.050e+10
## NA's :200 NA's :200 NA's :100
## cogs_lag1 cogs_lag2 cogs_rolling_mean
## Min. :3.902e+08 Min. :3.902e+08 Min. :4.480e+08
## 1st Qu.:5.285e+09 1st Qu.:5.056e+09 1st Qu.:5.651e+09
## Median :9.600e+09 Median :9.078e+09 Median :1.043e+10
## Mean :1.068e+10 Mean :9.956e+09 Mean :1.129e+10
## 3rd Qu.:1.471e+10 3rd Qu.:1.365e+10 3rd Qu.:1.537e+10
## Max. :4.445e+10 Max. :3.281e+10 Max. :4.044e+10
## NA's :100 NA's :200 NA's :200
## cogs_rolling_std cogs_momentum gross_profit_lag1
## Min. :2.948e+07 Min. :-4.948e+09 Min. :3.335e+08
## 1st Qu.:5.570e+08 1st Qu.: 1.094e+08 1st Qu.:3.862e+09
## Median :1.222e+09 Median : 8.920e+08 Median :6.969e+09
## Mean :1.599e+09 Mean : 1.391e+09 Mean :7.992e+09
## 3rd Qu.:2.252e+09 3rd Qu.: 2.240e+09 3rd Qu.:1.074e+10
## Max. :9.597e+09 Max. : 1.628e+10 Max. :3.444e+10
## NA's :200 NA's :100 NA's :100
## gross_profit_lag2 gross_profit_rolling_mean gross_profit_rolling_std
## Min. :3.335e+08 Min. :3.776e+08 Min. :1.210e+07
## 1st Qu.:3.649e+09 1st Qu.:4.144e+09 1st Qu.:4.687e+08
## Median :6.698e+09 Median :7.586e+09 Median :9.506e+08
## Mean :7.414e+09 Mean :8.432e+09 Mean :1.322e+09
## 3rd Qu.:1.004e+10 3rd Qu.:1.131e+10 3rd Qu.:1.805e+09
## Max. :3.133e+10 Max. :3.481e+10 Max. :7.904e+09
## NA's :200 NA's :200 NA's :200
## gross_profit_momentum operating_expense_lag1 operating_expense_lag2
## Min. :-5.420e+09 Min. :1.685e+08 Min. :1.685e+08
## 1st Qu.:-2.699e+07 1st Qu.:2.148e+09 1st Qu.:2.020e+09
## Median : 6.003e+08 Median :3.896e+09 Median :3.757e+09
## Mean : 1.018e+09 Mean :4.459e+09 Mean :4.158e+09
## 3rd Qu.: 1.754e+09 3rd Qu.:6.061e+09 3rd Qu.:5.596e+09
## Max. : 1.157e+10 Max. :1.952e+10 Max. :1.720e+10
## NA's :100 NA's :100 NA's :200
## operating_expense_rolling_mean operating_expense_rolling_std
## Min. :2.011e+08 Min. :1.114e+07
## 1st Qu.:2.368e+09 1st Qu.:3.005e+08
## Median :4.233e+09 Median :6.308e+08
## Mean :4.717e+09 Mean :8.616e+08
## 3rd Qu.:6.358e+09 3rd Qu.:1.112e+09
## Max. :1.755e+10 Max. :7.371e+09
## NA's :200 NA's :200
## operating_expense_momentum ebit_lag1 ebit_lag2
## Min. :-5.658e+09 Min. :1.296e+08 Min. :1.296e+08
## 1st Qu.:-1.142e+08 1st Qu.:1.630e+09 1st Qu.:1.570e+09
## Median : 3.438e+08 Median :2.958e+09 Median :2.771e+09
## Mean : 5.894e+08 Mean :3.533e+09 Mean :3.256e+09
## 3rd Qu.: 1.135e+09 3rd Qu.:4.807e+09 3rd Qu.:4.329e+09
## Max. : 1.169e+10 Max. :2.060e+10 Max. :1.897e+10
## NA's :100 NA's :100 NA's :200
## ebit_rolling_mean ebit_rolling_std ebit_momentum
## Min. :1.638e+08 Min. :2.323e+06 Min. :-1.016e+10
## 1st Qu.:1.791e+09 1st Qu.:3.674e+08 1st Qu.:-4.407e+08
## Median :3.292e+09 Median :7.657e+08 Median : 2.369e+08
## Mean :3.715e+09 Mean :1.058e+09 Mean : 4.290e+08
## 3rd Qu.:5.009e+09 3rd Qu.:1.436e+09 3rd Qu.: 1.165e+09
## Max. :1.726e+10 Max. :6.964e+09 Max. : 1.092e+10
## NA's :200 NA's :200 NA's :100
## ebitda_lag1 ebitda_lag2 ebitda_rolling_mean
## Min. :1.811e+08 Min. :1.811e+08 Min. :2.201e+08
## 1st Qu.:2.218e+09 1st Qu.:2.131e+09 1st Qu.:2.417e+09
## Median :4.094e+09 Median :3.818e+09 Median :4.465e+09
## Mean :4.747e+09 Mean :4.385e+09 Mean :4.996e+09
## 3rd Qu.:6.281e+09 3rd Qu.:5.820e+09 3rd Qu.:6.745e+09
## Max. :2.532e+10 Max. :2.313e+10 Max. :2.252e+10
## NA's :100 NA's :200 NA's :200
## ebitda_rolling_std ebitda_momentum net_profit_lag1
## Min. :4.259e+06 Min. :-1.103e+10 Min. :9.516e+07
## 1st Qu.:3.956e+08 1st Qu.:-2.940e+08 1st Qu.:1.215e+09
## Median :8.176e+08 Median : 3.035e+08 Median :2.221e+09
## Mean :1.132e+09 Mean : 5.812e+08 Mean :2.647e+09
## 3rd Qu.:1.533e+09 3rd Qu.: 1.326e+09 3rd Qu.:3.537e+09
## Max. :7.922e+09 Max. : 1.163e+10 Max. :1.631e+10
## NA's :200 NA's :100 NA's :100
## net_profit_lag2 net_profit_rolling_mean net_profit_rolling_std
## Min. :9.516e+07 Min. :1.160e+08 Min. :2.433e+06
## 1st Qu.:1.149e+09 1st Qu.:1.345e+09 1st Qu.:3.060e+08
## Median :2.072e+09 Median :2.469e+09 Median :5.646e+08
## Mean :2.445e+09 Mean :2.787e+09 Mean :8.109e+08
## 3rd Qu.:3.283e+09 3rd Qu.:3.823e+09 3rd Qu.:1.035e+09
## Max. :1.464e+10 Max. :1.284e+10 Max. :5.716e+09
## NA's :200 NA's :200 NA's :200
## net_profit_momentum assets_lag1 assets_lag2
## Min. :-9.503e+09 Min. :1.147e+09 Min. :1.147e+09
## 1st Qu.:-3.468e+08 1st Qu.:1.830e+10 1st Qu.:1.757e+10
## Median : 1.647e+08 Median :3.503e+10 Median :3.308e+10
## Mean : 3.235e+08 Mean :4.025e+10 Mean :3.734e+10
## 3rd Qu.: 8.341e+08 3rd Qu.:5.496e+10 3rd Qu.:5.105e+10
## Max. : 8.352e+09 Max. :1.830e+11 Max. :1.632e+11
## NA's :100 NA's :100 NA's :200
## assets_rolling_mean assets_rolling_std assets_momentum
## Min. :1.454e+09 Min. :5.328e+07 Min. :-5.400e+10
## 1st Qu.:2.092e+10 1st Qu.:2.807e+09 1st Qu.:-1.264e+09
## Median :3.843e+10 Median :6.326e+09 Median : 3.071e+09
## Mean :4.243e+10 Mean :8.220e+09 Mean : 5.030e+09
## 3rd Qu.:5.816e+10 3rd Qu.:1.114e+10 3rd Qu.: 1.077e+10
## Max. :1.718e+11 Max. :5.501e+10 Max. : 9.013e+10
## NA's :200 NA's :200 NA's :100
## equity_lag1 equity_lag2 equity_rolling_mean
## Min. :7.421e+08 Min. :7.421e+08 Min. :8.794e+08
## 1st Qu.:1.125e+10 1st Qu.:1.062e+10 1st Qu.:1.262e+10
## Median :2.034e+10 Median :1.977e+10 Median :2.304e+10
## Mean :2.425e+10 Mean :2.263e+10 Mean :2.553e+10
## 3rd Qu.:3.313e+10 3rd Qu.:3.104e+10 3rd Qu.:3.458e+10
## Max. :1.071e+11 Max. :1.071e+11 Max. :1.122e+11
## NA's :100 NA's :200 NA's :200
## equity_rolling_std equity_momentum debt_lag1
## Min. :3.098e+07 Min. :-3.721e+10 Min. :4.052e+08
## 1st Qu.:2.304e+09 1st Qu.:-1.879e+09 1st Qu.:7.078e+09
## Median :4.614e+09 Median : 1.677e+09 Median :1.341e+10
## Mean :5.886e+09 Mean : 2.874e+09 Mean :1.600e+10
## 3rd Qu.:7.962e+09 3rd Qu.: 7.356e+09 3rd Qu.:2.119e+10
## Max. :3.153e+10 Max. : 5.538e+10 Max. :9.759e+10
## NA's :200 NA's :100 NA's :100
## debt_lag2 debt_rolling_mean debt_rolling_std
## Min. :4.052e+08 Min. :5.364e+08 Min. :6.126e+07
## 1st Qu.:6.711e+09 1st Qu.:8.384e+09 1st Qu.:1.512e+09
## Median :1.264e+10 Median :1.487e+10 Median :3.392e+09
## Mean :1.471e+10 Mean :1.690e+10 Mean :4.670e+09
## 3rd Qu.:1.994e+10 3rd Qu.:2.284e+10 3rd Qu.:5.990e+09
## Max. :7.230e+10 Max. :6.712e+10 Max. :3.177e+10
## NA's :200 NA's :200 NA's :200
## debt_momentum cash_lag1 cash_lag2
## Min. :-4.430e+10 Min. :7.867e+07 Min. :7.867e+07
## 1st Qu.:-1.442e+09 1st Qu.:8.518e+08 1st Qu.:7.932e+08
## Median : 1.005e+09 Median :1.535e+09 Median :1.423e+09
## Mean : 2.157e+09 Mean :1.844e+09 Mean :1.707e+09
## 3rd Qu.: 5.682e+09 3rd Qu.:2.447e+09 3rd Qu.:2.326e+09
## Max. : 5.769e+10 Max. :9.706e+09 Max. :7.909e+09
## NA's :100 NA's :100 NA's :200
## cash_rolling_mean cash_rolling_std cash_momentum
## Min. :8.769e+07 Min. :1.047e+07 Min. :-4.833e+09
## 1st Qu.:9.317e+08 1st Qu.:2.167e+08 1st Qu.:-2.700e+08
## Median :1.727e+09 Median :4.191e+08 Median : 1.292e+08
## Mean :1.938e+09 Mean :5.625e+08 Mean : 2.190e+08
## 3rd Qu.:2.573e+09 3rd Qu.:7.278e+08 3rd Qu.: 6.458e+08
## Max. :7.724e+09 Max. :3.713e+09 Max. : 5.684e+09
## NA's :200 NA's :200 NA's :100
## inventory_lag1 inventory_lag2 inventory_rolling_mean
## Min. :8.139e+07 Min. :8.139e+07 Min. :1.444e+08
## 1st Qu.:1.544e+09 1st Qu.:1.465e+09 1st Qu.:1.671e+09
## Median :2.687e+09 Median :2.537e+09 Median :3.059e+09
## Mean :3.291e+09 Mean :3.063e+09 Mean :3.484e+09
## 3rd Qu.:4.488e+09 3rd Qu.:4.250e+09 3rd Qu.:4.811e+09
## Max. :1.517e+10 Max. :1.361e+10 Max. :1.320e+10
## NA's :100 NA's :200 NA's :200
## inventory_rolling_std inventory_momentum receivable_lag1
## Min. :1.876e+07 Min. :-6.342e+09 Min. :7.304e+07
## 1st Qu.:3.602e+08 1st Qu.:-3.263e+08 1st Qu.:1.220e+09
## Median :6.999e+08 Median : 2.333e+08 Median :2.101e+09
## Mean :8.808e+08 Mean : 4.328e+08 Mean :2.623e+09
## 3rd Qu.:1.204e+09 3rd Qu.: 1.114e+09 3rd Qu.:3.562e+09
## Max. :5.893e+09 Max. : 9.903e+09 Max. :1.427e+10
## NA's :200 NA's :100 NA's :100
## receivable_lag2 receivable_rolling_mean receivable_rolling_std
## Min. :7.304e+07 Min. :8.992e+07 Min. :6.299e+06
## 1st Qu.:1.163e+09 1st Qu.:1.352e+09 1st Qu.:2.396e+08
## Median :2.015e+09 Median :2.444e+09 Median :5.277e+08
## Mean :2.429e+09 Mean :2.773e+09 Mean :7.137e+08
## 3rd Qu.:3.307e+09 3rd Qu.:3.687e+09 3rd Qu.:9.968e+08
## Max. :1.068e+10 Max. :1.191e+10 Max. :6.182e+09
## NA's :200 NA's :200 NA's :200
## receivable_momentum employees_lag1 employees_lag2 employees_rolling_mean
## Min. :-6.736e+09 Min. : 305 Min. : 305 Min. : 789
## 1st Qu.:-2.211e+08 1st Qu.:1396 1st Qu.:1392 1st Qu.:2054
## Median : 1.893e+08 Median :2733 Median :2738 Median :2643
## Mean : 3.462e+08 Mean :2687 Mean :2692 Mean :2647
## 3rd Qu.: 8.703e+08 3rd Qu.:3938 3rd Qu.:3938 3rd Qu.:3240
## Max. : 9.032e+09 Max. :4997 Max. :4997 Max. :4799
## NA's :100 NA's :100 NA's :200 NA's :200
## employees_rolling_std employees_momentum shares_lag1
## Min. : 8.0 Min. :-4411.00 Min. : 50690778
## 1st Qu.: 919.1 1st Qu.:-1507.75 1st Qu.:169835258
## Median :1281.6 Median : 31.50 Median :288341835
## Mean :1289.0 Mean : -50.61 Mean :280194173
## 3rd Qu.:1698.9 3rd Qu.: 1439.50 3rd Qu.:392930827
## Max. :2562.5 Max. : 4502.00 Max. :499282127
## NA's :200 NA's :100 NA's :100
## shares_lag2 shares_rolling_mean shares_rolling_std
## Min. : 52181752 Min. : 70002088 Min. : 979757
## 1st Qu.:172664926 1st Qu.:229760300 1st Qu.: 78753490
## Median :286961198 Median :281812556 Median :118686117
## Mean :281401466 Mean :279592734 Mean :118601504
## 3rd Qu.:394419786 3rd Qu.:330473144 3rd Qu.:157807119
## Max. :499282127 Max. :470985140 Max. :237662957
## NA's :200 NA's :200 NA's :200
## shares_momentum operating_cf_lag1 operating_cf_lag2
## Min. :-432229983 Min. :1.239e+08 Min. :1.239e+08
## 1st Qu.:-130024940 1st Qu.:1.680e+09 1st Qu.:1.613e+09
## Median : -4018480 Median :3.026e+09 Median :2.853e+09
## Mean : -1438342 Mean :3.587e+09 Mean :3.315e+09
## 3rd Qu.: 128116534 3rd Qu.:4.885e+09 3rd Qu.:4.363e+09
## Max. : 433670256 Max. :1.898e+10 Max. :1.633e+10
## NA's :100 NA's :100 NA's :200
## operating_cf_rolling_mean operating_cf_rolling_std operating_cf_momentum
## Min. :1.479e+08 Min. :9.572e+06 Min. :-9.930e+09
## 1st Qu.:1.841e+09 1st Qu.:3.477e+08 1st Qu.:-2.678e+08
## Median :3.319e+09 Median :6.469e+08 Median : 2.474e+08
## Mean :3.782e+09 Mean :9.216e+08 Mean : 4.500e+08
## 3rd Qu.:5.154e+09 3rd Qu.:1.242e+09 3rd Qu.: 1.063e+09
## Max. :1.614e+10 Max. :5.660e+09 Max. : 9.095e+09
## NA's :200 NA's :200 NA's :100
## investing_cf_lag1 investing_cf_lag2 investing_cf_rolling_mean
## Min. :-9.722e+09 Min. :-9.128e+09 Min. :-9.455e+09
## 1st Qu.:-2.512e+09 1st Qu.:-2.304e+09 1st Qu.:-2.638e+09
## Median :-1.516e+09 Median :-1.449e+09 Median :-1.724e+09
## Mean :-1.866e+09 Mean :-1.722e+09 Mean :-1.969e+09
## 3rd Qu.:-8.471e+08 3rd Qu.:-8.134e+08 3rd Qu.:-9.639e+08
## Max. :-3.794e+07 Max. :-3.794e+07 Max. :-7.481e+07
## NA's :100 NA's :200 NA's :200
## investing_cf_rolling_std investing_cf_momentum financing_cf_lag1
## Min. :7.387e+06 Min. :-5.111e+09 Min. :-5.192e+09
## 1st Qu.:2.212e+08 1st Qu.:-6.928e+08 1st Qu.:-6.405e+08
## Median :4.304e+08 Median :-1.052e+08 Median : 1.109e+07
## Mean :5.672e+08 Mean :-2.409e+08 Mean : 2.565e+06
## 3rd Qu.:7.653e+08 3rd Qu.: 3.087e+08 3rd Qu.: 6.555e+08
## Max. :3.385e+09 Max. : 3.534e+09 Max. : 5.930e+09
## NA's :200 NA's :100 NA's :100
## financing_cf_lag2 financing_cf_rolling_mean financing_cf_rolling_std
## Min. :-4.497e+09 Min. :-3.695e+09 Min. :2.881e+06
## 1st Qu.:-6.055e+08 1st Qu.:-3.374e+08 1st Qu.:3.709e+08
## Median : 1.947e+07 Median : 1.446e+06 Median :7.672e+08
## Mean : 3.023e+06 Mean :-6.865e+06 Mean :9.971e+08
## 3rd Qu.: 6.052e+08 3rd Qu.: 3.618e+08 3rd Qu.:1.343e+09
## Max. : 4.582e+09 Max. : 5.243e+09 Max. :5.221e+09
## NA's :200 NA's :200 NA's :200
## financing_cf_momentum revenue_growth_lag1 revenue_growth_lag2
## Min. :-8.569e+09 Min. :-0.11970 Min. :-0.11970
## 1st Qu.:-8.165e+08 1st Qu.: 0.05931 1st Qu.: 0.05668
## Median :-1.216e+07 Median : 0.12804 Median : 0.12558
## Mean :-1.629e+07 Mean : 0.12857 Mean : 0.12684
## 3rd Qu.: 8.590e+08 3rd Qu.: 0.18846 3rd Qu.: 0.18809
## Max. : 8.065e+09 Max. : 0.38287 Max. : 0.38287
## NA's :100 NA's :200 NA's :300
## revenue_growth_rolling_mean revenue_growth_rolling_std revenue_growth_momentum
## Min. :0.004631 Min. :0.005021 Min. :-0.398044
## 1st Qu.:0.094238 1st Qu.:0.058974 1st Qu.:-0.117437
## Median :0.131864 Median :0.092919 Median :-0.005155
## Mean :0.127803 Mean :0.095023 Mean :-0.002201
## 3rd Qu.:0.162381 3rd Qu.:0.126197 3rd Qu.: 0.113302
## Max. :0.254524 Max. :0.228557 Max. : 0.424383
## NA's :300 NA's :300 NA's :200
## domestic_share_lag1 domestic_share_lag2 domestic_share_rolling_mean
## Min. :0.7500 Min. :0.7500 Min. :0.7621
## 1st Qu.:0.7886 1st Qu.:0.7886 1st Qu.:0.8082
## Median :0.8259 Median :0.8255 Median :0.8257
## Mean :0.8261 Mean :0.8259 Mean :0.8261
## 3rd Qu.:0.8647 3rd Qu.:0.8647 3rd Qu.:0.8444
## Max. :0.8999 Max. :0.8999 Max. :0.8949
## NA's :100 NA's :200 NA's :200
## domestic_share_rolling_std domestic_share_momentum export_share_lag1
## Min. :0.001101 Min. :-0.146020 Min. :0.1001
## 1st Qu.:0.025001 1st Qu.:-0.042776 1st Qu.:0.1353
## Median :0.039182 Median : 0.000750 Median :0.1741
## Mean :0.039348 Mean :-0.000414 Mean :0.1739
## 3rd Qu.:0.053388 3rd Qu.: 0.044058 3rd Qu.:0.2114
## Max. :0.081052 Max. : 0.137871 Max. :0.2500
## NA's :200 NA's :100 NA's :100
## export_share_lag2 export_share_rolling_mean export_share_rolling_std
## Min. :0.1001 Min. :0.1051 Min. :0.001101
## 1st Qu.:0.1353 1st Qu.:0.1556 1st Qu.:0.025001
## Median :0.1745 Median :0.1743 Median :0.039182
## Mean :0.1741 Mean :0.1739 Mean :0.039348
## 3rd Qu.:0.2114 3rd Qu.:0.1918 3rd Qu.:0.053388
## Max. :0.2500 Max. :0.2379 Max. :0.081052
## NA's :200 NA's :200 NA's :200
## export_share_momentum revenue_per_employee_lag1 revenue_per_employee_lag2
## Min. :-0.137871 Min. : 198047 Min. : 198047
## 1st Qu.:-0.044058 1st Qu.: 3387392 1st Qu.: 3153488
## Median :-0.000750 Median : 6301592 Median : 5872301
## Mean : 0.000414 Mean :10833026 Mean :10376992
## 3rd Qu.: 0.042776 3rd Qu.:12981077 3rd Qu.:11966675
## Max. : 0.146020 Max. :82153197 Max. :82153197
## NA's :100 NA's :100 NA's :200
## revenue_per_employee_rolling_mean revenue_per_employee_rolling_std
## Min. : 262827 Min. : 30759
## 1st Qu.: 4514881 1st Qu.: 1827885
## Median : 8670465 Median : 4478237
## Mean :11710682 Mean : 7991243
## 3rd Qu.:15377774 3rd Qu.: 9862523
## Max. :66963532 Max. :78035985
## NA's :200 NA's :200
## revenue_per_employee_momentum revenue_per_asset_lag1 revenue_per_asset_lag2
## Min. :-65775600 Min. :0.3574 Min. :0.3574
## 1st Qu.: -2826681 1st Qu.:0.4067 1st Qu.:0.4075
## Median : 350106 Median :0.4640 Median :0.4646
## Mean : 1794261 Mean :0.4812 Mean :0.4810
## 3rd Qu.: 5149380 3rd Qu.:0.5548 3rd Qu.:0.5534
## Max. :133456304 Max. :0.6665 Max. :0.6658
## NA's :100 NA's :100 NA's :200
## revenue_per_asset_rolling_mean revenue_per_asset_rolling_std
## Min. :0.3733 Min. :0.006425
## 1st Qu.:0.4385 1st Qu.:0.040866
## Median :0.4808 Median :0.074250
## Mean :0.4810 Mean :0.074524
## 3rd Qu.:0.5157 3rd Qu.:0.104103
## Max. :0.6367 Max. :0.172142
## NA's :200 NA's :200
## revenue_per_asset_momentum revenue_per_equity_lag1 revenue_per_equity_lag2
## Min. :-0.291987 Min. :0.4811 Min. :0.4811
## 1st Qu.:-0.073826 1st Qu.:0.6657 1st Qu.:0.6639
## Median : 0.001466 Median :0.7918 Median :0.7900
## Mean : 0.000370 Mean :0.8184 Mean :0.8155
## 3rd Qu.: 0.074934 3rd Qu.:0.9371 3rd Qu.:0.9367
## Max. : 0.300186 Max. :1.4334 Max. :1.4334
## NA's :100 NA's :100 NA's :200
## revenue_per_equity_rolling_mean revenue_per_equity_rolling_std
## Min. :0.5456 Min. :0.009297
## 1st Qu.:0.7361 1st Qu.:0.094488
## Median :0.8077 Median :0.158698
## Mean :0.8183 Mean :0.170094
## 3rd Qu.:0.8905 3rd Qu.:0.232338
## Max. :1.3062 Max. :0.465799
## NA's :200 NA's :200
## revenue_per_equity_momentum revenue_per_share_lag1 revenue_per_share_lag2
## Min. :-0.782539 Min. : 1.832 Min. : 1.832
## 1st Qu.:-0.165571 1st Qu.: 32.886 1st Qu.: 31.539
## Median : 0.005282 Median : 62.212 Median : 57.173
## Mean : 0.002213 Mean : 93.617 Mean : 85.186
## 3rd Qu.: 0.165738 3rd Qu.:117.915 3rd Qu.:108.250
## Max. : 0.919153 Max. :920.742 Max. :920.742
## NA's :100 NA's :100 NA's :200
## revenue_per_share_rolling_mean revenue_per_share_rolling_std
## Min. : 3.45 Min. : 0.1449
## 1st Qu.: 46.06 1st Qu.: 13.6587
## Median : 78.74 Median : 35.1846
## Mean : 98.35 Mean : 58.2226
## 3rd Qu.:133.07 3rd Qu.: 75.7685
## Max. :509.92 Max. :454.0048
## NA's :200 NA's :200
## revenue_per_share_momentum gross_margin_lag1 gross_margin_lag2
## Min. :-715.315 Min. :0.3500 Min. :0.3500
## 1st Qu.: -22.643 1st Qu.:0.3894 1st Qu.:0.3882
## Median : 4.622 Median :0.4280 Median :0.4271
## Mean : 12.796 Mean :0.4270 Mean :0.4264
## 3rd Qu.: 44.955 3rd Qu.:0.4658 3rd Qu.:0.4656
## Max. : 794.887 Max. :0.5000 Max. :0.5000
## NA's :100 NA's :100 NA's :200
## gross_margin_rolling_mean gross_margin_rolling_std gross_margin_momentum
## Min. :0.3564 Min. :0.001141 Min. :-0.148800
## 1st Qu.:0.4096 1st Qu.:0.027902 1st Qu.:-0.044549
## Median :0.4277 Median :0.041058 Median : 0.001636
## Mean :0.4268 Mean :0.040668 Mean :-0.000465
## 3rd Qu.:0.4440 3rd Qu.:0.053569 3rd Qu.: 0.045173
## Max. :0.4929 Max. :0.079875 Max. : 0.137160
## NA's :200 NA's :200 NA's :100
## ebit_margin_lag1 ebit_margin_lag2 ebit_margin_rolling_mean
## Min. :0.05446 Min. :0.05683 Min. :0.1071
## 1st Qu.:0.15030 1st Qu.:0.14906 1st Qu.:0.1671
## Median :0.18773 Median :0.18802 Median :0.1865
## Mean :0.18787 Mean :0.18700 Mean :0.1877
## 3rd Qu.:0.22589 3rd Qu.:0.22558 3rd Qu.:0.2092
## Max. :0.31408 Max. :0.31408 Max. :0.2832
## NA's :100 NA's :200 NA's :200
## ebit_margin_rolling_std ebit_margin_momentum ebitda_margin_lag1
## Min. :0.00108 Min. :-0.202350 Min. :0.1151
## 1st Qu.:0.02959 1st Qu.:-0.051292 1st Qu.:0.2136
## Median :0.04548 Median :-0.002749 Median :0.2517
## Mean :0.04835 Mean :-0.000257 Mean :0.2525
## 3rd Qu.:0.06567 3rd Qu.: 0.054456 3rd Qu.:0.2921
## Max. :0.11997 Max. : 0.218452 Max. :0.3851
## NA's :200 NA's :100 NA's :100
## ebitda_margin_lag2 ebitda_margin_rolling_mean ebitda_margin_rolling_std
## Min. :0.1151 Min. :0.1682 Min. :0.002424
## 1st Qu.:0.2133 1st Qu.:0.2306 1st Qu.:0.030788
## Median :0.2516 Median :0.2511 Median :0.046050
## Mean :0.2516 Mean :0.2523 Mean :0.048710
## 3rd Qu.:0.2900 3rd Qu.:0.2736 3rd Qu.:0.066805
## Max. :0.3851 Max. :0.3526 Max. :0.116791
## NA's :200 NA's :200 NA's :200
## ebitda_margin_momentum net_margin_lag1 net_margin_lag2
## Min. :-0.210979 Min. :0.04258 Min. :0.0447
## 1st Qu.:-0.051267 1st Qu.:0.11137 1st Qu.:0.1108
## Median :-0.002032 Median :0.14123 Median :0.1410
## Mean :-0.000440 Mean :0.14098 Mean :0.1405
## 3rd Qu.: 0.053644 3rd Qu.:0.16976 3rd Qu.:0.1693
## Max. : 0.204711 Max. :0.25210 Max. :0.2521
## NA's :100 NA's :100 NA's :200
## net_margin_rolling_mean net_margin_rolling_std net_margin_momentum
## Min. :0.07437 Min. :0.00033 Min. :-0.153114
## 1st Qu.:0.12499 1st Qu.:0.02215 1st Qu.:-0.038936
## Median :0.13952 Median :0.03632 Median :-0.002671
## Mean :0.14105 Mean :0.03738 Mean :-0.000103
## 3rd Qu.:0.15719 3rd Qu.:0.05066 3rd Qu.: 0.042611
## Max. :0.20874 Max. :0.09924 Max. : 0.174952
## NA's :200 NA's :200 NA's :100
## profit_growth_lag1 profit_growth_lag2 profit_growth_rolling_mean
## Min. :-0.7186 Min. :-0.6952 Min. :-0.21135
## 1st Qu.:-0.1661 1st Qu.:-0.1716 1st Qu.: 0.07833
## Median : 0.1259 Median : 0.1323 Median : 0.22671
## Mean : 0.2593 Mean : 0.2501 Mean : 0.25433
## 3rd Qu.: 0.5251 3rd Qu.: 0.5216 3rd Qu.: 0.37465
## Max. : 3.8825 Max. : 3.8825 Max. : 1.29336
## NA's :200 NA's :300 NA's :300
str(finance_df)
## 'data.frame': 1000 obs. of 210 variables:
## $ company : chr "Company_1" "Company_1" "Company_1" "Company_1" ...
## $ year : int 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 ...
## $ revenue : num 8.17e+09 8.26e+09 9.70e+09 1.19e+10 1.49e+10 ...
## $ domestic_sales : num 6.86e+09 6.22e+09 8.65e+09 9.92e+09 1.15e+10 ...
## $ export_sales : num 1.31e+09 2.04e+09 1.05e+09 1.96e+09 3.42e+09 ...
## $ cogs : num 4.27e+09 4.78e+09 5.67e+09 5.99e+09 9.26e+09 ...
## $ gross_profit : num 3.89e+09 3.48e+09 4.03e+09 5.88e+09 5.67e+09 ...
## $ operating_expense : num 1.62e+09 1.88e+09 2.20e+09 3.34e+09 2.82e+09 ...
## $ ebit : num 2.27e+09 1.60e+09 1.84e+09 2.54e+09 2.85e+09 ...
## $ ebitda : num 2.69e+09 2.02e+09 2.33e+09 3.30e+09 4.04e+09 ...
## $ net_profit : num 1.87e+09 1.35e+09 1.28e+09 1.85e+09 2.29e+09 ...
## $ assets : num 1.86e+10 1.49e+10 1.76e+10 3.21e+10 2.63e+10 ...
## $ equity : num 1.23e+10 7.10e+09 1.15e+10 2.15e+10 1.19e+10 ...
## $ debt : num 6.29e+09 7.78e+09 6.07e+09 1.07e+10 1.44e+10 ...
## $ cash : num 4.25e+08 9.23e+08 1.08e+09 9.82e+08 1.96e+09 ...
## $ inventory : num 2.00e+09 1.30e+09 2.18e+09 2.20e+09 3.08e+09 ...
## $ receivable : num 1.47e+09 1.63e+09 9.78e+08 1.69e+09 2.50e+09 ...
## $ employees : int 2733 1828 3452 637 4774 2990 1978 3602 3604 2007 ...
## $ shares : int 294061375 249502978 87709731 220476398 368019332 96321408 73841962 186702141 173850419 362601305 ...
## $ operating_cf : num 2.52e+09 1.94e+09 1.58e+09 2.68e+09 3.42e+09 ...
## $ investing_cf : num -9.12e+08 -9.74e+08 -1.22e+09 -1.02e+09 -1.72e+09 ...
## $ financing_cf : num 1.82e+08 -8.17e+07 -1.45e+08 -5.20e+08 1.24e+09 ...
## $ revenue_growth : num NA 0.011 0.175 0.224 0.257 ...
## $ domestic_share : num 0.84 0.753 0.891 0.835 0.771 ...
## $ export_share : num 0.16 0.247 0.109 0.165 0.229 ...
## $ revenue_per_employee : num 2987781 4516026 2810452 18645123 3127314 ...
## $ revenue_per_asset : num 0.438 0.555 0.551 0.37 0.569 ...
## $ revenue_per_equity : num 0.662 1.163 0.842 0.553 1.259 ...
## $ revenue_per_share : num 27.8 33.1 110.6 53.9 40.6 ...
## $ gross_margin : num 0.477 0.421 0.416 0.495 0.38 ...
## $ ebit_margin : num 0.278 0.193 0.189 0.214 0.191 ...
## $ ebitda_margin : num 0.33 0.245 0.24 0.278 0.27 ...
## $ net_margin : num 0.229 0.163 0.132 0.156 0.153 ...
## $ profit_growth : num NA -0.2783 -0.0517 0.4511 0.2349 ...
## $ cogs_margin : num 0.523 0.579 0.584 0.505 0.62 ...
## $ opex_ratio : num 0.199 0.228 0.226 0.281 0.189 ...
## $ asset_turnover : num 0.438 0.555 0.551 0.37 0.569 ...
## $ equity_ratio : num 0.662 0.477 0.655 0.668 0.452 ...
## $ debt_ratio : num 0.338 0.523 0.345 0.332 0.548 ...
## $ cash_ratio : num 0.0676 0.1187 0.1774 0.0921 0.1364 ...
## $ working_capital : num 1.85e+09 1.05e+09 9.66e+08 5.58e+08 2.76e+09 ...
## $ fcf : num 1.60e+09 9.65e+08 3.60e+08 1.67e+09 1.70e+09 ...
## $ ocf_margin : num 0.308 0.235 0.163 0.226 0.229 ...
## $ inventory_turnover : num 2.13 3.68 2.6 2.72 3.01 ...
## $ receivable_turnover : num 5.56 5.05 9.92 7.02 5.97 ...
## $ eps : num 6.35 5.4 14.57 8.41 6.22 ...
## $ book_value_per_share : num 42 28.5 131.4 97.4 32.2 ...
## $ year_lag1 : num NA 2016 2017 2018 2019 ...
## $ year_lag2 : num NA NA 2016 2017 2018 ...
## $ year_rolling_mean : num NA NA 2017 2018 2019 ...
## $ year_rolling_std : num NA NA 1 1 1 1 1 1 1 1 ...
## $ year_momentum : num NA 1 1 1 1 1 1 1 1 1 ...
## $ revenue_lag1 : num NA 8.17e+09 8.26e+09 9.70e+09 1.19e+10 ...
## $ revenue_lag2 : num NA NA 8.17e+09 8.26e+09 9.70e+09 ...
## $ revenue_rolling_mean : num NA NA 8.71e+09 9.94e+09 1.22e+10 ...
## $ revenue_rolling_std : num NA NA 8.62e+08 1.82e+09 2.63e+09 ...
## $ revenue_momentum : num NA 8.97e+07 1.45e+09 2.18e+09 3.05e+09 ...
## $ domestic_sales_lag1 : num NA 6.86e+09 6.22e+09 8.65e+09 9.92e+09 ...
## $ domestic_sales_lag2 : num NA NA 6.86e+09 6.22e+09 8.65e+09 ...
## $ domestic_sales_rolling_mean : num NA NA 7.24e+09 8.26e+09 1.00e+10 ...
## $ domestic_sales_rolling_std : num NA NA 1.26e+09 1.88e+09 1.44e+09 ...
## $ domestic_sales_momentum : num NA -6.37e+08 2.43e+09 1.27e+09 1.59e+09 ...
## $ export_sales_lag1 : num NA 1.31e+09 2.04e+09 1.05e+09 1.96e+09 ...
## $ export_sales_lag2 : num NA NA 1.31e+09 2.04e+09 1.05e+09 ...
## $ export_sales_rolling_mean : num NA NA 1.47e+09 1.68e+09 2.14e+09 ...
## $ export_sales_rolling_std : num NA NA 5.09e+08 5.45e+08 1.19e+09 ...
## $ export_sales_momentum : num NA 7.27e+08 -9.81e+08 9.04e+08 1.46e+09 ...
## $ cogs_lag1 : num NA 4.27e+09 4.78e+09 5.67e+09 5.99e+09 ...
## $ cogs_lag2 : num NA NA 4.27e+09 4.78e+09 5.67e+09 ...
## $ cogs_rolling_mean : num NA NA 4.91e+09 5.48e+09 6.98e+09 ...
## $ cogs_rolling_std : num NA NA 7.07e+08 6.30e+08 1.99e+09 ...
## $ cogs_momentum : num NA 5.04e+08 8.93e+08 3.24e+08 3.27e+09 ...
## $ gross_profit_lag1 : num NA 3.89e+09 3.48e+09 4.03e+09 5.88e+09 ...
## $ gross_profit_lag2 : num NA NA 3.89e+09 3.48e+09 4.03e+09 ...
## $ gross_profit_rolling_mean : num NA NA 3.80e+09 4.46e+09 5.19e+09 ...
## $ gross_profit_rolling_std : num NA NA 2.88e+08 1.26e+09 1.01e+09 ...
## $ gross_profit_momentum : num NA -4.14e+08 5.53e+08 1.85e+09 -2.14e+08 ...
## $ operating_expense_lag1 : num NA 1.62e+09 1.88e+09 2.20e+09 3.34e+09 ...
## $ operating_expense_lag2 : num NA NA 1.62e+09 1.88e+09 2.20e+09 ...
## $ operating_expense_rolling_mean : num NA NA 1.90e+09 2.47e+09 2.78e+09 ...
## $ operating_expense_rolling_std : num NA NA 2.87e+08 7.67e+08 5.73e+08 ...
## $ operating_expense_momentum : num NA 2.59e+08 3.13e+08 1.14e+09 -5.17e+08 ...
## $ ebit_lag1 : num NA 2.27e+09 1.60e+09 1.84e+09 2.54e+09 ...
## $ ebit_lag2 : num NA NA 2.27e+09 1.60e+09 1.84e+09 ...
## $ ebit_rolling_mean : num NA NA 1.90e+09 1.99e+09 2.41e+09 ...
## $ ebit_rolling_std : num NA NA 3.41e+08 4.93e+08 5.19e+08 ...
## $ ebit_momentum : num NA -6.73e+08 2.40e+08 7.08e+08 3.03e+08 ...
## $ ebitda_lag1 : num NA 2.69e+09 2.02e+09 2.33e+09 3.30e+09 ...
## $ ebitda_lag2 : num NA NA 2.69e+09 2.02e+09 2.33e+09 ...
## $ ebitda_rolling_mean : num NA NA 2.35e+09 2.55e+09 3.22e+09 ...
## $ ebitda_rolling_std : num NA NA 3.36e+08 6.68e+08 8.58e+08 ...
## $ ebitda_momentum : num NA -6.71e+08 3.06e+08 9.73e+08 7.38e+08 ...
## $ net_profit_lag1 : num NA 1.87e+09 1.35e+09 1.28e+09 1.85e+09 ...
## $ net_profit_lag2 : num NA NA 1.87e+09 1.35e+09 1.28e+09 ...
## $ net_profit_rolling_mean : num NA NA 1.50e+09 1.49e+09 1.81e+09 ...
## $ net_profit_rolling_std : num NA NA 3.22e+08 3.15e+08 5.08e+08 ...
## $ net_profit_momentum : num NA -5.20e+08 -6.98e+07 5.77e+08 4.36e+08 ...
## $ assets_lag1 : num NA 1.86e+10 1.49e+10 1.76e+10 3.21e+10 ...
## $ assets_lag2 : num NA NA 1.86e+10 1.49e+10 1.76e+10 ...
## [list output truncated]
This dataset is a realistic corporate financial dataset that comprises of 100 companies over 10 years (2016- 2025) that include 210 engineered features from income statements, balance sheets and cash flow statements.
STRUCTURE
The dataset variables are all numerical except “Company” column which is chr.
The company column is organized in suc a way that every ten rows is data pertaining to the same company
There are missing values in the dataset but most importantly, the first line of the start for each company is empty so we have 100 empty rows in the dataset.
The dataset also contains lag columns, rolling mean, rolling standard deviation, and momentum
The target variable is profit_growth
The goal is to visualize the distribution of the data by checking:
Here i will use the feature categories from which I’ll pick the most relevant to profit growth.
Also, they are all T-1 features.
For the efficiency features we will create a column ‘inventory_turnover_lag1’ by calculating cogs_lag1 / mean(inventory_lag1 & inventory_lag2) For the shareholder metrics, we will create a column ‘eps_lag1’ for year T-1 by net_profit_lag1 / shares_lag1
# efficiency features - inventory_turnover_lag1
finance_df <- finance_df%>%
group_by(company)%>%
arrange(year)%>%
mutate(
# average inventory lag 1 using inventory lag1 and inventory lag2
average_inventory_lag1 = (inventory_lag1 + inventory_lag2) / 2,
# inventory turnover lag 1 for T-1 values only
inventory_turnover_lag1 = cogs_lag1 / average_inventory_lag1
) %>%
ungroup()
# shareholder metrics - eps_lag1
finance_df <- finance_df %>%
group_by(company)%>%
arrange(year)%>%
mutate(
eps_lag1 = net_profit_lag1 / shares_lag1 # earnings per share are calculated by dividing net profit by number of shares
)%>%
ungroup()
features_to_plot <- c("revenue_growth_lag1",
"ebit_lag1",
"operating_expense_lag1",
"revenue_per_asset_lag1",
"cash_lag1",
"debt_lag1",
"operating_cf_lag1",
"inventory_turnover_lag1",
"eps_lag1",
"revenue_growth_momentum"
)
for (feature in features_to_plot) {
plot = ggplot(finance_df, aes_string(x = feature)) +
geom_histogram(bins = 20, color = "red", fill = "steelblue", alpha = 0.7) +
labs(title = paste("Distribution of", feature),
x = feature,
y = 'Count')+
theme_minimal()
print(plot)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 200 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 200 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 100 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 200 rows containing non-finite outside the scale range
## (`stat_bin()`).
### Key findings
Revenue_growth_lag1, inventory_turnover_lag1, revenue_growth_momentum form a bell-shaped curve indicating normal distribution
What this means for my project
There is no need for transformation of these features
I am not expecting extreme outliers
Logistic regression would work well
Revenue_per_asset_lag1 is irregularly distributed among the ten companies with very high peaks at 0.4 to 0.45 then 0.5 to 0.55
This is how many dollars of revenue the company generates for every dollar invested in assets
Its not a bell shaped curve/ not normal distribution
The range is between 0.35 - 0.7, so for every dollar invested in the companies’ assets, the companies generate between 35 to 70 cents. It is a narrow range that indicates that some companies have high asset efficiency of around 50 cents per dollar and others have lower asset efficiency of about 40 cents per dollar.
What this means for my project
I am expecting outliers but not extreme outliers
Does not require transformation
More visualization by boxplots needed.
ebit_lag1, operating_expense_lag1, cash_lag1, debt_lag1, operating_cf_lag1, eps_lag1 features are right-skewed
What this means for my project
These features represent monetary values where most companies cluster at lower values, while few companies cluster at large values which is an expected observation.
The large values may distort the model, there is need for log transformation to handle the skewness.
Feature scaling is needed to standardize or bring the features to a comparable scale.
Check for missing values and patterns if any
##
## FALSE TRUE
## 185400 27600
## company year
## 0 0
## revenue domestic_sales
## 0 0
## export_sales cogs
## 0 0
## gross_profit operating_expense
## 0 0
## ebit ebitda
## 0 0
## net_profit assets
## 0 0
## equity debt
## 0 0
## cash inventory
## 0 0
## receivable employees
## 0 0
## shares operating_cf
## 0 0
## investing_cf financing_cf
## 0 0
## revenue_growth domestic_share
## 100 0
## export_share revenue_per_employee
## 0 0
## revenue_per_asset revenue_per_equity
## 0 0
## revenue_per_share gross_margin
## 0 0
## ebit_margin ebitda_margin
## 0 0
## net_margin profit_growth
## 0 100
## cogs_margin opex_ratio
## 0 0
## asset_turnover equity_ratio
## 0 0
## debt_ratio cash_ratio
## 0 0
## working_capital fcf
## 0 0
## ocf_margin inventory_turnover
## 0 0
## receivable_turnover eps
## 0 0
## book_value_per_share year_lag1
## 0 100
## year_lag2 year_rolling_mean
## 200 200
## year_rolling_std year_momentum
## 200 100
## revenue_lag1 revenue_lag2
## 100 200
## revenue_rolling_mean revenue_rolling_std
## 200 200
## revenue_momentum domestic_sales_lag1
## 100 100
## domestic_sales_lag2 domestic_sales_rolling_mean
## 200 200
## domestic_sales_rolling_std domestic_sales_momentum
## 200 100
## export_sales_lag1 export_sales_lag2
## 100 200
## export_sales_rolling_mean export_sales_rolling_std
## 200 200
## export_sales_momentum cogs_lag1
## 100 100
## cogs_lag2 cogs_rolling_mean
## 200 200
## cogs_rolling_std cogs_momentum
## 200 100
## gross_profit_lag1 gross_profit_lag2
## 100 200
## gross_profit_rolling_mean gross_profit_rolling_std
## 200 200
## gross_profit_momentum operating_expense_lag1
## 100 100
## operating_expense_lag2 operating_expense_rolling_mean
## 200 200
## operating_expense_rolling_std operating_expense_momentum
## 200 100
## ebit_lag1 ebit_lag2
## 100 200
## ebit_rolling_mean ebit_rolling_std
## 200 200
## ebit_momentum ebitda_lag1
## 100 100
## ebitda_lag2 ebitda_rolling_mean
## 200 200
## ebitda_rolling_std ebitda_momentum
## 200 100
## net_profit_lag1 net_profit_lag2
## 100 200
## net_profit_rolling_mean net_profit_rolling_std
## 200 200
## net_profit_momentum assets_lag1
## 100 100
## assets_lag2 assets_rolling_mean
## 200 200
## assets_rolling_std assets_momentum
## 200 100
## equity_lag1 equity_lag2
## 100 200
## equity_rolling_mean equity_rolling_std
## 200 200
## equity_momentum debt_lag1
## 100 100
## debt_lag2 debt_rolling_mean
## 200 200
## debt_rolling_std debt_momentum
## 200 100
## cash_lag1 cash_lag2
## 100 200
## cash_rolling_mean cash_rolling_std
## 200 200
## cash_momentum inventory_lag1
## 100 100
## inventory_lag2 inventory_rolling_mean
## 200 200
## inventory_rolling_std inventory_momentum
## 200 100
## receivable_lag1 receivable_lag2
## 100 200
## receivable_rolling_mean receivable_rolling_std
## 200 200
## receivable_momentum employees_lag1
## 100 100
## employees_lag2 employees_rolling_mean
## 200 200
## employees_rolling_std employees_momentum
## 200 100
## shares_lag1 shares_lag2
## 100 200
## shares_rolling_mean shares_rolling_std
## 200 200
## shares_momentum operating_cf_lag1
## 100 100
## operating_cf_lag2 operating_cf_rolling_mean
## 200 200
## operating_cf_rolling_std operating_cf_momentum
## 200 100
## investing_cf_lag1 investing_cf_lag2
## 100 200
## investing_cf_rolling_mean investing_cf_rolling_std
## 200 200
## investing_cf_momentum financing_cf_lag1
## 100 100
## financing_cf_lag2 financing_cf_rolling_mean
## 200 200
## financing_cf_rolling_std financing_cf_momentum
## 200 100
## revenue_growth_lag1 revenue_growth_lag2
## 200 300
## revenue_growth_rolling_mean revenue_growth_rolling_std
## 300 300
## revenue_growth_momentum domestic_share_lag1
## 200 100
## domestic_share_lag2 domestic_share_rolling_mean
## 200 200
## domestic_share_rolling_std domestic_share_momentum
## 200 100
## export_share_lag1 export_share_lag2
## 100 200
## export_share_rolling_mean export_share_rolling_std
## 200 200
## export_share_momentum revenue_per_employee_lag1
## 100 100
## revenue_per_employee_lag2 revenue_per_employee_rolling_mean
## 200 200
## revenue_per_employee_rolling_std revenue_per_employee_momentum
## 200 100
## revenue_per_asset_lag1 revenue_per_asset_lag2
## 100 200
## revenue_per_asset_rolling_mean revenue_per_asset_rolling_std
## 200 200
## revenue_per_asset_momentum revenue_per_equity_lag1
## 100 100
## revenue_per_equity_lag2 revenue_per_equity_rolling_mean
## 200 200
## revenue_per_equity_rolling_std revenue_per_equity_momentum
## 200 100
## revenue_per_share_lag1 revenue_per_share_lag2
## 100 200
## revenue_per_share_rolling_mean revenue_per_share_rolling_std
## 200 200
## revenue_per_share_momentum gross_margin_lag1
## 100 100
## gross_margin_lag2 gross_margin_rolling_mean
## 200 200
## gross_margin_rolling_std gross_margin_momentum
## 200 100
## ebit_margin_lag1 ebit_margin_lag2
## 100 200
## ebit_margin_rolling_mean ebit_margin_rolling_std
## 200 200
## ebit_margin_momentum ebitda_margin_lag1
## 100 100
## ebitda_margin_lag2 ebitda_margin_rolling_mean
## 200 200
## ebitda_margin_rolling_std ebitda_margin_momentum
## 200 100
## net_margin_lag1 net_margin_lag2
## 100 200
## net_margin_rolling_mean net_margin_rolling_std
## 200 200
## net_margin_momentum profit_growth_lag1
## 100 200
## profit_growth_lag2 profit_growth_rolling_mean
## 300 300
## average_inventory_lag1 inventory_turnover_lag1
## 200 200
## eps_lag1
## 100
The first row of the first year of every company has missing values from lag_1 to the last column which totals to 100 rows of missing values.
The column year lag_2 has 200 missing values.
PLAN
Removal of these rows
finance_df2 <- finance_df %>%
filter(year!= 2016)
# verify
table(nrow(finance_df2))
##
## 900
## 1
## company year
## 0 0
## revenue domestic_sales
## 0 0
## export_sales cogs
## 0 0
## gross_profit operating_expense
## 0 0
## ebit ebitda
## 0 0
## net_profit assets
## 0 0
## equity debt
## 0 0
## cash inventory
## 0 0
## receivable employees
## 0 0
## shares operating_cf
## 0 0
## investing_cf financing_cf
## 0 0
## revenue_growth domestic_share
## 0 0
## export_share revenue_per_employee
## 0 0
## revenue_per_asset revenue_per_equity
## 0 0
## revenue_per_share gross_margin
## 0 0
## ebit_margin ebitda_margin
## 0 0
## net_margin profit_growth
## 0 0
## cogs_margin opex_ratio
## 0 0
## asset_turnover equity_ratio
## 0 0
## debt_ratio cash_ratio
## 0 0
## working_capital fcf
## 0 0
## ocf_margin inventory_turnover
## 0 0
## receivable_turnover eps
## 0 0
## book_value_per_share year_lag1
## 0 0
## year_lag2 year_rolling_mean
## 100 100
## year_rolling_std year_momentum
## 100 0
## revenue_lag1 revenue_lag2
## 0 100
## revenue_rolling_mean revenue_rolling_std
## 100 100
## revenue_momentum domestic_sales_lag1
## 0 0
## domestic_sales_lag2 domestic_sales_rolling_mean
## 100 100
## domestic_sales_rolling_std domestic_sales_momentum
## 100 0
## export_sales_lag1 export_sales_lag2
## 0 100
## export_sales_rolling_mean export_sales_rolling_std
## 100 100
## export_sales_momentum cogs_lag1
## 0 0
## cogs_lag2 cogs_rolling_mean
## 100 100
## cogs_rolling_std cogs_momentum
## 100 0
## gross_profit_lag1 gross_profit_lag2
## 0 100
## gross_profit_rolling_mean gross_profit_rolling_std
## 100 100
## gross_profit_momentum operating_expense_lag1
## 0 0
## operating_expense_lag2 operating_expense_rolling_mean
## 100 100
## operating_expense_rolling_std operating_expense_momentum
## 100 0
## ebit_lag1 ebit_lag2
## 0 100
## ebit_rolling_mean ebit_rolling_std
## 100 100
## ebit_momentum ebitda_lag1
## 0 0
## ebitda_lag2 ebitda_rolling_mean
## 100 100
## ebitda_rolling_std ebitda_momentum
## 100 0
## net_profit_lag1 net_profit_lag2
## 0 100
## net_profit_rolling_mean net_profit_rolling_std
## 100 100
## net_profit_momentum assets_lag1
## 0 0
## assets_lag2 assets_rolling_mean
## 100 100
## assets_rolling_std assets_momentum
## 100 0
## equity_lag1 equity_lag2
## 0 100
## equity_rolling_mean equity_rolling_std
## 100 100
## equity_momentum debt_lag1
## 0 0
## debt_lag2 debt_rolling_mean
## 100 100
## debt_rolling_std debt_momentum
## 100 0
## cash_lag1 cash_lag2
## 0 100
## cash_rolling_mean cash_rolling_std
## 100 100
## cash_momentum inventory_lag1
## 0 0
## inventory_lag2 inventory_rolling_mean
## 100 100
## inventory_rolling_std inventory_momentum
## 100 0
## receivable_lag1 receivable_lag2
## 0 100
## receivable_rolling_mean receivable_rolling_std
## 100 100
## receivable_momentum employees_lag1
## 0 0
## employees_lag2 employees_rolling_mean
## 100 100
## employees_rolling_std employees_momentum
## 100 0
## shares_lag1 shares_lag2
## 0 100
## shares_rolling_mean shares_rolling_std
## 100 100
## shares_momentum operating_cf_lag1
## 0 0
## operating_cf_lag2 operating_cf_rolling_mean
## 100 100
## operating_cf_rolling_std operating_cf_momentum
## 100 0
## investing_cf_lag1 investing_cf_lag2
## 0 100
## investing_cf_rolling_mean investing_cf_rolling_std
## 100 100
## investing_cf_momentum financing_cf_lag1
## 0 0
## financing_cf_lag2 financing_cf_rolling_mean
## 100 100
## financing_cf_rolling_std financing_cf_momentum
## 100 0
## revenue_growth_lag1 revenue_growth_lag2
## 100 200
## revenue_growth_rolling_mean revenue_growth_rolling_std
## 200 200
## revenue_growth_momentum domestic_share_lag1
## 100 0
## domestic_share_lag2 domestic_share_rolling_mean
## 100 100
## domestic_share_rolling_std domestic_share_momentum
## 100 0
## export_share_lag1 export_share_lag2
## 0 100
## export_share_rolling_mean export_share_rolling_std
## 100 100
## export_share_momentum revenue_per_employee_lag1
## 0 0
## revenue_per_employee_lag2 revenue_per_employee_rolling_mean
## 100 100
## revenue_per_employee_rolling_std revenue_per_employee_momentum
## 100 0
## revenue_per_asset_lag1 revenue_per_asset_lag2
## 0 100
## revenue_per_asset_rolling_mean revenue_per_asset_rolling_std
## 100 100
## revenue_per_asset_momentum revenue_per_equity_lag1
## 0 0
## revenue_per_equity_lag2 revenue_per_equity_rolling_mean
## 100 100
## revenue_per_equity_rolling_std revenue_per_equity_momentum
## 100 0
## revenue_per_share_lag1 revenue_per_share_lag2
## 0 100
## revenue_per_share_rolling_mean revenue_per_share_rolling_std
## 100 100
## revenue_per_share_momentum gross_margin_lag1
## 0 0
## gross_margin_lag2 gross_margin_rolling_mean
## 100 100
## gross_margin_rolling_std gross_margin_momentum
## 100 0
## ebit_margin_lag1 ebit_margin_lag2
## 0 100
## ebit_margin_rolling_mean ebit_margin_rolling_std
## 100 100
## ebit_margin_momentum ebitda_margin_lag1
## 0 0
## ebitda_margin_lag2 ebitda_margin_rolling_mean
## 100 100
## ebitda_margin_rolling_std ebitda_margin_momentum
## 100 0
## net_margin_lag1 net_margin_lag2
## 0 100
## net_margin_rolling_mean net_margin_rolling_std
## 100 100
## net_margin_momentum profit_growth_lag1
## 0 100
## profit_growth_lag2 profit_growth_rolling_mean
## 200 200
## average_inventory_lag1 inventory_turnover_lag1
## 100 100
## eps_lag1
## 0
##
## 0 100 200
## 110 98 5
The year 2017 has missing values for ‘lag_2’, ‘rolling_mean’, ‘rolling_std’. The missing lag 2 values are because of of the missing previous lag 1 values of 2016.
# remove year 20167 for all the companies
finance_df2 <- finance_df2 %>%
filter(year != 2017)
# verify the changes
nrow(finance_df2)
## [1] 800
# Confirm the changes using the selected features
features_to_plot <- c("revenue_growth_lag1",
"ebit_lag1",
"operating_expense_lag1",
"revenue_per_asset_lag1",
"cash_lag1",
"debt_lag1",
"operating_cf_lag1",
"inventory_turnover_lag1",
"eps_lag1",
"revenue_growth_momentum"
)
colSums(is.na(finance_df2[, features_to_plot]))
## revenue_growth_lag1 ebit_lag1 operating_expense_lag1
## 0 0 0
## revenue_per_asset_lag1 cash_lag1 debt_lag1
## 0 0 0
## operating_cf_lag1 inventory_turnover_lag1 eps_lag1
## 0 0 0
## revenue_growth_momentum
## 0
# check for missing values for the whole dataset
table(is.na(finance_df2))
##
## FALSE TRUE
## 169900 500
## company year
## 0 0
## revenue domestic_sales
## 0 0
## export_sales cogs
## 0 0
## gross_profit operating_expense
## 0 0
## ebit ebitda
## 0 0
## net_profit assets
## 0 0
## equity debt
## 0 0
## cash inventory
## 0 0
## receivable employees
## 0 0
## shares operating_cf
## 0 0
## investing_cf financing_cf
## 0 0
## revenue_growth domestic_share
## 0 0
## export_share revenue_per_employee
## 0 0
## revenue_per_asset revenue_per_equity
## 0 0
## revenue_per_share gross_margin
## 0 0
## ebit_margin ebitda_margin
## 0 0
## net_margin profit_growth
## 0 0
## cogs_margin opex_ratio
## 0 0
## asset_turnover equity_ratio
## 0 0
## debt_ratio cash_ratio
## 0 0
## working_capital fcf
## 0 0
## ocf_margin inventory_turnover
## 0 0
## receivable_turnover eps
## 0 0
## book_value_per_share year_lag1
## 0 0
## year_lag2 year_rolling_mean
## 0 0
## year_rolling_std year_momentum
## 0 0
## revenue_lag1 revenue_lag2
## 0 0
## revenue_rolling_mean revenue_rolling_std
## 0 0
## revenue_momentum domestic_sales_lag1
## 0 0
## domestic_sales_lag2 domestic_sales_rolling_mean
## 0 0
## domestic_sales_rolling_std domestic_sales_momentum
## 0 0
## export_sales_lag1 export_sales_lag2
## 0 0
## export_sales_rolling_mean export_sales_rolling_std
## 0 0
## export_sales_momentum cogs_lag1
## 0 0
## cogs_lag2 cogs_rolling_mean
## 0 0
## cogs_rolling_std cogs_momentum
## 0 0
## gross_profit_lag1 gross_profit_lag2
## 0 0
## gross_profit_rolling_mean gross_profit_rolling_std
## 0 0
## gross_profit_momentum operating_expense_lag1
## 0 0
## operating_expense_lag2 operating_expense_rolling_mean
## 0 0
## operating_expense_rolling_std operating_expense_momentum
## 0 0
## ebit_lag1 ebit_lag2
## 0 0
## ebit_rolling_mean ebit_rolling_std
## 0 0
## ebit_momentum ebitda_lag1
## 0 0
## ebitda_lag2 ebitda_rolling_mean
## 0 0
## ebitda_rolling_std ebitda_momentum
## 0 0
## net_profit_lag1 net_profit_lag2
## 0 0
## net_profit_rolling_mean net_profit_rolling_std
## 0 0
## net_profit_momentum assets_lag1
## 0 0
## assets_lag2 assets_rolling_mean
## 0 0
## assets_rolling_std assets_momentum
## 0 0
## equity_lag1 equity_lag2
## 0 0
## equity_rolling_mean equity_rolling_std
## 0 0
## equity_momentum debt_lag1
## 0 0
## debt_lag2 debt_rolling_mean
## 0 0
## debt_rolling_std debt_momentum
## 0 0
## cash_lag1 cash_lag2
## 0 0
## cash_rolling_mean cash_rolling_std
## 0 0
## cash_momentum inventory_lag1
## 0 0
## inventory_lag2 inventory_rolling_mean
## 0 0
## inventory_rolling_std inventory_momentum
## 0 0
## receivable_lag1 receivable_lag2
## 0 0
## receivable_rolling_mean receivable_rolling_std
## 0 0
## receivable_momentum employees_lag1
## 0 0
## employees_lag2 employees_rolling_mean
## 0 0
## employees_rolling_std employees_momentum
## 0 0
## shares_lag1 shares_lag2
## 0 0
## shares_rolling_mean shares_rolling_std
## 0 0
## shares_momentum operating_cf_lag1
## 0 0
## operating_cf_lag2 operating_cf_rolling_mean
## 0 0
## operating_cf_rolling_std operating_cf_momentum
## 0 0
## investing_cf_lag1 investing_cf_lag2
## 0 0
## investing_cf_rolling_mean investing_cf_rolling_std
## 0 0
## investing_cf_momentum financing_cf_lag1
## 0 0
## financing_cf_lag2 financing_cf_rolling_mean
## 0 0
## financing_cf_rolling_std financing_cf_momentum
## 0 0
## revenue_growth_lag1 revenue_growth_lag2
## 0 100
## revenue_growth_rolling_mean revenue_growth_rolling_std
## 100 100
## revenue_growth_momentum domestic_share_lag1
## 0 0
## domestic_share_lag2 domestic_share_rolling_mean
## 0 0
## domestic_share_rolling_std domestic_share_momentum
## 0 0
## export_share_lag1 export_share_lag2
## 0 0
## export_share_rolling_mean export_share_rolling_std
## 0 0
## export_share_momentum revenue_per_employee_lag1
## 0 0
## revenue_per_employee_lag2 revenue_per_employee_rolling_mean
## 0 0
## revenue_per_employee_rolling_std revenue_per_employee_momentum
## 0 0
## revenue_per_asset_lag1 revenue_per_asset_lag2
## 0 0
## revenue_per_asset_rolling_mean revenue_per_asset_rolling_std
## 0 0
## revenue_per_asset_momentum revenue_per_equity_lag1
## 0 0
## revenue_per_equity_lag2 revenue_per_equity_rolling_mean
## 0 0
## revenue_per_equity_rolling_std revenue_per_equity_momentum
## 0 0
## revenue_per_share_lag1 revenue_per_share_lag2
## 0 0
## revenue_per_share_rolling_mean revenue_per_share_rolling_std
## 0 0
## revenue_per_share_momentum gross_margin_lag1
## 0 0
## gross_margin_lag2 gross_margin_rolling_mean
## 0 0
## gross_margin_rolling_std gross_margin_momentum
## 0 0
## ebit_margin_lag1 ebit_margin_lag2
## 0 0
## ebit_margin_rolling_mean ebit_margin_rolling_std
## 0 0
## ebit_margin_momentum ebitda_margin_lag1
## 0 0
## ebitda_margin_lag2 ebitda_margin_rolling_mean
## 0 0
## ebitda_margin_rolling_std ebitda_margin_momentum
## 0 0
## net_margin_lag1 net_margin_lag2
## 0 0
## net_margin_rolling_mean net_margin_rolling_std
## 0 0
## net_margin_momentum profit_growth_lag1
## 0 0
## profit_growth_lag2 profit_growth_rolling_mean
## 100 100
## average_inventory_lag1 inventory_turnover_lag1
## 0 0
## eps_lag1
## 0
Missing values in:
profit_growth_rolling_mean 100
profit_growth_lag2 100
revenue_growth_rolling_std 100
revenue_growth_rolling_mean 100
revenue_growth_lag2 100
The reason for these missing values are that lag 2 variables require lag 1 values to be calculated.
Besides the above list, there are no other missing values
Recap
Started with 1000 rows
Removed 100 rows for year 2016 due to a lack of lag variables that require prior year. this dataset starts from 2016.
Removed 100 rows for 2017 because of missing lag 2 variables that requires lag 1 variables from 2016 to be calculated.
Currently at 800 observations from 2018 to 2025
Identified 500 missing lag 2 Variables from profit_growth & revenue_growth. This is because they require lag 1 values from 2017 in order to be filled.
I will move forward with my data set as is because only the selected features will be used to build a model.
Visualize the selected features using a correlation plot to check for feature to feature relationship
corr_matrix <- cor(finance_df2[, features_to_plot], use = 'complete.obs')
# correlation plot
corrplot(corr_matrix,
method = 'color',
type = 'upper',
tl.cex = 0.8,
addCoef.col = 'black',
number.cex = 0.7,
tl.srt = 45,
title = "Feature correlation heatmap",
mar = c(0,0,2,1))
### Key findings
High correlation features
ebit_lag1 & cash_lag1 0.79
ebit_lag1 & debt-lag 1 0.80
ebit_lag1 & operating_cf_lag1 0.98 - very high
ebit_lag1 & operating_cf_lag1 0.78
operating_expense_lag1 & cash_lag1 0.86
operating_expense_lag1 & debt_lag1 0.87
operating_expense_lag1 & operating_cf_lag1 0.87
cash_lag1 & debt_lag1 0.78
cash_lag1 & operating_cf_lag1 0.82
debt_lag1 & operating_cf_lag1 0.82
revenue_growth_lag1 & revenue_growth_momentum -0.81 (negatively correlated)
The heat map shows high multicollinearity among absolute features with ebit_lag1 & operating_cf_lag1 the highest at 0.98 which could mean that these tow features hava alsmot similar information.
The identified high correlation features scale in billions and this is not unusual for larger companies.
The multicollinearity shows there is a strong negative correlation between revenue_growth_lag1 & revenue_growth_momentum at -0.81.
Create target variable
##
## no yes
## 334 466
##
## no yes
## 0.4175 0.5825
58.25% of companies have had profit growth and 41.75% of companies have not had profit growth.
The target variable is reasonably balanced ruling out class imbalance.
# create the final modeling dataset
model_features <- c("year",
"revenue_growth_lag1",
"ebit_lag1",
"operating_expense_lag1",
"revenue_per_asset_lag1",
"cash_lag1",
"debt_lag1",
"operating_cf_lag1",
"inventory_turnover_lag1",
"eps_lag1",
"revenue_growth_momentum",
"profit_growth_binary"
)
finance_model_df <- finance_df2%>%
select(all_of(model_features))
# verify the final model dataset
colSums(is.na(finance_model_df))
## year revenue_growth_lag1 ebit_lag1
## 0 0 0
## operating_expense_lag1 revenue_per_asset_lag1 cash_lag1
## 0 0 0
## debt_lag1 operating_cf_lag1 inventory_turnover_lag1
## 0 0 0
## eps_lag1 revenue_growth_momentum profit_growth_binary
## 0 0 0
nrow(finance_model_df)
## [1] 800
I have a data set from features that represent all the feature categories of the finance data set.
Visualize the final modeling dataset with boxplots to understand feature vs target relationship
Profit is affected by
→ Interest payments on debt → Tax obligations → Depreciation and amortization → One-time charges
Revenue_growth_lag1 -> There is almost an overlap for companies that made profit and those that didn’t on the basis of revenue growth. This may not be a strong predictor for profit growth
ebit_lag1 (earnings before interest and taxes) -> companies that had higher earnings before interest did not make profit as much as those that had low earnings before interest in the previous year.
There is a slight overlap at the bottom but the ‘no’ group has a higher median.
So a company can have high EBIT but low profit growth because of high debt which leads to high interest payments hence high tax burdens or high investment costs therefore large depreciation costs.
These eat into profit after EBIT is calculated.
This feature shows moderate predictive potential but more outliers in the ‘No’ group warrant for log transformation for modeling.
Operating_expense_lag1 -> The boxes have significant overlap making this a weak predictor for profit growth.
However, the ‘yes’ group shows higher operating expenses which could be that companies that are making profit are investing more into operations aiming for profit growth.
There are more outliers in the ‘yes’ group which warrants for log transformation before modeling.
Revenue_per_asset -> Companies that did not make profit in the prior year made more revenue per asset
A company with no profit but high revenue per asset is likely efficient at generating sales but inefficient at controlling the costs.
I am not concerned about outliers for this feature hence no warrant for log transformation.
Cash_lag1 -> Immediate cash flow is slightly higher in companies that made profit than in companies that didn’t make profit.
debt_lag1 -> debt is almost equal with companies that made profit in the prior year being slightly higher than companies that didn’t make profit. Not a very strong predictor of profit growth.
Operating cash flow of the previous year is higher in companies that didn’t make profit in the prior year. This could be due to high non-cash expenses and most likely that companies like these have high ebit and ebitda. This will be a strong predictor for the profit growth
Inventory turnover lag1 is low for companies that didn’t make profit, this could also be related to those with high operating cash flow but low profit.
earnings per share is almost equal in companies that made profit and those that didn’t. This will be a weak predictor for profit growth
Revenue growth is slightly higher in companies that made profit. Sometimes companies can have revenue growth at profit loss because of efficiency improvements.
Verify selected features for modeling
# check the features for modeling
str(finance_model_df)
## tibble [800 × 12] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:800] 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ revenue_growth_lag1 : num [1:800] 0.011 0.1248 0.0829 0.2608 0.1424 ...
## $ ebit_lag1 : num [1:800] 1.60e+09 2.56e+09 6.59e+09 5.43e+09 2.97e+09 ...
## $ operating_expense_lag1 : num [1:800] 1.88e+09 5.06e+09 4.21e+09 4.68e+09 5.54e+09 ...
## $ revenue_per_asset_lag1 : num [1:800] 0.555 0.564 0.383 0.581 0.448 ...
## $ cash_lag1 : num [1:800] 9.23e+08 2.16e+09 2.11e+09 1.83e+09 1.73e+09 ...
## $ debt_lag1 : num [1:800] 7.78e+09 1.24e+10 1.92e+10 1.26e+10 2.01e+10 ...
## $ operating_cf_lag1 : num [1:800] 1.94e+09 2.96e+09 5.00e+09 5.52e+09 3.44e+09 ...
## $ inventory_turnover_lag1: num [1:800] 2.89 5.05 2.96 4.25 3.36 ...
## $ eps_lag1 : num [1:800] 5.4 7.11 9.36 42.01 5.77 ...
## $ revenue_growth_momentum: num [1:800] 0.1642 -0.0389 0.1378 -0.2153 -0.1334 ...
## $ profit_growth_binary : chr [1:800] "no" "yes" "yes" "no" ...
Year is included in the dataframe solely for temporal train/test splitting and will be excluded from features.
The goal of the project is to predict future profit growth using historical financial signals, hence the temporal splits is a better fit to ensure the model is trained on earlier years (2018 - 2022) to predict the future (2023 - 2025).
## tibble [800 × 12] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:800] 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ revenue_growth_lag1 : num [1:800] 0.011 0.1248 0.0829 0.2608 0.1424 ...
## $ ebit_lag1 : num [1:800] 1.60e+09 2.56e+09 6.59e+09 5.43e+09 2.97e+09 ...
## $ operating_expense_lag1 : num [1:800] 1.88e+09 5.06e+09 4.21e+09 4.68e+09 5.54e+09 ...
## $ revenue_per_asset_lag1 : num [1:800] 0.555 0.564 0.383 0.581 0.448 ...
## $ cash_lag1 : num [1:800] 9.23e+08 2.16e+09 2.11e+09 1.83e+09 1.73e+09 ...
## $ debt_lag1 : num [1:800] 7.78e+09 1.24e+10 1.92e+10 1.26e+10 2.01e+10 ...
## $ operating_cf_lag1 : num [1:800] 1.94e+09 2.96e+09 5.00e+09 5.52e+09 3.44e+09 ...
## $ inventory_turnover_lag1: num [1:800] 2.89 5.05 2.96 4.25 3.36 ...
## $ eps_lag1 : num [1:800] 5.4 7.11 9.36 42.01 5.77 ...
## $ revenue_growth_momentum: num [1:800] 0.1642 -0.0389 0.1378 -0.2153 -0.1334 ...
## $ profit_growth_binary : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 2 1 2 1 ...
train_data <- finance_model_df %>%
filter(year <= 2022) %>%
select(-year)
test_data <- finance_model_df %>%
filter(year >2022) %>%
select(-year)
prop.table(table(nrow(train_data)))
##
## 500
## 1
prop.table(table(nrow(test_data)))
##
## 300
## 1
dim(finance_model_df)
## [1] 800 12
# Check the target variable distribution
prop.table(table(train_data$profit_growth_binary))
##
## no yes
## 0.402 0.598
prop.table(table(test_data$profit_growth_binary))
##
## no yes
## 0.4433333 0.5566667
The temporal train/test split yielded 500 observations for training data and 300 observations for test data totaling to 800 observations.
Class distribution is consistent with Yes class (59.8% train data & 55.7% test data) with a difference of 4.1% indicating that the split is representative for both categories.
# Check minimum value of every feature
sapply(train_data[, features], min)
## revenue_growth_lag1 ebit_lag1 operating_expense_lag1
## -1.197031e-01 1.316671e+08 1.695674e+08
## revenue_per_asset_lag1 cash_lag1 debt_lag1
## 3.574419e-01 7.866745e+07 4.051942e+08
## operating_cf_lag1 inventory_turnover_lag1 eps_lag1
## 1.368744e+08 2.253622e+00 2.971642e-01
## revenue_growth_momentum
## -3.980441e-01
Features that have negative values
Features with only positive values
features <- c(
"revenue_growth_lag1",
"ebit_lag1",
"operating_expense_lag1",
"revenue_per_asset_lag1",
"cash_lag1",
"debt_lag1",
"operating_cf_lag1",
"inventory_turnover_lag1",
"eps_lag1",
"revenue_growth_momentum"
)
log_features <- c(
"ebit_lag1",
"operating_expense_lag1",
"revenue_per_asset_lag1",
"cash_lag1",
"debt_lag1",
"operating_cf_lag1",
"inventory_turnover_lag1",
"eps_lag1"
) # removed "Revenue growth lag1" and "Revenue growth momentum"
# apply log transformation in both train and test data
train_data[, log_features] <- log(train_data[, log_features])
test_data[, log_features] <- log(test_data[, log_features])
# fit scaler on the training features only
scaler <- preProcess(train_data[, features],
method = c('center', 'scale'))
# apply to both train and test splits
# predict in this case means "apply log transformation"
# it takes in each value ((value - mean) / std)
train_scaled <- predict(scaler, train_data)
test_scaled <- predict(scaler, test_data)
# verify
summary(train_scaled)
## revenue_growth_lag1 ebit_lag1 operating_expense_lag1
## Min. :-2.45654 Min. :-3.4320 Min. :-3.6445
## 1st Qu.:-0.74312 1st Qu.:-0.5067 1st Qu.:-0.5847
## Median :-0.01572 Median : 0.1925 Median : 0.2370
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.65906 3rd Qu.: 0.6865 3rd Qu.: 0.6800
## Max. : 2.57325 Max. : 1.7756 Max. : 1.8490
## revenue_per_asset_lag1 cash_lag1 debt_lag1 operating_cf_lag1
## Min. :-1.60637 Min. :-3.3026 Min. :-3.9295 Min. :-3.4973
## 1st Qu.:-0.85487 1st Qu.:-0.5667 1st Qu.:-0.5433 1st Qu.:-0.5055
## Median :-0.08155 Median : 0.1758 Median : 0.1716 Median : 0.2031
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.85799 3rd Qu.: 0.7307 3rd Qu.: 0.6934 3rd Qu.: 0.6904
## Max. : 1.94749 Max. : 1.8999 Max. : 2.0684 Max. : 1.8392
## inventory_turnover_lag1 eps_lag1 revenue_growth_momentum
## Min. :-2.18588 Min. :-3.1266 Min. :-2.494253
## 1st Qu.:-0.73815 1st Qu.:-0.5482 1st Qu.:-0.760654
## Median :-0.06625 Median : 0.1133 Median : 0.005399
## Mean : 0.00000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.75157 3rd Qu.: 0.6202 3rd Qu.: 0.725376
## Max. : 2.90808 Max. : 2.2708 Max. : 2.625465
## profit_growth_binary
## no :201
## yes:299
##
##
##
##
saveRDS("train_scaled", file = "log-reg progress")
set.seed(123)
logistic_regression_model <- glm(profit_growth_binary~., data = train_scaled, family = 'binomial')
summary(logistic_regression_model)
##
## Call:
## glm(formula = profit_growth_binary ~ ., family = "binomial",
## data = train_scaled)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.597325 0.112304 5.319 1.04e-07 ***
## revenue_growth_lag1 0.710617 0.191050 3.720 0.000200 ***
## ebit_lag1 -2.626469 0.734742 -3.575 0.000351 ***
## operating_expense_lag1 2.012753 0.416034 4.838 1.31e-06 ***
## revenue_per_asset_lag1 0.078142 0.129511 0.603 0.546269
## cash_lag1 -0.009429 0.279648 -0.034 0.973104
## debt_lag1 0.741293 0.381997 1.941 0.052310 .
## operating_cf_lag1 -0.306802 0.801414 -0.383 0.701849
## inventory_turnover_lag1 0.153245 0.113649 1.348 0.177527
## eps_lag1 0.154140 0.184479 0.836 0.403411
## revenue_growth_momentum 0.989173 0.194081 5.097 3.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 673.81 on 499 degrees of freedom
## Residual deviance: 529.55 on 489 degrees of freedom
## AIC: 551.55
##
## Number of Fisher Scoring iterations: 5
Based on the logistic model, the most significant features are:
The boxplot finding were that ebit_lag1, operating_expense_lag1, revenue_growth_momentum were moderate predictors but the model shows that they are the most significant.
The model contradicts my boxplot findings of operating cash flow lag 1 as a strong predictor as it is not significant.
revenue_growth_lag1 , ebit_lag1 are positive coefficients which means that they increase profit growth while operating_expense_lag1 & revenue_growth_momentum decrease profit growth.
ebit_lag1 -> Companies with highet prior ebit are more likley to not have profit growth because of large depreciation due to high tax burdens, high investment costs and high interest payments.
revenue_growth_lag 1 -> companies with higher prior revenue growth are more likely to have profit growth.
revenue_growth_momentum -> higher growth momentum is accelerating revenue which means that companies have captured the market share and possibly scaling and this leads to profit growth.
operating_expense_lag1 -> High operating expenses can lead to profit growth because it could be that companies these are making profit but investing more into operations aiming for more profit growth.
# make predictions using the model on the test data
logistic_regression_model_predictions <- factor(
ifelse(predict(logistic_regression_model,
newdata = test_scaled) > 0.5,
"yes", "no"),
levels = c("no", "yes")
)
# create a confusion matrix
lr_confusion_mtx <-confusionMatrix(logistic_regression_model_predictions,test_scaled$profit_growth_binary, positive = "yes")
lr_confusion_mtx
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 101 58
## yes 32 109
##
## Accuracy : 0.7
## 95% CI : (0.6447, 0.7513)
## No Information Rate : 0.5567
## P-Value [Acc > NIR] : 2.482e-07
##
## Kappa : 0.4041
##
## Mcnemar's Test P-Value : 0.008408
##
## Sensitivity : 0.6527
## Specificity : 0.7594
## Pos Pred Value : 0.7730
## Neg Pred Value : 0.6352
## Prevalence : 0.5567
## Detection Rate : 0.3633
## Detection Prevalence : 0.4700
## Balanced Accuracy : 0.7060
##
## 'Positive' Class : yes
##
# extract the output
accuracy_lr <- lr_confusion_mtx$overall ['Accuracy']
true_pos_rate_lr <- lr_confusion_mtx$byClass ['Sensitivity']
true_neg_rate_lr <- lr_confusion_mtx$byClass ['Specificity']
The model achieved an Accuracy: 70% significantly exceeding the ‘No’ information rate of 55.67%,
The model shows stronger Specificity: 76% than Sensitivity: 65% indicating that it is better at identifying companies that are not likely to have profit growth than those that will have profit growth which could indicate financial ditress signals rather than growth signals.
The model correctly identifies 65.27% of profit growing companies but maintains 76% accuracy in uidentifying non-growth companies which is meaningful information but not the goal of the project and what the investors would expect.
Decision trees do not require feature scaling or normality like logistic regression because they split data based on threshold rather that distance. Hence, i will apply the same train_scaled and test_scaled datasets on here for consistency.
The target variable remains as a factor with levels “yes” and “no”
Decision trees are suitable for small datasets but they are prone to over fitting so i have to be careful with that.
# set seed
set.seed(123)
# build decision tree model
decision_tree_model <- rpart(profit_growth_binary~., data = train_scaled, method = 'class')
# visualize the model
rpart.plot(decision_tree_model,
type = 4,
extra = 101,
fallen.leaves = TRUE,
main = "Decision Tree For Profit Growth")
# make prediction using the decision tree
decision_tree_predictions <- predict(decision_tree_model, test_scaled, type = 'class')
# Create the confusion matrix
confusionMatrix(decision_tree_predictions, test_scaled$profit_growth_binary, positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 43 33
## yes 90 134
##
## Accuracy : 0.59
## 95% CI : (0.532, 0.6462)
## No Information Rate : 0.5567
## P-Value [Acc > NIR] : 0.1347
##
## Kappa : 0.1314
##
## Mcnemar's Test P-Value : 4.433e-07
##
## Sensitivity : 0.8024
## Specificity : 0.3233
## Pos Pred Value : 0.5982
## Neg Pred Value : 0.5658
## Prevalence : 0.5567
## Detection Rate : 0.4467
## Detection Prevalence : 0.7467
## Balanced Accuracy : 0.5629
##
## 'Positive' Class : yes
##
The Decision Tree model achieves an accuracy of 59% , 11% lower than the logistic regression model.
It correctly predicts profit growth; 134 at 80.24% and correctly predicts no profit growth; 43 at 32.33%
This model has performed worse than the logistic regression model.
The Tree predicts “YES” for profit growth at 80.24% of the time while the actual yes rate is 55.67%, the tree is biased towards predicting yes.
The reason could have been that the tree grows too deep and learns specific patterns from training leading it to overpredicting ‘yes’.
The model memorized the training patterns leading to over prediction of ‘yes’.
increasing complexity of the data would have required adding more features which could expose the model to overfitting.
Solution: Try pruning to reduce the yes bias and see if the model improves.
# find optimal complexity parameter(cp) for the model
printcp(decision_tree_model)
##
## Classification tree:
## rpart(formula = profit_growth_binary ~ ., data = train_scaled,
## method = "class")
##
## Variables actually used in tree construction:
## [1] cash_lag1 ebit_lag1 inventory_turnover_lag1
## [4] operating_expense_lag1 revenue_growth_lag1 revenue_growth_momentum
##
## Root node error: 201/500 = 0.402
##
## n= 500
##
## CP nsplit rel error xerror xstd
## 1 0.094527 0 1.00000 1.00000 0.054545
## 2 0.047264 1 0.90547 1.04975 0.054943
## 3 0.034826 5 0.71642 0.92040 0.053711
## 4 0.019900 7 0.64677 0.91542 0.053650
## 5 0.017413 8 0.62687 0.89055 0.053333
## 6 0.014925 11 0.57214 0.85572 0.052847
## 7 0.010000 12 0.55721 0.86070 0.052920
plotcp(decision_tree_model)
### Key findings
The plot shows complexity parameter on the x_axis and cross validation error on the y-axis which are also displayed on the table.
The dotted line is threshold for selection at 1std
The goal or best outcome would be to find the lowest xerror point
Based on the graph my lowest xerror is at cp = 0.016, tree size = 12 nodes and xerror = 0.855
The threshold = min xerror - xstd at min -> 0.855 + 0.0.052 = 0.907
Find tree that falls below 0.907
cp 0.017 with 8 nodes at 0.891
use cp = 0.017 for pruning
N/B: I used Claude to help me understand how pruining is done and how it works.
pruned_decision_tree <- prune(decision_tree_model, cp = 0.017)
# plot
pruned_decision_tree_plot <- rpart.plot(pruned_decision_tree,
type = 4,
extra = 101,
fallen.leaves = TRUE,
main = "Pruned Decision Tree")
print(pruned_decision_tree_plot)
## $obj
## n= 500
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 500 201 yes (0.40200000 0.59800000)
## 2) ebit_lag1>=0.60842 147 64 no (0.56462585 0.43537415)
## 4) revenue_growth_momentum< -0.2481026 59 16 no (0.72881356 0.27118644) *
## 5) revenue_growth_momentum>=-0.2481026 88 40 yes (0.45454545 0.54545455)
## 10) operating_expense_lag1< 0.8448163 57 23 no (0.59649123 0.40350877)
## 20) inventory_turnover_lag1>=0.3263827 14 2 no (0.85714286 0.14285714) *
## 21) inventory_turnover_lag1< 0.3263827 43 21 no (0.51162791 0.48837209)
## 42) inventory_turnover_lag1< -0.2095408 34 13 no (0.61764706 0.38235294)
## 84) revenue_growth_lag1< -0.07026225 26 7 no (0.73076923 0.26923077) *
## 85) revenue_growth_lag1>=-0.07026225 8 2 yes (0.25000000 0.75000000) *
## 43) inventory_turnover_lag1>=-0.2095408 9 1 yes (0.11111111 0.88888889) *
## 11) operating_expense_lag1>=0.8448163 31 6 yes (0.19354839 0.80645161) *
## 3) ebit_lag1< 0.60842 353 118 yes (0.33427762 0.66572238)
## 6) operating_expense_lag1< -0.6334135 116 57 no (0.50862069 0.49137931)
## 12) ebit_lag1>=-0.3837157 21 1 no (0.95238095 0.04761905) *
## 13) ebit_lag1< -0.3837157 95 39 yes (0.41052632 0.58947368)
## 26) operating_expense_lag1< -1.404207 47 20 no (0.57446809 0.42553191)
## 52) ebit_lag1>=-2.020906 24 5 no (0.79166667 0.20833333) *
## 53) ebit_lag1< -2.020906 23 8 yes (0.34782609 0.65217391)
## 106) cash_lag1< -2.542958 10 3 no (0.70000000 0.30000000) *
## 107) cash_lag1>=-2.542958 13 1 yes (0.07692308 0.92307692) *
## 27) operating_expense_lag1>=-1.404207 48 12 yes (0.25000000 0.75000000) *
## 7) operating_expense_lag1>=-0.6334135 237 59 yes (0.24894515 0.75105485) *
##
## $snipped.nodes
## NULL
##
## $xlim
## [1] 0 1
##
## $ylim
## [1] 0 1
##
## $x
## [1] 0.53053227 0.22714408 0.08182368 0.37246447 0.25518836 0.16340706
## [7] 0.34696967 0.28578213 0.24499044 0.32657382 0.40815720 0.48974058
## [13] 0.83392047 0.68860007 0.57132396 0.80587618 0.71409488 0.65290734
## [19] 0.77528241 0.73449072 0.81607411 0.89765749 0.97924087
##
## $y
## [1] 0.9598391 0.8182741 0.0255096 0.6767090 0.5351439 0.0255096 0.3935788
## [8] 0.2520137 0.0255096 0.0255096 0.0255096 0.0255096 0.8182741 0.6767090
## [15] 0.0255096 0.5351439 0.3935788 0.0255096 0.2520137 0.0255096 0.0255096
## [22] 0.0255096 0.0255096
##
## $branch.x
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## x 0.5305323 0.2271441 0.08182368 0.3724645 0.2551884 0.1634071 0.3469697
## NA 0.2271441 0.08182368 0.3724645 0.2551884 0.1634071 0.3469697
## NA 0.5305323 0.22714408 0.2271441 0.3724645 0.2551884 0.2551884
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## x 0.2857821 0.2449904 0.3265738 0.4081572 0.4897406 0.8339205 0.6886001
## 0.2857821 0.2449904 0.3265738 0.4081572 0.4897406 0.8339205 0.6886001
## 0.3469697 0.2857821 0.2857821 0.3469697 0.3724645 0.5305323 0.8339205
## [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## x 0.5713240 0.8058762 0.7140949 0.6529073 0.7752824 0.7344907 0.8160741
## 0.5713240 0.8058762 0.7140949 0.6529073 0.7752824 0.7344907 0.8160741
## 0.6886001 0.6886001 0.8058762 0.7140949 0.7140949 0.7752824 0.7752824
## [,22] [,23]
## x 0.8976575 0.9792409
## 0.8976575 0.9792409
## 0.8058762 0.8339205
##
## $branch.y
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## y 0.9996865 0.8581215 0.06535701 0.7165564 0.5749913 0.06535701 0.4334262
## NA 0.9661116 0.82454647 0.8245465 0.6829814 0.54141631 0.5414163
## NA 0.9661116 0.82454647 0.8245465 0.6829814 0.54141631 0.5414163
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## y 0.2918611 0.06535701 0.06535701 0.06535701 0.06535701 0.8581215 0.7165564
## 0.3998512 0.25828615 0.25828615 0.39985123 0.68298139 0.9661116 0.8245465
## 0.3998512 0.25828615 0.25828615 0.39985123 0.68298139 0.9661116 0.8245465
## [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## y 0.06535701 0.5749913 0.4334262 0.06535701 0.2918611 0.06535701 0.06535701
## 0.68298139 0.6829814 0.5414163 0.39985123 0.3998512 0.25828615 0.25828615
## 0.68298139 0.6829814 0.5414163 0.39985123 0.3998512 0.25828615 0.25828615
## [,22] [,23]
## y 0.06535701 0.06535701
## 0.54141631 0.82454647
## 0.54141631 0.82454647
##
## $labs
## [1] "yes\n201 299\n100%" "no\n83 64\n29%" "no\n43 16\n12%"
## [4] "yes\n40 48\n18%" "no\n34 23\n11%" "no\n12 2\n3%"
## [7] "no\n22 21\n9%" "no\n21 13\n7%" "no\n19 7\n5%"
## [10] "yes\n2 6\n2%" "yes\n1 8\n2%" "yes\n6 25\n6%"
## [13] "yes\n118 235\n71%" "no\n59 57\n23%" "no\n20 1\n4%"
## [16] "yes\n39 56\n19%" "no\n27 20\n9%" "no\n19 5\n5%"
## [19] "yes\n8 15\n5%" "no\n7 3\n2%" "yes\n1 12\n3%"
## [22] "yes\n12 36\n10%" "yes\n59 178\n47%"
##
## $cex
## [1] 0.3875
##
## $boxes
## $boxes$x1
## [1] 0.50556185 0.20805117 0.06273077 0.35337157 0.23609546 0.14725291
## [7] 0.32787676 0.26668922 0.22883629 0.31335843 0.39494181 0.47358644
## [13] 0.80895004 0.66950717 0.55516982 0.78678328 0.69500197 0.63675320
## [19] 0.75912827 0.72127534 0.79991996 0.87856458 0.95720920
##
## $boxes$y1
## [1] 0.932536560 0.790971480 -0.001792973 0.649406399 0.507841318
## [6] -0.001792973 0.366276237 0.224711157 -0.001792973 -0.001792973
## [11] -0.001792973 -0.001792973 0.790971480 0.649406399 -0.001792973
## [16] 0.507841318 0.366276237 -0.001792973 0.224711157 -0.001792973
## [21] -0.001792973 -0.001792973 -0.001792973
##
## $boxes$x2
## [1] 0.5555027 0.2462370 0.1009166 0.3915574 0.2742813 0.1795612 0.3660626
## [8] 0.3048750 0.2611446 0.3397892 0.4213726 0.5058947 0.8588909 0.7076930
## [15] 0.5874781 0.8249691 0.7331878 0.6690615 0.7914366 0.7477061 0.8322283
## [22] 0.9167504 1.0012725
##
## $boxes$y2
## [1] 0.99968654 0.85812146 0.06535701 0.71655638 0.57499130 0.06535701
## [7] 0.43342622 0.29186114 0.06535701 0.06535701 0.06535701 0.06535701
## [13] 0.85812146 0.71655638 0.06535701 0.57499130 0.43342622 0.06535701
## [19] 0.29186114 0.06535701 0.06535701 0.06535701 0.06535701
##
##
## $split.labs
## [1] ""
##
## $split.cex
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $split.box
## $split.box$x1
## [1] 0.183965209 -0.009353179 NA 0.176328981 0.081467023
## [6] NA 0.205168274 0.171717023 NA NA
## [11] NA NA 0.607981049 0.526385450 NA
## [16] 0.636414611 0.615314442 NA 0.692927013 NA
## [21] NA NA NA
##
## $split.box$y1
## [1] 0.8971149 0.7555498 NA 0.6139848 0.4724197 NA 0.3308546
## [8] 0.1892895 NA NA NA NA 0.7555498 0.6139848
## [15] NA 0.4724197 0.3308546 NA 0.1892895 NA NA
## [22] NA NA
##
## $split.box$x2
## [1] 0.2703229 0.1730005 NA 0.3340477 0.2453471 NA 0.3663960
## [8] 0.3182639 NA NA NA NA 0.7692191 0.6162625
## [15] NA 0.7917751 0.6905002 NA 0.7760544 NA NA
## [22] NA NA
##
## $split.box$y2
## [1] 0.9222046 0.7806395 NA 0.6390744 0.4975094 NA 0.3559443
## [8] 0.2143792 NA NA NA NA 0.7806395 0.6390744
## [15] NA 0.4975094 0.3559443 NA 0.2143792 NA NA
## [22] NA NA
# make predictions using the pruned decision tree
pruned_decision_tree_predictions <- predict(pruned_decision_tree, newdata = test_scaled, type = 'class')
# create a confusion matrix
tree_confusion_matrix <- confusionMatrix(pruned_decision_tree_predictions, test_scaled$profit_growth_binary, positive = 'yes')
tree_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 50 46
## yes 83 121
##
## Accuracy : 0.57
## 95% CI : (0.5119, 0.6268)
## No Information Rate : 0.5567
## P-Value [Acc > NIR] : 0.342800
##
## Kappa : 0.1034
##
## Mcnemar's Test P-Value : 0.001526
##
## Sensitivity : 0.7246
## Specificity : 0.3759
## Pos Pred Value : 0.5931
## Neg Pred Value : 0.5208
## Prevalence : 0.5567
## Detection Rate : 0.4033
## Detection Prevalence : 0.6800
## Balanced Accuracy : 0.5502
##
## 'Positive' Class : yes
##
The pruned decision tree achieves an accuracy: 57%, a TPR: 72% and a TNR: 38%.
It has a p value of 0.3428 which is not statistically significant.
The model has slightly improved Specificity: 38% by 5.26% and reduced the Yes prediction bias from 80.24% to 72.46% (134 to 121) but the model still under performs compared to the logistic regression model.
It could be that this dataset has more linear relationships than complex non-linear relationships which favors logistic regression over decision trees which are designed to capture non-linear patterns.
N/B : I used
Missing values were handled by removal of rows by year; 2016 & 2917, resulting in 800 observations.
Target variable is in factor
Does not require feature scaling
No structural changes to the dataset will be made; I will still be using train_scaled for consistency
Random forest builds 500 decision trees on random subsets of the data and combines their results through majority voting.
set.seed(123)
# build the rf model
random_forest_model <- randomForest(profit_growth_binary~.,
data = train_scaled,
mtry = 3,
ntree = 500,
importance= TRUE)
# Make predictions using the RF model
randonm_forest_model_predictions <- predict(random_forest_model, newdata = test_scaled)
# create a confusion matrix
rf_confusion_matrix <- confusionMatrix(randonm_forest_model_predictions, test_scaled$profit_growth_binary,positive = 'yes')
rf_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 79 50
## yes 54 117
##
## Accuracy : 0.6533
## 95% CI : (0.5965, 0.7071)
## No Information Rate : 0.5567
## P-Value [Acc > NIR] : 0.0004124
##
## Kappa : 0.2955
##
## Mcnemar's Test P-Value : 0.7686248
##
## Sensitivity : 0.7006
## Specificity : 0.5940
## Pos Pred Value : 0.6842
## Neg Pred Value : 0.6124
## Prevalence : 0.5567
## Detection Rate : 0.3900
## Detection Prevalence : 0.5700
## Balanced Accuracy : 0.6473
##
## 'Positive' Class : yes
##
# extract the output
accuracy_rf <- rf_confusion_matrix$overall ['Accuracy']
true_pos_rate_rf <- rf_confusion_matrix$byClass ['Sensitivity']
true_neg_rate_rf <- rf_confusion_matrix$byClass ['Specificity']
The model shows improvement with an accuracy: 65%, a TPR: 70% and a TNR: 59%.
SUMMARY OF THE MODEL OUTCOMES BEFORE ENSEMBLE
Logistic regression model:
Decision tree model:
pruned decision tree model:
Random forest model:
Decision tree model has the highest Yes for profit growth at 80.24% but a p-value of 0.1347 which makes the model not significant.
Pruned decision tree model has an accuracy of 59% with a p vlaue of 0.3428 making it the worst performing model in the list.
logistic regression model and random forest model are the best performing with a difference in accuracy of about 5% with random forest having a higher yes prediction and logistic rergression having a higher no prediction for profit growth.
The goal of this project is to correctly predict companies that will make profit in the future hence if one comapny is incorrectly classifed as No for profit growth, an investor might miss an oppportunity to invest in this company. Hence, the best model so far is Random forest model with a TPR of 70.06% at the cost of lower TNR of 59.40%.
Based on the goal of the project, an investor would rather miss a falsely labeled non-growth company(false positive) than miss a growth opportunity (false negative). The logistic regression model is at a FPR of 54 and a FNR of 50
# get all predictions from the previous models
logistic_regression_pred <- predict(logistic_regression_model, newdata = test_scaled)
decision_tree_pred <- predict(pruned_decision_tree, newdata = test_scaled, type = 'class')
random_forest_pred <- predict(random_forest_model, newdata = test_scaled)
# create one df of the model predictions
ensemble_df <- data.frame(
logistic_regression_pred = as.character(logistic_regression_pred),
decision_tree_pred = as.character(decision_tree_pred),
random_forest_pred = as.character(random_forest_pred)
)
# We want votes for all the models
# create a function for majority voting
majority_vote <- function(row) {
votes <- table(row)
return(names(votes)[which.max(votes)])
}
# apply majority voting to the dataframe
ensemble_predictions <- apply(ensemble_df, 1, majority_vote)
# convert back to factor
ensemble_predictions <- factor(ensemble_predictions, levels = c("no", "yes"))
# Create a confusion matrix
ensemble_conf_mtx <- confusionMatrix(ensemble_predictions, test_scaled$profit_growth_binary, positive = "yes")
ensemble_conf_mtx
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 40 28
## yes 43 101
##
## Accuracy : 0.6651
## 95% CI : (0.5972, 0.7283)
## No Information Rate : 0.6085
## P-Value [Acc > NIR] : 0.05179
##
## Kappa : 0.2737
##
## Mcnemar's Test P-Value : 0.09661
##
## Sensitivity : 0.7829
## Specificity : 0.4819
## Pos Pred Value : 0.7014
## Neg Pred Value : 0.5882
## Prevalence : 0.6085
## Detection Rate : 0.4764
## Detection Prevalence : 0.6792
## Balanced Accuracy : 0.6324
##
## 'Positive' Class : yes
##
# extract the output
accuracy_ensemble <- ensemble_conf_mtx$overall ['Accuracy']
ensemble_tpr <- ensemble_conf_mtx$byClass ['Sensitivity']
ensemble_tnr <- ensemble_conf_mtx$byClass ['Specificity']
The ensemble model achieved an accuracy: 67%, a TPR: 78% and a TNR: 48%.
SUMMARY OF MODELS AFTER ENSEMBLE
Logistic regression model:
Decision tree model:
pruned decision tree model:
Random forest model:
Ensemble model
Prior to ensemble, the best performing model was the random forest model with 65.33% accuracy, TPR of 70.06% at the cost of lower TNR of 59.40%.
The ensemble model performs better with an increased accuracy of 0.71% compared to random forest model. It also has a yes prediction rate of more than 8.06% and a no prediction rate of less than 11.82%.
It has a p-value of 0.0522 which is borderline significant and slightly better than random chance of profit growth prediction but is not strong enough to be considered definitively significant.
Based on the goal of the project; to predict profit growth in the future,the ensemble model is recommended as the primary model due to its high TPR: 78%, correctly identifying companies that will have profit growth.
Random forest model remains statistically significant and recommended if sensitivity is not prioritized over predictive reliability at a p value of 0.0004124 and TPR: 70%
N/B: In lieu of ROC curve visualizations, model discrimination performance is summarized through Sensitivity(TPR) and Specificity(TNR) metrics across all models which captures similar trade off between true positive and true negative rates at the chosen classification threshold.
Limitation
The small data size of only 800 observations limit model learning with 10 of the 200 observations representing all the feature categories of the dataset.