The project uses data including 19 variables belonging to 5 groups of financial indicators in quarterly financial reports over the years and the target variable for the model using historical closing price data of 364 companies (excluding Banking, Insurance, Securities) listed on the Ho Chi Minh City Stock Exchange (HOSE). The data set includes 14,000 observations and is directly scraped through open source code from two sites Vietstock and SSI.
The main purpose of this data is to build a predictive stock price model, but the project will be mainly focused on checking whether PCA can help to get better results that can apply for the main purpose.
The data can be accessed through this (Link)
At first, we can take a glance on the data:
## X quarter Cash Revenue Net_Profit Current_Asset Total_Asset
## 1 1 201012 20198710 200490099 24173915 285053921 645292775
## 2 2 201103 7851111 194825638 18891881 282385452 655170933
## 3 3 201106 19514488 203723874 10678199 290141612 737556219
## 4 4 201109 42730978 222849372 22109728 315854076 798982301
## 5 5 201112 32921082 289235247 25009656 294682022 816617975
## 6 6 201203 15684040 264734451 17842987 301858334 816200477
## 7 7 201206 24421088 216588621 11913986 298434419 753866364
## 8 8 201209 11298810 243785092 12434518 277753357 723444411
## 9 9 201212 103421045 284925069 21829635 433003092 900493790
## 10 10 201303 96503117 243931078 20724971 402140441 875957803
## Total_Liabilities Current_Liabilities EPS BVPS P.E P.B ROAA D.E
## 1 321125391 212877987 9178.16 32643 3.61 1.01 3.20 78.95
## 2 351413660 248664923 9670.53 30581 2.60 0.82 2.58 95.39
## 3 424620134 327506474 8859.20 31466 1.94 0.55 1.31 113.65
## 4 467725567 340529211 6528.89 33311 2.59 0.00 2.45 115.24
## 5 464685984 342394479 6613.61 35389 1.71 0.32 2.57 110.17
## 6 441936580 334544075 6467.35 37076 2.98 0.52 1.88 97.06
## 7 381081088 302692088 6617.57 36916 2.57 0.46 1.35 83.28
## 8 338256129 264191989 5842.68 38148 2.21 0.00 1.51 72.25
## 9 373901324 357290324 5254.60 26190 2.65 0.00 2.22 47.40
## 10 332835205 320273205 4455.28 27014 3.01 0.50 1.94 51.12
## stockname Current_Ratio Net_Margin Asset_Turnover Cash_Operating
## 1 AAA 1.3390484 0.12057411 0.3106963 2318281463
## 2 AAA 1.1356063 0.09696815 0.2973661 -2061253669
## 3 AAA 0.8859111 0.05241506 0.2762147 15030733056
## 4 AAA 0.9275389 0.09921378 0.2789165 19939110352
## 5 AAA 0.8606506 0.08646822 0.3541867 23224273045
## 6 AAA 0.9022977 0.06739957 0.3243498 17115407805
## 7 AAA 0.9859340 0.05500744 0.2873037 28883308741
## 8 AAA 1.0513315 0.05100606 0.3369783 40797111231
## 9 AAA 1.2119083 0.07661535 0.3164098 24324704959
## 10 AAA 1.2556169 0.08496241 0.2784735 857129565
## Cash_Investing Cash_Financing Close pct_change Label
## 1 -96349161561 54628645611 11722 -0.41483626 1
## 2 -43220616825 32936887965 8310 -0.29107661 1
## 3 -68993784513 65606793018 5598 -0.32635379 1
## 4 -22731586616 26028601459 4407 -0.21275456 1
## 5 -29175882088 -3867723554 3876 -0.12049013 1
## 6 -9051720933 -25300728507 5238 0.35139319 0
## 7 31799529545 -51931804401 5814 0.10996564 0
## 8 -10633646274 -43299729177 4446 -0.23529412 1
## 9 -27923357430 95739318154 4003 -0.09964013 1
## 10 -35425531423 27650473527 4362 0.08968274 0
# data preprocessing:
df_final <- df_final[, -c(1,2)]
df_pca <- df_final %>%
select(-c(stockname, Close, pct_change, Label)) %>%
select_if(is.numeric)
df_pca <- df_pca[complete.cases(df_pca),] %>% scale()
Compute and plot correlation matrix. Several financial variables are highly correlated such as Net profit, Total asset, Total liabilities, Current asset, Current liabilities. The plot below shows the strong correlation between each variables:
corr_df = cor(df_pca,method='pearson')
corrplot(corr_df)
pca <- prcomp(df_pca, center=TRUE, scale=TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4704 1.4608 1.18550 1.16874 1.05676 1.0064 0.99993
## Proportion of Variance 0.3212 0.1123 0.07397 0.07189 0.05878 0.0533 0.05262
## Cumulative Proportion 0.3212 0.4335 0.50749 0.57938 0.63816 0.6915 0.74408
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.99953 0.9530 0.85126 0.73221 0.70355 0.62266 0.54316
## Proportion of Variance 0.05258 0.0478 0.03814 0.02822 0.02605 0.02041 0.01553
## Cumulative Proportion 0.79667 0.8445 0.88260 0.91082 0.93687 0.95728 0.97281
## PC15 PC16 PC17 PC18 PC19
## Standard deviation 0.52191 0.36824 0.24739 0.19368 0.09988
## Proportion of Variance 0.01434 0.00714 0.00322 0.00197 0.00053
## Cumulative Proportion 0.98714 0.99428 0.99750 0.99947 1.00000
The PCA results include standard deviations, proportion of variance, anf cumulative proportion for each principle component up to 17 PC.
As we can see, PC1 captures the most variance, the proportion of variance for PC1 is 28,74%. The cumulative proportion shows that the first six PCs explain approximately 70% of the variance. But i will choose 8 components: Capture ~81% variance, better for critical applications like stock price prediction.
The Scree plot is taken with the same result:
In term of plot, variables close to each other on the PCA plot are correlated. Those variables with strong positive correlation (Net profit, Total asset, Total liabilities, Current asset, Current liabilities) are grouped together:
pca$rotation[,1:8]
## PC1 PC2 PC3 PC4
## Cash 0.350645231 -0.02093057 0.033068197 0.1360416236
## Revenue 0.307355170 -0.07175490 0.166376739 0.1354908277
## Net_Profit 0.278740598 -0.15340037 0.037114176 0.1788207805
## Current_Asset 0.384617649 0.04492040 -0.034583555 0.0233484063
## Total_Asset 0.389244789 0.07321336 -0.034452960 0.0076722577
## Total_Liabilities 0.375919707 0.10619960 -0.039902506 -0.0465242506
## Current_Liabilities 0.379701062 0.08024904 -0.021652700 -0.0101482281
## EPS 0.042251104 -0.58507657 -0.024395880 -0.1600820362
## BVPS 0.046727709 -0.40947939 -0.151286205 -0.1773421950
## P.E -0.001533694 0.02319209 -0.007400025 0.0161777040
## P.B 0.087805104 -0.19693334 0.478496800 -0.2640565784
## ROAA 0.017879638 -0.52710525 -0.038836780 -0.0301951480
## D.E 0.025933274 0.16298243 0.579870303 -0.3532971130
## Current_Ratio -0.026830832 -0.15332056 -0.345886247 0.1240189429
## Net_Margin 0.001352952 -0.02362594 -0.012042791 -0.0007542002
## Asset_Turnover -0.012957408 -0.20341905 0.269999232 -0.0880678555
## Cash_Operating 0.159237056 -0.08841097 0.137258815 0.4415798684
## Cash_Investing -0.249363689 -0.04857288 0.205210676 0.2570173890
## Cash_Financing 0.137304007 0.12826872 -0.348591358 -0.6255428841
## PC5 PC6 PC7 PC8
## Cash 0.010415756 -0.081692410 0.020202379 0.005307850
## Revenue -0.102060728 0.003975566 0.008953373 -0.007262747
## Net_Profit 0.052590978 0.072580249 -0.016872045 -0.004698167
## Current_Asset 0.009363075 -0.141213545 0.045490377 0.009297502
## Total_Asset 0.009851007 -0.141901296 0.044493418 0.009212473
## Total_Liabilities -0.002141383 -0.169549574 0.055445938 0.011105580
## Current_Liabilities -0.025861036 -0.133578999 0.046613318 0.007567646
## EPS -0.085207089 -0.112173357 -0.002894324 -0.020132790
## BVPS -0.241349122 -0.339147418 0.033389490 -0.021635228
## P.E 0.058691047 -0.033188124 0.007082180 -0.996826029
## P.B 0.460468558 0.037483053 -0.041389271 0.002243995
## ROAA 0.180126898 0.163839079 -0.009082442 0.001173577
## D.E 0.179588406 -0.041401392 0.005645266 0.014092246
## Current_Ratio 0.607584057 0.010986946 -0.023417751 0.050232608
## Net_Margin 0.039276633 0.252299289 0.960786642 -0.002006116
## Asset_Turnover -0.518382948 0.256425969 -0.001674236 -0.030891145
## Cash_Operating -0.007233729 0.501645178 -0.175206595 -0.020099176
## Cash_Investing 0.013144653 -0.557077335 0.172405352 0.026517185
## Cash_Financing -0.022823625 0.226261918 -0.053357266 -0.014690165
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
#c<-fviz_contrib(pca, "var",axes = 3, xtickslab.rt=90)
#d<-fviz_contrib(pca, "var",axes = 4, xtickslab.rt=90)
#e<-fviz_contrib(pca, "var",axes = 5, xtickslab.rt=90)
#f<-fviz_contrib(pca, "var",axes = 6, xtickslab.rt=90)
#g<-fviz_contrib(pca, "var",axes = 7, xtickslab.rt=90)
#h<-fviz_contrib(pca, "var",axes = 8, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')
In conclusion, if critical metrics are sread across many components, Using 6-8 components for efficienct, but validate with downstream tasks like stock price prediction, PCA is useful but may not be the optimal choice for consideration.