Dataset Information

The project uses data including 19 variables belonging to 5 groups of financial indicators in quarterly financial reports over the years and the target variable for the model using historical closing price data of 364 companies (excluding Banking, Insurance, Securities) listed on the Ho Chi Minh City Stock Exchange (HOSE). The data set includes 14,000 observations and is directly scraped through open source code from two sites Vietstock and SSI.

The main purpose of this data is to build a predictive stock price model, but the project will be mainly focused on checking whether PCA can help to get better results that can apply for the main purpose.

The data can be accessed through this (Link)

Data Analysis

At first, we can take a glance on the data:

##     X quarter      Cash   Revenue Net_Profit Current_Asset Total_Asset
## 1   1  201012  20198710 200490099   24173915     285053921   645292775
## 2   2  201103   7851111 194825638   18891881     282385452   655170933
## 3   3  201106  19514488 203723874   10678199     290141612   737556219
## 4   4  201109  42730978 222849372   22109728     315854076   798982301
## 5   5  201112  32921082 289235247   25009656     294682022   816617975
## 6   6  201203  15684040 264734451   17842987     301858334   816200477
## 7   7  201206  24421088 216588621   11913986     298434419   753866364
## 8   8  201209  11298810 243785092   12434518     277753357   723444411
## 9   9  201212 103421045 284925069   21829635     433003092   900493790
## 10 10  201303  96503117 243931078   20724971     402140441   875957803
##    Total_Liabilities Current_Liabilities     EPS  BVPS  P.E  P.B ROAA    D.E
## 1          321125391           212877987 9178.16 32643 3.61 1.01 3.20  78.95
## 2          351413660           248664923 9670.53 30581 2.60 0.82 2.58  95.39
## 3          424620134           327506474 8859.20 31466 1.94 0.55 1.31 113.65
## 4          467725567           340529211 6528.89 33311 2.59 0.00 2.45 115.24
## 5          464685984           342394479 6613.61 35389 1.71 0.32 2.57 110.17
## 6          441936580           334544075 6467.35 37076 2.98 0.52 1.88  97.06
## 7          381081088           302692088 6617.57 36916 2.57 0.46 1.35  83.28
## 8          338256129           264191989 5842.68 38148 2.21 0.00 1.51  72.25
## 9          373901324           357290324 5254.60 26190 2.65 0.00 2.22  47.40
## 10         332835205           320273205 4455.28 27014 3.01 0.50 1.94  51.12
##    stockname Current_Ratio Net_Margin Asset_Turnover Cash_Operating
## 1        AAA     1.3390484 0.12057411      0.3106963     2318281463
## 2        AAA     1.1356063 0.09696815      0.2973661    -2061253669
## 3        AAA     0.8859111 0.05241506      0.2762147    15030733056
## 4        AAA     0.9275389 0.09921378      0.2789165    19939110352
## 5        AAA     0.8606506 0.08646822      0.3541867    23224273045
## 6        AAA     0.9022977 0.06739957      0.3243498    17115407805
## 7        AAA     0.9859340 0.05500744      0.2873037    28883308741
## 8        AAA     1.0513315 0.05100606      0.3369783    40797111231
## 9        AAA     1.2119083 0.07661535      0.3164098    24324704959
## 10       AAA     1.2556169 0.08496241      0.2784735      857129565
##    Cash_Investing Cash_Financing Close  pct_change Label
## 1    -96349161561    54628645611 11722 -0.41483626     1
## 2    -43220616825    32936887965  8310 -0.29107661     1
## 3    -68993784513    65606793018  5598 -0.32635379     1
## 4    -22731586616    26028601459  4407 -0.21275456     1
## 5    -29175882088    -3867723554  3876 -0.12049013     1
## 6     -9051720933   -25300728507  5238  0.35139319     0
## 7     31799529545   -51931804401  5814  0.10996564     0
## 8    -10633646274   -43299729177  4446 -0.23529412     1
## 9    -27923357430    95739318154  4003 -0.09964013     1
## 10   -35425531423    27650473527  4362  0.08968274     0
# data preprocessing:
df_final <- df_final[, -c(1,2)]
df_pca <- df_final %>%
  select(-c(stockname, Close, pct_change, Label)) %>%
  select_if(is.numeric)
df_pca <- df_pca[complete.cases(df_pca),] %>% scale()

Compute and plot correlation matrix. Several financial variables are highly correlated such as Net profit, Total asset, Total liabilities, Current asset, Current liabilities. The plot below shows the strong correlation between each variables:

corr_df = cor(df_pca,method='pearson')
corrplot(corr_df)

Principle Component Analysis (PCA)

Choose number of components

pca <- prcomp(df_pca, center=TRUE, scale=TRUE)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6     PC7
## Standard deviation     2.4704 1.4608 1.18550 1.16874 1.05676 1.0064 0.99993
## Proportion of Variance 0.3212 0.1123 0.07397 0.07189 0.05878 0.0533 0.05262
## Cumulative Proportion  0.3212 0.4335 0.50749 0.57938 0.63816 0.6915 0.74408
##                            PC8    PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.99953 0.9530 0.85126 0.73221 0.70355 0.62266 0.54316
## Proportion of Variance 0.05258 0.0478 0.03814 0.02822 0.02605 0.02041 0.01553
## Cumulative Proportion  0.79667 0.8445 0.88260 0.91082 0.93687 0.95728 0.97281
##                           PC15    PC16    PC17    PC18    PC19
## Standard deviation     0.52191 0.36824 0.24739 0.19368 0.09988
## Proportion of Variance 0.01434 0.00714 0.00322 0.00197 0.00053
## Cumulative Proportion  0.98714 0.99428 0.99750 0.99947 1.00000
  • The PCA results include standard deviations, proportion of variance, anf cumulative proportion for each principle component up to 17 PC.

  • As we can see, PC1 captures the most variance, the proportion of variance for PC1 is 28,74%. The cumulative proportion shows that the first six PCs explain approximately 70% of the variance. But i will choose 8 components: Capture ~81% variance, better for critical applications like stock price prediction.

The Scree plot is taken with the same result:

In term of plot, variables close to each other on the PCA plot are correlated. Those variables with strong positive correlation (Net profit, Total asset, Total liabilities, Current asset, Current liabilities) are grouped together:

Contribution of individual variables

pca$rotation[,1:8]
##                              PC1         PC2          PC3           PC4
## Cash                 0.350645231 -0.02093057  0.033068197  0.1360416236
## Revenue              0.307355170 -0.07175490  0.166376739  0.1354908277
## Net_Profit           0.278740598 -0.15340037  0.037114176  0.1788207805
## Current_Asset        0.384617649  0.04492040 -0.034583555  0.0233484063
## Total_Asset          0.389244789  0.07321336 -0.034452960  0.0076722577
## Total_Liabilities    0.375919707  0.10619960 -0.039902506 -0.0465242506
## Current_Liabilities  0.379701062  0.08024904 -0.021652700 -0.0101482281
## EPS                  0.042251104 -0.58507657 -0.024395880 -0.1600820362
## BVPS                 0.046727709 -0.40947939 -0.151286205 -0.1773421950
## P.E                 -0.001533694  0.02319209 -0.007400025  0.0161777040
## P.B                  0.087805104 -0.19693334  0.478496800 -0.2640565784
## ROAA                 0.017879638 -0.52710525 -0.038836780 -0.0301951480
## D.E                  0.025933274  0.16298243  0.579870303 -0.3532971130
## Current_Ratio       -0.026830832 -0.15332056 -0.345886247  0.1240189429
## Net_Margin           0.001352952 -0.02362594 -0.012042791 -0.0007542002
## Asset_Turnover      -0.012957408 -0.20341905  0.269999232 -0.0880678555
## Cash_Operating       0.159237056 -0.08841097  0.137258815  0.4415798684
## Cash_Investing      -0.249363689 -0.04857288  0.205210676  0.2570173890
## Cash_Financing       0.137304007  0.12826872 -0.348591358 -0.6255428841
##                              PC5          PC6          PC7          PC8
## Cash                 0.010415756 -0.081692410  0.020202379  0.005307850
## Revenue             -0.102060728  0.003975566  0.008953373 -0.007262747
## Net_Profit           0.052590978  0.072580249 -0.016872045 -0.004698167
## Current_Asset        0.009363075 -0.141213545  0.045490377  0.009297502
## Total_Asset          0.009851007 -0.141901296  0.044493418  0.009212473
## Total_Liabilities   -0.002141383 -0.169549574  0.055445938  0.011105580
## Current_Liabilities -0.025861036 -0.133578999  0.046613318  0.007567646
## EPS                 -0.085207089 -0.112173357 -0.002894324 -0.020132790
## BVPS                -0.241349122 -0.339147418  0.033389490 -0.021635228
## P.E                  0.058691047 -0.033188124  0.007082180 -0.996826029
## P.B                  0.460468558  0.037483053 -0.041389271  0.002243995
## ROAA                 0.180126898  0.163839079 -0.009082442  0.001173577
## D.E                  0.179588406 -0.041401392  0.005645266  0.014092246
## Current_Ratio        0.607584057  0.010986946 -0.023417751  0.050232608
## Net_Margin           0.039276633  0.252299289  0.960786642 -0.002006116
## Asset_Turnover      -0.518382948  0.256425969 -0.001674236 -0.030891145
## Cash_Operating      -0.007233729  0.501645178 -0.175206595 -0.020099176
## Cash_Investing       0.013144653 -0.557077335  0.172405352  0.026517185
## Cash_Financing      -0.022823625  0.226261918 -0.053357266 -0.014690165
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
#c<-fviz_contrib(pca, "var",axes = 3, xtickslab.rt=90)
#d<-fviz_contrib(pca, "var",axes = 4, xtickslab.rt=90)
#e<-fviz_contrib(pca, "var",axes = 5, xtickslab.rt=90)
#f<-fviz_contrib(pca, "var",axes = 6, xtickslab.rt=90)
#g<-fviz_contrib(pca, "var",axes = 7, xtickslab.rt=90)
#h<-fviz_contrib(pca, "var",axes = 8, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

Conclusion:

In conclusion, if critical metrics are sread across many components, Using 6-8 components for efficienct, but validate with downstream tasks like stock price prediction, PCA is useful but may not be the optimal choice for consideration.