CAFEF.VN FINANCE WEB SCRAPING

I. INTRODUCTION

I was really frustrate every time I want to analyze financial statements of a company because it’s a boring task. I have to look through cafef.vn website and copy the numers page by page. Each report takes me about an hour just to collect the number, and more time needed to organize them into right format.

As a result, I wonder if I can use R to get all of those value and organize them in appropriate formats, and the answer is YES. This solution help me collect any financial statements in a second. How amazing it is.

Final result is like:

scraping_finance(

  • company_stock, #company name on stock market, e.g. “fpt”, “vcb”
  • report_type, #“bsheet” (for balance sheet), “incsta” (for incomes statement), or “cashflow”
  • year_start,
  • year_end,
  • time_length, # “quarterly” or “annually”
  • format) # “web format” or “statistics”

II. SCRAPING WEB FOR FINANCIAL STATEMENTS

0. Preparation

Install required packages - “rvest” pakcage for web scraping

#install.packages("rvest")
library(rvest) # for web scraping

library(tidyverse) #for data analysis
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1. Create a list of page to scrape

I will create a funciton to get a list of page according to the input information.

#function
scraping_pages <- function(company_stock, #stock name of company, e.g "vcb", "fpt"
                           year_start, 
                           year_end,
                           report_type, #"bsheet"/ "incsta" (income statemanet)/ "cashflow"
                           time_length) #quarterly or annually
  {
  year <- c(year_start:year_end) #input year, however, you need to check it first
  
  if (time_length == "quarterly") {
    #web site link template for quarterly report
    list_of_page <- str_c("https://s.cafef.vn/bao-cao-tai-chinh/",company_stock,"/",report_type,"/",year,"/4/1/0/ket-qua-hoat-dong-kinh-doanh-.chn") 
  } else {
    #web site link template for annually report
    list_of_page <- str_c("https://s.cafef.vn/bao-cao-tai-chinh/",company_stock,"/",report_type,"/",year,"/0/0/0/bao-cao-tai-chinh-.chn")
  }
  
  return(list_of_page)
}

2. Scrape a web page

I will create a function that scrape a cafef.vn web page and give a table of financial information.

The function is specialized for the cafef.vn format. Function name pnl() mean profit and loss statement, because it’s originally created to scrape the profit and loss statement (income statement). This version is already updated to scrape through all of 3 kinds of financial statement:

  • balance sheet (if report_type == "bsheet")
  • income statement (if the report_type == "incsta")
  • cashflow (if the report_type == "cashflow")
# function
pnl <- function(link, time_length){
  page <- read_html(link)
  
  #--- get the raw data => pnl_raw
  pnl_raw <- page %>% 
  html_nodes(".b_r_c") %>% #syntax ".attribute"
  html_text()
  
  #--- transform data into a tibble => pnl_tib
  pnl_tib <- tibble(stat = gsub(",","",pnl_raw)[-1])
  
  #--- create a standard table => pnl_table
  #Create table format
  num_c <- 6
  num_r <- (length(pnl_raw) - 1)/6
  
  pnl_table <- tibble()
  for (r in 1:num_r) {
    for (c in 1:num_c) {
      pnl_table[r,c] <- pnl_tib[(r-1)*6+c,]
      }
  }
  pnl_table <- pnl_table[-6]   #eliminate blank column
  
  #--- add column name => pnl_standard
  #column name
  column_names <- c("stat","I","II","III","IV")
  colnames(pnl_table) = column_names
  
  #tranform chr column into num colum
  pnl_num <- pnl_table[-1] %>% 
    mutate_if(is.character,as.numeric)
  
  pnl_standard <- bind_cols(pnl_table[1],pnl_num)
  
  #--- take year
  #take year
  year <- strsplit(link,"/") #take year automatically
  year_2 <- subset(as.numeric(year[[1]]), as.numeric(year[[1]])>200) %>% 
      as.numeric()
    
  #CASE OF QUARTERLY & ANNUALLY
  if (time_length == "quarterly"){
     #--- add year & quarter => pnl_full
    year_3 <- tribble(
      ~stat, ~I, ~II, ~III, ~IV,
      "Quar",1,2,3,4,
      "Year",year_2,year_2,year_2,year_2
      )
    
    pnl_full <- bind_rows(pnl_standard,year_3)
  } else {
    #--- add year => pnl_full
    year_3 <- tribble(
      ~stat, ~I, ~II, ~III, ~IV,
      "Year",year_2 - 3,year_2 - 2,year_2 - 1,year_2) 
    
    pnl_full <- bind_rows(pnl_standard,year_3)
  }
 
  #result
  return (pnl_full)
}

3. Scrape through all the page in a function

Web page format

#function
scraping_finance <- function(company_stock,
                           report_type,
                           year_start,
                           year_end,
                           time_length,
                           format) {
  list_of_page <- scraping_pages(company_stock,
                                 year_start,
                                 year_end,
                                 report_type,
                                 time_length)
  
  #--- create the table => table_full
  #updating element table full
  if (time_length == "quarterly"){
    table_full <- pnl(list_of_page[1],time_length) #initiate with the 1st page in the list
    next_col <- -1 #if it's quarterly, we'll remove the 1st column and update the rest
    } else {
    table_full <- pnl(list_of_page[1],time_length)[,c(1,5)] #initiate with only column 1 and 5 of the 1st page in the list
    next_col <- 5 #if it's annually, we'll update only the 5th column
    }
  
  #complete the table
  for (n in 2:length(list_of_page)) {
    table_next <- pnl(list_of_page[n],time_length)
    table_full <- bind_cols(table_full,table_next[next_col])}
  
  #--- table format 
  if (format == "web format") {
    final_table <- table_full
  } else {
    #transpose table for data analysis
    statistic <- table_full[,-1]
    statistic_names <- table_full[[1]] #get the names as a list
    
    table_t <- t(statistic)
    colnames(table_t) = statistic_names
    
    final_table <- as_tibble(table_t)
  }
  
  #--- update table name
  if (format == "web format") {
     last_row <- nrow(final_table)
     colnames(final_table) = unlist(final_table[last_row,])
  }
 
  return (final_table)
}

4. Get final result

Example:

fpt_bsheet_2008_2023_annually_web_format <- scraping_finance("fpt",
                                             "bsheet",
                                             2008,2023,
                                             "quarterly",
                                             "web format")

fpt_bsheet_2008_2023_annually_web_format
## # A tibble: 121 × 65
##    Year    `2008`   `2008`   `2008`   `2008`   `2009`   `2009`   `2009`   `2009`
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 TÀI … NA       NA       NA       NA       NA       NA       NA       NA      
##  2 A- T…  4.66e12  4.66e12  4.66e12  4.66e12  7.68e12  7.68e12  7.68e12  7.68e12
##  3 I. T…  1.24e12  1.24e12  1.24e12  1.24e12  2.31e12  2.31e12  2.31e12  2.31e12
##  4 1. T…  1.09e12  1.09e12  1.09e12  1.09e12  1.49e12  1.49e12  1.49e12  1.49e12
##  5 2. C…  1.54e11  1.54e11  1.54e11  1.54e11  8.19e11  8.19e11  8.19e11  8.19e11
##  6 II. … NA       NA       NA       NA        6.20e11  6.20e11  6.20e11  6.20e11
##  7 1. C… NA       NA       NA       NA       NA       NA       NA       NA      
##  8 2. D… NA       NA       NA       NA       NA       NA       NA       NA      
##  9 3. Đ… NA       NA       NA       NA       NA       NA       NA       NA      
## 10 III.…  1.99e12  1.99e12  1.99e12  1.99e12  2.55e12  2.55e12  2.55e12  2.55e12
## # ℹ 111 more rows
## # ℹ 56 more variables: `2010` <dbl>, `2010` <dbl>, `2010` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2011` <dbl>, `2011` <dbl>, `2011` <dbl>, `2012` <dbl>,
## #   `2012` <dbl>, `2012` <dbl>, `2012` <dbl>, `2013` <dbl>, `2013` <dbl>,
## #   `2013` <dbl>, `2013` <dbl>, `2014` <dbl>, `2014` <dbl>, `2014` <dbl>,
## #   `2014` <dbl>, `2015` <dbl>, `2015` <dbl>, `2015` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2016` <dbl>, `2016` <dbl>, `2016` <dbl>, `2017` <dbl>, …