I was really frustrate every time I want to analyze financial statements of a company because it’s a boring task. I have to look through cafef.vn website and copy the numers page by page. Each report takes me about an hour just to collect the number, and more time needed to organize them into right format.
As a result, I wonder if I can use R to get all of those value and organize them in appropriate formats, and the answer is YES. This solution help me collect any financial statements in a second. How amazing it is.
Final result is like:
scraping_finance(
Install required packages - “rvest” pakcage for web scraping
#install.packages("rvest")
library(rvest) # for web scraping
library(tidyverse) #for data analysis
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
I will create a funciton to get a list of page according to the input information.
#function
scraping_pages <- function(company_stock, #stock name of company, e.g "vcb", "fpt"
year_start,
year_end,
report_type, #"bsheet"/ "incsta" (income statemanet)/ "cashflow"
time_length) #quarterly or annually
{
year <- c(year_start:year_end) #input year, however, you need to check it first
if (time_length == "quarterly") {
#web site link template for quarterly report
list_of_page <- str_c("https://s.cafef.vn/bao-cao-tai-chinh/",company_stock,"/",report_type,"/",year,"/4/1/0/ket-qua-hoat-dong-kinh-doanh-.chn")
} else {
#web site link template for annually report
list_of_page <- str_c("https://s.cafef.vn/bao-cao-tai-chinh/",company_stock,"/",report_type,"/",year,"/0/0/0/bao-cao-tai-chinh-.chn")
}
return(list_of_page)
}
I will create a function that scrape a cafef.vn web page and give a table of financial information.
The function is specialized for the cafef.vn format. Function name
pnl() mean profit and loss statement, because it’s
originally created to scrape the profit and loss statement (income
statement). This version is already updated to scrape through all of 3
kinds of financial statement:
report_type == "bsheet")report_type == "incsta")report_type == "cashflow")# function
pnl <- function(link, time_length){
page <- read_html(link)
#--- get the raw data => pnl_raw
pnl_raw <- page %>%
html_nodes(".b_r_c") %>% #syntax ".attribute"
html_text()
#--- transform data into a tibble => pnl_tib
pnl_tib <- tibble(stat = gsub(",","",pnl_raw)[-1])
#--- create a standard table => pnl_table
#Create table format
num_c <- 6
num_r <- (length(pnl_raw) - 1)/6
pnl_table <- tibble()
for (r in 1:num_r) {
for (c in 1:num_c) {
pnl_table[r,c] <- pnl_tib[(r-1)*6+c,]
}
}
pnl_table <- pnl_table[-6] #eliminate blank column
#--- add column name => pnl_standard
#column name
column_names <- c("stat","I","II","III","IV")
colnames(pnl_table) = column_names
#tranform chr column into num colum
pnl_num <- pnl_table[-1] %>%
mutate_if(is.character,as.numeric)
pnl_standard <- bind_cols(pnl_table[1],pnl_num)
#--- take year
#take year
year <- strsplit(link,"/") #take year automatically
year_2 <- subset(as.numeric(year[[1]]), as.numeric(year[[1]])>200) %>%
as.numeric()
#CASE OF QUARTERLY & ANNUALLY
if (time_length == "quarterly"){
#--- add year & quarter => pnl_full
year_3 <- tribble(
~stat, ~I, ~II, ~III, ~IV,
"Quar",1,2,3,4,
"Year",year_2,year_2,year_2,year_2
)
pnl_full <- bind_rows(pnl_standard,year_3)
} else {
#--- add year => pnl_full
year_3 <- tribble(
~stat, ~I, ~II, ~III, ~IV,
"Year",year_2 - 3,year_2 - 2,year_2 - 1,year_2)
pnl_full <- bind_rows(pnl_standard,year_3)
}
#result
return (pnl_full)
}
Web page format
#function
scraping_finance <- function(company_stock,
report_type,
year_start,
year_end,
time_length,
format) {
list_of_page <- scraping_pages(company_stock,
year_start,
year_end,
report_type,
time_length)
#--- create the table => table_full
#updating element table full
if (time_length == "quarterly"){
table_full <- pnl(list_of_page[1],time_length) #initiate with the 1st page in the list
next_col <- -1 #if it's quarterly, we'll remove the 1st column and update the rest
} else {
table_full <- pnl(list_of_page[1],time_length)[,c(1,5)] #initiate with only column 1 and 5 of the 1st page in the list
next_col <- 5 #if it's annually, we'll update only the 5th column
}
#complete the table
for (n in 2:length(list_of_page)) {
table_next <- pnl(list_of_page[n],time_length)
table_full <- bind_cols(table_full,table_next[next_col])}
#--- table format
if (format == "web format") {
final_table <- table_full
} else {
#transpose table for data analysis
statistic <- table_full[,-1]
statistic_names <- table_full[[1]] #get the names as a list
table_t <- t(statistic)
colnames(table_t) = statistic_names
final_table <- as_tibble(table_t)
}
#--- update table name
if (format == "web format") {
last_row <- nrow(final_table)
colnames(final_table) = unlist(final_table[last_row,])
}
return (final_table)
}
Example:
fpt_bsheet_2008_2023_annually_web_format <- scraping_finance("fpt",
"bsheet",
2008,2023,
"quarterly",
"web format")
fpt_bsheet_2008_2023_annually_web_format
## # A tibble: 121 × 65
## Year `2008` `2008` `2008` `2008` `2009` `2009` `2009` `2009`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TÀI … NA NA NA NA NA NA NA NA
## 2 A- T… 4.66e12 4.66e12 4.66e12 4.66e12 7.68e12 7.68e12 7.68e12 7.68e12
## 3 I. T… 1.24e12 1.24e12 1.24e12 1.24e12 2.31e12 2.31e12 2.31e12 2.31e12
## 4 1. T… 1.09e12 1.09e12 1.09e12 1.09e12 1.49e12 1.49e12 1.49e12 1.49e12
## 5 2. C… 1.54e11 1.54e11 1.54e11 1.54e11 8.19e11 8.19e11 8.19e11 8.19e11
## 6 II. … NA NA NA NA 6.20e11 6.20e11 6.20e11 6.20e11
## 7 1. C… NA NA NA NA NA NA NA NA
## 8 2. D… NA NA NA NA NA NA NA NA
## 9 3. Đ… NA NA NA NA NA NA NA NA
## 10 III.… 1.99e12 1.99e12 1.99e12 1.99e12 2.55e12 2.55e12 2.55e12 2.55e12
## # ℹ 111 more rows
## # ℹ 56 more variables: `2010` <dbl>, `2010` <dbl>, `2010` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2011` <dbl>, `2011` <dbl>, `2011` <dbl>, `2012` <dbl>,
## # `2012` <dbl>, `2012` <dbl>, `2012` <dbl>, `2013` <dbl>, `2013` <dbl>,
## # `2013` <dbl>, `2013` <dbl>, `2014` <dbl>, `2014` <dbl>, `2014` <dbl>,
## # `2014` <dbl>, `2015` <dbl>, `2015` <dbl>, `2015` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2016` <dbl>, `2016` <dbl>, `2016` <dbl>, `2017` <dbl>, …
Write to excel
#install.packages("writexl")
#library(writexl)
#library(readxl)
#write_xlsx(company_history_web_format, path="fpt_bsheet_2008_2023_annually_web.xlsx")