Description: This meetup is for anyone interested in learning and sharing knowledge about scraping data from Yahoo Finance using R. Yahoo Finance provides a wealth of financial data that can be used for research, analysis, and investment purposes. In this meetup, we will discuss the basics of web scraping, explore the structure of Yahoo Finance pages, and walk through the process of scraping data from Yahoo Finance and analyse the data using R and its libraries such as ggplot2, quantmod, and forecast.
Anyone who is interested in learning about web scraping and its application to financial data, from beginners to experienced data analysts and investors. This meetup is open to all skill levels.
Requirements: Participants should bring their laptops to the online event. Basic knowledge of R programming is recommended, but not required. Internet access will be required to access Yahoo Finance pages during the live coding session.
Quantmod is an R package that provides a suite of tools for quantitative financial modeling and analysis. It enables users to access and manipulate financial data from various sources, including Yahoo Finance. In this tutorial, we will walk through the steps of using quantmod to retrieve and analyze Yahoo Finance data.
To start using quantmod and other free libraries, we need to load the package into R by running the following command:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ################################### WARNING ###################################
## # We noticed you have dplyr installed. The dplyr lag() function breaks how #
## # base R's lag() function is supposed to work, which breaks lag(my_xts). #
## # #
## # Calls to lag(my_xts) that you enter or source() into this session won't #
## # work correctly. #
## # #
## # All package code is unaffected because it is protected by the R namespace #
## # mechanism. #
## # #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## # You can use stats::lag() to make sure you're not using dplyr::lag(), or you #
## # can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## ################################### WARNING ###################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
The first step in using quantmod to retrieve Yahoo Finance data is to specify the ticker symbol for the stock you want to analyze. For example, if you want to retrieve data for Tyson Foods and the Froster Farms, the ticker symbol are TSN and CALM, seprarately.
Once you have the ticker symbol, you can use the getSymbols() function to retrieve the data. This function downloads data from various sources, including Yahoo Finance, and returns it as an object that can be manipulated in R.
To retrieve data for TSN and CALM, run the following command:
getSymbols('TSN', src = 'yahoo',
from = "2010-01-01", to = Sys.Date())
## [1] "TSN"
Stock1 <- data.frame(
TSN,
date = as.Date(rownames(data.frame(TSN)))
)
getSymbols('CALM', src = 'yahoo',
from = "2010-01-01", to = Sys.Date())
## [1] "CALM"
Stock2 <- data.frame(
CALM,
date = as.Date(rownames(data.frame(CALM)))
)
head(Stock2)
## CALM.Open CALM.High CALM.Low CALM.Close CALM.Volume CALM.Adjusted
## 2010-01-04 17.140 17.395 17.025 17.140 468400 13.00976
## 2010-01-05 17.090 17.135 16.420 16.690 811400 12.66820
## 2010-01-06 16.595 16.825 16.490 16.825 542600 12.77067
## 2010-01-07 16.825 17.175 16.450 17.165 323600 13.02874
## 2010-01-08 17.145 17.425 17.085 17.290 169000 13.12362
## 2010-01-11 17.320 17.350 17.110 17.170 164400 13.03253
## date
## 2010-01-04 2010-01-04
## 2010-01-05 2010-01-05
## 2010-01-06 2010-01-06
## 2010-01-07 2010-01-07
## 2010-01-08 2010-01-08
## 2010-01-11 2010-01-11
This will download the daily historical data for these two stocks.
Note that we specified the start date using the from argument and the end date using the to argument. We set the end date to Sys.Date(), which retrieves data up to the current date.
Once you have retrieved the data, you can use various functions to explore and manipulate it. Here are a few examples:
To get a summary of the data, run the summary() function. ### Summary Statistics To get the first six rows of the data, run the head() function.
Now let’s dive deeper!
## Stock1 Stock2
## [1,] 9.969864 13.00976
## [2,] 10.157207 12.66820
## [3,] 10.670362 12.77067
## [4,] 10.857701 13.02874
## [5,] 10.833267 13.12362
## [6,] 10.686650 13.03253
## [1] 0.7663372 0.8017877 0.8355367 0.8333652 0.8254787 0.8199978
## [1] "TSN"
## [1] "SPY"
## SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume SPY.Adjusted
## 2010-01-04 112.37 113.39 111.51 113.33 118944600 88.11789
## 2010-01-05 113.26 113.68 112.85 113.63 111579900 88.35114
## 2010-01-06 113.52 113.99 113.43 113.71 116074400 88.41336
## 2010-01-07 113.50 114.33 113.18 114.19 131091100 88.78656
## 2010-01-08 113.89 114.62 113.66 114.57 126402800 89.08204
## 2010-01-11 115.08 115.13 114.24 114.73 106375700 89.20644
## date
## 2010-01-04 2010-01-04
## 2010-01-05 2010-01-05
## 2010-01-06 2010-01-06
## 2010-01-07 2010-01-07
## 2010-01-08 2010-01-08
## 2010-01-11 2010-01-11
## TSN.Open TSN.High TSN.Low TSN.Close TSN.Volume TSN.Adjusted date
## 1 12.27 12.30 12.15 12.24 3355000 9.96986 2010-01-04
## 2 12.21 12.49 12.21 12.47 3781300 10.15721 2010-01-05
## 3 12.79 13.12 12.58 13.10 6810500 10.67036 2010-01-06
## 4 13.10 13.42 13.01 13.33 5979200 10.85770 2010-01-07
## 5 13.29 13.37 13.12 13.30 3999300 10.83327 2010-01-08
## 6 13.29 13.40 13.04 13.12 2875700 10.68665 2010-01-11
## SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume SPY.Adjusted
## 1 112.37 113.39 111.51 113.33 118944600 88.11789
## 2 113.26 113.68 112.85 113.63 111579900 88.35114
## 3 113.52 113.99 113.43 113.71 116074400 88.41336
## 4 113.50 114.33 113.18 114.19 131091100 88.78656
## 5 113.89 114.62 113.66 114.57 126402800 89.08204
## 6 115.08 115.13 114.24 114.73 106375700 89.20644
##
## Call:
## lm(formula = TSN.Adjusted ~ SPY.Adjusted, data = time_series)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.807 -7.690 -3.288 8.124 29.231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.15424 0.43224 4.984 6.55e-07 ***
## SPY.Adjusted 0.19838 0.00176 112.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.76 on 3322 degrees of freedom
## Multiple R-squared: 0.7927, Adjusted R-squared: 0.7926
## F-statistic: 1.27e+04 on 1 and 3322 DF, p-value: < 2.2e-16
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 2.154 | 0.4322 | 4.984 | 6.552e-07 |
SPY.Adjusted | 0.1984 | 0.00176 | 112.7 | 0 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
3324 | 10.76 | 0.7927 | 0.7926 |
## [1] "CALM"
## [1] "SPY"
## SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume SPY.Adjusted
## 2010-01-04 112.37 113.39 111.51 113.33 118944600 88.11791
## 2010-01-05 113.26 113.68 112.85 113.63 111579900 88.35118
## 2010-01-06 113.52 113.99 113.43 113.71 116074400 88.41335
## 2010-01-07 113.50 114.33 113.18 114.19 131091100 88.78657
## 2010-01-08 113.89 114.62 113.66 114.57 126402800 89.08207
## 2010-01-11 115.08 115.13 114.24 114.73 106375700 89.20642
## date
## 2010-01-04 2010-01-04
## 2010-01-05 2010-01-05
## 2010-01-06 2010-01-06
## 2010-01-07 2010-01-07
## 2010-01-08 2010-01-08
## 2010-01-11 2010-01-11
## CALM.Open CALM.High CALM.Low CALM.Close CALM.Volume CALM.Adjusted date
## 1 17.140 17.395 17.025 17.140 468400 13.00976 2010-01-04
## 2 17.090 17.135 16.420 16.690 811400 12.66820 2010-01-05
## 3 16.595 16.825 16.490 16.825 542600 12.77067 2010-01-06
## 4 16.825 17.175 16.450 17.165 323600 13.02874 2010-01-07
## 5 17.145 17.425 17.085 17.290 169000 13.12362 2010-01-08
## 6 17.320 17.350 17.110 17.170 164400 13.03253 2010-01-11
## SPY.Open SPY.High SPY.Low SPY.Close SPY.Volume SPY.Adjusted
## 1 112.37 113.39 111.51 113.33 118944600 88.11791
## 2 113.26 113.68 112.85 113.63 111579900 88.35118
## 3 113.52 113.99 113.43 113.71 116074400 88.41335
## 4 113.50 114.33 113.18 114.19 131091100 88.78657
## 5 113.89 114.62 113.66 114.57 126402800 89.08207
## 6 115.08 115.13 114.24 114.73 106375700 89.20642
##
## Call:
## lm(formula = CALM.Adjusted ~ SPY.Adjusted, data = time_series)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4585 -7.7902 0.3837 6.3728 27.3414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.545121 0.373875 36.23 <2e-16 ***
## SPY.Adjusted 0.085849 0.001523 56.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.305 on 3322 degrees of freedom
## Multiple R-squared: 0.489, Adjusted R-squared: 0.4888
## F-statistic: 3179 on 1 and 3322 DF, p-value: < 2.2e-16
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 13.55 | 0.3739 | 36.23 | 1.669e-242 |
SPY.Adjusted | 0.08585 | 0.001523 | 56.38 | 0 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
3324 | 9.305 | 0.489 | 0.4888 |
References: Intro to the quantmod package. https://www.quantmod.com/ Using R for Time Series Analysis https://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html