library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(knitr)
This markdown reviews the validity of the FOMC data frame created by Jagdish. We assume the fomc_data.rds file is located in the working directory.
d2<-readRDS(file = "fomc_data.rds")
The FOMC dates are out of order and should be arrange alphabetically. Words counts should be tabulated using str_count
d2 %>%
mutate_if( is.factor, as.character) %>%
arrange(statement.dates) %>%
mutate( numwords = str_count(statement.content, "\\S+")) %>%
select( statement.dates, numwords ) %>%
filter(numwords > 700 ) %>% arrange(numwords)
## statement.dates numwords
## 1 20121212 702
## 2 20130619 713
## 3 20130731 716
## 4 20141029 731
## 5 20141217 748
## 6 20131030 792
## 7 20130918 814
## 8 20140618 822
## 9 20140430 835
## 10 20140129 856
## 11 20140730 865
## 12 20131218 893
## 13 20140319 901
## 14 20140917 919
d2 %>% mutate_if( is.factor, as.character) %>% # needed to change dates from factors to string
arrange(statement.dates) %>%
mutate( numwords = str_count(statement.content, "\\S+")) -> d3
d3 %>% select( statement.dates, numwords) %>%
kable() %>%
kable_styling() %>%
scroll_box(width="85%", height="200px")
| statement.dates | numwords |
|---|---|
| 20070131 | 189 |
| 20070321 | 175 |
| 20070509 | 176 |
| 20070618 | 187 |
| 20070807 | 215 |
| 20070810 | 92 |
| 20070817 | 138 |
| 20070918 | 267 |
| 20071031 | 313 |
| 20071211 | 289 |
| 20080122 | 263 |
| 20080130 | 261 |
| 20080311 | 415 |
| 20080318 | 299 |
| 20080430 | 317 |
| 20080625 | 259 |
| 20080805 | 254 |
| 20080916 | 232 |
| 20081008 | 439 |
| 20081029 | 297 |
| 20081216 | 411 |
| 20090128 | 485 |
| 20090318 | 432 |
| 20090429 | 437 |
| 20090624 | 349 |
| 20090812 | 434 |
| 20090923 | 437 |
| 20091104 | 475 |
| 20091216 | 584 |
| 20100127 | 563 |
| 20100316 | 458 |
| 20100428 | 422 |
| 20100509 | 290 |
| 20100623 | 372 |
| 20100810 | 471 |
| 20100921 | 440 |
| 20101103 | 506 |
| 20101214 | 505 |
| 20110126 | 450 |
| 20110315 | 476 |
| 20110427 | 477 |
| 20110622 | 469 |
| 20110809 | 537 |
| 20110921 | 606 |
| 20111102 | 501 |
| 20111213 | 445 |
| 20120125 | 434 |
| 20120313 | 445 |
| 20120425 | 454 |
| 20120620 | 501 |
| 20120801 | 473 |
| 20120913 | 571 |
| 20121024 | 564 |
| 20121212 | 702 |
| 20130130 | 661 |
| 20130320 | 659 |
| 20130501 | 685 |
| 20130619 | 713 |
| 20130731 | 716 |
| 20130918 | 814 |
| 20131030 | 792 |
| 20131218 | 893 |
| 20140129 | 856 |
| 20140319 | 901 |
| 20140430 | 835 |
| 20140618 | 822 |
| 20140730 | 865 |
| 20140917 | 919 |
| 20141029 | 731 |
| 20141217 | 748 |
| 20150128 | 583 |
| 20150318 | 598 |
| 20150429 | 574 |
| 20150617 | 559 |
| 20150729 | 553 |
| 20150917 | 602 |
| 20151028 | 593 |
| 20151216 | 606 |
| 20160127 | 573 |
| 20160316 | 586 |
| 20160427 | 591 |
| 20160615 | 555 |
| 20160727 | 580 |
| 20160921 | 618 |
| 20161102 | 612 |
| 20161214 | 550 |
| 20170201 | 515 |
| 20170315 | 536 |
| 20170503 | 532 |
| 20170614 | 566 |
| 20170726 | 521 |
| 20170920 | 549 |
| 20171101 | 527 |
| 20171213 | 501 |
| 20180131 | 436 |
| 20180321 | 462 |
| 20180502 | 436 |
| 20180613 | 336 |
| 20180801 | 324 |
| 20180926 | 306 |
| 20181108 | 319 |
| 20181219 | 346 |
| 20190130 | 344 |
| 20190320 | 360 |
| 20190501 | 336 |
dim(d3)
## [1] 105 4
We need to make 4 changes to the dataframe rows.
Remove 3 dates corresponding to extraordinary FOMC meeting related to swap lines, TALF and other measures. These dates do not have normal rate setting objectives. The dates to remove from the dataframe are:
20070810, 20080311 and 20100509.
Lastly, one of the dates is misnamed in the html filelink as 20070618 but the actual statement was released on 20070628. This is a typo in the URL. The actual statement contains the date June 28, 2007 which confirms the URL is misnamed.
d3 %>%
filter( statement.dates != "20070810") %>%
filter( statement.dates != "20080311") %>%
filter( statement.dates != "20100509") -> d4
d4[d4$statement.dates == "20070618", "statement.dates"] <- "20070628"
d4 %>% select(statement.dates, numwords) %>%
kable() %>%
kable_styling() %>%
scroll_box(width="100%", height="250px")
| statement.dates | numwords |
|---|---|
| 20070131 | 189 |
| 20070321 | 175 |
| 20070509 | 176 |
| 20070628 | 187 |
| 20070807 | 215 |
| 20070817 | 138 |
| 20070918 | 267 |
| 20071031 | 313 |
| 20071211 | 289 |
| 20080122 | 263 |
| 20080130 | 261 |
| 20080318 | 299 |
| 20080430 | 317 |
| 20080625 | 259 |
| 20080805 | 254 |
| 20080916 | 232 |
| 20081008 | 439 |
| 20081029 | 297 |
| 20081216 | 411 |
| 20090128 | 485 |
| 20090318 | 432 |
| 20090429 | 437 |
| 20090624 | 349 |
| 20090812 | 434 |
| 20090923 | 437 |
| 20091104 | 475 |
| 20091216 | 584 |
| 20100127 | 563 |
| 20100316 | 458 |
| 20100428 | 422 |
| 20100623 | 372 |
| 20100810 | 471 |
| 20100921 | 440 |
| 20101103 | 506 |
| 20101214 | 505 |
| 20110126 | 450 |
| 20110315 | 476 |
| 20110427 | 477 |
| 20110622 | 469 |
| 20110809 | 537 |
| 20110921 | 606 |
| 20111102 | 501 |
| 20111213 | 445 |
| 20120125 | 434 |
| 20120313 | 445 |
| 20120425 | 454 |
| 20120620 | 501 |
| 20120801 | 473 |
| 20120913 | 571 |
| 20121024 | 564 |
| 20121212 | 702 |
| 20130130 | 661 |
| 20130320 | 659 |
| 20130501 | 685 |
| 20130619 | 713 |
| 20130731 | 716 |
| 20130918 | 814 |
| 20131030 | 792 |
| 20131218 | 893 |
| 20140129 | 856 |
| 20140319 | 901 |
| 20140430 | 835 |
| 20140618 | 822 |
| 20140730 | 865 |
| 20140917 | 919 |
| 20141029 | 731 |
| 20141217 | 748 |
| 20150128 | 583 |
| 20150318 | 598 |
| 20150429 | 574 |
| 20150617 | 559 |
| 20150729 | 553 |
| 20150917 | 602 |
| 20151028 | 593 |
| 20151216 | 606 |
| 20160127 | 573 |
| 20160316 | 586 |
| 20160427 | 591 |
| 20160615 | 555 |
| 20160727 | 580 |
| 20160921 | 618 |
| 20161102 | 612 |
| 20161214 | 550 |
| 20170201 | 515 |
| 20170315 | 536 |
| 20170503 | 532 |
| 20170614 | 566 |
| 20170726 | 521 |
| 20170920 | 549 |
| 20171101 | 527 |
| 20171213 | 501 |
| 20180131 | 436 |
| 20180321 | 462 |
| 20180502 | 436 |
| 20180613 | 336 |
| 20180801 | 324 |
| 20180926 | 306 |
| 20181108 | 319 |
| 20181219 | 346 |
| 20190130 | 344 |
| 20190320 | 360 |
| 20190501 | 336 |
dim(d4)
## [1] 102 4
I checked the 4 statements against the actual webpage.
Oct 10, 2008 Dec 16, 2009 Sep 17, 2014 is the longest and is accurate. May 1, 2019 is the most recent
They are all accurate except for one detail. The elimination of newline characters is causing consecutive words to be conjoined producing nonsense words. The line code below should be eliminated in DS607_FOMC_Sentiment_Analysis_v3.Rmd
#reports\(statement.content[i]<-gsub("\n","",reports\)statement.content[i])
Once the removal of the minor changes to gsub command is made, the dataframe d4 above should be fit for purpose to use for research.
Nonetheless, I export the dataframe as a binary object below.
saveRDS(d4, file = "fomc_corrected_data_v1.rds")