Business Context:

Major online media channels increasingly rely on algorithms to retrieve breaking news and to classify it into categories. Google News and Apple News are leading examples of this type of algorithmic news curation. More broadly, automatic topic assignment is increasingly being used on SEC filings, investor reports, online reviews, legal documents, executive interview transcripts, and many other sources of business text.

In this analysis I analyze news articles from CNBC, a major American business news channel. In 2006, it relaunched CNBC.com as its main digital news outlet. The news content on the website is edited 24 hours a day during the business week. The site documents almost everything that happens in U.S. financial markets that deserves investor attention. Like most traditional news websites, CNBC visitors navigate pre-defined categories to find content of interest, such as ECONOMY or FINANCE. Accurate assignment of articles into categories is important. For instance, a reader may skip an article about Tesla or Uber that appears in the AUTO section that they would have otherwise read had it been in the TECHNOLOGY section.

Data Source:

Analyzed a database with over 20000 news articles

Code Proper:

Part 0. Data Loading:

First I will load in the data and library needed for this analysis

library(dplyr)
library(ggplot2)
library(readr)
library(tm)
library(topicmodels) 
library(stringr)
library(wordcloud)
news = read_csv("C:/Users/fzy20/Downloads/NewsArticles.csv")

Part 1: Topic Modelling for existing news data archive

In order to perform topic modelling, I first constructed a document term matrix with the news article dataset

#Itemize the words
reviews = VCorpus(VectorSource(news[1:10000,]$content))
#remove irelevant information
reviews <- tm_map(reviews, removePunctuation)
reviews <- tm_map(reviews, removeNumbers)
reviews <- tm_map(reviews, content_transformer(removeWords), stopwords("SMART"), lazy=TRUE)  
reviews <- tm_map(reviews, content_transformer(tolower), lazy=TRUE) 
reviews <- tm_map(reviews, content_transformer(removeWords), c("til")) 
reviews <- tm_map(reviews, stripWhitespace)
#DTM tranformation
dtm = DocumentTermMatrix(reviews)
dtms = removeSparseTerms(dtm, .988)
dtm_matrix <- as.matrix(dtms)

I then aggregated the data to prepare for the modellng

#Data Aggregation
terms = rowSums(dtm_matrix) != 0
dtm_matrix = dtm_matrix[terms,]

Modeled the Data through LDA

#Modelling
ldaOut <- LDA(dtm_matrix, 10, method="Gibbs",control=list(seed=10))

Here are the relevant terms:

#Printing out terms
terms =terms(ldaOut,10)
terms
##       Topic 1   Topic 2      Topic 3    Topic 4      Topic 5     
##  [1,] "percent" "president"  "billion"  "business"   "money"     
##  [2,] "year"    "tax"        "company"  "companies"  "financial" 
##  [3,] "growth"  "house"      "percent"  "oil"        "people"    
##  [4,] "rate"    "obama"      "million"  "read"       "years"     
##  [5,] "home"    "state"      "year"     "industry"   "pay"       
##  [6,] "month"   "read"       "share"    "prices"     "dont"      
##  [7,] "prices"  "bill"       "sales"    "years"      "make"      
##  [8,] "sales"   "government" "revenue"  "energy"     "read"      
##  [9,] "market"  "white"      "shares"   "price"      "work"      
## [10,] "week"    "states"     "earnings" "businesses" "retirement"
##       Topic 6       Topic 7   Topic 8      Topic 9      Topic 10   
##  [1,] "read"        "read"    "market"     "company"    "people"   
##  [2,] "information" "york"    "investors"  "apple"      "health"   
##  [3,] "bank"        "city"    "fed"        "online"     "plans"    
##  [4,] "public"      "million" "fund"       "companies"  "insurance"
##  [5,] "security"    "news"    "markets"    "technology" "read"     
##  [6,] "case"        "car"     "funds"      "time"       "states"   
##  [7,] "banks"       "friday"  "stocks"     "read"       "care"     
##  [8,] "comment"     "day"     "rates"      "service"    "federal"  
##  [9,] "court"       "told"    "investment" "google"     "million"  
## [10,] "credit"      "cars"    "year"       "media"      "year"

Very generally I mapeed the terms to the following seen in the next code chunk:

Topics = c("Topic 1", "Topic 2", "Topic 3", "Topic 4", "Topic 5",
          "Topic 6","Topic 7","Topic 8","Topic 9","Topic 10")
Names = c("Macro Trends","Politics","Company Revenue","Commodity","Community Finance","Legal","Retail","Investing","Technology","Healthcare")

Wordcloud representation of topics

Politics

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[2],cex = 6, col = 28)
wordcloud(terms[,2], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Company Revenue

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[3],cex = 6, col = 28)
wordcloud(terms[,3], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Commodity

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[4],cex = 6, col = 28)
wordcloud(terms[,4], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Community Finance

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[5],cex = 6, col = 28)
wordcloud(terms[,5], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Retail

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[7],cex = 6, col = 28)
wordcloud(terms[,7], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Investing

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[8],cex = 6, col = 28)
wordcloud(terms[,8], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Technology

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[9],cex = 6, col = 28)
wordcloud(terms[,9], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Healthcare

layout(matrix(c(1, 2), nrow=2), heights=c(1, 4))
par(mar=rep(1, 4))
plot.new()
text(x=0.5, y=0.5, Names[10],cex = 6, col = 28)
wordcloud(terms[,10], seq(from = 10, to=30, by = 2),main = 'Title',min.freq = 1)

Part 2: Webscraping

Using Rvest, I extracted about 100 links from the CNBC website to perform the clusterring on

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
link_list = c()
for(i in 1:3){
  page1 <- read_html(paste("https://www.cnbc.com/us-news/?page=",i,sep=""))
  link_list = c(link_list,page1 %>% html_nodes("a") %>% html_attr('href'))
}

Cleaned the links up a little bit and make sure the information will later be retrievable

#Filter out irrelevant links
url_list = unique(link_list[sapply(link_list, function(x) str_detect(x, "^https://www.cnbc.com/2019/05"))])
url_list_clean = url_list[!is.na(url_list)]
#Append home url
url_list_complete = url_list_clean
#Test
url_list_complete[2]
## [1] "https://www.cnbc.com/2019/05/16/architect-im-pei-dies-at-age-102.html"

Extracted Text informtion from retrieved links; Collapsed them together into one chunk. P.S I didnt do the intense data cleaning here as I will be cleaning them later on when constructing the DTM

#Text Extraction and Aggregation
text = sapply(url_list_complete, function(x) paste(((read_html(x) %>% html_nodes("p") %>% html_text())[14:(length(read_html(x) %>% html_nodes("p") %>% html_text())-6)]),collapse = " "))
text[1]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 https://www.cnbc.com/2019/05/16/haven-the-amazon-led-health-venture-just-lost-coo-jack-stoddard.html 
## "Haven, the health joint venture formed last year by Amazon, Berkshire Hathaway and J.P. Morgan, has lost Chief Operating Officer Jack Stoddard just nine months into his new role. Stoddard, who was most recently general manger for digital health at Comcast, confirmed to CNBC on Thursday that he departed Haven for personal reasons, including the length of his commute from his home in Philadelphia to Haven's headquarters in Boston. Stoddard was Haven's second hire in 2018, following  Atul Gawande, the renowned author and surgeon who was named CEO last June.   \"Jack played an important role in the early stages of Haven, but we understand his decision to leave the company for family reasons,\" a spokesperson from Haven said in a statement. \"We want to thank him for all of his contributions.\" Stoddard officially ended his tenure at Haven last week, and no replacement has been named. Losing such a key executive so early in the process could be a big setback for Haven, which has laid out an ambitious effort to bring down health-care costs, starting with the combined 1.2 million employees at the three companies. The entity, which is set up as a nonprofit, was named Haven in March and at the time had about a dozen people.  Other key executives include Chief Technology Officer Serkan Kutan, formerly of ZocDoc, and Dana Gelb Safran, who Haven hired from Blue Cross Blue Shield in Massachusetts to run analytics projects. Prior to joining Haven, Stoddard worked at Accolade, a health-technology start-up, and at Comcast, the parent company of CNBC. He said the the frequent travel to Boston — five hours by car and around 90 minutes by air each way — was taking him away from his family.  \"I wish Atul and the group the best,\" he said. With or without Stoddard, Haven faces some stiff challenges. Before coming up with a name or bringing a product to market, the group found itself deep in a hiring dispute with insurance company UnitedHealth Group, which sued  former executive David Smith for stealing trade secrets and taking them to Haven.  Stoddard interviewed Smith for the Haven job. Previously, Stoddard was part of the executive team that created  Optum, which was later acquired by UnitedHealth. According to court testimony, Stoddard indicated that Haven planned to make health care easier to understand, less expensive and ideally produce better outcomes for employees, which could be a competitive threat to incumbents. Haven's website says its other focuses include Improving the process of navigating the complex health-care system and helping with access to affordable treatments and prescription drugs. WATCH: This is the man in charge of changing health care"

Part 3: Topic Classification

Now I will be running the model on all my text data. Every one of the text will first be tranformed into a document term matrix of their own. Irrelevant terms within the matrix such as punctutation, weird slashes will be removed. Only words existing in the dictionary will be kept. The topic that has the highest associated probability for each article will be kept in a vector.

#Cosntruct dictionary
dic = Terms(dtms)

# Specify this dictionary when creating the dtm for the new articles, which will limit the dtm it creates to only the words that also appeared in the archive. In the example below, 'ldaOut' would be the name assigned to the topic model you created in Step 1.

#Initiliaze List
topic_list = c()
#Loop Through Text
for(i in text){
#itemize Text
reviews = VCorpus(VectorSource(i))
#Remove irrelevant information
reviews <- tm_map(reviews, removePunctuation)
reviews <- tm_map(reviews, removeNumbers)
reviews <- tm_map(reviews, content_transformer(removeWords), stopwords("SMART"), lazy=TRUE)  
reviews <- tm_map(reviews, content_transformer(tolower), lazy=TRUE) 
reviews <- tm_map(reviews, content_transformer(removeWords), c("til")) 
reviews <- tm_map(reviews, stripWhitespace)
#Control the text present
new_dtm = DocumentTermMatrix(reviews, control=list(dictionary = dic))
new_dtm = new_dtm[rowSums(as.matrix(new_dtm))!=0,]
#Classfication
topic_probabilities = posterior(ldaOut, new_dtm)
#Topic Extraction
topic_list = c(topic_list, Names[order(topic_probabilities$topics,decreasing = TRUE)[1]])
}

Construct a final data frame with the associated text, title and the correponding classified class of each article entry:

#Initialize Data Frame
final_frame = data.frame("text"=text,"topic"=topic_list)
#Extract Article Name
final_frame$title = sapply(url_list_clean, function(x) gsub("/2019/04/[0-9][0-9]/","",x))
#Clean Relevant Links
final_frame$title = sapply(final_frame$title, function(x) gsub("video","",x))
final_frame$title = sapply(final_frame$title, function(x) gsub("-"," ",x))
final_frame$title = sapply(final_frame$title, function(x) gsub("html","",x))
final_frame$title = sapply(final_frame$title, function(x) gsub("-"," ",x))
row.names(final_frame) = c()
head(final_frame[,1:3],n = 1)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           text
## 1 Haven, the health joint venture formed last year by Amazon, Berkshire Hathaway and J.P. Morgan, has lost Chief Operating Officer Jack Stoddard just nine months into his new role. Stoddard, who was most recently general manger for digital health at Comcast, confirmed to CNBC on Thursday that he departed Haven for personal reasons, including the length of his commute from his home in Philadelphia to Haven's headquarters in Boston. Stoddard was Haven's second hire in 2018, following  Atul Gawande, the renowned author and surgeon who was named CEO last June.   "Jack played an important role in the early stages of Haven, but we understand his decision to leave the company for family reasons," a spokesperson from Haven said in a statement. "We want to thank him for all of his contributions." Stoddard officially ended his tenure at Haven last week, and no replacement has been named. Losing such a key executive so early in the process could be a big setback for Haven, which has laid out an ambitious effort to bring down health-care costs, starting with the combined 1.2 million employees at the three companies. The entity, which is set up as a nonprofit, was named Haven in March and at the time had about a dozen people.  Other key executives include Chief Technology Officer Serkan Kutan, formerly of ZocDoc, and Dana Gelb Safran, who Haven hired from Blue Cross Blue Shield in Massachusetts to run analytics projects. Prior to joining Haven, Stoddard worked at Accolade, a health-technology start-up, and at Comcast, the parent company of CNBC. He said the the frequent travel to Boston — five hours by car and around 90 minutes by air each way — was taking him away from his family.  "I wish Atul and the group the best," he said. With or without Stoddard, Haven faces some stiff challenges. Before coming up with a name or bringing a product to market, the group found itself deep in a hiring dispute with insurance company UnitedHealth Group, which sued  former executive David Smith for stealing trade secrets and taking them to Haven.  Stoddard interviewed Smith for the Haven job. Previously, Stoddard was part of the executive team that created  Optum, which was later acquired by UnitedHealth. According to court testimony, Stoddard indicated that Haven planned to make health care easier to understand, less expensive and ideally produce better outcomes for employees, which could be a competitive threat to incumbents. Haven's website says its other focuses include Improving the process of navigating the complex health-care system and helping with access to affordable treatments and prescription drugs. WATCH: This is the man in charge of changing health care
##        topic
## 1 Healthcare
##                                                                                              title
## 1 https://www.cnbc.com/2019/05/16/haven the amazon led health venture just lost coo jack stoddard.

Clusters Result

Healthcare

final_frame[final_frame['topic'] == "Healthcare",3]
## [1] "https://www.cnbc.com/2019/05/16/haven the amazon led health venture just lost coo jack stoddard."                
## [2] "https://www.cnbc.com/2019/05/15/alphabet funds verve to protect people from heart disease."                      
## [3] "https://www.cnbc.com/2019/05/15/jj analyst day tries to focus on drug business with talc baby powder under fire."
## [4] "https://www.cnbc.com/2019/05/14/lawsuit alleges gilead propped up cost of hiv drug."                             
## [5] "https://www.cnbc.com/2019/05/14/sugary drink sales fall 38percent after philadelphia levied soda tax study."

Company Revenue

final_frame[final_frame['topic'] == "Company Revenue",3]
##  [1] "https://www.cnbc.com/2019/05/16/starbucks china challenger luckin coffee will likely price ipo at high end of range or above."
##  [2] "https://www.cnbc.com/2019/05/16/nvidia earnings q1 2020."                                                                     
##  [3] "https://www.cnbc.com/2019/05/16/pinterest reports q1 2019 earnings."                                                          
##  [4] "https://www.cnbc.com/2019/05/16/read president donald trumps financial disclosure report."                                    
##  [5] "https://www.cnbc.com/2019/05/16/cisco pops on strong revenue guidance."                                                       
##  [6] "https://www.cnbc.com/2019/05/16/former apple retail boss angela ahrendts reportedly joins airbnbs board."                     
##  [7] "https://www.cnbc.com/2019/05/16/stocks making the biggest moves premarket walmart pfizer pge sony tesla more."                
##  [8] "https://www.cnbc.com/2019/05/16/futures lower as us china trade tensions return."                                             
##  [9] "https://www.cnbc.com/2019/05/15/wework has q1 loss and says investors should see losses as investments."                      
## [10] "https://www.cnbc.com/2019/05/15/warren buffetts berkshire hathaway reveals 900 million amazon stake."                         
## [11] "https://www.cnbc.com/2019/05/15/cisco earnings q3 2019."                                                                      
## [12] "https://www.cnbc.com/2019/05/15/jp morgans tusa ge is not telling the whole story about power."                               
## [13] "https://www.cnbc.com/2019/05/14/bank of america cisco big tech stock to buy for low china trade risk."                        
## [14] "https://www.cnbc.com/2019/05/14/stocks making the biggest moves premarket volkswagen amazon facebook cvs more."

Politics

final_frame[final_frame['topic'] == "Politics",3]
##  [1] "https://www.cnbc.com/2019/05/16/watch live trump lays out new immigration plan in rose garden speech."              
##  [2] "https://www.cnbc.com/2019/05/16/trump says hope not when asked if us will go to war with iran."                     
##  [3] "https://www.cnbc.com/2019/05/16/maryland official calls for alabama divestment citing abortion bill."               
##  [4] "https://www.cnbc.com/2019/05/16/jimmy carter released from hospital three days after breaking hip."                 
##  [5] "https://www.cnbc.com/2019/05/16/california governor pushes new taxes fees despite states 21 billion surplus."       
##  [6] "https://www.cnbc.com/2019/05/15/alabama gov kay ivey signs nations most restrictive abortion law."                  
##  [7] "https://www.cnbc.com/2019/05/15/nyc mayor bill de blasio will enter the 2020 presidential race nbc news."           
##  [8] "https://www.cnbc.com/2019/05/15/trump to meet with south koreas moon amid stalled north korea talks."               
##  [9] "https://www.cnbc.com/2019/05/15/trump immigration plan puts emphasis on skills education over family."              
## [10] "https://www.cnbc.com/2019/05/15/joe biden plans first new york fundraising blitz as a 2020 candidate for president."
## [11] "https://www.cnbc.com/2019/05/15/trump lawyer pushes back against house democrats demand for documents."             
## [12] "https://www.cnbc.com/2019/05/15/how donald trump jr made a deal to limit scope of senate intel questions."          
## [13] "https://www.cnbc.com/2019/05/15/trump aide lighthizer to address metals tariffs in canada trade talks."             
## [14] "https://www.cnbc.com/2019/05/15/mnuchin signals he wont comply with subpoena for trump tax returns."                
## [15] "https://www.cnbc.com/2019/05/15/faa expects boeing to submit software fix for 737 max in next week."                
## [16] "https://www.cnbc.com/2019/05/15/trump administration to delay auto tariffs amid trade war."                         
## [17] "https://www.cnbc.com/2019/05/15/alabama lawmakers eyeing roe pass nations strictest abortion ban."                  
## [18] "https://www.cnbc.com/2019/05/15/pelosi to meet with trump trade rep lighthizer about usmca and china."              
## [19] "https://www.cnbc.com/2019/05/14/pete buttigieg shuts down pac as rival 2020 democrats reject pac money."            
## [20] "https://www.cnbc.com/2019/05/14/donald trump jr strikes deal with senate intel after subpoena."                     
## [21] "https://www.cnbc.com/2019/05/14/ca lawmakers considering bill to create state chartered cannabis banks."            
## [22] "https://www.cnbc.com/2019/05/14/mark cuban leaves open possibility of running for president as an independent."     
## [23] "https://www.cnbc.com/2019/05/14/marco rubio says war with iran is up to ayatollah khamenei its all on them."        
## [24] "https://www.cnbc.com/2019/05/14/here is how democratic 2020 contenders will negotiate trade with china."            
## [25] "https://www.cnbc.com/2019/05/14/trump three to four weeks to know if china trade talks successful."

Retail

final_frame[final_frame['topic'] == "Retail",3]
## [1] "https://www.cnbc.com/2019/05/16/architect im pei dies at age 102."                                       
## [2] "https://www.cnbc.com/2019/05/16/trump administration pulls california high speed rail funding."          
## [3] "https://www.cnbc.com/2019/05/16/six months after camp fire survivors struggle to find temporary homes."  
## [4] "https://www.cnbc.com/2019/05/15/ford recalls fusion cars to fix glitch that can cause them to roll away."

Technology

final_frame[final_frame['topic'] == "Technology",3]
##  [1] "https://www.cnbc.com/2019/05/16/china has plenty of ways to get back at us for treatment of huawei."         
##  [2] "https://www.cnbc.com/2019/05/16/nimble pharmacy is working to help local pharmacies deliver drugs."          
##  [3] "https://www.cnbc.com/2019/05/16/samsung galaxy fold new release date set for june according to report."      
##  [4] "https://www.cnbc.com/2019/05/16/samsung galaxy s10 5g launches on verizon."                                  
##  [5] "https://www.cnbc.com/2019/05/16/amazon announces fire 7 tablet price availability and features."             
##  [6] "https://www.cnbc.com/2019/05/16/taco bell wants you to spend the nightat its new palm springs hotel."        
##  [7] "https://www.cnbc.com/2019/05/16/steam game streaming app debuts on iphone and ipad a year after controversy."
##  [8] "https://www.cnbc.com/2019/05/15/musk on starlink internet satellites spacex has sufficient capital."         
##  [9] "https://www.cnbc.com/2019/05/15/north carolina ag sues e cigarette maker juul for downplaying dangers."      
## [10] "https://www.cnbc.com/2019/05/15/google finds security issue with its bluetooth titan security keys."         
## [11] "https://www.cnbc.com/2019/05/15/instacart ceo apoorva mehta were ready for split from whole foods."          
## [12] "https://www.cnbc.com/2019/05/15/apple suppliers foxconn japan display report disappointing results."         
## [13] "https://www.cnbc.com/2019/05/15/indigo ag improving yields with microbes satellite imaging."                 
## [14] "https://www.cnbc.com/2019/05/14/jeff bezos lifts dirt pile to kick of amazon 1point4 billion air hub."       
## [15] "https://www.cnbc.com/2019/05/14/microsoft kirk koenigsbauer windows 10 upgrade cycle drives growth."         
## [16] "https://www.cnbc.com/2019/05/14/alphabet falls after morgan stanley warns of a short term slump."            
## [17] "https://www.cnbc.com/2019/05/14/former google ceo eric schmidt advocated for a search engine in china."      
## [18] "https://www.cnbc.com/2019/05/14/google striking back against amazon with new shopping features."             
## [19] "https://www.cnbc.com/2019/05/14/kai fu lee sinovation ventures retreats from us amid trade dispute."         
## [20] "https://www.cnbc.com/2019/05/14/whole foods has highest prices despite cuts in april bank of america."       
## [21] "https://www.cnbc.com/2019/05/14/how to set up alexa guard on an amazon echo."                                
## [22] "https://www.cnbc.com/2019/05/14/comcast has agreed to sell its stake in hulu in 5 years."                    
## [23] "https://www.cnbc.com/2019/05/14/apple powerbeats pro review."

Commodity

final_frame[final_frame['topic'] == "Commodity",3]
## [1] "https://www.cnbc.com/2019/05/16/cramer trump doesnt care his china policies hurt american businesses."
## [2] "https://www.cnbc.com/2019/05/16/why huaweis problems with the us government have been so bad."        
## [3] "https://www.cnbc.com/2019/05/16/taylor morrison ceo on homebuilding at end of 2018 it was that bad."  
## [4] "https://www.cnbc.com/2019/05/15/as chinas economy slows restaurant brands is bullish long term."      
## [5] "https://www.cnbc.com/2019/05/14/what consumers should buy now as the trade war heats up."             
## [6] "https://www.cnbc.com/2019/05/15/burger kings parent plans to surpass 40000 stores in the next decade."
## [7] "https://www.cnbc.com/2019/05/15/tim hortons is testing beyond meats fake sausage on its menu."        
## [8] "https://www.cnbc.com/2019/05/14/jp morgan us farmers are facing a crisis downgrades deeres stock."

Community Finance

final_frame[final_frame['topic'] == "Community Finance",3]
## [1] "https://www.cnbc.com/2019/05/16/how much money a family of 4 needs to get by in every us state."
## [2] "https://www.cnbc.com/2019/05/15/some college internships pay twice what regular workers earn."  
## [3] "https://www.cnbc.com/2019/05/13/costs have jumped 55 percent in a decade at public colleges."

Investing

final_frame[final_frame['topic'] == "Investing",3]
## [1] "https://www.cnbc.com/2019/05/16/what a 1000 dollar investment in walmart in 2009 would be worth now."                       
## [2] "https://www.cnbc.com/2019/05/16/feds neel kashkari says rate hikes were not called for."                                    
## [3] "https://www.cnbc.com/2019/05/16/china has cut its holdings of us debt to the lowest level in two years."                    
## [4] "https://www.cnbc.com/2019/05/15/china trade dispute drags on trump still playing with the banks money."                     
## [5] "https://www.cnbc.com/2019/05/14/feds esther george says theres no need for an interest rate cut."                           
## [6] "https://www.cnbc.com/2019/05/14/markets send clear signal to us and china on the trade war."                                
## [7] "https://www.cnbc.com/2019/05/14/trump says if fed cuts interest rates us will win trade war it would be game over we win."  
## [8] "https://www.cnbc.com/2019/05/14/investors fear another sharp fall is coming and are scrambling to protect their portfolios."

Categorized = final_frame %>% group_by(topic) %>% summarise(count = n())

Generate Summary Graphics

#Visualize Category Counts
ggplot(aes(x =topic, y = count),data = Categorized) + geom_col() + geom_text(aes(label = count),hjust = 2,size = 5, color = "white") + theme_minimal() + coord_flip()