When the media reports the stories about the recent outbreak, which term, ‘Wuhan coronavirus’ or ‘COVID-19’, is more likely to be referred after WHO’s annouceement of the official name? Which publications do prefer to use each in the title?
“Require” the two packages.
require(RJSONIO)
require(RCurl)
Following NewsAPI’s instruction (https://newsapi.org/docs/endpoints/everything), we setup the key and the search query ‘Wuhan coronavirus’.
Let’s get the first 100 articles and have a look at the data structure.
date <- '2020-02-17'
search_q <- URLencode("'Wuhan coronavirus'")
url <- paste('https://newsapi.org/v2/everything?q=',search_q,'&from=',date,'&page=1&pageSize=100&apiKey=',api,sep="")
wuhan <- fromJSON(getURL(url))
str(wuhan$articles[[1]])
## List of 8
## $ source :List of 2
## ..$ id : NULL
## ..$ name: chr "Gizmodo.com"
## $ author : chr "Matt Novak"
## $ title : chr "Chinese Media Declares Hospital Director Dead From COVID-19, Then Alive, Then Dead Again"
## $ description: chr "Liu Zhiming, the 51-year-old director of Wuchang Hospital in Wuhan, China, died at 10:54 AM local time Tuesday "| __truncated__
## $ url : chr "https://gizmodo.com/chinese-media-declares-hospital-director-dead-from-covi-1841757263"
## $ urlToImage : chr "https://i.kinja-img.com/gawker-media/image/upload/c_fill,f_auto,fl_progressive,g_center,h_675,pg_1,q_80,w_1200/"| __truncated__
## $ publishedAt: chr "2020-02-18T12:30:00Z"
## $ content : chr "Liu Zhiming, the 51-year-old director of Wuchang Hospital in Wuhan, China, died at 10:54 AM local time Tuesday "| __truncated__
wuhan$totalResults
## [1] 3178
We are interested in the articles. Let’s have a look.
newsdata_df1 <- lapply(wuhan$articles,function(x){
return(c(x[["source"]][["name"]],x[["author"]],x[["title"]],x[["publishedAt"]],x[["content"]]))
})
newsdata_df1 <- do.call(rbind.data.frame,newsdata_df1)
colnames(newsdata_df1) <- c("source","author","title","publishedAt","content")
newsdata_df1$source[1:10] # First ten names
## [1] Gizmodo.com BBC News BBC News Gizmodo.com Reuters Gizmodo.com
## [7] Cnet.com Youtube.com BBC News Reuters
## 37 Levels: ABC News Al Jazeera English BBC News ... Youtube.com
newsdata_df1$title[1:10] # First ten titles
## [1] Chinese Media Declares Hospital Director Dead From COVID-19, Then Alive, Then Dead Again
## [2] The noted victims of the coronavirus
## [3] Coronavirus: How a misleading map went global
## [4] Global Death Toll from COVID-19 Surpasses 2,000, Though China Maintains Outbreak Slowing Down
## [5] Factbox: Countries evacuating nationals from China coronavirus areas
## [6] Why Are HIV Drugs Being Used to Treat the New Coronavirus?
## [7] Chilling footage emerges of China's deserted streets in coronavirus epicenter - CNET
## [8] 2020-02-19T10:20:50Z
## [9] Coronavirus triggers boom in private jet inquiries
## [10] China's Hubei province reports 132 new coronavirus deaths on Feb. 18
## 93 Levels: 'Everything is a mess': Westerdam passengers in limbo again after passenger testing positive for coronavirus prompts countries to not let them 'fly through' ...
it’s time to use the command ‘aggregate’. First, let’s count for the first 100 ‘Wuhan coronavirus’ data 1) how many articles by newspaper; 2) number of titles mentioning ‘Wuhan coronavirus’; 3) number of mentions of ‘government’ in the content
aggregate(title ~ source,newsdata_df1,length) # Count of each newspaper
## source title
## 1 ABC News 2
## 2 Al Jazeera English 1
## 3 BBC News 5
## 4 Boingboing.net 4
## 5 Business Insider 14
## 6 Cnet.com 2
## 7 CNN 1
## 8 Ctvnews.ca 1
## 9 Dailyforex.com 1
## 10 Dezeen.com 1
## 11 Ejinsight.com 1
## 12 Fastcompany.com 3
## 13 Fayerwayer.com 1
## 14 Gizmodo.com 3
## 15 Globalnews.ca 1
## 16 Heise.de 2
## 17 Hipertextual.com 1
## 18 Indiewire.com 1
## 19 Kotaku.com 1
## 20 Ktla.com 1
## 21 Macrumors.com 3
## 22 Marketwatch.com 5
## 23 Mondaq.com 1
## 24 Nature.com 4
## 25 Polygon 1
## 26 Reuters 6
## 27 Silive.com 1
## 28 Spiegel Online 1
## 29 Taiwannews.com.tw 1
## 30 The Times of India 1
## 31 Theatlantic.com 2
## 32 Time 3
## 33 USA Today 2
## 34 Vice News 2
## 35 Xataka.com 2
## 36 Yahoo.com 16
## 37 Youtube.com 2
aggregate(title ~ source,newsdata_df1,function(x){ # Count of number of title mention
sum(grepl('Wuhan coronavirus',x, ignore.case = T))
})
## source title
## 1 ABC News 0
## 2 Al Jazeera English 0
## 3 BBC News 0
## 4 Boingboing.net 0
## 5 Business Insider 3
## 6 Cnet.com 0
## 7 CNN 0
## 8 Ctvnews.ca 0
## 9 Dailyforex.com 0
## 10 Dezeen.com 0
## 11 Ejinsight.com 0
## 12 Fastcompany.com 2
## 13 Fayerwayer.com 0
## 14 Gizmodo.com 0
## 15 Globalnews.ca 0
## 16 Heise.de 0
## 17 Hipertextual.com 0
## 18 Indiewire.com 0
## 19 Kotaku.com 0
## 20 Ktla.com 0
## 21 Macrumors.com 0
## 22 Marketwatch.com 0
## 23 Mondaq.com 0
## 24 Nature.com 0
## 25 Polygon 0
## 26 Reuters 0
## 27 Silive.com 0
## 28 Spiegel Online 0
## 29 Taiwannews.com.tw 0
## 30 The Times of India 0
## 31 Theatlantic.com 0
## 32 Time 0
## 33 USA Today 0
## 34 Vice News 0
## 35 Xataka.com 0
## 36 Yahoo.com 2
## 37 Youtube.com 0
aggregate(content ~ source,newsdata_df1,function(x){ # Count of number of title mentioning 'government'
sum(grepl('government',x, ignore.case = T))
})
## source content
## 1 ABC News 1
## 2 Al Jazeera English 0
## 3 BBC News 0
## 4 Boingboing.net 1
## 5 Business Insider 2
## 6 Cnet.com 0
## 7 CNN 0
## 8 Ctvnews.ca 0
## 9 Dailyforex.com 0
## 10 Dezeen.com 0
## 11 Ejinsight.com 1
## 12 Fastcompany.com 0
## 13 Fayerwayer.com 0
## 14 Gizmodo.com 1
## 15 Globalnews.ca 0
## 16 Heise.de 0
## 17 Hipertextual.com 0
## 18 Indiewire.com 0
## 19 Kotaku.com 0
## 20 Ktla.com 0
## 21 Macrumors.com 0
## 22 Marketwatch.com 0
## 23 Mondaq.com 0
## 24 Nature.com 0
## 25 Polygon 0
## 26 Reuters 0
## 27 Silive.com 0
## 28 Spiegel Online 0
## 29 Taiwannews.com.tw 0
## 30 The Times of India 0
## 31 Theatlantic.com 0
## 32 Time 0
## 33 USA Today 0
## 34 Vice News 1
## 35 Xataka.com 0
## 36 Yahoo.com 2
## 37 Youtube.com 0
Next, we try searching ‘COVID-19’ and check the total number of hits.
search_q <- URLencode("'COVID-19'")
url <- paste('https://newsapi.org/v2/everything?q=',search_q,'&from=',date,'&page=1&pageSize=100&apiKey=',api,sep="")
covid19 <- fromJSON(getURL(url))
covid19$totalResults
## [1] 7509
newsdata_df2 <- lapply(covid19$articles,function(x){
return(c(x[["source"]][["name"]],x[["author"]],x[["title"]],x[["publishedAt"]],x[["content"]]))
})
newsdata_df2 <- do.call(rbind.data.frame,newsdata_df2)
colnames(newsdata_df2) <- c("source","author","title","publishedAt","content")
Again, let’s count for the first 100 ‘COVID-19’ data: 1) how many articles by each newspaper; 2) number of titles mentioning ‘COVID-19’; 3) number of mentions of ‘government’ in the content
aggregate(title ~ source,newsdata_df2,length) # Count of each newspaper
## source title
## 1 ABC News 5
## 2 Al Jazeera English 1
## 3 Androidcentral.com 1
## 4 Ars Technica 1
## 5 BBC News 2
## 6 Bbc.com 1
## 7 Business Insider 8
## 8 CNN 2
## 9 Core77.com 1
## 10 Ctvnews.ca 3
## 11 Dailyforex.com 1
## 12 Ejinsight.com 1
## 13 Engadget 2
## 14 Fastcompany.com 2
## 15 Fayerwayer.com 3
## 16 Futurity.org 1
## 17 Gigazine.net 1
## 18 Gizmodo.com 6
## 19 Gizmodo.jp 2
## 20 Globalnews.ca 3
## 21 Health24.com 1
## 22 Hipertextual.com 2
## 23 Huffpost.com 1
## 24 Investing.com 1
## 25 Kotaku.com 2
## 26 Kottke.org 1
## 27 Ktla.com 1
## 28 Macrumors.com 1
## 29 Marketwatch.com 2
## 30 Muyinteresante.es 1
## 31 Nature.com 2
## 32 New Scientist 1
## 33 Polygon 1
## 34 Reuters 4
## 35 Scroll.in 1
## 36 Silive.com 1
## 37 Spiegel Online 6
## 38 Taiwannews.com.tw 1
## 39 TechCrunch 1
## 40 Theatlantic.com 1
## 41 Time 6
## 42 USA Today 2
## 43 Xataka.com 1
## 44 Xinhua Net 1
## 45 Yahoo.com 5
## 46 Youtube.com 6
aggregate(title ~ source,newsdata_df2,function(x){ # Count of number of title mention
sum(grepl('COVID-19',x, ignore.case = T))
})
## source title
## 1 ABC News 0
## 2 Al Jazeera English 1
## 3 Androidcentral.com 0
## 4 Ars Technica 0
## 5 BBC News 0
## 6 Bbc.com 0
## 7 Business Insider 0
## 8 CNN 0
## 9 Core77.com 1
## 10 Ctvnews.ca 0
## 11 Dailyforex.com 0
## 12 Ejinsight.com 0
## 13 Engadget 0
## 14 Fastcompany.com 0
## 15 Fayerwayer.com 3
## 16 Futurity.org 0
## 17 Gigazine.net 0
## 18 Gizmodo.com 2
## 19 Gizmodo.jp 0
## 20 Globalnews.ca 2
## 21 Health24.com 0
## 22 Hipertextual.com 0
## 23 Huffpost.com 0
## 24 Investing.com 0
## 25 Kotaku.com 0
## 26 Kottke.org 0
## 27 Ktla.com 0
## 28 Macrumors.com 0
## 29 Marketwatch.com 0
## 30 Muyinteresante.es 0
## 31 Nature.com 0
## 32 New Scientist 0
## 33 Polygon 0
## 34 Reuters 0
## 35 Scroll.in 1
## 36 Silive.com 0
## 37 Spiegel Online 2
## 38 Taiwannews.com.tw 0
## 39 TechCrunch 0
## 40 Theatlantic.com 1
## 41 Time 3
## 42 USA Today 0
## 43 Xataka.com 0
## 44 Xinhua Net 0
## 45 Yahoo.com 0
## 46 Youtube.com 0
aggregate(content ~ source,newsdata_df2,function(x){ # Count of number of title mentioning 'government'
sum(grepl('government',x))
})
## source content
## 1 ABC News 2
## 2 Al Jazeera English 0
## 3 Androidcentral.com 0
## 4 Ars Technica 0
## 5 BBC News 0
## 6 Bbc.com 0
## 7 Business Insider 1
## 8 CNN 0
## 9 Core77.com 0
## 10 Ctvnews.ca 0
## 11 Dailyforex.com 0
## 12 Ejinsight.com 0
## 13 Engadget 0
## 14 Fastcompany.com 0
## 15 Fayerwayer.com 0
## 16 Futurity.org 0
## 17 Gigazine.net 0
## 18 Gizmodo.com 2
## 19 Gizmodo.jp 0
## 20 Globalnews.ca 0
## 21 Health24.com 0
## 22 Hipertextual.com 0
## 23 Huffpost.com 0
## 24 Investing.com 0
## 25 Kotaku.com 0
## 26 Kottke.org 0
## 27 Ktla.com 0
## 28 Macrumors.com 0
## 29 Marketwatch.com 0
## 30 Muyinteresante.es 0
## 31 Nature.com 0
## 32 New Scientist 0
## 33 Polygon 0
## 34 Reuters 1
## 35 Scroll.in 0
## 36 Silive.com 0
## 37 Spiegel Online 0
## 38 Taiwannews.com.tw 0
## 39 TechCrunch 0
## 40 Theatlantic.com 0
## 41 Time 0
## 42 USA Today 0
## 43 Xataka.com 0
## 44 Xinhua Net 0
## 45 Yahoo.com 0
## 46 Youtube.com 0
Why don’t we create a function to read a numebr of pages of data? A function called GetAllData is created.
## Create a function to read all news stories
GetAllData <- function(ndata,sterm){
max_page <- 10 # maximun pages of returned results
npage <- ceiling(ndata$totalResults) # Calcualte the total numebr of pages required
npage <- ifelse(npage >= max_page,max_page,npage)
output_data <- c()
for (p in 1:npage){
search_q <- URLencode(sterm)
url <- paste('https://newsapi.org/v2/everything?q=',search_q,'&from=',date,'&page=',p,'&pageSize=100&apiKey=',api,sep="")
res_df <- fromJSON(getURL(url))
if (!is.atomic(res_df)){
data_df <- lapply(res_df$articles,function(x){
return(c(x[["source"]][["name"]],x[["author"]],x[["title"]],x[["publishedAt"]],x[["content"]]))
})
data_df <- do.call(rbind.data.frame,data_df)
colnames(data_df) <- c("source","author","title","publishedAt","content")
output_data <- rbind(output_data,data_df)
Sys.sleep(15)
}
}
return(output_data)
}
Ready to run. Rerun the aggregate command to get the numbers of articles by news source, and sorted by descending order of the counts.
Allnewsdata1_df <- GetAllData(wuhan,'"Wuhan coronavirus"')
Allnewsdata2_df <- GetAllData(covid19,'"COVID-19"')
d1 <- aggregate(title ~ source,Allnewsdata1_df,function(x){ # Count of number of title mention
sum(grepl('Wuhan coronavirus',x, ignore.case = T))
})
d2 <- aggregate(title ~ source,Allnewsdata2_df,function(x){ # Count of number of title mention
sum(grepl('COVID-19',x, ignore.case = T))
})
d1[order(d1$title,decreasing=T),] # Display in descending order
## source title
## 6 Business Insider 4
## 12 Fastcompany.com 4
## 29 Naturalnews.com 4
## 45 Yahoo.com 3
## 2 Bizjournals.com 1
## 7 Businessinsider.com.au 1
## 21 Johnnyjet.com 1
## 28 Nationalinterest.org 1
## 1 Aei.org 0
## 3 Boingboing.net 0
## 4 Breitbart News 0
## 5 Bruegel.org 0
## 8 Cinemablend.com 0
## 9 Coed.com 0
## 10 Crypto Coins News 0
## 11 Dailymail.co.uk 0
## 13 Forbes.com 0
## 14 Fxstreet.com 0
## 15 Gizmodo.com.au 0
## 16 Governmentslaves.news 0
## 17 Imore.com 0
## 18 Independent 0
## 19 Investmentwatchblog.com 0
## 20 Iphoneincanada.ca 0
## 22 Komonews.com 0
## 23 Legalinsurrection.com 0
## 24 Leiphone.com 0
## 25 Macrumors.com 0
## 26 Mactrast.com 0
## 27 Meneame.net 0
## 30 Nature.com 0
## 31 Observador.pt 0
## 32 Ozbargain.com.au 0
## 33 Phys.org 0
## 34 Realclearpolitics.com 0
## 35 Salon.com 0
## 36 Seekingalpha.com 0
## 37 Shtfplan.com 0
## 38 Siliconangle.com 0
## 39 Taiwannews.com.tw 0
## 40 Thegatewaypundit.com 0
## 41 Thehealthsite.com 0
## 42 Theonlinecitizen.com 0
## 43 Vice News 0
## 44 Wattsupwiththat.com 0
## 46 Youtube.com 0
d2[order(d2$title,decreasing=T),] # Display in descending order
## source title
## 15 Fayerwayer.com 3
## 41 Time 3
## 18 Gizmodo.com 2
## 20 Globalnews.ca 2
## 37 Spiegel Online 2
## 2 Al Jazeera English 1
## 9 Core77.com 1
## 35 Scroll.in 1
## 40 Theatlantic.com 1
## 1 ABC News 0
## 3 Androidcentral.com 0
## 4 Ars Technica 0
## 5 BBC News 0
## 6 Bbc.com 0
## 7 Business Insider 0
## 8 CNN 0
## 10 Ctvnews.ca 0
## 11 Dailyforex.com 0
## 12 Ejinsight.com 0
## 13 Engadget 0
## 14 Fastcompany.com 0
## 16 Futurity.org 0
## 17 Gigazine.net 0
## 19 Gizmodo.jp 0
## 21 Health24.com 0
## 22 Hipertextual.com 0
## 23 Huffpost.com 0
## 24 Investing.com 0
## 25 Kotaku.com 0
## 26 Kottke.org 0
## 27 Ktla.com 0
## 28 Macrumors.com 0
## 29 Marketwatch.com 0
## 30 Muyinteresante.es 0
## 31 Nature.com 0
## 32 New Scientist 0
## 33 Polygon 0
## 34 Reuters 0
## 36 Silive.com 0
## 38 Taiwannews.com.tw 0
## 39 TechCrunch 0
## 42 USA Today 0
## 43 Xataka.com 0
## 44 Xinhua Net 0
## 45 Yahoo.com 0
## 46 Youtube.com 0