In light of present issues involving the French drug maker Sanopi’s Dengvaxia vaccine , I made this sentiment analysis because I wanted to dig deeper into how different media outlets are choosing their words in best describing the incident. I believe that by reading the articles, the words will always shape the opinion of the readers. Second, we would like to know who are the different persons involved in the scandal. We would like to know how much they are being mentioned in the news articles found in the respective pages of our news sources.
I have scrapped different articles in the following media news pages from: “ABS-CBN.com” , “BBC”, “CNBC” , “CNNPhilippines”
“Channel News Asia” , “FiercePharma” , “Financial Times” , “France_24”
“GMAnetwork.com” , “Independent.co.uk”, “Inquirer.net”, “Interaksyon.com”
“Manila Times” , “MedicalXpress.com” , “Philstar.com” , “Sunstar.com.ph”
“The Guardian UK” , “hurriyetdailynews.com” , “labiotech.eu” , “marketwatch.com”
“nytimes.com” , “rappler.com” , “reuters.com” , “untvweb.com” , “news.mb.com.ph” “statnews.com” , “dailymail.co.uk”
Data Collection
The articles span from November 30, 2017 (when the issue erupted) untill December 11, 2017 (1st senate hearing). The order of articles are scraped based from the first 3 pages of google search, using key words “dengvaxia Philippines”.
By doing these analysis, we can look at the issue in a different angle and describe how media outlets are releasing their news reports. Again, this information is only limited on the internet and on google searches.
Hopefully, we can come up with many important keywords and know what to expect as the senate probe on the issue continues.
We have the following variables: Date - date of the article when it was written and released Source - this is the news media source Head - this is the heading of the article Text - the text or complete body of the article
Our plot below show the articles released starting from 30th November until December 11, 2017.
ggplot(sanopi, aes(date)) +
geom_bar(fill = "blue") +
ggtitle("Articles By-Day")
We can see that many articles were released on December 11th, during the senate hearing. This is natural considering that “Philippines dengvaxia scam” is the hottest and most urgent topic. It is also expected that many readers will be hungry for information especially since not many can watch the live senate hearing due to work and school. We can expect this articles to have some particular attention within 2 days and then drop down, once the issue goes cold.
ggplot(sanopi, aes(outlet)) +
geom_bar(fill = "blue") +
ggtitle("Articles by Outlet-count")
Our data has more local outlets than foreign outlets. Again, I am only basing my search on the first 3 pages of google search when using the key words “dengvaxia Philippines”.
Next step, we will use the magic of tidytext and create wordclouds for most common words in our variables text and headings.
#convert source and outlet variable to factor format
sanopi <- sanopi %>%
mutate(source = as.factor(source), outlet = as.factor(outlet))
We start by using tidytext the unnest_tokens function, which drops punctuation and transforms all words into lower case. In addition, tidytext contains a dictionary of stop words, like “and” or “next”.
a1 <- sanopi %>% unnest_tokens(word, text)
a1 <- a1 %>%
anti_join(stop_words, by = "word")
a2 <- sanopi %>% unnest_tokens(word, head)
a2 <- a2 %>%
anti_join(stop_words, by = "word")
Most common words wordcloud for our all source texts.
a1 %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100, color = c("#9E0142", "#D53E4F" ,"black")))
Fig. 1
Our top most common words are dengue followed by dengvaxia , program, health, sanofi, Philppines and children. By looking at our wordcloud, we can expect these keywords in articles relating to the dengvaxia scam. The words can also be further use when carrying out further research regarding the topics.
Just as I mentioned above, we can see some key person’s of interest. We can see former president Aquino , Senator Gordon , DOH Secretary Francisco Doque , former DOH secretaries Garin and Ubial.
Quickly we can find French drug maker Sanofi and years 2015 , 2016. This is relating to articles during the time when the negotiation between Sanofi officials together with previous PH government officials was made. And finally during the time when the vaccine was bought and then implemented in the health program.
Most common words wordcloud for our all source headings.
a2 %>%
count(word) %>%
with(wordcloud(word, n, max.words = 20, color = c("green4", "orange4", "black")))
Fig. 1
For our headlines, we can see top keywords dengue, dengvaxia, vaccine, sanofi and Philippines. Person’s of interest include former DOH secretaries Garin and Ubial. Present DOH secretary Doque and words children. This is interesting since many children have recieved the vaccine. Luckily, my son was not able to get the vaccine. It was a blessing in disguise from above, noting that our vaccination schedule was delayed in some parts of Cebu. Let’s hope for other children who were vaccinated that things may work out fine in the end.
We can check the most common words in our news headlines by making a bar plot.
createBarPlotCommonWords = function(sanopi,title)
{
sanopi %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(word,sort = TRUE) %>%
ungroup() %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
head(10) %>%
ggplot(aes(x = word,y = n)) +
geom_bar(stat='identity',colour="white" , fill = "orange2") +
geom_text(aes(x = word, y = 1, label = paste0("(",n,")",sep="")),
hjust=0, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
labs(x = 'Word', y = 'Word Count',
title = title) +
coord_flip() +
theme_1()
}
createBarPlotCommonWords(sanopi ,'Top 10 most Common Words')
Our plot above shows the count of the keywords in our news headlines. We can find the word dengue being used 599x in our headlines and Sanofi - 351x. This means that these words were part of articles headings being released by our news sources. We can also find children, were most of them were vaccinated with dengvaxia.
Next, I want to check how many articles we have in our news sources.
sanopi %>%
ggplot(aes(source, fill = source)) +
geom_bar() +
theme(legend.position = "none") + theme_1()
Again, basing on first 3 pages of google searches using the keywords “dengvaxia Philippines”. I got our data and found out that there are more articles on Philstar compared to other local outlets. This is followed by Inquirer.net and tying at 6 articles are ABS-CBN, CNNPhilippines and rappler.com.
suppressMessages(library(ggridges))
sanopi %>%
mutate(sen_len = str_length(text)) %>%
ggplot(aes(sen_len, source, fill = source)) +
geom_density_ridges() +
scale_x_log10() +
theme(legend.position = "none") +
labs(x = "Sentence length [# characters]")
## Picking joint bandwidth of 0.0807
Our plot above shows the top local outlets showing the amount of words in their articles.
In terms of word length, CNNPhilippines, rappler.com , news.mb.com.ph are similar.While Philstar.com appears to use slightly longer words in overall similar with Interaksyon.com. Uniquely, Inquirer.net has some articles which are short and long. Here we see ABS-CBN.com uses the least number of words in their articles.
This plot shows the frequencies for the overall most popular words split into usage by each source. By filtering the most common words in each of our news text/article sources, we can get an idea how each news source is covering the dengvaxia scam.
Common words for rappler.com text or articles
a1 %>%
filter(source == "rappler.com") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 30, color = "blue4"))
Fig. 4
From rappler.com, we can see common keywords such as sanofi, vaccine, dengue and dengvaxia. We can also see the year 2016 when the vaccination operation started. By knowing these keywords, we can get an idea how rappler is covering the present issue. We can see DOH secretary Doque and former DOH secretaries Garin and Ubial. We can say that their articles have coverage on these person’s of interest.
Common words for Philstar.com text or articles
a1 %>%
filter(source == "Philstar.com") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 30, color = "red4"))
Fig. 4
From our top local outlet Philstar.com, we can see similar keywords as rappler.com. We will mention the ones we did not say above, we can see billion referring to the cost of the vaccination program made by the previous administration. We can see that they have subject articles reffering to children, virus, month of April and fever.
Common words for Inquirer.net text or articles.
a1 %>%
filter(source == "Inquirer.net") %>%
count(word) %>%
with(wordcloud(word, n, max.words = 30, color = "grey2"))
Fig. 4
From our top local outlet Inquirer.net, we can see similar keywords but importantly we see the word stop, fda and people. The articles written by Inquirer.net contain subject keywords such as these.
Next, we will see how the rest of our outlet most common words compared to the rest.
rex <- a1 %>%
group_by(word, source) %>%
count()
bar <- a1 %>%
group_by(word) %>%
count() %>%
rename(all = n)
rex %>%
left_join(bar, by = "word") %>%
arrange(desc(all)) %>%
head(80) %>%
ungroup() %>%
ggplot(aes(reorder(word, all, FUN = min), n, fill = source)) +
#ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
facet_wrap(~ source) +
theme(legend.position = "none")
Fig. 4
We can see that these 4 words are almost used frequently for all sources, such as “dengue” , “vaccine”, “sanofi” and “health”.
We can dive deeper into this fundamental question by directly comparing the relative frequency of word use between sources. In this example, we will compare Philstar.com to Inquirer.net and CNNPhilippines. We also require at least 10 occurences per author for a word to be included:
suppressMessages(library(scales))
frequency <-a1 %>%
count(source, word) %>%
filter(n > 1e1) %>%
group_by(source) %>%
mutate(freq = n / sum(n)) %>%
select(-n) %>%
spread(source, freq) %>%
gather(source, freq, Philstar.com:Inquirer.net) %>%
filter(!is.na(freq) & !is.na(source))
ggplot(frequency, aes(freq, CNNPhilippines , color = abs(CNNPhilippines - freq))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.1, height = 0.1) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
#scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray95") +
facet_wrap(~source, ncol = 2) +
theme(legend.position="none") +
labs(y = "CNN Philippines", x = NULL)
Fig. 5
In these plots, words that are close to the dashed line (of equal frequency) have similar frequencies in the corresponding news source. Words that are further along a particular news source axis (such as dengue for CNNPhilippines vs MedicalXpress.com) are more frequent for that source. The blue-gray scale indicates how different the CNNPhilippines frequency is from the overall frequency (with higher relative frequencies being lighter). The (slightly jittered) points in the background represent the complete set of (high-frequency) words, whereas the displayed words have been chosen to avoid overlap.
We want to measure the positive and negative sentiments by using polarity scores in qdap, tm and tidytext packages. Now want to check, how the words are separated by these sentiments and know whether any of our person’s of interest lie in the negative sentiment.
san_pol <- polarity(removePunctuation(removeNumbers(tolower(sanopi$text[1:67]))))
san_pol
## all total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 all 67 39092 -0.459 0.323 -1.422
spol.df <-san_pol$all #make dataframe
which.min(spol.df$polarity)
## [1] 27
which.max(spol.df$polarity)
## [1] 24
suppressMessages(library(magrittr))
#create our pol_subsections function
pol_subsections <- function(df) {
x.pos <- subset(df$text, df$polarity > 0)
x.neg <- subset(df$text, df$polarity < 0)
x.pos <- paste(x.pos, collapse = " ")
x.neg <- paste(x.neg, collapse = " ")
all.terms <- c(x.pos, x.neg)
return(all.terms)
}
#At this point you have omitted the neutral sentences and want to focus on organizing the remaining text. In this exercise we use the %>% operator again to forward objects to functions. After some simple cleaning use comparison.cloud() to make the visual.
# Add scores to each document line in a data frame
sa_df <- san_pol$all %>%
select(text = text.var, polarity = polarity)
# Custom function
P.all_terms <- pol_subsections(sa_df)
# Make a corpus
P.all_corpus <- P.all_terms %>%
VectorSource() %>%
VCorpus()
# Basic TDM
P.all_tdm <- TermDocumentMatrix(
P.all_corpus,
control = list(
removePunctuation = TRUE,
stopwords = stopwords(kind = "en")
)
) %>%
as.matrix() %>%
set_colnames(c("positive", "negative"))
#then finally make our comparison cloud
comparison.cloud(
P.all_tdm,
max.words = 50,
colors = c("grey2", "darkred")
)
Our plot above shows the positive and negative scores basing on polarity scores. First on the positive side, Person’s of interest Garin, Francisco Doque, Ona are scored positively. The words congress, doh and budget are also scored positively. But the most obvious would be Dr. Leachon. I believe this is the article at untv.com where he discusses dengvaxia vaccine and how he advised against using it. As a health advocate, he tried to take the matter to congress but only to be told to shut up by some congressmen. During that time, he already knew that the vaccine was still on it’s clinical stage and cannot be applied on a large mass scale. Only the Philippines, at the time approved on using the vaccine to become part of it’s government health program. Right from the beggining , he already exposed the controversial nature of the vaccine made by French Sanofi Pharmaceutical.
For the negative scores, we can see sanofi, drug, severe and dengue. I believe this is really such bad P.R to French Sanofi company. Also we can see the words children, previously, told and probe are included.
Next in ligth of the dengvaxia scam, we want to know whether the keywords are positive or negative. In a way, we want to measure if the articles we are reading are bad news or good news.
par <- sanopi %>% select(source, text)
# Create a VectorSource on column 2: vec_source
vec_corpus <-Corpus(VectorSource(par[, 2]))
#data cleaning, make our clean_corpus function
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords("en"))
return(corpus)
}
clean_corp <- clean_corpus(vec_corpus)
sn_dtm <- DocumentTermMatrix(clean_corp)
#sa_mat <- as.matrix(sn_dtm)
# Tidy up the DTM
san_tidy <- tidy(sn_dtm)
# Get Bing lexicon
bing <- get_sentiments("bing")
# Join text to lexicon
ag_bing_words <- inner_join(san_tidy, bing, by = c("term" = "word"))
# Examine
#ag_bing_words
# Get counts by sentiment
ag_bing_words %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 266
## 2 positive 135
##
# Inner join
san_sent <- inner_join(san_tidy, bing, by = c("term" = "word"))
# Tidy sentiment calculation
san_tidy_sentiment <- san_sent %>%
count(term, sentiment, wt = count) %>%
spread(sentiment, n, fill = 0) %>%
mutate(polarity = positive - negative)
# Subset
san_tidy_small <- san_tidy_sentiment %>%
filter(abs(polarity) >= 10)
# Add polarity
san_tidy_pol <- san_tidy_small %>%
mutate(
pol = ifelse(polarity > 0, "positive", "negative")
)
Our summary above tells us that there are more negative words - 266 vs 135, implying that indeed these are bad news.
# Plot
ggplot(
san_tidy_pol,
aes(reorder(term, polarity), polarity, fill = pol)
) +
geom_bar(stat = "identity") +
ggtitle("All Sources Sentiment Word Frequency") +
theme(axis.text.x = element_text(angle = 90, vjust = -0.1)) + theme_1()+ ylab("") + xlab("Words")
Our above plot shows the positive and negative words in all our news text sources. Our topmost positive words are recommended, refund and reccomendations. On the other hand, negative words are severe, infected and virus. To add, since we have more negative words which means that most of these articles carry bad news, we can see died, death and controversial. As much as Sanofi tries to deny but some articles containing “died” or “death” regarding their vaccine has already released. Worse the word “controversy” has also been written detailing the anomalious transaction done by Sanofi. This is indeed bad PR for them.
We wish to find out the important words which are written by the news sources.
TF stands for term frequency; essentially how often a word appears in the text. This is what we measured above. A list of stop-words can be used to filter out frequent words that likely have no impact on the question we want to answer (e.g. “and” or “the”). However, using stop words might not always be an elegant approach. IDF to the rescue.
IDF means inverse document frequency. Here, we give more emphasis to words that are rare within a collection of documents (which in our case means the entire text data.)
Both measures can be combined into TF-IDF, a heuristic index telling us how frequent a word is in a certain context (here: a certain source) within the context of a larger document (here: all source). You can understand it as a normalisation of the relative text frequency by the overall document frequency. This will lead to words standing out that are characteristic for a specific source, which is pretty much what we want to achieve in order build a prediction model.
The tidytext includes the function bind_tf_idf to extract these metrics from a tidy data set that contains words and their counts per source:
frequency <-a1 %>%
count(source, word)
tf_idf <- frequency %>%
bind_tf_idf(word, source, n)
These are the most characteristic words for each news source:
tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
top_n(20, tf_idf) %>%
ggplot(aes(word, tf_idf, fill = source)) +
geom_col() +
labs(x = NULL, y = "TF-IDF values") +
theme(legend.position = "top", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9))
We find:
On untv.com, we can find Dr. Leachon’s story on how he advised against using the dengvaxia vaccine to the previous DOH secretaries. We can find many words such as “sta”" , “ana”, “immunised” and “programme”" on the article from france 24 regarding Brazils usage of dengvaxia vaccination which was a similar government program run by the country. We can see Senator Hontiveros appearing on a Sunstar.com.ph article.
Our plot below produces the same information but shown in a barplot.
trainWords <- sanopi %>%
unnest_tokens(word, text) %>%
count(source, word, sort = TRUE) %>%
ungroup()
total_words <- trainWords %>%
group_by(source) %>%
summarize(total = sum(n))
trainWords <- left_join(trainWords, total_words)
#Now we are ready to use the bind_tf_idf which computes the tf-idf for each term.
trainWords <- trainWords %>%
filter(!is.na(source)) %>%
bind_tf_idf(word, source, n)
plot_trainWords <- trainWords %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
plot_trainWords %>%
top_n(20) %>%
ggplot(aes(word, tf_idf)) +
geom_col(fill = "orange2") +
labs(x = NULL, y = "tf-idf") +
coord_flip() +
theme_bw()
To conclude:
Regarding article news released, we can find that there were more articles released on December 11 2017, the day of the senate hearing on the Dengvaxia Scam. We can expect at least 2 to 3 more days of high quantity release regarding the topic before it cools down.
Longest news articles are from interskyon.com and Philstar.com. We can expect similar length of words used in their articles when we read them. ABS-CBN.com has the lowest. If you want quick tidbits of information, then I believe you can choose it.
So far, I have collected more local outlets compared to foreign outlets.
Top most common words are dengue followed by dengvaxia , program, health, sanofi, Philppines and children. We can expect these keywords in articles relating to the dengvaxia scam. The words can also be further use when carrying out further research regarding the topics.
We can see some key person’s of interest. We can see former president Aquino , Senator Gordon , DOH Secretary Francisco Doque , former DOH secretaries Garin and Ubial. As the issue continues, we can expect stories coming from these different persons. From time to time, Dr. Leachon’s name may also appear.
We can see coverage of French Sanofi drug maker in a controversial situation. Ever since, the issue erupted, the companies stock has dropped and has really put the company in a worse light. We expect to continue hearing stories from their side.
We can see that many children are really put on risk . We can expect to continue hearing stories of severe infection and worse death involving children who were vaccinated.
The senate probe will continue. We will expect Senator Gordon and present DOH secretary Doque investigation to continue regarding the dengvaxia scam.
The raw sanopi dataset is saved from an xlsx file in my computer. If you want to have a copy and play around with, you can send me your email and I will sent it to you. For the moment, I am continuing to collect articles everyday and updating my dataset. My analysis on the dengvaxia scam will continue until the issue cools down.