Unsupervised machine learning uses algorithms to learn patterns in datasets to represent the underlying structure of data. To be able to analyze positive and negative phrases within text data analysts use sentiment analysis.
This document will be analyzing the sentiments from five different articles on five different newspapers in Connecticut. The goal of this analysis is to rank the articles from most positive to least positive.
knitr::opts_chunk$set(echo = TRUE)
remove(list=ls())
setwd("~/Desktop/ML unsupervised")
set.seed(12345)
options(digits = 3, scipen = 9999, width = 120, knitr.table.format = "rst")
pacman::p_load(tokenizers,stopwords,dplyr,ggplot2,
ggthemes,tidytext, qdap, tm,lda,topicmodels, ggrepel,
tidyverse,wordcloud,wordcloud2, skmeans, clue, cluster, knitr,
fpc, ldatuning, wakefield, gridExtra)
pacman::p_load(quanteda, readtext, tidyverse,RColorBrewer, tidytext)
pacman::p_load(topicmodels, tfse)
library(ldatuning)
library(tfse)
library(SentimentAnalysis)
library(sentimentr)
library(widyr)
pacman::p_load(tidyverse, tidytext, textclean, tokenizers, markovchain)
pacman::p_load(stm, rvest, tm)
pacman::p_load(gutenbergr)
library(FactoMineR)
library(factoextra)
library(scales)
library(magrittr)
library(cowplot)
library(ggplot2)
library(plotly)
Pulling the data and alternating it to be text:
newspaper1 <- read_html("https://www.ctpost.com/news/article/Connecticut-issues-guidelines-for-proms-16092624.php?src=rdctpdensecp")
post = newspaper1 %>% html_nodes("p") %>% html_text()
post1 = data.frame(text=post)
post2 = post1 %>% unnest_tokens(word, text)
post2 %>% count(word, sort = T)
post2 %>% anti_join(stop_words, by = "word")
head(post2)
post3 = cbind.data.frame(linenumber = row_number(post2), post2)
post3 %>% count(word, sort = T)
By using corpus function it is clear to see there are more positive phrases than negative phrases within this article.
Using nrc Lexicon and basic plot function we are able to see which sentiments are most used (in the blue histogram) and different levels of emotion that are occurring throughout the article. This article mostly shows positivity and anticipation.
## index negative positive sentiment
## 1 1 1 0 -1
## 2 2 1 2 1
## 3 3 1 0 -1
## 4 4 0 1 1
## 5 6 0 2 2
## index sentiment n
## 1 1 anticipation 2
## 2 1 fear 1
## 3 1 negative 1
## 4 1 positive 1
## 5 1 sadness 1
## 6 2 anger 1
Pulling the data and alternating it to be text:
newspaper2 <- read_html("https://www.courant.com/politics/hc-pol-fairfield-protest-housing-legislation-20210410-jpxtxb4genbnppu2hn2g65ng74-story.html")
hart = newspaper2 %>% html_nodes("p") %>% html_text()
hart1 = data.frame(text=hart)
hart2 = hart1 %>% unnest_tokens(word, text)
hart2 %>% count(word, sort = T)
hart2 %>% anti_join(stop_words, by = "word")
hart3 = cbind.data.frame(linenumber = row_number(hart2), hart2)
hart3 %>% count(word, sort = T)
By using corpus function, it shows this article is more split between positive and negative phrases than the Connecticut Post article. This article still shows more positive phrases than negative.
Using nrc Lexicon and basic plot function we are able to see which sentiments are most used (in the blue histogram) and different levels of emotion that are occurring throughout the article. This article shows almost an equal amount of positivity and negativity with some trust. This could mean this article is giving some kind of bad news, but is also posing a solution to this news to keep positive.
## index negative positive sentiment
## 1 0 0 7 7
## 2 1 0 10 10
## 3 2 0 1 1
## 4 3 2 1 -1
## 5 4 7 0 -7
## 6 5 2 4 2
## index sentiment n
## 1 0 positive 2
## 2 2 anger 2
## 3 2 negative 2
## 4 2 positive 4
## 5 2 trust 5
## 6 3 anger 1
Pulling the data and alternating it to be text:
newspaper3 <- read_html("https://www.stamfordadvocate.com/living/slideshow/Here-s-how-much-it-could-cost-to-bring-your-dog-220018.php?src=sthpdesecp")
stam = newspaper3 %>% html_nodes("p") %>% html_text()
stam1 = data.frame(text=stam)
stam2 = stam1 %>% unnest_tokens(word, text)
stam2 %>% count(word, sort = T)
stam2 %>% anti_join(stop_words, by = "word")
stam3 = cbind.data.frame(linenumber = row_number(stam2), stam2)
stam3 %>% count(word, sort = T)
By using corpus function, it is found that negative and positive phrases were equally used throughout the article.
Using nrc Lexicon and basic plot function we are able to see which sentiments are most used (in the blue histogram) and different levels of emotion that are occurring throughout the article. This article shows a lot of anticipation, trust, and positivity.
## index negative positive sentiment
## 1 1 1 1 0
## 2 2 1 1 0
## 3 4 0 1 1
## 4 5 1 0 -1
## 5 6 1 1 0
## index sentiment n
## 1 1 anticipation 2
## 2 1 joy 9
## 3 1 positive 1
## 4 1 trust 1
## 5 2 joy 2
## 6 2 positive 3
Pulling the data and alternating it to be text:
newspaper4 <- read_html("https://www.nhregister.com/news/coronavirus/article/Web-engineers-develop-shortcuts-to-find-COVID-16091041.php?src=nhrhpdesecp")
new = newspaper4 %>% html_nodes("p") %>% html_text()
new1 = data.frame(text=new)
new2 = new1 %>% unnest_tokens(word, text)
new2 %>% count(word, sort = T)
new2 %>% anti_join(stop_words, by = "word")
new3 = cbind.data.frame(linenumber = row_number(new2), new2)
new3 %>% count(word, sort = T)
By using corpus function, it is found that positive phrases are used much more than negative phrases in this article.
Using nrc Lexicon and basic plot function we are able to see which sentiments are most used (in the blue histogram) and different levels of emotion that are occurring throughout the article. This article shows a lot of positivity and trust.
## index negative positive sentiment
## 1 2 1 9 8
## 2 3 1 0 -1
## 3 4 4 1 -3
## 4 5 0 1 1
## 5 7 2 0 -2
## 6 8 0 4 4
## index sentiment n
## 1 0 anticipation 2
## 2 0 fear 1
## 3 0 positive 3
## 4 0 surprise 1
## 5 2 anticipation 1
## 6 2 joy 1
Pulling the data and alternating it to be text:
newspaper5 <- read_html("https://www.theday.com/local-news/20210410/pandemics-toll-on-restaurants-entertainment-venues-tough-to-pin-down")
day = newspaper5 %>% html_nodes("p") %>% html_text()
day1 = data.frame(text=day)
day2 = day1 %>% unnest_tokens(word, text)
day2 %>% count(word, sort = T)
day2 %>% anti_join(stop_words, by = "word")
day3 = cbind.data.frame(linenumber = row_number(day2), day2)
day3 %>% count(word, sort = T)
By using corpus function, it is found positive and negative phrases are both used throughout the article. It seems more negative phrases are used throughout this article compared to the other four.
Using nrc Lexicon and basic plot function we are able to see which sentiments are most used (in the blue histogram) and different levels of emotion that are occurring throughout the article. This article shows a lot of positivity and trust.
## index negative positive sentiment
## 1 3 0 2 2
## 2 4 0 3 3
## 3 5 1 1 0
## 4 6 0 1 1
## 5 7 7 0 -7
## 6 8 0 4 4
## index sentiment n
## 1 1 positive 3
## 2 1 trust 1
## 3 3 anticipation 1
## 4 3 joy 1
## 5 3 positive 2
## 6 3 sadness 1
Ranking from most positive to least positive article based on positive to negative sentiment ratio from nrc lexicon analysis would be: The Day (red), New Haven Register (green), Connecticut Post News (blue), and Stamford Advocate (purple), Hartford Courant (orange).
However, when using corpus it is seen as a slightly different outcome. The order from most positive to least positive using corpus would be: The Day, New Haven Register, Hartford Courant, Connecticut Post News, and Stamford Advocate.
These two analyses rank differently because one is based on the ratio of positive to negative. Some articles showed more positivity than negativity throughout. The corpus analysis focuses more on which article has more elements of positivity throughout. Some of the articles do not display a lot of positive phrases while the article is more neutral.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.