Congratulations you’ve successfully transferred from being a NBA ‘quant’ scout to a consultant specializing in US national sentiment! You’ve been hired by a non-profit in secret to track the level of support nationally and regionally for the field of Data Science. The goal is to get a general idea of patterns associated with articles being written on the broad topic of Data Science (you can also choose to select a sub-topic). In doing so your data science team has decided to explore periodicals from around the country in a effort to track the relative positive or negative sentiment and word frequencies. Luckily your team has access to a world class library search engine call LexusNexus (NexusUni) that provides access to newspapers from around the country dating back decades. You’ll first need to decided what words you want to track and what time might be interesting to begin your search.
You’ll need to select several newspapers from different regions in the country limiting the search to 100 articles from each paper, run sentiment analysis with each newspaper serving as a corpus and then compare the level of positive or negative connotation associated with the outcomes. Also, run tf-idf on each corpus (newspapers) and work to compare the differences between the distributions (5 to 6 newspapers should be fine)
Your main goal (and the goal of all practicing data scientists!) is to translate this information into action. What patterns do you see, why do you believe this to be the case? What additional information might you want? Be as specific as possible, but keep in mind this is an initial exploratory effort…more analysis might be needed…but the result can and should advise the next steps you present to the firm.
Please submit a cleanly knitted HTML file describing in detail the steps you took along the way, the results of your analysis and most importantly the implications/next steps you would recommend. You will report your final results and recommendations next week in class. This will be 5 minutes per group.
You will need also need to try to collaborate within your group via a GitHub repo, if you choose it would be fine to assign 1 or 2 regions/newspapers per group member, that can then be added to the repo. Create a main repo, everyone should work in this repo and submit independently using branching/pull requests. If you want to try to use pull request to combine everyone’s work into a final project, please do so, but it’s not a requirement. Select a repo owner that sets up access (push access) for the week, we will rotate owners next week. Also, submit a link to your the GitHub repo (every group member can submit the same link).
For our analysis, I have chosen to extract articles from NexusUni that include ‘data science’ from 5 US cities– Philadelphia, New York City, Chicago, Dayton, and Los Angeles.
For this section, 25 articles were included.
PHLInquirer<- read_lines("Files (25).txt")
PHLInquirer<- tibble(PHLInquirer)
PHLInquirer$PHLInquirer<- as.character(PHLInquirer$PHLInquirer) #into large character
PHLInquirer <- PHLInquirer %>%
unnest_tokens(word, PHLInquirer)%>%
anti_join(stop_words)%>%
count(word, sort=TRUE)
## Joining, by = "word"
PHLInqAFFIN<- PHLInquirer %>%
inner_join(get_sentiments("afinn"))
## Joining, by = "word"
PHLInqNRC<- PHLInquirer %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
PHLInqBING<- PHLInquirer %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
table(PHLInqBING$sentiment)
##
## negative positive
## 196 169
table(PHLInqNRC$sentiment)
##
## anger anticipation disgust fear joy negative
## 74 132 49 111 89 209
## positive sadness surprise trust
## 362 89 54 217
ggplot(data = PHLInqAFFIN,
aes(x=value)
)+
geom_histogram()+
ggtitle("PHL Affin Sentiment Range")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
40 articles are included here.
NYT<- read_lines("NYT 40.txt")
NYT<- tibble(NYT)
NYT$NYT<- as.character(NYT$NYT) #into large character
NYT <- NYT %>%
unnest_tokens(word, NYT)%>%
anti_join(stop_words)%>%
count(word, sort=TRUE)
## Joining, by = "word"
NYTaffin<- NYT %>%
inner_join(get_sentiments("afinn"))
## Joining, by = "word"
NYTnrc<- NYT %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
NYTbing<- NYT %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
table(NYTbing$sentiment)
##
## negative positive
## 382 322
table(NYTnrc$sentiment)
##
## anger anticipation disgust fear joy negative
## 156 224 97 210 169 385
## positive sadness surprise trust
## 580 149 107 359
ggplot(data = NYTaffin,
aes(x=value)
)+
geom_histogram()+
ggtitle("NYT Affin Sentiment Range")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
40 articles are included from Chicago.
chicago<- read_lines("chicago 40.txt")
chicago<- tibble(chicago)
chicago$chicago<- as.character(chicago$chicago) #into large character
chicago <- chicago %>%
unnest_tokens(word, chicago)%>%
anti_join(stop_words)%>%
count(word, sort=TRUE)
## Joining, by = "word"
chicagoAFINN<- chicago %>%
inner_join(get_sentiments("afinn"))
## Joining, by = "word"
chicagoNRC<- chicago %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
chicagoBING<- chicago %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
table(chicagoBING$sentiment)
##
## negative positive
## 160 251
table(chicagoNRC$sentiment)
##
## anger anticipation disgust fear joy negative
## 79 179 44 96 123 174
## positive sadness surprise trust
## 453 81 72 276
ggplot(data = chicagoAFINN,
aes(x=value)
)+
geom_histogram()+
ggtitle("Chicago Affin Sentiment Range")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
50 articles are included from Dayton, Ohio.
dayton<- read_lines("dayton 50.txt")
dayton<- tibble(dayton)
dayton$dayton<- as.character(dayton$dayton) #into large character
dayton <- dayton %>%
unnest_tokens(word, dayton)%>%
anti_join(stop_words)%>%
count(word, sort=TRUE)
## Joining, by = "word"
daytonAFINN<- dayton %>%
inner_join(get_sentiments("afinn"))
## Joining, by = "word"
daytonNRC<- dayton %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
daytonBING<- dayton %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
table(daytonBING$sentiment)
##
## negative positive
## 208 285
table(daytonNRC$sentiment)
##
## anger anticipation disgust fear joy negative
## 103 179 59 129 137 236
## positive sadness surprise trust
## 526 95 83 303
ggplot(data = daytonAFINN,
aes(x=value)
)+
geom_histogram()+
ggtitle("Dayton Affin Sentiment Range")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
7 articles are included from LA.
la_times <- read_lines("la_times7.txt")
la_times <- tibble(la_times)
la_times <- la_times %>%
unnest_tokens(word, la_times)%>%
anti_join(stop_words)%>%
count(word, sort=TRUE)
## Joining, by = "word"
la_sentiment_affin <- la_times %>%
inner_join(get_sentiments("afinn"))#using a inner join to match words and add the sentiment variable
## Joining, by = "word"
la_sentiment_nrc <- la_times %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
la_sentiment_bing <- la_times %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
table(la_sentiment_bing$sentiment)
##
## negative positive
## 126 109
table(la_sentiment_nrc$sentiment)
##
## anger anticipation disgust fear joy negative
## 55 84 38 78 68 142
## positive sadness surprise trust
## 237 64 42 137
ggplot(data = la_sentiment_affin,
aes(x=value)
)+
geom_histogram()+
ggtitle("LA Times Sentiment Range")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# using ggwordcloud package
set.seed(42)
ggplot(la_times[1:50,], aes(label = word, size = n)
) +
geom_text_wordcloud() +
theme_minimal()
#TF-IDF look at frequency compared with the whole corpus For our purposes, we’ll treat each newspaper compilation as a separate document within the corpus.
PHLraw<- as_tibble(read_lines("Files (25).txt"))
NYTraw<- as_tibble(read_lines("NYT 40.txt"))
Chicagoraw<- as_tibble(read_lines("chicago 40.txt"))
Daytonraw<- as_tibble(read_lines("dayton 50.txt"))
la_times_raw <- as_tibble(read_lines("la_times7.txt"))
data_prep <- function(x,y,z){
i <- as_tibble(t(x))
ii <- unite(i,"text",y:z,remove = TRUE,sep = "")
}
PHLprep<- data_prep(PHLraw,'V1','V1738')
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(y)` instead of `y` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(z)` instead of `z` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
NYTprep<- data_prep(NYTraw, 'V1', 'V3085')
Chicagoprep<- data_prep(Chicagoraw, 'V1','V1775')
Daytonprep<- data_prep(Daytonraw, 'V1', 'V2913')
LAprep <- data_prep(la_times_raw, 'V1', 'V707')
cities <- c("Philadelphia","NYC","Chicago", "Dayton", "Los Angeles")
tf_idf_text <- tibble(cities,text=t(tibble(PHLprep,
NYTprep,
Chicagoprep,
Daytonprep,
LAprep,.name_repair = "universal")))
## New names:
## * text -> text...1
## * text -> text...2
## * text -> text...3
## * text -> text...4
## * text -> text...5
class(tf_idf_text)
## [1] "tbl_df" "tbl" "data.frame"
word_count <- tf_idf_text %>%
unnest_tokens(word, text) %>%
count(cities, word, sort = TRUE)
total_words <- word_count %>%
group_by(cities) %>%
summarize(total = sum(n))
news_words <- left_join(word_count, total_words)
## Joining, by = "cities"
head(news_words,10)
## # A tibble: 10 x 4
## cities word n total
## <chr> <chr> <int> <int>
## 1 NYC the 3153 63977
## 2 Dayton the 1776 42282
## 3 NYC of 1746 63977
## 4 NYC a 1639 63977
## 5 NYC and 1626 63977
## 6 NYC to 1611 63977
## 7 Chicago the 1434 31000
## 8 NYC in 1391 63977
## 9 Dayton to 1269 42282
## 10 Dayton and 1248 42282
#View(news_words)
news_words2 <- news_words %>%
bind_tf_idf(word, cities, n)
head(news_words2,10)
## # A tibble: 10 x 7
## cities word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 NYC the 3153 63977 0.0493 0 0
## 2 Dayton the 1776 42282 0.0420 0 0
## 3 NYC of 1746 63977 0.0273 0 0
## 4 NYC a 1639 63977 0.0256 0 0
## 5 NYC and 1626 63977 0.0254 0 0
## 6 NYC to 1611 63977 0.0252 0 0
## 7 Chicago the 1434 31000 0.0463 0 0
## 8 NYC in 1391 63977 0.0217 0 0
## 9 Dayton to 1269 42282 0.0300 0 0
## 10 Dayton and 1248 42282 0.0295 0 0
#View(news_words2)
From conducting sentiment analysis on these five US cities using Afinn, Bing, and NRC for this lab, we can see the general sentiment of certain publications across the United States and if they lean positive or negative in their sentiments. Most of the publications leaned positive according to NRC, but leaned negative according to Bing. When looking at the Afinn sentiment range, most publications that we analyzed had a pretty even spread of negative and positive sentiments, but were pretty normally distributed which leads us to believe the general sentiment is pretty neutral without strong negative or positive sentiments. According to Bing, Afinn, and NRC, the Chicago Daily Herald is the most positive. The sentiment that was the most common for each of the publications was “trust”, which makes sense because news paper editors probably ensure that their writing is trustworthy and attempt to provide unbiased news. The least common sentiment was “disgust”, which might make sense because data science is often exclusively quantitative, and thus would not invoke a strong emotional response such as disgust.