Task 1: Reflection

Free response part in surveys can reveal important information in the data set. However, it is hard to read every message. Visualizing them is somehow problematic but when done properly, can be very revealing. Unlike text, numbers can be analysed easily with regressions and averages.Like every other analysis, qualitative analysis needs to follow the CRAP principles. It best to normally avoid word clouds since it shows just little information about the data(just one dimensional display of data). Qualitative data analysis has become a little easier due modern software that helps to analyse text base data.

library(tidyverse)
library(tidytext)
library(gutenbergr)

#Getting data

#marktwain_raw <- gutenberg_download(c(8525, 1837, 3178, 119),
 #                                   meta_fields = "title")
#write.csv(marktwain_raw,"data.csv")

loading dataset

marktwain_raw<- read_csv("data.csv")
## New names:
## Rows: 60175 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): text, title dbl (2): ...1, gutenberg_id
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mark_twain<- marktwain_raw %>% 
  
  slice(52:n()) %>% 
  drop_na(text) %>% 
  mutate(chapter_start = str_detect(text, "^CHAPTER"),
         chapter_number = cumsum(chapter_start)) %>% 
  select(-gutenberg_id, -title, -chapter_start)
marktwain_words <- marktwain_raw %>% 
  drop_na(text) %>% 
  unnest_tokens(word, text)
top_words_marktwain <- marktwain_words %>% 
  anti_join(stop_words) %>% 
  count(title, word, sort = TRUE) %>% 
  group_by(title) %>% 
  top_n(15) %>% 
  ungroup() %>%
  mutate(word = fct_inorder(word)) 
ggplot(top_words_marktwain, aes(y = fct_rev(word), x = n, fill = title)) + 
  geom_col() + 
  guides(fill = "none") +
  labs(y = "Count", x = NULL, 
       title = "15 most frequent words in four Mark Twain books") +
  facet_wrap(vars(title), scales = "free_y") +
  theme_bw()

marktwain_words2 <- marktwain_raw %>% 
  drop_na() %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  count(title, word, sort = TRUE)


marktwain_tf_idf <- marktwain_words2 %>% 
  bind_tf_idf(word, title, n)

10 most unique words

mtwain_tf_idf_plot <- marktwain_tf_idf %>% 
  arrange(desc(tf_idf)) %>% 
  group_by(title) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = fct_inorder(word))

Plotting 10 most unique word

ggplot(mtwain_tf_idf_plot, 
       aes(y = fct_rev(word), x = tf_idf, fill = title)) +
  geom_col() +
  guides(fill = "none") +
  labs(x = "tf-idf", y = NULL, title = "MOST UNIQUE WORDS IN MARK TWAIN's BOOKS") +
  facet_wrap(~ title, scales = "free") +
  theme_bw()

100% optional bonus fun tasks

If you want, do some other things with the text you’ve downloaded. Make a “he verbs vs. she verbs” plot. Tag the parts of speech and find the most common verbs or nouns. Try some sentiment analysis. Do something fun.

mtwain_sentiment <- marktwain_words2 %>% 
  inner_join(get_sentiments("bing"), relationship = "many-to-many")
MT_sentiment_plot <- mtwain_sentiment %>% 
  count(title, sentiment)

ggplot(MT_sentiment_plot, aes(x = sentiment, y = n, fill = title, alpha = sentiment)) +
  geom_col(position = position_dodge()) +
  scale_alpha_manual(values = c(0.5, 1)) +
  facet_wrap(vars(title)) +
  theme_bw()