Assignment 10 - Text Mining

Subhalaxmi Rout

04/05/2020

Assignment Over-view

In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:

  • Work with a different corpus of your choosing, and
  • Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research).

Code from Textbook

The aim of this assignment is to understand sentiment Analysis given in the textbook “Text Mining with R-chapter 2” then add a new corpus and lexicon which is not used in the textbook.

what is corpus?

These types of objects typically contain raw strings annotated with additional metadata and details.

Jane Austen dataset

Using the text of Jane Austen’s 6 completed, published novels from the janeaustenr package (Silge 2016), and transform them into a tidy format.

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # … with 293 more rows

## # A tibble: 122,204 x 4
##    book              linenumber chapter word     
##    <fct>                  <int>   <int> <chr>    
##  1 Pride & Prejudice          1       0 pride    
##  2 Pride & Prejudice          1       0 and      
##  3 Pride & Prejudice          1       0 prejudice
##  4 Pride & Prejudice          3       0 by       
##  5 Pride & Prejudice          3       0 jane     
##  6 Pride & Prejudice          3       0 austen   
##  7 Pride & Prejudice          7       1 chapter  
##  8 Pride & Prejudice          7       1 1        
##  9 Pride & Prejudice         10       1 it       
## 10 Pride & Prejudice         10       1 is       
## # … with 122,194 more rows
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
## # A tibble: 2,585 x 3
##    word     sentiment     n
##    <chr>    <chr>     <int>
##  1 miss     negative   1855
##  2 well     positive   1523
##  3 good     positive   1380
##  4 great    positive    981
##  5 like     positive    725
##  6 better   positive    639
##  7 enough   positive    613
##  8 happy    positive    534
##  9 love     positive    495
## 10 pleasure positive    462
## # … with 2,575 more rows

## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows

New Corpus

My Bondage and My Freedom is an autobiographical slave narrative written by Frederick Douglass and published in 1855. Download data using gutenbergr package.

Reference: https://docsouth.unc.edu/neh/douglass55/douglass55.html

## # A tibble: 12,208 x 2
##    gutenberg_id text                                                            
##           <int> <chr>                                                           
##  1          202 "MY BONDAGE and MY FREEDOM"                                     
##  2          202 ""                                                              
##  3          202 "By Frederick Douglass"                                         
##  4          202 ""                                                              
##  5          202 ""                                                              
##  6          202 "By a principle essential to Christianity, a PERSON is eternall…
##  7          202 "differenced from a THING; so that the idea of a HUMAN BEING, n…
##  8          202 "excludes the idea of PROPERTY IN THAT BEING."                  
##  9          202 "--COLERIDGE"                                                   
## 10          202 ""                                                              
## # … with 12,198 more rows

Convert Data to Tidy

## # A tibble: 10,624 x 4
##    gutenberg_id text                                          linenumber chapter
##           <int> <chr>                                              <int>   <int>
##  1          202 "CHAPTER I. _Childhood_"                               1       1
##  2          202 "PLACE OF BIRTH--CHARACTER OF THE DISTRICT--…          2       1
##  3          202 "NAME--CHOPTANK RIVER--TIME OF BIRTH--GENEAL…          3       1
##  4          202 "COUNTING TIME--NAMES OF GRANDPARENTS--THEIR…          4       1
##  5          202 "ESPECIALLY ESTEEMED--\"BORN TO GOOD LUCK\"-…          5       1
##  6          202 "POTATOES--SUPERSTITION--THE LOG CABIN--ITS …          6       1
##  7          202 "CHILDREN--MY AUNTS--THEIR NAMES--FIRST KNOW…          7       1
##  8          202 "MASTER--GRIEFS AND JOYS OF CHILDHOOD--COMPA…          8       1
##  9          202 "SLAVE-BOY AND THE SON OF A SLAVEHOLDER."              9       1
## 10          202 "In Talbot county, Eastern Shore, Maryland, …         10       1
## # … with 10,614 more rows

Lexicon

Using Loughran lexicon perform sentiment analysis.

loughran: English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: “negative”, “positive”, “litigious”, “uncertainty”, “constraining”, or “superfluous”.

Reference: https://rdrr.io/cran/textdata/man/lexicon_loughran.html

The two basic arguments to unnest_tokens used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case). Remember that text_df above has a column called text that contains the data of interest.

Analysis

The dataset consist of word, sentiment and Freq.

Chapter wise positive and negative words

Apply group by on Chapter so we can get chapter based positive/negative sentiments words. Let’s get total number of positive and negative word count using bing lexion.

The book has 25 chapters, using Finn lexicon we can see which chapter has more positive words and which chapter has more negative words. The suggestion from the book is to use ~ 80 lines of text, and let’s try that.

From the above graph we can see Chapter 25 has more negative sentimants among all other chapters.

TF-IDF

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents.

## # A tibble: 34,361 x 6
##    chapter word            n      tf   idf  tf_idf
##      <int> <chr>       <int>   <dbl> <dbl>   <dbl>
##  1       8 gore           19 0.00722  2.12 0.0153 
##  2       8 denby          10 0.00380  3.22 0.0122 
##  3      22 bedford        33 0.00546  1.83 0.0100 
##  4      17 covey          46 0.00956  1.02 0.00976
##  5       7 barney         10 0.00300  3.22 0.00967
##  6      16 covey          28 0.00919  1.02 0.00939
##  7      18 holidays       19 0.00336  2.53 0.00850
##  8       1 grandmother    18 0.00664  1.27 0.00845
##  9      23 collins         5 0.00235  3.22 0.00755
## 10       6 nelly          12 0.00234  3.22 0.00755
## # … with 34,351 more rows

Conclusion

Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. We can use sentiment analysis to understand how a narrative arc changes throughout its course or what words with emotional and opinion content are important for a particular text. In this assignment, we added a new corpus from ‘gutenbergr’ package and applied sentiment analysis. From the analysis, we came to know mostly used positive/negative words and chapter wise sentiment analysis. Chapter 25 has more negative sentiments and chapter 7, and chapter 22 have more positive sentiments. We explored TF_IDF analysis also.