DS 807 Assignment 2

Group Assignment Guidelines

Purpose:

Learning Outcomes measured in this assignment: LO1 to LO5
Content knowledge you’ll gain from doing this assignment: Classification sentiment analysis, and Topic modeling.

Criteria:

For this assingmet, you can work in groups of up to 3 people.
For the assignment 2, the grading criteria is 70% based on correctness of the code and 30% based on your communication of results.
Submission: You have two options. Please choose as you wish.
1. Upload the knitted document on Canvas.
2. Publish your final output in RPubs. https://rpubs.com/about/getting-started

Data Set

For this assignment, we will be using Coronavirus tweets data. The entire dataset is available here: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification?select=Corona_NLP_train.csv

Notice that these are actual tweets, there may be offending words/language.

Optional: If you want, you can scrap your own twitter data and use that

The following R chunk reads the data:

tweet=read_csv("https://unh.box.com/shared/static/b3iz7j2d33ju568vi0yc7842s01pypjg.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   UserName = col_double(),
##   ScreenName = col_double(),
##   TweetAt = col_character(),
##   OriginalTweet = col_character(),
##   Sentiment = col_character()
## )

(5 points) Notice that this data set has 41,157 tweets. Please randomly select at most 10,000 tweets arnd carry the analysis with that data.

set.seed(1234)

data = sample_n(tweet, 10000)
head(data)

## # A tibble: 6 x 5
##   UserName ScreenName TweetAt   OriginalTweet                       Sentiment   
##      <dbl>      <dbl> <chr>     <chr>                               <chr>       
## 1    44582      89534 13-04-20… "\u0093Conferences that are being … Neutral     
## 2    44652      89604 13-04-20… "Brain, Nervous System Affected in… Extremely N…
## 3    19039      63991 22-03-20… "This country will find a vaccine … Extremely P…
## 4    37500      82452 8/4/2020  "Privileged to have been invited t… Positive    
## 5    39514      84466 9/4/2020  "People grocery store workers are … Positive    
## 6    21285      66237 23-03-20… "PM says s amp shall stay open but… Neutral

(5 points) Tokenize the data set by “word”, remove stop words, and get the word counts.

data_word =  data %>%
  unnest_tokens(output=word, token = "words", input = OriginalTweet) %>%
  anti_join(stop_words)

## Joining, by = "word"

word_count = data_word %>%
  count(word, sort=TRUE)

head(word_count)

## # A tibble: 6 x 2
##   word            n
##   <chr>       <int>
## 1 t.co         5858
## 2 https        5855
## 3 coronavirus  4429
## 4 19           3081
## 5 covid        2942
## 6 prices       1844

Text Classification

(10 points) Turn the data into a tfidf matrix using cast_dtm where document=UserName, term=word. Did you need to create an unnested object first? Do you need to remove sparse terms? If so, do that in as well.

data_matrix = data_word %>%
  count(UserName, word) %>%
  cast_dtm(document=UserName, term=word, 
           value=n, weighting=tm::weightTfIdf)

data_matrix

## <<DocumentTermMatrix (documents: 10000, terms: 29736)>>
## Non-/sparse entries: 163776/297196224
## Sparsity           : 100%
## Maximal term length: 55
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

removeSparseTerms(data_matrix, sparse=.999)

## <<DocumentTermMatrix (documents: 10000, terms: 2069)>>
## Non-/sparse entries: 115898/20574102
## Sparsity           : 99%
## Maximal term length: 21
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

(5 points) Separate your data into test/train split - you can decide on the split. We will predict the variable “Sentiment”.

set.seed(1234)

sample_size = floor(0.7*nrow(data_matrix))
train_index = sample(nrow(data_matrix), size = sample_size)

train=data_matrix[train_index,]
test=data_matrix[-train_index,]

nrow(train)

## [1] 7000

nrow(test)

## [1] 3000

(20 points) Carry out your favorite classification model to training set, then calculate the model accuracy using the testing set. Why did you choose this particular model or any of the parameters? Looking at this model, what information do you gain?

df=as.data.frame(as.matrix(train))
factor_data=factor(data$Sentiment[train_index])
rf=randomForest(x=df,y=factor_data,ntree=10)
rf

## 
## Call:
##  randomForest(x = df, y = factor_data, ntree = 10) 
##                Type of random forest: classification
##                      Number of trees: 10
## No. of variables tried at each split: 172
## 
##         OOB estimate of  error rate: 77.04%
## Confusion matrix:
##                    Extremely Negative Extremely Positive Negative Neutral
## Extremely Negative                 71                102      254     128
## Extremely Positive                 84                100      306     158
## Negative                          123                176      446     220
## Neutral                           104                146      338     176
## Positive                          158                203      512     267
##                    Positive class.error
## Extremely Negative      363   0.9226580
## Extremely Positive      440   0.9080882
## Negative                704   0.7327741
## Neutral                 556   0.8666667
## Positive                799   0.5879319

pred=predict(rf, newdata=as.data.frame(as.matrix(test)))
confusionMatrix(factor(data[-train_index,]$Sentiment), pred)

## Confusion Matrix and Statistics
## 
##                     Reference
## Prediction           Extremely Negative Extremely Positive Negative Neutral
##   Extremely Negative                  9                 17      105      42
##   Extremely Positive                 15                 32      130      44
##   Negative                           15                 41      194      84
##   Neutral                            12                 27      142      63
##   Positive                           22                 55      219     100
##                     Reference
## Prediction           Positive
##   Extremely Negative      214
##   Extremely Positive      244
##   Negative                423
##   Neutral                 325
##   Positive                426
## 
## Overall Statistics
##                                           
##                Accuracy : 0.2413          
##                  95% CI : (0.2261, 0.2571)
##     No Information Rate : 0.544           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0096         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: Extremely Negative Class: Extremely Positive
## Sensitivity                            0.12329                   0.18605
## Specificity                            0.87086                   0.84689
## Pos Pred Value                         0.02326                   0.06882
## Neg Pred Value                         0.97551                   0.94477
## Prevalence                             0.02433                   0.05733
## Detection Rate                         0.00300                   0.01067
## Detection Prevalence                   0.12900                   0.15500
## Balanced Accuracy                      0.49707                   0.51647
##                      Class: Negative Class: Neutral Class: Positive
## Sensitivity                  0.24557         0.1892          0.2610
## Specificity                  0.74525         0.8103          0.7105
## Pos Pred Value               0.25627         0.1107          0.5182
## Neg Pred Value               0.73428         0.8889          0.4463
## Prevalence                   0.26333         0.1110          0.5440
## Detection Rate               0.06467         0.0210          0.1420
## Detection Prevalence         0.25233         0.1897          0.2740
## Balanced Accuracy            0.49541         0.4997          0.4858

Topic Modeling

(5 points) Turn the data into a TF matrix, i.e., cast_dtm with weighting=tm::weightTf.

data_matrix_TF = data_word %>%
  count(Sentiment, word) %>%
  cast_dtm(document=Sentiment, term=word,           
           value=n, weighting=tm::weightTf)

data_matrix_TF

## <<DocumentTermMatrix (documents: 5, terms: 29736)>>
## Non-/sparse entries: 47949/100731
## Sparsity           : 68%
## Maximal term length: 55
## Weighting          : term frequency (tf)

(20 points) Carry out a topic model with \(k\). Why did you choose this particular \(k\)?

data_lda=LDA(data_matrix_TF, k=2, control=list(seed=1234))
data_lda

## A LDA_VEM topic model with 2 topics.

(10 points) Investigate the beta matrix for the model in part 7. What are the top 10 words for each topic?

lda_topics=tidy(data_lda, matrix="beta")
lda_topics

## # A tibble: 59,472 x 3
##    topic term                beta
##    <int> <chr>              <dbl>
##  1     1 _reaalamerican_ 3.91e-44
##  2     2 _reaalamerican_ 1.05e- 5
##  3     1 0               5.19e- 5
##  4     2 0               1.16e- 4
##  5     1 0.006           6.21e-43
##  6     2 0.006           1.05e- 5
##  7     1 0.1             1.30e- 5
##  8     2 0.1             1.05e- 5
##  9     1 0.25            2.91e-44
## 10     2 0.25            1.05e- 5
## # … with 59,462 more rows

lda_top_terms = lda_topics %>%
  group_by(topic) %>%
  top_n(10, beta)%>%
  ungroup() %>%
  arrange(topic, beta)

lda_top_terms %>%
  mutate(term=reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill=factor(topic))) + 
  geom_col(show.legend = FALSE) +
  facet_wrap(~topic, scales = "free") +
  coord_flip()

(10 points) Investigate the gamma matrix for the model in part 7. What can you say about the topics of the models?

lda_documents=tidy(data_lda, matrix="gamma")
lda_documents

## # A tibble: 10 x 3
##    document           topic       gamma
##    <chr>              <int>       <dbl>
##  1 Extremely Negative     1 0.00000122 
##  2 Extremely Positive     1 1.00       
##  3 Negative               1 0.000000708
##  4 Neutral                1 0.00000108 
##  5 Positive               1 1.00       
##  6 Extremely Negative     2 1.00       
##  7 Extremely Positive     2 0.00000104 
##  8 Negative               2 1.00       
##  9 Neutral                2 1.00       
## 10 Positive               2 0.000000626

(10 points) What is your learning outcome in this analysis? What would you like me to to notice in your analysis?