Learning Outcomes measured in this assignment: LO1 to LO5
Content knowledge you’ll gain from doing this assignment: Classification sentiment analysis, and Topic modeling.
For this assingmet, you can work in groups of up to 3 people.
For the assignment 2, the grading criteria is 70% based on correctness of the code and 30% based on your communication of results.
Submission: You have two options. Please choose as you wish.
For this assignment, we will be using Coronavirus tweets data. The entire dataset is available here: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification?select=Corona_NLP_train.csv
Notice that these are actual tweets, there may be offending words/language.
Optional: If you want, you can scrap your own twitter data and use that
The following R chunk reads the data:
tweet=read_csv("https://unh.box.com/shared/static/b3iz7j2d33ju568vi0yc7842s01pypjg.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## UserName = col_double(),
## ScreenName = col_double(),
## TweetAt = col_character(),
## OriginalTweet = col_character(),
## Sentiment = col_character()
## )
set.seed(1234)
data = sample_n(tweet, 10000)
head(data)
## # A tibble: 6 x 5
## UserName ScreenName TweetAt OriginalTweet Sentiment
## <dbl> <dbl> <chr> <chr> <chr>
## 1 44582 89534 13-04-20… "\u0093Conferences that are being … Neutral
## 2 44652 89604 13-04-20… "Brain, Nervous System Affected in… Extremely N…
## 3 19039 63991 22-03-20… "This country will find a vaccine … Extremely P…
## 4 37500 82452 8/4/2020 "Privileged to have been invited t… Positive
## 5 39514 84466 9/4/2020 "People grocery store workers are … Positive
## 6 21285 66237 23-03-20… "PM says s amp shall stay open but… Neutral
data_word = data %>%
unnest_tokens(output=word, token = "words", input = OriginalTweet) %>%
anti_join(stop_words)
## Joining, by = "word"
word_count = data_word %>%
count(word, sort=TRUE)
head(word_count)
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 t.co 5858
## 2 https 5855
## 3 coronavirus 4429
## 4 19 3081
## 5 covid 2942
## 6 prices 1844
cast_dtm where document=UserName, term=word. Did you need to create an unnested object first? Do you need to remove sparse terms? If so, do that in as well.data_matrix = data_word %>%
count(UserName, word) %>%
cast_dtm(document=UserName, term=word,
value=n, weighting=tm::weightTfIdf)
data_matrix
## <<DocumentTermMatrix (documents: 10000, terms: 29736)>>
## Non-/sparse entries: 163776/297196224
## Sparsity : 100%
## Maximal term length: 55
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
removeSparseTerms(data_matrix, sparse=.999)
## <<DocumentTermMatrix (documents: 10000, terms: 2069)>>
## Non-/sparse entries: 115898/20574102
## Sparsity : 99%
## Maximal term length: 21
## Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
set.seed(1234)
sample_size = floor(0.7*nrow(data_matrix))
train_index = sample(nrow(data_matrix), size = sample_size)
train=data_matrix[train_index,]
test=data_matrix[-train_index,]
nrow(train)
## [1] 7000
nrow(test)
## [1] 3000
df=as.data.frame(as.matrix(train))
factor_data=factor(data$Sentiment[train_index])
rf=randomForest(x=df,y=factor_data,ntree=10)
rf
##
## Call:
## randomForest(x = df, y = factor_data, ntree = 10)
## Type of random forest: classification
## Number of trees: 10
## No. of variables tried at each split: 172
##
## OOB estimate of error rate: 77.04%
## Confusion matrix:
## Extremely Negative Extremely Positive Negative Neutral
## Extremely Negative 71 102 254 128
## Extremely Positive 84 100 306 158
## Negative 123 176 446 220
## Neutral 104 146 338 176
## Positive 158 203 512 267
## Positive class.error
## Extremely Negative 363 0.9226580
## Extremely Positive 440 0.9080882
## Negative 704 0.7327741
## Neutral 556 0.8666667
## Positive 799 0.5879319
pred=predict(rf, newdata=as.data.frame(as.matrix(test)))
confusionMatrix(factor(data[-train_index,]$Sentiment), pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Extremely Negative Extremely Positive Negative Neutral
## Extremely Negative 9 17 105 42
## Extremely Positive 15 32 130 44
## Negative 15 41 194 84
## Neutral 12 27 142 63
## Positive 22 55 219 100
## Reference
## Prediction Positive
## Extremely Negative 214
## Extremely Positive 244
## Negative 423
## Neutral 325
## Positive 426
##
## Overall Statistics
##
## Accuracy : 0.2413
## 95% CI : (0.2261, 0.2571)
## No Information Rate : 0.544
## P-Value [Acc > NIR] : 1
##
## Kappa : -0.0096
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: Extremely Negative Class: Extremely Positive
## Sensitivity 0.12329 0.18605
## Specificity 0.87086 0.84689
## Pos Pred Value 0.02326 0.06882
## Neg Pred Value 0.97551 0.94477
## Prevalence 0.02433 0.05733
## Detection Rate 0.00300 0.01067
## Detection Prevalence 0.12900 0.15500
## Balanced Accuracy 0.49707 0.51647
## Class: Negative Class: Neutral Class: Positive
## Sensitivity 0.24557 0.1892 0.2610
## Specificity 0.74525 0.8103 0.7105
## Pos Pred Value 0.25627 0.1107 0.5182
## Neg Pred Value 0.73428 0.8889 0.4463
## Prevalence 0.26333 0.1110 0.5440
## Detection Rate 0.06467 0.0210 0.1420
## Detection Prevalence 0.25233 0.1897 0.2740
## Balanced Accuracy 0.49541 0.4997 0.4858
cast_dtm with weighting=tm::weightTf.data_matrix_TF = data_word %>%
count(Sentiment, word) %>%
cast_dtm(document=Sentiment, term=word,
value=n, weighting=tm::weightTf)
data_matrix_TF
## <<DocumentTermMatrix (documents: 5, terms: 29736)>>
## Non-/sparse entries: 47949/100731
## Sparsity : 68%
## Maximal term length: 55
## Weighting : term frequency (tf)
data_lda=LDA(data_matrix_TF, k=2, control=list(seed=1234))
data_lda
## A LDA_VEM topic model with 2 topics.
lda_topics=tidy(data_lda, matrix="beta")
lda_topics
## # A tibble: 59,472 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 _reaalamerican_ 3.91e-44
## 2 2 _reaalamerican_ 1.05e- 5
## 3 1 0 5.19e- 5
## 4 2 0 1.16e- 4
## 5 1 0.006 6.21e-43
## 6 2 0.006 1.05e- 5
## 7 1 0.1 1.30e- 5
## 8 2 0.1 1.05e- 5
## 9 1 0.25 2.91e-44
## 10 2 0.25 1.05e- 5
## # … with 59,462 more rows
lda_top_terms = lda_topics %>%
group_by(topic) %>%
top_n(10, beta)%>%
ungroup() %>%
arrange(topic, beta)
lda_top_terms %>%
mutate(term=reorder(term, beta)) %>%
ggplot(aes(term, beta, fill=factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") +
coord_flip()
lda_documents=tidy(data_lda, matrix="gamma")
lda_documents
## # A tibble: 10 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Extremely Negative 1 0.00000122
## 2 Extremely Positive 1 1.00
## 3 Negative 1 0.000000708
## 4 Neutral 1 0.00000108
## 5 Positive 1 1.00
## 6 Extremely Negative 2 1.00
## 7 Extremely Positive 2 0.00000104
## 8 Negative 2 1.00
## 9 Neutral 2 1.00
## 10 Positive 2 0.000000626