Date: 02/26/2020

Data Analytics Plan

We used https://www.reddit.com/ for the sentiment analysis of the recent Corona Virus outbreak. The idea behind it is to analyse Reddit community’s reaction to this outbreak by tracking the comments in Corona and World News subreddits.

Attributes used: Comments, Comments Score and Post Score

Peer Comments Summary

  • Inclusion of Deep learning models and machine learning techniques
  • To implement clustering in the analysis
  • Use PCA to reduce the number of features for faster computation
  • Use predefined sentiment lexicons like bing, nrc in tidytext package
  • To handle grammatical nuances, misspellings, and ambiguity during analysis

Word Cloud Visualization

Sentiment Visualization using bing

Sentiment Visualization using nrc

Sentiment Distribution

Corona Virus Comment Trend

Sentiment output to ML

ML Model Procedures

We used two models using H2o instances that can predict a comment’s score based on the score of the parent post and the NRC sentiments of the comments:

  • Gradient Boosting Model
  • Linear Regression Model

After processing data was divided:

  • Train (70%)
  • Test (15%)
  • Validate (15%)

Gradient Boosting Model Results

Linear Regression Model Results

Key Take-aways

  • ML metrics can change significantly in a day after getting fresh data
    • RMSE 3 vs 15
  • Further spread of the Corona virus –> wordcloud changes
  • Since no cure for Coronavirus yet, people are having more negative sentiments than positive ones –> evident from bing visualization