Background

  • Reddit is a website which acts like a collection of forums
  • Users post and discuss about various topics pertaining to the forum in the entities called subreddit
  • In an apple subreddit users will discuss about latest products from apple.

Recap:

In our project proposal, we stated below

  • Our primary source of data will be the Reddit API:

  • We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation

  • we will use Natural Language Processing Techniques to categorize the reddit data into their relevant categories.

  • For data exploration purposes, we will use ggplot graphs to show our analysis and the outcomes.

Peer Comments Summary

  • Most of the peers thought that use of NLP is the best idea to understand the sentiments of the comments.
  • According to the suggestions, we have added visualizations to get more idea about the patterns in the data
  • We have also included topic modeling to understand the probability of a comment belonging to a post

Steps performed to achieve our proposed aim.

Fetching data:

  • We used RedditExtractoR package to fetch data from reddit API using the search terms “virus” and “news”.
  • Using these search terms we were able to fetch data containing comments on posts from 24 subreddits

Data Cleaning

  • Dropped unnecessary columns like Link, domain, URL and structure as these does not play any role in analyzing comments and further predicting upvotes
  • Using lubridate library, we formatted time in Day,Month and year format(dmy) for columns post_date & comm_date
  • Checking and managing null values
  • Scaled columns to bring it to standard values
  • Cleaned comments by removing useless charcaters like punctuations,numbers, special characters, HTML tags etc.

Natural Language Processing Techniques to categorize the reddit data into their relevant categories.

  • We started by creating a function that will take input as a dataframe and provide us with cleaned tokens as output by removing stop words and numbers
  • Then we created word clouds for 4 subreddits to find insights which word might belong to which subreddit which thus will help in categorizing
  • Performed sentiment analysis to find the most positive and negative words of the comment text
  • Prepared Document Term Matrix to describe the frequency of terms that occur in a collection of documents

Data exploration purposes

  • Visualized post by year, months and days
  • Top 20 authors by number of posts and by post score

Data Summary

Data is uploaded to : https://www.dropbox.com/s/hx5yiqkjwc9b4cd/worldnews_test3.csv?dl=0 , this link has a csv file which is fetched with the reddit API. We used RedditExtractoR to download reddit data from the following subreddits ( majority of data)

  • ask reddit - Data related to ask reddit
  • world news - Data related to world news
  • NintendoSwitch - Data related to games
  • Bangtan - Data related to the South Korean boy group (Entertainment)

Data Results

Number of posts vs years

Alt text

Data results - continued

Number of posts vs days of the week - People post very less on wednesdays

NLP Procedures

  • We are reading the csv files into R variables.
  • creating a corpus
  • tokenization
  • Removing Stop Words
  • Removing Numbers
  • Removing rare words
  • Finding unigrams and bigrams
  • Finding correlation between words
  • Generating Document-Term Matrix
  • Topic Modeling
  • Term Frequency and Inverse Document Frequency

NLP Results - Word Clouds

Most common tokens in subreddit - world news - Data related to world news

Alt text

NLP Results - Word Clouds

Most common tokens in subreddit - ask reddit - Data related to ask reddit

NLP Results - Word Clouds

Most common tokens in subreddit Bangtan - Data related to the South Korean boy group (Entertainment) (Entertainment)

NLP Results - Word Clouds

Most common tokens in subreddit NintendoSwitch - Data related to games

Topic Modelling

Alt text

Distribution of sentiment

Alt text

We can see that more review words are towards the positive side rather than the negative side. But we cannot just look at the word and judge. So we will find the sentiment densities to make more sense.

Distribution of sentiment

Alt text

From this we can see the sentiments are fairly equally distributed when compared to just going by sentiments of words.

Final Remarks-

  • Natural language Processing is the best technique to get insights out of Text Data
  • Word Clouds help us to understand the most popular sentiment in the comments for a particular subreddit
  • Distribution of sentiments help us to understand that sentiments are fairly distributed across various subreddits
  • Topic models can organize the collection according to the discovered themes.