Reddit Comments

Background

Reddit is a website which acts like a collection of forums
Users post and discuss about various topics pertaining to the forum in the entities called subreddit
In an apple subreddit users will discuss about latest products from apple.

In our project proposal, we stated below

Our primary source of data will be the Reddit API:
We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation
we will use Natural Language Processing Techniques to categorize the reddit data into their relevant categories.
For data exploration purposes, we will use ggplot graphs to show our analysis and the outcomes.

Most of the peers thought that use of NLP is the best idea to understand the sentiments of the comments.
According to the suggestions, we have added visualizations to get more idea about the patterns in the data
We have also included topic modeling to understand the probability of a comment belonging to a post

We used RedditExtractoR package to fetch data from reddit API using the search terms “virus” and “news”.
Using these search terms we were able to fetch data containing comments on posts from 24 subreddits

Dropped unnecessary columns like Link, domain, URL and structure as these does not play any role in analyzing comments and further predicting upvotes
Using lubridate library, we formatted time in Day,Month and year format(dmy) for columns post_date & comm_date
Checking and managing null values
Scaled columns to bring it to standard values
Cleaned comments by removing useless charcaters like punctuations,numbers, special characters, HTML tags etc.

We started by creating a function that will take input as a dataframe and provide us with cleaned tokens as output by removing stop words and numbers
Then we created word clouds for 4 subreddits to find insights which word might belong to which subreddit which thus will help in categorizing
Performed sentiment analysis to find the most positive and negative words of the comment text
Prepared Document Term Matrix to describe the frequency of terms that occur in a collection of documents

Data is uploaded to : https://www.dropbox.com/s/hx5yiqkjwc9b4cd/worldnews_test3.csv?dl=0 , this link has a csv file which is fetched with the reddit API. We used RedditExtractoR to download reddit data from the following subreddits ( majority of data)

Number of posts vs years

Alt text

Number of posts vs days of the week - People post very less on wednesdays

Most common tokens in subreddit - world news - Data related to world news

Alt text

Most common tokens in subreddit - ask reddit - Data related to ask reddit

Most common tokens in subreddit Bangtan - Data related to the South Korean boy group (Entertainment) (Entertainment)

Most common tokens in subreddit NintendoSwitch - Data related to games

Alt text

Alt text

Alt text

Natural language Processing is the best technique to get insights out of Text Data
Word Clouds help us to understand the most popular sentiment in the comments for a particular subreddit
Distribution of sentiments help us to understand that sentiments are fairly distributed across various subreddits
Topic models can organize the collection according to the discovered themes.