Ankita Mitkari, Abhijeet Ray, Aditi Deokar, Aparna Sree
The primary data source will be the Reddit API. We will be using RedditExtractoR package to fetch the data from Reddit API. In addition to this, We will also be collecting data from Kaggle. The API response is in the dataframe format and we will store the fetched data in CSV files. We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation.
Reddit is a website which acts like a collection of forums. There are entities in reddit namely subreddits where in users post and discuss about various topics pertaining to the forum. For example, in an apple subreddit users will discuss about latest products from apple.
Our goal is to analyse the comments for a post, understand sentiment of the comment. The eventual goal is to predict the number of upvotes a comment will receive. This is important from a business analytics perspective.
In recent times, reddit is used by companies to advertise products in the comments. This gives a very targeted audience. For example, a 3rd party bluetooth headset product manager would want to reach apple customers in the “apple” subreddit.
We will attempt to fetch reddit posts and comments by performing NLP analysis. This will give us the features to decide if a particular comment has the potential to get upvotes and thus seen by a vast userbase. Using the Reddit API we plan to collect thousands of comments from various subreddits and perform Sentiment analysis. Further, we plan to predict the upvotes of new comment that will be received.
We plan to decontruct every sentence in the comments to count words and perform granular level sentiment in them. The idea is to identify and categorizes text into three sentiments: positive, negative, or neutral. Also, we will implement Word Cloud to see the most popular sentiment in the comments. We also plan to visualize various trends in the data as a function of time of post, time of first comment etc.
For Analysis, we will visualize the data in R using ggplot2 to better see and understand relationships amongst the variables and the distributions. We will understand the sentiments of the comments using NLP libraries like tidytext. We will also examine and plot the correlation coefficient between different variables and choose the best factors for model training. Our main aim is to predict the number of upvotes of a comment on Reddit, so we will be using different machine learning models like Linear Regression, Ridge CV, Random Forest regression.
Qualitatively, we would be using ROC curve to identify the precision and recall curve. Quantitatively, we will use the different metrics like R-squared, Adjusted R-squared, RMSE and prediction scores to compare our models.