Predicting Reddit Comments Upvotes

1/20/2020

Background

Reddit is a website which acts like a collection of forums
Users post and discuss about various topics pertaining to the forum in the entities called subreddit
In an apple subreddit users will discuss about latest products from apple.

The primary and secondary data source will be the Reddit API and data collected from kaggle
We will be using RedditExtractoR package to fetch the data from Reddit API
The API response is in the dataframe format and we will store the fetched data in CSV files
We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation

Our goal is to analyse the comments for a post by understanding sentiment of comment
The eventual goal is to predict the number of upvotes a comment will receive
Aim is to cater users better and also market the products

Using the Reddit API to perform Sentiment analysis and text classification
Fetching reddit posts and comments by performing NLP analysis
To decide if a particular comment has the potential to get upvotes
Break every value in comments to separate words to see the word count and granular level sentiment in them
Identifies and categorizes text into three sentiments: positive, negative, or neutral
Implement Word Cloud to see the most popular sentiment in the comments

Visualize data using ggplot2 to better understand relationships amongst variables and distributions
Understand the sentiments of the comments using NLP libraries like tidytext
Examine and Plot correlation coefficient between different variables and choose best factors for model training
Use machine learning models like Linear Regression, Ridge CV, Random Forest regression

We will be using ROC curve to identify the precision and recall curve
We will use different metrics like R-squared, Adjusted R-squared, RMSE and prediction scores to compare our models