1/20/2020
Background
- Reddit is a website which acts like a collection of forums
- Users post and discuss about various topics pertaining to the forum in the entities called subreddit
- In an apple subreddit users will discuss about latest products from apple.
Data Collection, Preparation, and Cleaning
- The primary and secondary data source will be the Reddit API and data collected from kaggle
- We will be using RedditExtractoR package to fetch the data from Reddit API
- The API response is in the dataframe format and we will store the fetched data in CSV files
- We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation
Problem Description - Step 1
- Our goal is to analyse the comments for a post by understanding sentiment of comment
- The eventual goal is to predict the number of upvotes a comment will receive
- Aim is to cater users better and also market the products
Approach - Step 2
- Using the Reddit API to perform Sentiment analysis and text classification
- Fetching reddit posts and comments by performing NLP analysis
- To decide if a particular comment has the potential to get upvotes
- Break every value in comments to separate words to see the word count and granular level sentiment in them
- Identifies and categorizes text into three sentiments: positive, negative, or neutral
- Implement Word Cloud to see the most popular sentiment in the comments
Analytics Plan
- Visualize data using ggplot2 to better understand relationships amongst variables and distributions
- Understand the sentiments of the comments using NLP libraries like tidytext
- Examine and Plot correlation coefficient between different variables and choose best factors for model training
- Use machine learning models like Linear Regression, Ridge CV, Random Forest regression
Evaluation Plan
- We will be using ROC curve to identify the precision and recall curve
- We will use different metrics like R-squared, Adjusted R-squared, RMSE and prediction scores to compare our models