1/20/2020

Background

  • Reddit is a website which acts like a collection of forums
  • Users post and discuss about various topics pertaining to the forum in the entities called subreddit
  • In an apple subreddit users will discuss about latest products from apple.

Data Collection, Preparation, and Cleaning

  • The primary and secondary data source will be the Reddit API and data collected from kaggle
  • We will be using RedditExtractoR package to fetch the data from Reddit API
  • The API response is in the dataframe format and we will store the fetched data in CSV files
  • We will be utilizing various data cleaning libraries of R like tidyr and dplyr for Data Cleaning and Manipulation

Problem Description - Step 1

  • Our goal is to analyse the comments for a post by understanding sentiment of comment
  • The eventual goal is to predict the number of upvotes a comment will receive
  • Aim is to cater users better and also market the products

Approach - Step 2

  • Using the Reddit API to perform Sentiment analysis and text classification
  • Fetching reddit posts and comments by performing NLP analysis
  • To decide if a particular comment has the potential to get upvotes
  • Break every value in comments to separate words to see the word count and granular level sentiment in them
  • Identifies and categorizes text into three sentiments: positive, negative, or neutral
  • Implement Word Cloud to see the most popular sentiment in the comments

Analytics Plan

  • Visualize data using ggplot2 to better understand relationships amongst variables and distributions
  • Understand the sentiments of the comments using NLP libraries like tidytext
  • Examine and Plot correlation coefficient between different variables and choose best factors for model training
  • Use machine learning models like Linear Regression, Ridge CV, Random Forest regression

Evaluation Plan

  • We will be using ROC curve to identify the precision and recall curve
  • We will use different metrics like R-squared, Adjusted R-squared, RMSE and prediction scores to compare our models