Project Introduction

Our assigned API was the reddit API. Reddit is a social media platform revolving around text data, often marketed as the “front page of the web”. Content is driven by user interaction such as upvotes and comments.
On January 27, 2021, Game Stop stock price jumped to 350 compared to where it was a few weeks earlier, at 20. This jump caused losses for some US firms and for many brokerage firms to halt trading stocks on its platform.
Many analysts claimed that the subreddit wallstreetbets, where many retail investor analyze and discuss stocks, often using profane jargon, was responsible for the price spike. Analysts asserted that the price increase that these “meme” stocks witnessed were not due to a change in business fundamentals or to an approaching earning call, but it was entirely due to sentiment.

Original data analytics plan based on your project proposal

In Project proposal we mentioned that We will doing data cleansing using tidyverse and data visualization using ggplot2. We will also perform NLP analysis to work on text data. For our model we will try to design a RNN( Recurrent neural network)/CNN(Convolutional neural networks)/ANN(Artificial Neural Network)
In our final model we have tried 5 models in RNN to get the best fit.

Peer Summary comments

Which library will you be using for web scraping? We are using RedditExtractoR for web scraping
What specific visualization techniques do you plan on using? Using ggplot2 library we will be plotting the data
If you’re planning to go for CNN, use glove embeddings and experiment with pooling techniques.? Yes we will be tokenize the data , cleaning the data and then if needed we will performing word embedding.
Your posts are basically posts and texts, how are you planning to clean them? Since the features are text data we will be tokenizing the data, clean symbols, punctuation, remove words like and, the etc in post comments and title(consider only important words), then perform one hot encoding/embedding on the dataset obtained.
What is your prominent indicator after text processing? What is its correlation with your popularity metric? The prominent indicator is title data and we will use comments, upvotes, and downvotes to determine post popularity. Since it is text, using a correlation coefficient for checking a correlation would be meaningless. But we know that there is a correlation between popular posts and upvotes/downvotes/comments. Popularity posts have high upvotes and more comments.

Data Collection

Data has been extracted using the RedditExtractor API and csv files of each subreddit have been collated and uploaded
Data from Wallstreetbets subreddits based on top ,hot , rising , new with period “All”, month, year have been pulled. We mainly scraped data from WallstreetBets as our subject area is to predict popular post in WallstreetBets
Due to the nature of the API and rate limits, scraping large amounts of data takes a long time. In addition we wanted to scrape posts as they were new and see their organic growth. To achieve this we scraped sets of roughly 177376 post. We would then wait about four days from the first time we had scraped the post and scrape it again to see how many upvotes, comments, awards, etc. it had achieved in that time period. We have also used Python code to extract the data from the subreddit Wallstreetbets as there were some limitation in extracting the data in R for a single Subreddit. We also made sure to drop any duplicate threads so that each row is unique.

Original Dataset

The data is roughly 177376 rows long, with about 9 variables.

Dataset

Data Cleaning

We have used textclean package to convert emoji, emoticon, kernal , money and Non- ascii characters from the dataset obtained from Web Scraping.
The next step was to add the column for what posts are popular. We decided to mark any post that has more than 0.1 upvotes_ratio and 50 comments as popular. Posts that are popular have a value of 1 in the ‘Popular’ column, and post that do not have 0.
When we look at the total number of popular posts vs not popular we can see that the data is not skewed.
The last piece of info that we can look at is the user information that submitted the post. We create a dataframe of users that created the post we scraped, with the total number of posts that they created during our time of scraping.

Data Exploration

Here using ggplot Bar chart the popular and Non popular total count is vizualised.

Data Exploration

Link_flair_css_Class is visualized against the upvote ratio column. More than 0.5 ratio is considered as popular else non popular post. Upvote ratio count considered as 0.5 for data balancing.

Link flair class and Upvote ratio

Link_flair_css_Class is visualized against the number crosspost column and Barchart is plotted.

Link flair class and Cross post

Data Exploration

Category wise Bar Chart is plotted for the popular vs non popular post along with link_flair_Css_class as the fill for colors.

popular vs non popular post along with link_flair_Css_class as the fill

Preparing the data for the model

Now we must prepare our data for our model. Data has been split into 80% for training & 20% Testing. Since it is text data we will need to tokenize it and create embeddings of the data. Here we have used Glove embedding technique which is already pretrained embedding technique. Since reading the entire file of glove_50d was giving an error we have reading it as 40000 rows per iteration and iterating this for the entire file.
We have used the keras library for most part for modeling.
We have used the following models for analysis

Simple RNN

Long Short-Term Memory (LSTM)

RNN Gated Recurrent Units (GRU)

Bidirectional LSTM

Bidirectional LSTM with more dense nodes

Model Prediction and Result

To get the best model we have applied 5 models to our dataset and obtained the best model fit for both training and validation data.
We have also fine tunned the model by changing the model parameters and dense node to obtain the best fit of the model.

Accuracy of Training and Validation Sample

Simple RNN Prediction Result

Model 1 Prediction Result

Long-Short term Memory Prediction Result

Model 2 Prediction Result

RNN Gated Recurrent Units Model Prediction Result

Model 3 Prediction Result

Bidirectional LSTM

Model 4 Prediction Result

Bidirectional LSTM with more dense nodes Model Prediction Result

Model 5 Prediction Result

Key takeaways from findings

Based off of these various models which we ran on the training and testing data set we feel like we have achieved quite a good model and can conclude that Bidirectional LSTM and Simple RNN are best for training and RNN Gated Recurrent Units (GRU) is best for validation. Overall for training and validation data set RNN Gated Recurrent Units (GRU) is best for both training and testing as they have accuracy of 95% in training and 58.30% for validation data set. Through this model we can see that RNN, especially with GRU layers, can be quite useful in tackling NLP problems.

Further recommendations

In order to improve and predict further we can also extract the text of data in addition to the title to improve the accuracy of our model.
More features can included while classifying the post as popular in addition to upvote ratio and number of comments.
Making use of all data with the other features like text , title and comments in the post , number of upvotes and the number of comments i.e using a combination of text and numbers should have a better result.

CIS 8398 - Final Project - Reddit Wallstreet Post Popularity Prediction

Team Members

Project Introduction

Original data analytics plan based on your project proposal

Peer Summary comments

Data Collection

Original Dataset

Data Cleaning

Data Exploration

Data Exploration

Data Exploration

Preparing the data for the model

Model Prediction and Result

Simple RNN Prediction Result

Long-Short term Memory Prediction Result

RNN Gated Recurrent Units Model Prediction Result

Bidirectional LSTM

Bidirectional LSTM with more dense nodes Model Prediction Result

Key takeaways from findings

Further recommendations