Group 12
2022-12-07
Punam Shet
Priyadharshini Kasi
Nikkhil Matthew
Our assigned API was the reddit API. Reddit is a social media platform revolving around text data, often marketed as the “front page of the web”. Content is driven by user interaction such as upvotes and comments.
On January 27, 2021, Game Stop stock price jumped to 350 compared to where it was a few weeks earlier, at 20. This jump caused losses for some US firms and for many brokerage firms to halt trading stocks on its platform.
Many analysts claimed that the subreddit wallstreetbets, where many retail investor analyze and discuss stocks, often using profane jargon, was responsible for the price spike. Analysts asserted that the price increase that these “meme” stocks witnessed were not due to a change in business fundamentals or to an approaching earning call, but it was entirely due to sentiment.
In Project proposal we mentioned that We will doing data cleansing using tidyverse and data visualization using ggplot2. We will also perform NLP analysis to work on text data. For our model we will try to design a RNN( Recurrent neural network)/CNN(Convolutional neural networks)/ANN(Artificial Neural Network)
In our final model we have tried 5 models in RNN to get the best fit.
Which library will you be using for web scraping? We are using RedditExtractoR for web scraping
What specific visualization techniques do you plan on using? Using ggplot2 library we will be plotting the data
If you’re planning to go for CNN, use glove embeddings and experiment with pooling techniques.? Yes we will be tokenize the data , cleaning the data and then if needed we will performing word embedding.
Your posts are basically posts and texts, how are you planning to clean them? Since the features are text data we will be tokenizing the data, clean symbols, punctuation, remove words like and, the etc in post comments and title(consider only important words), then perform one hot encoding/embedding on the dataset obtained.
What is your prominent indicator after text processing? What is its correlation with your popularity metric? The prominent indicator is title data and we will use comments, upvotes, and downvotes to determine post popularity. Since it is text, using a correlation coefficient for checking a correlation would be meaningless. But we know that there is a correlation between popular posts and upvotes/downvotes/comments. Popularity posts have high upvotes and more comments.
Data has been extracted using the RedditExtractor API and csv files of each subreddit have been collated and uploaded
Data from Wallstreetbets subreddits based on top ,hot , rising , new with period “All”, month, year have been pulled. We mainly scraped data from WallstreetBets as our subject area is to predict popular post in WallstreetBets
Due to the nature of the API and rate limits, scraping large amounts of data takes a long time. In addition we wanted to scrape posts as they were new and see their organic growth. To achieve this we scraped sets of roughly 177376 post. We would then wait about four days from the first time we had scraped the post and scrape it again to see how many upvotes, comments, awards, etc. it had achieved in that time period. We have also used Python code to extract the data from the subreddit Wallstreetbets as there were some limitation in extracting the data in R for a single Subreddit. We also made sure to drop any duplicate threads so that each row is unique.
Dataset
We have used textclean package to convert emoji, emoticon, kernal , money and Non- ascii characters from the dataset obtained from Web Scraping.
The next step was to add the column for what posts are popular. We decided to mark any post that has more than 0.1 upvotes_ratio and 50 comments as popular. Posts that are popular have a value of 1 in the ‘Popular’ column, and post that do not have 0.
When we look at the total number of popular posts vs not popular we can see that the data is not skewed.
The last piece of info that we can look at is the user information that submitted the post. We create a dataframe of users that created the post we scraped, with the total number of posts that they created during our time of scraping.
Popular VS Non Popular Post
Link flair class and Number of comments
Link flair class and Upvote ratio
Link flair class and Cross post
popular vs non popular post along with link_flair_Css_class as the fill
Now we must prepare our data for our model. Data has been split into 80% for training & 20% Testing. Since it is text data we will need to tokenize it and create embeddings of the data. Here we have used Glove embedding technique which is already pretrained embedding technique. Since reading the entire file of glove_50d was giving an error we have reading it as 40000 rows per iteration and iterating this for the entire file.
We have used the keras library for most part for modeling.
We have used the following models for analysis
Simple RNN
Long Short-Term Memory (LSTM)
RNN Gated Recurrent Units (GRU)
Bidirectional LSTM
Bidirectional LSTM with more dense nodes
To get the best model we have applied 5 models to our dataset and obtained the best model fit for both training and validation data.
We have also fine tunned the model by changing the model parameters and dense node to obtain the best fit of the model.
Accuracy of Training and Validation Sample
Model 1 Prediction Result
Model 2 Prediction Result
Model 3 Prediction Result
Model 4 Prediction Result
Model 5 Prediction Result
In order to improve and predict further we can also extract the text of data in addition to the title to improve the accuracy of our model.
More features can included while classifying the post as popular in addition to upvote ratio and number of comments.
Making use of all data with the other features like text , title and comments in the post , number of upvotes and the number of comments i.e using a combination of text and numbers should have a better result.