Analyzing Reddit Sentiment:

A Comparative Study of R vs Python Using Sentiment Analysis and Machine Learning


Project Motivation:

A Reddit post I made titled People who use R of Reddit, why not just use Python? in the r/RStudio subreddit unexpectedly went viral! Amassing nearly 50,000 views and hundreds of comments. It is safe to say this is a topic of great interest in the “Data-community”.

Inspired by a professor’s advice to learn Python and pure-humor, I instead decided to use R to analyze why people choose one language over the other—using, ironically, R.

This project highlights my ability to manage the entire data pipeline, from scraping and cleaning data to performing sentiment analysis and building machine learning models. With nearly two years of data science experience at UCLA, I’ll complete this project in three weeks. This project is meant to act as a testament to my overall-versatility and readiness for any data-driven role.


General Overview :


Project-Objective:

This project aims to analyze user sentiments in Reddit comments on an R-related post to determine how positive and negative sentiments correlate with mentions of R and Python. By applying sentiment analysis techniques and leveraging machine learning models—k-Nearest Neighbors (kNN) and Naive Bayes—this project seeks to classify the overall sentiment of the community and identify trends related to each programming language. The project includes data collection, preprocessing, exploratory analysis, and model comparison, providing insights into user preferences in a structured and data-driven manner.

Week 1: Data Collection and Preparation

  • Goal: Scrape, clean, and prepare Reddit data.
  • Tasks:
    • Scrape Reddit comments using RedditExtractoR.
    • Clean and preprocess data with tidytext (tokenization, stop word removal, stemming/lemmatization).
    • Perform Exploratory Data Analysis (EDA) using dplyr and ggplot2 to visualize word frequency and metadata.

Week 2: Sentiment Analysis

  • Goal: Apply sentiment analysis and explore sentiment polarity.
  • Tasks:
    • Use syuzhet and tidytext for sentiment scoring (Bing, NRC, AFINN).
    • Visualize sentiment distribution and compare R vs Python mentions using ggplot2.
    • Cluster comments by sentiment using k-means or other methods.

Week 3: Machine Learning Models (kNN & Naive Bayes)

  • Goal: Build, analyze, and compare kNN and Naive Bayes classifiers.
  • Tasks:
    • Implement Naive Bayes (naivebayes) and kNN (class) classifiers on labeled sentiment data.
    • Compare model performance (accuracy, precision, recall) and plot ROC curves using pROC.
    • Present findings through visualizations using ggplot2.

Day-by-day Schedule:


Week 1: Data Collection and Preparation

Goal: Collect Reddit data, clean it, and prepare it for analysis.

Day 1: Scrape Reddit Data
- Use the RedditExtractoR library to extract comments from your R-related Reddit post, along with metadata like upvotes, timestamps, and user information.
- Resources:
- RedditExtractoR Documentation

Day 2-3: Clean and Preprocess Data
- Tokenize the text, remove stop words, punctuation, and irrelevant text using tidytext.
- Apply stemming or lemmatization to reduce words to their base form.
- Resources:
- tidytext Documentation

Day 4-5: Exploratory Data Analysis (EDA)
- Use dplyr to summarize your dataset (e.g., comment length, word frequency).
- Visualize word frequency, comment length, and metadata using ggplot2.
- Resources:
- dplyr Documentation
- ggplot2 Documentation

DataCamp Course Recommendations:
- Already Taken: Data Manipulation with dplyr
- New: Introduction to Text Analysis in R


Week 2: Sentiment Analysis

Goal: Apply sentiment analysis to determine if Reddit comments lean toward R or Python and how positive/negative sentiments correlate with each.

Day 1-2: Perform Sentiment Analysis
- Use the syuzhet package for sentiment scoring and extract emotional valence (positive/negative).
- Apply tidytext to tokenize the text and join it with sentiment lexicons such as Bing, NRC, and AFINN.
- Resources:
- syuzhet Documentation
- tidytext Documentation

Day 3: Visualize Sentiments
- Use ggplot2 to plot the distribution of positive vs. negative sentiments.
- Compare sentiment frequency for “R” vs. “Python” mentions.
- Resources:
- ggplot2 Documentation

Day 4-5: Cluster Comments by Sentiment
- Apply k-means clustering or a simple distance-based approach to group comments based on sentiment polarity.
- Investigate whether positive/negative comments cluster around mentions of R or Python.
- Resources:
- tidytext Documentation
- ggplot2 Documentation

DataCamp Course Recommendations:
- Already Taken: Sentiment Analysis in R
- New: Analyzing Social Media Data in R


Week 3: Machine Learning Models (kNN & Naive Bayes)

Goal: Build, analyze, and compare kNN and Naive Bayes models for sentiment classification.

Day 1: Implement Naive Bayes Classifier
- Build a Naive Bayes classifier using the naivebayes library.
- Train the model on labeled sentiment data to predict sentiment (positive/negative).
- Evaluate the model using accuracy, precision, and recall.
- Resources:
- naivebayes Documentation

Day 2: Implement k-Nearest Neighbors (kNN)
- Use the class package to implement a kNN model and experiment with different values of k.
- Train the model on the sentiment dataset and evaluate the classification performance.
- Resources:
- class Documentation

Day 3: Model Comparison
- Compare the performance of the Naive Bayes and kNN models based on accuracy, precision, recall, and F1-score.
- Plot ROC curves for both models using pROC.
- Resources:
- pROC Documentation

Day 4: Visualize and Interpret Results
- Create compelling visualizations using ggplot2, such as confusion matrices and ROC curves.
- Use bar charts and line plots to show how each model classifies the sentiment data.
- Resources:
- ggplot2 Documentation

Day 5: Finalize and Present Findings
- Summarize which model performed better for sentiment classification and how the results relate to the R vs. Python discussion.
- Prepare a visual and written report on your findings.
- Resources:
- Supervised Learning in R: Classification

DataCamp Course Recommendations:
- Already Taken: Understanding Machine Learning
- New: Supervised Learning in R: Classification

PROJECT :

Install RedditExtractoR :

# Check if RedditExtractoR is already installed
if (!require(RedditExtractoR)) {
  install.packages("RedditExtractoR")
  library(RedditExtractoR)
  message("RedditExtractoR has been installed.")
} else {
  message("RedditExtractoR is already downloaded before.")
}
## Loading required package: RedditExtractoR
## RedditExtractoR is already downloaded before.

Load file into Workspace:

#?RedditExtractoR
# Load the package
library(RedditExtractoR)

# Define the Reddit post URL
post_url <- "https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/"

# Extract thread content
thread_data <- get_thread_content(post_url)

# Save the data for further analysis
write.csv(thread_data$comments, "reddit_comments.csv")

Investigate Data :