A Comparative Study of R vs Python Using Sentiment Analysis and Machine Learning
A Reddit post I made titled “People who use R of Reddit, why not just use Python?” in the r/RStudio subreddit unexpectedly went viral! Amassing nearly 50,000 views and hundreds of comments. It is safe to say this is a topic of great interest in the “Data-community”.
Inspired by a professor’s advice to learn Python and pure-humor, I instead decided to use R to analyze why people choose one language over the other—using, ironically, R.
This project highlights my ability to manage the entire data pipeline, from scraping and cleaning data to performing sentiment analysis and building machine learning models. With nearly two years of data science experience at UCLA, I’ll complete this project in three weeks. This project is meant to act as a testament to my overall-versatility and readiness for any data-driven role.
This project aims to analyze user sentiments in Reddit comments on an R-related post to determine how positive and negative sentiments correlate with mentions of R and Python. By applying sentiment analysis techniques and leveraging machine learning models—k-Nearest Neighbors (kNN) and Naive Bayes—this project seeks to classify the overall sentiment of the community and identify trends related to each programming language. The project includes data collection, preprocessing, exploratory analysis, and model comparison, providing insights into user preferences in a structured and data-driven manner.
RedditExtractoR
.tidytext
(tokenization,
stop word removal, stemming/lemmatization).dplyr
and
ggplot2
to visualize word frequency and metadata.syuzhet
and tidytext
for sentiment
scoring (Bing, NRC, AFINN).ggplot2
.naivebayes
) and kNN
(class
) classifiers on labeled sentiment data.pROC
.ggplot2
.Goal: Collect Reddit data, clean it, and prepare it for analysis.
Day 1: Scrape Reddit Data
- Use the RedditExtractoR
library to extract comments from
your R-related Reddit post, along with metadata like upvotes,
timestamps, and user information.
- Resources:
- RedditExtractoR
Documentation
Day 2-3: Clean and Preprocess Data
- Tokenize the text, remove stop words, punctuation, and irrelevant text
using tidytext
.
- Apply stemming or lemmatization to reduce words to their base
form.
- Resources:
- tidytext
Documentation
Day 4-5: Exploratory Data Analysis (EDA)
- Use dplyr
to summarize your dataset (e.g., comment
length, word frequency).
- Visualize word frequency, comment length, and metadata using
ggplot2
.
- Resources:
- dplyr Documentation
- ggplot2 Documentation
DataCamp Course Recommendations:
- Already Taken: Data
Manipulation with dplyr
- New: Introduction
to Text Analysis in R
Goal: Apply sentiment analysis to determine if Reddit comments lean toward R or Python and how positive/negative sentiments correlate with each.
Day 1-2: Perform Sentiment Analysis
- Use the syuzhet
package for sentiment scoring and extract
emotional valence (positive/negative).
- Apply tidytext
to tokenize the text and join it with
sentiment lexicons such as Bing, NRC, and AFINN.
- Resources:
- syuzhet
Documentation
- tidytext
Documentation
Day 3: Visualize Sentiments
- Use ggplot2
to plot the distribution of positive
vs. negative sentiments.
- Compare sentiment frequency for “R” vs. “Python” mentions.
- Resources:
- ggplot2 Documentation
Day 4-5: Cluster Comments by Sentiment
- Apply k-means clustering or a simple distance-based approach to group
comments based on sentiment polarity.
- Investigate whether positive/negative comments cluster around mentions
of R or Python.
- Resources:
- tidytext
Documentation
- ggplot2 Documentation
DataCamp Course Recommendations:
- Already Taken: Sentiment
Analysis in R
- New: Analyzing
Social Media Data in R
Goal: Build, analyze, and compare kNN and Naive Bayes models for sentiment classification.
Day 1: Implement Naive Bayes Classifier
- Build a Naive Bayes classifier using the naivebayes
library.
- Train the model on labeled sentiment data to predict sentiment
(positive/negative).
- Evaluate the model using accuracy, precision, and recall.
- Resources:
- naivebayes
Documentation
Day 2: Implement k-Nearest Neighbors (kNN)
- Use the class
package to implement a kNN model and
experiment with different values of k.
- Train the model on the sentiment dataset and evaluate the
classification performance.
- Resources:
- class
Documentation
Day 3: Model Comparison
- Compare the performance of the Naive Bayes and kNN models based on
accuracy, precision, recall, and F1-score.
- Plot ROC curves for both models using pROC
.
- Resources:
- pROC
Documentation
Day 4: Visualize and Interpret Results
- Create compelling visualizations using ggplot2
, such as
confusion matrices and ROC curves.
- Use bar charts and line plots to show how each model classifies the
sentiment data.
- Resources:
- ggplot2 Documentation
Day 5: Finalize and Present Findings
- Summarize which model performed better for sentiment classification
and how the results relate to the R vs. Python discussion.
- Prepare a visual and written report on your findings.
- Resources:
- Supervised
Learning in R: Classification
DataCamp Course Recommendations:
- Already Taken: Understanding
Machine Learning
- New: Supervised
Learning in R: Classification
Install RedditExtractoR :
# Check if RedditExtractoR is already installed
if (!require(RedditExtractoR)) {
install.packages("RedditExtractoR")
library(RedditExtractoR)
message("RedditExtractoR has been installed.")
} else {
message("RedditExtractoR is already downloaded before.")
}
## Loading required package: RedditExtractoR
## RedditExtractoR is already downloaded before.
Load file into Workspace:
#?RedditExtractoR
# Load the package
library(RedditExtractoR)
# Define the Reddit post URL
post_url <- "https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/"
# Extract thread content
thread_data <- get_thread_content(post_url)
# Save the data for further analysis
write.csv(thread_data$comments, "reddit_comments.csv")
Investigate Data :