A Comparative Study of R vs Python Using Sentiment Analysis and Machine Learning

Project Motivation:

A Reddit post I made titled “People who use R of Reddit, why not just use Python?” in the r/RStudio subreddit unexpectedly went viral! Amassing nearly 50,000 views and hundreds of comments. It is safe to say this is a topic of great interest in the “Data-community”.

Inspired by a professor’s advice to learn Python and pure-humor, I instead decided to use R to analyze why people choose one language over the other—using, ironically, R.

This project highlights my ability to manage the entire data pipeline, from scraping and cleaning data to performing sentiment analysis and building machine learning models. With nearly two years of data science experience at UCLA, I’ll complete this project in three weeks. This project is meant to act as a testament to my overall-versatility and readiness for any data-driven role.

General Overview :

Project-Objective :

This project aims to analyze user sentiments in Reddit comments on an R-related post to determine how positive and negative sentiments correlate with mentions of R and Python. By applying sentiment analysis techniques and leveraging machine learning models—k-Nearest Neighbors (kNN) and Naive Bayes—this project seeks to classify the overall sentiment of the community and identify trends related to each programming language. The project includes data collection, preprocessing, exploratory analysis, and model comparison, providing insights into user preferences in a structured and data-driven manner.

Week 1: Data Collection and Preparation

Goal: Scrape, clean, and prepare Reddit data.
Tasks:
- Scrape Reddit comments using RedditExtractoR.
- Clean and preprocess data with tidytext (tokenization, stop word removal, stemming/lemmatization).
- Perform Exploratory Data Analysis (EDA) using dplyr and ggplot2 to visualize word frequency and metadata.

Week 2: Sentiment Analysis

Goal: Apply sentiment analysis and explore sentiment polarity.
Tasks:
- Use syuzhet and tidytext for sentiment scoring (Bing, NRC, AFINN).
- Visualize sentiment distribution and compare R vs Python mentions using ggplot2.
- Cluster comments by sentiment using k-means or other methods.

Week 3: Machine Learning Models (kNN & Naive Bayes)

Goal: Build, analyze, and compare kNN and Naive Bayes classifiers.
Tasks:
- Implement Naive Bayes (naivebayes) and kNN (class) classifiers on labeled sentiment data.
- Compare model performance (accuracy, precision, recall) and plot ROC curves using pROC.
- Present findings through visualizations using ggplot2.

Day-by-day Schedule:

Week 1: Data Collection and Preparation

Goal: Collect Reddit data, clean it, and prepare it for analysis.

Day 1: Scrape Reddit Data
- Use the RedditExtractoR library to extract comments from your R-related Reddit post, along with metadata like upvotes, timestamps, and user information.
- Resources:
- RedditExtractoR Documentation

Day 2-3: Clean and Preprocess Data
- Tokenize the text, remove stop words, punctuation, and irrelevant text using tidytext.
- Apply stemming or lemmatization to reduce words to their base form.
- Resources:
- tidytext Documentation

- New: Introduction to Natural Language Processing in R

Day 4-5: Exploratory Data Analysis (EDA)
- Use dplyr to summarize your dataset (e.g., comment length, word frequency).
- Visualize word frequency, comment length, and metadata using ggplot2.
- Resources:
- dplyr Documentation
- ggplot2 Documentation

DataCamp Course Recommendations:
- Already Taken: Data Manipulation with dplyr
- New: Introduction to Text Analysis in R

Week 2: Sentiment Analysis

Goal: Apply sentiment analysis to determine if Reddit comments lean toward R or Python and how positive/negative sentiments correlate with each.

Day 1-2: Perform Sentiment Analysis
- Use the syuzhet package for sentiment scoring and extract emotional valence (positive/negative).
- Apply tidytext to tokenize the text and join it with sentiment lexicons such as Bing, NRC, and AFINN.
- Resources:
- syuzhet Documentation
- tidytext Documentation

Day 3: Visualize Sentiments
- Use ggplot2 to plot the distribution of positive vs. negative sentiments.
- Compare sentiment frequency for “R” vs. “Python” mentions.
- Resources:
- ggplot2 Documentation

Day 4-5: Cluster Comments by Sentiment
- Apply k-means clustering or a simple distance-based approach to group comments based on sentiment polarity.
- Investigate whether positive/negative comments cluster around mentions of R or Python.
- Resources:
- tidytext Documentation
- ggplot2 Documentation

DataCamp Course Recommendations:
- Already Taken: Sentiment Analysis in R
- New: Analyzing Social Media Data in R

Week 3: Machine Learning Models (kNN & Naive Bayes)

Goal: Build, analyze, and compare kNN and Naive Bayes models for sentiment classification.

Day 1: Implement Naive Bayes Classifier
- Build a Naive Bayes classifier using the naivebayes library.
- Train the model on labeled sentiment data to predict sentiment (positive/negative).
- Evaluate the model using accuracy, precision, and recall.
- Resources:
- naivebayes Documentation

Day 2: Implement k-Nearest Neighbors (kNN)
- Use the class package to implement a kNN model and experiment with different values of k.
- Train the model on the sentiment dataset and evaluate the classification performance.
- Resources:
- class Documentation

Day 3: Model Comparison
- Compare the performance of the Naive Bayes and kNN models based on accuracy, precision, recall, and F1-score.
- Plot ROC curves for both models using pROC.
- Resources:
- pROC Documentation

Day 4: Visualize and Interpret Results
- Create compelling visualizations using ggplot2, such as confusion matrices and ROC curves.
- Use bar charts and line plots to show how each model classifies the sentiment data.
- Resources:
- ggplot2 Documentation

Day 5: Finalize and Present Findings
- Summarize which model performed better for sentiment classification and how the results relate to the R vs. Python discussion.
- Prepare a visual and written report on your findings.
- Resources:
- Supervised Learning in R: Classification

DataCamp Course Recommendations:
- Already Taken: Understanding Machine Learning
- New: Supervised Learning in R: Classification

PROJECT : Code :

Day 1 : Reddit Extraction : Data Scraping and Anonymization (6 hrs)

Install RedditExtractoR :

## Loading required package: RedditExtractoR

## RedditExtractoR is already downloaded before.

Load file into Workspace:

#?RedditExtractoR

# Define the Reddit post URL
post_url <- "https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/"

Investigate Data :

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## parsing URLs on page 1...

## [1] "list"

## [1] 0

## [1] FALSE

Ethics : Anonymize data :

# Check if 'digest' package is installed; if not, install it.
if (!require(digest)) {
  install.packages("digest")
}

## Loading required package: digest

# Load necessary libraries
library(dplyr)
library(digest)

# Set seed for reproducibility
set.seed(123)

# Step 1: Create a function to hash author names using SHA-256.
# This function will hash each author name using the full SHA-256 hash (64 characters).
create_hashed_code <- function(author_name) {
  hash <- digest(author_name, algo = "sha256")  # Generate the full SHA-256 hash
  return(hash)  # Return the full 64-character hash
}

# Step 2: Replace author names directly in 'comments_df' with their corresponding hashed codes.
# No need to create a separate lookup table; everything is done within the 'mutate()' function.
randomized_comments_df <- comments_df %>%
  mutate(
    random_code = sapply(author, create_hashed_code),  # Hash all author names
    author = ifelse(
      author == "Square-Problem4346",  # Preserve the original name for 'Square-Problem4346'
      author,                          # Keep unchanged
      random_code                      # Replace other authors with hashed codes
    )
  ) %>%
  select(-random_code)  # Remove the intermediate 'random_code' column

# Step 3: Display the first few rows of the updated dataframe to verify the changes.
head(randomized_comments_df$author, 10)

##  [1] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
##  [2] "Square-Problem4346"                                              
##  [3] "28643325867213509bef0e654af57308700c413deac5980e42e1ce5d6a623dc6"
##  [4] "9a86c0781a60871e2b36342d6fab5536e29e5628c5ff621c23de48878a5b0ec5"
##  [5] "a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4"
##  [6] "c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89"
##  [7] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
##  [8] "c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89"
##  [9] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
## [10] "9bc3196ce78a6e0f96459f63eff6ef6dbd2ec36c80817e93f7835f889d4e661e"

# Step 4: Verify the number of unique commenters (excluding 'Square-Problem4346').
amount_of_unique_commentors <- nrow(
  comments_df %>%
  filter(author != "Square-Problem4346") %>%
  distinct(author)
)

# Get the count of unique authors in 'randomized_comments_df'
amount_of_unique_in_randomized <- nrow(
  randomized_comments_df %>%
  distinct(author)
)

# Print whether the counts match (excluding 1 for 'Square-Problem4346').
print(amount_of_unique_commentors == (amount_of_unique_in_randomized - 1))

## [1] TRUE

Brief summary :

SHA-256 Security: SHA-256 is a highly secure, widely used cryptographic algorithm. Unlike SHA-1, which was broken in 2017, SHA-256 remains resistant to collision attacks, making it ideal for anonymizing user data.
Hashing vs Encryption: Hashing is a one-way, irreversible process that does not use a secret key. Even though the algorithm is known, it’s computationally infeasible to reverse the hash to retrieve the original input (e.g., usernames).

Briefly Investigate User Behavior :

No, in fact only about 23% comment more than once–and, of that only about 9% comment more than twice. Essentially, users are usually only commenting once then moving on to the next dopamine-rush.

Day 2: Clean and Preprocess Data (8hrs)

EDA-PRIOR : 1 : Det. Thread :

# Load dplyr for data manipulation
library(dplyr)

# Add a new column 'Parent' and classify comments
randomized_comments_df <- randomized_comments_df %>%
  mutate(
    next_comment_id = lead(comment_id),  # Check the next comment's id
    Parent = ifelse(
      grepl("_", comment_id),  # If comment is part of a thread (has "_")
      0,
      ifelse(   # Otherwise, determine if it's a parent
        !is.na(next_comment_id) & 
        (!grepl("_", next_comment_id) & as.numeric(sub("_.*", "", next_comment_id)) == as.numeric(comment_id) + 1),  # If the next comment is unthreaded and follows numerically
        0,  # Top-level comment without a thread
        1   # Top-level comment with responses (parent of a thread)
      )
    )
  ) %>%
  select(-next_comment_id)  # Remove temporary column

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Parent = ifelse(...)`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion

# Split into two dataframes: unthreaded and threaded comments
unthreaded_comments_df <- randomized_comments_df %>%
  filter(Parent == 0 & !grepl("_", comment_id))  # Top-level comments without threads

threaded_comments_df <- randomized_comments_df %>%
  filter(grepl("_", comment_id) | Parent == 1)  # Threaded comments or parents

# View the two dataframes
print("Unthreaded Comments:")

## [1] "Unthreaded Comments:"

head(unthreaded_comments_df)

##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 4 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 5 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 6 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 063ac96d8f190af509f9392b81c9ec222399f91372ad0e9fb07fa423e858b0a5 2024-09-04
## 2 f2d40d40d4257e8e9d6d4d783c509c90b0e893fd210a22f3c86d33a807935361 2024-09-04
## 3 4f64ea74eba9d17f2fc36fbe063e4d6470b555fcb648294b7e2f998761251ebe 2024-09-04
## 4 1ef7e8ecb78e7790720585cd283f2312df7af6e0a628f92b8c398e98a6424819 2024-09-04
## 5 2b77a0a723ab8f15461de67d2023193bfcde880f810b529333ad7c8abdb117e6 2024-09-04
## 6 b9bcfc870dce5ee1135e7e00cb3f7f991e5c3ca13598dfaa0395af3cd4be5c3b 2024-09-04
##    timestamp score upvotes downvotes golds
## 1 1725487774    37      37         0     0
## 2 1725489691    15      15         0     0
## 3 1725491871    16      16         0     0
## 4 1725488439    12      12         0     0
## 5 1725489509     9       9         0     0
## 6 1725491497     8       8         0     0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                      comment
## 1                                                                                                                                                                                                                                                                                                This question I have asked myself before too. In the end I think it boils down to R being more out of the box ready to use for data wrangling and the like.
## 2                                                                                                                      Adding to what others have said - RStudio beats out Pycharm and other IDEs by light years imho.\n\nAgreeing with what others have said: for anything biostats (or operation on spreadsheets), visualization, etc., R wins out.\n\nArguments for python (in my life) include need for ML/DL and \030true\031 programming (vs analysis)
## 3                                                                                                                                                                                                                              R was built as a statistical analysis language. Why would I want to use a general purpose language when there is a language for my exact use case.\n\nTo be clear I have no problem with Python, and have learned the basics.
## 4                                                                                                                                                                                                                                                                                              I'm a stats dork first and foremost, therefore R is a very good fit for me and in my experience does that better. I flip to python for more general purposes.
## 5                                My school focuses on teaching R more than Python as a Data Analytics major and so far I've enjoyed it more. In one class I had to compare between statistical functions in R and Python and I always found R having more detailed functions with an easier code than Python. Idk if in general R is more prepared for detailed statistical work or not in comparison to Python unless there's Python libraries I've missed.
## 6 1. Stats and data managing tools are well developed, documented (have citable papers) and transparent. \n\n2. Bioinformatics packages are well developed, documented (have citable papers) and transparent. \n\n3. Thats what I started with and what is taught in most biostats programs. \n\nAlso, I learned a bit of Python and numpy and scipy seemed utterly ridiculous to me. But who knows if I\031d learned that first maybe it\031d be different.
##   comment_id Parent
## 1          5      0
## 2         11      0
## 3         12      0
## 4         13      0
## 5         15      0
## 6         16      0

print("Threaded Comments:")

## [1] "Threaded Comments:"

head(threaded_comments_df)

##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 4 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 5 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 6 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67 2024-09-04
## 2                                               Square-Problem4346 2024-09-04
## 3 28643325867213509bef0e654af57308700c413deac5980e42e1ce5d6a623dc6 2024-09-05
## 4 9a86c0781a60871e2b36342d6fab5536e29e5628c5ff621c23de48878a5b0ec5 2024-09-05
## 5 a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4 2024-09-07
## 6 c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89 2024-09-05
##    timestamp score upvotes downvotes golds
## 1 1725486694   165     165         0     0
## 2 1725487023     9       9         0     0
## 3 1725494583    28      28         0     0
## 4 1725521027     5       5         0     0
## 5 1725676114     1       1         0     0
## 6 1725560242     2       2         0     0
##                                                                                                                                                                                                                                comment
## 1 Pretty much ggplot and the dbplyr. I use Node for a lot, but I just really like performing an analysis inside R. I keep thinking of giving Python another shot, but I feel like at that point I would rather give Rust another shot.
## 2                                                                                                                                                                                                                That\031s makes sense
## 3                                                                                                 Modeling and the ability to factor categorical variables without making dummies. I use python now but miss the ease of modeling in r
## 4                                                                                                                                                            Do you have to make dummies to deal with categorical variables in Python?
## 5                                                                                                                                                                 With scipy yeah but there are other libraries that you don't have to
## 6                                                                                                                                                                                                                  lets-plot\n\npolars
##   comment_id Parent
## 1          1      1
## 2        1_1      0
## 3      1_1_1      0
## 4    1_1_1_1      0
## 5  1_1_1_1_1      0
## 6        1_2      0

# Load dplyr for data manipulation
library(dplyr)

# Find the authors that appear in both threaded and unthreaded comment dataframes
common_authors <- intersect(threaded_comments_df$author, unthreaded_comments_df$author)

# Display the common authors
print(common_authors)

## [1] "a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4"
## [2] "18c55ec198f889e965bb965c19343bb7d9525dbe48161f252cdb277cf53da3c0"
## [3] "9b0701406e601311f042b43c63b4dcd0b20117a328b9d61769f507edc50a29c4"
## [4] "de64c06aaf788801551f83ebf7c73c5ab042336b2823591117fbd5e53a48207c"
## [5] "e0f5bf1ad24858f119f894783bcd35839614fb06780d05b4e3ddb0bb27c5ae67"

EDA-PRIOR : 2 : Visualize : Threaded Network

# Ensure 'date' column is a Date object in threaded_comments_df
threaded_comments_df$date <- as.Date(threaded_comments_df$date)

# Load necessary libraries
library(dplyr)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggplot2)

# Step 1: Create 'parent_id' based on the structure of 'comment_id'
threaded_comments_df <- threaded_comments_df %>%
  mutate(parent_id = ifelse(grepl("_", comment_id),  # If the comment_id has "_", it's a reply
                            sub("_[^_]+$", "", comment_id),  # Remove the last part to get the parent_id
                            NA))  # Top-level comments have no parent

# Step 2: Create an edge list for parent-child relationships (source = parent, target = reply)
edges <- threaded_comments_df %>%
  filter(!is.na(parent_id)) %>%  # Only keep comments with a parent (i.e., replies)
  select(source = parent_id, target = comment_id, author)  # Create the edge list for network visualization

# Step 3: Create a nodes dataframe to identify unique users (vertices for the graph)
nodes <- threaded_comments_df %>%
  filter(comment_id %in% c(edges$source, edges$target)) %>%
  select(id = comment_id, author) %>%
  distinct()

# Step 4: Assign the distinct colors you provided to common authors
distinct_colors <- c("green", "blue", "purple", "orange", "lightblue")

# Create a named vector to map each common author to a unique color
author_color_map <- setNames(distinct_colors, common_authors)

# Step 5: Modify the edge color based on author:
# - "Square-Problem4346" will remain red
# - Common authors get their unique assigned color
# - Others will remain black
edges <- edges %>%
  mutate(edge_color = case_when(
    author == "Square-Problem4346" ~ "red",  # Red for Square-Problem4346
    author %in% common_authors ~ author_color_map[author],  # Unique color for each common author
    TRUE ~ "black"  # Black for all other authors
  ))

# Step 6: Create the network graph with igraph
comment_graph <- graph_from_data_frame(d = edges %>% select(source, target), vertices = nodes, directed = TRUE)

# Step 7: Customize and visualize the network graph

# Layout the graph slightly shifted to the left
layout <- layout_as_tree(comment_graph)

# Plot the network graph
plot(
  comment_graph, 
  layout = layout - 0.3,  # Shift the layout to the left
  vertex.size = 5, 
  vertex.label = NA,  # No labels for vertices
  vertex.color = "lightgrey",  # Uniform color for all nodes
  edge.color = edges$edge_color,  # Use the edge color we assigned (red for Square-Problem4346, unique for common authors)
  edge.arrow.size = 0.3, 
  main = "Comment Responses Network"
)

# Step 8: Add a legend for the common authors (to the right of the plot)
legend(
  x = 1.2,  # Position the legend to the right of the graph
  y = 0.9, 
  legend = paste0("user_", 1:length(common_authors)),  # Replace hashed names with user_1, user_2, etc.
  col = distinct_colors,  # Use the distinct colors for the common authors
  pch = 16,  # Use filled circles for the legend
  title = "C-Authors"
)

EDA-PRIOR : 3 : Thread-Summary :

# Step 1: Get the unique users from the entire dataset
unique_users_total <- randomized_comments_df %>%
  distinct(author)  # Get distinct users

# Step 2: Count the number of unique users
num_unique_users_total <- nrow(unique_users_total)

# Output the result
num_unique_users_total

## [1] 143

# Step 1: Identify users in threads (either as parent or child)
users_in_threads <- randomized_comments_df %>%
  filter(comment_id %in% c(edges$source, edges$target)) %>%
  distinct(author)

# Step 2: Identify users not in threads (standalone comments)
users_not_in_threads <- randomized_comments_df %>%
  filter(!(comment_id %in% c(edges$source, edges$target))) %>%
  distinct(author)

# Step 3: Identify users who are in both categories (started a thread and replied to others)
users_in_both <- users_in_threads %>%
  inner_join(users_not_in_threads, by = "author")  # Users present in both sets

# Step 4: Summarize the counts in a table
summary_table <- tibble(
  Category = c("Not in Thread", "In Thread", "In Both"),
  Count = c(nrow(users_not_in_threads), 
            nrow(users_in_threads), 
            nrow(users_in_both))
)

# Output the summary table
summary_table

## # A tibble: 3 × 2
##   Category      Count
##   <chr>         <int>
## 1 Not in Thread    88
## 2 In Thread        60
## 3 In Both           5

#verify
summary_table$Count[1] + summary_table$Count[2] - summary_table$Count[3] == num_unique_users_total

## [1] TRUE

# Ensure 'date' column is a Date object
randomized_comments_df$date <- as.Date(randomized_comments_df$date)

# Check the class of 'date' column
class(randomized_comments_df$date)

## [1] "Date"

# Check the range of dates
range(randomized_comments_df$date, na.rm = TRUE)

## [1] "2024-09-04" "2024-09-08"

percent_summary_table <- as.data.frame(summary_table) %>% mutate(Count = Count/num_unique_users_total) 
percent_summary_table

##        Category      Count
## 1 Not in Thread 0.61538462
## 2     In Thread 0.41958042
## 3       In Both 0.03496503

EDA-PRIOR : 4 : Visualize : Score Dist.

# Load necessary libraries and install them if not already installed
if (!require(ggplot2)) {
  install.packages("ggplot2")
}
if (!require(dplyr)) {
  install.packages("dplyr")
}
if (!require(gridExtra)) {
  install.packages("gridExtra")
}

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Load libraries into the R session
library(ggplot2)
library(dplyr)
library(gridExtra)

# Create the histogram
histogram_plot <- randomized_comments_df %>%
  ggplot(aes(x = score)) +
  geom_histogram(binwidth = 5, fill = "lightgreen") +  # Adjust binwidth as needed
  theme_minimal() +
  labs(
    title = "Distribution of Scores",
    x = "Score",
    y = "Count"
  )

# Create the boxplot (horizontally aligned for comparison)
boxplot_plot <- randomized_comments_df %>%
  ggplot(aes(x = 1, y = score)) +  # Use 1 as a dummy variable to align the boxplot horizontally
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()) +  # Hide x-axis details
  labs(
    title = "Boxplot of Scores",
    y = "Score"
  )

# Combine the two plots side by side
grid.arrange(histogram_plot, boxplot_plot, ncol = 2)

#Take a look at the top 2 and bottom 2 outliers : 
top_2 <- randomized_comments_df %>% arrange(desc(score)) %>% head(2)
bottom_2 <- randomized_comments_df %>% arrange(score) %>% head(2)

#df of top_2 & bottom_2: 
score_outliers_df <- rbind(top_2 %>% select(author,comment, score), bottom_2 %>% select(author,comment, score))

#Check for where he is in the network to find him : 
randomized_comments_df %>% filter(author == "422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a")

##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a 2024-09-04
## 2 422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a 2024-09-05
##    timestamp score upvotes downvotes golds
## 1 1725490338   -32     -32         0     0
## 2 1725495627   -14     -14         0     0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          comment
## 1 &gt;I think this article does a good job showing side by side how has a more intuitive syntax.\n\nFirst off, this is NOT a personal attack. It's not. This is my personal opinion that that article is full of crap. \n\nBasically that article is arguing that not only Python, but C++ and the entire C family of languages is somehow intuitionally inferior to R. That's a huge....HUGE statement and one that is largely untenable. Python is immensely "conversational". R is absolutely not.\n\nR is a language designed specifically to solve one problem: how to efficiently consolidate statistical analysis under one umbrella, command line language. That's it. Period. Everything that came after that (ggplot2 and dplyr for example) are attempts to make R emulate the flexibility of an actual programming language. Yes they work. But they work because they have a very dedicated development community and... because they have RStudio \n\nThe result is that R packages are bulky, inefficient, and frequently mismanaged (how often do your R sessions crash??). And don't get me started on AI. Machine learning with R is mostly an exercise in fitting a square peg into an invisible hole. \n\nIf anything, that article showcases how dead-end R actually is. If you immerse yourself in R like  the author, you basically have no intuition for how the vast majority of scripting and programming languages work. You're a one-trick pony with a non-translational feel for syntax.\n\nOkay. Down vote away. I don't care. I use R all the time, but I hate myself for it every time I use it.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Yes, Tidymodels is doing NOW what scikitlearn was doing in 2015. Now, tell me exactly how Tidymodels is anywhere close to the capabilities of pytorch?
##   comment_id Parent
## 1        3_4      0
## 2    3_4_4_1      0

#Check if they are "common_authors", "threaded" or "unthreaded": 
unique(score_outliers_df$author) %in% common_authors

## [1] FALSE FALSE FALSE

unique(score_outliers_df$author) %in% threaded_comments_df$author

## [1] TRUE TRUE TRUE

unique(score_outliers_df$author) %in% unthreaded_comments_df$author

## [1] FALSE FALSE FALSE

#the most liked/disliked comments cause threads; comments : 
top_2$comment[1]

## [1] "python is a general purpose language with a statistics library strapped on the top. R (and tidyverse) was built for numbers first. I think [this article](https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/) does a good job showing side by side how has a more intuitive syntax."

top_2$comment[2]

## [1] "Pretty much ggplot and the dbplyr. I use Node for a lot, but I just really like performing an analysis inside R. I keep thinking of giving Python another shot, but I feel like at that point I would rather give Rust another shot."

bottom_2$comment[1]

## [1] "&gt;I think this article does a good job showing side by side how has a more intuitive syntax.\n\nFirst off, this is NOT a personal attack. It's not. This is my personal opinion that that article is full of crap. \n\nBasically that article is arguing that not only Python, but C++ and the entire C family of languages is somehow intuitionally inferior to R. That's a huge....HUGE statement and one that is largely untenable. Python is immensely \"conversational\". R is absolutely not.\n\nR is a language designed specifically to solve one problem: how to efficiently consolidate statistical analysis under one umbrella, command line language. That's it. Period. Everything that came after that (ggplot2 and dplyr for example) are attempts to make R emulate the flexibility of an actual programming language. Yes they work. But they work because they have a very dedicated development community and... because they have RStudio \n\nThe result is that R packages are bulky, inefficient, and frequently mismanaged (how often do your R sessions crash??). And don't get me started on AI. Machine learning with R is mostly an exercise in fitting a square peg into an invisible hole. \n\nIf anything, that article showcases how dead-end R actually is. If you immerse yourself in R like  the author, you basically have no intuition for how the vast majority of scripting and programming languages work. You're a one-trick pony with a non-translational feel for syntax.\n\nOkay. Down vote away. I don't care. I use R all the time, but I hate myself for it every time I use it."

bottom_2$comment[2]

## [1] "Yes, Tidymodels is doing NOW what scikitlearn was doing in 2015. Now, tell me exactly how Tidymodels is anywhere close to the capabilities of pytorch?"

EDA-PRIOR : 5 : Visualize : Response Length by sentences :

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)

# Ensure that the 'sentence_count' column exists by calculating it for both data frames
# If the column doesn't exist, create it by counting sentences in each comment

# Function to count sentences in a comment
count_sentences <- function(comment) {
  length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))
}

# Calculate the number of sentences in each comment for threaded and unthreaded comments
threaded_comments_df <- threaded_comments_df %>%
  mutate(sentence_count = sapply(comment, count_sentences))

unthreaded_comments_df <- unthreaded_comments_df %>%
  mutate(sentence_count = sapply(comment, count_sentences))

# Remove NA and infinite values from sentence_count columns in both data frames
threaded_comments_df <- threaded_comments_df %>%
  filter(!is.na(sentence_count) & is.finite(sentence_count))

unthreaded_comments_df <- unthreaded_comments_df %>%
  filter(!is.na(sentence_count) & is.finite(sentence_count))

# Calculate the quartiles, mean, and upper outlier threshold for threaded comments
q1_threaded <- quantile(threaded_comments_df$sentence_count, 0.25)
q2_threaded <- quantile(threaded_comments_df$sentence_count, 0.50)  # Median
q3_threaded <- quantile(threaded_comments_df$sentence_count, 0.75)
mean_threaded <- mean(threaded_comments_df$sentence_count)
iqr_threaded <- q3_threaded - q1_threaded
upper_outlier_threshold_threaded <- q3_threaded + (1.5 * iqr_threaded)

# Calculate the quartiles, mean, and upper outlier threshold for unthreaded comments
q1_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.25)
q2_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.50)  # Median
q3_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.75)
mean_unthreaded <- mean(unthreaded_comments_df$sentence_count)
iqr_unthreaded <- q3_unthreaded - q1_unthreaded
upper_outlier_threshold_unthreaded <- q3_unthreaded + (1.5 * iqr_unthreaded)

# Threaded comments histogram with quartile, median, mean, and outlier lines
threaded_histogram <- threaded_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_threaded, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(xintercept = upper_outlier_threshold_threaded, color = "black", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q1_threaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q2_threaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q3_threaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = mean_threaded, color = "green", linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Threaded Comments", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), 
                     minor_breaks = seq(0, max(threaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Unthreaded comments histogram with quartile, median, mean, and outlier lines
unthreaded_histogram <- unthreaded_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_unthreaded, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(xintercept = upper_outlier_threshold_unthreaded, color = "black", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q1_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q2_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q3_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = mean_unthreaded, color = "green", linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Unthreaded Comments", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), 
                     minor_breaks = seq(0, max(unthreaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))

# Use gridExtra to place the histograms side by side
grid.arrange(threaded_histogram, unthreaded_histogram, ncol = 2)

## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

EDA-PRIOR : 6 : Response Length by Nested (T/F) :

# Load necessary libraries
library(dplyr)
library(tidytext)

# Ensure the 'comment' column exists and contains valid text data
if(!"comment" %in% colnames(threaded_comments_df)) {
  stop("The 'comment' column does not exist in the data frame.")
}

# Calculate the number of sentences in each comment for the threaded comments
threaded_comments_df <- threaded_comments_df %>%
  rowwise() %>%
  mutate(sentence_count = length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))) %>%
  ungroup()

# Check if the 'sentence_count' column has been successfully populated
summary(threaded_comments_df$sentence_count)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   3.282   3.000  24.000

# Similarly for unthreaded comments
unthreaded_comments_df <- unthreaded_comments_df %>%
  rowwise() %>%
  mutate(sentence_count = length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))) %>%
  ungroup()


# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)

# Calculate the quartiles, mean, and upper outlier threshold for threaded comments
q1_threaded <- quantile(threaded_comments_df$sentence_count, 0.25)
q2_threaded <- quantile(threaded_comments_df$sentence_count, 0.50)  # Median
q3_threaded <- quantile(threaded_comments_df$sentence_count, 0.75)
mean_threaded <- mean(threaded_comments_df$sentence_count)
iqr_threaded <- q3_threaded - q1_threaded
upper_outlier_threshold_threaded <- q3_threaded + (1.5 * iqr_threaded)

# Calculate the quartiles, mean, and upper outlier threshold for unthreaded comments
q1_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.25)
q2_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.50)  # Median
q3_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.75)
mean_unthreaded <- mean(unthreaded_comments_df$sentence_count)
iqr_unthreaded <- q3_unthreaded - q1_unthreaded
upper_outlier_threshold_unthreaded <- q3_unthreaded + (1.5 * iqr_unthreaded)

# Threaded comments histogram with quartile, median, mean, and outlier lines
threaded_histogram <- threaded_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_threaded, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(aes(xintercept = upper_outlier_threshold_threaded, color = "Outlier Threshold"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = q1_threaded, color = "Q1 (25th Percentile)"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = q2_threaded, color = "Median (50th Percentile)"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = q3_threaded, color = "Q3 (75th Percentile)"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = mean_threaded, color = "Mean"), linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Threaded Comments", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), minor_breaks = seq(0, max(threaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))

# Unthreaded comments histogram with quartile, median, mean, and outlier lines
unthreaded_histogram <- unthreaded_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_unthreaded, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(aes(xintercept = upper_outlier_threshold_unthreaded, color = "Outlier Threshold"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = q1_unthreaded, color = "Q1 (25th Percentile)"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = q2_unthreaded, color = "Median (50th Percentile)"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = q3_unthreaded, color = "Q3 (75th Percentile)"), linetype = "dashed", size = 1) +
  geom_vline(aes(xintercept = mean_unthreaded, color = "Mean"), linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Unthreaded Comments", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), minor_breaks = seq(0, max(unthreaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))

# Use gridExtra to place the histograms side by side
grid.arrange(threaded_histogram, unthreaded_histogram, ncol = 2)

Tokenize-Text

Resources & Notes:

What is Natural language Processing?
- What is sentiment Analysis?
Others Reddit projects:
- Dynamic-Sentiment-Analysis

Analyzing Reddit Sentiment:

Isaiah C. Mireles

2024-08-07

A Comparative Study of R vs Python Using Sentiment Analysis and Machine Learning

Project Motivation:

General Overview :

Project-Objective :

Week 1: Data Collection and Preparation

Week 2: Sentiment Analysis

Week 3: Machine Learning Models (kNN & Naive Bayes)

Day-by-day Schedule:

Week 1: Data Collection and Preparation

Week 2: Sentiment Analysis

Week 3: Machine Learning Models (kNN & Naive Bayes)

PROJECT : Code :

Day 1 : Reddit Extraction : Data Scraping and Anonymization (6 hrs)

Install RedditExtractoR :

Load file into Workspace:

Investigate Data :

Ethics : Anonymize data :

Briefly Investigate User Behavior :

Day 2: Clean and Preprocess Data (8hrs)

EDA-PRIOR : 1 : Det. Thread :

EDA-PRIOR : 2 : Visualize : Threaded Network

EDA-PRIOR : 3 : Thread-Summary :

EDA-PRIOR : 4 : Visualize : Score Dist.

EDA-PRIOR : 5 : Visualize : Response Length by sentences :

EDA-PRIOR : 6 : Response Length by Nested (T/F) :

Tokenize-Text

Resources & Notes: