A Comparative Study of R vs Python Using Sentiment Analysis and Machine Learning

Project Motivation:

A Reddit post I made titled People who use R of Reddit, why not just use Python? in the r/RStudio subreddit unexpectedly went viral! Amassing above 60,000 views, hundreds of comments and shares! It is safe to say this is a topic of great interest in the “Data-community”.

Inspired by a professor’s advice to learn Python and pure-humor, I have instead decided to use R to analyze why people choose one language over the other—using, ironically, R–not Python.


General Overview :


Project-Objective :

In this project, I managed the entire data pipeline, starting with scraping Reddit’s nested comment structure and progressing to sentiment analysis and machine learning model development. To ensure ethical integrity, I hashed user data—an approach beyond encryption in its irreversibility due to computational complexity—safeguarding privacy without compromising data quality. Throughout the process, I carefully balanced developing innovative approaches with the pressure of tight deadlines, which not only pushed my capabilities but also sharpened my problem-solving skills under time constraints. The intensity of this challenge energized me, allowing me to grow my expertise in NLP, data visualization, and advanced modeling. My passion for systematized research, especially in network science, grew as I uncovered new insights in the nested structure data, and I’m excited to dive deeper into fields like computer vision, where I can continue combining cutting-edge innovation with rigorous, ethical data practices. I invite you to explore my project to see the depth of these insights firsthand and how each challenge was met with a data-driven solution.


Day-by-Day Schedule :

Week 1: Data Collection and Preparation

Goal: Complete scraping, preprocessing, tokenization, and check the frequency of special characters and slang.

  • Day 1 (6 hours):
    • Reddit Extraction and Data Scraping:
      • Libraries: RedditExtractoR, dplyr
      • Scrape Reddit comments and extract metadata.
      • Perform initial data investigation (missing values, unique authors).
      • Ethics: Anonymize user data using SHA-256 hashing.
      • Estimated Time: 6-7 hours.
  • Day 2-3 (10 hours total):
    • Clean and Preprocess Data:
      • Libraries: dplyr, tidytext
      • Classify comments as threaded or unthreaded.
      • Visualize comment structure with networks.
      • Investigate user behavior, such as the frequency of comments in threads.
      • Estimated Time: 12-14 hours.
  • Day 4 (7 hours):
    • Complete Tokenization and Special Character/Slang Analysis:
      • Libraries: tidytext, dplyr
      • Check the frequency of special characters and slang to decide if further processing is needed.
      • Tokenize text, remove stop words, punctuation, apply lemmatization or stemming.
      • Estimated Time: 6-7 hours.
  • Day 5 (7 hours):
    • Perform Exploratory Data Analysis (EDA):
      • Libraries: dplyr, ggplot2
      • Visualize word frequencies, comment lengths, basic statistics, and check the relevance of special characters and slang.
      • Estimated Time: 6-7 hours.
  • Optional Day 6 (2 hours):
    • Wrap up tokenization, EDA, or finalize decision on handling special characters and slang.

Week 2: Sentiment Analysis and Labeling

Goal: Perform sentiment analysis, apply both manual and NLP-based labeling, and incorporate special character and slang handling.

  • Day 1-2 (14 hours total):
    • Manual Labeling:
      • Libraries: dplyr
      • Manually label a subset of the data for reasons choosing R or Python, incorporating insights from special text behavior, including slang.
      • Estimated Time: 7 hours each day for two days (14 hours total).
  • Day 3-4 (14 hours total):
    • NLP-Based Labeling:
      • Libraries: tidytext, dplyr
      • Automate the labeling process with dictionary or keyword approaches. Refine labels manually and integrate special character and slang handling as needed.
      • Estimated Time: 7 hours each day for two days (14 hours total).
  • Day 5 (7 hours):
    • Train Initial Model (Manual Labels):
      • Libraries: naivebayes, e1071
      • Train Naive Bayes or SVM models on the manually labeled data, ensuring the model accounts for special characters and slang.
      • Estimated Time: 6-7 hours.
  • Optional Day 6 (2 hours):
    • Fine-tune or adjust the labeled dataset and model.

Week 3: Model Comparison and Evaluation

Goal: Train models on both NLP-labeled and manually labeled data, evaluate their performance, and adjust based on special text elements, including slang.

  • Day 1 (7 hours):
    • Train Models on NLP-Labeled Data:
      • Libraries: naivebayes, e1071, class
      • Train models using the NLP-labeled dataset, incorporating special character and slang handling.
      • Estimated Time: 6-7 hours.
  • Day 2 (7 hours):
    • Generate Predictions:
      • Libraries: naivebayes, e1071, class
      • Generate predictions using both manual and NLP-labeled models, considering special text behavior, including slang.
      • Estimated Time: 6-7 hours.
  • Day 3 (7 hours):
    • Evaluate Performance (Accuracy and Confusion Matrices):
      • Libraries: dplyr, ggplot2
      • Compare models using accuracy, confusion matrices, and how well they handle special text elements, including slang.
      • Estimated Time: 6-7 hours.
  • Day 4 (7 hours):
    • ROC/AUC Curves, Precision, Recall, and F1-Score:
      • Libraries: ggplot2, e1071
      • Compute ROC/AUC curves, precision, recall, and F1-score to evaluate model performance, factoring in special text elements, including slang.
      • Estimated Time: 6-7 hours.
  • Day 5 (7 hours):
    • Finalize Results and Conclusions:
      • Summarize which model performed better, and the impact of handling slang and special text on the final results.
      • Estimated Time: 6-7 hours.
  • Optional Day 6 (2 hours):
    • Refine model outputs or revisit comparisons.

Project Code :

Day 1 : Reddit Extraction : Data Scraping and Anonymization (6 hrs)

Install RedditExtractoR :

## Loading required package: RedditExtractoR
## RedditExtractoR is already downloaded before.

Load file into Workspace:

#?RedditExtractoR

# Define the Reddit post URL
post_url <- "https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/"

Investigate Data :

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## parsing URLs on page 1...
## [1] "list"
## [1] 0
## [1] FALSE

Ethics : Anonymize user-data :

# Check if 'digest' package is installed; if not, install it.
if (!require(digest)) {
  install.packages("digest")
}
## Loading required package: digest
# Load necessary libraries
library(dplyr)
library(digest)

# Set seed for reproducibility
set.seed(123)

# Step 1: Create a function to hash author names using SHA-256.
# This function will hash each author name using the full SHA-256 hash (64 characters).
create_hashed_code <- function(author_name) {
  hash <- digest(author_name, algo = "sha256")  # Generate the full SHA-256 hash
  return(hash)  # Return the full 64-character hash
}

# Step 2: Replace author names directly in 'comments_df' with their corresponding hashed codes.
# No need to create a separate lookup table; everything is done within the 'mutate()' function.
randomized_comments_df <- comments_df %>%
  mutate(
    random_code = sapply(author, create_hashed_code),  # Hash all author names
    author = ifelse(
      author == "Square-Problem4346",  # Preserve the original name for 'Square-Problem4346'
      author,                          # Keep unchanged
      random_code                      # Replace other authors with hashed codes
    )
  ) %>%
  select(-random_code)  # Remove the intermediate 'random_code' column

# Step 3: Display the first few rows of the updated dataframe to verify the changes.
head(randomized_comments_df$author, 10)
##  [1] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
##  [2] "Square-Problem4346"                                              
##  [3] "28643325867213509bef0e654af57308700c413deac5980e42e1ce5d6a623dc6"
##  [4] "9a86c0781a60871e2b36342d6fab5536e29e5628c5ff621c23de48878a5b0ec5"
##  [5] "a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4"
##  [6] "c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89"
##  [7] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
##  [8] "c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89"
##  [9] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
## [10] "9bc3196ce78a6e0f96459f63eff6ef6dbd2ec36c80817e93f7835f889d4e661e"
# Step 4: Verify the number of unique commenters (excluding 'Square-Problem4346').
amount_of_unique_commentors <- nrow(
  comments_df %>%
  filter(author != "Square-Problem4346") %>%
  distinct(author)
)

# Get the count of unique authors in 'randomized_comments_df'
amount_of_unique_in_randomized <- nrow(
  randomized_comments_df %>%
  distinct(author)
)

# Print whether the counts match (excluding 1 for 'Square-Problem4346').
print(amount_of_unique_commentors == (amount_of_unique_in_randomized - 1))
## [1] TRUE

Brief summary :

  • SHA-256 Security: SHA-256 is a highly secure, widely used cryptographic algorithm. Unlike SHA-1, which was broken in 2017, SHA-256 remains resistant to collision attacks, making it ideal for anonymizing user data.

  • Hashing vs Encryption: Hashing is a one-way, irreversible process that does not use a secret key. Even though the algorithm is known, it’s computationally infeasible to reverse the hash to retrieve the original input (e.g., usernames).

Briefly Investigate User Behavior :

No, in fact only about 23% comment more than once–and, of that only about 9% comment more than twice. Essentially, users are usually only commenting once then moving on to the next dopamine-rush.

Day 2-3 : Clean and Preprocess Data (10hrs)

EDA-PRIOR : 1 : Det. Thread :

# Load dplyr for data manipulation
library(dplyr)

# Add a new column 'Parent' and classify comments
randomized_comments_df <- randomized_comments_df %>%
  mutate(
    next_comment_id = lead(comment_id),  # Check the next comment's id
    Parent = ifelse(
      grepl("_", comment_id),  # If comment is part of a thread (has "_")
      0,
      ifelse(   # Otherwise, determine if it's a parent
        !is.na(next_comment_id) & 
        (!grepl("_", next_comment_id) & as.numeric(sub("_.*", "", next_comment_id)) == as.numeric(comment_id) + 1),  # If the next comment is unthreaded and follows numerically
        0,  # Top-level comment without a thread
        1   # Top-level comment with responses (parent of a thread)
      )
    )
  ) %>%
  select(-next_comment_id)  # Remove temporary column
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Parent = ifelse(...)`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion
# Split into two dataframes: unthreaded and threaded comments
unthreaded_comments_df <- randomized_comments_df %>%
  filter(Parent == 0 & !grepl("_", comment_id))  # Top-level comments without threads

threaded_comments_df <- randomized_comments_df %>%
  filter(grepl("_", comment_id) | Parent == 1)  # Threaded comments or parents

# View the two dataframes
print("Unthreaded Comments:")
## [1] "Unthreaded Comments:"
head(unthreaded_comments_df)
##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 4 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 5 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 6 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 063ac96d8f190af509f9392b81c9ec222399f91372ad0e9fb07fa423e858b0a5 2024-09-04
## 2 f2d40d40d4257e8e9d6d4d783c509c90b0e893fd210a22f3c86d33a807935361 2024-09-04
## 3 4f64ea74eba9d17f2fc36fbe063e4d6470b555fcb648294b7e2f998761251ebe 2024-09-04
## 4 1ef7e8ecb78e7790720585cd283f2312df7af6e0a628f92b8c398e98a6424819 2024-09-04
## 5 2b77a0a723ab8f15461de67d2023193bfcde880f810b529333ad7c8abdb117e6 2024-09-04
## 6 b9bcfc870dce5ee1135e7e00cb3f7f991e5c3ca13598dfaa0395af3cd4be5c3b 2024-09-04
##    timestamp score upvotes downvotes golds
## 1 1725487774    35      35         0     0
## 2 1725489691    15      15         0     0
## 3 1725491871    16      16         0     0
## 4 1725488439    13      13         0     0
## 5 1725489509     9       9         0     0
## 6 1725491497     8       8         0     0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                      comment
## 1                                                                                                                                                                                                                                                                                                This question I have asked myself before too. In the end I think it boils down to R being more out of the box ready to use for data wrangling and the like.
## 2                                                                                                                      Adding to what others have said - RStudio beats out Pycharm and other IDEs by light years imho.\n\nAgreeing with what others have said: for anything biostats (or operation on spreadsheets), visualization, etc., R wins out.\n\nArguments for python (in my life) include need for ML/DL and \030true\031 programming (vs analysis)
## 3                                                                                                                                                                                                                              R was built as a statistical analysis language. Why would I want to use a general purpose language when there is a language for my exact use case.\n\nTo be clear I have no problem with Python, and have learned the basics.
## 4                                                                                                                                                                                                                                                                                              I'm a stats dork first and foremost, therefore R is a very good fit for me and in my experience does that better. I flip to python for more general purposes.
## 5                                My school focuses on teaching R more than Python as a Data Analytics major and so far I've enjoyed it more. In one class I had to compare between statistical functions in R and Python and I always found R having more detailed functions with an easier code than Python. Idk if in general R is more prepared for detailed statistical work or not in comparison to Python unless there's Python libraries I've missed.
## 6 1. Stats and data managing tools are well developed, documented (have citable papers) and transparent. \n\n2. Bioinformatics packages are well developed, documented (have citable papers) and transparent. \n\n3. Thats what I started with and what is taught in most biostats programs. \n\nAlso, I learned a bit of Python and numpy and scipy seemed utterly ridiculous to me. But who knows if I\031d learned that first maybe it\031d be different.
##   comment_id Parent
## 1          5      0
## 2         11      0
## 3         12      0
## 4         13      0
## 5         15      0
## 6         16      0
print("Threaded Comments:")
## [1] "Threaded Comments:"
head(threaded_comments_df)
##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 4 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 5 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 6 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67 2024-09-04
## 2                                               Square-Problem4346 2024-09-04
## 3 28643325867213509bef0e654af57308700c413deac5980e42e1ce5d6a623dc6 2024-09-05
## 4 9a86c0781a60871e2b36342d6fab5536e29e5628c5ff621c23de48878a5b0ec5 2024-09-05
## 5 a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4 2024-09-07
## 6 c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89 2024-09-05
##    timestamp score upvotes downvotes golds
## 1 1725486694   166     166         0     0
## 2 1725487023    10      10         0     0
## 3 1725494583    29      29         0     0
## 4 1725521027     4       4         0     0
## 5 1725676114     1       1         0     0
## 6 1725560242     2       2         0     0
##                                                                                                                                                                                                                                comment
## 1 Pretty much ggplot and the dbplyr. I use Node for a lot, but I just really like performing an analysis inside R. I keep thinking of giving Python another shot, but I feel like at that point I would rather give Rust another shot.
## 2                                                                                                                                                                                                                That\031s makes sense
## 3                                                                                                 Modeling and the ability to factor categorical variables without making dummies. I use python now but miss the ease of modeling in r
## 4                                                                                                                                                            Do you have to make dummies to deal with categorical variables in Python?
## 5                                                                                                                                                                 With scipy yeah but there are other libraries that you don't have to
## 6                                                                                                                                                                                                                  lets-plot\n\npolars
##   comment_id Parent
## 1          1      1
## 2        1_1      0
## 3      1_1_1      0
## 4    1_1_1_1      0
## 5  1_1_1_1_1      0
## 6        1_2      0
# Load dplyr for data manipulation
library(dplyr)

# Find the authors that appear in both threaded and unthreaded comment dataframes
common_authors <- intersect(threaded_comments_df$author, unthreaded_comments_df$author)

# Display the common authors
print(common_authors)
## [1] "a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4"
## [2] "18c55ec198f889e965bb965c19343bb7d9525dbe48161f252cdb277cf53da3c0"
## [3] "9b0701406e601311f042b43c63b4dcd0b20117a328b9d61769f507edc50a29c4"
## [4] "de64c06aaf788801551f83ebf7c73c5ab042336b2823591117fbd5e53a48207c"
## [5] "e0f5bf1ad24858f119f894783bcd35839614fb06780d05b4e3ddb0bb27c5ae67"

EDA-PRIOR : 2 : Visualize : Threaded Network

# Ensure 'date' column is a Date object in threaded_comments_df
threaded_comments_df$date <- as.Date(threaded_comments_df$date)

# Load necessary libraries
library(dplyr)
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
library(ggplot2)

# Step 1: Create 'parent_id' based on the structure of 'comment_id'
threaded_comments_df <- threaded_comments_df %>%
  mutate(parent_id = ifelse(grepl("_", comment_id),  # If the comment_id has "_", it's a reply
                            sub("_[^_]+$", "", comment_id),  # Remove the last part to get the parent_id
                            NA))  # Top-level comments have no parent

# Step 2: Create an edge list for parent-child relationships (source = parent, target = reply)
edges <- threaded_comments_df %>%
  filter(!is.na(parent_id)) %>%  # Only keep comments with a parent (i.e., replies)
  select(source = parent_id, target = comment_id, author)  # Create the edge list for network visualization

# Step 3: Create a nodes dataframe to identify unique users (vertices for the graph)
nodes <- threaded_comments_df %>%
  filter(comment_id %in% c(edges$source, edges$target)) %>%
  select(id = comment_id, author) %>%
  distinct()

# Step 4: Assign the distinct colors you provided to common authors
distinct_colors <- c("green", "blue", "purple", "orange", "lightblue")

# Create a named vector to map each common author to a unique color
author_color_map <- setNames(distinct_colors, common_authors)

# Step 5: Modify the edge color based on author:
# - "Square-Problem4346" will remain red
# - Common authors get their unique assigned color
# - Others will remain black
edges <- edges %>%
  mutate(edge_color = case_when(
    author == "Square-Problem4346" ~ "red",  # Red for Square-Problem4346
    author %in% common_authors ~ author_color_map[author],  # Unique color for each common author
    TRUE ~ "black"  # Black for all other authors
  ))

# Step 6: Create the network graph with igraph
comment_graph <- graph_from_data_frame(d = edges %>% select(source, target), vertices = nodes, directed = TRUE)

# Step 7: Customize and visualize the network graph

# Layout the graph slightly shifted to the left
layout <- layout_as_tree(comment_graph)

# Plot the network graph
plot(
  comment_graph, 
  layout = layout - 0.3,  # Shift the layout to the left
  vertex.size = 5, 
  vertex.label = NA,  # No labels for vertices
  vertex.color = "lightgrey",  # Uniform color for all nodes
  edge.color = edges$edge_color,  # Use the edge color we assigned (red for Square-Problem4346, unique for common authors)
  edge.arrow.size = 0.3, 
  main = "Comment Responses Network"
)

# Step 8: Add a legend for the common authors (to the right of the plot)
legend(
  x = 1.2,  # Position the legend to the right of the graph
  y = 0.9, 
  legend = paste0("user_", 1:length(common_authors)),  # Replace hashed names with user_1, user_2, etc.
  col = distinct_colors,  # Use the distinct colors for the common authors
  pch = 16,  # Use filled circles for the legend
  title = "C-Authors"
)

EDA-PRIOR : 3 : Thread-Summary :

# Step 1: Get the unique users from the entire dataset
unique_users_total <- randomized_comments_df %>%
  distinct(author)  # Get distinct users

# Step 2: Count the number of unique users
num_unique_users_total <- nrow(unique_users_total)

# Output the result
num_unique_users_total
## [1] 143
# Step 1: Identify users in threads (either as parent or child)
users_in_threads <- randomized_comments_df %>%
  filter(comment_id %in% c(edges$source, edges$target)) %>%
  distinct(author)

# Step 2: Identify users not in threads (standalone comments)
users_not_in_threads <- randomized_comments_df %>%
  filter(!(comment_id %in% c(edges$source, edges$target))) %>%
  distinct(author)

# Step 3: Identify users who are in both categories (started a thread and replied to others)
users_in_both <- users_in_threads %>%
  inner_join(users_not_in_threads, by = "author")  # Users present in both sets

# Step 4: Summarize the counts in a table
summary_table <- tibble(
  Category = c("Not in Thread", "In Thread", "In Both"),
  Count = c(nrow(users_not_in_threads), 
            nrow(users_in_threads), 
            nrow(users_in_both))
)

# Output the summary table
summary_table
## # A tibble: 3 × 2
##   Category      Count
##   <chr>         <int>
## 1 Not in Thread    88
## 2 In Thread        60
## 3 In Both           5
#verify
summary_table$Count[1] + summary_table$Count[2] - summary_table$Count[3] == num_unique_users_total
## [1] TRUE
# Ensure 'date' column is a Date object
randomized_comments_df$date <- as.Date(randomized_comments_df$date)

# Check the class of 'date' column
class(randomized_comments_df$date)
## [1] "Date"
# Check the range of dates
range(randomized_comments_df$date, na.rm = TRUE)
## [1] "2024-09-04" "2024-09-08"
percent_summary_table <- as.data.frame(summary_table) %>% mutate(Count = Count/num_unique_users_total) 
percent_summary_table
##        Category      Count
## 1 Not in Thread 0.61538462
## 2     In Thread 0.41958042
## 3       In Both 0.03496503

EDA-PRIOR : 4 : Visualize : Score Dist.

# Load necessary libraries and install them if not already installed
if (!require(ggplot2)) {
  install.packages("ggplot2")
}
if (!require(dplyr)) {
  install.packages("dplyr")
}
if (!require(gridExtra)) {
  install.packages("gridExtra")
}
## Loading required package: gridExtra
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Load libraries into the R session
library(ggplot2)
library(dplyr)
library(gridExtra)

# Create the histogram
histogram_plot <- randomized_comments_df %>%
  ggplot(aes(x = score)) +
  geom_histogram(binwidth = 5, fill = "lightgreen") +  # Adjust binwidth as needed
  theme_minimal() +
  labs(
    title = "Distribution of Scores",
    x = "Score",
    y = "Count"
  )

# Create the boxplot (horizontally aligned for comparison)
boxplot_plot <- randomized_comments_df %>%
  ggplot(aes(x = 1, y = score)) +  # Use 1 as a dummy variable to align the boxplot horizontally
  geom_boxplot(fill = "lightblue") +
  theme_minimal() +
  theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()) +  # Hide x-axis details
  labs(
    title = "Boxplot of Scores",
    y = "Score"
  )

# Combine the two plots side by side
grid.arrange(histogram_plot, boxplot_plot, ncol = 2)

#Take a look at the top 2 and bottom 2 outliers : 
top_2 <- randomized_comments_df %>% arrange(desc(score)) %>% head(2)
bottom_2 <- randomized_comments_df %>% arrange(score) %>% head(2)

#df of top_2 & bottom_2: 
score_outliers_df <- rbind(top_2 %>% select(author,comment, score), bottom_2 %>% select(author,comment, score))

#Check for where he is in the network to find him : 
randomized_comments_df %>% filter(author == "422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a")
##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a 2024-09-04
## 2 422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a 2024-09-05
##    timestamp score upvotes downvotes golds
## 1 1725490338   -32     -32         0     0
## 2 1725495627   -15     -15         0     0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          comment
## 1 &gt;I think this article does a good job showing side by side how has a more intuitive syntax.\n\nFirst off, this is NOT a personal attack. It's not. This is my personal opinion that that article is full of crap. \n\nBasically that article is arguing that not only Python, but C++ and the entire C family of languages is somehow intuitionally inferior to R. That's a huge....HUGE statement and one that is largely untenable. Python is immensely "conversational". R is absolutely not.\n\nR is a language designed specifically to solve one problem: how to efficiently consolidate statistical analysis under one umbrella, command line language. That's it. Period. Everything that came after that (ggplot2 and dplyr for example) are attempts to make R emulate the flexibility of an actual programming language. Yes they work. But they work because they have a very dedicated development community and... because they have RStudio \n\nThe result is that R packages are bulky, inefficient, and frequently mismanaged (how often do your R sessions crash??). And don't get me started on AI. Machine learning with R is mostly an exercise in fitting a square peg into an invisible hole. \n\nIf anything, that article showcases how dead-end R actually is. If you immerse yourself in R like  the author, you basically have no intuition for how the vast majority of scripting and programming languages work. You're a one-trick pony with a non-translational feel for syntax.\n\nOkay. Down vote away. I don't care. I use R all the time, but I hate myself for it every time I use it.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Yes, Tidymodels is doing NOW what scikitlearn was doing in 2015. Now, tell me exactly how Tidymodels is anywhere close to the capabilities of pytorch?
##   comment_id Parent
## 1        3_4      0
## 2    3_4_4_1      0
#Check if they are "common_authors", "threaded" or "unthreaded": 
unique(score_outliers_df$author) %in% common_authors
## [1] FALSE FALSE FALSE
unique(score_outliers_df$author) %in% threaded_comments_df$author
## [1] TRUE TRUE TRUE
unique(score_outliers_df$author) %in% unthreaded_comments_df$author
## [1] FALSE FALSE FALSE
#the most liked/disliked comments cause threads; comments : 
top_2$comment[1]
## [1] "python is a general purpose language with a statistics library strapped on the top. R (and tidyverse) was built for numbers first. I think [this article](https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/) does a good job showing side by side how has a more intuitive syntax."
top_2$comment[2]
## [1] "Pretty much ggplot and the dbplyr. I use Node for a lot, but I just really like performing an analysis inside R. I keep thinking of giving Python another shot, but I feel like at that point I would rather give Rust another shot."
bottom_2$comment[1]
## [1] "&gt;I think this article does a good job showing side by side how has a more intuitive syntax.\n\nFirst off, this is NOT a personal attack. It's not. This is my personal opinion that that article is full of crap. \n\nBasically that article is arguing that not only Python, but C++ and the entire C family of languages is somehow intuitionally inferior to R. That's a huge....HUGE statement and one that is largely untenable. Python is immensely \"conversational\". R is absolutely not.\n\nR is a language designed specifically to solve one problem: how to efficiently consolidate statistical analysis under one umbrella, command line language. That's it. Period. Everything that came after that (ggplot2 and dplyr for example) are attempts to make R emulate the flexibility of an actual programming language. Yes they work. But they work because they have a very dedicated development community and... because they have RStudio \n\nThe result is that R packages are bulky, inefficient, and frequently mismanaged (how often do your R sessions crash??). And don't get me started on AI. Machine learning with R is mostly an exercise in fitting a square peg into an invisible hole. \n\nIf anything, that article showcases how dead-end R actually is. If you immerse yourself in R like  the author, you basically have no intuition for how the vast majority of scripting and programming languages work. You're a one-trick pony with a non-translational feel for syntax.\n\nOkay. Down vote away. I don't care. I use R all the time, but I hate myself for it every time I use it."
bottom_2$comment[2]
## [1] "Yes, Tidymodels is doing NOW what scikitlearn was doing in 2015. Now, tell me exactly how Tidymodels is anywhere close to the capabilities of pytorch?"

EDA-PRIOR : 5 : Visualize : Response Length by sentences :

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Ensure the 'sentence_count' column exists by calculating it for randomized_comments_df

# Function to count sentences in a comment
count_sentences <- function(comment) {
  length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))
}

# Calculate the number of sentences in each comment for the entire dataset
randomized_comments_df <- randomized_comments_df %>%
  mutate(sentence_count = sapply(comment, count_sentences))

# Remove NA and infinite values from sentence_count column
randomized_comments_df <- randomized_comments_df %>%
  filter(!is.na(sentence_count) & is.finite(sentence_count))

# Calculate the quartiles, mean, and upper outlier threshold for all comments
q1_overall <- quantile(randomized_comments_df$sentence_count, 0.25)
q2_overall <- quantile(randomized_comments_df$sentence_count, 0.50)  # Median
q3_overall <- quantile(randomized_comments_df$sentence_count, 0.75)
mean_overall <- mean(randomized_comments_df$sentence_count)
iqr_overall <- q3_overall - q1_overall
upper_outlier_threshold_overall <- q3_overall + (1.5 * iqr_overall)

# Overall comments histogram with quartile, median, mean, and outlier lines
overall_histogram <- randomized_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_overall, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(xintercept = upper_outlier_threshold_overall, color = "black", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q1_overall, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q2_overall, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q3_overall, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = mean_overall, color = "green", linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Distribution of Comments (Overall)", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), 
                     minor_breaks = seq(0, max(randomized_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Display the overall histogram
print(overall_histogram)
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

EDA-PRIOR : 6 : Response Length by Nested (T/F) :

# Load necessary libraries
library(dplyr)
library(tidytext)

# Ensure the 'comment' column exists and contains valid text data
if(!"comment" %in% colnames(threaded_comments_df)) {
  stop("The 'comment' column does not exist in the data frame.")
}

# Calculate the number of sentences in each comment for the threaded comments
threaded_comments_df <- threaded_comments_df %>%
  rowwise() %>%
  mutate(sentence_count = length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))) %>%
  ungroup()

# Check if the 'sentence_count' column has been successfully populated
summary(threaded_comments_df$sentence_count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   3.282   3.000  24.000
# Similarly for unthreaded comments
unthreaded_comments_df <- unthreaded_comments_df %>%
  rowwise() %>%
  mutate(sentence_count = length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))) %>%
  ungroup()


# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)

# Ensure that the 'sentence_count' column exists by calculating it for both data frames
# If the column doesn't exist, create it by counting sentences in each comment

# Function to count sentences in a comment
count_sentences <- function(comment) {
  length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))
}

# Calculate the number of sentences in each comment for threaded and unthreaded comments
threaded_comments_df <- threaded_comments_df %>%
  mutate(sentence_count = sapply(comment, count_sentences))

unthreaded_comments_df <- unthreaded_comments_df %>%
  mutate(sentence_count = sapply(comment, count_sentences))

# Remove NA and infinite values from sentence_count columns in both data frames
threaded_comments_df <- threaded_comments_df %>%
  filter(!is.na(sentence_count) & is.finite(sentence_count))

unthreaded_comments_df <- unthreaded_comments_df %>%
  filter(!is.na(sentence_count) & is.finite(sentence_count))

# Calculate the quartiles, mean, and upper outlier threshold for threaded comments
q1_threaded <- quantile(threaded_comments_df$sentence_count, 0.25)
q2_threaded <- quantile(threaded_comments_df$sentence_count, 0.50)  # Median
q3_threaded <- quantile(threaded_comments_df$sentence_count, 0.75)
mean_threaded <- mean(threaded_comments_df$sentence_count)
iqr_threaded <- q3_threaded - q1_threaded
upper_outlier_threshold_threaded <- q3_threaded + (1.5 * iqr_threaded)

# Calculate the quartiles, mean, and upper outlier threshold for unthreaded comments
q1_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.25)
q2_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.50)  # Median
q3_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.75)
mean_unthreaded <- mean(unthreaded_comments_df$sentence_count)
iqr_unthreaded <- q3_unthreaded - q1_unthreaded
upper_outlier_threshold_unthreaded <- q3_unthreaded + (1.5 * iqr_unthreaded)

# Threaded comments histogram with quartile, median, mean, and outlier lines
threaded_histogram <- threaded_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_threaded, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(xintercept = upper_outlier_threshold_threaded, color = "black", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q1_threaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q2_threaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q3_threaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = mean_threaded, color = "green", linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Threaded Comments", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), 
                     minor_breaks = seq(0, max(threaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))

# Unthreaded comments histogram with quartile, median, mean, and outlier lines
unthreaded_histogram <- unthreaded_comments_df %>%
  ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_unthreaded, "Above", "Below"))) +
  geom_histogram(binwidth = 1, alpha = 0.6) +
  scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
  geom_vline(xintercept = upper_outlier_threshold_unthreaded, color = "black", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q1_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q2_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = q3_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
  geom_vline(xintercept = mean_unthreaded, color = "green", linetype = "solid", size = 1.2) +
  scale_color_manual(
    name = "Statistics", 
    values = c("Outlier Threshold" = "black", 
               "Q1 (25th Percentile)" = "darkblue", 
               "Median (50th Percentile)" = "darkblue", 
               "Q3 (75th Percentile)" = "darkblue", 
               "Mean" = "green")
  ) +
  theme_minimal() +
  labs(title = "Unthreaded Comments", x = "Length (Number of Sentences)", y = "Count") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), 
                     minor_breaks = seq(0, max(unthreaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
  theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))

# Use gridExtra to place the histograms side by side
grid.arrange(threaded_histogram, unthreaded_histogram, ncol = 2)
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.

Day 4 : Complete Tokenization and Special Character/Slang Analysis ()

Text Cleanining :

# Load necessary libraries
library(dplyr)
library(stringr)

# Function to count sentences in a comment
count_sentences <- function(comment) {
  if (!is.character(comment)) return(0)  # Handle non-character values
  length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))
}

# Function to count paragraphs in a comment
count_paragraphs <- function(comment) {
  if (!is.character(comment)) return(0)  # Handle non-character values
  length(unlist(strsplit(comment, "\n+", perl = TRUE)))
}

# Function to extract special characters from a comment
extract_special_characters <- function(comment) {
  if (!is.character(comment)) return(character())  # Handle non-character values
  special_characters <- str_extract_all(comment, "[[:punct:]]")[[1]]
  special_characters
}

# Function to determine emphasis type, frequency, and logic for emphasis
determine_emphasis_info <- function(comment) {
  chars <- extract_special_characters(comment)
  emphasis_freq <- sum(chars %in% c("!", "?"))  # Count the number of "!" and "?"
  
  # Determine type of emphasis
  if (sum(chars == "?") > 1 && sum(chars == "!") == 0) {
    emphasis_type <- "??"
  } else if (sum(chars == "!") > 1 && sum(chars == "?") == 0) {
    emphasis_type <- "!!"
  } else if (sum(chars == "!") > 0 && sum(chars == "?") > 0) {
    emphasis_type <- "?!"
  } else if (sum(chars == "!") == 1 && sum(chars == "?") == 0) {
    emphasis_type <- "!"
  } else {
    emphasis_type <- ""  # No emphasis
  }
  
  # Logical emphasis indicator
  contains_emphasis <- ifelse(emphasis_freq > 0, TRUE, FALSE)
  
  list(
    contains_emphasis = contains_emphasis,
    emphasis_type = emphasis_type,
    emphasis_freq = emphasis_freq
  )
}

# Apply the new functions to the dataframe
randomized_comments_df <- randomized_comments_df %>%
  mutate(
    comment = as.character(comment),  # Ensure the comment column is character type
    sentence_count = sapply(comment, count_sentences),
    paragraph_count = sapply(comment, count_paragraphs),
    
    # Apply emphasis logic
    emphasis_info = lapply(comment, determine_emphasis_info),
    contains_emphasis = sapply(emphasis_info, function(info) info$contains_emphasis),
    emphasis_type = sapply(emphasis_info, function(info) info$emphasis_type),
    emphasis_freq = sapply(emphasis_info, function(info) info$emphasis_freq)
  ) %>%
  filter(!is.na(sentence_count) & is.finite(sentence_count) &
         !is.na(paragraph_count) & is.finite(paragraph_count)) 

# Testing : 
head(randomized_comments_df %>% arrange(desc(emphasis_freq)), 3)
##                                                                                                     url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
##                                                             author       date
## 1 3110f81465945005e5458f7d3d2e30219d1499a09508d8b89244ab924540f7e4 2024-09-04
## 2 9bc3196ce78a6e0f96459f63eff6ef6dbd2ec36c80817e93f7835f889d4e661e 2024-09-08
## 3 f35cbf3f98e257499d84d70b7d8bd14d4ee0764675f90fd5fb9738766613649e 2024-09-05
##    timestamp score upvotes downvotes golds
## 1 1725492694     5       5         0     0
## 2 1725839519     1       1         0     0
## 3 1725575881     1       1         0     0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       comment
## 1 I invested a lot of time learning R. I'm not convinced python is better. \n\nAfter getting into shiny and packages like webGL I realized the real language to learn is the web dev stuff (html / css /JavaScript). \n\nAll the cool interactive graphics and geospatial stuff that R and python make are really just JavaScript enabled html files. [HTMLwidgets](https://www.htmlwidgets.org/) makes it all possible for R. \n\nMost of the real magic in R and python are coded in "packages" that are housed on a repo somewhere and sometimes those packages aren't really R scripts or python software, but the packages run other software languages.  \n\n* Leaflet? That's [leaflet.js](https://leafletjs.com/)\n\n* Shiny? The UI is literally just [client-side html](https://shiny.posit.co/py/docs/ui-html.html)\n\n* Plot_ly? Oh you mean [plotly.js](https://plotly.com/javascript/) (which is itself built on [d3.js](https://d3js.org/) )\n\nR offers a good way to assemble html files. \n\n**TLDR: the R vs. Python debate is a false dichotomy; everyone should just learn JavaScript.**
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               I mean, I can write R and dplyr will translate it into sql for me, so I don't have to write two languages. Is there a way to do that in Python?\n\nSo far, the best way I've found to deal with it is write a sql in a separate file and use python to read &amp; execute it, but maybe there's a more python-y way to do it?
## 3         Really, not a general purpose language? Given that it is often referred to as the second best language for everything one could say it is the most general purpose language.  After web tools, Python is the most used language.\n\nAs for scripting languages what even is that?  Usually one thinks of shell scripts or batch files.   If you think that is python\031s primary use case, you would be wrong. Web development, API, machine learning, data science, infrastructure, AI are all substantial engineering tasks that sit on python. Most are not scripts.\n\nIf you choose python (and its ecosystem) you value development speed over run time speed.  It is  easy to read, easy to write. It is often the best choice because runtime speed just doesn\031t matter or you are able to sit on top of pandas, or polars or PyTorch so at least one bottleneck is eliminated.\n\nPython has many issues, it is slow to run (in some use cases), isn\031t great on mobile, has deployment pain points, doesn\031t have a great threading model (until 3.13 is released) to name a few.
##   comment_id Parent sentence_count paragraph_count emphasis_info
## 1         46      1             12               9   TRUE, ??, 3
## 2  1_2_1_1_2      0              3               2   TRUE, ??, 2
## 3      3_2_2      0             12               4   TRUE, ??, 2
##   contains_emphasis emphasis_type emphasis_freq
## 1              TRUE            ??             3
## 2              TRUE            ??             2
## 3              TRUE            ??             2
# Load necessary libraries
library(ggplot2)
library(dplyr)

# Create a summarized data frame with symbolic labels and percentages in the label
emphasis_data <- randomized_comments_df %>%
  count(emphasis_freq) %>%
  mutate(emphasis_label = case_when(
    emphasis_freq == 0 ~ "No Emphasis",
    emphasis_freq == 1 ~ "!",
    emphasis_freq == 2 ~ "!!",
    emphasis_freq == 3 ~ "??",
    TRUE ~ "?!"
  ),
  percent = round(100 * n / sum(n), 1),  # Calculate percentage
  emphasis_label_with_percent = paste0(emphasis_label, " (", percent, "%)"))  # Combine label and percentage

# Create a pie chart using ggplot
ggplot(emphasis_data, aes(x = "", y = n, fill = emphasis_label_with_percent)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  theme_void() +  # Remove background, grid, and axis marks
  labs(fill = "Emphasis Type") +
  ggtitle("Distribution of Emphasis Types in Comments") +
  theme(plot.title = element_text(hjust = 0.5))

Load Urban Dictionary :

# Read the CSV file
slang_data <- read.csv("/Users/isaiahmireles/Downloads/Slang Text.csv", stringsAsFactors = FALSE)