A Reddit post I made titled “People who use R of Reddit, why not just use Python?” in the r/RStudio subreddit unexpectedly went viral! Amassing nearly 50,000 views and hundreds of comments. It is safe to say this is a topic of great interest in the “Data-community”.
Inspired by a professor’s advice to learn Python and pure-humor, I instead decided to use R to analyze why people choose one language over the other—using, ironically, R.
This project highlights my ability to manage the entire data pipeline, from scraping and cleaning data to performing sentiment analysis and building machine learning models. With nearly two years of data science experience at UCLA, I’ll complete this project in three weeks. This project is meant to act as a testament to my overall-versatility and readiness for any data-driven role.
This project aims to analyze user sentiments in Reddit comments on an R-related post to determine how positive and negative sentiments correlate with mentions of R and Python. By applying sentiment analysis techniques and leveraging machine learning models—k-Nearest Neighbors (kNN) and Naive Bayes—this project seeks to classify the overall sentiment of the community and identify trends related to each programming language. The project includes data collection, preprocessing, exploratory analysis, and model comparison, providing insights into user preferences in a structured and data-driven manner.
RedditExtractoR
.tidytext
(tokenization,
stop word removal, stemming/lemmatization).dplyr
and
ggplot2
to visualize word frequency and metadata.syuzhet
and tidytext
for sentiment
scoring (Bing, NRC, AFINN).ggplot2
.naivebayes
) and kNN
(class
) classifiers on labeled sentiment data.pROC
.ggplot2
.Goal: Collect Reddit data, clean it, and prepare it for analysis.
Day 1: Scrape Reddit Data
- Use the RedditExtractoR
library to extract comments from
your R-related Reddit post, along with metadata like upvotes,
timestamps, and user information.
- Resources:
- RedditExtractoR
Documentation
Day 2-3: Clean and Preprocess Data
- Tokenize the text, remove stop words, punctuation, and irrelevant text
using tidytext
.
- Apply stemming or lemmatization to reduce words to their base
form.
- Resources:
- tidytext
Documentation
- New: Introduction to Natural Language Processing in R
Day 4-5: Exploratory Data Analysis (EDA)
- Use dplyr
to summarize your dataset (e.g., comment
length, word frequency).
- Visualize word frequency, comment length, and metadata using
ggplot2
.
- Resources:
- dplyr Documentation
- ggplot2 Documentation
DataCamp Course Recommendations:
- Already Taken: Data
Manipulation with dplyr
- New: Introduction
to Text Analysis in R
Goal: Apply sentiment analysis to determine if Reddit comments lean toward R or Python and how positive/negative sentiments correlate with each.
Day 1-2: Perform Sentiment Analysis
- Use the syuzhet
package for sentiment scoring and extract
emotional valence (positive/negative).
- Apply tidytext
to tokenize the text and join it with
sentiment lexicons such as Bing, NRC, and AFINN.
- Resources:
- syuzhet
Documentation
- tidytext
Documentation
Day 3: Visualize Sentiments
- Use ggplot2
to plot the distribution of positive
vs. negative sentiments.
- Compare sentiment frequency for “R” vs. “Python” mentions.
- Resources:
- ggplot2 Documentation
Day 4-5: Cluster Comments by Sentiment
- Apply k-means clustering or a simple distance-based approach to group
comments based on sentiment polarity.
- Investigate whether positive/negative comments cluster around mentions
of R or Python.
- Resources:
- tidytext
Documentation
- ggplot2 Documentation
DataCamp Course Recommendations:
- Already Taken: Sentiment
Analysis in R
- New: Analyzing
Social Media Data in R
Goal: Build, analyze, and compare kNN and Naive Bayes models for sentiment classification.
Day 1: Implement Naive Bayes Classifier
- Build a Naive Bayes classifier using the naivebayes
library.
- Train the model on labeled sentiment data to predict sentiment
(positive/negative).
- Evaluate the model using accuracy, precision, and recall.
- Resources:
- naivebayes
Documentation
Day 2: Implement k-Nearest Neighbors (kNN)
- Use the class
package to implement a kNN model and
experiment with different values of k.
- Train the model on the sentiment dataset and evaluate the
classification performance.
- Resources:
- class
Documentation
Day 3: Model Comparison
- Compare the performance of the Naive Bayes and kNN models based on
accuracy, precision, recall, and F1-score.
- Plot ROC curves for both models using pROC
.
- Resources:
- pROC
Documentation
Day 4: Visualize and Interpret Results
- Create compelling visualizations using ggplot2
, such as
confusion matrices and ROC curves.
- Use bar charts and line plots to show how each model classifies the
sentiment data.
- Resources:
- ggplot2 Documentation
Day 5: Finalize and Present Findings
- Summarize which model performed better for sentiment classification
and how the results relate to the R vs. Python discussion.
- Prepare a visual and written report on your findings.
- Resources:
- Supervised
Learning in R: Classification
DataCamp Course Recommendations:
- Already Taken: Understanding
Machine Learning
- New: Supervised
Learning in R: Classification
## Loading required package: RedditExtractoR
## RedditExtractoR is already downloaded before.
#?RedditExtractoR
# Define the Reddit post URL
post_url <- "https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/"
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## parsing URLs on page 1...
## [1] "list"
## [1] 0
## [1] FALSE
# Check if 'digest' package is installed; if not, install it.
if (!require(digest)) {
install.packages("digest")
}
## Loading required package: digest
# Load necessary libraries
library(dplyr)
library(digest)
# Set seed for reproducibility
set.seed(123)
# Step 1: Create a function to hash author names using SHA-256.
# This function will hash each author name using the full SHA-256 hash (64 characters).
create_hashed_code <- function(author_name) {
hash <- digest(author_name, algo = "sha256") # Generate the full SHA-256 hash
return(hash) # Return the full 64-character hash
}
# Step 2: Replace author names directly in 'comments_df' with their corresponding hashed codes.
# No need to create a separate lookup table; everything is done within the 'mutate()' function.
randomized_comments_df <- comments_df %>%
mutate(
random_code = sapply(author, create_hashed_code), # Hash all author names
author = ifelse(
author == "Square-Problem4346", # Preserve the original name for 'Square-Problem4346'
author, # Keep unchanged
random_code # Replace other authors with hashed codes
)
) %>%
select(-random_code) # Remove the intermediate 'random_code' column
# Step 3: Display the first few rows of the updated dataframe to verify the changes.
head(randomized_comments_df$author, 10)
## [1] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
## [2] "Square-Problem4346"
## [3] "28643325867213509bef0e654af57308700c413deac5980e42e1ce5d6a623dc6"
## [4] "9a86c0781a60871e2b36342d6fab5536e29e5628c5ff621c23de48878a5b0ec5"
## [5] "a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4"
## [6] "c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89"
## [7] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
## [8] "c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89"
## [9] "50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67"
## [10] "9bc3196ce78a6e0f96459f63eff6ef6dbd2ec36c80817e93f7835f889d4e661e"
# Step 4: Verify the number of unique commenters (excluding 'Square-Problem4346').
amount_of_unique_commentors <- nrow(
comments_df %>%
filter(author != "Square-Problem4346") %>%
distinct(author)
)
# Get the count of unique authors in 'randomized_comments_df'
amount_of_unique_in_randomized <- nrow(
randomized_comments_df %>%
distinct(author)
)
# Print whether the counts match (excluding 1 for 'Square-Problem4346').
print(amount_of_unique_commentors == (amount_of_unique_in_randomized - 1))
## [1] TRUE
Brief summary :
SHA-256 Security: SHA-256 is a highly secure, widely used cryptographic algorithm. Unlike SHA-1, which was broken in 2017, SHA-256 remains resistant to collision attacks, making it ideal for anonymizing user data.
Hashing vs Encryption: Hashing is a one-way, irreversible process that does not use a secret key. Even though the algorithm is known, it’s computationally infeasible to reverse the hash to retrieve the original input (e.g., usernames).
No, in fact only about 23% comment more than once–and, of that only about 9% comment more than twice. Essentially, users are usually only commenting once then moving on to the next dopamine-rush.
# Load dplyr for data manipulation
library(dplyr)
# Add a new column 'Parent' and classify comments
randomized_comments_df <- randomized_comments_df %>%
mutate(
next_comment_id = lead(comment_id), # Check the next comment's id
Parent = ifelse(
grepl("_", comment_id), # If comment is part of a thread (has "_")
0,
ifelse( # Otherwise, determine if it's a parent
!is.na(next_comment_id) &
(!grepl("_", next_comment_id) & as.numeric(sub("_.*", "", next_comment_id)) == as.numeric(comment_id) + 1), # If the next comment is unthreaded and follows numerically
0, # Top-level comment without a thread
1 # Top-level comment with responses (parent of a thread)
)
)
) %>%
select(-next_comment_id) # Remove temporary column
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Parent = ifelse(...)`.
## Caused by warning in `ifelse()`:
## ! NAs introduced by coercion
# Split into two dataframes: unthreaded and threaded comments
unthreaded_comments_df <- randomized_comments_df %>%
filter(Parent == 0 & !grepl("_", comment_id)) # Top-level comments without threads
threaded_comments_df <- randomized_comments_df %>%
filter(grepl("_", comment_id) | Parent == 1) # Threaded comments or parents
# View the two dataframes
print("Unthreaded Comments:")
## [1] "Unthreaded Comments:"
head(unthreaded_comments_df)
## url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 4 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 5 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 6 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## author date
## 1 063ac96d8f190af509f9392b81c9ec222399f91372ad0e9fb07fa423e858b0a5 2024-09-04
## 2 f2d40d40d4257e8e9d6d4d783c509c90b0e893fd210a22f3c86d33a807935361 2024-09-04
## 3 4f64ea74eba9d17f2fc36fbe063e4d6470b555fcb648294b7e2f998761251ebe 2024-09-04
## 4 1ef7e8ecb78e7790720585cd283f2312df7af6e0a628f92b8c398e98a6424819 2024-09-04
## 5 2b77a0a723ab8f15461de67d2023193bfcde880f810b529333ad7c8abdb117e6 2024-09-04
## 6 b9bcfc870dce5ee1135e7e00cb3f7f991e5c3ca13598dfaa0395af3cd4be5c3b 2024-09-04
## timestamp score upvotes downvotes golds
## 1 1725487774 37 37 0 0
## 2 1725489691 15 15 0 0
## 3 1725491871 16 16 0 0
## 4 1725488439 12 12 0 0
## 5 1725489509 9 9 0 0
## 6 1725491497 8 8 0 0
## comment
## 1 This question I have asked myself before too. In the end I think it boils down to R being more out of the box ready to use for data wrangling and the like.
## 2 Adding to what others have said - RStudio beats out Pycharm and other IDEs by light years imho.\n\nAgreeing with what others have said: for anything biostats (or operation on spreadsheets), visualization, etc., R wins out.\n\nArguments for python (in my life) include need for ML/DL and \030true\031 programming (vs analysis)
## 3 R was built as a statistical analysis language. Why would I want to use a general purpose language when there is a language for my exact use case.\n\nTo be clear I have no problem with Python, and have learned the basics.
## 4 I'm a stats dork first and foremost, therefore R is a very good fit for me and in my experience does that better. I flip to python for more general purposes.
## 5 My school focuses on teaching R more than Python as a Data Analytics major and so far I've enjoyed it more. In one class I had to compare between statistical functions in R and Python and I always found R having more detailed functions with an easier code than Python. Idk if in general R is more prepared for detailed statistical work or not in comparison to Python unless there's Python libraries I've missed.
## 6 1. Stats and data managing tools are well developed, documented (have citable papers) and transparent. \n\n2. Bioinformatics packages are well developed, documented (have citable papers) and transparent. \n\n3. Thats what I started with and what is taught in most biostats programs. \n\nAlso, I learned a bit of Python and numpy and scipy seemed utterly ridiculous to me. But who knows if I\031d learned that first maybe it\031d be different.
## comment_id Parent
## 1 5 0
## 2 11 0
## 3 12 0
## 4 13 0
## 5 15 0
## 6 16 0
print("Threaded Comments:")
## [1] "Threaded Comments:"
head(threaded_comments_df)
## url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 3 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 4 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 5 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 6 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## author date
## 1 50ca7cf9fa866770dae86774b05282ffb8a18c9376b43bcf04f52cf8c9cc1f67 2024-09-04
## 2 Square-Problem4346 2024-09-04
## 3 28643325867213509bef0e654af57308700c413deac5980e42e1ce5d6a623dc6 2024-09-05
## 4 9a86c0781a60871e2b36342d6fab5536e29e5628c5ff621c23de48878a5b0ec5 2024-09-05
## 5 a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4 2024-09-07
## 6 c636c3a6353ca033c8001e038bb1427432b3052c99a3dea1b19fd26b205a4a89 2024-09-05
## timestamp score upvotes downvotes golds
## 1 1725486694 165 165 0 0
## 2 1725487023 9 9 0 0
## 3 1725494583 28 28 0 0
## 4 1725521027 5 5 0 0
## 5 1725676114 1 1 0 0
## 6 1725560242 2 2 0 0
## comment
## 1 Pretty much ggplot and the dbplyr. I use Node for a lot, but I just really like performing an analysis inside R. I keep thinking of giving Python another shot, but I feel like at that point I would rather give Rust another shot.
## 2 That\031s makes sense
## 3 Modeling and the ability to factor categorical variables without making dummies. I use python now but miss the ease of modeling in r
## 4 Do you have to make dummies to deal with categorical variables in Python?
## 5 With scipy yeah but there are other libraries that you don't have to
## 6 lets-plot\n\npolars
## comment_id Parent
## 1 1 1
## 2 1_1 0
## 3 1_1_1 0
## 4 1_1_1_1 0
## 5 1_1_1_1_1 0
## 6 1_2 0
# Load dplyr for data manipulation
library(dplyr)
# Find the authors that appear in both threaded and unthreaded comment dataframes
common_authors <- intersect(threaded_comments_df$author, unthreaded_comments_df$author)
# Display the common authors
print(common_authors)
## [1] "a72c94a8e5caf7c5b66b4899cf74eb2630c467681957516282480fe045bab6d4"
## [2] "18c55ec198f889e965bb965c19343bb7d9525dbe48161f252cdb277cf53da3c0"
## [3] "9b0701406e601311f042b43c63b4dcd0b20117a328b9d61769f507edc50a29c4"
## [4] "de64c06aaf788801551f83ebf7c73c5ab042336b2823591117fbd5e53a48207c"
## [5] "e0f5bf1ad24858f119f894783bcd35839614fb06780d05b4e3ddb0bb27c5ae67"
# Ensure 'date' column is a Date object in threaded_comments_df
threaded_comments_df$date <- as.Date(threaded_comments_df$date)
# Load necessary libraries
library(dplyr)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggplot2)
# Step 1: Create 'parent_id' based on the structure of 'comment_id'
threaded_comments_df <- threaded_comments_df %>%
mutate(parent_id = ifelse(grepl("_", comment_id), # If the comment_id has "_", it's a reply
sub("_[^_]+$", "", comment_id), # Remove the last part to get the parent_id
NA)) # Top-level comments have no parent
# Step 2: Create an edge list for parent-child relationships (source = parent, target = reply)
edges <- threaded_comments_df %>%
filter(!is.na(parent_id)) %>% # Only keep comments with a parent (i.e., replies)
select(source = parent_id, target = comment_id, author) # Create the edge list for network visualization
# Step 3: Create a nodes dataframe to identify unique users (vertices for the graph)
nodes <- threaded_comments_df %>%
filter(comment_id %in% c(edges$source, edges$target)) %>%
select(id = comment_id, author) %>%
distinct()
# Step 4: Assign the distinct colors you provided to common authors
distinct_colors <- c("green", "blue", "purple", "orange", "lightblue")
# Create a named vector to map each common author to a unique color
author_color_map <- setNames(distinct_colors, common_authors)
# Step 5: Modify the edge color based on author:
# - "Square-Problem4346" will remain red
# - Common authors get their unique assigned color
# - Others will remain black
edges <- edges %>%
mutate(edge_color = case_when(
author == "Square-Problem4346" ~ "red", # Red for Square-Problem4346
author %in% common_authors ~ author_color_map[author], # Unique color for each common author
TRUE ~ "black" # Black for all other authors
))
# Step 6: Create the network graph with igraph
comment_graph <- graph_from_data_frame(d = edges %>% select(source, target), vertices = nodes, directed = TRUE)
# Step 7: Customize and visualize the network graph
# Layout the graph slightly shifted to the left
layout <- layout_as_tree(comment_graph)
# Plot the network graph
plot(
comment_graph,
layout = layout - 0.3, # Shift the layout to the left
vertex.size = 5,
vertex.label = NA, # No labels for vertices
vertex.color = "lightgrey", # Uniform color for all nodes
edge.color = edges$edge_color, # Use the edge color we assigned (red for Square-Problem4346, unique for common authors)
edge.arrow.size = 0.3,
main = "Comment Responses Network"
)
# Step 8: Add a legend for the common authors (to the right of the plot)
legend(
x = 1.2, # Position the legend to the right of the graph
y = 0.9,
legend = paste0("user_", 1:length(common_authors)), # Replace hashed names with user_1, user_2, etc.
col = distinct_colors, # Use the distinct colors for the common authors
pch = 16, # Use filled circles for the legend
title = "C-Authors"
)
# Step 1: Get the unique users from the entire dataset
unique_users_total <- randomized_comments_df %>%
distinct(author) # Get distinct users
# Step 2: Count the number of unique users
num_unique_users_total <- nrow(unique_users_total)
# Output the result
num_unique_users_total
## [1] 143
# Step 1: Identify users in threads (either as parent or child)
users_in_threads <- randomized_comments_df %>%
filter(comment_id %in% c(edges$source, edges$target)) %>%
distinct(author)
# Step 2: Identify users not in threads (standalone comments)
users_not_in_threads <- randomized_comments_df %>%
filter(!(comment_id %in% c(edges$source, edges$target))) %>%
distinct(author)
# Step 3: Identify users who are in both categories (started a thread and replied to others)
users_in_both <- users_in_threads %>%
inner_join(users_not_in_threads, by = "author") # Users present in both sets
# Step 4: Summarize the counts in a table
summary_table <- tibble(
Category = c("Not in Thread", "In Thread", "In Both"),
Count = c(nrow(users_not_in_threads),
nrow(users_in_threads),
nrow(users_in_both))
)
# Output the summary table
summary_table
## # A tibble: 3 × 2
## Category Count
## <chr> <int>
## 1 Not in Thread 88
## 2 In Thread 60
## 3 In Both 5
#verify
summary_table$Count[1] + summary_table$Count[2] - summary_table$Count[3] == num_unique_users_total
## [1] TRUE
# Ensure 'date' column is a Date object
randomized_comments_df$date <- as.Date(randomized_comments_df$date)
# Check the class of 'date' column
class(randomized_comments_df$date)
## [1] "Date"
# Check the range of dates
range(randomized_comments_df$date, na.rm = TRUE)
## [1] "2024-09-04" "2024-09-08"
percent_summary_table <- as.data.frame(summary_table) %>% mutate(Count = Count/num_unique_users_total)
percent_summary_table
## Category Count
## 1 Not in Thread 0.61538462
## 2 In Thread 0.41958042
## 3 In Both 0.03496503
# Load necessary libraries and install them if not already installed
if (!require(ggplot2)) {
install.packages("ggplot2")
}
if (!require(dplyr)) {
install.packages("dplyr")
}
if (!require(gridExtra)) {
install.packages("gridExtra")
}
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Load libraries into the R session
library(ggplot2)
library(dplyr)
library(gridExtra)
# Create the histogram
histogram_plot <- randomized_comments_df %>%
ggplot(aes(x = score)) +
geom_histogram(binwidth = 5, fill = "lightgreen") + # Adjust binwidth as needed
theme_minimal() +
labs(
title = "Distribution of Scores",
x = "Score",
y = "Count"
)
# Create the boxplot (horizontally aligned for comparison)
boxplot_plot <- randomized_comments_df %>%
ggplot(aes(x = 1, y = score)) + # Use 1 as a dummy variable to align the boxplot horizontally
geom_boxplot(fill = "lightblue") +
theme_minimal() +
theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank()) + # Hide x-axis details
labs(
title = "Boxplot of Scores",
y = "Score"
)
# Combine the two plots side by side
grid.arrange(histogram_plot, boxplot_plot, ncol = 2)
#Take a look at the top 2 and bottom 2 outliers :
top_2 <- randomized_comments_df %>% arrange(desc(score)) %>% head(2)
bottom_2 <- randomized_comments_df %>% arrange(score) %>% head(2)
#df of top_2 & bottom_2:
score_outliers_df <- rbind(top_2 %>% select(author,comment, score), bottom_2 %>% select(author,comment, score))
#Check for where he is in the network to find him :
randomized_comments_df %>% filter(author == "422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a")
## url
## 1 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## 2 https://www.reddit.com/r/RStudio/comments/1f95f5c/people_who_use_r_of_reddit_why_not_just_use_python/
## author date
## 1 422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a 2024-09-04
## 2 422bed076d74660797d37b1b4bb078f5351a3147ad68961cb3289c8e4cf4950a 2024-09-05
## timestamp score upvotes downvotes golds
## 1 1725490338 -32 -32 0 0
## 2 1725495627 -14 -14 0 0
## comment
## 1 >I think this article does a good job showing side by side how has a more intuitive syntax.\n\nFirst off, this is NOT a personal attack. It's not. This is my personal opinion that that article is full of crap. \n\nBasically that article is arguing that not only Python, but C++ and the entire C family of languages is somehow intuitionally inferior to R. That's a huge....HUGE statement and one that is largely untenable. Python is immensely "conversational". R is absolutely not.\n\nR is a language designed specifically to solve one problem: how to efficiently consolidate statistical analysis under one umbrella, command line language. That's it. Period. Everything that came after that (ggplot2 and dplyr for example) are attempts to make R emulate the flexibility of an actual programming language. Yes they work. But they work because they have a very dedicated development community and... because they have RStudio \n\nThe result is that R packages are bulky, inefficient, and frequently mismanaged (how often do your R sessions crash??). And don't get me started on AI. Machine learning with R is mostly an exercise in fitting a square peg into an invisible hole. \n\nIf anything, that article showcases how dead-end R actually is. If you immerse yourself in R like the author, you basically have no intuition for how the vast majority of scripting and programming languages work. You're a one-trick pony with a non-translational feel for syntax.\n\nOkay. Down vote away. I don't care. I use R all the time, but I hate myself for it every time I use it.
## 2 Yes, Tidymodels is doing NOW what scikitlearn was doing in 2015. Now, tell me exactly how Tidymodels is anywhere close to the capabilities of pytorch?
## comment_id Parent
## 1 3_4 0
## 2 3_4_4_1 0
#Check if they are "common_authors", "threaded" or "unthreaded":
unique(score_outliers_df$author) %in% common_authors
## [1] FALSE FALSE FALSE
unique(score_outliers_df$author) %in% threaded_comments_df$author
## [1] TRUE TRUE TRUE
unique(score_outliers_df$author) %in% unthreaded_comments_df$author
## [1] FALSE FALSE FALSE
#the most liked/disliked comments cause threads; comments :
top_2$comment[1]
## [1] "python is a general purpose language with a statistics library strapped on the top. R (and tidyverse) was built for numbers first. I think [this article](https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/) does a good job showing side by side how has a more intuitive syntax."
top_2$comment[2]
## [1] "Pretty much ggplot and the dbplyr. I use Node for a lot, but I just really like performing an analysis inside R. I keep thinking of giving Python another shot, but I feel like at that point I would rather give Rust another shot."
bottom_2$comment[1]
## [1] ">I think this article does a good job showing side by side how has a more intuitive syntax.\n\nFirst off, this is NOT a personal attack. It's not. This is my personal opinion that that article is full of crap. \n\nBasically that article is arguing that not only Python, but C++ and the entire C family of languages is somehow intuitionally inferior to R. That's a huge....HUGE statement and one that is largely untenable. Python is immensely \"conversational\". R is absolutely not.\n\nR is a language designed specifically to solve one problem: how to efficiently consolidate statistical analysis under one umbrella, command line language. That's it. Period. Everything that came after that (ggplot2 and dplyr for example) are attempts to make R emulate the flexibility of an actual programming language. Yes they work. But they work because they have a very dedicated development community and... because they have RStudio \n\nThe result is that R packages are bulky, inefficient, and frequently mismanaged (how often do your R sessions crash??). And don't get me started on AI. Machine learning with R is mostly an exercise in fitting a square peg into an invisible hole. \n\nIf anything, that article showcases how dead-end R actually is. If you immerse yourself in R like the author, you basically have no intuition for how the vast majority of scripting and programming languages work. You're a one-trick pony with a non-translational feel for syntax.\n\nOkay. Down vote away. I don't care. I use R all the time, but I hate myself for it every time I use it."
bottom_2$comment[2]
## [1] "Yes, Tidymodels is doing NOW what scikitlearn was doing in 2015. Now, tell me exactly how Tidymodels is anywhere close to the capabilities of pytorch?"
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
# Ensure that the 'sentence_count' column exists by calculating it for both data frames
# If the column doesn't exist, create it by counting sentences in each comment
# Function to count sentences in a comment
count_sentences <- function(comment) {
length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))
}
# Calculate the number of sentences in each comment for threaded and unthreaded comments
threaded_comments_df <- threaded_comments_df %>%
mutate(sentence_count = sapply(comment, count_sentences))
unthreaded_comments_df <- unthreaded_comments_df %>%
mutate(sentence_count = sapply(comment, count_sentences))
# Remove NA and infinite values from sentence_count columns in both data frames
threaded_comments_df <- threaded_comments_df %>%
filter(!is.na(sentence_count) & is.finite(sentence_count))
unthreaded_comments_df <- unthreaded_comments_df %>%
filter(!is.na(sentence_count) & is.finite(sentence_count))
# Calculate the quartiles, mean, and upper outlier threshold for threaded comments
q1_threaded <- quantile(threaded_comments_df$sentence_count, 0.25)
q2_threaded <- quantile(threaded_comments_df$sentence_count, 0.50) # Median
q3_threaded <- quantile(threaded_comments_df$sentence_count, 0.75)
mean_threaded <- mean(threaded_comments_df$sentence_count)
iqr_threaded <- q3_threaded - q1_threaded
upper_outlier_threshold_threaded <- q3_threaded + (1.5 * iqr_threaded)
# Calculate the quartiles, mean, and upper outlier threshold for unthreaded comments
q1_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.25)
q2_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.50) # Median
q3_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.75)
mean_unthreaded <- mean(unthreaded_comments_df$sentence_count)
iqr_unthreaded <- q3_unthreaded - q1_unthreaded
upper_outlier_threshold_unthreaded <- q3_unthreaded + (1.5 * iqr_unthreaded)
# Threaded comments histogram with quartile, median, mean, and outlier lines
threaded_histogram <- threaded_comments_df %>%
ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_threaded, "Above", "Below"))) +
geom_histogram(binwidth = 1, alpha = 0.6) +
scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
geom_vline(xintercept = upper_outlier_threshold_threaded, color = "black", linetype = "dashed", size = 1) +
geom_vline(xintercept = q1_threaded, color = "darkblue", linetype = "dashed", size = 1) +
geom_vline(xintercept = q2_threaded, color = "darkblue", linetype = "dashed", size = 1) +
geom_vline(xintercept = q3_threaded, color = "darkblue", linetype = "dashed", size = 1) +
geom_vline(xintercept = mean_threaded, color = "green", linetype = "solid", size = 1.2) +
scale_color_manual(
name = "Statistics",
values = c("Outlier Threshold" = "black",
"Q1 (25th Percentile)" = "darkblue",
"Median (50th Percentile)" = "darkblue",
"Q3 (75th Percentile)" = "darkblue",
"Mean" = "green")
) +
theme_minimal() +
labs(title = "Threaded Comments", x = "Length (Number of Sentences)", y = "Count") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10),
minor_breaks = seq(0, max(threaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Unthreaded comments histogram with quartile, median, mean, and outlier lines
unthreaded_histogram <- unthreaded_comments_df %>%
ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_unthreaded, "Above", "Below"))) +
geom_histogram(binwidth = 1, alpha = 0.6) +
scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
geom_vline(xintercept = upper_outlier_threshold_unthreaded, color = "black", linetype = "dashed", size = 1) +
geom_vline(xintercept = q1_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
geom_vline(xintercept = q2_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
geom_vline(xintercept = q3_unthreaded, color = "darkblue", linetype = "dashed", size = 1) +
geom_vline(xintercept = mean_unthreaded, color = "green", linetype = "solid", size = 1.2) +
scale_color_manual(
name = "Statistics",
values = c("Outlier Threshold" = "black",
"Q1 (25th Percentile)" = "darkblue",
"Median (50th Percentile)" = "darkblue",
"Q3 (75th Percentile)" = "darkblue",
"Mean" = "green")
) +
theme_minimal() +
labs(title = "Unthreaded Comments", x = "Length (Number of Sentences)", y = "Count") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10),
minor_breaks = seq(0, max(unthreaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))
# Use gridExtra to place the histograms side by side
grid.arrange(threaded_histogram, unthreaded_histogram, ncol = 2)
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
# Load necessary libraries
library(dplyr)
library(tidytext)
# Ensure the 'comment' column exists and contains valid text data
if(!"comment" %in% colnames(threaded_comments_df)) {
stop("The 'comment' column does not exist in the data frame.")
}
# Calculate the number of sentences in each comment for the threaded comments
threaded_comments_df <- threaded_comments_df %>%
rowwise() %>%
mutate(sentence_count = length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))) %>%
ungroup()
# Check if the 'sentence_count' column has been successfully populated
summary(threaded_comments_df$sentence_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 3.282 3.000 24.000
# Similarly for unthreaded comments
unthreaded_comments_df <- unthreaded_comments_df %>%
rowwise() %>%
mutate(sentence_count = length(unlist(strsplit(comment, "(?<=[.!?])\\s+", perl = TRUE)))) %>%
ungroup()
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
# Calculate the quartiles, mean, and upper outlier threshold for threaded comments
q1_threaded <- quantile(threaded_comments_df$sentence_count, 0.25)
q2_threaded <- quantile(threaded_comments_df$sentence_count, 0.50) # Median
q3_threaded <- quantile(threaded_comments_df$sentence_count, 0.75)
mean_threaded <- mean(threaded_comments_df$sentence_count)
iqr_threaded <- q3_threaded - q1_threaded
upper_outlier_threshold_threaded <- q3_threaded + (1.5 * iqr_threaded)
# Calculate the quartiles, mean, and upper outlier threshold for unthreaded comments
q1_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.25)
q2_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.50) # Median
q3_unthreaded <- quantile(unthreaded_comments_df$sentence_count, 0.75)
mean_unthreaded <- mean(unthreaded_comments_df$sentence_count)
iqr_unthreaded <- q3_unthreaded - q1_unthreaded
upper_outlier_threshold_unthreaded <- q3_unthreaded + (1.5 * iqr_unthreaded)
# Threaded comments histogram with quartile, median, mean, and outlier lines
threaded_histogram <- threaded_comments_df %>%
ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_threaded, "Above", "Below"))) +
geom_histogram(binwidth = 1, alpha = 0.6) +
scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
geom_vline(aes(xintercept = upper_outlier_threshold_threaded, color = "Outlier Threshold"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = q1_threaded, color = "Q1 (25th Percentile)"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = q2_threaded, color = "Median (50th Percentile)"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = q3_threaded, color = "Q3 (75th Percentile)"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_threaded, color = "Mean"), linetype = "solid", size = 1.2) +
scale_color_manual(
name = "Statistics",
values = c("Outlier Threshold" = "black",
"Q1 (25th Percentile)" = "darkblue",
"Median (50th Percentile)" = "darkblue",
"Q3 (75th Percentile)" = "darkblue",
"Mean" = "green")
) +
theme_minimal() +
labs(title = "Threaded Comments", x = "Length (Number of Sentences)", y = "Count") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10), minor_breaks = seq(0, max(threaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))
# Unthreaded comments histogram with quartile, median, mean, and outlier lines
unthreaded_histogram <- unthreaded_comments_df %>%
ggplot(aes(x = sentence_count, fill = ifelse(sentence_count > upper_outlier_threshold_unthreaded, "Above", "Below"))) +
geom_histogram(binwidth = 1, alpha = 0.6) +
scale_fill_manual(values = c("Below" = "lightblue", "Above" = "red"), guide = FALSE) +
geom_vline(aes(xintercept = upper_outlier_threshold_unthreaded, color = "Outlier Threshold"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = q1_unthreaded, color = "Q1 (25th Percentile)"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = q2_unthreaded, color = "Median (50th Percentile)"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = q3_unthreaded, color = "Q3 (75th Percentile)"), linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_unthreaded, color = "Mean"), linetype = "solid", size = 1.2) +
scale_color_manual(
name = "Statistics",
values = c("Outlier Threshold" = "black",
"Q1 (25th Percentile)" = "darkblue",
"Median (50th Percentile)" = "darkblue",
"Q3 (75th Percentile)" = "darkblue",
"Mean" = "green")
) +
theme_minimal() +
labs(title = "Unthreaded Comments", x = "Length (Number of Sentences)", y = "Count") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10), minor_breaks = seq(0, max(unthreaded_comments_df$sentence_count, na.rm = TRUE), by = 1)) +
theme(panel.grid.minor = element_line(color = "gray", linetype = "dotted", size = 0.5))
# Use gridExtra to place the histograms side by side
grid.arrange(threaded_histogram, unthreaded_histogram, ncol = 2)
Others Reddit projects: