# load library with datasets
library(dslabs)
DS Labs Assignment
# List all datasets included in the dslabs package
data(package = "dslabs") # asked ChatGpt a chunk to look up all available datasets in dslabs.
data(trump_tweets)
str(trump_tweets)
'data.frame': 20761 obs. of 8 variables:
$ source : chr "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...
$ id_str : chr "6971079756" "6312794445" "6090839867" "5775731054" ...
$ text : chr "From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think li"| __truncated__ "Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban "| __truncated__ "Wishing you and yours a very Happy and Bountiful Thanksgiving!" "Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc" ...
$ created_at : POSIXct, format: "2009-12-23 12:38:18" "2009-12-03 14:39:09" ...
$ retweet_count : int 28 33 13 5 7 4 2 4 1 22 ...
$ in_reply_to_user_id_str: chr NA NA NA NA ...
$ favorite_count : int 12 6 11 3 6 5 2 10 4 30 ...
$ is_retweet : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
Warning: package 'lubridate' was built under R version 4.4.2
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
Clean and prepare data:
- Extract the week (floor date to week)
- Create a variable for whether the tweet mentions Clinton
- Group by week and Clinton mention
- Calculate average likes, average retweets, and tweet counts
<- trump_tweets |>
trump_tweets_clean mutate(
week = floor_date(created_at, "week"), # ChatGPT helped with this
clinton_mention = ifelse(grepl("Clinton|Hillary", text, ignore.case = TRUE),
"Mentions Clinton", "No Mention") # ChatGPT helped with this part of code
|>
) group_by(week, clinton_mention) |>
summarise(
avg_likes = mean(favorite_count, na.rm = TRUE),
avg_retweets = mean(retweet_count, na.rm = TRUE),
n = n(),
.groups = 'drop'
)
Visualize the data:
- X-axis: Week
- Y-axis: Average likes
- Point size: Average retweets
- Point color: Clinton mention status
- Add labels, legends, and custom style
<- ggplot(trump_tweets_clean, aes(week, avg_likes, color = clinton_mention, size = avg_retweets)) +
viz geom_point(alpha = 0.7) +
labs(
title = "Trump's Tweets: Likes Over Time",
subtitle = "Size shows average retweets; color shows Clinton mention",
x = "Week",
y = "Average Likes",
color = "Clinton Mention",
size = "Avg Retweets"
+
) scale_color_manual(
values = c("Mentions Clinton" = "#ba34eb", "No Mention" = "#f035dd")
+
) scale_size_continuous( # asked ChatGPT to find this function and to explain its properties
range = c(2, 12),
breaks = c(20000, 40000, 60000),
labels = c("20K", "40K", "60K")
+
) theme_minimal()
viz
For this assignment, I used the trump_tweets dataset from the dslabs package. This dataset contains Donald Trump’s tweets over time, including information like the tweet text, date, number of likes, and number of retweets. I wanted to explore whether mentioning Clinton or Hillary in his tweets affected the engagement. First, I cleaned the data by grouping the tweets by week using the floor_date function from lubridate, so I could summarize activity over time. I also created a new column using grepl that checks if the tweet text mentions “Clinton” or “Hillary” (ignoring case) and labeled tweets as either “Mentions Clinton” or “No Mention”. Then I calculated the average number of likes and retweets for each group. For my visualization, I created a scatter plot where the x-axis represents time (week), the y-axis represents average likes, the point size represents average retweets, and the point color shows whether the tweet mentioned Clinton. I customized the colors and legend. From my plot, it looks like tweets that mention Clinton tend to have higher numbers of likes and retweets compared to those that do not mention Clinton, especially in later periods. I did not conduct formal statistical testing to be confident in my findings.