DS Labs Assignment

Author

V. Lyon

# load library with datasets
library(dslabs)

# List all datasets included in the dslabs package
data(package = "dslabs") # asked ChatGpt a chunk to look up all available datasets in                                dslabs.

data(trump_tweets)

str(trump_tweets)

'data.frame':   20761 obs. of  8 variables:
 $ source                 : chr  "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" "Twitter Web Client" ...
 $ id_str                 : chr  "6971079756" "6312794445" "6090839867" "5775731054" ...
 $ text                   : chr  "From Donald Trump: Wishing everyone a wonderful holiday & a happy, healthy, prosperous New Year. Let’s think li"| __truncated__ "Trump International Tower in Chicago ranked 6th tallest building in world by Council on Tall Buildings & Urban "| __truncated__ "Wishing you and yours a very Happy and Bountiful Thanksgiving!" "Donald Trump Partners with TV1 on New Reality Series Entitled, Omarosa's Ultimate Merger: http://tinyurl.com/yk5m3lc" ...
 $ created_at             : POSIXct, format: "2009-12-23 12:38:18" "2009-12-03 14:39:09" ...
 $ retweet_count          : int  28 33 13 5 7 4 2 4 1 22 ...
 $ in_reply_to_user_id_str: chr  NA NA NA NA ...
 $ favorite_count         : int  12 6 11 3 6 5 2 10 4 30 ...
 $ is_retweet             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(lubridate)

Warning: package 'lubridate' was built under R version 4.4.2


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Clean and prepare data:

- Extract the week (floor date to week)

- Create a variable for whether the tweet mentions Clinton

- Group by week and Clinton mention

- Calculate average likes, average retweets, and tweet counts

trump_tweets_clean <- trump_tweets |>
  mutate(
    week = floor_date(created_at, "week"), # ChatGPT helped with this
    clinton_mention = ifelse(grepl("Clinton|Hillary", text, ignore.case = TRUE), 
                             "Mentions Clinton", "No Mention") # ChatGPT helped with this part of code
  ) |>
  group_by(week, clinton_mention) |>
  summarise(
    avg_likes = mean(favorite_count, na.rm = TRUE),
    avg_retweets = mean(retweet_count, na.rm = TRUE),
    n = n(),
    .groups = 'drop'
  )

Visualize the data:

- X-axis: Week

- Y-axis: Average likes

- Point size: Average retweets

- Point color: Clinton mention status

- Add labels, legends, and custom style

viz <- ggplot(trump_tweets_clean, aes(week, avg_likes, color = clinton_mention, size = avg_retweets)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Trump's Tweets: Likes Over Time",
    subtitle = "Size shows average retweets; color shows Clinton mention",
    x = "Week",
    y = "Average Likes",
    color = "Clinton Mention",
    size = "Avg Retweets"
  ) +
  scale_color_manual(
    values = c("Mentions Clinton" = "#ba34eb", "No Mention" = "#f035dd")
  ) +
  scale_size_continuous( # asked ChatGPT to find this function and to explain its properties 
    range = c(2, 12),
    breaks = c(20000, 40000, 60000),
    labels = c("20K", "40K", "60K")
  ) +
  theme_minimal()
viz

For this assignment, I used the trump_tweets dataset from the dslabs package. This dataset contains Donald Trump’s tweets over time, including information like the tweet text, date, number of likes, and number of retweets. I wanted to explore whether mentioning Clinton or Hillary in his tweets affected the engagement. First, I cleaned the data by grouping the tweets by week using the floor_date function from lubridate, so I could summarize activity over time. I also created a new column using grepl that checks if the tweet text mentions “Clinton” or “Hillary” (ignoring case) and labeled tweets as either “Mentions Clinton” or “No Mention”. Then I calculated the average number of likes and retweets for each group. For my visualization, I created a scatter plot where the x-axis represents time (week), the y-axis represents average likes, the point size represents average retweets, and the point color shows whether the tweet mentioned Clinton. I customized the colors and legend. From my plot, it looks like tweets that mention Clinton tend to have higher numbers of likes and retweets compared to those that do not mention Clinton, especially in later periods. I did not conduct formal statistical testing to be confident in my findings.