First, we load all the files from folder dataset and condensed them in one dataset. Data is located in Canada’s Data folder, subfolder Data obtained from DiscoverText
## Rows: 12307 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (22): Title, Text, [M] alternate date format:, [M] city:, [M] country:,...
## dbl (10): [M] ave:, [M] desktop reach:, [M] engagement:, [M] mobile reach:,...
## lgl (2): ReferenceText, Annotations
## time (1): [M] time:
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We calculate the mean of RTs for each year, based on the column RT Count.
Since we have two different datasets, we focus first on the earlier tweets (early2022), where we have a column that we can use to identify the most retweeted tweets.
#rename column tweet_count
df <- early2022 %>% rename(retweet_count = `retweet count`)
#replace NA with 0s
df["retweet_count"][is.na(df["retweet_count"])] <- 0
# Calculate frequencies
new_df <- df %>%
count(retweet_count)
#SAMPLING USING PERCENTILE
# 1. Calculate cumulative frequency
cumulative_frequency <- new_df %>%
mutate(cumulative_n = cumsum(n))
#SAMPLING USING QUARTILE
# Calculate quantiles to find a more meaningful threshold
quantile_threshold <- quantile(new_df$retweet_count, probs = 0.75)
# Filter the data based on the quantile threshold
quantile_data <- new_df %>%
filter(retweet_count >= quantile_threshold)
# Print the sample data
print(quantile_data)
## # A tibble: 7 × 2
## retweet_count n
## <dbl> <int>
## 1 30 1
## 2 31 2
## 3 37 2
## 4 44 1
## 5 45 1
## 6 79 3
## 7 551 1
However, this data was DISCARDED because the sample does not have enough RTs. Here the number of RTs was reported in column “retweet count”, but when we search for the RTs in the entire data, only few RTs were found.
Now we are going to work in the second dataset, for late 2022- 2023.
Here we need to count parent url for each RT, since we don´t have a Retweet count column. That could be a problem because we don´t know how many RTs are reported by Twitter; however, we can get an actual number of RTs.
## # A tibble: 6,084 × 2
## `[M] parent url:` retweet_count
## <chr> <int>
## 1 0 2126
## 2 https://twitter.com/MikeSchreiner/statuses/1677373739040907264 191
## 3 https://twitter.com/JohnLeePettim13/statuses/16184776550238167… 90
## 4 https://twitter.com/JunkScience/statuses/1686038870352162816 87
## 5 https://twitter.com/ABDanielleSmith/statuses/16340183101227786… 79
## 6 https://twitter.com/CanadaNuclear/statuses/1602325412461289473 61
## 7 https://twitter.com/GasPriceWizard/statuses/1585226318290989056 59
## 8 https://twitter.com/JonathanWNV/statuses/1584928602516180992 55
## 9 https://twitter.com/ONenergy/statuses/1677322032554430467 49
## 10 https://twitter.com/_ClimateCraze/statuses/1670573629531144197 47
## # ℹ 6,074 more rows
## # A tibble: 38 × 2
## retweet_count n
## <dbl> <dbl>
## 1 191 1
## 2 90 1
## 3 87 1
## 4 79 1
## 5 61 1
## 6 59 1
## 7 55 1
## 8 49 1
## 9 47 1
## 10 32 1
## # ℹ 28 more rows
## # A tibble: 42 × 2
## retweet_count n
## <dbl> <dbl>
## 1 191 1
## 2 90 1
## 3 87 1
## 4 79 4
## 5 61 1
## 6 59 1
## 7 55 1
## 8 49 1
## 9 47 1
## 10 45 1
## # ℹ 32 more rows
Saving dataset dc to calculate the quartile of the whole sample
## # A tibble: 11 × 2
## retweet_count n
## <dbl> <int>
## 1 44 1
## 2 45 1
## 3 47 1
## 4 49 1
## 5 55 1
## 6 59 1
## 7 61 1
## 8 79 1
## 9 87 1
## 10 90 1
## 11 191 1