First, we load all the files from folder dataset and condensed them in one dataset. Data is located in Canada’s Data folder, subfolder Data obtained from DiscoverText

## Rows: 12307 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (22): Title, Text, [M] alternate date format:, [M] city:, [M] country:,...
## dbl  (10): [M] ave:, [M] desktop reach:, [M] engagement:, [M] mobile reach:,...
## lgl   (2): ReferenceText, Annotations
## time  (1): [M] time:
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Method 1: RT metric

We calculate the mean of RTs for each year, based on the column RT Count.

Early 2022

Since we have two different datasets, we focus first on the earlier tweets (early2022), where we have a column that we can use to identify the most retweeted tweets.

#rename column tweet_count
 df <- early2022 %>% rename(retweet_count = `retweet count`)

#replace NA with 0s
df["retweet_count"][is.na(df["retweet_count"])] <- 0

# Calculate frequencies
new_df <- df %>%
  count(retweet_count)


#SAMPLING USING PERCENTILE
# 1. Calculate cumulative frequency
cumulative_frequency <- new_df %>%
  mutate(cumulative_n = cumsum(n))


#SAMPLING USING QUARTILE
# Calculate quantiles to find a more meaningful threshold
quantile_threshold <- quantile(new_df$retweet_count, probs = 0.75)

# Filter the data based on the quantile threshold
quantile_data <- new_df %>%
  filter(retweet_count >= quantile_threshold)

# Print the sample data
print(quantile_data)
## # A tibble: 7 × 2
##   retweet_count     n
##           <dbl> <int>
## 1            30     1
## 2            31     2
## 3            37     2
## 4            44     1
## 5            45     1
## 6            79     3
## 7           551     1

However, this data was DISCARDED because the sample does not have enough RTs. Here the number of RTs was reported in column “retweet count”, but when we search for the RTs in the entire data, only few RTs were found.

Late 2022

Now we are going to work in the second dataset, for late 2022- 2023.

Here we need to count parent url for each RT, since we don´t have a Retweet count column. That could be a problem because we don´t know how many RTs are reported by Twitter; however, we can get an actual number of RTs.

## # A tibble: 6,084 × 2
##    `[M] parent url:`                                               retweet_count
##    <chr>                                                                   <int>
##  1 0                                                                        2126
##  2 https://twitter.com/MikeSchreiner/statuses/1677373739040907264            191
##  3 https://twitter.com/JohnLeePettim13/statuses/16184776550238167…            90
##  4 https://twitter.com/JunkScience/statuses/1686038870352162816               87
##  5 https://twitter.com/ABDanielleSmith/statuses/16340183101227786…            79
##  6 https://twitter.com/CanadaNuclear/statuses/1602325412461289473             61
##  7 https://twitter.com/GasPriceWizard/statuses/1585226318290989056            59
##  8 https://twitter.com/JonathanWNV/statuses/1584928602516180992               55
##  9 https://twitter.com/ONenergy/statuses/1677322032554430467                  49
## 10 https://twitter.com/_ClimateCraze/statuses/1670573629531144197             47
## # ℹ 6,074 more rows
## # A tibble: 38 × 2
##    retweet_count     n
##            <dbl> <dbl>
##  1           191     1
##  2            90     1
##  3            87     1
##  4            79     1
##  5            61     1
##  6            59     1
##  7            55     1
##  8            49     1
##  9            47     1
## 10            32     1
## # ℹ 28 more rows

Summing all tweets to calculate the sample

## # A tibble: 42 × 2
##    retweet_count     n
##            <dbl> <dbl>
##  1           191     1
##  2            90     1
##  3            87     1
##  4            79     4
##  5            61     1
##  6            59     1
##  7            55     1
##  8            49     1
##  9            47     1
## 10            45     1
## # ℹ 32 more rows

Saving data from Canada

Saving dataset dc to calculate the quartile of the whole sample

Quartile sampling - deprecated

## # A tibble: 11 × 2
##    retweet_count     n
##            <dbl> <int>
##  1            44     1
##  2            45     1
##  3            47     1
##  4            49     1
##  5            55     1
##  6            59     1
##  7            61     1
##  8            79     1
##  9            87     1
## 10            90     1
## 11           191     1