class: center, middle, inverse, title-slide .title[ # The social construction of ‘Health’ ] .subtitle[ ## A data exploration using Twitter’s API ] .author[ ### Victoria Sass ] .institute[ ### UW Sociology ] .date[ ### 10 January 2023 ] --- # Using APIs In the previous example we directly scraped text from one webpage of interest. This can be useful if you have a few specific pages of interest but if you're in a more discovery phase of your research process (as I am), you may want to collect a larger trove of data using a more systematic approach. Fortunately, some websites have created a way to do this using something called an Application Programming Interface (API). --- # My research A bit about my research interests: I'm a PhD student in the Sociology department and my dissertation broadly focuses on the social construction of health and health problems. I'm interested specifically in the social construction of "obesity" and the interactive mental and physical health effects of an overwhelming medical/public health/cultural prescription to diet. -- Part of my work looks at existing health datasets to understand the relationship between dieting and mental/physical health outcomes. But I'm also interested in the ways people talk about health. This latter aspect of my work still has a ways to go towards being fleshed out and therefore I'd like to explore a corpus of tweets that contain certain words of interest as a starting point. --- # Twitter's API To utilize Twitter's API you need to sign up for an account (or use an existing account) and apply for "developer" access. You can get instant access by applying for an **Essential** account (limited access) which enables you to retrieve up to 500,000 tweets per month. You can also apply for an **Academic Research** account which involves a more thorough application process (about your specific research aims and usage of the data) but allows you to retrieve up to 10,000,000 tweets per month with full access to the archive of historical tweets. I'm currently waiting on approval for the latter so we'll proceed in this example with a more basic example of authentication. -- More information about developer accounts can be found [here](https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api). --- # Authentication First we want to load the necessary packages: ```r library(tidyverse) # suite of useful data manipulation and tidying packages library(tidytext) # package to create tidy text data library(rtweet) # package to pull Twitter data using their API library(wordcloud2) # easy wordcloud visualizations library(ggthemes) # nice themes for ggplot ``` -- Once approved for a developer account you will create an app and then will be given four passwords which will serve to authenticate you requests. You want to store these outside of your R script for privacy reasons, especially if you intend to share your code with others. For this example we're simply going to use a basic form of authentication which allows you to utilize the twitter account logged in on your default browser. Rate limits for this type of authentication are 18,000 tweets every 15 minutes so it's considerably more limited than authentication methods using an app or a bot. More info on authenticating with `rtweet` can be found [here](https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html) or by running `vignette('auth')` in the console. --- # Authentication The following line of code only needs to be run once per computer. ```r auth_setup_default() ``` --- # Getting tweets Now we can pull tweets of interest. For my work I want to start by seeing what kinds of words accompany three general themes: health, dieting, and wellness. ```r health <- search_tweets(q = "health", n = 10000, include_rts = FALSE, lang = "en") dieting <- search_tweets(q = "diet OR dieting OR lose weight", n = 10000, include_rts = FALSE, lang = "en") wellness <- search_tweets(q = "wellness", n = 10000, include_rts = FALSE, lang = "en") ``` --- # Getting tweets A preview of one of these datasets gives you a sense of what's available: ```r names(wellness) ``` ``` ## [1] "created_at" "id" ## [3] "id_str" "full_text" ## [5] "truncated" "display_text_range" ## [7] "entities" "metadata" ## [9] "source" "in_reply_to_status_id" ## [11] "in_reply_to_status_id_str" "in_reply_to_user_id" ## [13] "in_reply_to_user_id_str" "in_reply_to_screen_name" ## [15] "geo" "coordinates" ## [17] "place" "contributors" ## [19] "is_quote_status" "retweet_count" ## [21] "favorite_count" "favorited" ## [23] "retweeted" "lang" ## [25] "possibly_sensitive" "quoted_status_id" ## [27] "quoted_status_id_str" "quoted_status" ## [29] "text" "favorited_by" ## [31] "scopes" "display_text_width" ## [33] "retweeted_status" "quoted_status_permalink" ## [35] "quote_count" "timestamp_ms" ## [37] "reply_count" "filter_level" ## [39] "query" "withheld_scope" ## [41] "withheld_copyright" "withheld_in_countries" ## [43] "possibly_sensitive_appealable" ``` --- # Getting tweets A preview of one of these datasets gives you a sense of what's available: ```r wellness$full_text[1] ``` ``` ## Prioritizing health and wellness in 2023. ``` ```r wellness$full_text[2] ``` ``` ## In 2022, US cops killed more people than any year on record. ## ## • 32% were fleeing ## • 11% committed no offense ## • 9% during mental health / wellness checks ## • 8% involved traffic infractions ## • 18% for non-violent offenses ## ## Cops killed +3 everyday for mostly arbitrary reasons. https://t.co/J0HwJYvvVO ``` ```r wellness$full_text[3] ``` ``` ## Healthcare worker and admin at “wellness” seminar https://t.co/mUoca280Ki ``` --- # Tokenizing Now that we have pulled tweets from three topics of interest we can tokenize them and remove stop words. ```r tidy_health <- tibble(tweet = 1:9998, text = health$full_text) tidy_dieting <- tibble(tweet = 1:9972, text = dieting$full_text) tidy_wellness <- tibble(tweet = 1:9994, text = wellness$full_text) data("stop_words") tidy_health2 <- tidy_health %>% unnest_tokens(word, text) %>% # tokenizing by word anti_join(stop_words) %>% # removing stop words filter(word != "health") %>% # removing the key word itself filter(!(word %in% c("t.co", "https", "amp", "it’s", "i’m", "don’t"))) %>% # removing random stop words specific to tweets filter(grepl("[^0-9]", word)) # removing numbers by using regex search ``` --- # Tidy text Now we have our data in tidy text format ```r tidy_health2 %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 33,214 × 2 ## word n ## <chr> <int> ## 1 mental 2177 ## 2 care 1174 ## 3 people 864 ## 4 public 502 ## 5 time 445 ## 6 covid 442 ## 7 issues 402 ## 8 life 399 ## 9 system 330 ## 10 medical 286 ## # … with 33,204 more rows ``` --- # Visualize word frequencies If we tidy up the other two datasets we can start to create some visualizations ```r # code for wordcloud tidy_health2 %>% count(word) %>% filter(n > 50) %>% wordcloud2() # code for bar chart tidy_health2 %>% count(word, sort = TRUE) %>% # calculating freqencies and sorting in descending order (default) filter(n > 200) %>% # just looking at words mentioned more than 5 times in the article mutate(word = reorder(word, n), # reordering the words in the tibble by their frequency color = case_when(n >= 500 ~ "1", # creating color categories for different frequencies n >= 400 ~ "2", n >= 300 ~ "3", n < 300 ~ "4", TRUE ~ NA_character_)) %>% ggplot(aes(n, word)) + # main plot call geom_col(aes(fill = color)) + # geom for columns scale_fill_manual(values = c("1" = "#4e79a7", "2" = "#f28e2c", "3" = "#e15759", "4" = "#76b7b2")) + # manual colors geom_text(aes(label = n), nudge_x = 2, color = "black") + # adding labels of frequencies labs(y = NULL) + # removing y axis name theme_tufte() + # changine theme to tufte (minimal aesthetic) theme(legend.position = "none") # removing color legend ```  --- # Health wordcloud  --- # Health bar chart <!-- --> --- # Dataset comparisons ```r # code to make frequency comparisons frequency <- bind_rows(mutate(tidy_dieting2, keyword = "Dieting"), # binding rows together mutate(tidy_wellness2, keyword = "Wellness"), mutate(tidy_health2, keyword = "Health")) %>% mutate(word = str_extract(word, "[a-z']+")) %>% # extracting just text, excluding punctuation denoting style count(keyword, word) %>% filter(n > 20) %>% group_by(keyword) %>% # grouping dataset but the three keywords mutate(proportion = n / sum(n)) %>% # getting a proportion of each word within each keyword-specific corpus select(-n) %>% # removing n from dataset pivot_wider(names_from = keyword, values_from = proportion) %>% # creating columns for each keyword w/ proportion as the value pivot_longer(c(`Dieting`, `Wellness`), names_to = "keyword", values_to = "proportion") # keeping the Health column and collapsing Dieting and Wellness back into a keyword column so we can see proportions compared to health within 1 row ``` --- # Dataset comparisons ```r frequency ``` ``` ## # A tibble: 3,646 × 4 ## word Health keyword proportion ## <chr> <dbl> <chr> <dbl> ## 1 a 0.000905 Dieting 0.00213 ## 2 a 0.000905 Wellness 0.00133 ## 3 absolutely 0.000815 Dieting 0.000477 ## 4 absolutely 0.000815 Wellness NA ## 5 abt NA Dieting 0.000562 ## 6 abt NA Wellness NA ## 7 accept NA Dieting 0.000375 ## 8 accept NA Wellness NA ## 9 achieve NA Dieting 0.00102 ## 10 achieve NA Wellness 0.00158 ## # … with 3,636 more rows ``` --- # Dataset comparisons ```r # code to make comparison plot library(scales) ggplot(frequency, aes(x = proportion, y = `Health`, color = abs(`Health` - proportion))) + geom_abline(color = "gray40", lty = 2) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) + geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + scale_x_log10(labels = percent_format()) + scale_y_log10(labels = percent_format()) + scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") + facet_wrap(~keyword, ncol = 2) + labs(y = "Health", x = NULL) + theme_tufte() + theme(legend.position="none") ``` --- # Dataset comparisons <!-- --> --- # Correlation of relative frequencies ```r # Correlation between Health and Dieting tweets cor.test(data = frequency[frequency$keyword == "Dieting",], ~ proportion + `Health`) ``` ``` ## ## Pearson's product-moment correlation ## ## data: proportion and Health ## t = 4.588, df = 505, p-value = 5.652e-06 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.1149573 0.2822040 ## sample estimates: ## cor ## 0.2000374 ``` --- # Correlation of relative frequencies ```r # Correlation between Health and Wellness tweets cor.test(data = frequency[frequency$keyword == "Wellness",], ~ proportion + `Health`) ``` ``` ## ## Pearson's product-moment correlation ## ## data: proportion and Health ## t = 12.171, df = 609, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.3761925 0.5039532 ## sample estimates: ## cor ## 0.4423141 ``` -- We can see the words in the selection of tweets referencing health are more correlated with wellness-related tweets (0.442) than dieting-related tweets (0.200), though neither are particularly strong. --- # Discussion Principles of Selection and Representation - Principle 1: Question-Specific Corpus Construction Using tweets naturally comes with limitations as not everyone writes down their thoughts/feelings about health-related topics, much less uses twitter to do so. Nonetheless, twitter data can still be of use to gain a better and broader understanding of the opinions of people who do tweet about these topics. -- - Principle 2: No Values-Free Corpus Construction In addition to the above limitations, the type of speech that occurs around topics such as "obesity," body size, and the morality associated with health status warrants additional sensitivity. Hate speech is rife on twitter and when it comes to anti-fat bias and I'll want to be careful going forward about how I'm contextualizing my analysis to avoid perpetuating harmful speech. --- # Discussion - Principle 3: No Right Way to Represent Text This exploration of a limited amount of data was a good first step but I'll definitely want to analyze many additional facets of tweets of interest, i.e. sentiment analysis or topic modeling. With more exploration I hope to have a better sense of the research questions I'm interested in and therefore, which methods wil be most appropriate. -- - Principle 4: Validation Of course, extensive validation will be a part of whatever method I choose. Comparing methods across different types of data sources may also be interesting, i.e. how are these topics discussed on Twitter versus Reddit, versus Instagram. --- # Discussion Selecting Documents When my developer account for research purposes is approved I expect recreating and expanding this exploration to include more texts across a longer range of time as well as more keywords related to the topics I'm interested in. -- - Resource Bias Again, twitter is not representative of the entire population of English speakers and therefore my analysis will be limited in its generalizability. However, I do think much can be learned from the way a broader swath of the population (i.e. not just medical doctors, public health officials, and people directly involved in the health/wellness industrial complex) are discussing these topics. -- - Incentive Bias I'll want to consider that not all tweets have been retained (some have been deleted by the user themselves, others have been removed due to violations of Twitter's use agreement). Also, some people may not want to share their real opinions about the topics I'm interested in for fear of being perceived a certain way. --- # Discussion - Medium Bias I will definitely want to engage with the full text of the tweets themselves to gain a better understanding about their context. This will help inform the types of analysis I consider doing and even what questions I am interested in answering. Additionally, other forms of media (pictures, links, emojis) may provide additional information that informs those decisions. Additionally, (and a counter to social desirability bias) there is a tendency for people to express more negative opinions in online spaces where they are (or feel) anonymous. I'll want to think critically about what this means for the social phenomena I'm expecting to witness versus what I actually capture using this platform as a data source. --- # Discussion - Retrieval Bias Coming up with a list of keywords has an inherent bias as it is subject to my understanding of the social processes I'm interested and that I think have potentially interesting relationships with one another. Validation and iteration throughout the research process, particularly the discovery phase will thus be very important. Additionally, I'll want to gain a more thorough understanding of the Twitter API to make sure I understand its limitations and how they may affect the types of tweets I have access to.