Last time, we practiced with R to map the frequency of tweets into U.S. But doing so does not allow us to know what is being discussed on Twitter in the country. So today we are going to learn about how to reveal what’s being said importantly among tweets posted in the U.S.
So we want to examine what is the most prominent issue on tweets about COVID-19 in the U.S. For addressing such a question, a primary task to do is to quantify which word (or phrase) is being used the most among the tweets posted in the U.S. This measure of how important a word may be is its term frequency (tf), how frequently a word occurs in a tweet or in a document.
Let’s take a look at which word is the most prominent in the tweets about COVID-19 posted in the U.S. For doing so, we need to format our tweet dataset into a tidy dataset. And we are going to treat all the tweets from the U.S. as one document and count the tokenized terms in terms of their frequency in the document.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 3.0.1 ✓ dplyr 0.8.5
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(stringr)
library(stopwords)
library(textclean) # A new package introduced to preprocess text easily
load("covid_tweets_423.RData") # We will use the tweets collected on April 23rd.
covid_tweets
## # A tibble: 18,224 x 9
## user_id status_id created_at screen_name text lang country lat
## <chr> <chr> <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… "@ev… en United… 36.0
## 2 169480… 12533658… 2020-04-23 16:51:11 Coachjmorr… "Ple… en United… 36.9
## 3 215583… 12533658… 2020-04-23 16:51:09 KOROGLU_BA… "@Ay… tr Azerba… 40.2
## 4 744597… 12533657… 2020-04-23 16:51:05 FoodFocusSA "Pre… en South … -26.1
## 5 155877… 12533657… 2020-04-23 16:51:01 opcionsecu… "#AT… es Ecuador -1.67
## 6 998960… 12533657… 2020-04-23 16:51:01 amystones4 "Tha… en United… 53.7
## 7 102768… 12533657… 2020-04-23 16:51:00 COTACYT "Men… es Mexico 23.7
## 8 247382… 12533657… 2020-04-23 16:50:54 bkracing123 "The… en United… 53.9
## 9 175662… 12533657… 2020-04-23 16:50:51 AnnStrahm "Thi… en United… 37.5
## 10 226707… 12533656… 2020-04-23 16:50:42 JLeonRojas "INF… es Chile -35.5
## # … with 18,214 more rows, and 1 more variable: lng <dbl>
class(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text))
## [1] "character"
replace_non_ascii(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text))
## [1] "I'm glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs"
## [2] "Whenever we reopen masks will be worn during the ENTIRE time you're in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"
sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii)
## I’m glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs
## "I'm glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs"
## Whenever we reopen masks will be worn during the ENTIRE time you’re in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome… https://t.co/H5n8NzUjsg
## "Whenever we reopen masks will be worn during the ENTIRE time you're in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"
replace_contraction(sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii))
## I’m glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs
## "I am glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs"
## Whenever we reopen masks will be worn during the ENTIRE time you’re in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome… https://t.co/H5n8NzUjsg
## "Whenever we reopen masks will be worn during the ENTIRE time you are in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"
replace_html(replace_contraction(sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii)))
## I’m glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs
## "I am glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. https://t.co/WUr2eWcRgs"
## Whenever we reopen masks will be worn during the ENTIRE time you’re in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome… https://t.co/H5n8NzUjsg
## "Whenever we reopen masks will be worn during the ENTIRE time you are in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... https://t.co/H5n8NzUjsg"
replace_url(replace_html(replace_contraction(sapply(covid_tweets %>% filter(status_id %in% c("1253360389921873923", "1253359677569728520")) %>% pull(text), replace_non_ascii))))
## [1] "I am glad that programs like this are supporting even the smallest businesses in our state, including focusing on minority-owned businesses. Every person & organization has a role to play in getting us all through this together. "
## [2] "Whenever we reopen masks will be worn during the ENTIRE time you are in the salon: before, during, & after your appointment. It may be removed when you walk out the door. #covid19 #coronavirus #hairtyllc #safeathome... "
covid_tweets_tidy <- covid_tweets %>%
filter(lang == "en") %>% # Selecting tweets written in English only
filter(country == "United States") %>% # Selecting tweets posted in the U.S. only
mutate(text = str_replace_all(text, "RT", " ")) %>% # Removing retweet marker
mutate(text = sapply(text, replace_non_ascii)) %>%
mutate(text = sapply(text, replace_contraction)) %>%
mutate(text = sapply(text, replace_html)) %>%
mutate(text = sapply(text, replace_url)) %>%
unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets
filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
filter(str_detect(word, "[a-z]")) # Selecting words that should contain any alphbetical letter
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
covid_tweets_tidy # Look at the word column
## # A tibble: 69,448 x 9
## user_id status_id created_at screen_name lang country lat lng
## <chr> <chr> <dttm> <chr> <chr> <chr> <dbl> <dbl>
## 1 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 2 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 3 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 4 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 5 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 6 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 7 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 8 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 9 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## 10 479491… 12533658… 2020-04-23 16:51:11 Vegastechh… en United… 36.0 -115.
## # … with 69,438 more rows, and 1 more variable: word <chr>
Now we are ready to count the word column and find the most prominent words in terms of the frequency.
covid_tweets_tidy %>%
count(word, sort=T) # Count the elements in the word column and arrange the result in a descending order
## # A tibble: 17,622 x 2
## word n
## <chr> <int>
## 1 #covid19 2150
## 2 covid19 1560
## 3 #coronavirus 722
## 4 can 411
## 5 people 406
## 6 us 335
## 7 new 317
## 8 get 305
## 9 just 293
## 10 now 290
## # … with 17,612 more rows
As you can see in the counting result, the top 10 words (or tokens) in frequency are #covid19, covid19, #coronavirus, can, people, us, new, get, just, now…
What do you think on the list of the top 10 words? Are they helpful in guessing the important issue being discussed among the tweets about COVID-19 in the U.S.? They could be important terms when it comes to COVID-19 in general. But they are not important to the COVID-19 issue in the U.S. because they are the terms that are also the most frequent in the U.K., India, or Canada. Being important in the U.S. may mean being special to the country.
So the most frequent words are the words in the tweets that occur many times but may not be important in understanding what’s going on in the U.S. Why? Because they are also likely to be the most frequent words among tweets posted in the U.K. or Canada where English is the official language.
In English, these are probably stop words like “the”, “is”, “a”, “an”, “of”, and so on. Of course, such stop words were already removed by using the pre-determined list of stopwords from the stopwords
package: filter(!word %in% stopwords())
. But we still see the words that are being used frequently but are not relevant to finding the important issue about COVID-19 in the U.S. Actually, the most frequent words in tweets from the U.S. are the words that are also frequent in tweets from another country as long as they are written in English.
Let’s see which words are being used frequently in the U.K.
covid_tweets %>%
filter(lang == "en") %>% # Selecting tweets written in English only
filter(country == "United Kingdom") %>% # Seleting tweets posted in the UK only
mutate(text = str_replace_all(text, "RT", " ")) %>% # Removing retweet marker
mutate(text = sapply(text, replace_non_ascii)) %>%
mutate(text = sapply(text, replace_contraction)) %>%
mutate(text = sapply(text, replace_html)) %>%
mutate(text = sapply(text, replace_url)) %>%
unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets
filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
filter(str_detect(word, "[a-z]")) %>%
count(word, sort=T)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
## # A tibble: 7,240 x 2
## word n
## <chr> <int>
## 1 #covid19 629
## 2 covid19 328
## 3 #coronavirus 220
## 4 can 134
## 5 covid 108
## 6 people 102
## 7 just 79
## 8 uk 73
## 9 us 73
## 10 now 70
## # … with 7,230 more rows
From the result, we can figure out that the most frequent words (e.g., “can”, “people”, “just”, “us”, “now”) are commonly used words across the countries. The thing that the words are used frequently does not say that they are important in the tweets about COVID-19 in the U.S. When everyone can speak in English fluently, my English skill does not make me look important in the job market.
So we might take the approach of finding out the words that might be more important in the tweets from the U.S. than other countries. For doing so, we need to consider an approach called a term’s inverse document frequency (idf), which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of tweets or documents. This can be combined with term frequency (tf) to calculate a term’s tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how rarely it is used.
The statistic tf-idf is intended to measure how important a word is to a document (or a tweet) in a collection of document (tweets) or a corpus
The statistic tf-idf is intended to measure how important a word is to a group of tweets from the U.S. in a collection of all the tweets from the U.S. the U.K., India, and Canada.
So in terms of idf, the word “covid19” decreases in its importance because the word is likely to appear in almost every tweet from the U.S. or the U.K., or India, or Canada. Please remember that our dataset of tweets was collected because they contain a hashtag related to “covid19”!
But the inverse document frequency (idf) for any given term is defined as
So this metric decreases the weight for the words in tweets, which are likely to be used in all the countries such as “covid19”, “coronavirus”, “people”, “now”, etc., but increases the weight for the words that are unique to each country. For example, a word “modi” refers to the last name of currnet Prime Minister of India, so it is frequently used in tweets from India but much less likely to be used in tweets from the U.S., the U.K., or Canada. In this case, the IDF gives mcuh greater weight to the word “modi” in compared with its term frequency.
Let’s start by looking at the COVID-19 tweets posted in the top four countries (with English as an official language) and examine first term frequency, then tf-idf. We can start just by using dplyr
functions such as group_by()
and left_join()
. What are the most commonly used words in tweets about COVID-19? (Let’s also calculate the total words in each country for later use).
library(dplyr)
library(tidytext)
covid_tweets %>%
count(country, sort=T)
## # A tibble: 163 x 2
## country n
## <chr> <int>
## 1 United States 4998
## 2 India 1527
## 3 Indonesia 1324
## 4 United Kingdom 1311
## 5 Brazil 942
## 6 Canada 658
## 7 Spain 640
## 8 Mexico 567
## 9 Nigeria 552
## 10 Colombia 381
## # … with 153 more rows
# Focusing on the top four countries with English as an official language
tweet_words <- covid_tweets %>%
filter(lang == "en") %>% # Selecting tweets written in English only
filter(country %in% c("United States","India","United Kingdom","Canada")) %>% # Selecting tweets from U.S., India, U.K., and Canada
mutate(text = str_replace_all(text, "RT", " ")) %>% # Removing retweet marker
mutate(text = sapply(text, replace_non_ascii)) %>%
mutate(text = sapply(text, replace_contraction)) %>%
mutate(text = sapply(text, replace_html)) %>%
mutate(text = sapply(text, replace_url)) %>%
unnest_tweets(word, text) %>% # Splitting text into words by unnest_tweets
filter(!word %in% stopwords()) %>% # Removing words matched by any element in stopwords() vector
filter(str_detect(word, "[a-z]")) %>%
count(country, word, sort=T)
## Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
tweet_words
## # A tibble: 35,663 x 3
## country word n
## <chr> <chr> <int>
## 1 United States #covid19 2150
## 2 United States covid19 1560
## 3 United States #coronavirus 722
## 4 United Kingdom #covid19 629
## 5 United States can 411
## 6 United States people 406
## 7 India #covid19 379
## 8 India covid19 358
## 9 United States us 335
## 10 United Kingdom covid19 328
## # … with 35,653 more rows
Adding the total number of words used in the tweets from each country
total_words <- tweet_words %>%
group_by(country) %>%
summarise(total = sum(n))
total_words
## # A tibble: 4 x 2
## country total
## <chr> <int>
## 1 Canada 9209
## 2 India 17863
## 3 United Kingdom 19583
## 4 United States 69448
We can measure the proportion of each word to total words. So there is one row in this tweet_words
data frame for each word-country combination; n
is the number of times that word is used in that country and total
is the total words in that country. The usual suspects are here with the highest n
: “covid19”, “coronavirus”, “can”, “us”, “people”, and so forth. Here we can look at the most commonly used words in COVID-19 tweets from each country with respect to term frequency.
tweet_words <- tweet_words %>%
left_join(total_words)
## Joining, by = "country"
tweet_words
## # A tibble: 35,663 x 4
## country word n total
## <chr> <chr> <int> <int>
## 1 United States #covid19 2150 69448
## 2 United States covid19 1560 69448
## 3 United States #coronavirus 722 69448
## 4 United Kingdom #covid19 629 19583
## 5 United States can 411 69448
## 6 United States people 406 69448
## 7 India #covid19 379 17863
## 8 India covid19 358 17863
## 9 United States us 335 69448
## 10 United Kingdom covid19 328 19583
## # … with 35,653 more rows
Let’s sort the words by tf in each country
tweet_words %>%
group_by(country) %>%
mutate(tf = n/total) %>%
top_n(10, tf) %>%
arrange(desc(tf)) %>%
ungroup
## # A tibble: 41 x 5
## country word n total tf
## <chr> <chr> <int> <int> <dbl>
## 1 United Kingdom #covid19 629 19583 0.0321
## 2 Canada #covid19 289 9209 0.0314
## 3 United States #covid19 2150 69448 0.0310
## 4 Canada covid19 216 9209 0.0235
## 5 United States covid19 1560 69448 0.0225
## 6 India #covid19 379 17863 0.0212
## 7 India covid19 358 17863 0.0200
## 8 United Kingdom covid19 328 19583 0.0167
## 9 United Kingdom #coronavirus 220 19583 0.0112
## 10 United States #coronavirus 722 69448 0.0104
## # … with 31 more rows
The result of sorting the words by tf shows that the top 10 words by tf in each country are quite similar across the countries.
But as we discussed above, the most frequently used words may not show what is being said importantly in tweets because they are too common.
tweet_words %>%
group_by(country) %>%
mutate(tf = n/total) %>%
top_n(10, tf) %>%
arrange(desc(tf)) %>%
ungroup %>%
mutate(country = as.factor(country),
word = reorder_within(word, tf, country)) %>% # To order the words by tf within each country
ggplot(aes(word, tf, fill = country)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() + # This line removes the separator and country name as a suffix
labs(x = NULL, y = "Term Frequency") +
facet_wrap(~country, ncol=2, scales="free") +
coord_flip()
The idea of tf-idf is to find the important words for the content of each tweet by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of tweets, in this case, the group of U.S. tweets as a whole. Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now.
The bind_tf_idf
function in the tidytext package takes a tidy text dataset as input with one row per token (term), per tweet. One column (word
here) contains the terms/tokens, one column contains the tweets (country
in this case), and the last necessary column contains the counts, how many times each country contains each term (n
in this example). We calculated a total
for each country for our explorations above, but it is not necessary for the bind_tf_idf
function; the table only needs to contain all the words in each country tweets.
tweet_words
## # A tibble: 35,663 x 4
## country word n total
## <chr> <chr> <int> <int>
## 1 United States #covid19 2150 69448
## 2 United States covid19 1560 69448
## 3 United States #coronavirus 722 69448
## 4 United Kingdom #covid19 629 19583
## 5 United States can 411 69448
## 6 United States people 406 69448
## 7 India #covid19 379 17863
## 8 India covid19 358 17863
## 9 United States us 335 69448
## 10 United Kingdom covid19 328 19583
## # … with 35,653 more rows
tweet_words <- tweet_words %>%
bind_tf_idf(word, country, n)
tweet_words
## # A tibble: 35,663 x 7
## country word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 United States #covid19 2150 69448 0.0310 0 0
## 2 United States covid19 1560 69448 0.0225 0 0
## 3 United States #coronavirus 722 69448 0.0104 0 0
## 4 United Kingdom #covid19 629 19583 0.0321 0 0
## 5 United States can 411 69448 0.00592 0 0
## 6 United States people 406 69448 0.00585 0 0
## 7 India #covid19 379 17863 0.0212 0 0
## 8 India covid19 358 17863 0.0200 0 0
## 9 United States us 335 69448 0.00482 0 0
## 10 United Kingdom covid19 328 19583 0.0167 0 0
## # … with 35,653 more rows
Notice that idf and thus tf-idf are zero for these extremely common words. These are all words that appear in all the four countries, so the idf term (which will tenn be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the documents in a collection; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the tweets in the collection.
Let’s look at terms with high tf-idf in tweets about COVID-19.
tweet_words %>%
select(-total) %>%
arrange(desc(tf_idf))
## # A tibble: 35,663 x 6
## country word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 India #indiafightscorona 78 0.00437 1.39 0.00605
## 2 India modi 53 0.00297 1.39 0.00411
## 3 Canada ontario 27 0.00293 1.39 0.00406
## 4 India namo 45 0.00252 1.39 0.00349
## 5 Canada #cdnpoli 23 0.00250 1.39 0.00346
## 6 Canada canada 21 0.00228 1.39 0.00316
## 7 India @narendramodi 73 0.00409 0.693 0.00283
## 8 India @pmoindia 62 0.00347 0.693 0.00241
## 9 India #india 25 0.00140 1.39 0.00194
## 10 United Kingdom #lockdownuk 27 0.00138 1.39 0.00191
## # … with 35,653 more rows
Here we see some hashtags that are specific to the country context and some named entities like names of person, place, or organizations, which are in fact important in the tweets of each country. These terms rarely occur in all countires, and they are important, characteristic words for each country within the corpus of COVID-19 tweets.
Some of the values for idf are the same for different terms because they are four groups of tweets in this corpus and we are seeing the numerical value for ln(4/1) = 1.386294
, ln(4/2) = 0.6931472
, etc.
And let’s look at a visualization for these high tf-idf words.
tweet_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(country) %>%
top_n(10) %>%
ungroup %>%
ggplot(aes(word, tf_idf, fill = country)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "TF-IDF") +
facet_wrap(~country, ncol=2, scales="free") +
coord_flip()
## Selecting by tf_idf
Using tf-idf, we can see the words in the top list are the more important to each country insofar as they indicate the country-specific situation about COVID-19. What measuring tf-idf has done here is show us that COVID-19 tweets contain similar words across the four countries, and what distinguishes a country from the rest within the collection of COVID-19 tweets are the hashtags, Twitter handles, the names of people or places or organizations. This is the point of tf-idf: it identifies words that are important to one country within a collection of tweets.
In this assignment, you will 1) reduce the tweet dataset from “covid_tweets_423.RData” to the tweets written in English and posted in any four countries (with at least 100 tweets or more) using the “country” variable, 2) visualize four bar graphs to show the top 10 words in term frequency from COVID-19 tweets in each country of your interest, 3) visualize four bar graphs to show the top 10 words in tf-idf in each country of your interest, and 4) your insights about the results.
Requirement 1: You should select tweets in English using the lang
variable.
Requirement 2: You should select tweets posted in four countries using the country
variable.
Requirement 3: Your bar graphs should be labeled with the top 10 words, the country, and the quantity by which the top 10 words are ordered.
Requirement 4: By 11:59 PM on June 17th (Wednesday), you will upload the following things to this Assignments section on our e-class page.
Four graphs to visualize the top 10 words in term frequency among the tweets about COVID-19 from four countries you selected and the R code used to generate the graph.
Four graphs to visualize the top 10 words in tf-idf among the tweets about COVID-19 from four countries you selected and the R code used to generate the graph.
What are the differences in the top 10 words between tf and tf-idf? What do you think the cause of such differences?