5.1 Mainsteam
Get keyword frequency
keyword <- separate_rows(data = dat[nchar(dat$keywords)>1,c("keywords","decade", "id")], "keywords", sep = ";", convert = T)
keyword$keywords <- trimws(keyword$keywords)
keyword <- keyword[nchar(keyword$keywords)>1,]
head(keyword)
## # A tibble: 6 × 3
## keywords decade id
## <chr> <ord> <chr>
## 1 ENGLISH-LANGUAGE PROFICIENCY t90 91_2
## 2 LABOR-MARKET t90 91_2
## 3 HISPANIC MEN t90 91_2
## 4 UNEMPLOYMENT t90 91_3
## 5 MIGRATION t90 91_3
## 6 POLICY t90 91_3
Merge keywords with cat_id
by the id
column
key_cat <- merge(keyword, cat_id, by = "id", all.x = T)
key_cat <- key_cat[nchar(key_cat$keywords)>1,] # eliminate ' ' place holder, keep >=2 characters
key_cat$keywords <- toupper(key_cat$keywords)
All-time keyword frequency
Remove stopwords. Stopwords include the search terms in this query. Their presence does not add information.
stopwords <- c("migration","immigration","immigrant","immigrants","migrants","migrant","skilled migration", "skilled migrants","emigration","skills","skill")
stopwords <- toupper(stopwords)
## Remove stopwords from the keyword-discipline data
keyword <- keyword[!keyword$keywords %in% stopwords,]
key_cat <- key_cat[!key_cat$keywords %in% stopwords,]
5.1.1 Keyword Frequency
keyword_freq <- freq_tables(keyword$keywords[!key_cat$keywords %in% stopwords])
print_freq_tables(keyword$keywords)
top20_keyword_freq <- paste(keyword_freq$var[1:20]," (", keyword_freq$Freq[1:20], ", ", keyword_freq$pct[1:20],")", sep = "")
# collapse to concatenate a vector of strings
paste(tolower(top20_keyword_freq), collapse = ", ")
## [1] "impact (239, 0.69%), mobility (238, 0.69%), international migration (209, 0.61%), gender (188, 0.54%), employment (175, 0.51%), earnings (174, 0.5%), labor (165, 0.48%), education (159, 0.46%), growth (154, 0.45%), brain-drain (136, 0.39%), labor-market (125, 0.36%), unemployment (117, 0.34%), networks (116, 0.34%), brain drain (105, 0.3%), united-states (99, 0.29%), policy (99, 0.29%), women (97, 0.28%), wages (96, 0.28%), trade (96, 0.28%), workers (95, 0.28%)"
5.1.2 Themes from Keywords (patterns)
The keywords were combined with three strategies. The first strategy is to combine all phrases that include the target keyword. For example, impact was the most frequent word. Phrases such as “economic impact”, “demographic impact” were included in the impact group. The second strategy is to combine spelling variations, plural forms, and synonyms. The third strategy is to combine keywords of different meaning but belong to the same category. For example, I grouped sex (gender, women, men), family (generation, family, fertility, children), age (aging, youth, young, older), and race (race, ethnic) related keywords into a broader demographic characteristics category. Also combined here were the policy related words, including policies, regulation, law, rule, and politics.
## Strategy one, sub-string matching
#pat_mobility = "mobility"
#pat_impact = "impact"
#pat_network = "network"
## Strategy two, spelling variations, plural forms, synonyms
pat_income = "wage|earning|income|salary"
pat_employ = "employ|labor|labour|job"
#pat_brain = "brain drain|brain-drain"
#pat_growth = "growth"
## Strategy three, words belong to the same category/family
pat_education = "education|learn|student|university"
pat_policy = "policy|policies|regulation|rule| law"
pat_politics = "politics|political"
#pat_pol = paste(pat_policy, pat_politics, sep = "|"); pat_pol
#pat_demo = "aging|youth|young|older|gender|woman|women|family|families|fertility|^man$|^men$|children|generation|race|identity|ethnic" # de-activate for now
#pat_attitude = "attitude|opposition|polarization|opinion|perception|segregation|prejudice|equality|integration|discrimination" # de-activate for now
#pat_equal = "segregat|equality|integration|discriminat|prejudice"
# How many publications have keywords
keyword_denomi <- length(unique(keyword$id))
ids_w_key <- unique(keyword$id)
5.1.3 Theme-Frequency Matrix
Add columns to the main data
## Create variable names. Search all objects start with pat_ in .GlobalEnv
pat <- grep("pat_",names(.GlobalEnv),value=TRUE)
## get the content of each object into a list
patterns_l <- do.call("list", mget(pat))
## convert the list to a data frame
pattern_df <- data.frame(do.call(rbind, patterns_l))
dim(pattern_df); colnames(pattern_df) <- "patterns"
## [1] 5 1
pattern_df$theme <- pat
# sort the themes by alphabetical order
pattern_df <- pattern_df[order(pattern_df$theme),]
rownames(pattern_df) <- NULL
# extract variable names AFTER the themes were sorted
varnames <- gsub(pattern = "pat_", replacement = "", pattern_df$theme); varnames
## [1] "education" "employ" "income" "policy" "politics"
## experiment with new variable names
newnames <- varnames
for (i in 1:length(varnames)) {
if(newnames[i] == "demo") {
newnames[i] <- "demographic_characteristics"
} else if (newnames[i] == "edu") {
newnames[i] <- "education"
} else if (newnames[i] == "employ") {
newnames[i] <- "employment"
} else if (newnames[i] == "brain") {
newnames[i] <- "brain_drain"
}
}
varnames <- newnames
## initiate an empty data frame
keyword_bi <- data.frame(matrix(ncol = length(varnames), nrow = nrow(dat)))
colnames(keyword_bi) <- varnames
## match 0s and 1s of the patterns in the keywords.
for (i in 1:ncol(keyword_bi)) {
keyword_bi[,i] <- ifelse(grepl(pattern = pattern_df$patterns[i], dat$keywords, ignore.case = T), 1, 0)
}
## column sum of keywords
colSums(keyword_bi)
## education employment income policy politics
## 343 954 422 344 137
## Bind the colSums freq and % into one dataframe
keyword_colsum <- data.frame(cbind(
keyword = varnames,
Freq = colSums(keyword_bi),
pct = paste(round(100*colSums(keyword_bi)/keyword_denomi,2), "%", sep = "")
))
## Add the patterns to the data frame of the keyword/theme list
keyword_colsum$pattern <- pattern_df$patterns
keyword_colsum <- keyword_colsum[rev(order(keyword_colsum$Freq)),]
rownames(keyword_colsum) <- 1:nrow(keyword_colsum)
keyword_colsum$keyword <- trimws(keyword_colsum$keyword)
datatable(keyword_colsum, caption = "Keyword, Frequency, and Search Patterns")
## Number of publications with at least one of the designated categories
length(which(rowSums(keyword_bi)>0))
## [1] 1505
## % among all publications with keywords
length(which(rowSums(keyword_bi)>0)) / keyword_denomi
## [1] 0.6190868
Printed out the descending keyword, frequency, and %
After categorizing the keywords, the following 5 themes emerged.
They are employment (954, 39.24%), income (422, 17.36%), policy (344, 14.15%), education (343, 14.11%), politics (137, 5.64%). These themes cover 1505 (61.91%) of the 2431 publications with keywords.
paste(paste(keyword_colsum$keyword, " (", keyword_colsum$Freq,", ", keyword_colsum$pct, ")", sep = ""), collapse = ", ")
## [1] "employment (954, 39.24%), income (422, 17.36%), policy (344, 14.15%), education (343, 14.11%), politics (137, 5.64%)"
5.1.4 Theme-Frequency Matrix
# use the same method to construct a binary matrix that query from the main text (Title + Abstract)
text_bi <- data.frame(matrix(ncol = length(varnames), nrow = nrow(dat)))
colnames(text_bi) <- varnames
## match 0s and 1s of the patterns in the keywords.
for (i in 1:ncol(text_bi)) {
text_bi[,i] <- ifelse(grepl(pattern = pattern_df$patterns[i], dat$text, ignore.case = T), 1, 0)
}
## column sum of keywords
colSums(text_bi)
## education employment income policy politics
## 698 1659 667 892 225
## Bind the colSums freq and % into one data frame
text_colsum <- data.frame(cbind(
keyword = varnames,
Freq = colSums(text_bi), # characters by default
pct = paste(round(100*colSums(text_bi)/nrow(dat),2), "%", sep = "")
))
text_colsum$Freq <- as.numeric(text_colsum$Freq) # very strange, the default class of frequency is character
text_colsum <- text_colsum[rev(order(text_colsum$Freq)),]
rownames(text_colsum) <- 1:nrow(text_colsum)
datatable(text_colsum, caption = "Patterns matches in titles and abstracts")
5.1.5 Theme Coverage in Title & Abstract
## Number of publications with at least one of the designated categories
length(which(rowSums(text_bi)>0))
## [1] 2263
## % among all publications with keywords
length(which(rowSums(text_bi)>0)) / nrow(dat)
## [1] 0.8647306
I then tested the performance of themes extracted from the keywords by applying them in the article titles and abstracts. The performance of keyword-based themes turned out to provide 86.47% coverage of all the publications. This is the distribution of the themes when matching them in the titles and abstracts: employment (1659, 63.39%), policy (892, 34.08%), education (698, 26.67%), income (667, 25.49%), politics (225, 8.6%).
For example, the previous example for text processing now has its matched themes. According to the methods, the original text body is expressed as employment, policy, politics.
paste(dat$`Article Title`[100], dat$Abstract[100], sep = ' ')
## [1] "New migrations in the Asia-Pacific region: a force for social and political change A rapid increase in international migration is a central aspect of the social transformations currently taking place in the Asia-Pacific region. Population movements take many forms, including permanent migration, temporary labour migration, mobility of highly skilled personnel, refugee movements and family reunion. Destinations include North America, the Gulf oil states and - increasingly - the fast-growing 'tiger economies' of Asia. Much of the migration is undocumented and a growing proportion of the migrants are women. So far, researchers and policy-makers have concentrated on short-term economic and regulatory aspects. But migration is likely to be a major factor bringing about social and political change in the region. The social networks which develop as part of the migratory process often make official migration control policies difficult to implement. Unplanned settlement is taking place, with important consequences for both sending and receiving societies. Scholars from a number of countries in the region have therefore established an Asia Pacific Migration Research Network to study these issues, to raise public awareness and to provide advice to policymakers. The article describes the aims and development of this Network, which is part of the UNESCO Management of Social Transformations Programme."
text_bi[100,][,colSums(text_bi[100,])>0]
## employment policy politics
## 100 1 1 1
names(text_bi[100,][,colSums(text_bi[100,])>0])
## [1] "employment" "policy" "politics"