knitr::opts_chunk$set(warning = F)
library(tidyverse)
library(rhymer)
# install.packages('rhymer')
We compiled a list of 44 seeding words which includes words used when expressing fear of crime, different crime types and British slangs referring to police and criminal activity (find this list online [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/FOC_seed_words_002.csv]). The seed list was used to create the lexicon by querying the Datamuse API -a word-finding engine which allows for querying rhyming words, similar spellings and semantically similar or contextually related words using multiple constraints such as synonyms, perfect and approximate rhymes, homophones, frequent followers, direct holonyms. Using the rhymer package (Landesberg 2017) in R, we queried Datamuse API for the first 100 words that ‘means like’ -i.e. words or sequence of words that are conceptually, semantically and lexically related to- the words in the seed list. After removing duplicates and manually removing clearly out of context results (such as query:coppers, Datamuse result:atomic number 29), we gathered a lexicon consisting of 2538 terms. We also identified 139 slang terms that are used by London gangs and inspected the usage of these terms on Twitter. (see this for methodology [http://rpubs.com/sefaozalp/gang_slangs]). We found that only 16 out of 139 gang slang terms were useful to identify tweets referring to crime and criminal activity. After adding these terms, we collated a lexicon consisting of 2554 terms that could be useful to identify tweets referring to crime, disorder or criminal activity on Twitter (find this file online [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/FOC_lexicon_003_final.csv]).
This lexicon was used to filter the 20m tweets in the dataset… (this bit to be added once Amir or Mo should filters the dataset using the lexicon)
Seed list was formed by heuristically selecting words referring to crime and fear. Not ideal but as these are seed terms, the lexicon should converge towards the right direction.
seed_list <- read_csv("../data/FOC_seed_words.csv")
## Parsed with column specification:
## cols(
## id = col_integer(),
## words = col_character(),
## context = col_character(),
## explanation = col_character()
## )
seed_list %>% distinct(words) #51 distinct words
## # A tibble: 51 x 1
## words
## <chr>
## 1 afraid
## 2 alone
## 3 assault
## 4 avoid
## 5 burglary
## 6 CCTV
## 7 coppers
## 8 whore
## 9 crime
## 10 criminal
## # ... with 41 more rows
query_datamuse <- function(x, ...){
rhymer::get_means_like(word = x, ...)
}
crime_fear_lexicon <- seed_list %>%
select(words) %>%
pull() %>%
map_df(query_datamuse, limit=100, .id = "id") %>% #5030 results
mutate(id=as.integer(id)) %>%
distinct(word, .keep_all = T) %>% # drops to 3687
left_join( select(seed_list, c(id, query_word=words)), by="id") %>%
select(id, query_word, everything()) %>%
mutate(tags=as.character(tags)) %>%
separate(tags, sep = ",", into = c("tag1", "tag2", "tag3", "tag4")) %>%
mutate_at(vars(tag1,tag2,tag3,tag4), .funs = function(x){
str_extract(string=x, pattern = regex("(?<=\")[[:alnum:]]+(?=\")"))})
# write_csv(crime_fear_lexicon,"../data/FOC_lexicon_001.csv")
#
Not memoising above because no need as wrote to csv file once and using that.
Below removing words to remove identified manually. Again wrote once and commented out the write function.
lexicon_filtered <- read_csv("../data/FOC_lexicon_001_manual.csv") %>%
filter(remove==0) # rows to be removed was labelled as 1, rest was 0
## Parsed with column specification:
## cols(
## id = col_integer(),
## query_word = col_character(),
## word = col_character(),
## score = col_integer(),
## tag1 = col_character(),
## tag2 = col_character(),
## tag3 = col_character(),
## tag4 = col_character(),
## remove = col_integer()
## )
lexicon_filtered
## # A tibble: 3,096 x 9
## id query_word word score tag1 tag2 tag3 tag4 remove
## <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 1 afraid scared 50038 syn adj <NA> <NA> 0
## 2 1 afraid fearful 47652 syn adj <NA> <NA> 0
## 3 1 afraid frightened 45532 syn adj <NA> <NA> 0
## 4 1 afraid terrified 44729 syn adj <NA> <NA> 0
## 5 1 afraid petrified 42958 syn adj <NA> <NA> 0
## 6 1 afraid intimidated 42315 syn adj <NA> <NA> 0
## 7 1 afraid apprehensive 41740 syn adj <NA> <NA> 0
## 8 1 afraid concerned 40893 syn adj <NA> <NA> 0
## 9 1 afraid alarmed 39542 syn adj <NA> <NA> 0
## 10 1 afraid cowed 38611 syn adj <NA> <NA> 0
## # ... with 3,086 more rows
# write_csv(lexicon_filtered, "../data/FOC_lexicon_001_manual_edited.csv")
New comments from the PI, need to do the following.
This is going to be done manually and outside R.
Removing manually identified words
lexicon_filtered_02 <- read_csv("../data/FOC_lexicon_002_manual.csv") %>%
filter(remove==0) # rows to be removed was labelled as 1, rest was 0
## Parsed with column specification:
## cols(
## id = col_integer(),
## query_word = col_character(),
## word = col_character(),
## score = col_integer(),
## tag1 = col_character(),
## tag2 = col_character(),
## tag3 = col_character(),
## tag4 = col_character(),
## remove = col_integer()
## )
lexicon_filtered_02
## # A tibble: 2,538 x 9
## id query_word word score tag1 tag2 tag3 tag4 remove
## <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
## 1 1 assault attack 93835 syn n <NA> <NA> 0
## 2 1 assault sexual assault 93573 syn n <NA> <NA> 0
## 3 1 assault rape 92281 syn n <NA> <NA> 0
## 4 1 assault battery 85324 syn n <NA> <NA> 0
## 5 1 assault violation 85044 syn n <NA> <NA> 0
## 6 1 assault offensive 84266 syn n <NA> <NA> 0
## 7 1 assault ravishment 80402 syn n <NA> <NA> 0
## 8 1 assault assail 79614 syn v <NA> <NA> 0
## 9 1 assault round 79435 syn n <NA> <NA> 0
## 10 1 assault snipe 79399 syn n <NA> <NA> 0
## # ... with 2,528 more rows
write_csv(lexicon_filtered_02, "../data/FOC_lexicon_002_manual_edited.csv")
As words in English can take many forms, suffixes and prefixes, I’ll add lemmas and stems of words in separate columns.
lexicon_filtered_02 <- lexicon_filtered_02 %>%
mutate(lemmas=textstem::lemmatize_strings(word)) %>%
mutate(stems= textstem::stem_strings(word)) %>%
select(everything(),-remove, remove)
lexicon_filtered_02
## # A tibble: 2,538 x 11
## id query_word word score tag1 tag2 tag3 tag4 lemmas ste… remove
## <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <ch> <int>
## 1 1 assault atta… 93835 syn n <NA> <NA> attack att… 0
## 2 1 assault sexu… 93573 syn n <NA> <NA> sexua… sex… 0
## 3 1 assault rape 92281 syn n <NA> <NA> rape rape 0
## 4 1 assault batt… 85324 syn n <NA> <NA> batte… bat… 0
## 5 1 assault viol… 85044 syn n <NA> <NA> viola… vio… 0
## 6 1 assault offe… 84266 syn n <NA> <NA> offen… off… 0
## 7 1 assault ravi… 80402 syn n <NA> <NA> ravis… rav… 0
## 8 1 assault assa… 79614 syn v <NA> <NA> assail ass… 0
## 9 1 assault round 79435 syn n <NA> <NA> round rou… 0
## 10 1 assault snipe 79399 syn n <NA> <NA> snipe sni… 0
## # ... with 2,528 more rows
write_csv(lexicon_filtered_02, "../data/FOC_lexicon_002_manual_edited.csv")
After the lexicon was created, I have moved on to explore the London based gang slangs and whether they would be useful to include in the lexicon for the purposes of this study. The methods used to explore the use of London gang slang terms on Twitter are detailed here [http://rpubs.com/sefaozalp/gang_slangs]. In summary, I have explored the use of 139 London gang slang terms on Twitter and found out that only a fraction of these terms were vaguely referred to crime or criminal activity. Find the results of the exploration here [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/slangs_from_shinobi.csv] and here [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/ereid_glossary.csv].
Below, I will merge the London gang slang terms which were identified to have referred to crime or criminal activity with the lexicon.
lexicon_filtered_02 <- read_csv("../data/FOC_lexicon_002_manual_edited.csv")
shinobi_glossary <- read_csv("../data/slangs_from_shinobi.csv")
ereid_glossary <- read_csv("../data/ereid_glossary.csv")
Slang terms that might be useful and collected from shinobi are:
shinobi_useful <- shinobi_glossary %>%
filter(useful_for_further_inspection!= "no") %>%
filter(!is.null(slang_term))
shinobi_useful
## # A tibble: 14 x 4
## slang_term slang_term_lower useful_for_further… comment
## <chr> <chr> <chr> <chr>
## 1 Harlem Spar… harlem spartans maybe refences to a uk ban…
## 2 Aggy aggy maybe no mention of crime …
## 3 Aggro aggro maybe no mention of crime …
## 4 Allow it allow it maybe interestingly observ…
## 5 Bredrins bredrins yes some referral to gan…
## 6 Bumbaclart bumbaclart maybe Jamaican origin. sla…
## 7 Bludclart bludclart maybe slang usage observed…
## 8 Bloodclart bloodclart maybe slang usage observed…
## 9 Hollow tips hollow tips maybe refers to violance a…
## 10 Merk merk maybe tentative maybe. Man…
## 11 Murk murk maybe tentative maybe. Man…
## 12 Paigon paigon maybe carries a negative c…
## 13 Sprayed sprayed maybe slang usage (shower …
## 14 Yute(s) yute(s) maybe slang usage observed…
Slang terms that might be useful and collected from Ebony Reid’s PhD thesis are:
ereid_useful <- ereid_glossary %>%
filter(useful_for_further_inspection!="no") %>%
filter(!is.na(slang))
ereid_useful
## # A tibble: 2 x 4
## slang meaning twitter_usage_comments useful_for_furthe…
## <chr> <chr> <chr> <chr>
## 1 Garms clothing slang usage observed. Some m… maybe
## 2 Grind refers to workin… slang usage oberved but gene… maybe
shinobi_slang_terms <- shinobi_useful %>% select(word=slang_term)
ereid_slang_terms <- ereid_useful %>% select(word=slang)
lexicon_final <-lexicon_filtered_02 %>%
bind_rows(shinobi_slang_terms) %>%
bind_rows(ereid_slang_terms)
lexicon_final %>% write_csv("../data/FOC_lexicon_003_final.csv")
lexicon_final%>% rmarkdown:::print.paged_df()