knitr::opts_chunk$set(warning = F)
library(tidyverse)
library(rhymer)
# install.packages('rhymer')

The Lexicon Paragraph

We compiled a list of 44 seeding words which includes words used when expressing fear of crime, different crime types and British slangs referring to police and criminal activity (find this list online [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/FOC_seed_words_002.csv]). The seed list was used to create the lexicon by querying the Datamuse API -a word-finding engine which allows for querying rhyming words, similar spellings and semantically similar or contextually related words using multiple constraints such as synonyms, perfect and approximate rhymes, homophones, frequent followers, direct holonyms. Using the rhymer package (Landesberg 2017) in R, we queried Datamuse API for the first 100 words that ‘means like’ -i.e. words or sequence of words that are conceptually, semantically and lexically related to- the words in the seed list. After removing duplicates and manually removing clearly out of context results (such as query:coppers, Datamuse result:atomic number 29), we gathered a lexicon consisting of 2538 terms. We also identified 139 slang terms that are used by London gangs and inspected the usage of these terms on Twitter. (see this for methodology [http://rpubs.com/sefaozalp/gang_slangs]). We found that only 16 out of 139 gang slang terms were useful to identify tweets referring to crime and criminal activity. After adding these terms, we collated a lexicon consisting of 2554 terms that could be useful to identify tweets referring to crime, disorder or criminal activity on Twitter (find this file online [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/FOC_lexicon_003_final.csv]).

This lexicon was used to filter the 20m tweets in the dataset… (this bit to be added once Amir or Mo should filters the dataset using the lexicon)

1. Steps to Create the Lexicon

1.1. Seed list

Seed list was formed by heuristically selecting words referring to crime and fear. Not ideal but as these are seed terms, the lexicon should converge towards the right direction.

seed_list <- read_csv("../data/FOC_seed_words.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   words = col_character(),
##   context = col_character(),
##   explanation = col_character()
## )
seed_list %>% distinct(words) #51 distinct words
## # A tibble: 51 x 1
##    words   
##    <chr>   
##  1 afraid  
##  2 alone   
##  3 assault 
##  4 avoid   
##  5 burglary
##  6 CCTV    
##  7 coppers 
##  8 whore   
##  9 crime   
## 10 criminal
## # ... with 41 more rows
query_datamuse <- function(x, ...){
    rhymer::get_means_like(word = x, ...)
}

crime_fear_lexicon <- seed_list %>% 
    select(words) %>%
    pull() %>%  
    map_df(query_datamuse, limit=100, .id = "id") %>% #5030 results
    mutate(id=as.integer(id)) %>% 
    distinct(word, .keep_all = T) %>% # drops to 3687
    left_join( select(seed_list, c(id, query_word=words)), by="id") %>% 
    select(id, query_word, everything()) %>% 
    mutate(tags=as.character(tags)) %>% 
    separate(tags, sep = ",", into = c("tag1", "tag2", "tag3", "tag4")) %>% 
    mutate_at(vars(tag1,tag2,tag3,tag4), .funs = function(x){
              str_extract(string=x, pattern =  regex("(?<=\")[[:alnum:]]+(?=\")"))})


# write_csv(crime_fear_lexicon,"../data/FOC_lexicon_001.csv")

# 

Not memoising above because no need as wrote to csv file once and using that.

Below removing words to remove identified manually. Again wrote once and commented out the write function.

lexicon_filtered <- read_csv("../data/FOC_lexicon_001_manual.csv") %>% 
    filter(remove==0) # rows to be removed was labelled as 1, rest was 0
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   query_word = col_character(),
##   word = col_character(),
##   score = col_integer(),
##   tag1 = col_character(),
##   tag2 = col_character(),
##   tag3 = col_character(),
##   tag4 = col_character(),
##   remove = col_integer()
## )
lexicon_filtered
## # A tibble: 3,096 x 9
##       id query_word word         score tag1  tag2  tag3  tag4  remove
##    <int> <chr>      <chr>        <int> <chr> <chr> <chr> <chr>  <int>
##  1     1 afraid     scared       50038 syn   adj   <NA>  <NA>       0
##  2     1 afraid     fearful      47652 syn   adj   <NA>  <NA>       0
##  3     1 afraid     frightened   45532 syn   adj   <NA>  <NA>       0
##  4     1 afraid     terrified    44729 syn   adj   <NA>  <NA>       0
##  5     1 afraid     petrified    42958 syn   adj   <NA>  <NA>       0
##  6     1 afraid     intimidated  42315 syn   adj   <NA>  <NA>       0
##  7     1 afraid     apprehensive 41740 syn   adj   <NA>  <NA>       0
##  8     1 afraid     concerned    40893 syn   adj   <NA>  <NA>       0
##  9     1 afraid     alarmed      39542 syn   adj   <NA>  <NA>       0
## 10     1 afraid     cowed        38611 syn   adj   <NA>  <NA>       0
## # ... with 3,086 more rows
# write_csv(lexicon_filtered, "../data/FOC_lexicon_001_manual_edited.csv")

Take 2

New comments from the PI, need to do the following.

  1. Remove fear related words from the seed list and lexicon
  2. Revisit the final version of the lexicon and remove irrelevant words (such as drugs ~ prescription, charlatan)

Revisit and remove irrelevant words

This is going to be done manually and outside R.

Removing manually identified words

lexicon_filtered_02 <- read_csv("../data/FOC_lexicon_002_manual.csv") %>% 
    filter(remove==0) # rows to be removed was labelled as 1, rest was 0
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   query_word = col_character(),
##   word = col_character(),
##   score = col_integer(),
##   tag1 = col_character(),
##   tag2 = col_character(),
##   tag3 = col_character(),
##   tag4 = col_character(),
##   remove = col_integer()
## )
lexicon_filtered_02
## # A tibble: 2,538 x 9
##       id query_word word           score tag1  tag2  tag3  tag4  remove
##    <int> <chr>      <chr>          <int> <chr> <chr> <chr> <chr>  <int>
##  1     1 assault    attack         93835 syn   n     <NA>  <NA>       0
##  2     1 assault    sexual assault 93573 syn   n     <NA>  <NA>       0
##  3     1 assault    rape           92281 syn   n     <NA>  <NA>       0
##  4     1 assault    battery        85324 syn   n     <NA>  <NA>       0
##  5     1 assault    violation      85044 syn   n     <NA>  <NA>       0
##  6     1 assault    offensive      84266 syn   n     <NA>  <NA>       0
##  7     1 assault    ravishment     80402 syn   n     <NA>  <NA>       0
##  8     1 assault    assail         79614 syn   v     <NA>  <NA>       0
##  9     1 assault    round          79435 syn   n     <NA>  <NA>       0
## 10     1 assault    snipe          79399 syn   n     <NA>  <NA>       0
## # ... with 2,528 more rows
write_csv(lexicon_filtered_02, "../data/FOC_lexicon_002_manual_edited.csv")

As words in English can take many forms, suffixes and prefixes, I’ll add lemmas and stems of words in separate columns.

lexicon_filtered_02 <- lexicon_filtered_02 %>% 
    mutate(lemmas=textstem::lemmatize_strings(word)) %>% 
    mutate(stems= textstem::stem_strings(word)) %>% 
    select(everything(),-remove, remove)

lexicon_filtered_02
## # A tibble: 2,538 x 11
##       id query_word word  score tag1  tag2  tag3  tag4  lemmas ste… remove
##    <int> <chr>      <chr> <int> <chr> <chr> <chr> <chr> <chr>  <ch>  <int>
##  1     1 assault    atta… 93835 syn   n     <NA>  <NA>  attack att…      0
##  2     1 assault    sexu… 93573 syn   n     <NA>  <NA>  sexua… sex…      0
##  3     1 assault    rape  92281 syn   n     <NA>  <NA>  rape   rape      0
##  4     1 assault    batt… 85324 syn   n     <NA>  <NA>  batte… bat…      0
##  5     1 assault    viol… 85044 syn   n     <NA>  <NA>  viola… vio…      0
##  6     1 assault    offe… 84266 syn   n     <NA>  <NA>  offen… off…      0
##  7     1 assault    ravi… 80402 syn   n     <NA>  <NA>  ravis… rav…      0
##  8     1 assault    assa… 79614 syn   v     <NA>  <NA>  assail ass…      0
##  9     1 assault    round 79435 syn   n     <NA>  <NA>  round  rou…      0
## 10     1 assault    snipe 79399 syn   n     <NA>  <NA>  snipe  sni…      0
## # ... with 2,528 more rows
write_csv(lexicon_filtered_02, "../data/FOC_lexicon_002_manual_edited.csv")

Final Version of the Lexicon

Merging with London Slangs

After the lexicon was created, I have moved on to explore the London based gang slangs and whether they would be useful to include in the lexicon for the purposes of this study. The methods used to explore the use of London gang slang terms on Twitter are detailed here [http://rpubs.com/sefaozalp/gang_slangs]. In summary, I have explored the use of 139 London gang slang terms on Twitter and found out that only a fraction of these terms were vaguely referred to crime or criminal activity. Find the results of the exploration here [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/slangs_from_shinobi.csv] and here [https://github.com/sefabey/fear_of_crime_paper/blob/master/data/ereid_glossary.csv].

Below, I will merge the London gang slang terms which were identified to have referred to crime or criminal activity with the lexicon.

lexicon_filtered_02 <- read_csv("../data/FOC_lexicon_002_manual_edited.csv")
shinobi_glossary <- read_csv("../data/slangs_from_shinobi.csv")
ereid_glossary <- read_csv("../data/ereid_glossary.csv")

Slang terms that might be useful and collected from shinobi are:

shinobi_useful <- shinobi_glossary %>%
    filter(useful_for_further_inspection!= "no") %>% 
    filter(!is.null(slang_term))

shinobi_useful
## # A tibble: 14 x 4
##    slang_term   slang_term_lower useful_for_further… comment              
##    <chr>        <chr>            <chr>               <chr>                
##  1 Harlem Spar… harlem spartans  maybe               refences to a uk ban…
##  2 Aggy         aggy             maybe               no mention of crime …
##  3 Aggro        aggro            maybe               no mention of crime …
##  4 Allow it     allow it         maybe               interestingly observ…
##  5 Bredrins     bredrins         yes                 some referral to gan…
##  6 Bumbaclart   bumbaclart       maybe               Jamaican origin. sla…
##  7 Bludclart    bludclart        maybe               slang usage observed…
##  8 Bloodclart   bloodclart       maybe               slang usage observed…
##  9 Hollow tips  hollow tips      maybe               refers to violance a…
## 10 Merk         merk             maybe               tentative maybe. Man…
## 11 Murk         murk             maybe               tentative maybe. Man…
## 12 Paigon       paigon           maybe               carries a negative c…
## 13 Sprayed      sprayed          maybe               slang usage (shower …
## 14 Yute(s)      yute(s)          maybe               slang usage observed…

Slang terms that might be useful and collected from Ebony Reid’s PhD thesis are:

ereid_useful <- ereid_glossary %>% 
    filter(useful_for_further_inspection!="no") %>% 
    filter(!is.na(slang))

ereid_useful
## # A tibble: 2 x 4
##   slang meaning           twitter_usage_comments        useful_for_furthe…
##   <chr> <chr>             <chr>                         <chr>             
## 1 Garms clothing          slang usage observed. Some m… maybe             
## 2 Grind refers to workin… slang usage oberved but gene… maybe
shinobi_slang_terms <- shinobi_useful %>% select(word=slang_term)
ereid_slang_terms <- ereid_useful %>% select(word=slang)

lexicon_final <-lexicon_filtered_02 %>% 
    bind_rows(shinobi_slang_terms) %>%
    bind_rows(ereid_slang_terms)
lexicon_final %>% write_csv("../data/FOC_lexicon_003_final.csv")

lexicon_final%>% rmarkdown:::print.paged_df()