This is a simple test the accuracy of cld2 and cld3 packages. This is far from being a complete study but its a quick quality provided for Jeroen Ooms. You might want to download the .rmd as package loading and unloading are not visible here.
I’m using the most convenient dataset available for me: the one that is already loaded in my Global environment.
summary(Antisemitism$tweettextstr)
The Antisemitism dataset is used for training a ML classifier purposes. It is a subset of a larger dataset which was created by collecting data from Twitter’s streaming API using some context specific keywords such as ‘Jew’ ‘Zionist’ etc over a long time period. In pre-processing, I used Twitter’s integrated language identification to filter out only tweets in English. Later, the selection was annotated on CrowdFlower. So, I’m 100% sure that all of the 1390 tweets I’m using here are in English.
Startign with cld2:
detach("package:cld3", unload=TRUE)
library(cld2)
table(detect_language(Antisemitism$tweettextstr), useNA = 'ifany')
So, cld2 misses 19 out of 1390. I’ve eyeballed the tweet detected as Arabic; the tweet is in English (proper sentence with 10+ words) but contains only one hashtag in Arabic script.
Now off to test the cld3
detach("package:cld2", unload=TRUE)
library(cld3)
table(detect_language(Antisemitism$tweettextstr), useNA = 'ifany')
This is surprising. I was expecting a decrease in mis-identifications but its the opposite. Many more NAs and many different languages. I really would like to see the precision of language detection to improve but I guess something is not right in the newest version of the package.
Antisemitism$tweettextstr <- Antisemitism$tweettextstr %>%
gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", .) %>% # REMOVE URLS
gsub("\n", " ", .) %>% # REMOVE LINE BREAKS
gsub("[^[:alnum:][:blank:]+?&/'@#\\-]", "", .) # REMOVE NON ALPHANUMERICS BUT LEAVE THE ONES SPECIFIED AFTER THE +
detach("package:cld3", unload=TRUE)
library(cld2)
table(detect_language(Antisemitism$tweettextstr), useNA = 'ifany')
detach("package:cld2", unload=TRUE)
library(cld3)
table(detect_language(Antisemitism$tweettextstr), useNA = 'ifany')
Cleaning the text a bit definitely made a difference but unfortunately for the worse.
Hope this helps.