The primary aim of this job was to get a fresh annotated sample of tweets that refers to LGBTI identity in the context of online hate speech and/or antagonistic speech. This annotated sample will be used to train machine learning models and build a classifier that will be used in further analysis.
The annotation sample was created by sampling from a larger dataset that were created by using keywords and slang terms that refer to LGBTI identity to filter Twitter’s Streaming API (can provide a more detailed sampling methodology). After sampling, 2409 tweets were uploaded to Figure8 (previously known as the CrowdFlower).
Annotators were asked to read tweet text and answer a binary yes/no question asking whether the text was ‘offensive or antagonistic in terms of sexual orientation’. To train annotators and to increase consistency between annotators, an extended job guidance, including an overview of the job, steps to take, rules and tips detailing what constitutes antagonistic speech and examples of both antagonistic and non-antagonistic tweets with explanations provided as to why tweets fall into the particular category. Following the job description, the users were asked to answer 51 pre-answered test questions. When an annotator missed a test question, a comment detailing the right answer was shown. Only annotators who scored over 70% in test questions were progressed to the next phase and provided with raw/unannotated tweets.
Because of the nature of the job, the annotation task required a rather nuanced understanding of the English language. Therefore, only annotators from five majority English speaking countries (i.e. UK, USA, Canada, Australia and New Zealand) were selected. Further, annotators were limited to annotate a maximum of 200 tweets each to prevent over-fitting. Similarly, we asked four different annotators to judge each. Lastly, if annotators spent less than 10 seconds per row, they were asked to be flagged and dropped out from the annotation.
Figure taken from figure8
As these are very self-explanatory, a discussion is not necessary.
Through out the job, 185 annotators passed the quiz questions and 11 annotators failed and removed from the job. Out of 185 who passed the text questions, 71 annotators maxed out (i.e. labelled 200 tweets) and remaining annotators annotated less than 200 tweets. Only 3 annotators were flagged and removed from the job. The judgement count distribution per user is also presented below. No abnormalities observed.
Figure taken from figure8
Figure taken from figure8
Another measure of the job performance is the performance of the test questions. According to the data provided by figure8, 51 test questions were deemed as good (more than 50% of users got right at first try) and only 1 was deemed as missed (answered incorrectly by more than 50% of users). There were no contested test questions (annotators can contest test questions if they do not agree with the explanation provided for a failed test question). Trusted annotators (i.e. figure8 contributors who score high consistently for different jobs) achieved 93% accuracy in test questions whereas untrusted annotators managed to achieve only 58%. For 36 test questions accuracy was 100% for all users. These results indicate that test questions were devised accurately and was helpful in preparing annotators to label raw data.
Figure taken from figure8
Annotators were also presented a voluntary exit survey when they finished the job. 33 annotators volunteered to take the contributor satisfaction survey. According to the exit survey, overall satisfaction rate was high (4.2 out of 5). Almost all annotators though the instructions were clear. Large majority of annotators found the test questions to be fair and was happy about the ease of job and the pay.
Figure taken from figure8
All considered, the performance of the annotation job was more than satisfactory. This indicates the job design was of high quality; descriptions and guidances provided in the job design was clear for annotators.
When a job is complete, figure8 provides the results in multiple formats.
For the purposes of this report, Aggregated Report (results grouped by each individual tweet) is adequate.
Below, we will look at the Aggregated Report results and sentiments
library(tidyverse)
library(hrbrthemes)
aggr_report_original <- read_csv("../data/annotation/aggregated_report_1277648.csv")%>%
select(-2:-5) %>% # dropping unused columns ##amazing that dplyr lets me get away with this syntax
select(id,
label_annotation=is_this_text_offensive_or_antagonistic_in_terms_of_religion_race_sexual_orientation_and_disability,
confidence_annotation=`is_this_text_offensive_or_antagonistic_in_terms_of_religion_race_sexual_orientation_and_disability:confidence`,
everything())
aggr_report <- aggr_report_original %>%
filter(confidence_annotation>0.7)
Here confidence can be construed as Cohen’s kappa score (inter-rater reliability). When we select rows with confidence score of at least 70% are, 203 rows get dropped. This is nearly 10% and in my experience this is quite low, meaning annotators were not very far off in their labelling processes.
Next, let’s take a look at rows annotated as antagonistic or offensive (labelled as 1 in the dataset) and not antagonistic nor offensive (labelled as 2 in the dataset). While 335 tweets were labelled as antagonistic by annotators, 1871 were labelled as not-antagonistic. This means 17.905% of tweets were labelled as antagonistic by annotators. This is a bit higher than other datasets I have worked previously.
Now let’s check out the tweets labelled as antagonistic. I’ll randomly sample 20 tweets and print them out, twice.
First batch of tweets labelled as antagonistic:
set.seed(3)
aggr_report %>%
filter(label_annotation==1) %>%
select(text) %>%
sample_n(20)
Second batch of tweets labelled as antagonistic:
set.seed(4)
aggr_report %>%
filter(label_annotation==1) %>%
select(text) %>%
sample_n(20)
Judging by these samples, we can say that the annotation of antagonistic tweets were successful. There are many terms and terms referring to LGBTI identity coupled with common expletives referring to LGBTI identity. The word ‘faggot’ and its abbreviated conjugation is quite common in both samples
Lets contrast these tweets with tweets labelled as not antagonistic.
set.seed(7)
aggr_report %>%
filter(label_annotation==2) %>%
select(text) %>%
sample_n(20)
set.seed(1234)
aggr_report %>%
filter(label_annotation==2) %>%
select(text) %>%
sample_n(20)
Even though these tweets that were labelled as not antagonistic contain words, slang terms and/or slurs which refer to LGBTI identity such as ‘dyke’, ‘gay’, and ‘queer’, the sentiment is not offensive or antagonistic. Some counter-speech also observed. There is a clear contrast with tweets that have been labelled as antagonistic.
We have created a sample of 2409 tweets which refer to LGBTI identity on Twitter. We uploaded these tweets to Figure8 and asked annotators whether a tweet was offensive or antagonistic towards LGBTI identity. Only less than 10% of tweets had to be removed due to low confidence (inter-rater) scores. ~17% of the remaining tweets in the sample were annotated as antagonistic or offensive. Although this is a bit high, manual inspection of tweets labelled as antagonistic versus non-antagonistic revealed that both samples are semantically distinct.
Finally, it could be said that annotation job was a success. These tweets can be used to build machine learning classifiers.