Importing Complete and Processed Dataset

This data set is the result of all our previous processing attempts. It contains tweet metadata, geo-locations, hate/antagonistic speech around Jewish identity classification results and the identification of agent types (i.e. celebrity, police, MPs, Jewish Organisations and the media agents). Those steps are not provided here explicitly for the sake of brevity.

The data set we have has 2670427 rows and 67 columns. A quick glimpse to the complete processed dataset is provided below.

complete.dataset.processed %>% glimpse()
## Observations: 2,670,427
## Variables: 67
## $ tweet.id.str                         <dbl> 655010066236612608, 65501...
## $ tweet.text.str                       <chr> "@brianaeden_xo @loveaIwa...
## $ tweet.time.str                       <chr> "Fri Oct 16 13:19:03 +000...
## $ timestamp.str                        <dbl> 1445001543946, 1445001546...
## $ user.id.str                          <dbl> 385685952, 137469743, 199...
## $ user.handle.str                      <chr> "sthrmsy", "Mirelle_Byruc...
## $ user.name.str                        <chr> "quesadeity", "Mirelle", ...
## $ user.verified                        <lgl> FALSE, FALSE, FALSE, FALS...
## $ user.followers                       <int> 108, 164, 2413, 4384, 121...
## $ user.following                       <int> 113, 63, 2091, 2007, 1259...
## $ user.status.count                    <int> 6195, 23000, 142393, 4173...
## $ user.description.str                 <chr> "בצלם אלוהים", "Smile at ...
## $ user.location.str                    <chr> NA, "Lovely #London  My H...
## $ user.timezone                        <chr> "London", "London", "Lond...
## $ retweeted.id.str                     <dbl> NA, 655009823772291072, 6...
## $ retweeted.text.str                   <chr> NA, "I should note: there...
## $ retweeted.time.str                   <chr> NA, "Fri Oct 16 13:18:06 ...
## $ retweeted.favorite.count             <int> NA, 1, 2, 0, 14, NA, NA, ...
## $ retweeted.retweet.count              <int> NA, 2, 3, 1, 10, NA, NA, ...
## $ retweeted.user.id.str                <dbl> NA, 16330790, 16330790, 4...
## $ retweeted.user.handle.str            <chr> NA, "AviMayer", "AviMayer...
## $ retweeted.user.name.str              <chr> NA, "Avi Mayer", "Avi May...
## $ retweeted.user.verified              <lgl> NA, FALSE, FALSE, FALSE, ...
## $ retweeted.user.followers             <int> NA, 30394, 30394, 6663, 6...
## $ retweeted.user.following             <int> NA, 6306, 6306, 6192, 642...
## $ retweeted.user.status.count          <int> NA, 52988, 52988, 74694, ...
## $ retweeted.user.description.str       <chr> NA, "Just some guy living...
## $ retweeted.user.location.str          <chr> NA, "Jerusalem, Israel", ...
## $ retweeted.user.timezone              <chr> NA, "Jerusalem", "Jerusal...
## $ quoted.id.str                        <dbl> NA, NA, NA, NA, NA, NA, N...
## $ quoted.text.str                      <chr> NA, NA, NA, NA, NA, NA, N...
## $ quoted.time.str                      <chr> NA, NA, NA, NA, NA, NA, N...
## $ quoted.favorite.count                <int> NA, NA, NA, NA, NA, NA, N...
## $ quoted.retweet.count                 <int> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.id.str                   <dbl> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.handle.str               <chr> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.name.str                 <chr> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.verified                 <lgl> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.followers                <int> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.following                <int> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.status.count             <int> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.description.str          <chr> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.location.str             <chr> NA, NA, NA, NA, NA, NA, N...
## $ quoted.user.timezone                 <chr> NA, NA, NA, NA, NA, NA, N...
## $ in.reply.to.status.id.str            <dbl> 655009821637410816, NA, N...
## $ in.reply.to.user.id.str              <dbl> 2527048611, NA, NA, NA, N...
## $ in.reply.to.screen.name              <chr> "brianaeden_xo", NA, NA, ...
## $ tweet.time.posix                     <date> 2015-10-16, 2015-10-16, ...
## $ lat                                  <dbl> NA, NA, NA, NA, NA, NA, N...
## $ long                                 <dbl> NA, NA, NA, NA, NA, NA, N...
## $ user.is.celebrity                    <lgl> FALSE, FALSE, FALSE, FALS...
## $ user.is.media                        <lgl> FALSE, FALSE, FALSE, FALS...
## $ user.is.MemberOfParliament           <lgl> FALSE, FALSE, FALSE, FALS...
## $ user.is.police                       <lgl> FALSE, FALSE, FALSE, FALS...
## $ user.is.JewishOrganisation           <lgl> FALSE, FALSE, FALSE, FALS...
## $ retweeted.user.is.celebrity          <lgl> FALSE, FALSE, FALSE, FALS...
## $ retweeted.user.is.media              <lgl> FALSE, FALSE, FALSE, FALS...
## $ retweeted.user.is.MemberOfParliament <lgl> FALSE, FALSE, FALSE, FALS...
## $ retweeted.user.is.police             <lgl> FALSE, FALSE, FALSE, FALS...
## $ retweeted.user.is.JewishOrganisation <lgl> FALSE, FALSE, FALSE, FALS...
## $ quoted.user.is.celebrity             <lgl> FALSE, FALSE, FALSE, FALS...
## $ quoted.user.is.media                 <lgl> FALSE, FALSE, FALSE, FALS...
## $ quoted.user.is.MemberOfParliament    <lgl> FALSE, FALSE, FALSE, FALS...
## $ quoted.user.is.police                <lgl> FALSE, FALSE, FALSE, FALS...
## $ quoted.user.is.JewishOrganisation    <lgl> FALSE, FALSE, FALSE, FALS...
## $ classifier.prob.yes                  <dbl> 0.4793730020428988436443,...
## $ classifier.hate.yes                  <lgl> FALSE, FALSE, FALSE, FALS...

Plotting Data

First converting the total data dataset to time series.

ts <- xts(x = rep(1,times=nrow(complete.dataset.processed)), 
          order.by = complete.dataset.processed$tweet.time.posix)
ts.sum <- apply.daily(ts,sum)
ts.sum.df <- data.frame(date=index(ts.sum), coredata(ts.sum))
colnames(ts.sum.df)=c('date','sum')

The time series line graph of complete dataset (daily)

a <- ggplot(ts.sum.df)+geom_line(aes(x=date,y=sum))+
     labs( x= 'Time (Daily)', y= "Tweet Count",
           title = "Line Graph of Tweet Counts For the Complete Dataset ", 
           subtitle = "Graph 1", 
           caption = "Social Data Lab") +
     theme_ipsum_rc()+
     scale_x_date(date_breaks = "1 month",date_labels = ("%b-%y"))

a

Looks like the highest peak was around Livingstone event (28th of April). There is also another peak around first week of July but not sure what it is.

First, creating a dataset called ‘antagonistic’ which only contains the tweets that are classified as antagonistic speech. You can toggle the tweet texts using the buttons on the page if you’d like to take a peek at the classification results.

antagonistic <- complete.dataset.processed %>% 
     filter(classifier.hate.yes==TRUE) %>% 
     select(tweet.id.str, tweet.text.str, tweet.time.posix,classifier.prob.yes,classifier.hate.yes) %>% 
     glimpse()
## Observations: 15,575
## Variables: 5
## $ tweet.id.str        <dbl> 655011747137458176, 655012089770188800, 65...
## $ tweet.text.str      <chr> "@NahumRoni Right, look at how many Jews c...
## $ tweet.time.posix    <date> 2015-10-16, 2015-10-16, 2015-10-16, 2015-...
## $ classifier.prob.yes <dbl> 0.7136901412143061840254, 0.72849177866786...
## $ classifier.hate.yes <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, ...
antagonistic
## # A tibble: 15,575 x 5
##          tweet.id.str
##                 <dbl>
##  1 655011747137458176
##  2 655012089770188800
##  3 655012816571768832
##  4 655027410430246912
##  5 655027997804765184
##  6 655028426047410176
##  7 655029759924162560
##  8 655036836012826624
##  9 655048042693939200
## 10 655048205994958848
## # ... with 15,565 more rows, and 4 more variables: tweet.text.str <chr>,
## #   tweet.time.posix <date>, classifier.prob.yes <dbl>,
## #   classifier.hate.yes <lgl>

Time series line graph for tweets classified as Antagonistic (grouping daily)

ts.antagonistic<- xts(x = rep(1,times=nrow(antagonistic)), order.by = antagonistic$tweet.time.posix)
ts.sum.antagonistic <- apply.daily(ts.antagonistic,sum)
ts.sum.df.antagonistic <- data.frame(date=index(ts.sum.antagonistic), coredata(ts.sum.antagonistic))
colnames(ts.sum.df.antagonistic)=c('date','sum')

b <- ggplot(ts.sum.df.antagonistic)+
     geom_line(aes(x=date,y=sum), colour="red")+
     labs( x= 'Time (Daily)', y= "Tweet Count",
           title = "Line Graph of Tweet Counts For the Tweets \nthat Classified as Antagonistic ", 
           subtitle = "Graph 2", 
           caption = "Social Data Lab") +
     theme_ipsum_rc()+
     scale_x_date(date_breaks = "1 month",date_labels = ("%b-%y"))

b

Note the peaks around late April 2016, 15-16 June 2016, 14-15 August 2016.

Plotting complete dataset and hate on the same chart.

c <- ggplot() + 
     geom_line(data = ts.sum.df.antagonistic, aes(x =date, y = sum, color = "Antagonistic"), colour="red")  +
     geom_line(data = ts.sum.df, aes(x = date, y = sum, color = "All tweets"), colour="black") +
     labs( x= 'Time (Daily)', y= "Tweet Count",
           title = " Tweet Counts of Complete Data vs Antagonistic Subset ", 
           subtitle = "Graph 3", 
           caption = "Social Data Lab") +
     theme_ipsum_rc()+
     scale_x_date(date_breaks = "1 month",date_labels = ("%b-%y"))

c

I’d say this is normal. Given that only 0.5% (15575 out of 2670427) of tweets were classified as antagonistic, plotting antagonistic tweets and the complete dataset in the same plot using the same scales does not return a visually meaningful plot.

Selecting events based on Peaks in the plots

The time series plots has revealed several peaks in the antagonistic speech in the data set. Below I will select three events around three of those peaks: Late April 2016, 15-16 June 2016, 14-15 August 2016. One week before and one week after the peaks has been included in each event.