Intro

The idea is to use “#firstsevenlanguages” and “#first7languages” tweets as example of text analysis, starting from querying Twitter API with the rtweet package, then cleaning the tweets a bit.

“#firstsevenlanguages” and “#first7languages” tweets initial goal was to provide a short description of the 7 first human/computer languages learnt by users. I want to have a look at the more frequent ones.

library("rtweet")
library("dplyr")
library("stringr")

Getting tweets

If following the instructions given in the repository of the rtweet package, one does not have to re-input their own tokens each time.

first7languages <- search_tweets(q = "#firstsevenlanguages", count = 100000, type = "recent")
first7languages <- rbind(first7languages,
                    search_tweets(q = "first7languages", count = 100000, type = "recent"))
first7languages <- unique(first7languages)
first7languages <- filter(first7languages, lang == "en")
first7languages
## # A tibble: 9,048 x 26
##    created_at          status_id retweet_count favorite_count
##        <time>              <chr>         <int>          <int>
## 1        <NA> 765144346660405248           310            389
## 2        <NA> 765467037288239104           311              0
## 3        <NA> 765466942396305408             1              0
## 4        <NA> 765466842425073664             0              0
## 5        <NA> 765466615387480064             0              0
## 6        <NA> 765466135751950336             0              0
## 7        <NA> 765466003220332544             0              0
## 8        <NA> 765465958731419648           310              0
## 9        <NA> 765465819866279936             3              0
## 10       <NA> 765465811515478016            14              0
## # ... with 9,038 more rows, and 22 more variables: text <chr>,
## #   in_reply_to_status_id <chr>, in_reply_to_user_id <chr>,
## #   is_quote_status <lgl>, quoted_status_id <chr>, lang <chr>,
## #   user_id <chr>, user_mentions <list>, hashtags <list>, urls <list>,
## #   is_retweet <lgl>, retweet_status_id <chr>, place_name <chr>,
## #   country <chr>, long1 <dbl>, long2 <dbl>, long3 <dbl>, long4 <dbl>,
## #   lat1 <dbl>, lat2 <dbl>, lat3 <dbl>, lat4 <dbl>

Cleaning

The goal is to get languages on their own, which is hard since depending on the tweets the separation was done differently. Moreover with this code I don’t totally get rid of 1) spam tweets 2) parts of the “#firstsevenlanguages” tweet that are not job descriptions.

The parsing is tricky because each tweet can have different separations between description, e.g. use back slashes, or use commas and back slashes for a mixed job description (waitress/dishwasher, second job, third job, etc.) so I’ll try to guess which character is the separator by counting the number of back slashes, commas, semi commas, etc. and selecting the most present one in each tweet.

If someone used spaces, then I won’t be able to parse the tweet. Hopefully it was a rare occurence (it’s quite hard to read the tweet as an human being when there are only spaces as separators!).

Moreoever a tweet can be in English with a part in Swedish (“#sjuförstajobben”) or in Chinese, so these parts cannot be analyzed.

separators <- c("\\?", "\\.", "\\!", "\\;",
                "\\,", "\\\n", "\\/", "[1-7]")


first7languages <- first7languages  %>%
    # no link in the tweet
    filter(is.na(urls)) %>%
    select(status_id, text) %>%
    group_by(status_id) %>%
    # count  the potential separators
    mutate(no_interrogation = str_count(text, "\\?")) %>%
    mutate(no_point = str_count(text, "\\.")) %>%
    mutate(no_exclamation = str_count(text, "\\!")) %>%
    mutate(no_semicolumn = str_count(text, "\\;")) %>%
    mutate(no_comma = str_count(text, "\\,")) %>%
    mutate(no_newline = str_count(text, "\\\n")) %>%
    mutate(no_backslash = str_count(text, "\\/")) %>%
    mutate(no_number = str_count(text, "[1-7]")) %>%
   mutate(max_separator = max(c(no_interrogation,
                                        no_point,
                                        no_exclamation,
                                        no_semicolumn,
                                        no_comma,
                                        no_newline,
                                        no_backslash,
                                        no_number))) %>%
    filter(max_separator >= 6) %>%
    filter(max_separator <= 8) %>%
    mutate(separator = separators[order(c(no_interrogation,
                                        no_point,
                                        no_exclamation,
                                        no_semicolumn,
                                        no_comma,
                                        no_newline,
                                        no_backslash,
                                        no_number),
                                        decreasing = TRUE)[1]]) %>%
    # no numbers
    mutate(text = gsub("[1-9].", "", text)) %>%
    # no parenthesis
    mutate(text = gsub("[\\(\\)]", "", text)) %>%
    # no username
    mutate(text = gsub("\\@.*", "", text)) %>%
    # no RT part
    filter(!grepl("RT \\@", text)) %>%
    # don't keep the polluting hashtags
    filter(!grepl("\\#NameofFirstPet", text)) %>%
    mutate(text = gsub("\\#firstsevenlanguages", "", text)) %>%
    mutate(text = gsub("\\#FirstSevenlanguages", "", text)) %>%
    mutate(text = gsub("\\#FirstSevenLanguages", "", text))  %>%
    mutate(text = gsub("\\#firstobs", "", text)) %>%
    filter(!grepl("\\#MothersMaidenName", text)) %>%
    filter(!grepl("\\#HighSchoolMascot", text)) %>%
    # no polluting sentence "Can we get these trending too"
    filter(!grepl("Can we get these trending too", text)) %>%
    # no hyphen
    mutate(text = gsub("-", " ", text)) %>%
    mutate(text = gsub("  ", " ", text)) %>%
    mutate(text = gsub("- ", "", text)) %>%
    # no empty
    filter(text != "") %>%
    filter(text != " ") %>%
    filter(text != "RT") %>%
    filter(text != "RT ") %>%
    # split by these in order to separate the languages
    do(str_split(.$text, .$separator) %>%
         unlist %>% 
         data_frame(wordsgroup = .)) %>%
    ungroup() %>%
  # no emojis
  mutate(wordsgroup = gsub("<.*>", "", wordsgroup))%>%
    # no empty
    filter(wordsgroup != "") %>%
    filter(wordsgroup != " ") %>%
    filter(wordsgroup != "RT") %>%
    filter(wordsgroup != "RT ") %>%
    filter(!grepl("\\:", wordsgroup)) %>%
  group_by(status_id) %>%
    mutate(rank = 1:n()) %>%
   # remove remaining new lines
    mutate(wordsgroup = gsub("\\\n","",wordsgroup)) %>%
    ungroup()
first7languages
## # A tibble: 32,014 x 3
##             status_id         wordsgroup  rank
##                 <chr>              <chr> <int>
## 1  764914996761415681                  C     1
## 2  764914996761415681                C++     2
## 3  764914996761415681                 C#     3
## 4  764914996761415681        Objective C     4
## 5  764914996761415681         x Assembly     5
## 6  764914996761415681 ARM Thumb assembly     6
## 7  764914996761415681               Java     7
## 8  764915140508561408              BASIC     1
## 9  764915140508561408                  C     2
## 10 764915140508561408               lisp     3
## # ... with 32,004 more rows

Now try to summarize a bit

first7languages %>% 
  group_by(wordsgroup = gsub(" ", "", wordsgroup)) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  filter(n > 5) %>%
  knitr::kable()
wordsgroup n
C 1554
Java 1184
English 1145
C++ 1102
Pascal 971
PHP 937
JavaScript 717
Python 716
BASIC 627
Basic 563
French 540
C# 503
Perl 487
VisualBasic 454
Ruby 439
Spanish 417
HTML 399
German 349
Javascript 317
Logo 283
SQL 272
ObjectiveC 252
Fortran 232
Assembly 228
Latin 208
TurboPascal 187
Scheme 159
C/C++ 156
Swift 153
ActionScript 148
QBasic 143
CSS 136
Lisp 133
Prolog 133
Delphi 128
Italian 128
Haskell 123
Sarcasm 119
Bash 115
VB 115
Japanese 112
Russian 107
FORTRAN 101
COBOL 100
Assembler 78
HTML/CSS 76
VBScript 76
LOGO 70
Smalltalk 70
Forth 68
Swedish 65
HyperTalk 63
R 63
QBASIC 61
Go 60
Dutch 59
ColdFusion 56
Actionscript 53
Greek 53
0Assembly 52
ASP 50
Matlab 50
Cobol 49
JS 49
Chinese 47
Clipper 47
0assembly 46
Ada 45
BBCBasic 45
LISP 45
Portuguese 45
Lingo 44
php 44
0Assembler 42
APL 42
Arabic 42
QuickBasic 42
english 41
Lua 39
xassembly 39
xAssembly 38
perl 37
assembly 36
BBCBASIC 36
CommonLisp 36
Finnish 36
GWBasic 36
Php 36
Shell 36
AppleScript 35
Mandarin 34
Hebrew 33
python 33
GWBASIC 31
javascript 31
Music 31
ASM 30
ML 30
c 29
java 29
MATLAB 29
PigLatin 29
ZAssembly 29
PASCAL 28
Scala 28
TIBASIC 27
AncientGreek 26
AppleBasic 26
TIBasic 26
VBA 26
FoxPro 25
CommodoreBASIC 24
french 24
Polish 24
. 23
Klingon 23
ZAssembler 23
ApplesoftBASIC 22
Baby 22
basic 22
SinclairBASIC 22
Zassembler 22
Body 21
Elixir 21
html 21
ObjectPascal 21
pascal 21
PL/SQL 21
Qbasic 21
Gibberish 20
xasm 20
#firstSevenLanguages 19
0assembler 19
Korean 19
ObjC 19
PERL 19
Zassembly 19
0asm 18
Rust 18
AmericanEnglish 17
Batch 17
Mathematica 17
RPG 17
Welsh 17
xASM 17
? 16
000assembly 16
AppleBASIC 16
bash 16
c++ 16
CommodoreBasic 16
Danish 16
german 16
Groovy 16
Hindi 16
OCaml 16
QuickBASIC 16
SAS 16
SinclairBasic 16
TCL 16
Visualbasic 16
000Assembly 15
awk 15
Cat 15
Ebonics 15
FORTH 15
HTML/CSS/JS 15
Norwegian 15
PL/I 15
TSQL 15
Turkish 15
Algol 14
American 14
AMOS 14
Assemblyx 14
dBase 14
Emoji 14
Pascal/Delphi 14
Profanity 14
spanish 14
13
0ASM 13
Awk 13
BASH 13
Cantonese 13
CoffeeScript 13
JCL 13
Love 13
Powershell 13
Swearing 13
TBD 13
xAssembler 13
ASMx 12
c# 12
Fangirl 12
HyperCard 12
Internet 12
PowerShell 12
ShellScript 12
SML 12
Tagalog 12
XSLT 12
ApplesoftBasic 11
assembler 11
CBasic 11
Czech 11
Eiffel 11
GML 11
MandarinChinese 11
memes 11
MIPSAssembly 11
My 11
sarcasm 11
SmallTalk 11
StandardML 11
VisualBASIC 11
AppleSoftBASIC 10
Babytalk 10
Binary 10
Caml 10
Catalan 10
CBASIC 10
Coldfusion 10
DOSBatch 10
Html 10
JAVA 10
mIRCScript 10
Processing 10
Tcl 10
Turing 10
VHDL 10
visualbasic 10
Z 10
9
#English 9
actionscript 9
ActionScript0 9
AmericanSignLanguage 9
AppleSoftBasic 9
asm 9
Elvish 9
empty 9
F# 9
Hypercard 9
Hypertalk 9
ModulaC 9
PostScript 9
REXX 9
Shellscript 9
sql 9
TRSBASIC 9
Twitter 9
Vietnamese 9
VisualBasic.NET 9
Python 8
0Pascal 8
AWK 8
Comal 8
CommonLISP 8
dBaseIII 8
Elm 8
Esperanto 8
FORTRANIV 8
Golang 8
Irish 8
LaTeX 8
Maple 8
Miranda 8
mIRCscript 8
Occam 8
PowerBuilder 8
Rexx 8
sh 8
Spanglish 8
SpectrumBasic 8
SpectrumBASIC 8
TypeScript 8
Zasm 8
??? 7
Java 7
7
0 7
AmigaBasic 7
ARexx 7
ASL 7
Babbling 7
BourneShell 7
British 7
BrokenEnglish 7
C,C++ 7
Crying 7
Dog 7
EmacsLisp 7
Erlang 7
HTML&CSS 7
HTML+CSS 7
Hungarian 7
IDL 7
japanese 7
Java, 7
lisp 7
logo 7
Meme 7
Memes 7
MIPSassembly 7
OldEnglish 7
Parseltongue 7
Racket 7
ruby 7
Sass 7
Scratch 7
Southern 7
UCSDPascal 7
Ukrainian 7
Urdu 7
VBC++ 7
Verilog 7
x 7
xassembler 7
Yiddish 7
ZXBASIC 7
#Python 6
.NET 6
000 6
000assembler 6
000Assembler 6
AssemblyZ 6
AtariBasic 6
BadEnglish 6
BCPL 6
BlitzBasic 6
Bullcrap 6
Bullshit 6
C#.NET 6
C, 6
C+ 6
ClassicASP 6
Clojure 6
Coffeescript 6
Farsi 6
Food 6
Gaelic 6
GoodKorean 6
HTML/CSS/JavaScript 6
Js 6
KAssembly 6
Kotlin 6
LabVIEW 6
latin 6
LocomotiveBasic 6
Marathi 6
MatLab 6
Modula 6
MSXBasic 6
music 6
ObjectPascalDelphi 6
OldNorse 6
OPL 6
PICAssembly 6
PL/C 6
PLSQL 6
Pussycrushing 6
qbasic 6
REALbasic 6
realtimefeedback 6
russian 6
scheme 6
Scottish 6
Screeching 6
Silence 6
Slang 6
TBA 6
VbScript 6
VisualBasic0 6
Xassembly 6
XAssembly 6
Zmachinecode 6