The idea is to use “#firstsevenlanguages” and “#first7languages” tweets as example of text analysis, starting from querying Twitter API with the rtweet package, then cleaning the tweets a bit.
“#firstsevenlanguages” and “#first7languages” tweets initial goal was to provide a short description of the 7 first human/computer languages learnt by users. I want to have a look at the more frequent ones.
library("rtweet")
library("dplyr")
library("stringr")
If following the instructions given in the repository of the rtweet package, one does not have to re-input their own tokens each time.
first7languages <- search_tweets(q = "#firstsevenlanguages", count = 100000, type = "recent")
first7languages <- rbind(first7languages,
search_tweets(q = "first7languages", count = 100000, type = "recent"))
first7languages <- unique(first7languages)
first7languages <- filter(first7languages, lang == "en")
first7languages
## # A tibble: 9,048 x 26
## created_at status_id retweet_count favorite_count
## <time> <chr> <int> <int>
## 1 <NA> 765144346660405248 310 389
## 2 <NA> 765467037288239104 311 0
## 3 <NA> 765466942396305408 1 0
## 4 <NA> 765466842425073664 0 0
## 5 <NA> 765466615387480064 0 0
## 6 <NA> 765466135751950336 0 0
## 7 <NA> 765466003220332544 0 0
## 8 <NA> 765465958731419648 310 0
## 9 <NA> 765465819866279936 3 0
## 10 <NA> 765465811515478016 14 0
## # ... with 9,038 more rows, and 22 more variables: text <chr>,
## # in_reply_to_status_id <chr>, in_reply_to_user_id <chr>,
## # is_quote_status <lgl>, quoted_status_id <chr>, lang <chr>,
## # user_id <chr>, user_mentions <list>, hashtags <list>, urls <list>,
## # is_retweet <lgl>, retweet_status_id <chr>, place_name <chr>,
## # country <chr>, long1 <dbl>, long2 <dbl>, long3 <dbl>, long4 <dbl>,
## # lat1 <dbl>, lat2 <dbl>, lat3 <dbl>, lat4 <dbl>
The goal is to get languages on their own, which is hard since depending on the tweets the separation was done differently. Moreover with this code I don’t totally get rid of 1) spam tweets 2) parts of the “#firstsevenlanguages” tweet that are not job descriptions.
The parsing is tricky because each tweet can have different separations between description, e.g. use back slashes, or use commas and back slashes for a mixed job description (waitress/dishwasher, second job, third job, etc.) so I’ll try to guess which character is the separator by counting the number of back slashes, commas, semi commas, etc. and selecting the most present one in each tweet.
If someone used spaces, then I won’t be able to parse the tweet. Hopefully it was a rare occurence (it’s quite hard to read the tweet as an human being when there are only spaces as separators!).
Moreoever a tweet can be in English with a part in Swedish (“#sjuförstajobben”) or in Chinese, so these parts cannot be analyzed.
separators <- c("\\?", "\\.", "\\!", "\\;",
"\\,", "\\\n", "\\/", "[1-7]")
first7languages <- first7languages %>%
# no link in the tweet
filter(is.na(urls)) %>%
select(status_id, text) %>%
group_by(status_id) %>%
# count the potential separators
mutate(no_interrogation = str_count(text, "\\?")) %>%
mutate(no_point = str_count(text, "\\.")) %>%
mutate(no_exclamation = str_count(text, "\\!")) %>%
mutate(no_semicolumn = str_count(text, "\\;")) %>%
mutate(no_comma = str_count(text, "\\,")) %>%
mutate(no_newline = str_count(text, "\\\n")) %>%
mutate(no_backslash = str_count(text, "\\/")) %>%
mutate(no_number = str_count(text, "[1-7]")) %>%
mutate(max_separator = max(c(no_interrogation,
no_point,
no_exclamation,
no_semicolumn,
no_comma,
no_newline,
no_backslash,
no_number))) %>%
filter(max_separator >= 6) %>%
filter(max_separator <= 8) %>%
mutate(separator = separators[order(c(no_interrogation,
no_point,
no_exclamation,
no_semicolumn,
no_comma,
no_newline,
no_backslash,
no_number),
decreasing = TRUE)[1]]) %>%
# no numbers
mutate(text = gsub("[1-9].", "", text)) %>%
# no parenthesis
mutate(text = gsub("[\\(\\)]", "", text)) %>%
# no username
mutate(text = gsub("\\@.*", "", text)) %>%
# no RT part
filter(!grepl("RT \\@", text)) %>%
# don't keep the polluting hashtags
filter(!grepl("\\#NameofFirstPet", text)) %>%
mutate(text = gsub("\\#firstsevenlanguages", "", text)) %>%
mutate(text = gsub("\\#FirstSevenlanguages", "", text)) %>%
mutate(text = gsub("\\#FirstSevenLanguages", "", text)) %>%
mutate(text = gsub("\\#firstobs", "", text)) %>%
filter(!grepl("\\#MothersMaidenName", text)) %>%
filter(!grepl("\\#HighSchoolMascot", text)) %>%
# no polluting sentence "Can we get these trending too"
filter(!grepl("Can we get these trending too", text)) %>%
# no hyphen
mutate(text = gsub("-", " ", text)) %>%
mutate(text = gsub(" ", " ", text)) %>%
mutate(text = gsub("- ", "", text)) %>%
# no empty
filter(text != "") %>%
filter(text != " ") %>%
filter(text != "RT") %>%
filter(text != "RT ") %>%
# split by these in order to separate the languages
do(str_split(.$text, .$separator) %>%
unlist %>%
data_frame(wordsgroup = .)) %>%
ungroup() %>%
# no emojis
mutate(wordsgroup = gsub("<.*>", "", wordsgroup))%>%
# no empty
filter(wordsgroup != "") %>%
filter(wordsgroup != " ") %>%
filter(wordsgroup != "RT") %>%
filter(wordsgroup != "RT ") %>%
filter(!grepl("\\:", wordsgroup)) %>%
group_by(status_id) %>%
mutate(rank = 1:n()) %>%
# remove remaining new lines
mutate(wordsgroup = gsub("\\\n","",wordsgroup)) %>%
ungroup()
first7languages
## # A tibble: 32,014 x 3
## status_id wordsgroup rank
## <chr> <chr> <int>
## 1 764914996761415681 C 1
## 2 764914996761415681 C++ 2
## 3 764914996761415681 C# 3
## 4 764914996761415681 Objective C 4
## 5 764914996761415681 x Assembly 5
## 6 764914996761415681 ARM Thumb assembly 6
## 7 764914996761415681 Java 7
## 8 764915140508561408 BASIC 1
## 9 764915140508561408 C 2
## 10 764915140508561408 lisp 3
## # ... with 32,004 more rows
Now try to summarize a bit
first7languages %>%
group_by(wordsgroup = gsub(" ", "", wordsgroup)) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
filter(n > 5) %>%
knitr::kable()
| wordsgroup | n |
|---|---|
| C | 1554 |
| Java | 1184 |
| English | 1145 |
| C++ | 1102 |
| Pascal | 971 |
| PHP | 937 |
| JavaScript | 717 |
| Python | 716 |
| BASIC | 627 |
| Basic | 563 |
| French | 540 |
| C# | 503 |
| Perl | 487 |
| VisualBasic | 454 |
| Ruby | 439 |
| Spanish | 417 |
| HTML | 399 |
| German | 349 |
| Javascript | 317 |
| Logo | 283 |
| SQL | 272 |
| ObjectiveC | 252 |
| Fortran | 232 |
| Assembly | 228 |
| Latin | 208 |
| TurboPascal | 187 |
| Scheme | 159 |
| C/C++ | 156 |
| Swift | 153 |
| ActionScript | 148 |
| QBasic | 143 |
| CSS | 136 |
| Lisp | 133 |
| Prolog | 133 |
| Delphi | 128 |
| Italian | 128 |
| Haskell | 123 |
| Sarcasm | 119 |
| Bash | 115 |
| VB | 115 |
| Japanese | 112 |
| Russian | 107 |
| FORTRAN | 101 |
| COBOL | 100 |
| Assembler | 78 |
| HTML/CSS | 76 |
| VBScript | 76 |
| LOGO | 70 |
| Smalltalk | 70 |
| Forth | 68 |
| Swedish | 65 |
| HyperTalk | 63 |
| R | 63 |
| QBASIC | 61 |
| Go | 60 |
| Dutch | 59 |
| ColdFusion | 56 |
| Actionscript | 53 |
| Greek | 53 |
| 0Assembly | 52 |
| ASP | 50 |
| Matlab | 50 |
| Cobol | 49 |
| JS | 49 |
| Chinese | 47 |
| Clipper | 47 |
| 0assembly | 46 |
| Ada | 45 |
| BBCBasic | 45 |
| LISP | 45 |
| Portuguese | 45 |
| Lingo | 44 |
| php | 44 |
| 0Assembler | 42 |
| APL | 42 |
| Arabic | 42 |
| QuickBasic | 42 |
| english | 41 |
| Lua | 39 |
| xassembly | 39 |
| xAssembly | 38 |
| perl | 37 |
| assembly | 36 |
| BBCBASIC | 36 |
| CommonLisp | 36 |
| Finnish | 36 |
| GWBasic | 36 |
| Php | 36 |
| Shell | 36 |
| AppleScript | 35 |
| Mandarin | 34 |
| Hebrew | 33 |
| python | 33 |
| GWBASIC | 31 |
| javascript | 31 |
| Music | 31 |
| ASM | 30 |
| ML | 30 |
| c | 29 |
| java | 29 |
| MATLAB | 29 |
| PigLatin | 29 |
| ZAssembly | 29 |
| PASCAL | 28 |
| Scala | 28 |
| TIBASIC | 27 |
| AncientGreek | 26 |
| AppleBasic | 26 |
| TIBasic | 26 |
| VBA | 26 |
| FoxPro | 25 |
| CommodoreBASIC | 24 |
| french | 24 |
| Polish | 24 |
| . | 23 |
| Klingon | 23 |
| ZAssembler | 23 |
| ApplesoftBASIC | 22 |
| Baby | 22 |
| basic | 22 |
| SinclairBASIC | 22 |
| Zassembler | 22 |
| Body | 21 |
| Elixir | 21 |
| html | 21 |
| ObjectPascal | 21 |
| pascal | 21 |
| PL/SQL | 21 |
| Qbasic | 21 |
| Gibberish | 20 |
| xasm | 20 |
| #firstSevenLanguages | 19 |
| 0assembler | 19 |
| Korean | 19 |
| ObjC | 19 |
| PERL | 19 |
| Zassembly | 19 |
| 0asm | 18 |
| Rust | 18 |
| AmericanEnglish | 17 |
| Batch | 17 |
| Mathematica | 17 |
| RPG | 17 |
| Welsh | 17 |
| xASM | 17 |
| ? | 16 |
| 000assembly | 16 |
| AppleBASIC | 16 |
| bash | 16 |
| c++ | 16 |
| CommodoreBasic | 16 |
| Danish | 16 |
| german | 16 |
| Groovy | 16 |
| Hindi | 16 |
| OCaml | 16 |
| QuickBASIC | 16 |
| SAS | 16 |
| SinclairBasic | 16 |
| TCL | 16 |
| Visualbasic | 16 |
| 000Assembly | 15 |
| awk | 15 |
| Cat | 15 |
| Ebonics | 15 |
| FORTH | 15 |
| HTML/CSS/JS | 15 |
| Norwegian | 15 |
| PL/I | 15 |
| TSQL | 15 |
| Turkish | 15 |
| Algol | 14 |
| American | 14 |
| AMOS | 14 |
| Assemblyx | 14 |
| dBase | 14 |
| Emoji | 14 |
| Pascal/Delphi | 14 |
| Profanity | 14 |
| spanish | 14 |
| … | 13 |
| 0ASM | 13 |
| Awk | 13 |
| BASH | 13 |
| Cantonese | 13 |
| CoffeeScript | 13 |
| JCL | 13 |
| Love | 13 |
| Powershell | 13 |
| Swearing | 13 |
| TBD | 13 |
| xAssembler | 13 |
| ASMx | 12 |
| c# | 12 |
| Fangirl | 12 |
| HyperCard | 12 |
| Internet | 12 |
| PowerShell | 12 |
| ShellScript | 12 |
| SML | 12 |
| Tagalog | 12 |
| XSLT | 12 |
| ApplesoftBasic | 11 |
| assembler | 11 |
| CBasic | 11 |
| Czech | 11 |
| Eiffel | 11 |
| GML | 11 |
| MandarinChinese | 11 |
| memes | 11 |
| MIPSAssembly | 11 |
| My | 11 |
| sarcasm | 11 |
| SmallTalk | 11 |
| StandardML | 11 |
| VisualBASIC | 11 |
| AppleSoftBASIC | 10 |
| Babytalk | 10 |
| Binary | 10 |
| Caml | 10 |
| Catalan | 10 |
| CBASIC | 10 |
| Coldfusion | 10 |
| DOSBatch | 10 |
| Html | 10 |
| JAVA | 10 |
| mIRCScript | 10 |
| Processing | 10 |
| Tcl | 10 |
| Turing | 10 |
| VHDL | 10 |
| visualbasic | 10 |
| Z | 10 |
| 9 | |
| #English | 9 |
| actionscript | 9 |
| ActionScript0 | 9 |
| AmericanSignLanguage | 9 |
| AppleSoftBasic | 9 |
| asm | 9 |
| Elvish | 9 |
| empty | 9 |
| F# | 9 |
| Hypercard | 9 |
| Hypertalk | 9 |
| ModulaC | 9 |
| PostScript | 9 |
| REXX | 9 |
| Shellscript | 9 |
| sql | 9 |
| TRSBASIC | 9 |
| 9 | |
| Vietnamese | 9 |
| VisualBasic.NET | 9 |
| Python | 8 |
| 0Pascal | 8 |
| AWK | 8 |
| Comal | 8 |
| CommonLISP | 8 |
| dBaseIII | 8 |
| Elm | 8 |
| Esperanto | 8 |
| FORTRANIV | 8 |
| Golang | 8 |
| Irish | 8 |
| LaTeX | 8 |
| Maple | 8 |
| Miranda | 8 |
| mIRCscript | 8 |
| Occam | 8 |
| PowerBuilder | 8 |
| Rexx | 8 |
| sh | 8 |
| Spanglish | 8 |
| SpectrumBasic | 8 |
| SpectrumBASIC | 8 |
| TypeScript | 8 |
| Zasm | 8 |
| ??? | 7 |
| Java | 7 |
| 7 | |
| 0 | 7 |
| AmigaBasic | 7 |
| ARexx | 7 |
| ASL | 7 |
| Babbling | 7 |
| BourneShell | 7 |
| British | 7 |
| BrokenEnglish | 7 |
| C,C++ | 7 |
| Crying | 7 |
| Dog | 7 |
| EmacsLisp | 7 |
| Erlang | 7 |
| HTML&CSS | 7 |
| HTML+CSS | 7 |
| Hungarian | 7 |
| IDL | 7 |
| japanese | 7 |
| Java, | 7 |
| lisp | 7 |
| logo | 7 |
| Meme | 7 |
| Memes | 7 |
| MIPSassembly | 7 |
| OldEnglish | 7 |
| Parseltongue | 7 |
| Racket | 7 |
| ruby | 7 |
| Sass | 7 |
| Scratch | 7 |
| Southern | 7 |
| UCSDPascal | 7 |
| Ukrainian | 7 |
| Urdu | 7 |
| VBC++ | 7 |
| Verilog | 7 |
| x | 7 |
| xassembler | 7 |
| Yiddish | 7 |
| ZXBASIC | 7 |
| #Python | 6 |
| .NET | 6 |
| 000 | 6 |
| 000assembler | 6 |
| 000Assembler | 6 |
| AssemblyZ | 6 |
| AtariBasic | 6 |
| BadEnglish | 6 |
| BCPL | 6 |
| BlitzBasic | 6 |
| Bullcrap | 6 |
| Bullshit | 6 |
| C#.NET | 6 |
| C, | 6 |
| C+ | 6 |
| ClassicASP | 6 |
| Clojure | 6 |
| Coffeescript | 6 |
| Farsi | 6 |
| Food | 6 |
| Gaelic | 6 |
| GoodKorean | 6 |
| HTML/CSS/JavaScript | 6 |
| Js | 6 |
| KAssembly | 6 |
| Kotlin | 6 |
| LabVIEW | 6 |
| latin | 6 |
| LocomotiveBasic | 6 |
| Marathi | 6 |
| MatLab | 6 |
| Modula | 6 |
| MSXBasic | 6 |
| music | 6 |
| ObjectPascalDelphi | 6 |
| OldNorse | 6 |
| OPL | 6 |
| PICAssembly | 6 |
| PL/C | 6 |
| PLSQL | 6 |
| Pussycrushing | 6 |
| qbasic | 6 |
| REALbasic | 6 |
| realtimefeedback | 6 |
| russian | 6 |
| scheme | 6 |
| Scottish | 6 |
| Screeching | 6 |
| Silence | 6 |
| Slang | 6 |
| TBA | 6 |
| VbScript | 6 |
| VisualBasic0 | 6 |
| Xassembly | 6 |
| XAssembly | 6 |
| Zmachinecode | 6 |