Intro

The idea is to use “#firstsevenlanguages” and “#first7languages” tweets as example of text analysis, starting from querying Twitter API with the rtweet package, then cleaning the tweets a bit.

“#firstsevenlanguages” and “#first7languages” tweets initial goal was to provide a short description of the 7 first human/computer languages learnt by users. I want to have a look at the more frequent ones.

library("rtweet")
library("dplyr")
library("stringr")

Getting tweets

If following the instructions given in the repository of the rtweet package, one does not have to re-input their own tokens each time.

first7languages <- search_tweets(q = "#firstsevenlanguages", count = 100000, type = "recent")
first7languages <- rbind(first7languages,
                    search_tweets(q = "first7languages", count = 100000, type = "recent"))
first7languages <- unique(first7languages)
first7languages <- filter(first7languages, lang == "en")
first7languages

## # A tibble: 9,048 x 26
##    created_at          status_id retweet_count favorite_count
##        <time>              <chr>         <int>          <int>
## 1        <NA> 765144346660405248           310            389
## 2        <NA> 765467037288239104           311              0
## 3        <NA> 765466942396305408             1              0
## 4        <NA> 765466842425073664             0              0
## 5        <NA> 765466615387480064             0              0
## 6        <NA> 765466135751950336             0              0
## 7        <NA> 765466003220332544             0              0
## 8        <NA> 765465958731419648           310              0
## 9        <NA> 765465819866279936             3              0
## 10       <NA> 765465811515478016            14              0
## # ... with 9,038 more rows, and 22 more variables: text <chr>,
## #   in_reply_to_status_id <chr>, in_reply_to_user_id <chr>,
## #   is_quote_status <lgl>, quoted_status_id <chr>, lang <chr>,
## #   user_id <chr>, user_mentions <list>, hashtags <list>, urls <list>,
## #   is_retweet <lgl>, retweet_status_id <chr>, place_name <chr>,
## #   country <chr>, long1 <dbl>, long2 <dbl>, long3 <dbl>, long4 <dbl>,
## #   lat1 <dbl>, lat2 <dbl>, lat3 <dbl>, lat4 <dbl>

Cleaning

The goal is to get languages on their own, which is hard since depending on the tweets the separation was done differently. Moreover with this code I don’t totally get rid of 1) spam tweets 2) parts of the “#firstsevenlanguages” tweet that are not job descriptions.

The parsing is tricky because each tweet can have different separations between description, e.g. use back slashes, or use commas and back slashes for a mixed job description (waitress/dishwasher, second job, third job, etc.) so I’ll try to guess which character is the separator by counting the number of back slashes, commas, semi commas, etc. and selecting the most present one in each tweet.

If someone used spaces, then I won’t be able to parse the tweet. Hopefully it was a rare occurence (it’s quite hard to read the tweet as an human being when there are only spaces as separators!).

Moreoever a tweet can be in English with a part in Swedish (“#sjuförstajobben”) or in Chinese, so these parts cannot be analyzed.

separators <- c("\\?", "\\.", "\\!", "\\;",
                "\\,", "\\\n", "\\/", "[1-7]")


first7languages <- first7languages  %>%
    # no link in the tweet
    filter(is.na(urls)) %>%
    select(status_id, text) %>%
    group_by(status_id) %>%
    # count  the potential separators
    mutate(no_interrogation = str_count(text, "\\?")) %>%
    mutate(no_point = str_count(text, "\\.")) %>%
    mutate(no_exclamation = str_count(text, "\\!")) %>%
    mutate(no_semicolumn = str_count(text, "\\;")) %>%
    mutate(no_comma = str_count(text, "\\,")) %>%
    mutate(no_newline = str_count(text, "\\\n")) %>%
    mutate(no_backslash = str_count(text, "\\/")) %>%
    mutate(no_number = str_count(text, "[1-7]")) %>%
   mutate(max_separator = max(c(no_interrogation,
                                        no_point,
                                        no_exclamation,
                                        no_semicolumn,
                                        no_comma,
                                        no_newline,
                                        no_backslash,
                                        no_number))) %>%
    filter(max_separator >= 6) %>%
    filter(max_separator <= 8) %>%
    mutate(separator = separators[order(c(no_interrogation,
                                        no_point,
                                        no_exclamation,
                                        no_semicolumn,
                                        no_comma,
                                        no_newline,
                                        no_backslash,
                                        no_number),
                                        decreasing = TRUE)[1]]) %>%
    # no numbers
    mutate(text = gsub("[1-9].", "", text)) %>%
    # no parenthesis
    mutate(text = gsub("[\\(\\)]", "", text)) %>%
    # no username
    mutate(text = gsub("\\@.*", "", text)) %>%
    # no RT part
    filter(!grepl("RT \\@", text)) %>%
    # don't keep the polluting hashtags
    filter(!grepl("\\#NameofFirstPet", text)) %>%
    mutate(text = gsub("\\#firstsevenlanguages", "", text)) %>%
    mutate(text = gsub("\\#FirstSevenlanguages", "", text)) %>%
    mutate(text = gsub("\\#FirstSevenLanguages", "", text))  %>%
    mutate(text = gsub("\\#firstobs", "", text)) %>%
    filter(!grepl("\\#MothersMaidenName", text)) %>%
    filter(!grepl("\\#HighSchoolMascot", text)) %>%
    # no polluting sentence "Can we get these trending too"
    filter(!grepl("Can we get these trending too", text)) %>%
    # no hyphen
    mutate(text = gsub("-", " ", text)) %>%
    mutate(text = gsub("  ", " ", text)) %>%
    mutate(text = gsub("- ", "", text)) %>%
    # no empty
    filter(text != "") %>%
    filter(text != " ") %>%
    filter(text != "RT") %>%
    filter(text != "RT ") %>%
    # split by these in order to separate the languages
    do(str_split(.$text, .$separator) %>%
         unlist %>% 
         data_frame(wordsgroup = .)) %>%
    ungroup() %>%
  # no emojis
  mutate(wordsgroup = gsub("<.*>", "", wordsgroup))%>%
    # no empty
    filter(wordsgroup != "") %>%
    filter(wordsgroup != " ") %>%
    filter(wordsgroup != "RT") %>%
    filter(wordsgroup != "RT ") %>%
    filter(!grepl("\\:", wordsgroup)) %>%
  group_by(status_id) %>%
    mutate(rank = 1:n()) %>%
   # remove remaining new lines
    mutate(wordsgroup = gsub("\\\n","",wordsgroup)) %>%
    ungroup()
first7languages

## # A tibble: 32,014 x 3
##             status_id         wordsgroup  rank
##                 <chr>              <chr> <int>
## 1  764914996761415681                  C     1
## 2  764914996761415681                C++     2
## 3  764914996761415681                 C#     3
## 4  764914996761415681        Objective C     4
## 5  764914996761415681         x Assembly     5
## 6  764914996761415681 ARM Thumb assembly     6
## 7  764914996761415681               Java     7
## 8  764915140508561408              BASIC     1
## 9  764915140508561408                  C     2
## 10 764915140508561408               lisp     3
## # ... with 32,004 more rows

Now try to summarize a bit

first7languages %>% 
  group_by(wordsgroup = gsub(" ", "", wordsgroup)) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) %>%
  filter(n > 5) %>%
  knitr::kable()

wordsgroup	n
C	1554
Java	1184
English	1145
C++	1102
Pascal	971
PHP	937
JavaScript	717
Python	716
BASIC	627
Basic	563
French	540
C#	503
Perl	487
VisualBasic	454
Ruby	439
Spanish	417
HTML	399
German	349
Javascript	317
Logo	283
SQL	272
ObjectiveC	252
Fortran	232
Assembly	228
Latin	208
TurboPascal	187
Scheme	159
C/C++	156
Swift	153
ActionScript	148
QBasic	143
CSS	136
Lisp	133
Prolog	133
Delphi	128
Italian	128
Haskell	123
Sarcasm	119
Bash	115
VB	115
Japanese	112
Russian	107
FORTRAN	101
COBOL	100
Assembler	78
HTML/CSS	76
VBScript	76
LOGO	70
Smalltalk	70
Forth	68
Swedish	65
HyperTalk	63
R	63
QBASIC	61
Go	60
Dutch	59
ColdFusion	56
Actionscript	53
Greek	53
0Assembly	52
ASP	50
Matlab	50
Cobol	49
JS	49
Chinese	47
Clipper	47
0assembly	46
Ada	45
BBCBasic	45
LISP	45
Portuguese	45
Lingo	44
php	44
0Assembler	42
APL	42
Arabic	42
QuickBasic	42
english	41
Lua	39
xassembly	39
xAssembly	38
perl	37
assembly	36
BBCBASIC	36
CommonLisp	36
Finnish	36
GWBasic	36
Php	36
Shell	36
AppleScript	35
Mandarin	34
Hebrew	33
python	33
GWBASIC	31
javascript	31
Music	31
ASM	30
ML	30
c	29
java	29
MATLAB	29
PigLatin	29
ZAssembly	29
PASCAL	28
Scala	28
TIBASIC	27
AncientGreek	26
AppleBasic	26
TIBasic	26
VBA	26
FoxPro	25
CommodoreBASIC	24
french	24
Polish	24
.	23
Klingon	23
ZAssembler	23
ApplesoftBASIC	22
Baby	22
basic	22
SinclairBASIC	22
Zassembler	22
Body	21
Elixir	21
html	21
ObjectPascal	21
pascal	21
PL/SQL	21
Qbasic	21
Gibberish	20
xasm	20
#firstSevenLanguages	19
0assembler	19
Korean	19
ObjC	19
PERL	19
Zassembly	19
0asm	18
Rust	18
AmericanEnglish	17
Batch	17
Mathematica	17
RPG	17
Welsh	17
xASM	17
?	16
000assembly	16
AppleBASIC	16
bash	16
c++	16
CommodoreBasic	16
Danish	16
german	16
Groovy	16
Hindi	16
OCaml	16
QuickBASIC	16
SAS	16
SinclairBasic	16
TCL	16
Visualbasic	16
000Assembly	15
awk	15
Cat	15
Ebonics	15
FORTH	15
HTML/CSS/JS	15
Norwegian	15
PL/I	15
TSQL	15
Turkish	15
Algol	14
American	14
AMOS	14
Assemblyx	14
dBase	14
Emoji	14
Pascal/Delphi	14
Profanity	14
spanish	14
…	13
0ASM	13
Awk	13
BASH	13
Cantonese	13
CoffeeScript	13
JCL	13
Love	13
Powershell	13
Swearing	13
TBD	13
xAssembler	13
ASMx	12
c#	12
Fangirl	12
HyperCard	12
Internet	12
PowerShell	12
ShellScript	12
SML	12
Tagalog	12
XSLT	12
ApplesoftBasic	11
assembler	11
CBasic	11
Czech	11
Eiffel	11
GML	11
MandarinChinese	11
memes	11
MIPSAssembly	11
My	11
sarcasm	11
SmallTalk	11
StandardML	11
VisualBASIC	11
AppleSoftBASIC	10
Babytalk	10
Binary	10
Caml	10
Catalan	10
CBASIC	10
Coldfusion	10
DOSBatch	10
Html	10
JAVA	10
mIRCScript	10
Processing	10
Tcl	10
Turing	10
VHDL	10
visualbasic	10
Z	10
	9
#English	9
actionscript	9
ActionScript0	9
AmericanSignLanguage	9
AppleSoftBasic	9
asm	9
Elvish	9
empty	9
F#	9
Hypercard	9
Hypertalk	9
ModulaC	9
PostScript	9
REXX	9
Shellscript	9
sql	9
TRSBASIC	9
Twitter	9
Vietnamese	9
VisualBasic.NET	9
Python	8
0Pascal	8
AWK	8
Comal	8
CommonLISP	8
dBaseIII	8
Elm	8
Esperanto	8
FORTRANIV	8
Golang	8
Irish	8
LaTeX	8
Maple	8
Miranda	8
mIRCscript	8
Occam	8
PowerBuilder	8
Rexx	8
sh	8
Spanglish	8
SpectrumBasic	8
SpectrumBASIC	8
TypeScript	8
Zasm	8
???	7
Java	7
	7
0	7
AmigaBasic	7
ARexx	7
ASL	7
Babbling	7
BourneShell	7
British	7
BrokenEnglish	7
C,C++	7
Crying	7
Dog	7
EmacsLisp	7
Erlang	7
HTML&CSS	7
HTML+CSS	7
Hungarian	7
IDL	7
japanese	7
Java,	7
lisp	7
logo	7
Meme	7
Memes	7
MIPSassembly	7
OldEnglish	7
Parseltongue	7
Racket	7
ruby	7
Sass	7
Scratch	7
Southern	7
UCSDPascal	7
Ukrainian	7
Urdu	7
VBC++	7
Verilog	7
x	7
xassembler	7
Yiddish	7
ZXBASIC	7
#Python	6
.NET	6
000	6
000assembler	6
000Assembler	6
AssemblyZ	6
AtariBasic	6
BadEnglish	6
BCPL	6
BlitzBasic	6
Bullcrap	6
Bullshit	6
C#.NET	6
C,	6
C+	6
ClassicASP	6
Clojure	6
Coffeescript	6
Farsi	6
Food	6
Gaelic	6
GoodKorean	6
HTML/CSS/JavaScript	6
Js	6
KAssembly	6
Kotlin	6
LabVIEW	6
latin	6
LocomotiveBasic	6
Marathi	6
MatLab	6
Modula	6
MSXBasic	6
music	6
ObjectPascalDelphi	6
OldNorse	6
OPL	6
PICAssembly	6
PL/C	6
PLSQL	6
Pussycrushing	6
qbasic	6
REALbasic	6
realtimefeedback	6
russian	6
scheme	6
Scottish	6
Screeching	6
Silence	6
Slang	6
TBA	6
VbScript	6
VisualBasic0	6
Xassembly	6
XAssembly	6
Zmachinecode	6

First seven languages

M. Salmon

8 de agosto de 2016

Intro

Getting tweets

Cleaning