As a response to sending conference announcement to a list of e-mail addresses of friends and colleagues, I usually receive several responses that e-mail can not be delivered. I would like to clean my list of e-mails and delete all e-mails that are not reachable anymore.
Select e-mails with undeliverable responses (or any other group of e-mails) and use File|Save as... menu option in Outlook to save them as text file. The headers (and sometimes the body) will be conveniently saved in one file.
lfn <- "emails2.txt"
txt <- readLines(lfn)
txt[1:12]
## [1] "From:\tMail Delivery System [MAILER-DAEMON@relay.nib.si]"
## [2] "To:\thassank@namru3.med.navy.mil"
## [3] "Sent:\t21. april 2015 12:51"
## [4] "Subject:\tUndeliverable: [AS2015] Applied Statistics 2015 Conference"
## [5] ""
## [6] "From:\tMail Delivery System [MAILER-DAEMON@relay.nib.si]"
## [7] "To:\tr.heuberger@iccr-international.org"
## [8] "Sent:\t21. april 2015 12:51"
## [9] "Subject:\tUndeliverable: [AS2015] Applied Statistics 2015 Conference"
## [10] ""
## [11] "From:\tMail Delivery System [MAILER-DAEMON@relay.nib.si]"
## [12] "To:\tnino.rode@fsd.si; mt@iccr-international.org"
Each e-mail has recepient e-mail address in the line starting with the text “To:” (followed by tab character) and corresponding to the To: field. In case of multiple recepients, the e-mails are printed in several lines, with e-mails delimited by ;. Next line corresponds to Sent: field (starts with this keyword and tab character)
We need to extract information in several places. First we detect the locations of To: and Sent: lines.
start <- grep("To:\t",txt)
start
## [1] 2 7 12 17 22 27 32 37 42 47 52 57
end <- grep("Sent:\t",txt)-1
end
## [1] 2 7 12 17 22 27 32 37 42 47 52 57
Extract only relevant lines:
ind <- data.frame(start,end)
mytxt <- apply(ind,1,function(x) txt[x[1]:x[2]])
txt <- unlist(mytxt)
The first group of e-mails:
head(txt)
## [1] "To:\thassank@namru3.med.navy.mil"
## [2] "To:\tr.heuberger@iccr-international.org"
## [3] "To:\tnino.rode@fsd.si; mt@iccr-international.org"
## [4] "To:\tGerd.Beidernikl@zbw.at; Gisela.Andersson@integrationsverket.se"
## [5] "To:\tstadlober@stat.tu-graz.ac.at"
## [6] "To:\tBojan.Leskosek@sp.uni-lj.si"
Now we can delete all spaces and keyword To:.
txt <- gsub(" ","",txt)
txt <- gsub("To:\\t","",txt)
head(txt)
## [1] "hassank@namru3.med.navy.mil"
## [2] "r.heuberger@iccr-international.org"
## [3] "nino.rode@fsd.si;mt@iccr-international.org"
## [4] "Gerd.Beidernikl@zbw.at;Gisela.Andersson@integrationsverket.se"
## [5] "stadlober@stat.tu-graz.ac.at"
## [6] "Bojan.Leskosek@sp.uni-lj.si"
Finnaly we can use semicolon to split multiple e-mails and make the final list:
emails <- unlist(strsplit(txt,";"))
head(emails)
## [1] "hassank@namru3.med.navy.mil"
## [2] "r.heuberger@iccr-international.org"
## [3] "nino.rode@fsd.si"
## [4] "mt@iccr-international.org"
## [5] "Gerd.Beidernikl@zbw.at"
## [6] "Gisela.Andersson@integrationsverket.se"
length(emails)
## [1] 14
Save them into a text for further use and delete them from the source list.
cat(emails,file="wrong-e-mails.txt",sep="\n",append=TRUE)
With obviuos modifications you can use this method to extract e-mails from the textfile with printed e-mails (or similar structured files).