Courera Data Science Capstone: https://www.coursera.org/learn/data-science-project/
Data Source: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
This exercise uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. See the About the Corpora reading for more details. The files have been language filtered but may still contain some foreign text.
1. The en_US.blogs.txt file is how many megabytes?
filepath <- "D:/R/data/capstone/en_US/en_US.blogs.txt"
file.info(filepath)[,1]/1024/1000
## [1] 205.2344
2. The en_US.twitter.txt has how many lines of text?
filepath <- "D:/R/data/capstone/en_US/en_US.twitter.txt"
con <- file(filepath)
length(readLines(con))
## [1] 2360148
close(con)
3. What is the length of the longest line seen in any of the three en_US data sets?
folder <- "D:/R/data/capstone/en_US/"
filelist <- list.files(path=folder)
l <- lapply(paste0(folder, filelist), function(filepath) {
size <- file.info(filepath)[1]/1024/1000
con <- file(filepath, open="r")
lines <- readLines(con)
nchars <- lapply(lines, nchar)
maxchars <- max(unlist(nchars))
nwords <- sum(sapply(strsplit(lines, "\\s+"), length))
close(con)
return(c(filepath, format(round(size, 2), nsmall=2), length(lines), maxchars, nwords))
})
df <- data.frame(matrix(unlist(l), nrow=3, byrow=TRUE))
names(df) <- c('file name', 'size(MB)', 'entries', 'longest line', 'total words')
df
## file name size(MB) entries longest line
## 1 D:/R/data/capstone/en_US/en_US.blogs.txt 205.23 899288 40835
## 2 D:/R/data/capstone/en_US/en_US.news.txt 200.99 77259 5760
## 3 D:/R/data/capstone/en_US/en_US.twitter.txt 163.19 2360148 213
## total words
## 1 37334441
## 2 2643972
## 3 30373792
4. In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
filepath <- "D:/R/data/capstone/en_US/en_US.twitter.txt"
con <- file(filepath)
lines <- readLines(con)
loves <- lapply(lines, function(line){grepl('love', line, ignore.case=FALSE)})
hates <- lapply(lines, function(line){grepl('hate', line, ignore.case=FALSE)})
close(con)
loves <- sum(unlist(loves))
hates <- sum(unlist(hates))
loves; hates; loves/hates
## [1] 90956
## [1] 22138
## [1] 4.108592
5. The one tweet in the en_US twitter data set that matches the word “biostats” says what?
filepath <- "D:/R/data/capstone/en_US/en_US.twitter.txt"
con <- file(filepath)
lines <- readLines(con)
matches <- lapply(lines, function(line){grepl('biostats', line, ignore.case=FALSE)})
close(con)
result <- lines[unlist(matches)]; result
## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"
6. How many tweets have the exact characters “A computer once beat me at chess, but it was no match for me at kickboxing”. (I.e. the line matches those characters exactly.)
filepath <- "D:/R/data/capstone/en_US/en_US.twitter.txt"
con <- file(filepath)
lines <- readLines(con)
matches <- lapply(lines, function(line){grepl('A computer once beat me at chess, but it was no match for me at kickboxing', line, ignore.case=FALSE)})
close(con)
result <- lines[unlist(matches)]; result; length(result)
## [1] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [2] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [3] "A computer once beat me at chess, but it was no match for me at kickboxing"
## [1] 3
The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language*.
Each entry is tagged with it’s date of publication. Where user comments are included they will be tagged with the date of the main entry.
Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog) If possible, each entry is tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it will be tagged with “sports” subject).In many cases it’s not feasible to tag the entries (for example, it’s not really practical to tag each individual Twitter entry, though I’ve got some ideas which might be implemented in the future) or no subject is found by the automated process, in which case the entry is tagged with a ‘0’.
To save space, the subject and type is given as a numerical code.
Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.
tagesspiegel.de 2010/12/03 1 7 Er ist weder ein Abzocker noch ein Ausbeuter, er ist kein Betrger, er haut niemanden in die Pfanne oder betrgt ihn um seinen gerechten Anteil, er steht zu seinem Wort und erfllt seine Vertrge sinngem und feilscht nicht wegen irgendwelcher Lcken im Maschendraht des Kleingedruckten der Vertrge.spiegel.de 2010/11/30 1 1,6 Diplomaten sehen Clintons Direktive als Besttigung einer alten Regel: Diezeit.de 2009/10/22 1 2,10 Warum schaffen wir nicht eine Whrung, die diese Aufgaben erfllt anstatt den Forderungen der Geldwirtschaft hinterherzuhecheln, die niemals erfllt werden knnen?
You may still find lines of entirely different languages in the corpus. There are 2 main reasons for that:1. Similar languages. Some languages are very similar, and the automatic language checker could therefore erroneously accept the foreign language text.2. “Embedded” foreign languages. While a text may be mainly in the desired language there may be parts of it in another language. Since the text is then split up into individual lines, it is possible to see entire lines written in a foreign language.Whereas number 1 is just an out-and-out error, I think number 2 is actually desirable, as it will give a picture of when foreign language is used within the main language.
Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 207. https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html