The data used in this project is from a corpus called HC Corpora, and collected from publicly available sources by a web crawler. This data is used by SwiftKey to build a smart keyboard which provides featuring text generated by predictive text models.
The data contains information about the content of some Twitter posts, Blogs, and News, made available for applying data science in the area of natural language processing. The size of the overall folder is 548.0 MB, here are the steps taken to download and read selected files. It takes about 8 minutes to load.
# set a longer timeout
options(timeout = 6000)
# assign the data link to a variable named url
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# set a destination path
dest <- file.path("data", "Coursera-SwiftKey.zip")
# launch the clock
s<- Sys.time()
# download the data
downloader::download(url=url,destfile = dest)
e <- Sys.time()
# check the time taken to download
e-s
In particular, the Coursera-SwiftKey.zip folder contains Local data in English, German, Russian and Finnish, they are stored in a sub-folder named final.
This project will use the English database.
Name Length Date
1 final/en_US/ 0 2014-07-22 10:10:00
2 final/en_US/en_US.twitter.txt 167105338 2014-07-22 10:12:00
3 final/en_US/en_US.news.txt 205811889 2014-07-22 10:13:00
4 final/en_US/en_US.blogs.txt 210160014 2014-07-22 10:13:00
To see the first line of the
en_US.twitter.txt file the readLines() function is
used.
[1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
Here is a function to read random lines in one of the selected sets.
For practice, the function access to the .zip folder and extracts
requested information, releasing a dataframe made of two columns:
text and type with the scope of merging all
sets by rows of random lines.
It might take a bit to release the information as requested is loading from raw data, this is something that would need to be improved in order to be used in a Shiny app for text prediction.
randomRead(data = "data/Coursera-SwiftKey.zip",
file = "final/en_US/en_US.blogs.txt",
n = 2)
text type
1 We love you Mr. Brown. blogs
The objective of this project is to give the user the opportunity
whether to select one or more type of source (twitter, blogs, or news)
for randomly visualize the text. Further work will involve tokenization
of the sentences to extrapolate only certain type of words and special
signs such as the hashtag # sign.
To introduce what will be shown, here are some visualizations of the data which will be automated by loading the sets into the environment, to be used for the rest of the project.
All three data sets take about 50 secs to load, then the images are saved as .RData files and set ready to be used use by loading them into the environment.
The blogs set has about 900 000 rows, the twitter set about 2 millions rows, while the news set comes with about a millions lines of information.
Here is a table summarizing the dimension and the max string length of the sets.
data dim max_string_length
1 blogs 899288 40833
2 twitter 2360148 140
3 news 1010242 11384
An interesting finding is that by selecting how many times the
hashtag # sign is typed on a Twitter post, based on
available data.
[1] 214369
Only 9% of the available tweets contain a hashtag!
\[\frac{\text{n of rows containg # sign}}{\text{total n of rows}}\]
[1] 9.082863
Tokenization is used to identify the appropriate tokens such as words, punctuation, or numbers. Two functions are used that take a file as input and returns a tokenized version of the sampled string:
firstToken()
lastToken()
firstToken
1 In
2 We
3 Chad
lastToken
1 “gods”.
2 Brown.
3 him.
Here is the random result of the two functions output joined on a data frame. This shows the first and last word on three random lines, along with their row identification number.
firstToken lastToken
747958 but just
74522 Have relationship.
260593 There on.
One more tool to be used to remove profanity and other words that are
not to be included in the prediction algorithm, is the
profanityRead() function. It readLines() from
the bad-words database and releases a list to be used for
filtering out these words.
set.seed(2222)
profanityRead(twitter,n=1)
bad
1 fight
Finally, the objective of this capstone is to build a predictive algorithm embedded in a Shiny app to be used with strings of data for random extrapolations.
It will provide visualizations of the statistics for selected data and a word cloud of the most frequent words in the text.
Other additional features are still to be tested.