The goal of the project is to develop the predicting algorithm and Shiny App for natural language processing (text mining and processing). We used the provided Swiftkey dataset and prepared the progress report based on initial findings. The text file containing the texts from blogs, news and twitter in English version are briefly examined. The works covered by this report can be summarized in the following points:
The R-markdown file and codes made for the data analysis is placed GitHub Link. Please refer the GitHub Link section for R-codes used for data analysis and visualization.
The data used in the study is sourced from Data Science Capstone Dataset as a part of Data Science Specialization Course in Coursera. The dataset consists of text file from blogs, news and twitter. These text data are in different languages such as German, English, Russian and Finnish. For this project we will consider only the English version.
The following text files are used for the study.
| Name | Name in Report | Description |
|---|---|---|
| en_US.blogs.txt | blog | A text file of blog-related text in US English. |
| en_US.news.txt | news | A text file of news-related text in US English. |
| en_US.twitter.txt | A text file of tweets from Twitter in English. |
Some of the text samples from the each text file includes:
Sample text from blogs: In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”., We love you Mr. Brown.
Sample text from news: He wasn’t home alone, apparently., The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
Sample text from twitter: How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long., When you meet someone special… you’ll know. Your heart will beat more rapidly and you’ll smile for no reason.
The text files contain large number of data. The table below illustrates the file size, total number of lines, total character and total words in each text file. Each file contains more than three million words.
| Data Source | File Size (mb) | Total rows | Total Characters | Total Words |
|---|---|---|---|---|
| blog | 255.4 Mb | 899288 | 206824505 | 37546250 |
| news | 257.3 Mb | 1010242 | 203223159 | 34762395 |
| 319 Mb | 2360148 | 162096241 | 30093413 |
The preprocessing and cleaning of the data is important for word prediction algorithm development. However the task is challenging too because of the large size of the text file and the text contains characters including numbers, punctuations, symbols, repeated alphabets in words. So, we preprocess and cleaned the data by;
Since the data size is very large, for some initial data analysis purpose we divide the dataset in the ratio of 70% : 30%. Here we will examine the data features for the 70% of the data with remaining 30% being left for the predictive model testing.
The table below summarizes the total number of words and number of unique words present in the sample data for each text files.
| Data Source | Total rows | Total Characters | Total Words | Total Unique Words |
|---|---|---|---|---|
| blog | 629502 | 104551756 | 13418344 | 71546 |
| news | 707170 | 107598388 | 13666997 | 25617 |
| 1652104 | 83525467 | 11993087 | 573 |
The sample data has been used to understand the important features of the data. We developed the n-gram model from the corpus and then identify the most frequent words in n-gram, such as 1-gram, 2-gram and 3-gram. The histogram below shows the most frequent 1-gram words for each sample files.
The following histogram shows the most frequent 2-gram words for each sample text files.
The following histogram shows the most frequent 3-gram words for each sample text files.
We also create the top 100 frequently used words for blog, news and twitter sample text data files which are shown in the following word clouds respecively.
We examine the text dataset and exploratory analysis of the dataset provides some interesting findings. Next step is to develop the predictive model and Shiny App. We will use the sampled 70% of the data to build the word predictive model and the remaining 30% will be used for the model evaluation. The following are the plans for the predictive model and Shiny App development.
Data Source: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5), 1–54. https://doi.org/10.18637/jss.v025.i05 https://www.jstatsoft.org/article/view/v025i05