Summary

This is a milestone report of the proposed “Text Predictor” project for Coursera’s Data Science specialization capstone course. It contains an overview of the data corpus used and a rudimentary exploration of the data.

Data Corpus

The corpus consists of data from 3 sources, Twitter, News and Blogs.The design phase of the algorithm uses only a part of all three sources as using the whole data will be computationally expensive. A part of the data is usually enough to make reasonable predictions. Given below are the lengths of the 3 files respectively.

## [1] 2360148
## [1] 77259
## [1] 899288

Exploring the data

The features of the three files are shown below. This exploration of the data uses only a part of the data i.e. 500 lines from each file. The 3 tables given below show a list of the most frequently occurring words in the text corpus. The graphs below show the words with the highest frequencies(>10)

The word cloud at the bottom shows a pictorial respresentation of the words and their frequencies. The words in larger sizes are more frequent that the words in smaller sizes.

## # A tibble: 1,714 x 2
##       word     n
##      <chr> <int>
##  1     day    29
##  2    love    27
##  3      rt    22
##  4   night    14
##  5     hey    13
##  6    time    13
##  7 tonight    12
##  8  follow    10
##  9     bad     8
## 10    guys     8
## # ... with 1,704 more rows
## # A tibble: 4,464 x 2
##       word     n
##      <chr> <int>
##  1      ts    40
##  2    time    31
##  3  people    26
##  4  police    24
##  5  school    24
##  6    city    18
##  7     day    18
##  8    home    18
##  9 million    18
## 10  county    17
## # ... with 4,454 more rows
## # A tibble: 4,768 x 2
##      word     n
##     <chr> <int>
##  1     ts   102
##  2   time    72
##  3     tt    72
##  4    day    43
##  5     ia    34
##  6 people    31
##  7    ita    23
##  8  water    23
##  9   dona    19
## 10    lot    19
## # ... with 4,758 more rows

Goals and scope

The next upcoming goal of this project is to explore the words that are not as frequent. They enable us to gain more information than the frequent words. The next logical step would be to build a model for prediction.