This is a milestone report of the proposed “Text Predictor” project for Coursera’s Data Science specialization capstone course. It contains an overview of the data corpus used and a rudimentary exploration of the data.
The corpus consists of data from 3 sources, Twitter, News and Blogs.The design phase of the algorithm uses only a part of all three sources as using the whole data will be computationally expensive. A part of the data is usually enough to make reasonable predictions. Given below are the lengths of the 3 files respectively.
## [1] 2360148
## [1] 77259
## [1] 899288
The features of the three files are shown below. This exploration of the data uses only a part of the data i.e. 500 lines from each file. The 3 tables given below show a list of the most frequently occurring words in the text corpus. The graphs below show the words with the highest frequencies(>10)
The word cloud at the bottom shows a pictorial respresentation of the words and their frequencies. The words in larger sizes are more frequent that the words in smaller sizes.
## # A tibble: 1,714 x 2
## word n
## <chr> <int>
## 1 day 29
## 2 love 27
## 3 rt 22
## 4 night 14
## 5 hey 13
## 6 time 13
## 7 tonight 12
## 8 follow 10
## 9 bad 8
## 10 guys 8
## # ... with 1,704 more rows
## # A tibble: 4,464 x 2
## word n
## <chr> <int>
## 1 ts 40
## 2 time 31
## 3 people 26
## 4 police 24
## 5 school 24
## 6 city 18
## 7 day 18
## 8 home 18
## 9 million 18
## 10 county 17
## # ... with 4,454 more rows
## # A tibble: 4,768 x 2
## word n
## <chr> <int>
## 1 ts 102
## 2 time 72
## 3 tt 72
## 4 day 43
## 5 ia 34
## 6 people 31
## 7 ita 23
## 8 water 23
## 9 dona 19
## 10 lot 19
## # ... with 4,758 more rows
The next upcoming goal of this project is to explore the words that are not as frequent. They enable us to gain more information than the frequent words. The next logical step would be to build a model for prediction.