The goal of this project overall is to build a predictive text application that suggests the next word based on what a user types. For the progress so far I have downloaded and loaded the data sets provided in the course which includes the english language one as well. blogs, news , and Twitter( or x) datasets are provided. Due to the large size of the datasets, a subset of 1,000 lines from each source was used for initial exploration. Usually the blogs and news entries are usually longer and more structured, while Twitter or x posts are shorter and more conversational in nature. The news and blog text file is around 200 MB in size while the twitter text file is around 160 MB in size. Looking at the texts provided shows that common words such as “the,” “and,” and “to” appear very commonly like in real life and certain word pairs repeat often which shows predictable language patterns. These findings suggest that it is possible to create a model that predicts the next word using previous words. The predictive algorithm i will create use such patterns, first considering the two words immediately preceding the next word, then falling back to one word if no match is found and finally using single words if required to always provide a prediction. The data provided can be summarized as:
| File.name | Total.lines | Text.nature |
|---|---|---|
| Blogs | 899288 | Structured |
| News | 1010206 | Structured |
| 2360148 | Short and conversational |