Progress so far

The goal of this project overall is to build a predictive text application that suggests the next word based on what a user types. For the progress so far I have downloaded and loaded the data sets provided in the course which includes the english language one as well. blogs, news , and Twitter( or x) datasets are provided. Due to the large size of the datasets, a subset of 1,000 lines from each source was used for initial exploration. Usually the blogs and news entries are usually longer and more structured, while Twitter or x posts are shorter and more conversational in nature. The news and blog text file is around 200 MB in size while the twitter text file is around 160 MB in size. Looking at the texts provided shows that common words such as “the,” “and,” and “to” appear very commonly like in real life and certain word pairs repeat often which shows predictable language patterns. These findings suggest that it is possible to create a model that predicts the next word using previous words. The predictive algorithm i will create use such patterns, first considering the two words immediately preceding the next word, then falling back to one word if no match is found and finally using single words if required to always provide a prediction. The data provided can be summarized as:

Summary of Dataset Lines and Text Type
File.name Total.lines Text.nature
Blogs 899288 Structured
News 1010206 Structured
Twitter 2360148 Short and conversational