After reading the Natural Language Processing (NLP) materials from Stanford, I downloaded the SwiftKey dataset, which contains large collections of text data from blogs, news articles, and Twitter posts. I successfully loaded the English-language files into R and performed an initial exploratory analysis. This included cleaning the text by replacing non-letter characters with spaces, splitting the lines into words, and removing empty entries. I summarized the data by calculating the number of lines, total number of words, and maximum line length for each dataset.
## # A tibble: 3 × 4
## Dataset Lines Words Max_Line_Length
## <chr> <int> <int> <int>
## 1 Blogs 899288 37546250 40833
## 2 News 1010242 34762395 11384
## 3 Twitter 2360148 30093372 140
From the summary table, we can see that Twitter has the largest number of lines but a smaller total number of words compared to blogs and news articles. This likely reflects the nature of Twitter, where users tend to write shorter messages and use more common words. In contrast, news and blog writers are often more professional and tend to use longer sentences with more specialized or uncommon words. These differences suggest that when building the prediction model, careful attention must be paid to the match between the training and testing data to ensure good performance across different types of text.
From the top 10 most frequent words in each dataset, we observe that the most common words are very similar across all three datasets. This suggests that a core set of words is consistently used across all datasets. Combined with the earlier finding, this highlights the importance of considering both common and rare words when building the language model.
Goal:
Build a model to predict the next word after user typing a word.
Steps: