Module 2 Report

Part I: Data Process

After reading the Natural Language Processing (NLP) materials from Stanford, I downloaded the SwiftKey dataset, which contains large collections of text data from blogs, news articles, and Twitter posts. I successfully loaded the English-language files into R and performed an initial exploratory analysis. This included cleaning the text by replacing non-letter characters with spaces, splitting the lines into words, and removing empty entries. I summarized the data by calculating the number of lines, total number of words, and maximum line length for each dataset.

## # A tibble: 3 × 4
##   Dataset   Lines    Words Max_Line_Length
##   <chr>     <int>    <int>           <int>
## 1 Blogs    899288 37546250           40833
## 2 News    1010242 34762395           11384
## 3 Twitter 2360148 30093372             140

Data exploring

From the summary table, we can see that Twitter has the largest number of lines but a smaller total number of words compared to blogs and news articles. This likely reflects the nature of Twitter, where users tend to write shorter messages and use more common words. In contrast, news and blog writers are often more professional and tend to use longer sentences with more specialized or uncommon words. These differences suggest that when building the prediction model, careful attention must be paid to the match between the training and testing data to ensure good performance across different types of text.

Data exploring

From the top 10 most frequent words in each dataset, we observe that the most common words are very similar across all three datasets. This suggests that a core set of words is consistently used across all datasets. Combined with the earlier finding, this highlights the importance of considering both common and rare words when building the language model.

Part II: Plans for creating a language model

Goal:
Build a model to predict the next word after user typing a word.

Steps: