Milestone Report

The goal of this project is to create a helpful tool, which provides word suggestions based on the last word used. This way it helps the user to write texts by not having to write out words because he can use the suggested words instead. The suggested words are based on all kinds of publicly available text data and are mostly determined by the last written word. Here is the exploratory analysis I did, to wrap my head around the data.

## [1] "Total words in the US blogs data set: 37546806"

## [1] "Total words in the US twitter data set: 30096649"

## [1] "Total words in the US news data set: 2674561"

## [1] "Total lines in the US blogs data set: 899288"

## [1] "Total lines in the US twitter data set: 2360148"

## [1] "Total lines in the US news data set: 77259"

## [1] "Average words per entry in the US blogs data set: 41.7518428529332"

## [1] "Average words per entry in the US twitter data set: 12.7520176700783"

## [1] "Average words per entry in the US news data set: 34.6181156887871"

## Most frequently used words in the US blogs data set:

## # A tibble: 15 × 3
##    word        n    freq
##    <chr>   <int>   <dbl>
##  1 the   1860184 0.0495 
##  2 and   1094404 0.0291 
##  3 to    1069442 0.0285 
##  4 a      900374 0.0240 
##  5 of     876799 0.0234 
##  6 i      775057 0.0206 
##  7 in     598541 0.0159 
##  8 that   460783 0.0123 
##  9 is     432715 0.0115 
## 10 it     403905 0.0108 
## 11 for    363840 0.00969
## 12 you    298709 0.00796
## 13 with   286734 0.00764
## 14 was    278347 0.00741
## 15 on     276514 0.00736

## Most frequently used words in the US twitter data set:

## # A tibble: 15 × 3
##    word       n    freq
##    <chr>  <int>   <dbl>
##  1 the   937467 0.0311 
##  2 to    788663 0.0262 
##  3 i     723548 0.0240 
##  4 a     611407 0.0203 
##  5 you   548164 0.0182 
##  6 and   438541 0.0146 
##  7 for   385357 0.0128 
##  8 in    380383 0.0126 
##  9 of    359636 0.0119 
## 10 is    358787 0.0119 
## 11 it    295125 0.00981
## 12 my    291924 0.00970
## 13 on    278038 0.00924
## 14 that  234679 0.00780
## 15 me    202713 0.00674

## Most frequently used words in the US news data set:

## # A tibble: 15 × 3
##    word       n    freq
##    <chr>  <int>   <dbl>
##  1 the   151717 0.0567 
##  2 to     69757 0.0261 
##  3 and    68604 0.0257 
##  4 a      67346 0.0252 
##  5 of     59315 0.0222 
##  6 in     51894 0.0194 
##  7 for    27166 0.0102 
##  8 that   26384 0.00986
##  9 is     21969 0.00821
## 10 on     20814 0.00778
## 11 with   19758 0.00739
## 12 said   19176 0.00717
## 13 was    17627 0.00659
## 14 he     17587 0.00658
## 15 it     16768 0.00627

Milestone Report

Maurice

2025-08-17