Introduction

The goal of this report is to summarize the exploratory data analysis (EDA) conducted on text data from three different sources: blogs, news articles, and Twitter. The purpose of this analysis is to understand the major features of the datasets and to outline the plans for developing a predictive text algorithm and a Shiny app.

Data Overview

The datasets used in this analysis are:

  • Blogs: A collection of blog posts.
  • News: A collection of news articles.
  • Twitter: A collection of tweets. Each dataset consists of a text column containing the text content from the respective sources. After loading and cleaning the data, and to save time I will be working on a subset of the original data sets that is equal to 1000 lines. A seed was set as well to ensure the same results. ### Calculate summary statistics
Dataset Number_of_Lines Total_Word_Count
Twitter 1000 12276
News 1000 32563
Blogs 1000 41360

Major Features of the Data

To understand the characteristics of the text data, we performed tokenization to break the text into smaller units, such as unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences). We then counted the frequency of these units to identify the most common words and phrases in each dataset.

Tokenize the text into unigrams and calculate frequencies

As you can see the table contains the top 20 most common words in all the data sets. you can see the top 1 common word is “the” in all the data sets followed by “to” in advance data cleaning it is advisable to delete these word because they add no extra information.

##    Blogs Blogs_count News News_count Twitter Twitter_count
## 1    the        2054  the       1885     the           417
## 2     to        1260   to        907      to           312
## 3    and        1230    a        882       i           311
## 4      a         999  and        795       a           238
## 5      i         934   of        745     you           215
## 6     of         901   in        674     and           180
## 7     in         644  for        386      of           174
## 8   that         543 that        332      in           162
## 9     it         466   is        304      is           157
## 10    is         445   on        281     for           144
## 11   for         397 said        259      it           140
## 12  with         362 with        252      on           124
## 13    my         327   it        245      my           105
## 14   was         317   he        212    that           103
## 15    on         308   at        208      me            87
## 16   you         296  was        204      so            85
## 17  this         267  but        175      be            78
## 18    as         258   as        156      at            77
## 19    we         258   be        152    your            75
## 20    be         246   by        147    this            69

Ploting the top 20 unigrams of the three datasets

Tokenize the text into 2-grams and calculate frequencies

Bigrams are pairs of consecutive words. We identified the top 20 bigrams for each dataset. Common bigrams included phrases like “of the”, “in the”, and “to the”.

##      Bigram Blogs_Count  Bigram.1 News_Count   Bigram.2 Twiter_Count
## 1    of the         194    in the        167     in the           35
## 2    in the         178    of the        162     of the           34
## 3    to the         104   for the         74    for the           26
## 4     to be          93    to the         72      to be           24
## 5    on the          77    on the         70     on the           21
## 6   for the          69      in a         61     to the           20
## 7     and i          68   and the         58     i know           17
## 8   and the          59     to be         51 thanks for           17
## 9     i was          58    at the         48     at the           15
## 10   at the          57  from the         41   going to           15
## 11    it is          56   he said         41     i have           15
## 12 with the          56  with the         36     i love           14
## 13     in a          55  that the         35     have a           13
## 14     i am          53      of a         33     i cant           12
## 15   i have          50    by the         30       is a           12
## 16   it was          48      to a         30     it was           12
## 17   that i          43 more than         29    so much           12
## 18    but i          42     and a         27    will be           12
## 19     of a          42     for a         27      and i           11
## 20     is a          40     it is         27       i am           11

Ploting the bigrams

Tokenize the text into trigrams, and count their frequencies.

trigrams Analysis: Trigrams are sequences of three consecutive words. We identified the top 20 trigrams for each dataset. Common trigrams included phrases like “a lot of”, “thanks for the”, But the top 1 is “NA” indecating that the data needs more cleaning.

##             Trigram Blogs_Count        Trigram.1 News_Count       Trigram.2
## 1              <NA>          57             <NA>         41            <NA>
## 2          a lot of          25         a lot of         13  thanks for the
## 3        one of the          20        said in a          9        a lot of
## 4       i wanted to          10       one of the          8      for the rt
## 5           it is a          10      more than a          7         i had a
## 6        as much as           9      of the year          7        i have a
## 7           i had a           9       out of the          7     going to be
## 8          i had to           9      a couple of          6         i am so
## 9         i want to           9       be able to          6       i have no
## 10      a couple of           8      going to be          6      i love you
## 11       the end of           8       he said he          6       i need to
## 12       and i have           7   in a statement          6      id like to
## 13        i have to           7      a series of          5      one of the
## 14         it was a           7 according to the          5   the fact that
## 15         a bit of           6        he said i          5        to be in
## 16 according to the           6      i dont know          5      to see you
## 17     i decided to           6      in front of          5     want to get
## 18      if you dont           6      in new york          5 a beautiful day
## 19      one of them           6      is going to          5        a bit of
## 20       that it is           6         it was a          5      a good one
##    Twitter_Count
## 1             25
## 2             11
## 3              5
## 4              4
## 5              4
## 6              4
## 7              3
## 8              3
## 9              3
## 10             3
## 11             3
## 12             3
## 13             3
## 14             3
## 15             3
## 16             3
## 17             3
## 18             2
## 19             2
## 20             2

Plot the trigrams of the three data sets

Goals for the Prediction Algorithm and Shiny App

Based on the insights gained from the exploratory analysis, the following plans have been made for the development of the prediction algorithm and the Shiny app:

  • Prediction Algorithm:

The algorithm will be designed to predict the next word(s) based on the input text. It will use n-gram models (unigrams, bigrams, trigrams) to generate predictions. The model will be trained on the three datasets to ensure it captures a diverse range of language patterns.

  • Shiny App:

The Shiny app will provide an interactive interface for users to input text and receive word predictions. The app will display the top suggested words based on the input text, leveraging the trained n-gram models. Users will be able to choose from different prediction options and see real-time updates as they type.

Conclusion

The exploratory data analysis has revealed important patterns and features in the text data from blogs, news articles, and Twitter. These insights will guide the development of a robust and accurate prediction algorithm. The Shiny app will provide a user-friendly interface for text prediction, making it accessible and useful for a wide range of users. By understanding the common words and phrases in the data, we can create a prediction model that accurately reflects natural language usage. The eventual goal is to develop a powerful tool that enhances text input efficiency and provides meaningful suggestions based on context.