The goal of this report is to summarize the exploratory data analysis (EDA) conducted on text data from three different sources: blogs, news articles, and Twitter. The purpose of this analysis is to understand the major features of the datasets and to outline the plans for developing a predictive text algorithm and a Shiny app.
The datasets used in this analysis are:
| Dataset | Number_of_Lines | Total_Word_Count |
|---|---|---|
| 1000 | 12276 | |
| News | 1000 | 32563 |
| Blogs | 1000 | 41360 |
To understand the characteristics of the text data, we performed tokenization to break the text into smaller units, such as unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences). We then counted the frequency of these units to identify the most common words and phrases in each dataset.
As you can see the table contains the top 20 most common words in all the data sets. you can see the top 1 common word is “the” in all the data sets followed by “to” in advance data cleaning it is advisable to delete these word because they add no extra information.
## Blogs Blogs_count News News_count Twitter Twitter_count
## 1 the 2054 the 1885 the 417
## 2 to 1260 to 907 to 312
## 3 and 1230 a 882 i 311
## 4 a 999 and 795 a 238
## 5 i 934 of 745 you 215
## 6 of 901 in 674 and 180
## 7 in 644 for 386 of 174
## 8 that 543 that 332 in 162
## 9 it 466 is 304 is 157
## 10 is 445 on 281 for 144
## 11 for 397 said 259 it 140
## 12 with 362 with 252 on 124
## 13 my 327 it 245 my 105
## 14 was 317 he 212 that 103
## 15 on 308 at 208 me 87
## 16 you 296 was 204 so 85
## 17 this 267 but 175 be 78
## 18 as 258 as 156 at 77
## 19 we 258 be 152 your 75
## 20 be 246 by 147 this 69
Bigrams are pairs of consecutive words. We identified the top 20 bigrams for each dataset. Common bigrams included phrases like “of the”, “in the”, and “to the”.
## Bigram Blogs_Count Bigram.1 News_Count Bigram.2 Twiter_Count
## 1 of the 194 in the 167 in the 35
## 2 in the 178 of the 162 of the 34
## 3 to the 104 for the 74 for the 26
## 4 to be 93 to the 72 to be 24
## 5 on the 77 on the 70 on the 21
## 6 for the 69 in a 61 to the 20
## 7 and i 68 and the 58 i know 17
## 8 and the 59 to be 51 thanks for 17
## 9 i was 58 at the 48 at the 15
## 10 at the 57 from the 41 going to 15
## 11 it is 56 he said 41 i have 15
## 12 with the 56 with the 36 i love 14
## 13 in a 55 that the 35 have a 13
## 14 i am 53 of a 33 i cant 12
## 15 i have 50 by the 30 is a 12
## 16 it was 48 to a 30 it was 12
## 17 that i 43 more than 29 so much 12
## 18 but i 42 and a 27 will be 12
## 19 of a 42 for a 27 and i 11
## 20 is a 40 it is 27 i am 11
trigrams Analysis: Trigrams are sequences of three consecutive words. We identified the top 20 trigrams for each dataset. Common trigrams included phrases like “a lot of”, “thanks for the”, But the top 1 is “NA” indecating that the data needs more cleaning.
## Trigram Blogs_Count Trigram.1 News_Count Trigram.2
## 1 <NA> 57 <NA> 41 <NA>
## 2 a lot of 25 a lot of 13 thanks for the
## 3 one of the 20 said in a 9 a lot of
## 4 i wanted to 10 one of the 8 for the rt
## 5 it is a 10 more than a 7 i had a
## 6 as much as 9 of the year 7 i have a
## 7 i had a 9 out of the 7 going to be
## 8 i had to 9 a couple of 6 i am so
## 9 i want to 9 be able to 6 i have no
## 10 a couple of 8 going to be 6 i love you
## 11 the end of 8 he said he 6 i need to
## 12 and i have 7 in a statement 6 id like to
## 13 i have to 7 a series of 5 one of the
## 14 it was a 7 according to the 5 the fact that
## 15 a bit of 6 he said i 5 to be in
## 16 according to the 6 i dont know 5 to see you
## 17 i decided to 6 in front of 5 want to get
## 18 if you dont 6 in new york 5 a beautiful day
## 19 one of them 6 is going to 5 a bit of
## 20 that it is 6 it was a 5 a good one
## Twitter_Count
## 1 25
## 2 11
## 3 5
## 4 4
## 5 4
## 6 4
## 7 3
## 8 3
## 9 3
## 10 3
## 11 3
## 12 3
## 13 3
## 14 3
## 15 3
## 16 3
## 17 3
## 18 2
## 19 2
## 20 2
Based on the insights gained from the exploratory analysis, the following plans have been made for the development of the prediction algorithm and the Shiny app:
The algorithm will be designed to predict the next word(s) based on the input text. It will use n-gram models (unigrams, bigrams, trigrams) to generate predictions. The model will be trained on the three datasets to ensure it captures a diverse range of language patterns.
The Shiny app will provide an interactive interface for users to input text and receive word predictions. The app will display the top suggested words based on the input text, leveraging the trained n-gram models. Users will be able to choose from different prediction options and see real-time updates as they type.
The exploratory data analysis has revealed important patterns and features in the text data from blogs, news articles, and Twitter. These insights will guide the development of a robust and accurate prediction algorithm. The Shiny app will provide a user-friendly interface for text prediction, making it accessible and useful for a wide range of users. By understanding the common words and phrases in the data, we can create a prediction model that accurately reflects natural language usage. The eventual goal is to develop a powerful tool that enhances text input efficiency and provides meaningful suggestions based on context.