This document is the first step in developing a word prediction algorithm as part of the Capstone Project for the Data Science Specialization track. The goal is to be able to eventually have an app that will predict a word once the user inputs a phrase. However to get to that point, we first start by performing some exploratory data analysis on our datasets that would act as the source for the app. This data is available in the form of three text files which are from twitter feeds, US-blogs and US-news sources.
We begin by loading the data
library(dplyr)
library(ggplot2)
setwd("C:/Users/Siddharth/Documents/Careers/Siddharth/DataScienceSpecialization/Data Science Capstone/Final Project/final/en_US")
#Read the file that contains feeds from Twitter
twitter.file <- file("en_US.twitter.txt")
twitter <- readLines(twitter.file)
#Read the file that contains text from US-blogs
blogs.file <- file("en_US.blogs.txt")
blogs <- readLines(blogs.file)
#Read the file that contains text from US-news
news.file <- file("en_US.news.txt")
news <- readLines(news.file)
Number of words in each file
We can now look at the number of words in each file.
twitter.words <- sum(sapply(strsplit(twitter, " "), length))
blogs.words <- sum(sapply(strsplit(blogs, " "), length))
news.words <- sum(sapply(strsplit(news, " "), length))
So we see that:
The number of words in the twitter file is: 30373543
The number of words in the blogs file is: 37334131
The number of words in the news file is: 2643969
Number of lines in each file
We can also take a look at the number of lines of text in each file.
twitter.length <- length(twitter)
news.length <- length(news)
blogs.length <- length(blogs)
So we see that: The number of lines in the twitter file is: 2360148
The number of lines in the blogs file is: 899288
The number of lines in the news file is: 77259
Frequency of words in each file
We can now go ahead and take a look at how frequently certain words appear in each file.
##frequency of words in the blogs file
blogs.allwords <- strsplit(blogs, " ")
blogs.freq <- table(unlist(blogs.allwords))
count.blog <- cbind.data.frame(names(blogs.freq), as.integer(blogs.freq))
count.blog.sorted <- count.blog[order(count.blog$`as.integer(blogs.freq)`, decreasing = TRUE),]
##frequency of words in the news file
news.allwords <- strsplit(news, " ")
news.freq <- table(unlist(news.allwords))
count.news <- cbind.data.frame(names(news.freq), as.integer(news.freq))
count.news.sorted <- count.news[order(count.news$`as.integer(news.freq)`, decreasing = TRUE),]
##frequency of words in the twitter file
twitter.allwords <- strsplit(twitter, " ")
twitter.freq <- table(unlist(twitter.allwords))
count.twitter <- cbind.data.frame(names(twitter.freq), as.integer(twitter.freq))
count.twitter.sorted <- count.twitter[order(count.twitter$`as.integer(twitter.freq)`, decreasing = TRUE),]
Let us now take a look at the top 20 most frequently used words from each file
Top 20 words in the blogs file:
head(count.blog.sorted,20)
## names(blogs.freq) as.integer(blogs.freq)
## 993934 the 1659151
## 1007372 to 1043878
## 215594 and 1015714
## 750891 of 862906
## 139236 a 857102
## 572773 I 738534
## 580977 in 540436
## 993200 that 421628
## 596649 is 412438
## 486334 for 337156
## 1062974 was 271439
## 1080908 with 271302
## 598564 it 270280
## 755768 on 252275
## 720162 my 239952
## 1097473 you 238652
## 542750 have 210982
## 255081 be 198728
## 231468 as 196879
## 999113 this 188536
Top 20 words in the twitter file:
head(count.twitter.sorted,20)
## names(twitter.freq) as.integer(twitter.freq)
## 1144236 the 837023
## 1161158 to 761901
## 688407 I 604530
## 259049 a 572690
## 1281498 you 416376
## 293954 and 397641
## 590228 for 368422
## 889478 of 349367
## 698325 in 348814
## 713674 is 329396
## 895917 on 253558
## 855362 my 248739
## 715717 it 192437
## 1143023 that 190844
## 338541 be 172886
## 312930 at 171759
## 1252317 with 163808
## 1284351 your 157112
## 654922 have 149376
## 813499 me 143522
Top 20 words in the news file:
head(count.news.sorted,20)
## names(news.freq) as.integer(news.freq)
## 179034 the 131810
## 181190 to 68417
## 34226 and 65167
## 26882 a 63401
## 132756 of 58675
## 101865 in 47526
## 84925 for 25498
## 178969 that 23916
## 104997 is 21232
## 133558 on 19198
## 194564 with 18779
## 191254 was 17371
## 179035 The 17144
## 37402 at 15329
## 36758 as 13369
## 95632 he 12799
## 97772 his 11305
## 41164 be 11274
## 86707 from 11256
## 105241 it 10894
So just as one would expect, the most frequently occuring words are words like: and, the, for, a and so on.
We can also look at a plot to see the most frequently used words. Since there are over a million words as we saw above, we try and filter out the words that occur more than 50,000 times.
Let’s take a look at this plot for the twitter file
twitter.words.50kplus <- filter(count.twitter.sorted, `as.integer(twitter.freq)` > 50000)
twitter.plot <- ggplot(data = twitter.words.50kplus, aes(x= twitter.words.50kplus$`names(twitter.freq)`, y = twitter.words.50kplus$`as.integer(twitter.freq)`)) + geom_point()
twitter.plot + theme(axis.text.x = element_text(angle = 45)) + xlab("Word") + ylab("Frequnecy of occurance")
Figure 1: Plot showing the words that have a frequency of more than 50,000 occurances in the twitter file
Now let us take a look at the plot for the blogs file
blogs.words.50kplus <- filter(count.blog.sorted, `as.integer(blogs.freq)` > 50000)
blogs.plot <- ggplot(data = blogs.words.50kplus, aes(x= blogs.words.50kplus$`names(blogs.freq)`, y = blogs.words.50kplus$`as.integer(blogs.freq)`)) + geom_point()
blogs.plot + theme(axis.text.x = element_text(angle = 45)) + xlab("Word") + ylab("Frequnecy of occurance")
Figure 2: Plot showing the words that have a frequency of more than 50,000 occurances in the US-blogs file
And finally let us take a look at a plot for the US-news file
news.words.50kplus <- filter(count.news.sorted, `as.integer(news.freq)` > 50000)
news.plot <- ggplot(data = news.words.50kplus, aes(x= news.words.50kplus$`names(news.freq)`, y = news.words.50kplus$`as.integer(news.freq)`)) + geom_point()
news.plot + theme(axis.text.x = element_text(angle = 45)) + xlab("Word") + ylab("Frequnecy of occurance")
Figure 3: Plot showing the words that have a frequency of more than 50,000 occurances in the US-news file