Introduction

This document is the first step in developing a word prediction algorithm as part of the Capstone Project for the Data Science Specialization track. The goal is to be able to eventually have an app that will predict a word once the user inputs a phrase. However to get to that point, we first start by performing some exploratory data analysis on our datasets that would act as the source for the app. This data is available in the form of three text files which are from twitter feeds, US-blogs and US-news sources.

Load the data

We begin by loading the data

library(dplyr)
library(ggplot2)
setwd("C:/Users/Siddharth/Documents/Careers/Siddharth/DataScienceSpecialization/Data Science Capstone/Final Project/final/en_US")

#Read the file that contains feeds from Twitter
twitter.file <- file("en_US.twitter.txt")
twitter <- readLines(twitter.file)

#Read the file that contains text from US-blogs
blogs.file <- file("en_US.blogs.txt")
blogs <- readLines(blogs.file)

#Read the file that contains text from US-news
news.file <- file("en_US.news.txt")
news <- readLines(news.file)

Exploratory Data Analysis

Number of words in each file

We can now look at the number of words in each file.

twitter.words <- sum(sapply(strsplit(twitter, " "), length))

blogs.words <- sum(sapply(strsplit(blogs, " "), length))

news.words <- sum(sapply(strsplit(news, " "), length))

So we see that:
The number of words in the twitter file is: 30373543
The number of words in the blogs file is: 37334131
The number of words in the news file is: 2643969

Number of lines in each file

We can also take a look at the number of lines of text in each file.

twitter.length <- length(twitter)
news.length <- length(news)
blogs.length <- length(blogs)

So we see that: The number of lines in the twitter file is: 2360148
The number of lines in the blogs file is: 899288
The number of lines in the news file is: 77259

Frequency of words in each file

We can now go ahead and take a look at how frequently certain words appear in each file.

##frequency of words in the blogs file
blogs.allwords <- strsplit(blogs, " ")
blogs.freq <- table(unlist(blogs.allwords))
count.blog <- cbind.data.frame(names(blogs.freq), as.integer(blogs.freq))
count.blog.sorted <- count.blog[order(count.blog$`as.integer(blogs.freq)`, decreasing = TRUE),]

##frequency of words in the news file
news.allwords <- strsplit(news, " ")
news.freq <- table(unlist(news.allwords))
count.news <- cbind.data.frame(names(news.freq), as.integer(news.freq))
count.news.sorted <- count.news[order(count.news$`as.integer(news.freq)`, decreasing = TRUE),]

##frequency of words in the twitter file
twitter.allwords <- strsplit(twitter, " ")
twitter.freq <- table(unlist(twitter.allwords))
count.twitter <- cbind.data.frame(names(twitter.freq), as.integer(twitter.freq))
count.twitter.sorted <- count.twitter[order(count.twitter$`as.integer(twitter.freq)`, decreasing = TRUE),]

Let us now take a look at the top 20 most frequently used words from each file

Top 20 words in the blogs file:

head(count.blog.sorted,20)
##         names(blogs.freq) as.integer(blogs.freq)
## 993934                the                1659151
## 1007372                to                1043878
## 215594                and                1015714
## 750891                 of                 862906
## 139236                  a                 857102
## 572773                  I                 738534
## 580977                 in                 540436
## 993200               that                 421628
## 596649                 is                 412438
## 486334                for                 337156
## 1062974               was                 271439
## 1080908              with                 271302
## 598564                 it                 270280
## 755768                 on                 252275
## 720162                 my                 239952
## 1097473               you                 238652
## 542750               have                 210982
## 255081                 be                 198728
## 231468                 as                 196879
## 999113               this                 188536

Top 20 words in the twitter file:

head(count.twitter.sorted,20)
##         names(twitter.freq) as.integer(twitter.freq)
## 1144236                 the                   837023
## 1161158                  to                   761901
## 688407                    I                   604530
## 259049                    a                   572690
## 1281498                 you                   416376
## 293954                  and                   397641
## 590228                  for                   368422
## 889478                   of                   349367
## 698325                   in                   348814
## 713674                   is                   329396
## 895917                   on                   253558
## 855362                   my                   248739
## 715717                   it                   192437
## 1143023                that                   190844
## 338541                   be                   172886
## 312930                   at                   171759
## 1252317                with                   163808
## 1284351                your                   157112
## 654922                 have                   149376
## 813499                   me                   143522

Top 20 words in the news file:

head(count.news.sorted,20)
##        names(news.freq) as.integer(news.freq)
## 179034              the                131810
## 181190               to                 68417
## 34226               and                 65167
## 26882                 a                 63401
## 132756               of                 58675
## 101865               in                 47526
## 84925               for                 25498
## 178969             that                 23916
## 104997               is                 21232
## 133558               on                 19198
## 194564             with                 18779
## 191254              was                 17371
## 179035              The                 17144
## 37402                at                 15329
## 36758                as                 13369
## 95632                he                 12799
## 97772               his                 11305
## 41164                be                 11274
## 86707              from                 11256
## 105241               it                 10894

So just as one would expect, the most frequently occuring words are words like: and, the, for, a and so on.

We can also look at a plot to see the most frequently used words. Since there are over a million words as we saw above, we try and filter out the words that occur more than 50,000 times.

Let’s take a look at this plot for the twitter file

twitter.words.50kplus <- filter(count.twitter.sorted, `as.integer(twitter.freq)` > 50000)
twitter.plot <- ggplot(data = twitter.words.50kplus, aes(x= twitter.words.50kplus$`names(twitter.freq)`, y = twitter.words.50kplus$`as.integer(twitter.freq)`)) + geom_point()
twitter.plot + theme(axis.text.x = element_text(angle = 45)) + xlab("Word") + ylab("Frequnecy of occurance")


Figure 1: Plot showing the words that have a frequency of more than 50,000 occurances in the twitter file

Now let us take a look at the plot for the blogs file

blogs.words.50kplus <- filter(count.blog.sorted, `as.integer(blogs.freq)` > 50000)
blogs.plot <- ggplot(data = blogs.words.50kplus, aes(x= blogs.words.50kplus$`names(blogs.freq)`, y = blogs.words.50kplus$`as.integer(blogs.freq)`)) + geom_point()
blogs.plot + theme(axis.text.x = element_text(angle = 45)) + xlab("Word") + ylab("Frequnecy of occurance")


Figure 2: Plot showing the words that have a frequency of more than 50,000 occurances in the US-blogs file

And finally let us take a look at a plot for the US-news file

news.words.50kplus <- filter(count.news.sorted, `as.integer(news.freq)` > 50000)
news.plot <- ggplot(data = news.words.50kplus, aes(x= news.words.50kplus$`names(news.freq)`, y = news.words.50kplus$`as.integer(news.freq)`)) + geom_point()
news.plot + theme(axis.text.x = element_text(angle = 45)) + xlab("Word") + ylab("Frequnecy of occurance")


Figure 3: Plot showing the words that have a frequency of more than 50,000 occurances in the US-news file