Introduction

The goal for this assignment is to understand the dataset and do a exploratory data analaysis for each of the given files, en_US.blogs.tx, ex_US.news.txt and en_US.twitter.txt. You should make use of tables and plots to illustrate important summaries of the data set.

Loading packages

Loading the required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(stringi)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(RWeka)
library(wordcloud)

## Loading required package: RColorBrewer

library(ngram)
library(R.utils)

## Loading required package: R.oo

## Loading required package: R.methodsS3

## R.methodsS3 v1.8.1 (2020-08-26 16:20:06 UTC) successfully loaded. See ?R.methodsS3 for help.

## R.oo v1.24.0 (2020-08-26 16:11:58 UTC) successfully loaded. See ?R.oo for help.

## 
## Attaching package: 'R.oo'

## The following object is masked from 'package:R.methodsS3':
## 
##     throw

## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods

## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save

## R.utils v2.10.1 (2020-08-26 22:50:31 UTC) successfully loaded. See ?R.utils for help.

## 
## Attaching package: 'R.utils'

## The following object is masked from 'package:utils':
## 
##     timestamp

## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, nullfile, parse,
##     warnings

Importing data from text files

Data were downloaded from following link. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

conn <- file("en_US.blogs.txt")
blogs_data <- readLines(conn, encoding="UTF-8", skipNul=TRUE)
close(conn)

file <- file("en_US.news.txt")
news_data <- readLines(conn, encoding="UTF-8", skipNul=TRUE)

## Warning in readLines(conn, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'en_US.news.txt'

close(conn)

file <- file("en_US.twitter.txt")
twitter_data <- readLines(conn, encoding="UTF-8", skipNul=TRUE)
close(conn)

Basic Summary of data

The basic summary of the three text files such as - File Size, Word Count, No of Lines are calculate

“en_US.blogs.txt”

head(blogs_data)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"                                                                                    
## [6] "If you have an alternative argument, let's hear it! :)"

Lines <- length(blogs_data) 
Size <- gsub(' ',' ' , object.size(blogs_data))
wordCount <- wordcount(blogs_data, sep=" ", count.function = sum)
data <- data.frame(FileName = "en_US.blogs.txt" , 
                   FileSize = Size,
                   WordCount = wordCount,
                   Lines = Lines)


data

##          FileName  FileSize WordCount  Lines
## 1 en_US.blogs.txt 267758632  37334131 899288

“en_US.news.txt”

head(news_data)

## [1] "He wasn't home alone, apparently."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                                                                                                                                                                                                                                                                         
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."                                                                                                                                                                                                                                                                                                                                 
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"                                                                                                                                                                                                                                                            
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."

Lines <- length(news_data) 
Size <- gsub(' ',' ' , object.size(news_data))
wordCount <- wordcount(news_data, sep=" ", count.function = sum)
data <- data.frame(FileName = "en_US.news.txt" , 
                   FileSize = Size,
                   WordCount = wordCount,
                   Lines = Lines)
data

##         FileName FileSize WordCount Lines
## 1 en_US.news.txt 20729472   2643969 77259

“en_US.twitter.txt”

head(twitter_data)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
## [5] "Words from a complete stranger! Made my birthday even better :)"                                                
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"

Lines <- length(twitter_data) 
Size <- gsub(' ',' ' , object.size(twitter_data))
wordCount <- wordcount(twitter_data, sep=" ", count.function = sum)
data <- data.frame(FileName = "en_US.blogs.txt" , 
                   FileSize = Size,
                   WordCount = wordCount,
                   Lines = Lines)
data

##          FileName  FileSize WordCount   Lines
## 1 en_US.blogs.txt 334484992  30373583 2360148

Data Cleaning

Combine all data into test data frame

test_data <- c(sample(blogs_data ,length(blogs_data) * 0.005),
               sample(news_data, length(news_data) * 0.005),
               sample(twitter_data, length(twitter_data) * 0.005)
)
testdata <- iconv(test_data, "UTF-8", "ASCII", sub="")

sample_corpus <- VCorpus(VectorSource(testdata))
sample_corpus <- tm_map(sample_corpus, tolower)
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus, PlainTextDocument)

creating N-grams for our data

unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

unidtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=unigram))
bidtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=bigram))
tridtf <- TermDocumentMatrix(sample_corpus, control=list(tokenize=trigram))

uni_tf <- findFreqTerms(unidtf, lowfreq = 50 )
bi_tf <- findFreqTerms(bidtf, lowfreq = 50 )
tri_tf <- findFreqTerms(tridtf, lowfreq = 10 )

uni_freq <- rowSums(as.matrix(unidtf[uni_tf, ]))
uni_freq <- data.frame(words=names(uni_freq), frequency=uni_freq)
head(uni_freq)

##               words frequency
## able           able        98
## about         about      1018
## above         above        57
## according according        52
## across       across        67
## actually   actually       141

bi_freq <- rowSums(as.matrix(bidtf[bi_tf, ]))
bi_freq <- data.frame(words=names(bi_freq), frequency=bi_freq)
head(bi_freq)

##             words frequency
## a big       a big        64
## a bit       a bit        87
## a couple a couple        64
## a few       a few       148
## a good     a good       161
## a great   a great       175

tri_freq <- rowSums(as.matrix(tridtf[tri_tf, ]))
tri_freq <- data.frame(words=names(tri_freq), frequency=tri_freq)
head(tri_freq)

##                       words frequency
## a bit of           a bit of        19
## a bunch of       a bunch of        20
## a chance to     a chance to        16
## a couple of     a couple of        34
## a few days       a few days        13
## a few minutes a few minutes        13

Plotting N-gram Data

plotting Bi-freq data using word cloud

wordcloud(words=bi_freq$words, freq=bi_freq$frequency, max.words=100, colors = brewer.pal(8, "Dark2"))

plotting Uni-freq

plot_freq <- ggplot(data = uni_freq[order(-uni_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
        geom_bar(stat="identity", fill="yellow") + 
        ggtitle("Top Unigram") + xlab("words") +  ylab("frequency")


plot_freq

Plotting Tri- freq

plot_freq <- ggplot(data = tri_freq[order(-tri_freq$frequency),][1:15, ], aes(x = reorder(words, -frequency), y=frequency)) +
        geom_bar(stat="identity", fill="pink") + theme(axis.text.x = element_text(angle = 45)) + 
        ggtitle("Top Trigram") + xlab("words") +  ylab("frequency")

plot_freq

Conclusion

The three text files were extracted and Basic summary statistics were performed. The plots show the most frequently used words in three divisons - Single word Combination, Two words Combination and Three Words Combination

Data SCience Capstone - Week 2 Assignment

Jayasree Kulothungan

05/10/2020