The goal of this report is to show the summary of the data, exploratory data analysis of the en_UD text files we have got them from week1, and to create a prediction algorithm. The 3 documents of text files we have downloaded are 1. en_US.twitter.txt 2. en_US.news.txt 3. en_US.blogs.txt
Installing the required packages install.packages(“tm”) install.packages(“ggplot2”) install.packages(“RWeka”) install.packages(“wordcloud”)
Loading the installed packages
library(tm)
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.6.3
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.6.3
## Loading required package: RColorBrewer
setwd("C:/Users/vamshikrishna/Desktop/R practice/R assignment data/Capstone_project_NLP/data/Coursera-SwiftKey/final/en_US/")
Downloading data and extracting zip
if (!file.exists("data/Coursera-SwiftKey.zip")){
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
}
Importing the data and then look at the summary of those text documents
getwd()
## [1] "C:/Users/vamshikrishna/Desktop/R practice/R assignment data/Capstone_project_NLP"
twitter = readLines("data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt",encoding='UTF-8')
## Warning in readLines("data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", :
## line 167155 appears to contain an embedded nul
## Warning in readLines("data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", :
## line 268547 appears to contain an embedded nul
## Warning in readLines("data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", :
## line 1274086 appears to contain an embedded nul
## Warning in readLines("data/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", :
## line 1759032 appears to contain an embedded nul
news = readLines("data/Coursera-SwiftKey/final/en_US/en_US.news.txt",encoding='UTF-8')
## Warning in readLines("data/Coursera-SwiftKey/final/en_US/en_US.news.txt", :
## incomplete final line found on 'data/Coursera-SwiftKey/final/en_US/
## en_US.news.txt'
blogs = readLines("data/Coursera-SwiftKey/final/en_US/en_US.blogs.txt",encoding='UTF-8')
#See minimum number of characters
min(nchar(twitter))
## [1] 2
min(nchar(news))
## [1] 2
min(nchar(blogs))
## [1] 1
#See maximum number of characters
max(nchar(twitter))
## [1] 140
max(nchar(news))
## [1] 5760
max(nchar(blogs))
## [1] 40833
#see summary os documents
summary(twitter)
## Length Class Mode
## 2360148 character character
summary(news)
## Length Class Mode
## 77259 character character
summary(blogs)
## Length Class Mode
## 899288 character character
Lets see sample lines from each document
head(twitter)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
head(news)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
head(blogs)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."
## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!"
## [6] "If you have an alternative argument, let's hear it! :)"
As suggested in Week1, lets take samples of each document with 5% of data abd that will help us to maintain CPU and time. Lets also set the seed to maintain reproducibility. Finally merge the samples and see the summary of sample data.
twitter1 = sample(twitter,length(twitter)*0.05)
news1 = sample(news,length(news)*0.05)
blogs1 = sample(blogs,length(blogs)*0.05)
sample_data = iconv(c(twitter1, news1, blogs1),'UTF-8','ASCII',sub='')
summary(sample_data)
## Length Class Mode
## 166833 character character
min(nchar(sample_data))
## [1] 0
max(nchar(sample_data))
## [1] 37166
Lets now see the head of sample we created from actual data
head(twitter1)
## [1] "Back from the PDN PhotoPlus Expo in New York. Other than the Canon 1DX there was little new in the DSLR world."
## [2] "It's never too late to find a mentor. Even as a CEO you need feedback just as those rising through the ranks need feedback."
## [3] "Great practice!! Ready for tomorrow"
## [4] "Just bought $DMD"
## [5] "hm shape up"
## [6] "I hear RackSpace & now unfortunately think Rack City :-/"
head(news1)
## [1] "In fact, if you<U+0092>re going to try to rationalize a stupid action by an athlete over the last week, you<U+0092>d have a much easier time doing it with Amare Stoudemire. Following the Knicks' 104-94 Game 2 loss to the Heat, Stoudemire punched the door that held a fire extinguisher and cut his hand. Reports still vary on how serious the cuts are, but he clearly won<U+0092>t play Thursday in Game 3 in New York and he<U+0092>s probably done for the series, which means he<U+0092>s probably done for the season. Stoudemire was immediately branded stupid, selfish and guilty of an unforgivable act."
## [2] "\"He's not an enigma,\" says Gold of Jay's sleek grace and monkish announcements. \"But he is very un-Portland in some ways. For John, everything is an expression of art and design. Even the way he dresses.\""
## [3] "A little boy must have felt bad, so he raised his hand."
## [4] "We get all that. But it’s the performance that looks worse every year, even from those borrowed professionals trying to meet their one-year requirement as college residents. Meanwhile, the cartel known as the NCAA takes $771 million from CBS and Turner for this tournament, which gets more tedious with every passing year."
## [5] "Houston’s passing came on the eve of the 54th annual Grammy Awards. From 1986 through 2000, she owned the Grammy stage, winning six awards and regularly performing on the show."
## [6] "Immediately after the shooting, Trott sped away in a Jeep Cherokee, driving north on Route 1 with Williams in the passenger’s seat. The Jeep struck other cars as Trott fled. The two suspects were stopped in traffic in a jughandle on Route 1 and Old Post Road in Edison, where Trott was captured."
head(blogs1)
## [1] "Being very sure it would be of interest he also sent a magazine cutting as well to give a second view."
## [2] "I chuckled not because I don't admire the effort, because I do. I still believe that the truth about nature versus nurture lies somewhere in between, and that my own efforts at gender-neutral parenting have borne fruit as my child enters her teenage years; definitely female, but not always stereotypically so. I chuckled because it struck me as the kind of thing I would have tried when I was a brand new teacher, discovered it was like trying to push water uphill, then abandoned it in the interest of choosing a battle I could actually win."
## [3] "Below are the steps to complete your purchase."
## [4] "One of the luggage lessons I’ve learned is that cheap luggage is cheap for a reason. It usually can’t stand up to the rigors of even light travel. Any strain on the zippers or fabric, and you may find yourself hastily repacking your underwear in the middle of an airport concourse. Now, I’ve rarely seen luggage actually being loaded on or off a plane – I’d like to imagine they handle everything with kid gloves. I doubt it though – you’ll want your suitcase to be able to stand up to a few tosses, and having other luggage land on top of it."
## [5] "He was too shocked to say anything. Emotions were flowing in abundance and he felt them overflowing through his eyes."
## [6] "But first, this is what arrived today:"
With the help of textmining tm package we clean the data by modifying the data such as lowercasing, stripping whitespaces, removing numbers, profanity, stopwords, punctuations and also convert the data to plain text.
sample_corpus = VCorpus(VectorSource(sample_data))
sample_corpus = tm_map(sample_corpus, tolower)
sample_corpus = tm_map(sample_corpus, stripWhitespace)
sample_corpus = tm_map(sample_corpus, removeNumbers)
sample_corpus = tm_map(sample_corpus, removePunctuation)
sample_corpus = tm_map(sample_corpus, removeWords, stopwords('english'))
sample_corpus = tm_map(sample_corpus, PlainTextDocument)
Lets now create uni_gram, bi_gram, tri_grams using NGramTokenizer and them see the summary of each N-gram
uni_gram = function(x) NGramTokenizer(x, Weka_control(min=1,max=1))
uni_gram_tdm = TermDocumentMatrix(sample_corpus, control=list(tokenize=uni_gram))
bi_gram = function(y) NGramTokenizer(y, Weka_control(min=2,max=2))
bi_gram_tdm = TermDocumentMatrix(sample_corpus, control=list(tokenize=bi_gram))
tri_gram = function(z) NGramTokenizer(z, Weka_control(min=3,max=3))
tri_gram_tdm = TermDocumentMatrix(sample_corpus, control=list(tokenize=tri_gram))
Lets see the summary of each N-gram from above output to see the sparseness of each N-gram Using removeSparseTerms lets remove the sparsed terms using threshold of 0.99, 0.999, 0.9999
uni_gram_tdm_sparsed = removeSparseTerms(uni_gram_tdm, 0.99)
bi_gram_tdm_sparsed = removeSparseTerms(bi_gram_tdm, 0.999)
tri_gram_tdm_sparsed = removeSparseTerms(tri_gram_tdm, 0.9999)
Lets again see the summary of each N-gram after using removeSparseTerms method from above output
frequency_uni_gram = rowSums(as.matrix(uni_gram_tdm_sparsed))
frequency_uni_gram_orderby = order(frequency_uni_gram, decreasing=TRUE)
frequency_bi_gram = rowSums(as.matrix(bi_gram_tdm_sparsed))
frequency_bi_gram_orderby = order(frequency_bi_gram, decreasing=TRUE)
frequency_tri_gram = rowSums(as.matrix(tri_gram_tdm_sparsed))
frequency_tri_gram_orderby = order(frequency_tri_gram, decreasing=TRUE)
df_uni_gram <- data.frame("ngram"=names(frequency_uni_gram[frequency_uni_gram_orderby]), "freq"=frequency_uni_gram[frequency_uni_gram_orderby])
df_bi_gram <- data.frame("ngram"=names(frequency_bi_gram[frequency_bi_gram_orderby]), "freq"=frequency_bi_gram[frequency_bi_gram_orderby])
df_tri_gram <- data.frame("ngram"=names(frequency_tri_gram[frequency_tri_gram_orderby]), "freq"=frequency_tri_gram[frequency_tri_gram_orderby])
Lets see head of each dataframe
head(df_uni_gram)
## ngram freq
## just just 12612
## like like 11203
## will will 10952
## one one 10728
## can can 9731
## get get 9615
head(df_bi_gram)
## ngram freq
## right now right now 1103
## cant wait cant wait 951
## dont know dont know 807
## last night last night 709
## im going im going 636
## feel like feel like 547
head(df_tri_gram)
## ngram freq
## cant wait see cant wait see 187
## happy mothers day happy mothers day 164
## let us know let us know 116
## im pretty sure im pretty sure 93
## happy new year happy new year 88
## dont even know dont even know 70
ggplot(df_uni_gram[1:40,], aes(factor(ngram, levels = unique(ngram)), freq)) +
geom_bar(stat = 'identity') +
theme(axis.text.x=element_text(angle=90)) +
xlab('Uni_gram') +
ylab('Frequency')
ggplot(df_bi_gram[1:40,], aes(factor(ngram, levels = unique(ngram)), freq)) +
geom_bar(stat = 'identity') +
theme(axis.text.x=element_text(angle=90)) +
xlab('bi_gram') +
ylab('Frequency')
ggplot(df_tri_gram[1:40,], aes(factor(ngram, levels = unique(ngram)), freq)) +
geom_bar(stat = 'identity') +
theme(axis.text.x=element_text(angle=90)) +
xlab('Tri_gram') +
ylab('Frequency')
### Creating WordClouds
uniCloud <- wordcloud(df_uni_gram$ngram, df_uni_gram$freq, scale = c(2, 0.5), max.words = 100, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
biCloud <- wordcloud(df_bi_gram$ngram, df_bi_gram$freq, scale = c(2, 0.5), max.words = 100, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
triCloud <- wordcloud(df_tri_gram$ngram, df_tri_gram$freq, scale = c(2, 0.5), max.words = 100, random.order = FALSE, rot.per = 0.35, use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
By looking at the histograms we can see that the overall frequency is high in uni-grams when compared to bi-grams and tri-grams. The frequency decreased as the increase in N-grams.
By avoiding using stop words, we may not get the correct phrases in bi-grams and tri-grams.
We in future create prediction model, Data product, Slide deck and Summary using a report and Shiny application.