Milestone Report

Downloading the file

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(stringr)
library(tidytext)
library(tm)

## Loading required package: NLP

library(tokenizers)
library(stopwords)

## 
## Attaching package: 'stopwords'

## The following object is masked from 'package:tm':
## 
##     stopwords

library(quanteda.corpora)
library(quanteda)

## Package version: 2.1.2

## Parallel computing: 2 of 12 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

## The following object is masked from 'package:utils':
## 
##     View

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RColorBrewer)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Goal:

The goal of this task is to get familiar with the databases and do the necessary cleaning. ### Tasks to accomplish

Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. Profanity filtering - removing profanity and other words you do not want to predict.

Task 1 - Getting and cleaning the data

Task to accomplish:

As the listed folders are: de_DE, en_US, fi_FI, and ru_RU I would analyze de de_DE folder and I will see the files inside that folder.

Tips: Reading in chunks or lines using R’s readLines or scan functions can be useful. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done. See the connections help page for more information.

#First I unzip the file:
#archiveFile <- "Coursera-SwiftKey.zip"
#unzip(archiveFile)
# Once unzipped, I commented the previous two line code.

#Then, I am going to read the archives in the folder
list.files(path = "./final")

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

#Then, I will read the english path to see the information
list.files(path = "./final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Task 1.1: Loading the data in RStudio

Reading 5 lines in Blogs:

#I am commenting this code as I alredy sampled this:

conBlogs <-file("./final/en_US/en_US.blogs.txt", "r") 
Blogs<-readLines(conBlogs, 3,encoding = "UTF-8"); #Reading three lines
Blogs

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

Reading 5 lines in News:

conNews <-file("./final/en_US/en_US.news.txt", "r") 
News<- readLines(conNews,3,encoding = "UTF-8"); #Reading three lines
News

## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."

Reading 5 lines in Twitter:

#I am commenting this code as I alredy sampled this:

conTwitter <-file("./final/en_US/en_US.twitter.txt", "r")
Twitter <-readLines(conTwitter,3,encoding = "UTF-8", skipNul = TRUE)  #Reading three lines
Twitter

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

Task 1.2: Sampling

To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not. ### Task to accomplish: Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. Profanity filtering - removing profanity and other words you do not want to predict

Tokenization:

Creating a function for tokenization: Input: file Output: tokenized version Output: tokenized version

#Sampling
set.seed(1424) #I am creating a random number generation to create words that can be reproduced.
blogs<-readLines(conBlogs,encoding = "UTF-8", skipNul = TRUE); close(conBlogs)
news<-readLines(conNews,encoding = "UTF-8", skipNul = TRUE); close(conNews)
twitter <-readLines(conTwitter,encoding = "UTF-8", skipNul = TRUE); close(conTwitter)

#Making a function for the sampled data
sampledData<- function(rawData,probability) {
  sampledData_ <-rawData[rbinom(length(rawData)*0.1,size =length(rawData), probability )]
}

#For 90% of probability
ReproducedSample_B <- sampledData(blogs,0.9) #BLOGS
ReproducedSample_N <- sampledData(news,0.9) #NEWS
ReproducedSample_T <- sampledData(twitter,0.9) #TWITTER

#For 50% of probability
ReproducedSample_50B <- sampledData(blogs,0.5) #BLOGS
ReproducedSample_50N <- sampledData(news,0.5) #NEWS
ReproducedSample_50T <- sampledData(twitter,0.5) #TWITTER

Summarizing information

Regarding Lines“,”Characters“,”Words“,”Sentences" from the Original Data and the Sampled Data

countedInfo <- function(data_texts){
  # Length 
  len_Sample<-length(data_texts)
  # Characters
  totChar_Sampled <- sum(count_characters(data_texts))
  # Words
  totWords_Sampled <- sum(count_words(data_texts))
  # Senteces
  toSent_Sampled <-  sum(count_sentences(data_texts))
  return(list(len_Sample,totChar_Sampled,totWords_Sampled,toSent_Sampled))
}
# Original Data without sampling
Total_Blogs <- countedInfo(blogs)
Total_News <- countedInfo(news)
Total_Twitter <- countedInfo(twitter)
#Sampled Data 
Sampled_Blogs <- countedInfo(ReproducedSample_B)
Sampled_News <- countedInfo(ReproducedSample_N)
Sampled_Twitter <- countedInfo(ReproducedSample_T)
#Creation of table, accessing to the information [[i]]
table_Content <- data.frame(Blogs = c(Total_Blogs[[1]],Total_Blogs[[2]],Total_Blogs[[3]],Total_Blogs[[4]]), 
                Sampled_Blogs = c(Sampled_Blogs[[1]],Sampled_Blogs[[2]],Sampled_Blogs[[3]],Sampled_Blogs[[4]]),
                News = c(Total_News[[1]],Total_News[[2]],Total_News[[3]],Total_News[[4]]),
                Sampled_News = c(Sampled_News[[1]],Sampled_News[[2]],Sampled_News[[3]],Sampled_News[[4]]),
                Twitter = c(Total_Twitter[[1]],Total_Twitter[[2]],Total_Twitter[[3]],Total_Twitter[[4]]),
                Sampled_Twitter = c(Sampled_Twitter[[1]],Sampled_Twitter[[2]],Sampled_Twitter[[3]],Sampled_Twitter[[4]]))
# Assigning names to the row
row.names(table_Content) <- c("Lines", "Characters", "Words", "Sentences")
# Creating the table
table_Content

##                Blogs Sampled_Blogs      News Sampled_News   Twitter
## Lines         899285         89928   1010239       101023   2360145
## Characters 206823451      21219071 203222790     19891746 162095715
## Words       37546079       3836092  34762332      3407552  30093362
## Sentences    2375708        238868   2024581       201832   3770151
##            Sampled_Twitter
## Lines               236014
## Characters        16105568
## Words              2986873
## Sentences           379893

In the table it is possible to see that it is considered the 10% of the total lines. The next step is to tokenize the input words

# PROFANITY WORDS: https://github.com/RobertJGabriel/Google-profanity-words/blob/master/list.txt
PROFANITY <- readLines("list.txt")
#Unlist to get vector of words
split_profa <- str_split(PROFANITY, " ")
#Unlist to get vector of words
UNLIST_PROFA<-unlist(split_profa)

#Function to tokenize words and remove punctuation, stopwords, number, speccial characters.
tokenizedwords <- function(sample){
  split_Samp <- str_split(sample," ")
  #Unlist to get vector of words
  UNLISTEDSAMP<-unlist(split_Samp)
  ##CLEANING DATA
  #lower case
  UNLISTEDSAMP <- str_to_lower(UNLISTEDSAMP)
  # Remove numbers
  UNLISTEDSAMP <- str_replace_all(UNLISTEDSAMP, pattern = "[[:digit:]]","")
  UNLISTEDSAMP <- str_replace_all(UNLISTEDSAMP, pattern= "$","")
  # Remove stopwords
  UNLISTEDSAMP<-removeWords(UNLISTEDSAMP, stopwords("en"))
  #Remove profanity words 
  UNLISTEDSAMP<-removeWords(UNLISTEDSAMP, UNLIST_PROFA)
  # Remove punctuation
  UNLISTEDSAMP <- str_replace_all(UNLISTEDSAMP, pattern = "[[:punct:]]","")
  # remove special characters
  UNLISTEDSAMP <- str_replace_all(UNLISTEDSAMP, pattern = "[%$&=<>@-_~`][:)][=)][:(][=()]","")
  #Remove ""
  UNLISTEDSAMP <- UNLISTEDSAMP[UNLISTEDSAMP != ""]
}
BlogExample <- tokenizedwords(ReproducedSample_B)
BlogExample<-sum(count_words(BlogExample))
NewsExample <- tokenizedwords(ReproducedSample_N)
NewsExample<-sum(count_words(NewsExample))
TwExample <- tokenizedwords(ReproducedSample_T)
TwExample<-sum(count_words(TwExample))

Once done this tokenizing, I will review how many words changed from the Total words, to the Sampled Words and to this Tokenizing function. Thus, I will display a table:

Creating the comparison between between total word, sampled words and cleaned words

For blogs:

table_content_Blogs <- data.frame(Blogs = c(Total_Blogs[[3]]),Sampled_Blogs = c(Sampled_Blogs[[3]]),CleanWords_Blog = c(BlogExample[[1]]))
row.names(table_content_Blogs) <- c("Words")
table_content_Blogs

##          Blogs Sampled_Blogs CleanWords_Blog
## Words 37546079       3836092         1949662

For News:

table_content_News <- data.frame(Total_WordsNews = c(Total_News[[3]]),Sampled_News = c(Sampled_News[[3]]), CleanWordsNews = c(NewsExample[[1]]))
row.names(table_content_News) <- c("Words")
table_content_News

##       Total_WordsNews Sampled_News CleanWordsNews
## Words        34762332      3407552        1883110

For Twitter:

table_content_Twitter <- data.frame(Total_WordsTW = c(Total_Twitter[[3]]),Sampled_Twitter = c(Sampled_Twitter[[3]]), CleanWords_TW = c(TwExample[[1]]))
row.names(table_content_Twitter) <- c("Words")
table_content_Twitter

##       Total_WordsTW Sampled_Twitter CleanWords_TW
## Words      30093362         2986873       1655055

As it is seen, in each data there was a reduction of words from the sampled data to the tokenized function. This is because this data does not consider the stopwords, profanity words, numbers, punctuaction, blank spaces between words, special characters.

Task 2 - Exploratory Data Analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Creating functions for graphs and cloud words

First, I will create the functions needed to plot the graphs.

#Building the unigram model function, I am removing the profanity words and stopwords, too.
unigram <- function(Sample){
  Texte1 <- tokens(Sample, what = "word", remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_url = TRUE, remove_separators = TRUE, split_hyphens = FALSE)
  Texte1 <- tokens_tolower(Texte1)
  Texte1 <- tokens_select(Texte1,stopwords("en"),selection = "remove")
  Texte1 <- tokens_select(Texte1,UNLIST_PROFA,selection="remove")
  
  texteunigram <- tokens_ngrams(Texte1, n=1)
  return(texteunigram)
}

#Building the bigram model function, I am removing the profanity words and stopwords, too.
bigram <- function(Sample){
  Texte1 <- tokens(Sample, what = "word", remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_url = TRUE, remove_separators = TRUE, split_hyphens = FALSE)
  Texte1 <- tokens_tolower(Texte1)
  Texte1 <- tokens_select(Texte1,stopwords("en"),selection = "remove")
  Texte1 <- tokens_select(Texte1,UNLIST_PROFA,selection="remove")
  
  textebigram <- tokens_ngrams(Texte1, n=2)
  return(textebigram)
}
#Building the trigram model function, I am removing the profanity words and stopwords, too
trigram <- function(Sample){
  Texte1 <- tokens(Sample, what = "word", remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE, remove_url = TRUE, remove_separators = TRUE, split_hyphens = FALSE)
  Texte1 <- tokens_tolower(Texte1)
  Texte1 <- tokens_select(Texte1,stopwords("en"),selection = "remove")
  Texte1 <- tokens_select(Texte1,UNLIST_PROFA,selection="remove")
  textetrigram <- tokens_ngrams(Texte1, n=3)
  return(textetrigram)
}
repetWords <- readLines("txt.txt")

## Warning in readLines("txt.txt"): incomplete final line found on 'txt.txt'

repetWords <- str_split(repetWords, " ")
unlistrep <- unlist(repetWords)
repeated_words <- c(unlistrep,"p.m","lol", "rt") #lol and rt are for twitter as they are in the top five, I wnat to give opportunity to other words rather than those

repetbiWords <- readLines("txt_.txt")

## Warning in readLines("txt_.txt"): incomplete final line found on 'txt_.txt'

repetbiWords <- str_split(repetbiWords, " ")
unlistrepbi <- unlist(repetbiWords)
repeated_biwords <- c(unlistrepbi,"p.m","even_though", "can_get", "can_make", "year_old", "one_day","can_see","feel_like", "can_find", "make_sure","just_get","just_like","can_find", "every_day")  #I chose these words because I did some trials to see the frequencies and to avoid stopwords.


#Creating the frequencies for unigram (function)
unig<-function(sample){
  SampleBB_uni <- unigram(sample)
  topfeatures(dfm(SampleBB_uni, tolower = TRUE, remove = repeated_words),5)
  topUnigram <- textstat_frequency(dfm(SampleBB_uni, tolower = TRUE, remove = repeated_words), n = 5)
}
#Creating the frequencies for bigram (function)
big<-function(sample){
  SampleBB_bi <- bigram(sample)
  topfeatures(dfm(SampleBB_bi, tolower = TRUE, remove = repeated_biwords),5)
  topBigram <- textstat_frequency(dfm(SampleBB_bi, tolower = TRUE, remove = repeated_words), n = 5)
}
#Creating the frequencies for trigram (function)
trig<-function(sample){
  SampleBB_tri <- trigram(sample)
  topfeatures(dfm(SampleBB_tri, tolower = TRUE, remove = repeated_biwords),5)
  topTrigram <- textstat_frequency(dfm(SampleBB_tri, tolower = TRUE, remove = repeated_words), n = 5)
}

#Creating the graphs of frequencies for unigram model
graphsFreqUni <-function(sampled__){
  UniBlogs <-unig(sampled__)
  UniBlogs$feature <- with(UniBlogs, reorder(feature, -frequency))
  UniBlogs<-ggplot(UniBlogs, aes(x = feature, y = frequency)) + geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  return(UniBlogs)
}
#Creating the graphs of frequencies for bigram model
graphsFreqBi <-function(sampled__){
  BiBlogs <-big(sampled__)
  BiBlogs$feature <- with(BiBlogs, reorder(feature, -frequency))
  BiBlogs<-ggplot(BiBlogs, aes(x = feature, y = frequency)) + geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  return(BiBlogs)
}
#Creating the graphs of frequencies for trigram model
graphsFreqtri <-function(sampled__){
  TriBlogs <-trig(sampled__)
  TriBlogs$feature <- with(TriBlogs, reorder(feature, -frequency))
  TriBlogs<-ggplot(TriBlogs, aes(x = feature, y = frequency)) + geom_point() + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  return(TriBlogs)
}
#Creating the cloud words function for the unigram model
cloudesplotUni <-function(sample){
  SampleBB_uni <- unigram(sample)
  textplot_wordcloud(dfm(SampleBB_uni, tolower = TRUE, remove = repeated_words), max_words = 20, color = rev(RColorBrewer::brewer.pal(4, "RdBu")),random_order = FALSE, random_color = TRUE,fixed_aspect = TRUE)
}

#Creating the cloud words function for the bigram model
cloudesplotbi <-function(sample){
  SampleBB_bi <- bigram(sample)
  textplot_wordcloud(dfm(SampleBB_bi, tolower = TRUE, remove = repeated_biwords), max_words = 20, color = rev(RColorBrewer::brewer.pal(4, "RdBu")),random_order = FALSE, random_color = TRUE,fixed_aspect = TRUE)
}
#Creating the cloud words function for the trigram model
cloudesplottri <-function(sample){
  SampleBB_tri <- trigram(sample)
  textplot_wordcloud(dfm(SampleBB_tri, tolower = TRUE, remove = repeated_biwords), max_words = 20, color = rev(RColorBrewer::brewer.pal(4, "RdBu")),random_order = FALSE, random_color = TRUE,fixed_aspect = TRUE)
}

Following, I am plotting the graphs for the uni-grams, 2-grams, and 3-grams considering 90% and 50% of probability.

Plotting tables for each n-gram: 90% Probability

Table 1-gram for Blogs:

Table1gram<-unig(ReproducedSample_B)
Table1gram

##   feature frequency rank docfreq group
## 1    time      8850    1    7159   all
## 2  people      7234    2    5599   all
## 3    good      4929    3    4501   all
## 4     day      4898    4    4070   all
## 5    love      4662    5    4385   all

Table 2-gram for Blogs:

Table2gram<-big(ReproducedSample_B)
Table2gram

##      feature frequency rank docfreq group
## 1    can_get       913    1     881   all
## 2    one_day       671    2     671   all
## 3 last_night       659    3     659   all
## 4 first_time       643    4     505   all
## 5   year_old       637    5     595   all

Table 3-gram for Blogs:

Table3gram<-trig(ReproducedSample_B)
Table3gram

##                     feature frequency rank docfreq group
## 1            new_york_times       344    1     247   all
## 2          three_four_times       244    2     244   all
## 3           story_good_true       240    3     120   all
## 4 complimentary_toward_next       236    4     118   all
## 5              can_get_free       232    5     232   all

Plotting these frequencies for Blogs:

#For 90% of probability
GraphBlogs<- graphsFreqUni(ReproducedSample_B)
fig1<-ggplotly(GraphBlogs)
#CloudBlogs <- cloudesplotUni(ReproducedSample_B) #Just plot it if you want to spend much time plotting each clod

GraphBlogsBi<- graphsFreqBi(ReproducedSample_B)
fig2<-ggplotly(GraphBlogsBi)
#CloudBlogBis <- cloudesplotbi(ReproducedSample_B)

GraphBlogsTri<- graphsFreqtri(ReproducedSample_B)
fig3<-ggplotly(GraphBlogsTri)
#CloudBlogBis <- cloudesplottri(ReproducedSample_B)

GraphsBlogs <-subplot(fig1,fig2,fig3)
GraphsBlogs

### Table 1-gram for News:

Table1gramN<-unig(ReproducedSample_N)
Table1gramN

##   feature frequency rank docfreq group
## 1    time      4767    1    4698   all
## 2    year      4533    2    4385   all
## 3 percent      4096    3    2673   all
## 4  people      4045    4    3464   all
## 5    game      3800    5    3277   all

Table 2-gram for News:

Table2gramN<-big(ReproducedSample_N)
Table2gramN

##          feature frequency rank docfreq group
## 1     new_jersey      1010    1     992   all
## 2    health_care       965    2     677   all
## 3      last_year       810    3     793   all
## 4       st_louis       787    4     785   all
## 5 officials_said       767    5     767   all

Table 3-gram for News:

Table3gramN<-trig(ReproducedSample_N)
Table3gramN

##                    feature frequency rank docfreq group
## 1         tunnel_feet_wide       348    1     116   all
## 2          feet_wide_miles       348    1     116   all
## 3          wide_miles_long       348    1     116   all
## 4 national_weather_service       284    4     143   all
## 5        health_care_costs       283    5     283   all

Plotting these frequencies for News:

GraphNews<- graphsFreqUni(ReproducedSample_N)
fig1N <- ggplotly(GraphNews)
#CloudNews <- cloudesplotUni(ReproducedSample_N)

GraphNewsBi<- graphsFreqBi(ReproducedSample_N)
fig2N<-ggplotly(GraphNewsBi)
#CloudNewsBi <- cloudesplotbi(ReproducedSample_N)

GraphNewsTri<- graphsFreqtri(ReproducedSample_N)
fig3N<-ggplotly(GraphNewsTri)
#CloudNewsBis <- cloudesplottri(ReproducedSample_N)
GraphsNews <-subplot(fig1N,fig2N,fig3N)
GraphsNews

### Table 1-gram for Twitter:

Table1gramT<-unig(ReproducedSample_T)
Table1gramT

##   feature frequency rank docfreq group
## 1     day     11024    1   10728   all
## 2    love     10623    2   10264   all
## 3    good      8091    3    8030   all
## 4    time      6638    4    6581   all
## 5   today      6586    5    6586   all

Table 2-gram for Twitter:

Table2gramT<-big(ReproducedSample_T)
Table2gramT

##           feature frequency rank docfreq group
## 1       high_high      2869    1     151   all
## 2       right_now      1692    2    1692   all
## 3 looking_forward      1563    3    1563   all
## 4         one_day      1284    4    1284   all
## 5    good_morning      1044    5    1044   all

Table 3-gram for Twitter:

Table3gramT<-trig(ReproducedSample_T)
Table3gramT

##                  feature frequency rank docfreq group
## 1         high_high_high      2718    1     151   all
## 2         happy_new_year       552    2     552   all
## 3         clap_clap_clap       531    3     177   all
## 4      happy_mothers_day       496    4     496   all
## 5 looking_forward_seeing       438    5     438   all

Plotting these frequencies for Twitter:

GraphTwitter<- graphsFreqUni(ReproducedSample_T)
fig1T<-ggplotly(GraphTwitter)
#CloudTwitter <- cloudesplotUni(ReproducedSample_T)

GraphTwBi<- graphsFreqBi(ReproducedSample_T)
fig2T<-ggplotly(GraphTwBi)
#CloudTwBi <- cloudesplotbi(ReproducedSample_T)

GraphTwTri<- graphsFreqtri(ReproducedSample_T)
fig3T<-ggplotly(GraphTwTri)
#CloudTwBis <- cloudesplottri(ReproducedSample_T)
GraphsTW <-subplot(fig1T,fig2T,fig3T)
GraphsTW

As it is displayed, the main word is day, or new york.

Plotting tables for each n-gram: 50% Probability

First for the 50% of probability I will plot the tables for each n-gram

Table 1-gram for Blogs:

Table1gram50<-unig(ReproducedSample_50B)
Table1gram50

##   feature frequency rank docfreq group
## 1    time      7588    1    6515   all
## 2  people      6126    2    4831   all
## 3     day      5604    3    4439   all
## 4    love      5022    4    3658   all
## 5    life      4405    5    3776   all

Table 2-gram for Blogs:

Table2gram50<-big(ReproducedSample_50B)
Table2gram50

##         feature frequency rank docfreq group
## 1 mister_rogers       960    1      64   all
## 2    little_boy       772    2     188   all
## 3     years_ago       709    3     705   all
## 4    last_night       594    4     502   all
## 5     big_sword       576    5      64   all

Table 3-gram for Blogs:

Table3gram50<-trig(ReproducedSample_50B)
Table3gram50

##                      feature frequency rank docfreq group
## 1             little_boy_big       384    1      64   all
## 2              boy_big_sword       384    1      64   all
## 3      gaston_south_carolina       250    3      25   all
## 4 south_carolina_attractions       250    3      25   all
## 5    creative_kuts_scrapping       189    5      63   all

Plotting these frequencies for Twitter:

GraphBlogs50<- graphsFreqUni(ReproducedSample_50B)
fig150B<-ggplotly(GraphBlogs50)
#CloudBlogs <- cloudesplotUni(ReproducedSample_50B) #Just plot it with time

GraphBlogs50Bi<- graphsFreqBi(ReproducedSample_50B)
fig250B<-ggplotly(GraphBlogs50Bi)
#CloudBlogBis <- cloudesplotbi(ReproducedSample_50B)

GraphBlogs50Tri<- graphsFreqtri(ReproducedSample_50B)
fig350B<-ggplotly(GraphBlogs50Tri)
#CloudBlogBis <- cloudesplottri(ReproducedSample_50B)

GraphsBlogs50 <-subplot(fig150B,fig250B,fig350B)
GraphsBlogs50

Table 1-gram for News:

Table1gram50N<-unig(ReproducedSample_50N)
Table1gram50N

##   feature frequency rank docfreq group
## 1    year      5237    1    4990   all
## 2    time      5195    2    4825   all
## 3  people      5184    3    4669   all
## 4   years      4246    4    3854   all
## 5    game      3667    5    3033   all

Table 2-gram:

Table2gram50N<-big(ReproducedSample_50N)
Table2gram50N

##       feature frequency rank docfreq group
## 1    new_york      1175    1    1100   all
## 2   last_year      1146    2    1146   all
## 3    st_louis       960    3     865   all
## 4   last_week       670    4     670   all
## 5 high_school       645    5     643   all

Table 3-gram:

Table3gram50N<-trig(ReproducedSample_50N)
Table3gram50N

##                           feature frequency rank docfreq group
## 1                   new_york_city       238    1     238   all
## 2                 st_louis_county       193    2     193   all
## 3      county_prosecutor's_office       183    3     183   all
## 4                  really_good_us       162    4      81   all
## 5 superintendent_special_services       158    5      79   all

Plotting these frequencies for News:

Graph50News<- graphsFreqUni(ReproducedSample_50N)
fig150N<-ggplotly(Graph50News)
#CloudNews <- cloudesplotUni(ReproducedSample_50N)

GraphNews50Bi<- graphsFreqBi(ReproducedSample_50N)
fig250N<-ggplotly(GraphNews50Bi)
#CloudNewsBi <- cloudesplotbi(ReproducedSample_50N)

GraphNews50Tri<- graphsFreqtri(ReproducedSample_50N)
fig350N<-ggplotly(GraphNews50Tri)
#CloudNewsBis <- cloudesplottri(ReproducedSample_50N)

GraphsNews50 <-subplot(fig150N,fig250N,fig350N)
GraphsNews50

Table 1-gram for Twitter:

Table1gram50T<-unig(ReproducedSample_50T)
Table1gram50T

##   feature frequency rank docfreq group
## 1    love     10050    1    9630   all
## 2   today      9360    2    9170   all
## 3    good      9208    3    8919   all
## 4     day      8814    4    8590   all
## 5   great      8558    5    8027   all

Table 2-gram for Twitter:

Table2gram50T<-big(ReproducedSample_50T)
Table2gram50T

##           feature frequency rank docfreq group
## 1       right_now      1472    1    1472   all
## 2 looking_forward      1365    2    1365   all
## 3      last_night      1286    3    1286   all
## 4       next_week      1095    4    1095   all
## 5   thanks_follow       981    5     981   all

Table 3-gram for Twitter:

Table3gram50T<-trig(ReproducedSample_50T)
Table3gram50T

##                  feature frequency rank docfreq group
## 1            let_us_know       305    1     305   all
## 2         go_night_night       228    2     114   all
## 3     happy_mother's_day       216    3     216   all
## 4               hi_hi_hi       212    4     106   all
## 5 store_called_regarding       186    5      93   all

Plotting these frequencies for News:

Graph50Twitter<- graphsFreqUni(ReproducedSample_50T)
fig150T<-ggplotly(Graph50Twitter)
#CloudTwitter <- cloudesplotUni(ReproducedSample_50T) 

GraphTwB50i<- graphsFreqBi(ReproducedSample_50T)
fig250T<-ggplotly(GraphTwB50i)
#CloudTwBi <- cloudesplotbi(ReproducedSample_50T)

GraphTw50Tri<- graphsFreqtri(ReproducedSample_50T)
fig350T<-ggplotly(GraphTw50Tri)
#CloudTwBis <- cloudesplottri(ReproducedSample_50T)
GraphsTW50 <-subplot(fig150T,fig250T,fig350T)
GraphsTW50

### CLOUD PLOTS

Finally, just for representation, I will display a cloud plot for the Twitter

CloudTwitter <- cloudesplotUni(ReproducedSample_50T)

CloudTwBi <- cloudesplotbi(ReproducedSample_50T)

CloudTwBis <- cloudesplottri(ReproducedSample_50T)

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## follow_let_know could not be fit on page. It will not be plotted.

## Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
## good_morning_world could not be fit on page. It will not be plotted.

## Conclusions

The words time, people, and day are the most used for 1-gram. Whereas, the word new york or last year for the 2-gram. For the 3-gram there is a variaty in every dataset.

By using stopwords (en) there is a way to evaluate the words that come from foreign language. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases? Eventhough, there are words that help increasing the coverage of words, for instance: By printing the words and looking for more stopwords. By filtering the words by consider the profanity words. By removing the words with other database.

Main findings: Future steps.

The database need to be cleaned with more filters than just the library give, for instance, there some words that the stop words do not consider.

The words fsor 50% of probability and 90% are similar.

In case of having troubles installing the RWeka package due to Mac filters, the best option I found was to use the quanteda library.

Other thing to consider is that when ysing 2-grams or 3-grams the stepwords take place, so I found that many words was formed by the stopwords.

Optimize the code to update it for shiny web app.

Milestone Report

Isabel Méndez

10/05/2020

Downloading the file

Goal:

Task 1 - Getting and cleaning the data

Task to accomplish:

Task 1.1: Loading the data in RStudio

Reading 5 lines in Blogs:

Reading 5 lines in News:

Reading 5 lines in Twitter:

Task 1.2: Sampling

Tokenization:

Summarizing information

Creating the comparison between between total word, sampled words and cleaned words

For blogs:

For News:

For Twitter:

Task 2 - Exploratory Data Analysis

Creating functions for graphs and cloud words

Plotting tables for each n-gram: 90% Probability

Table 1-gram for Blogs:

Table 2-gram for Blogs:

Table 3-gram for Blogs:

Plotting these frequencies for Blogs:

Table 2-gram for News:

Table 3-gram for News:

Plotting these frequencies for News:

Table 2-gram for Twitter:

Table 3-gram for Twitter:

Plotting these frequencies for Twitter:

Plotting tables for each n-gram: 50% Probability

Table 1-gram for Blogs:

Table 2-gram for Blogs:

Table 3-gram for Blogs:

Plotting these frequencies for Twitter:

Table 1-gram for News:

Table 2-gram:

Table 3-gram:

Plotting these frequencies for News:

Table 1-gram for Twitter:

Table 2-gram for Twitter:

Table 3-gram for Twitter:

Plotting these frequencies for News:

Main findings: Future steps.