Assignment

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Loading the data

The required libraries are loaded, as well as the data (if it was already not in the expected folder).

## Loading required package: NLP
## Loading required package: RColorBrewer
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v dplyr   0.8.5
## v tibble  3.0.0     v stringr 1.4.0
## v tidyr   1.0.2     v forcats 0.5.0
## v purrr   0.3.3
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::between()    masks data.table::between()
## x dplyr::filter()     masks stats::filter()
## x dplyr::first()      masks data.table::first()
## x dplyr::lag()        masks stats::lag()
## x dplyr::last()       masks data.table::last()
## x purrr::transpose()  masks data.table::transpose()

The folder which includes the three files in US English is selected and the three files are read and stored.

setwd("../final/en_US/")
# Select only files in English
en_US_blogs   <- "en_US.blogs.txt"
en_US_news    <- "en_US.news.txt"
en_US_twitter <- "en_US.twitter.txt"  

# Open the files to raed the lines AND count number of lines
blogs   <- read_lines(en_US_blogs )
news    <- read_lines(en_US_news)
twitter <- read_lines(en_US_twitter)

Data summary

Basic statistics are performed on the data, such as counting the number of lines and of characters:

blogs_lines   <- length(blogs)
news_lines    <- length(news) 
twitter_lines <- length(twitter)

total_lines   <- blogs_lines + news_lines + twitter_lines

# Determine the the size of each line of each file
blogs_nchar   <- nchar(blogs)
news_nchar    <- nchar(news)
twitter_nchar <- nchar(twitter)

In total, the imported files have 4269678 lines. The following is the header of the file blogs.txt, to give an idea of how the files are structured.

summary(blogs)
##    Length     Class      Mode 
##    899288 character character

In order to have a grasp of the content of the files and perform the analysis, we subset the datasets in order to include 1% of the original content and we append the findings in a single variable, called repo_sample. This is to avoid overloading the machine and to have a grasp on what are the key figures of the dataset.

# Create a subsample of 10% of the lines of each file
sample_pct = 0.01
set.seed(1001)

# Create samples (take for each file a smaller sample of lines as calculated above)
blogs_sample   <- sample(blogs, blogs_lines * sample_pct)
news_sample    <- sample(news, news_lines * sample_pct)
twitter_sample <- sample(twitter, twitter_lines * sample_pct)
# Combine all columns into a single column
repo_sample    <- c(blogs_sample, news_sample, twitter_sample)

Cleaning the data

The sample is then converted into a corpus and cleaned in order to: * remove the stopwords for “english and”smart" lists * remove web addresses * remove punctuation * remove upper case * remove profanity (see references for the used dataset)

# Cleaning the sample data
clean_sample <- VCorpus(VectorSource(repo_sample))

# Remove stopwords  
clean_sample <- tm_map(clean_sample, removeWords, stopwords("english"))
clean_sample <- tm_map(clean_sample, removeWords, stopwords("SMART"))

# Remove URL's  
# Source: [R and Data Mining]("http://www.rdatamining.com/books/rdm/faq/removeurlsfromtext")
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
clean_sample <- tm_map(clean_sample, content_transformer(removeURL))

# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
clean_sample <- tm_map(clean_sample, content_transformer(removeNumPunct))

#clean_sample <- tm_map(clean_sample, content_transformer(removePunctuation))

# Transform to lowercase
clean_sample <- tm_map(clean_sample, content_transformer(tolower))

# Remove profanities
profanity <- read.table(profanityFilePath, header = FALSE, sep ="\n")
clean_sample <- tm_map(clean_sample, removeWords, profanity[,1])

# Remove Whitespace  
clean_sample <- tm_map(clean_sample, stripWhitespace)


# Remove stopwords  
clean_sample <- tm_map(clean_sample, removeWords, stopwords("english"))
clean_sample <- tm_map(clean_sample, removeWords, stopwords("SMART"))

Data is then divided into tokens of one (unigrams), two (bigrams) or three (trigrams) words.

cleanData<-data.frame(text=unlist(sapply(clean_sample, `[`, "content")), stringsAsFactors=F)

unigram_tokenised <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bigram_tokenised <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
                                                ,;:\"()?!"))
trigram_tokenised <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
                                                 .,;:\"()?!"))

Data exploration

The unigrams, bigrams and trigrams are then checked to find the most common words for the three of them.

unigram <- data.frame(table(unigram_tokenised))
bigram <- data.frame(table(bigram_tokenised))
trigram <- data.frame(table(trigram_tokenised))
unigram_sorted <- unigram[order(unigram$Freq,decreasing = TRUE),]
bigram_sorted <- bigram[order(bigram$Freq,decreasing = TRUE),]
trigram_sorted <- trigram[order(trigram$Freq,decreasing = TRUE),]
unigram_freq <- unigram_sorted[1:15,]
colnames(unigram_freq) <- c("Word","Frequency")
bigram_Freq<- bigram_sorted[1:15,]
colnames(bigram_Freq) <- c("Word","Frequency")
trigram_Freq <- trigram_sorted[1:15,]
colnames(trigram_Freq) <- c("Word","Frequency")

The results are plotted into histograms:

ggplot(unigram_freq, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity", fill="green", colour = "pink") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

ggplot(bigram_Freq, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity", fill="white", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

ggplot(trigram_Freq, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity", fill="red", colour = "black") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))

Conclusions

After a first, partial analysis, the most used terms in the three categories are:

##         Word Frequency
## 47175   time      2200
## 11328    day      1795
## 19178   good      1782
## 34446 people      1625
## 27526   love      1592
## 3079    back      1495
## 52404   year      1485
## 28071   make      1345
## 19559  great      1234
## 47349  today      1179
## 51859   work      1026
## 26798   life       972
## 52417  years       951
## 40014     rt       876
## 27897   made       853
##                  Word Frequency
## 159249    high school       139
## 392826      years ago       132
## 153281 happy birthday       121
## 326318       st louis        96
## 143509   good morning        85
## 319927   social media        77
## 143466      good luck        75
## 127560    follow back        74
## 203078    los angeles        73
## 367535  united states        71
## 202287      long time        65
## 300656  san francisco        62
## 300654      san diego        56
## 146244      great day        45
## 250199     past years        45
##                            Word Frequency
## 168784        happy mothers day        23
## 138740     follow follow follow        20
## 296589   president barack obama        17
## 429360             world war ii        17
## 61420             cinco de mayo        15
## 168783         happy mother day        13
## 18831   attorney general office        10
## 46889            cake cake cake        10
## 158524       gov chris christie        10
## 360065          st louis county        10
## 197243  italy vacation packages         9
## 250242  montreal italy vacation         9
## 58665   chief executive officer         8
## 79364  county prosecutor office         8
## 101457 district attorney office         8

References