The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
The required libraries are loaded, as well as the data (if it was already not in the expected folder).
## Loading required package: NLP
## Loading required package: RColorBrewer
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v dplyr 0.8.5
## v tibble 3.0.0 v stringr 1.4.0
## v tidyr 1.0.2 v forcats 0.5.0
## v purrr 0.3.3
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::between() masks data.table::between()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks data.table::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks data.table::last()
## x purrr::transpose() masks data.table::transpose()
The folder which includes the three files in US English is selected and the three files are read and stored.
setwd("../final/en_US/")
# Select only files in English
en_US_blogs <- "en_US.blogs.txt"
en_US_news <- "en_US.news.txt"
en_US_twitter <- "en_US.twitter.txt"
# Open the files to raed the lines AND count number of lines
blogs <- read_lines(en_US_blogs )
news <- read_lines(en_US_news)
twitter <- read_lines(en_US_twitter)
Basic statistics are performed on the data, such as counting the number of lines and of characters:
blogs_lines <- length(blogs)
news_lines <- length(news)
twitter_lines <- length(twitter)
total_lines <- blogs_lines + news_lines + twitter_lines
# Determine the the size of each line of each file
blogs_nchar <- nchar(blogs)
news_nchar <- nchar(news)
twitter_nchar <- nchar(twitter)
In total, the imported files have 4269678 lines. The following is the header of the file blogs.txt, to give an idea of how the files are structured.
summary(blogs)
## Length Class Mode
## 899288 character character
In order to have a grasp of the content of the files and perform the analysis, we subset the datasets in order to include 1% of the original content and we append the findings in a single variable, called repo_sample. This is to avoid overloading the machine and to have a grasp on what are the key figures of the dataset.
# Create a subsample of 10% of the lines of each file
sample_pct = 0.01
set.seed(1001)
# Create samples (take for each file a smaller sample of lines as calculated above)
blogs_sample <- sample(blogs, blogs_lines * sample_pct)
news_sample <- sample(news, news_lines * sample_pct)
twitter_sample <- sample(twitter, twitter_lines * sample_pct)
# Combine all columns into a single column
repo_sample <- c(blogs_sample, news_sample, twitter_sample)
The sample is then converted into a corpus and cleaned in order to: * remove the stopwords for “english and”smart" lists * remove web addresses * remove punctuation * remove upper case * remove profanity (see references for the used dataset)
# Cleaning the sample data
clean_sample <- VCorpus(VectorSource(repo_sample))
# Remove stopwords
clean_sample <- tm_map(clean_sample, removeWords, stopwords("english"))
clean_sample <- tm_map(clean_sample, removeWords, stopwords("SMART"))
# Remove URL's
# Source: [R and Data Mining]("http://www.rdatamining.com/books/rdm/faq/removeurlsfromtext")
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
clean_sample <- tm_map(clean_sample, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
clean_sample <- tm_map(clean_sample, content_transformer(removeNumPunct))
#clean_sample <- tm_map(clean_sample, content_transformer(removePunctuation))
# Transform to lowercase
clean_sample <- tm_map(clean_sample, content_transformer(tolower))
# Remove profanities
profanity <- read.table(profanityFilePath, header = FALSE, sep ="\n")
clean_sample <- tm_map(clean_sample, removeWords, profanity[,1])
# Remove Whitespace
clean_sample <- tm_map(clean_sample, stripWhitespace)
# Remove stopwords
clean_sample <- tm_map(clean_sample, removeWords, stopwords("english"))
clean_sample <- tm_map(clean_sample, removeWords, stopwords("SMART"))
Data is then divided into tokens of one (unigrams), two (bigrams) or three (trigrams) words.
cleanData<-data.frame(text=unlist(sapply(clean_sample, `[`, "content")), stringsAsFactors=F)
unigram_tokenised <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bigram_tokenised <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
trigram_tokenised <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
The unigrams, bigrams and trigrams are then checked to find the most common words for the three of them.
unigram <- data.frame(table(unigram_tokenised))
bigram <- data.frame(table(bigram_tokenised))
trigram <- data.frame(table(trigram_tokenised))
unigram_sorted <- unigram[order(unigram$Freq,decreasing = TRUE),]
bigram_sorted <- bigram[order(bigram$Freq,decreasing = TRUE),]
trigram_sorted <- trigram[order(trigram$Freq,decreasing = TRUE),]
unigram_freq <- unigram_sorted[1:15,]
colnames(unigram_freq) <- c("Word","Frequency")
bigram_Freq<- bigram_sorted[1:15,]
colnames(bigram_Freq) <- c("Word","Frequency")
trigram_Freq <- trigram_sorted[1:15,]
colnames(trigram_Freq) <- c("Word","Frequency")
The results are plotted into histograms:
ggplot(unigram_freq, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity", fill="green", colour = "pink") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
ggplot(bigram_Freq, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity", fill="white", colour = "green") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
ggplot(trigram_Freq, aes(x=Word, y=Frequency)) + geom_bar(stat="Identity", fill="red", colour = "black") +geom_text(aes(label=Frequency),vjust=-0.1) + theme(axis.text.x = element_text(angle = 45, hjust = 2))
After a first, partial analysis, the most used terms in the three categories are:
## Word Frequency
## 47175 time 2200
## 11328 day 1795
## 19178 good 1782
## 34446 people 1625
## 27526 love 1592
## 3079 back 1495
## 52404 year 1485
## 28071 make 1345
## 19559 great 1234
## 47349 today 1179
## 51859 work 1026
## 26798 life 972
## 52417 years 951
## 40014 rt 876
## 27897 made 853
## Word Frequency
## 159249 high school 139
## 392826 years ago 132
## 153281 happy birthday 121
## 326318 st louis 96
## 143509 good morning 85
## 319927 social media 77
## 143466 good luck 75
## 127560 follow back 74
## 203078 los angeles 73
## 367535 united states 71
## 202287 long time 65
## 300656 san francisco 62
## 300654 san diego 56
## 146244 great day 45
## 250199 past years 45
## Word Frequency
## 168784 happy mothers day 23
## 138740 follow follow follow 20
## 296589 president barack obama 17
## 429360 world war ii 17
## 61420 cinco de mayo 15
## 168783 happy mother day 13
## 18831 attorney general office 10
## 46889 cake cake cake 10
## 158524 gov chris christie 10
## 360065 st louis county 10
## 197243 italy vacation packages 9
## 250242 montreal italy vacation 9
## 58665 chief executive officer 8
## 79364 county prosecutor office 8
## 101457 district attorney office 8