Milestone Report - Data Science

1.Something from Text & Analytics

Text mining, also known as text data mining, equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically obtained through the design of patterns and trends through means such as the learning of statistical patterns. Text mining generally involves the process of structuring the input text (generally analysis, along with adding some derived linguistic features and removing others, and then inserting them into a database), deriving patterns within structured data and, finally, evaluation and interpretation. of departure. “High quality” in text mining generally refers to a combination of relevance, novelty and interesting. Typical text mining tasks include categorizing text, grouping text, extracting concepts / entities, producing granular taxonomies, analyzing sentiments, summarizing documents, and modeling relationships between entities (i.e. , the learning relationships between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, labeling / annotation, information extraction, data extraction techniques, including link analysis and associations, visualization and predictive analytics. The overall goal is essentially to convert text to data for analysis, through the application of Natural Language Processing (NLP) and analytical methods.

A typical application is to scan a set of documents written in natural language and model the set of documents for predictive classification purposes, or fill a database or search index with the extracted information.

2. Sinopsys

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

3. Description of the data

The data set to be processed corresponds to random data extracted from the web through news portals, twees and blogs in 4 different languages such as English, German, Russian and Finnish.

4.Loading the necessary libraries

library(sf)

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

library(NLP)
library(tm)
library(openNLP)
library(rJava)
library(abind)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(base)
library(stars)
library(RColorBrewer)
library(wordcloud)
library(tidyr)
library(ggraph)
library(igraph)

## 
## Attaching package: 'igraph'

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

5. Data extraction site and Loading the data into R

In the case of the news dataset, in order to bypass an “End Of File” error that appeared in the middle of the document, there is a different method of loading in the file.

The data was downloaded at this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

library("stringi")
setwd("./projectData/final/en_US")
twitter<-readLines("en_US.twitter.txt",warn=FALSE,encoding="UTF-8")
blogs<-readLines("en_US.blogs.txt",warn=FALSE,encoding="UTF-8")
news<-readLines("en_US.news.txt",warn=FALSE,encoding="UTF-8")

datasummary <- data.frame("FileNames" =c("Twitter","Blogs","News"),
                          "FileSize"=c(format(object.size(twitter), units = "MB", standard ="auto"),
                                       format(object.size(blogs), units = "MB", standard = "auto"),
                                       format(object.size(news), units = "MB", standard = "auto")),
                          "FileLength"=c(length(twitter),length(blogs),length(news)),
                          "Wordcount"=c(sum(stri_stats_latex(twitter)[4]),sum(stri_stats_latex(blogs)[4]),sum(stri_stats_latex(news)[4])),
                          "NoOfChars"=c(sum(nchar(twitter)),sum(nchar(blogs)),sum(nchar(news))))

datasummary

##   FileNames FileSize FileLength Wordcount NoOfChars
## 1   Twitter   319 Mb    2360148  30451128 162096031
## 2     Blogs 255.4 Mb     899288  37570839 206824505
## 3      News  19.8 Mb      77259   2651432  15639408

6.Exploring the data under analysis

In this section we will perform some exploratory data analysis using tidy data principles which is a powerful way to make handling data easier and more effective. we will perform this analysis on the sample data set which is 2% of the original dataset.Below are packages requried to perform this analysis library(“tidyr”) library(“dplyr”) library(“tidytext”) library(“tm”) library(“openNLP”) library(“RWeka”) library(“tm”)

# Remove all non english characters as they cause issues down the road
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")

#sampling of the data set 
set.seed(1000)
twitter_sample<- sample(twitter,length(twitter)*0.02)
blogs_sample<- sample(blogs,length(blogs)*0.02)
news_sample<- sample(news,length(news)*0.02)

#write the sample files

dir.create("sampleDatafiles", showWarnings = FALSE)

write(twitter_sample, "sampleDatafiles/twitter_sample.txt")
write(blogs_sample, "sampleDatafiles/blogs_sample.txt")
write(news_sample, "sampleDatafiles/news_sample.txt")


remove(twitter)
remove(blogs)
remove(news)

7. Preprosessing

Data preprocessing involves operations related to the elimination of blank spaces, punctuation marks, empty or incomplete words depending on the language, number or symbol derivations. Inside the ordered text there is a conversion of the lowercase or uppercase words or vice versa during tokenization.

library("openNLP")
library("tm")
library("tidyr")
finalSampleData <- c(twitter_sample,blogs_sample,news_sample)
sampleData <- tibble(text = finalSampleData)

8. Unigrams – Tokenizing by 1-gram

By unlocking tokens and removing empty words using ordered tools, word counts are stored in an ordered data frame.

8.1. Upload data to tokenize

library("openNLP")
library("tm")
library("tidytext")
library("tidyr")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data(stop_words)
tidySampleData <- sampleData %>% unnest_tokens(word, text) %>%anti_join(stop_words)

## Joining, by = "word"

8.2.elimination of spaces empty words and numbers

tidySampleData$word <- gsub("\\s+","",tidySampleData$word)
tidySampleData<-tidySampleData[-grep("\\b\\d+\\b", tidySampleData$word),]
tidySampleData %>% count(word, sort = TRUE);

## # A tibble: 61,694 x 2
##    word       n
##    <chr>  <int>
##  1 time    3458
##  2 love    3094
##  3 day     2910
##  4 people  2361
##  5 rt      1824
##  6 life    1511
##  7 lol     1458
##  8 happy   1240
##  9 im      1190
## 10 night   1185
## # ... with 61,684 more rows

8.3.Graph Unigram Words

library(ggplot2)
tidySampleData %>% count(word, sort = TRUE) %>% filter(n > 1000) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col(fill="cyan4") + xlab("Unigram") + ylab("Frequency")+labs(title="Unigram with most common words")+coord_flip();

9.Bigrams – Tokenizing by 2-gram

A bigrama or digrama is a group of two letters, two syllables, or two words. Bigrams are commonly used as the basis for simple statistical analysis of text. They are used in one of the most successful speech recognition language models, this is a special case of the N-gram.

9.1.elimination of spaces empty words and numbers

library(tidyr)
stop_words <- rbind(stop_words,data.frame(word="amp",lexicon=""))
tidyBigramSampleData <- sampleData %>% unnest_tokens(bigram, text,token = "ngrams", n = 2) 
#Seperating the bigram 
tidyBigramSampleData_separated <- tidyBigramSampleData %>% separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- tidyBigramSampleData_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)
#Removing whitespaces 
bigrams_filtered$word1 <- gsub("\\s+","",bigrams_filtered$word1)
bigrams_filtered$word2 <- gsub("\\s+","",bigrams_filtered$word2)

bigrams_filtered$word1 <- gsub("\\'+","",bigrams_filtered$word1)
bigrams_filtered$word2 <- gsub("\\'+","",bigrams_filtered$word2)

#Removing Numbers
bigrams_filtered<-bigrams_filtered[-grep("\\b\\d+\\b", bigrams_filtered$word1),]
bigrams_filtered<-bigrams_filtered[-grep("\\b\\d+\\b", bigrams_filtered$word2),]

bigrams_united <- bigrams_filtered %>% unite(bigram, word1, word2, sep = " ")
bigrams_united %>% count(bigram,sort=TRUE)

## # A tibble: 174,992 x 2
##    bigram             n
##    <chr>          <int>
##  1 happy birthday   175
##  2 mothers day      126
##  3 social media     109
##  4 stay tuned        67
##  5 ice cream         66
##  6 happy mothers     63
##  7 san diego         53
##  8 valentines day    48
##  9 rt rt             47
## 10 happy hour        45
## # ... with 174,982 more rows

9.2.Graph Bigram Words

bigrams_united %>% count(bigram, sort = TRUE) %>% filter(n > 35) %>% mutate(bigram = reorder(bigram, n)) %>% ggplot(aes(bigram, n,fill=n)) + geom_col(fill="cyan4") + xlab("Bigram") + ylab("Frequency")+labs(title="Bigram with most common words")+coord_flip()

10.Trigrams – Tokenizing by 3-gram

Trigrams are a special case of the n-gram, where n is 3. They are often used in natural language processing for statistical analysis of texts, and in cryptography for the control and use of ciphers and codes.

10.1.elimination of spaces empty words and numbers

library(tidyr)
stop_words <- rbind(stop_words,data.frame(word="amp",lexicon=""))
tidyBigramSampleData <- sampleData %>% unnest_tokens(bigram, text,token = "ngrams", n = 3) 
#Seperating the bigram 
tidyBigramSampleData_separated <- tidyBigramSampleData %>% separate(bigram, c("word1", "word2", "word3"), sep = " ")
bigrams_filtered <- tidyBigramSampleData_separated %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)%>% filter(!word3 %in% stop_words$word)
#Removing whitespaces 
bigrams_filtered$word1 <- gsub("\\s+","",bigrams_filtered$word1)
bigrams_filtered$word2 <- gsub("\\s+","",bigrams_filtered$word2)
bigrams_filtered$word3 <- gsub("\\s+","",bigrams_filtered$word3)

bigrams_filtered$word1 <- gsub("\\'+","",bigrams_filtered$word1)
bigrams_filtered$word2 <- gsub("\\'+","",bigrams_filtered$word2)
bigrams_filtered$word3 <- gsub("\\'+","",bigrams_filtered$word3)

#Removing Numbers
bigrams_filtered<-bigrams_filtered[-grep("\\b\\d+\\b", bigrams_filtered$word1),]
bigrams_filtered<-bigrams_filtered[-grep("\\b\\d+\\b", bigrams_filtered$word2),]
bigrams_filtered<-bigrams_filtered[-grep("\\b\\d+\\b", bigrams_filtered$word3),]

bigrams_united <- bigrams_filtered %>% unite(bigram, word1, word2,word3, sep = " ")
bigrams_united %>% count(bigram,sort=TRUE)

## # A tibble: 77,164 x 2
##    bigram                      n
##    <chr>                   <int>
##  1 happy mothers day          63
##  2 cinco de mayo              20
##  3 jobs jobs jobs             16
##  4 happy valentines day       15
##  5 love love love             13
##  6 cake cake cake             12
##  7 manhattan kansas motels    12
##  8 clap clap clap             10
##  9 fat independent movie      10
## 10 ha ha ha                   10
## # ... with 77,154 more rows

10.2.Graph Triigram Words

bigrams_united %>% count(bigram, sort = TRUE) %>% filter(n > 10) %>% mutate(bigram = reorder(bigram, n)) %>% ggplot(aes(bigram, n,fill=n)) + geom_col(fill="cyan4") + xlab("Trigram") + ylab("Frequency")+labs(title="Trigram with most common words")+coord_flip()