Capstone_Proj

R Markdown

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs ( http://rpubs.com/ ) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Use Librarys and Read Input Files sampling subsets of 1500 lines from the en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt datasets and combining them into a single dataset. Combined data to a file called InputData.txt and clearning the subset from memory

library(caTools)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(NLP)

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(rJava)
library(knitr)
library(stringr)
library(wordcloud)

## Loading required package: RColorBrewer

library(tm)
library(RWeka)

## 
## Attaching package: 'RWeka'

## The following object is masked from 'package:caTools':
## 
##     LogitBoost

en_blogs <- readLines(paste("en_US/en_US.blogs.txt", sep=""))
en_news <- readLines(paste("en_US/en_US.news.txt",sep=""))
en_twitter <- suppressWarnings(readLines(paste("en_US/en_US.twitter.txt", sep="")))

data_all <- c(en_blogs,en_news,en_twitter)

set.seed(311344)
en_blogs_subset    <- sample(en_blogs,1500)
en_news_subset <-  sample(en_news,1500)
en_twitter_subset    <- sample(en_twitter,1500)
inputData <- c(en_blogs_subset, en_news_subset, en_twitter_subset)
writeLines(inputData, "InputData.txt")

rm(en_blogs_subset)
rm(en_news_subset)
rm(en_twitter_subset)

Clean the data, remove punctuations, numbers whitespace, stopwords and change data to lowercase from input file. Cleaning Text, replacting special characters ("|/|@|\|) with spaces using a custom content_transformer. generate a word cloud with wordcloud Function.

inputData <- readLines("InputData.txt", encoding="UTF-8")
data_corpus <- VCorpus(VectorSource(inputData))

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
data_corpus <- tm_map(data_corpus, toSpace, "\"|/|@|\\|")
data_corpus <- tm_map(data_corpus, toSpace, "[^[:graph:]]")
data_corpus <- tm_map(data_corpus, content_transformer(tolower))
data_corpus <- tm_map(data_corpus, removePunctuation)
data_corpus <- tm_map(data_corpus, removeNumbers)
data_corpus <- tm_map(data_corpus, stripWhitespace)
data_corpus <- tm_map(data_corpus, stemDocument)
data_corpus <- tm_map(data_corpus, PlainTextDocument)
data_corpus <- tm_map(data_corpus, removeWords, stopwords("english"))

wordcloud(data_corpus, min.freq=5, max.words=101, random.order=TRUE,
          rot.per=0.5, colors=brewer.pal(8, "Set2"), use.r.layout=FALSE)

## Warning in wordcloud(data_corpus, min.freq = 5, max.words = 101, random.order =
## TRUE, : said could not be fit on page. It will not be plotted.

Convert the Corpus data to a character Vector, extract unigrams (single words) from processed corpus and then visualize the top 30 most frequent word in a barplot. Process are, Tokenizing into Unigrams, Soring by frequency, Barplot Barplot will give you a visual representation of the most common words in your corpus.

corpus.dataframe <- data.frame(text = sapply(data_corpus, as.character), stringsAsFactors = FALSE)

#1.0

uniGramToken <- data.frame(table(NGramTokenizer(corpus.dataframe$text, Weka_control(min = 1, max = 1))))
unigram <- uniGramToken[order(uniGramToken$Freq, decreasing = TRUE),]

par(mfrow = c(1, 1))
par(mar=c(5,4,2,0))
barplot(unigram[1:30,2], 
        names.arg=unigram[1:30,1], 
        col = "light blue", 
        main="Most Commonly Used Words (Top 30)", 
        las=2, 
        ylab = "Frequency")

Generating a barplot for bigrams (two-word combinations) from your text corpus, following a similar structure to your unigram analysis. #2.0

biGramToken <- data.frame(table(NGramTokenizer(corpus.dataframe$text, Weka_control(min = 2, max = 2))))
bigram  <- biGramToken[order(biGramToken$Freq, decreasing = TRUE),]

par(mar=c(8.5,4,2,1))
barplot(bigram[1:30,2], 
        names.arg=bigram[1:30,1], 
        col = "pink", 
        main="Most Commonly Used Two Word Combinations (Top 30)", 
        las=2, 
        ylab = "Frequency")

Generating plot to trigrams (three-word combinations) with a similar approach as your unigram and bigram visualizations

#3.0
triGramToken <- data.frame(table(NGramTokenizer(corpus.dataframe$text, Weka_control(min = 3, max = 3))))
trigram <- triGramToken[order(triGramToken$Freq, decreasing = TRUE),]

par(mar=c(8.5,4,2,1))
barplot(trigram[1:30,2], 
        names.arg=trigram[1:30,1], 
        col = "green", 
        main="Most commonly used three word combinations (Top 30)", 
        las=2, 
        ylab = "Frequency")

Summary :

The unigram plot emphasizes individual words that are important in the dataset, giving a general sense of vocabulary usage. The bigram and trigram plots add layers of context, showing more detailed relationships between words and the flow of language, which helps identify common topics, expressions, and recurring themes.

Capstone_Proj_final

Joe Okelly

30/09/2024

R Markdown