SYNOPSIS

The report is to explains my exploratory analysis on the all of three text files (blogs, news and twitter). First, basic summaries analysis of the three files will be conducted. Next, some histograms will be plotted to present the frequency of Top 20 words, 2-grams and 3-grams distributions in the three files. Due to the large size of the all three file, this analyis will only inlcude first 600 lines of the three files.

DATA PROCESSING

1.Set up the work directory and load the first 600 lines from each dataset into R

setwd("~/Desktop/Capstone/final/en_US")
data.blogs <- file("en_US.blogs.txt")
blogs <- readLines(data.blogs, 600)
close(data.blogs)

data.news <- file("en_US.news.txt")
news <- readLines(data.news, 600)
close(data.news)

data.twitter <- file("en_US.twitter.txt")
twitter <- readLines(data.twitter, 600)
close(data.twitter)

2.Conduct some basic summaries on the three dataset

basic_sum <- function(file) {
                line_count <- length(file)
                file_1 <- strsplit(file," ")
                word_count = 0
                for (i in 1:length(file_1)){
                        word_count = word_count + length(file_1[[i]])
                        }
                
                print(paste("Line Count: ", line_count))
                print(paste("Words Count:", word_count))
}
basic_sum(blogs)

## [1] "Line Count:  600"
## [1] "Words Count: 25501"

basic_sum(news)

## [1] "Line Count:  600"
## [1] "Words Count: 20366"

basic_sum(twitter)

## [1] "Line Count:  600"
## [1] "Words Count: 7574"

3.Create a Tokenization and Profanity filtering function

Tokenization <- function(text, lines) {
        n <- lines
        #strings <- readLines(con, n)
        word_1 <- strsplit(text," ")
        words <- list()
        for (i in 1:length(word_1)) {
                for (j in 1:length(word_1[[i]])){
                        word_1[[i]][j] <- gsub(" *$", "",word_1[[i]][j])
                        words <- append(words, gsub("\\...|\\... |:|: |\\. |\\.|\\?|\\? |\\,|\\, |!|! |\\@|\\@ |\\$|\\$ |\\%|\\% |\\&|\\& |\\*|\\* |<|< |>|> |+|+ |/|/ |\"|\" |”|” |~|~ |=|= |-|- | |;|; |\\)|\\) |\\(|\\( |#|# |“|“ |«|« |»|» |♥|♥ |^\'|\'$|👦|👦 |the|The|THE","",word_1[[i]][j]))
                }
        }           
        words[words == " "] <- NA
        words[words == ""] <- NA
        words[is.na(words)] <- NULL
        words[words == "a"] <- NULL
        words[words == "A"] <- NULL
        words[words == "an"] <- NULL
        words[words == "An"] <- NULL 
        words[words == "AN"] <- NULL 
        words[words == "to"] <- NULL
        words[words == "TO"] <- NULL
        words[words == "To"] <- NULL
        words[words == "Of"] <- NULL
        words[words == "of"] <- NULL
        words[words == "OF"] <- NULL       
        words
}

4.Distributions of word frequencies, and frequecies of 2-grams and 3-grams in the dataset

Result

Plot histograms of the Top10 Words / 2-grams / 3-grams frequencies distributions plot of chunk Histogram Top10 frequency

Capstone Milestone Report

Linghuan Zeng

November 16, 2014

SYNOPSIS

DATA PROCESSING

Result