On this first peer-graded assignment we are going to do some exploratory analysis and a sample of the ngram. Since the data is composed of several languages for the porpose of this report we are going to explore only the english files and only going to explore 15,000 lines of each files in order to have a manageable set.
library(tm);library(stringi);library(RWeka);library(SnowballC);library(dplyr);library(ggplot2);library(gridExtra)
The following code will check is you have the dataset set on a given directory. Feel free to change the directory to where you have downloaded the file. If the file is not within the directory it will be downloaded.
dir <- "C:\\Users\\pedro moises\\Documents\\academic\\cursos online\\capstone project\\Capstone Project Data Science Specialization"
setwd(dir)
file_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
ifelse(
file.exists("final"),
print("you are ready to roll"),
ifelse(
file.exists("Coursera-SwiftKey.zip"),
unzip("./Coursera-SwiftKey.zip"),
{download.file(file_url, destfile = "./Coursera-SwiftKey.zip");unzip("Coursera-SwiftKey.zip")}
)
)
the r function readLines() reads text lines in a simple and quicker manner that other functions. In the case of the news dataset we have to open it as binary first because it has special characters. See this post for more information
twitter <- readLines(".\\final\\en_US\\en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(".\\final\\en_US\\en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
#reading the news file as binary
b.news <- file(".\\final\\en_US\\en_US.news.txt", open="rb")
news <- readLines(b.news, encoding="UTF-8")
close(b.news)
rm(b.news)
#number of observations
blog_l <- length(blogs)
news_l <- length(news)
twitter_l <- length(twitter)
#number character per entry
blogs_c <- nchar(blogs)
news_c <- nchar(news)
twitter_c <- nchar(twitter)
#number of words per entry
blogs_w <- stri_count_words(blogs)
news_w <- stri_count_words(news)
twitter_w <- stri_count_words(twitter)
summary <- data.frame( "Observations" = c(blog_l, news_l, twitter_l) %>% prettyNum(big.mark = ",") ,
"Characters" = c(sum(blogs_c), sum(news_c), sum(twitter_c)) %>% prettyNum(big.mark = ","),
"Mean" = (c(mean(blogs_c), mean(news_c), mean(twitter_c))) %>% ceiling,
"Median" = (c(median(blogs_c), median(news_c), median(twitter_c))) %>% ceiling,
"Words" = c(sum(blogs_w), sum(news_w), sum(twitter_w)) %>% prettyNum(big.mark = ","),
"mean" = c(mean(blogs_w), mean(news_w), mean(twitter_w)) %>% ceiling(),
"Median" = c(median(blogs_w), median(news_w), mean(twitter_w)) %>% ceiling(),
row.names = c("Blogs", "News", "Twitter"))
knitr::kable(summary)
| Observations | Characters | Mean | Median | Words | mean | Median.1 | |
|---|---|---|---|---|---|---|---|
| Blogs | 899,288 | 206,824,505 | 230 | 156 | 37,546,246 | 42 | 28 |
| News | 1,010,242 | 203,223,159 | 202 | 185 | 34,762,395 | 35 | 32 |
| 2,360,148 | 162,096,241 | 69 | 64 | 30,093,410 | 13 | 13 |
As we can see in the table the number of observations varies sustancially amount the different files, having more entries on the twitter file. This is, however surpassed as we sum the number of character per entry; meaning that, as we can see from the table, the blogs and news have more charatecters per entry (the same applies to words). As we can see from the median and mean columns the blogs and news files have some observations with character counts (outliers) that skew a little the means. The twitter file seems to have more regular counts (something to spect since the number of character allowed in the social media is limited to 140 charatects).
Having more or less the same word count on each file, means that we would have more or less the same sample size to train our model. This is important since the language used on each source would be different.
Now we are going to see the distribution of the variables.
df <- data.frame(characters = c(blogs_c, news_c, twitter_c), words = c(blogs_w,
news_w, twitter_w), source = c(rep("Blogs", blog_l), rep("News", news_l),
rep("Twitter", twitter_l)))
char <- ggplot(data = df, aes(x = source, y = characters, color = source)) +
geom_boxplot() + theme(legend.position = "none")
word <- ggplot(data = df, aes(x = source, y = words, color = source)) + geom_boxplot() +
theme(legend.position = "none")
grid.arrange(char, word, nrow = 1, ncol = 2)
As we can see in the above graphic, the number of characters and words for the sources of blogs and news have drastic outliers. This is important to know because the way we write is different depending on the size of the writing at hand, therefore in this case we cannot simple remove the outliers to smoth the sample, because we are going to left behind potentially different writing styles.
Since the files to be analyzed have a size not manageable with everyday computers we are going to sample 15,000 elements from each file.
set.seed(200)
t <- sample(twitter,15000)
b <- sample(blogs,15000)
n <- sample(news,15000)
vector <- c(t,b,n)
writeLines(vector, "vector.txt")
rm(twitter, blogs,news,t,b,n)
On this process we are going to use the tm package for loading and data cleaning.
c <- Corpus(DirSource(pattern = "*.txt"))
rm(vector)
#data cleaning
c <- tm_map(c,tolower) #transfer to lowercase
c <- tm_map(c, removePunctuation) #remove punctuation
c <- tm_map(c,removeNumbers) #remove numbers
c <- tm_map(c,stemDocument) #remove common word endings
c <- tm_map(c, stripWhitespace) #strip white spaces
c <- tm_map(c,PlainTextDocument)
Now we are going to calculate the word frequency. This is by no means a final analysis but it gives us an idea of what to spect from the final data.
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
unigram <- DocumentTermMatrix(c, control=list(tokenize=UnigramTokenizer))
uni_table <-unigram %>%
removeSparseTerms(0.2) %>%
as.matrix %>%
colSums %>%
sort(decreasing=TRUE)
uni_df <- data.frame(word = names(uni_table), freq = uni_table)
knitr::kable(head(uni_df))
| word | freq | |
|---|---|---|
| the | the | 303010 |
| and | and | 154155 |
| that | that | 70937 |
| for | for | 68548 |
| you | you | 56127 |
| with | with | 45174 |
For latter analysis we would be doing de following:
Thanks for reading!