The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable
We analyse three corpora of US English text found online. We find that the blogs and news corpora are similar, the twitter corpus is different. We propose that this is the result of the 140 character limit of Twitter messages.to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
Demonstrate that you’ve downloaded the data and have successfully loaded it in. Create a basic report of summary statistics about the data sets. Report any interesting findings that you amassed so far. Get feedback on your plans for creating a prediction algorithm and Shiny app. In this report we look at three corpora of US English text, a set of internet blogs posts, a set of internet news articles, and a set of twitter messages.
We collect the following forms of information:
file size number of lines number of non-empty lines number of words distribution of words (quantiles and plot) number of characters number of non-white characters In the following section we will describe the data collection process, the section after that gives the results of the data exploration, we finally present conclusions and give references.
For our analysis we use the R computing environment [@R], as well as the libraries stringi [@stringi] and ggplot2 [@ggplot2]. In order to make the code more readable we use the pipe operator from the magrittr library [@magrittr]. This report is compiled using the rmarkdown library [@rmarkdown] and [@knitr]. Finally during writing we used the RStudio IDE [@RStudio].
Download file if it does not exist, and unzip it.
if (!file.exists("./data/final")) {
if (!file.exists("./data/Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip", quiet=TRUE)
}
unzip("Coursera-SwiftKey.zip", exdir = "./data");
}
We only load the English data, as specified by the requirements.
blogs <- readLines("data/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
# Kept getting an 'incomplete final line found' error with the news dataset, using binary mode instead
con <- file("data/final/en_US/en_US.news.txt", open="rb")
news <- readLines(con, encoding="UTF-8", skipNul=TRUE)
close(con)
rm(con)
Show some sample data.
head(blogs, 2)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
## [2] "We love you Mr. Brown."
head(twitter, 2)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
head(news, 2)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
The before we analyse the files we look at their size (presented in MegaBytes / MBs).
# file size (in MegaBytes/MB)
file.info("data/final/en_US/en_US.blogs.txt")$size / 1024^2
## [1] 200.4242
file.info("data/final/en_US/en_US.news.txt")$size / 1024^2
## [1] 196.2775
file.info("data/final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641
For our analysis we need two libraries.
# library for character string analysis
library(stringi)
# library for plotting
library(ggplot2)
We analyse the lines and characters.
# library for character string analysis
stri_stats_general( blogs )
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general( news )
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general( twitter )
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
Count words per line & summarize distribution of these counts per corpus, using summary statistics and a distibution plot.
words_blogs <- stri_count_words(blogs)
summary( words_blogs )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
b <- qplot( words_blogs )
b
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Analyze the “news” corpus:
words_news <- stri_count_words(news)
summary( words_news )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
qplot( words_news )
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Analyze “twitter” corpus:
words_twitter <- stri_count_words(twitter)
summary( words_twitter )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
qplot( words_twitter )
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
While the general use case is to create a prediction algorithm, each of the data sets can be quite different.
For example, Twitter’s data is very different from that of Blogs or News. This is probably due to the 140-character limit of Twitter, so a lot of abbreviations are used. This makes the dataset very hard to clean up (also, in a case like this, how do we go about handling predictions? Should we expand all abbreviations to their full form and add it to the corpus, or should we leave it as it is?) In addition, due to the use of hashtags in Twitter, and the way people use hashtags (some use hashtags as part of the sentence, some use it before or after the actual tweet), it seems to be the hardest case to handle.
One of the biggest challenges I foresee is the process and rules involved in cleaning the data. For example, with Twitter and Blogs, URLs tend to be used. URLs should ideally not be part of the corpus and should be cleaned.
Finally, I plan to build a word prediction algorithm. I’m still working on the model, but I plan to use some of the techniques presented in the previous courses. I will create a training dataset to build prediction model and then test this model.