This report summarizes an exploratory analysis of three large English text files (blogs, news, and Twitter). The goal is to understand the data and outline a plan for building a next-word prediction algorithm and Shiny web application.
The three data sources differ in size and style. Blogs tend to have longer lines, news text is more formal, and Twitter messages are shorter and more conversational. Table 1 shows the number of lines and average characters per line for each source.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
twitter <- readLines("final/en_US/en_US.twitter.txt",
encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("final/en_US/en_US.blogs.txt",
encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt",
encoding = "UTF-8", skipNul = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary_stats <- tibble(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Chars = c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter)))
) %>%
mutate(AvgCharsPerLine = round(Chars / Lines, 1))
knitr::kable(summary_stats, caption = "Basic summary of text files")
| Source | Lines | Chars | AvgCharsPerLine |
|---|---|---|---|
| Blogs | 899288 | 206824505 | 230.0 |
| News | 1010242 | 203223159 | 201.2 |
| 2360148 | 162096241 | 68.7 |
library(stringr)
library(ggplot2)
library(dplyr)
line_lengths <- tibble(
Source = rep(c("Blogs","News","Twitter"),
times = c(length(blogs), length(news), length(twitter))),
Words = c(str_count(blogs, "\\S+"),
str_count(news, "\\S+"),
str_count(twitter, "\\S+"))
)
ggplot(line_lengths, aes(x = Words, fill = Source)) +
geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
xlab("Words per line") + ylab("Count") +
ggtitle("Distribution of line lengths")
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.