library(here)
library(tidyverse)
#library(tm)
library(tidytext)
di <- list.files(here("Data/en_US"))
blog <- read_lines(here("Data/en_US", di[1]), skip_empty_rows = TRUE)
twit <- read_lines(here("Data/en_US", di[3]), skip_empty_rows = TRUE)
news <- read_lines(here("Data/en_US", di[2]), skip_empty_rows = TRUE)
rm(di)Data Science Capstone Assignment 1
Assignment 1 for Data Science Capstone
The assignment is to read in a bunch of text data and summarize it in an interesting way. Data files are provided. This appears to be the tasks:
- Demonstrate that you’ve downloaded the data and have successfully loaded it in.
- Create a basic report of summary statistics about the data sets.
- Report any interesting findings that you amassed so far.
- Get feedback on your plans for creating a prediction algorithm and Shiny app.
Setup
Load packages. Read data.
Some statistics
Let’s calculate some stats on these files, such as number of lines, words, characters, and mean length of lines and words.
nlines <- c(length(blog), length(news), length(twit))
nchar <- c(sum(nchar(blog)), sum(nchar(news)), sum(nchar(twit)))
mean_line_length <- c(nchar/nlines)
nwords <- c(sum(stringi::stri_count_words(blog)),
sum(stringi::stri_count_words(news)),
sum(stringi::stri_count_words(twit)))
mean_word_length <- c(nchar/nwords)
stat_tbl <- data.frame(Dataset = c("Blog", "News", "Twitter"),
Lines = c(nlines),
Words = nwords,
Characters = nchar,
MeanLineLength = round(mean_line_length, 0),
MeanWordLength = round(mean_word_length, 2))Here is a table that shows the results:
knitr::kable(stat_tbl, format.args = list(big.mark = ","))| Dataset | Lines | Words | Characters | MeanLineLength | MeanWordLength |
|---|---|---|---|---|---|
| Blog | 899,288 | 37,546,806 | 206,824,505 | 230 | 5.51 |
| News | 1,010,242 | 34,762,658 | 203,223,159 | 201 | 5.85 |
| 2,360,148 | 30,096,649 | 162,096,031 | 69 | 5.39 |
Looks like Twitter has the most lines, but not as many words or characters. Let’s see this in a graph:
stat_long <- stat_tbl %>%
pivot_longer(-Dataset, names_to = "stat", values_to = "value")
stat_long %>%
filter(stat %in% c("Lines", "Characters", "Words")) %>%
ggplot(., aes(Dataset, value, fill = stat)) +
geom_col(position = "dodge") +
facet_wrap(~stat, scales = "free_y") +
labs(title = "Number of Characters, Lines, and Words in Text Files",
x = NULL,
y = NULL,
fill = NULL) +
theme(legend.position = "none")TidyText functions
We will combine all these data into one dataframe which will include the source (i.e., news, blogs, and Twitter). Using tidytext allows tidy format and usage of other tidyverse packages.
We will select a sample of 5%. Also we will “unnest tokens” this data, which will separate out individual words into observations, remove empty lines and puncuation, and make everything lower case. Then we use the tidytext stop words to remove super common words.
samp_size = 0.05
blog <- tibble(text = blog, source = "blog")
news <- tibble(text = news, source = "news")
twit <- tibble(text = twit, source = "twit")
all <- rbind(blog, news, twit) %>%
slice_sample(prop = samp_size) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)Joining with `by = join_by(word)`
Let’s look at the most common words.
all %>%
count(word, sort = TRUE) # A tibble: 131,799 × 2
word n
<chr> <int>
1 time 11188
2 day 8732
3 love 8303
4 people 7997
5 2 5327
6 3 5248
7 1 4694
8 life 4600
9 rt 4547
10 home 4214
# ℹ 131,789 more rows
Some of the most common words are numbers! We need to remove numbers.
all <- all %>% mutate(word = str_extract(word, "[a-z']+")) %>%
na.omit()
all %>%
count(word, sort = TRUE) # A tibble: 116,570 × 2
word n
<chr> <int>
1 time 11196
2 day 8822
3 love 8310
4 people 8059
5 life 4626
6 rt 4552
7 home 4219
8 week 3915
9 night 3843
10 game 3806
# ℹ 116,560 more rows
Results
OK, good enough for now. I have some interesting findings, if you’re interested in such things.
- Twitter has a disproportionately high number of
linesas well as the least mean line length, most likely due to the required format of atweet. - The mean word length in the News source is greater than in the Blog or Twitter source. This might be because the News source data were written by journalists or professional writers, and so they might use slightly longer, more complicated words.