Data Science Capstone Assignment 1

Author

Eric Krantz

Published

October 28, 2025

Assignment 1 for Data Science Capstone

The assignment is to read in a bunch of text data and summarize it in an interesting way. Data files are provided. This appears to be the tasks:

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Setup

Load packages. Read data.

library(here)
library(tidyverse)
#library(tm)
library(tidytext)

di <- list.files(here("Data/en_US"))
blog <- read_lines(here("Data/en_US", di[1]), skip_empty_rows = TRUE)
twit <- read_lines(here("Data/en_US", di[3]), skip_empty_rows = TRUE)
news <- read_lines(here("Data/en_US", di[2]), skip_empty_rows = TRUE)
rm(di)

Some statistics

Let’s calculate some stats on these files, such as number of lines, words, characters, and mean length of lines and words.

nlines <- c(length(blog), length(news), length(twit))
nchar <- c(sum(nchar(blog)), sum(nchar(news)), sum(nchar(twit)))
mean_line_length <- c(nchar/nlines)
nwords <- c(sum(stringi::stri_count_words(blog)),
            sum(stringi::stri_count_words(news)),
            sum(stringi::stri_count_words(twit)))
mean_word_length <- c(nchar/nwords)
stat_tbl <- data.frame(Dataset = c("Blog", "News", "Twitter"),
                       Lines = c(nlines),
                       Words = nwords,
                       Characters = nchar,
                       MeanLineLength = round(mean_line_length, 0),
                       MeanWordLength = round(mean_word_length, 2))

Here is a table that shows the results:

knitr::kable(stat_tbl, format.args = list(big.mark = ","))

Dataset	Lines	Words	Characters	MeanLineLength	MeanWordLength
Blog	899,288	37,546,806	206,824,505	230	5.51
News	1,010,242	34,762,658	203,223,159	201	5.85
Twitter	2,360,148	30,096,649	162,096,031	69	5.39

Looks like Twitter has the most lines, but not as many words or characters. Let’s see this in a graph:

stat_long <- stat_tbl %>% 
  pivot_longer(-Dataset, names_to = "stat", values_to = "value")

stat_long %>% 
  filter(stat %in% c("Lines", "Characters", "Words")) %>% 
  ggplot(., aes(Dataset, value, fill = stat)) +
  geom_col(position = "dodge") +
  facet_wrap(~stat, scales = "free_y") +
  labs(title = "Number of Characters, Lines, and Words in Text Files",
       x = NULL,
       y = NULL,
       fill = NULL) +
  theme(legend.position = "none")

TidyText functions

We will combine all these data into one dataframe which will include the source (i.e., news, blogs, and Twitter). Using tidytext allows tidy format and usage of other tidyverse packages.

We will select a sample of 5%. Also we will “unnest tokens” this data, which will separate out individual words into observations, remove empty lines and puncuation, and make everything lower case. Then we use the tidytext stop words to remove super common words.

samp_size = 0.05
blog <- tibble(text = blog, source = "blog")
news <- tibble(text = news, source = "news")
twit <- tibble(text = twit, source = "twit")

all <- rbind(blog, news, twit) %>% 
  slice_sample(prop = samp_size) %>% 
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

Joining with `by = join_by(word)`

Let’s look at the most common words.

all %>%
  count(word, sort = TRUE)

# A tibble: 131,799 × 2
   word       n
   <chr>  <int>
 1 time   11188
 2 day     8732
 3 love    8303
 4 people  7997
 5 2       5327
 6 3       5248
 7 1       4694
 8 life    4600
 9 rt      4547
10 home    4214
# ℹ 131,789 more rows

Some of the most common words are numbers! We need to remove numbers.

all <- all %>% mutate(word = str_extract(word, "[a-z']+")) %>% 
  na.omit()

all %>%
  count(word, sort = TRUE)

# A tibble: 116,570 × 2
   word       n
   <chr>  <int>
 1 time   11196
 2 day     8822
 3 love    8310
 4 people  8059
 5 life    4626
 6 rt      4552
 7 home    4219
 8 week    3915
 9 night   3843
10 game    3806
# ℹ 116,560 more rows

Results

OK, good enough for now. I have some interesting findings, if you’re interested in such things.

Twitter has a disproportionately high number of lines as well as the least mean line length, most likely due to the required format of a tweet.
The mean word length in the News source is greater than in the Blog or Twitter source. This might be because the News source data were written by journalists or professional writers, and so they might use slightly longer, more complicated words.