Introduction
The goal of the milestone project is to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. The contents of this report are:
Download the data from the link provided and successfully loaded it in.
Exploratory analysis - data cleaning and preprocessing, summary statistics and descriptive plots for each file (blogs, news, twitter) and combined.
Report findings about the data, such as most frequent words (one, two and three words) and coverage of number of unique words that comprises 50% and 90% of the corpus.
Next steps to create a prediction algorithm.
Download files from link
Setup working directory
Reading Text Files
path = "./data"
blogs <- readLines(paste(path, "/final/en_US/en_US.blogs.txt", sep = ""), encoding="UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines(paste(path, "/final/en_US/en_US.news.txt", sep = ""), encoding="UTF-8", skipNul = TRUE, warn = FALSE)
twitter <- readLines(paste(path, "/final/en_US/en_US.twitter.txt", sep = ""), encoding="UTF-8", skipNul = TRUE, warn = FALSE)Exploratory Analysis
Summary of descriptive statistics for each collection (blogs, news and tweets)
data <- list(blogs, news, twitter)
line_count <- sapply(data, stri_stats_general)[c('Lines'), ]
word_count <- sapply(data, stri_stats_latex)[c('Words'), ]
word_min_mean_max <- sapply(data, function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
data_summ <- data.frame(rbind(line_count, word_count, word_min_mean_max))
colnames(data_summ) <- c("Blogs", "News", "Twitter")
rownames(data_summ) <- c("Line Count", "Word Count", "Min. words per line", "Average words per line", "Max. words per line")
data_summ %>%
kable(digits = 1, format.args = list(big.mark = ",",
scientific = FALSE),
align = "c") %>%
kable_styling(position = "center")| Blogs | News | ||
|---|---|---|---|
| Line Count | 899,288.0 | 1,010,242.0 | 2,360,148.0 |
| Word Count | 37,570,839.0 | 34,494,539.0 | 30,451,170.0 |
| Min. words per line | 0.0 | 1.0 | 1.0 |
| Average words per line | 41.8 | 34.4 | 12.8 |
| Max. words per line | 6,726.0 | 1,796.0 | 47.0 |
Cleaning the data
Using the quanteda package to convert all text to lower case, replace special characters between words such as hyphens, apostrophes, slashes, etc. with spaces, and punctuation (periods, asking, exclamation). The profanity (“bad words”) set is downloaded from this link. A separate dataset removing “stopwords” is created to compare frequency of original datasets with no “stopwords” removed.
Blogs
Sample a fraction (10%) of the complete file to speed up the processing time.
Cleaning blogs_sample with textclean and tm libraries.
blog_sample <- replace_contraction(blog_sample)
blog_sample <- replace_hash(blog_sample)
blog_sample <- replace_html(blog_sample)
blog_sample <- replace_url(blog_sample)
blog_sample <- tolower(blog_sample)
blog_sample <- removePunctuation(blog_sample, ucp = TRUE)
blog_sample <- removeWords(blog_sample, bad_words)
blog_sample <- replace_white(blog_sample)Create blog_dfm (document-feature matrix) and list top 20 words by frequency using corpus_dfm and corpus_dfm_nsw functions (see Appendix I). The frequency tables illustrate word frequency with and without stopwords. Stopwords are common words, such as ‘a’, ‘the’, ‘and’, etc., that generally are not indexed or searchable in a search engine (source: Collins Dictionary)
blog_dfm <- corpus_dfm(blog_sample, 1)
blog_dfm_20 <- textstat_frequency(blog_dfm, 20)
blog_dfm_nsw <- corpus_dfm_nsw(blog_sample, 1)
blog_dfm_nsw_20 <- textstat_frequency(blog_dfm_nsw, 20)| Word | Frequency |
|---|---|
| the | 186,317 |
| and | 109,420 |
| to | 107,498 |
| a | 90,249 |
| of | 88,114 |
| i | 84,754 |
| in | 59,741 |
| is | 51,224 |
| that | 47,354 |
| it | 44,206 |
| for | 36,594 |
| you | 30,574 |
| with | 28,771 |
| was | 28,464 |
| my | 27,482 |
| on | 27,445 |
| this | 26,021 |
| not | 26,018 |
| have | 24,765 |
| as | 22,361 |
| Word | Frequency |
|---|---|
| one | 12,508 |
| can | 10,863 |
| just | 10,069 |
| like | 9,639 |
| time | 9,008 |
| get | 7,140 |
| now | 6,122 |
| people | 6,092 |
| know | 5,874 |
| us | 5,518 |
| also | 5,488 |
| new | 5,487 |
| even | 5,237 |
| day | 5,073 |
| see | 5,059 |
| first | 5,048 |
| really | 5,047 |
| back | 5,007 |
| make | 4,959 |
| well | 4,953 |
Plot top 20 words for Blog sample with and without stopwords (see Appendix II for function code)
g1 <- top_20_barchart(blog_dfm_20, "Top 20 (Blog with stopwords)")
g2 <- top_20_barchart(blog_dfm_nsw_20, "Top 20 (Blog without stopwords)")
grid.arrange(g1, g2, nrow = 1)Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary (see Appendix III). In this case, the Blog corpus with and without the stopwords are used.
blog_freq_all <- textstat_frequency(blog_dfm)
blog_freq_all <- blog_freq_all %>%
mutate(Freqprc = (frequency / sum(frequency)) * 100,
Freqcum = cumsum(Freqprc))
blog_freq <- textstat_frequency(blog_dfm_nsw)
blog_freq <- blog_freq %>%
mutate(Freqprc = (frequency / sum(frequency)) * 100,
Freqcum = cumsum(Freqprc))
c1 <- cum_freq_plot(blog_freq_all, "Cumulative Frequency Plot (Blog)", "(with stopwords)")
c2 <- cum_freq_plot(blog_freq, "Cumulative Frequency Plot (Blog)", "(without stopwords)")
grid.arrange(c1, c2, nrow = 1)From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 90. For 90%, the number of words is 5,331. If the stopwords are excluded, the 50% and 90% coverage increases to 888 and 11,200 respectively.
Create blog_dfm_n2 (ngram = 2) and and list top 20 words by frequency
# create tokens from blog sample
blog_dfm_n2 <- corpus_dfm(blog_sample, 2)
blog_dfm_n2_20 <- textstat_frequency(blog_dfm_n2, 20)
blog_dfm_n2_20 %>%
select(feature, frequency) %>%
kable(col.names = c("Word", "Frequency"), format.args = list(big.mark = ",",
scientific = FALSE)) %>%
kable_styling(full_width = F, position = "center")| Word | Frequency |
|---|---|
| of the | 18,656 |
| in the | 15,677 |
| it is | 8,777 |
| to the | 8,550 |
| i am | 8,031 |
| on the | 7,486 |
| to be | 6,819 |
| i have | 6,789 |
| for the | 5,858 |
| and the | 5,808 |
| and i | 5,407 |
| is a | 5,371 |
| i was | 5,188 |
| it was | 4,891 |
| at the | 4,817 |
| in a | 4,558 |
| with the | 4,424 |
| that i | 4,152 |
| from the | 3,742 |
| do not | 3,635 |
Create blog_dfm_n3 (ngram = 3) and and list top 20 words by frequency
# create tokens from blog sample
blog_dfm_n3 <- corpus_dfm(blog_sample, 3)
blog_dfm_n3_20 <- textstat_frequency(blog_dfm_n3, 20)
blog_dfm_n3_20 %>%
select(feature, frequency) %>%
kable(col.names = c("Word", "Frequency"), format.args = list(big.mark = ",",
scientific = FALSE)) %>%
kable_styling(full_width = F, position = "center")| Word | Frequency |
|---|---|
| i do not | 1,397 |
| one of the | 1,393 |
| a lot of | 1,204 |
| i have been | 974 |
| it is a | 969 |
| i am not | 887 |
| i did not | 744 |
| some of the | 718 |
| as well as | 715 |
| there is a | 684 |
| to be a | 681 |
| be able to | 667 |
| it was a | 658 |
| out of the | 643 |
| the end of | 628 |
| a couple of | 620 |
| it is not | 579 |
| i want to | 577 |
| if you are | 570 |
| and i am | 553 |
News
The same process shown in the blogs file will be applied to the news file. Only the output will be shown.
List top 20 words by frequency (with and without stopwords)| Word | Frequency |
|---|---|
| the | 196,845 |
| to | 89,485 |
| and | 88,075 |
| a | 87,699 |
| of | 76,923 |
| in | 66,968 |
| is | 40,959 |
| that | 36,700 |
| for | 35,106 |
| it | 27,410 |
| on | 26,666 |
| with | 25,150 |
| said | 25,111 |
| he | 25,015 |
| was | 23,552 |
| not | 21,290 |
| at | 21,237 |
| as | 18,620 |
| i | 18,329 |
| are | 17,097 |
| Word | Frequency |
|---|---|
| said | 25,111 |
| one | 8,315 |
| new | 6,950 |
| can | 6,883 |
| also | 5,891 |
| two | 5,839 |
| year | 5,511 |
| just | 5,316 |
| first | 5,196 |
| years | 5,172 |
| last | 5,114 |
| time | 5,053 |
| like | 4,977 |
| state | 4,826 |
| people | 4,655 |
| get | 4,434 |
| us | 4,323 |
| city | 3,707 |
| now | 3,607 |
| school | 3,534 |
Plot top 20 words for News sample with and without stopwords
Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the News corpus with and without the stopwords are used.
From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 155. For 90%, the number of words is 6,958. If the stopwords are excluded, the 50% and 90% coverage increases to 1,128 and 12,320 respectively.
Create news_dfm_n2 (ngram = 2) and and list top 20 features by frequency| Word | Frequency |
|---|---|
| of the | 18,657 |
| in the | 17,777 |
| to the | 8,313 |
| it is | 7,381 |
| on the | 7,308 |
| for the | 6,957 |
| at the | 5,903 |
| in a | 5,160 |
| and the | 5,125 |
| to be | 4,686 |
| is a | 4,399 |
| with the | 4,303 |
| from the | 3,663 |
| with a | 3,415 |
| he said | 3,407 |
| of a | 3,336 |
| for a | 3,159 |
| will be | 3,073 |
| as a | 3,051 |
| that is | 2,935 |
| Word | Frequency |
|---|---|
| one of the | 1,503 |
| a lot of | 1,170 |
| it is a | 967 |
| i do not | 761 |
| it is not | 712 |
| as well as | 616 |
| part of the | 576 |
| according to the | 566 |
| the end of | 552 |
| to be a | 548 |
| out of the | 542 |
| some of the | 526 |
| there is a | 522 |
| in the first | 515 |
| going to be | 511 |
| is going to | 500 |
| are going to | 442 |
| be able to | 431 |
| the united states | 428 |
| it was a | 419 |
The same process shown in the blogs and news files will be applied to the twitter file. Only the output will be shown.
List top 20 words by frequency (with and without stopwords)| Word | Frequency |
|---|---|
| the | 93,825 |
| i | 89,651 |
| to | 79,159 |
| a | 60,896 |
| you | 59,835 |
| is | 54,239 |
| and | 43,369 |
| for | 38,248 |
| it | 37,982 |
| in | 37,584 |
| of | 35,962 |
| not | 32,300 |
| my | 29,048 |
| on | 27,690 |
| that | 26,626 |
| are | 22,664 |
| have | 20,242 |
| me | 20,173 |
| at | 18,758 |
| be | 18,681 |
| Word | Frequency |
|---|---|
| just | 14,860 |
| can | 13,430 |
| like | 12,278 |
| get | 11,278 |
| love | 10,523 |
| good | 9,889 |
| day | 9,064 |
| rt | 8,985 |
| thanks | 8,737 |
| now | 8,191 |
| one | 8,180 |
| know | 7,927 |
| u | 7,732 |
| great | 7,586 |
| time | 7,544 |
| go | 7,213 |
| today | 7,036 |
| new | 6,969 |
| see | 6,665 |
| lol | 6,589 |
Plot top 20 words for Twitter sample with and without stopwords
Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the Twitter corpus with and without the stopwords are used.
From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 89. For 90%, the number of words is 3,801. If the stopwords are excluded, the 50% and 90% coverage increases to 465 and 7,602 respectively.
Create twitter_dfm_n2 (ngram = 2) and and list top 20 words by frequency| Word | Frequency |
|---|---|
| i am | 15,734 |
| it is | 10,109 |
| do not | 7,917 |
| in the | 7,817 |
| for the | 7,252 |
| you are | 6,120 |
| of the | 5,789 |
| i have | 5,274 |
| on the | 4,921 |
| to be | 4,691 |
| can not | 4,610 |
| that is | 4,480 |
| to the | 4,421 |
| is a | 4,417 |
| i will | 4,330 |
| thanks for | 4,147 |
| if you | 3,864 |
| at the | 3,849 |
| will be | 3,731 |
| i love | 3,513 |
| Word | Frequency |
|---|---|
| i do not | 2,468 |
| thanks for the | 2,275 |
| i can not | 1,502 |
| can not wait | 1,397 |
| i am not | 1,245 |
| it is a | 934 |
| i will be | 930 |
| looking forward to | 881 |
| thank you for | 878 |
| i love you | 867 |
| you do not | 845 |
| i am so | 765 |
| going to be | 756 |
| do not know | 746 |
| if you are | 731 |
| i did not | 720 |
| not wait to | 719 |
| for the follow | 701 |
| i am going | 693 |
| is going to | 684 |
Combined datasets
Combining the samples from blogs, news and twitter datasets already cleaned and preprocessed.
List top 20 words by frequency (with and without stopwords)| Word | Frequency |
|---|---|
| the | 476,987 |
| to | 276,142 |
| and | 240,864 |
| a | 238,844 |
| of | 200,999 |
| i | 192,734 |
| in | 164,293 |
| is | 146,422 |
| that | 110,680 |
| for | 109,948 |
| it | 109,598 |
| you | 101,644 |
| on | 81,801 |
| not | 79,608 |
| with | 71,106 |
| was | 64,237 |
| have | 61,154 |
| my | 60,693 |
| are | 60,582 |
| at | 57,337 |
| Word | Frequency |
|---|---|
| can | 31,176 |
| said | 30,630 |
| just | 30,245 |
| one | 29,003 |
| like | 26,894 |
| get | 22,852 |
| time | 21,605 |
| new | 19,406 |
| now | 17,920 |
| good | 17,670 |
| day | 16,920 |
| us | 16,138 |
| know | 16,037 |
| love | 15,979 |
| people | 15,776 |
| back | 13,994 |
| go | 13,931 |
| see | 13,852 |
| first | 13,372 |
| also | 12,938 |
Plot top 20 words for combined samples with and without stopwords
Cumulative frequency plot to determine the total word count coverage at 50% and 90% of the total dictionary. In this case, the combined samples with and without the stopwords are used.
From the cumulative frequency plot, the number of words that comprise 50% of the total word count is 114. For 90%, the number of words is 6,487. If the stopwords are excluded, the 50% and 90% coverage increases to 960 and 13,299 respectively.
Create comb_dfm_n2 (ngram = 2) and and list top 20 words by frequency| Word | Frequency |
|---|---|
| of the | 43,102 |
| in the | 41,271 |
| it is | 26,267 |
| i am | 25,916 |
| to the | 21,284 |
| for the | 20,067 |
| on the | 19,715 |
| to be | 16,196 |
| at the | 14,569 |
| do not | 14,372 |
| is a | 14,187 |
| i have | 13,437 |
| and the | 12,426 |
| in a | 12,006 |
| with the | 10,578 |
| that is | 10,313 |
| it was | 9,965 |
| will be | 9,901 |
| you are | 9,605 |
| for a | 9,604 |
| Word | Frequency |
|---|---|
| i do not | 4,626 |
| one of the | 3,478 |
| a lot of | 2,955 |
| it is a | 2,870 |
| i am not | 2,464 |
| thanks for the | 2,297 |
| i can not | 2,160 |
| it is not | 1,931 |
| i have been | 1,841 |
| to be a | 1,835 |
| going to be | 1,775 |
| i did not | 1,723 |
| there is a | 1,696 |
| is going to | 1,591 |
| if you are | 1,560 |
| can not wait | 1,549 |
| you do not | 1,495 |
| do not know | 1,478 |
| i will be | 1,461 |
| the end of | 1,460 |
Conclusions and next steps
When analyzed independently, the number of words that are common to the text files (blogs, news, and twitter) increase with the n-grams size. Also, the cumulative frequencies for number of words that comprise the individual corpus increase significantly when stopwords are excluded. Therefore, the combined samples corpus n-grams and word coverage is a good starting dataset for our prediction model.
The next steps involve developing a prediction algorithm and deploy it with a Shiny app. For the observed n-grams, Markov chains will be used, considering the specific property that it does not take into account past events (i.e. only applies probabilities from the current state to a next state). This behavior is applicable to text prediction. For the unobserved n-grams, backoff models will be implemented to estimate the probability of such event.
Appendix Section
Appendix I - Function to create corpus_dfm (document-feature matrix) including stopwords and corpus__dfm_nsw excluding stopwords
corpus_dfm <- function(x, n) {
text <- dfm(x, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE, remove_twitter = TRUE,
ngrams = n, concatenator = " ")
text <- dfm_compress(text)
text <- dfm_trim(text, min_docfreq = 2)
return(text)
}
corpus_dfm_nsw <- function(x, n) {
text <- dfm(x, remove = stopwords("english"), remove_numbers = TRUE,
remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE,
remove_separators = TRUE, ngrams = n, concatenator = " ")
text <- dfm_compress(text)
text <- dfm_trim(text, min_docfreq = 2)
return(text)
}Appendix II - Function to plot bar charts (top 20 frequency words)
# using RColor Brewer expanded for frequency term plot
colorCount = 20
getPalette = colorRampPalette(brewer.pal(12, "Paired"))
top_20_barchart <- function(df, title) {
ggplot(df, aes(x = reorder(feature, frequency), y = frequency, fill = feature)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = getPalette(colorCount)) +
labs(title = title, x = "Word", y = "Total Count") +
coord_flip() +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none")
}Appendix III - Function to plot cumulative frequencies and determine the number of words that comprises 50% and 90% of the total corpus
cum_freq_plot <- function(df, title, subtitle) {
# function for extrcting intersection value from y
f1 <- approxfun(df$Freqcum, df$rank)
f50 <- as.integer(f1(50))
f90 <- as.integer(f1(90))
ggplot(df, aes(x = as.numeric(row.names(df)), y = Freqcum)) +
geom_line() +
geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
geom_hline(yintercept = 90, linetype = "dashed", color = "blue") +
geom_vline(xintercept = f50, linetype = "dashed") +
geom_text(aes(x = f50, y = 10, label = f50),
color = "red", angle = 90, size = 3, nudge_x = -0.05) +
geom_vline(xintercept = f90, linetype = "dashed") +
geom_text(aes(x = f90, y = 10, label = f90),
color = "blue", angle = 90, size = 3, nudge_x = -0.05) +
scale_x_log10(limits = c(1, nrow(df)),
breaks = c(50, 100, 5000, 10000)) +
scale_y_log10(breaks = c(25, 50, 80, 90, 95)) +
labs(title = title, subtitle = subtitle,
x = "Number of Words",
y = "Cumulative Frequency (%)")
}