1 R Markdown

This is a week report for capstone projetc of JHU Capstone Course by Rongbin Ye, as a part of the data scientist specialization on Coursera.This report will provide a report of summary statistics about the data sets, report some interesting findings via term frequency, N gram and word cloud. Furthermore, based on the explainatory data analysis outcome, this report will propose a plan of capstone project, including a prediction algorithm for typing recommendation and a Shiny app on the AWS.

2 Milestone Plan

As discussed in the begining, this report will provide a milestone plan throughoout this whole capstone project. The scheduled time is in seven weeks. Yet, in regarding the time contraint, the milestones has been set in an accelerated manner, which entails tighted up plans to clean up the data and schedule of model development. The major deliverables are two: 1. An algorithm of recommendation the correlated the words in English, based on the given dataset. 2. A Shiny App for the usage of the users to interact with.

The milestone has been set to be following four sections: > Week 1 & 2: Data Familarization + Explantory Data Cleaning > Week 3 & 4: Text Mining: Extract and identify the key patterns related to the usage habbits of english writers > week 5 & 6: Model Development: An Algorithm to be developed and tuned in this process > Week 7: Develop a Shiny App and Delopy

3 Explanatory Analysis

3.1 Preparation

3.1.1 Load Libraries

library(tidyverse)
library(tm)
library(lexicon)
library(stringr)
library(stopwords)
library(tidytext)
library(textstem)
library(tidyr)

3.2 Load Data

In this section, using the connection, the txt file has been read into R lines by lines. The texts are stored in characters’ form. Based on these three characters chuncks, the author is able to conduct the preliminary data cleaning and text preprocessing for explantory data analysis

# read in multiple lines into one data frame: blogs
blogs_con <- file("C:/Users/yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.blogs.txt")
blogs <- readLines(con = blogs_con)
close(blogs_con)
# all the blogs has been loaded in properly
# read in multiple lines into one data frame: twitters
twitters_con <- file("C:/Users/yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.twitter.txt")
twitter <- readLines(con = twitters_con)
## Warning in readLines(con = twitters_con): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con = twitters_con): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con = twitters_con): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con = twitters_con): line 1759032 appears to contain an
## embedded nul
close(twitters_con)
# all the twitters are read into the properly

# read in multiple lines into one data frame: news
news_con <- file("C:/Users/yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.news.txt")
news <- readLines(con = news_con)
## Warning in readLines(con = news_con): incomplete final line found on 'C:/Users/
## yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.news.txt'
close(news_con)
# all the twitters are read into the properly

4 Basic Information

After loading in the data, let us look at the basic shape of three data sets.

len_news <- length(news)
len_blogs <- length(blogs)
len_twitters <- length(twitter)

The length of news is 77259 in sentences. The length of blogs is 899288 in sentences. The length of twitters is 2360148 in sentences. Let us have a close look at the words.

5 Text Preprocessing

In order to conduct analysis, three text will be cleaned. The major cleaning process includes lowercase, strip spaces on both sides, taking out the punctuation, lemmatization, and tokenization. Eventually, the text chucks are expected to be broken down into tokens for further analysis and comparison at the level of sentences and words.

blogs <- blogs %>%  removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

news <- news %>%  removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

twitter <- twitter %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

6 Frequency Analysis

6.0.1 blogs

blogs_df <- tibble(line = 1:length(blogs), text = blogs)

tokens_blogs <- blogs_df  %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
tokens_blogs
## # A tibble: 401,562 x 2
##    word        n
##    <chr>   <int>
##  1 â      209407
##  2 time   104728
##  3 day     69875
##  4 people  59873
##  5 love    57915
##  6 iâ      55037
##  7 feel    47780
##  8 life    47411
##  9 book    40964
## 10 itâ     40847
## # ... with 401,552 more rows

Surprisingly, in the blog, five words bloggers wrote most about are: time, people, day, love and life. Such a philosophical and metaphysical finding demonstrate the salt of the earth, which shows the potential topics. One of the preliminary conclusions is that these blogs experts m

6.0.2 News

news_df <- tibble(line = 1:length(news), text = news)

tokens_news <- news_df  %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
tokens_news
## # A tibble: 77,215 x 2
##    word        n
##    <chr>   <int>
##  1 â       10011
##  2 time     5058
##  3 people   3723
##  4 school   3584
##  5 game     3577
##  6 day      3355
##  7 play     3251
##  8 city     3093
##  9 include  2961
## 10 team     2819
## # ... with 77,205 more rows

Meanwhile, it seems the topics in the news are more about time, people, game, school and day. Indeed, one of the preliminary thoughts is that the news covered might be the sport news. Despite not covering any specific sports, the news is about the season, game, and team, which supports the preliminary finding.

tokens_news$org <-"news"   
tf_idf_news <- tokens_news %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_news %>% arrange(desc(tf_idf))
## # A tibble: 77,215 x 6
##    word        n org        tf   idf tf_idf
##    <chr>   <int> <chr>   <dbl> <dbl>  <dbl>
##  1 â       10011 news  0.00839     0      0
##  2 time     5058 news  0.00424     0      0
##  3 people   3723 news  0.00312     0      0
##  4 school   3584 news  0.00300     0      0
##  5 game     3577 news  0.00300     0      0
##  6 day      3355 news  0.00281     0      0
##  7 play     3251 news  0.00272     0      0
##  8 city     3093 news  0.00259     0      0
##  9 include  2961 news  0.00248     0      0
## 10 team     2819 news  0.00236     0      0
## # ... with 77,205 more rows

To further consider the frequencies of words, as one can see that, including the adjusted term frequency does not help us to identify the keywords. The reason is that all these words are common words. Hence, using the term frequency solely is enough to help us to understand the korpus already.

6.1 Word Cloud - news

# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)

### Twitter

twitter <- twitter %>%  removeNumbers() %>% removePunctuation() %>% lemmatize_strings()

twitters_df <- tibble(line = 1:length(twitter), text = twitter)

tokens_twitters <- twitters_df  %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining, by = "word"
tokens_twitters
## # A tibble: 468,683 x 2
##    word        n
##    <chr>   <int>
##  1 im     157940
##  2 love   119937
##  3 day    108525
##  4 rt      88236
##  5 time    84785
##  6 lol     66407
##  7 follow  66236
##  8 people  52671
##  9 happy   49494
## 10 â       48593
## # ... with 468,673 more rows
tokens_twitters$org <- "twitters"
tokens_blogs$org <-"blogs"
tf_idf_twitter <- tokens_twitters %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_blogs <- tokens_blogs %>% tidytext::bind_tf_idf(word,org, n)

Yet, one of the issues of pure In order to adjust the influence of high frequent words, I adopt the inverse term frequency and expand the n into tf-idf model. Using the blind_df_idf function, the tf-idf metrics are provided.

6.2 Summary of All Three Files

tokens_all <- rbind(tokens_news, tokens_twitters)
tokens_all <- rbind(tokens_all, tokens_blogs)
tf_idf_all <- tokens_all %>% tidytext::bind_tf_idf(word,org, n)
news_20 <- top_n(tf_idf_news, 20, wt = tf) %>% select(word)
blogs_20 <- top_n(tf_idf_twitter, 20, wt = tf) %>% select(word)
twitter_20 <- top_n(tf_idf_blogs,20, wt = tf) %>% select(word)
all_20 <- cbind(news_20, blogs_20)
all_20 <- cbind(all_20, twitter_20)
colnames(all_20) <- c("Top News", "Top Blogs", "Top Tweets")
all_20
##    Top News Top Blogs Top Tweets
## 1         â        im          â
## 2      time      love       time
## 3    people       day        day
## 4    school        rt     people
## 5      game      time       love
## 6       day       lol         iâ
## 7      play    follow       feel
## 8      city    people       life
## 9   include     happy       book
## 10     team         â        itâ
## 11     home   tonight      start
## 12  percent     night         im
## 13      run      feel       week
## 14     call     watch      write
## 15    start      hope      leave
## 16  million     youre       read
## 17   season      game       call
## 18     week      life      world
## 19   county     tweet       home
## 20      win     start     friend

7 Conclusion

Through the exploration, one could discover there are some key patterns in these characters. The summary of major keywords are presented here in the form of cloud. ## Word Cloud - All

All_join_summary <- inner_join(tf_idf_news, tf_idf_blogs, by = "word")
All_join_summary <- inner_join(All_join_summary, tf_idf_twitter, by = "word")
#All_join_summary %>% arrange(desc())
All_join_summary$tf_idf_all <- All_join_summary$tf_idf.x + All_join_summary$tf_idf.y
All_join_summary$tf_all <- All_join_summary$tf.x + All_join_summary$tf.y + All_join_summary$tf
All_join_summary <- All_join_summary %>% arrange(desc(tf_all))
All_join_summary
## # A tibble: 43,457 x 18
##    word    n.x org.x    tf.x idf.x tf_idf.x    n.y org.y    tf.y idf.y tf_idf.y
##    <chr> <int> <chr>   <dbl> <dbl>    <dbl>  <int> <chr>   <dbl> <dbl>    <dbl>
##  1 â     10011 news  8.39e-3     0        0 209407 blogs 0.0145      0        0
##  2 time   5058 news  4.24e-3     0        0 104728 blogs 0.00727     0        0
##  3 im     1397 news  1.17e-3     0        0  37835 blogs 0.00263     0        0
##  4 day    3355 news  2.81e-3     0        0  69875 blogs 0.00485     0        0
##  5 love   1107 news  9.27e-4     0        0  57915 blogs 0.00402     0        0
##  6 peop~  3723 news  3.12e-3     0        0  59873 blogs 0.00416     0        0
##  7 feel   1555 news  1.30e-3     0        0  47780 blogs 0.00332     0        0
##  8 start  2576 news  2.16e-3     0        0  38687 blogs 0.00269     0        0
##  9 life   1485 news  1.24e-3     0        0  47411 blogs 0.00329     0        0
## 10 week   2439 news  2.04e-3     0        0  37087 blogs 0.00258     0        0
## # ... with 43,447 more rows, and 7 more variables: n <int>, org <chr>,
## #   tf <dbl>, idf <dbl>, tf_idf <dbl>, tf_idf_all <dbl>, tf_all <dbl>
# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)

## Word Cloud - Blogs

# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud2::wordcloud2(data = tf_idf_blogs)

## Word Cloud - Twitter

# To summarize the existing information, let us develop a word cloud accordly. 
wordcloud::wordcloud(words = tf_idf_twitter$word, freq = tf_idf_twitter$tf, max.words = 10, colors = TRUE)

7.1 N-Gram Analysis

After our analysis of bigram and trigram, many meaningful connections among words were discovered and one could use this as the foundation of the development of the recommendation system for users in the future, if there is a demand for developing a recommendation system based on the existing literature.

7.2 Application of Findings

The exploration in the distribution of words and connection of words enables us to further explore the connections between the words and words, in sentences and words. The bigram and trigram shed lights on the recommendation system.