This is a week report for capstone projetc of JHU Capstone Course by Rongbin Ye, as a part of the data scientist specialization on Coursera.This report will provide a report of summary statistics about the data sets, report some interesting findings via term frequency, N gram and word cloud. Furthermore, based on the explainatory data analysis outcome, this report will propose a plan of capstone project, including a prediction algorithm for typing recommendation and a Shiny app on the AWS.
As discussed in the begining, this report will provide a milestone plan throughoout this whole capstone project. The scheduled time is in seven weeks. Yet, in regarding the time contraint, the milestones has been set in an accelerated manner, which entails tighted up plans to clean up the data and schedule of model development. The major deliverables are two: 1. An algorithm of recommendation the correlated the words in English, based on the given dataset. 2. A Shiny App for the usage of the users to interact with.
The milestone has been set to be following four sections: > Week 1 & 2: Data Familarization + Explantory Data Cleaning > Week 3 & 4: Text Mining: Extract and identify the key patterns related to the usage habbits of english writers > week 5 & 6: Model Development: An Algorithm to be developed and tuned in this process > Week 7: Develop a Shiny App and Delopy
library(tidyverse)
library(tm)
library(lexicon)
library(stringr)
library(stopwords)
library(tidytext)
library(textstem)
library(tidyr)
In this section, using the connection, the txt file has been read into R lines by lines. The texts are stored in characters’ form. Based on these three characters chuncks, the author is able to conduct the preliminary data cleaning and text preprocessing for explantory data analysis
# read in multiple lines into one data frame: blogs
blogs_con <- file("C:/Users/yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.blogs.txt")
blogs <- readLines(con = blogs_con)
close(blogs_con)
# all the blogs has been loaded in properly
# read in multiple lines into one data frame: twitters
twitters_con <- file("C:/Users/yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.twitter.txt")
twitter <- readLines(con = twitters_con)
## Warning in readLines(con = twitters_con): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con = twitters_con): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con = twitters_con): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con = twitters_con): line 1759032 appears to contain an
## embedded nul
close(twitters_con)
# all the twitters are read into the properly
# read in multiple lines into one data frame: news
news_con <- file("C:/Users/yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.news.txt")
news <- readLines(con = news_con)
## Warning in readLines(con = news_con): incomplete final line found on 'C:/Users/
## yrbbe/Downloads/JHU-Text Analysis/final/en_US/en_US.news.txt'
close(news_con)
# all the twitters are read into the properly
After loading in the data, let us look at the basic shape of three data sets.
len_news <- length(news)
len_blogs <- length(blogs)
len_twitters <- length(twitter)
The length of news is 77259 in sentences. The length of blogs is 899288 in sentences. The length of twitters is 2360148 in sentences. Let us have a close look at the words.
In order to conduct analysis, three text will be cleaned. The major cleaning process includes lowercase, strip spaces on both sides, taking out the punctuation, lemmatization, and tokenization. Eventually, the text chucks are expected to be broken down into tokens for further analysis and comparison at the level of sentences and words.
blogs <- blogs %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
news <- news %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
twitter <- twitter %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
blogs_df <- tibble(line = 1:length(blogs), text = blogs)
tokens_blogs <- blogs_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
tokens_blogs
## # A tibble: 401,562 x 2
## word n
## <chr> <int>
## 1 â 209407
## 2 time 104728
## 3 day 69875
## 4 people 59873
## 5 love 57915
## 6 iâ 55037
## 7 feel 47780
## 8 life 47411
## 9 book 40964
## 10 itâ 40847
## # ... with 401,552 more rows
Surprisingly, in the blog, five words bloggers wrote most about are: time, people, day, love and life. Such a philosophical and metaphysical finding demonstrate the salt of the earth, which shows the potential topics. One of the preliminary conclusions is that these blogs experts m
news_df <- tibble(line = 1:length(news), text = news)
tokens_news <- news_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
tokens_news
## # A tibble: 77,215 x 2
## word n
## <chr> <int>
## 1 â 10011
## 2 time 5058
## 3 people 3723
## 4 school 3584
## 5 game 3577
## 6 day 3355
## 7 play 3251
## 8 city 3093
## 9 include 2961
## 10 team 2819
## # ... with 77,205 more rows
Meanwhile, it seems the topics in the news are more about time, people, game, school and day. Indeed, one of the preliminary thoughts is that the news covered might be the sport news. Despite not covering any specific sports, the news is about the season, game, and team, which supports the preliminary finding.
tokens_news$org <-"news"
tf_idf_news <- tokens_news %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_news %>% arrange(desc(tf_idf))
## # A tibble: 77,215 x 6
## word n org tf idf tf_idf
## <chr> <int> <chr> <dbl> <dbl> <dbl>
## 1 â 10011 news 0.00839 0 0
## 2 time 5058 news 0.00424 0 0
## 3 people 3723 news 0.00312 0 0
## 4 school 3584 news 0.00300 0 0
## 5 game 3577 news 0.00300 0 0
## 6 day 3355 news 0.00281 0 0
## 7 play 3251 news 0.00272 0 0
## 8 city 3093 news 0.00259 0 0
## 9 include 2961 news 0.00248 0 0
## 10 team 2819 news 0.00236 0 0
## # ... with 77,205 more rows
To further consider the frequencies of words, as one can see that, including the adjusted term frequency does not help us to identify the keywords. The reason is that all these words are common words. Hence, using the term frequency solely is enough to help us to understand the korpus already.
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)
### Twitter
twitter <- twitter %>% removeNumbers() %>% removePunctuation() %>% lemmatize_strings()
twitters_df <- tibble(line = 1:length(twitter), text = twitter)
tokens_twitters <- twitters_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining, by = "word"
tokens_twitters
## # A tibble: 468,683 x 2
## word n
## <chr> <int>
## 1 im 157940
## 2 love 119937
## 3 day 108525
## 4 rt 88236
## 5 time 84785
## 6 lol 66407
## 7 follow 66236
## 8 people 52671
## 9 happy 49494
## 10 â 48593
## # ... with 468,673 more rows
tokens_twitters$org <- "twitters"
tokens_blogs$org <-"blogs"
tf_idf_twitter <- tokens_twitters %>% tidytext::bind_tf_idf(word,org, n)
tf_idf_blogs <- tokens_blogs %>% tidytext::bind_tf_idf(word,org, n)
Yet, one of the issues of pure In order to adjust the influence of high frequent words, I adopt the inverse term frequency and expand the n into tf-idf model. Using the blind_df_idf function, the tf-idf metrics are provided.
tokens_all <- rbind(tokens_news, tokens_twitters)
tokens_all <- rbind(tokens_all, tokens_blogs)
tf_idf_all <- tokens_all %>% tidytext::bind_tf_idf(word,org, n)
news_20 <- top_n(tf_idf_news, 20, wt = tf) %>% select(word)
blogs_20 <- top_n(tf_idf_twitter, 20, wt = tf) %>% select(word)
twitter_20 <- top_n(tf_idf_blogs,20, wt = tf) %>% select(word)
all_20 <- cbind(news_20, blogs_20)
all_20 <- cbind(all_20, twitter_20)
colnames(all_20) <- c("Top News", "Top Blogs", "Top Tweets")
all_20
## Top News Top Blogs Top Tweets
## 1 â im â
## 2 time love time
## 3 people day day
## 4 school rt people
## 5 game time love
## 6 day lol iâ
## 7 play follow feel
## 8 city people life
## 9 include happy book
## 10 team â itâ
## 11 home tonight start
## 12 percent night im
## 13 run feel week
## 14 call watch write
## 15 start hope leave
## 16 million youre read
## 17 season game call
## 18 week life world
## 19 county tweet home
## 20 win start friend
Through the exploration, one could discover there are some key patterns in these characters. The summary of major keywords are presented here in the form of cloud. ## Word Cloud - All
All_join_summary <- inner_join(tf_idf_news, tf_idf_blogs, by = "word")
All_join_summary <- inner_join(All_join_summary, tf_idf_twitter, by = "word")
#All_join_summary %>% arrange(desc())
All_join_summary$tf_idf_all <- All_join_summary$tf_idf.x + All_join_summary$tf_idf.y
All_join_summary$tf_all <- All_join_summary$tf.x + All_join_summary$tf.y + All_join_summary$tf
All_join_summary <- All_join_summary %>% arrange(desc(tf_all))
All_join_summary
## # A tibble: 43,457 x 18
## word n.x org.x tf.x idf.x tf_idf.x n.y org.y tf.y idf.y tf_idf.y
## <chr> <int> <chr> <dbl> <dbl> <dbl> <int> <chr> <dbl> <dbl> <dbl>
## 1 â 10011 news 8.39e-3 0 0 209407 blogs 0.0145 0 0
## 2 time 5058 news 4.24e-3 0 0 104728 blogs 0.00727 0 0
## 3 im 1397 news 1.17e-3 0 0 37835 blogs 0.00263 0 0
## 4 day 3355 news 2.81e-3 0 0 69875 blogs 0.00485 0 0
## 5 love 1107 news 9.27e-4 0 0 57915 blogs 0.00402 0 0
## 6 peop~ 3723 news 3.12e-3 0 0 59873 blogs 0.00416 0 0
## 7 feel 1555 news 1.30e-3 0 0 47780 blogs 0.00332 0 0
## 8 start 2576 news 2.16e-3 0 0 38687 blogs 0.00269 0 0
## 9 life 1485 news 1.24e-3 0 0 47411 blogs 0.00329 0 0
## 10 week 2439 news 2.04e-3 0 0 37087 blogs 0.00258 0 0
## # ... with 43,447 more rows, and 7 more variables: n <int>, org <chr>,
## # tf <dbl>, idf <dbl>, tf_idf <dbl>, tf_idf_all <dbl>, tf_all <dbl>
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud::wordcloud(words = tf_idf_news$word, freq = tf_idf_news$tf, max.words = 10, colors = TRUE)
## Word Cloud - Blogs
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud2::wordcloud2(data = tf_idf_blogs)
## Word Cloud - Twitter
# To summarize the existing information, let us develop a word cloud accordly.
wordcloud::wordcloud(words = tf_idf_twitter$word, freq = tf_idf_twitter$tf, max.words = 10, colors = TRUE)
After our analysis of bigram and trigram, many meaningful connections among words were discovered and one could use this as the foundation of the development of the recommendation system for users in the future, if there is a demand for developing a recommendation system based on the existing literature.
The exploration in the distribution of words and connection of words enables us to further explore the connections between the words and words, in sentences and words. The bigram and trigram shed lights on the recommendation system.