Introduction
Data loading
Data cleaning
Data exploration
Summary

Introduction

The goal of this whole project is to come up with an algorithm to predict the next word, and built a word suggestion application using the algorithm. Source data in four languages (English, German, Russian and Finnish) to build the algorithm was provided by the instructors. Since 1) the data set in English is expected to be used for the exercises and 2) I have little knowledge on the other three languages, I use the English data set for this project.

The objective of this report is 1) load the given data sets, 2) explore and give a brief summary of them, and 3) come up with a plan to develop an algorithm and application.

Data loading

Loading data sets

US_B <- readLines("final/en_US/en_US.blogs.txt")
US_N <- readLines("final/en_US/en_US.news.txt")
US_T <- readLines("final/en_US/en_US.twitter.txt")

Check the data sets

Check the size of the data sets

Blogs: 899288 lines and 2.605643210^{8} bytes
News: 1010242 lines and 2.617590510^{8} bytes
Twitter: 2360148 lines and 3.160373410^{8} bytes

These data sets are too large to conduct further analyses. Therefore, I randomly pick some lines from each data set and continue the analysis. Since dealing with over 1000 lines slows down calculation processes significantly, I select 1000 lines from each data set and use for the further calculation.

Check sampled sets

Blogs: 1000 lines and 2.8561610^{5} bytes
News: 1000 lines and 2.559610^{5} bytes
Twitter: 1000 lines and 1.3667210^{5} bytes

Data cleaning

I utilize tm package which is an R package specialized for text mining.

Cleaning the data sets

This cleaning process includes

Convert uppercase to lower case
Remove punctuation
Remove numbers
Stemming (remove affiexes)
Remove offensive words
Remove white spaces

In my option, whether a word or sentence is offensive or not heavily depends on context. Any words could be used in offensive ways, and some potentially offensive words could be used in non-offensive ways. Detecting offensive usage of words and eliminate them requires another level of language processing skills. Therefore, I stick with eliminating seven absolutely offensive words (“shit”,“piss”,“fuck”,“cunt”,“cocksucker”,“motherfucker”,“tits”) from the data sets.

After the cleaning proccess, text lines became like examples below.

Blog: the faith it take to give someon a second chanc to believ in someon dream to walk with someon through a struggl is direct tie to your faith in god
News: at one point today assembl speaker sheila oliv dessex said she did not think the chief bill in the packag would pass until may but after lastminut negoti includ an intervent by christi the assembl pass that bill after lawmak remov a provis to let fulltim worker with fewer than year on the job switch to a kstyle plan some lawmak worri that could hurt the stabil of the pension system
Twitter: will the bruin be play black and yellow for the next month until the start of next season

Data exploration

Frequently used words

Calculate Term Document Matrix (TDM: reflect the number of times each word in the corpus is fund in each of the documents) to find words with the highest frequency of usage.

Top 20 words with the highest frequency of usage:

Blog: the, and, that, for, with, you, was, this, have, but, not, are, from, all, her, she, they, when, had, his
News: the, and, for, that, with, said, was, have, but, his, are, from, has, year, not, they, this, who, out, will
Twitter: the, you, and, for, that, your, have, just, not, this, but, with, like, what, are, get, love, good, thank, know

Compare the frequency of usage among the data sources (Blog, News and Twitter): plot 200 words with the highest frequency of usage. Note that “the” and “and” had significantly high frequencies (the = Blog: 1973, News: 1911, Twitter: 392, and = Blog: 1124, News: 870, Twitter: 189); therefire the two words were excluded in the graphs. X and Y axes show the requency of usage of words in each data source.

The word frequencies in Blog and News sources are similar compared to that in Twitter.

Examine bigram words instead of single word.

Top 20 words with the highest frequency of usage:

Blog: of the, in the, to the, to be, on the, and i, and the, for the, i am, i have, in a, with the, that i, is a, it is, i was, it was, go to, at the, i had
News: in the, of the, on the, for the, to the, at the, and the, in a, with the, as a, by the, he was, to be, from the, is a, of a, want to, with a, for a, it was
Twitter: in the, for the, of the, to be, at the, go to, on the, thank for, to the, i have, i love, you know, do you, have a, i cant, i dont, thank you, follow me, for a, in a

How many bigram words were collected?

Blog: 27441
News: 24031
Twitter: 9298

Compare bigram word counts among data sources (Blog, News, and Twitter): plot 200 words with the highest frequencies.

Summary

From the data exploration, I found that:

Words frequently used in blog posts and news articles are similar compared to words used in Twitter posts.
The number of bigram words found was smaller for Twitter compared to the other two data sources. The data sizes of analyzed data sets (1000 lines each) are about the same, so this finding also indicates that the usage of words in Twitter might be different from the other two.

Because of this uniqueness in Twitter, I plan to develop an algorithm to predict the next word for the Twitter service. Since Twitter has a limitation on the number of characters to use in a post, I would imagine more abbreviations are used in this service. Also, I imagine emoticons and emojis are often used as well. I would spend next several weeks to take into account these Twitter specific features and develop an algorithm and an application.

Coursera Data Science Specialization Capstone Project: Report1

Table of contents