Exploratory Data Analysis

Goals

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Summary

As part of the Capstone project by JHU through Coursera, this job tries to put in practice the knowledge of data getting and cleaning in the real world environment.

Loading libraries

library("dplyr")
library("ngram")
library("tidytext")
library("janeaustenr")
library("qdap")

Getting Data

fileurl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileurl, destfile = "./SwiftKey.zip", method = "curl")
unzip("SwiftKey.zip", exdir = "./SwiftKey")

Data Analysis

Loading data

news <- readLines("./SwiftKey/final/en_US/en_US.news.txt")
blogs <- readLines("./SwiftKey/final/en_US/en_US.blogs.txt")
twitter <- readLines("./SwiftKey/final/en_US/en_US.twitter.txt")

Getting the number of lines

newlen <- length(news)
blolen <- length(blogs)
twilen <- length(twitter)

The number of lines in:

There are 77259 lines in en_US.news.txt

There are 899288 lines in en_US.blogs.txt

There are 2360148 lines in en_US.twitter.txt

Getting the number of words

In order to get the number of words into each file, we can use bash.

wc -w SwiftKey/final/en_US/en_US.news.txt > nwords
read nwords filename < nwords
echo "Number of words in News: $nwords"
wc -w SwiftKey/final/en_US/en_US.blogs.txt > bwords
read bwords filename < bwords
echo "Number of words in Blogs: $bwords"
wc -w SwiftKey/final/en_US/en_US.twitter.txt > twords
read twords filename < twords
echo "Number of words in Twitter: $twords"

## Number of words in News: 34365936
## Number of words in Blogs: 37334117
## Number of words in Twitter: 30373559

Text Mining

Mining the text, we can show the three most common words into each file

ft <- freq_terms(news, 5)
plot(ft, main = "Most common words in news")

ft <- freq_terms(blogs, 5)
plot(ft, main = "Most common words in blogs")

ft <- freq_terms(twitter, 5)
plot(ft, main = "Most common words in twitter")