Loading the Data

news <- readLines("E:/Nada/Others/Courses/Data Science Specialization/Ex/Capstone/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn=FALSE)
blogs <- readLines("E:/Nada/Others/Courses/Data Science Specialization/Ex/Capstone/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn=FALSE)
twitter <- readLines("E:/Nada/Others/Courses/Data Science Specialization/Ex/Capstone/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn=FALSE)

Summary Statistics

This part calculates maximum, minimum and average number of words per line in the three files.

##   Dataset   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
## 1    news   77259  15639408  2651432       1 34.61779    1123
## 2   blogs  899288 206824382 37570839       0 41.75107    6726
## 3 twitter 2360148 162096241 30451170       1 12.75065      47

Data Preparation

This part calculates maximum, minimum and average number of words per line in the three files.

Exploratory Data Analysis

This part calculates maximum, minimum and average number of words per line in the three files.

Capstone

Nada Hossam Sharkawi

3/5/2020

Loading the Data

Summary Statistics

Data Preparation

Exploratory Data Analysis