The mission of the project is to clean a raw database and present the data on on what people read or write in the online news, social media and blogs within one sample year in order to pass a JHU course project.
Internet is an important source of data nowadays - and effective statistical and IT ways to proceed its big data are of a high interest.
This project uses the .txt format raw (unclenaed) data in the original languages as they are.
The data dorcessing methods documentation presentation may be found here: https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html.
The data itself may be found here: .
The project is made in R which is a statistical programming language free software. The code samples are represented in this text.
An English part of this report is made on English Twitter, News outputs and US-UK blogs datatsets. The code and the output look like this:
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
### Now load the data:
my_path = "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\"
dir(my_path)
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
my_path_eng = "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\en_US\\"
# "de_DE" "en_US" "fi_FI" "ru_RU"
filenames = dir(my_path, pattern="*.txt")
filenames = dir(my_path_eng, pattern="*.txt")
str(filenames)
## chr [1:3] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
files <- list.files(path = my_path, full.names = TRUE, recursive = TRUE, pattern="*.txt")
str(files)
## chr [1:12] "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\/de_DE/de_DE.blogs.txt" ...
summary(files)
## Length Class Mode
## 12 character character
# Open all files together - we need more data to teach our algorithm
ldf <- lapply(paste(my_path_eng, filenames, sep=""), readLines)
## Warning in FUN(X[[i]], ...): incomplete final
## line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\en_US\en_US.news.txt'
## Warning in FUN(X[[i]], ...): line 167155 appears to contain an embedded nul
## Warning in FUN(X[[i]], ...): line 268547 appears to contain an embedded nul
## Warning in FUN(X[[i]], ...): line 1274086 appears to contain an embedded nul
## Warning in FUN(X[[i]], ...): line 1759032 appears to contain an embedded nul
str(ldf)
## List of 3
## $ : chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235." "We love you Mr. Brown." "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been"| __truncated__ "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter."| __truncated__ ...
## $ : chr [1:77259] "He wasn't home alone, apparently." "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset o"| __truncated__ "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new bi"| __truncated__ "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton"| __truncated__ ...
## $ : chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason." "they've decided its more fun if I don't." "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)" ...
res <- lapply(ldf, summary)
res # [[1]] - blog; [[2]] - news; [[3]] - twitter - proofs see below where we will take summary for each file separately
## [[1]]
## Length Class Mode
## 899288 character character
##
## [[2]]
## Length Class Mode
## 77259 character character
##
## [[3]]
## Length Class Mode
## 2360148 character character
#for (i in 1:length(res))
# assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]]) #if there would be DFs different on number in names
# Also open all files separately - it may be useful:
blogs <- readLines(paste(my_path_eng, "en_US.blogs.txt", sep=""), encoding = "UTF-8", skipNul = TRUE)
news <- readLines(paste(my_path_eng, "en_US.news.txt", sep=""), encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(paste(my_path_eng, "en_US.news.txt", sep = ""), encoding
## = "UTF-8", : incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\en_US\en_US.news.txt'
twitter <- readLines(paste(my_path_eng, "en_US.twitter.txt", sep=""), encoding = "UTF-8", skipNul = TRUE)
summary(blogs)
## Length Class Mode
## 899288 character character
summary(news)
## Length Class Mode
## 77259 character character
summary(twitter)
## Length Class Mode
## 2360148 character character
file_sizes <- lapply(paste(my_path_eng, filenames, sep=""), file.info)
file_sizes[[2]]$size
## [1] 205811889
#result
fileList <- list.files(my_path_eng, pattern="*.txt")
file_sizes <- lapply(paste(my_path_eng, filenames, sep=""), file.info)
file_sizes[[2]]$size
## [1] 205811889
dat <- file_sizes[[1]]$size
for (i in 1:3){
c = paste("file ", i, "size is equal to:", file_sizes[[i]]$size, sep = " ")
print(c)
# print( file_sizes[[i]]$isdir)
# write.dna(dat, file = sprintf("./fastas/%s.fasta", i), format = "fasta")
}
## [1] "file 1 size is equal to: 210160014"
## [1] "file 2 size is equal to: 205811889"
## [1] "file 3 size is equal to: 167105338"
### Loop for answering to all the questions about the dataset:
for (i in 1:3){
a = paste("The name of the file number", i, "is", filenames[[i]], sep = " ")
print(a)
b = paste("File number ", i, "(", filenames[[i]], ")", "is", ifelse(i == 1, "blogs", ifelse(i == 2, "news", "twitter")), sep = " ")
print(b)
c1 = paste("File ", i, "(", filenames[[i]], ")", "size is equal to:", file_sizes[[i]]$size, sep = " ")
print(c1)
d = paste("Is file ", i, "(", filenames[[i]], ")", "is a directory?", ifelse(file_sizes[[i]]$isdir == TRUE, "Yes. it is.", "No, it is not"), sep = " ")
print(d)
e = paste("The file number", i, "(", filenames[[i]], ")", "permissions code printed in the octal number is:", file_sizes[[i]]$mode, sep = " ")
print(e)
f = paste("The supposed date and time when the original file number", i, "(", filenames[[i]], ")", "was created is:", file_sizes[[i]]$mtime , sep = " ")
print(f)
g = paste("The assumed time when the file number", i, "(", filenames[[i]], ")", "was apploaded to the computer is:", file_sizes[[i]]$ctime, sep = " ")
print(g)
h = paste("The estimated time when this file", i, "(", filenames[[i]], ")", "was the first time loaded into the R global environment is:", file_sizes[[i]]$atime, sep = " ")
print(h)
i2 = paste("Does the file ", i, "(", filenames[[i]], ")", "contain executable code?", ifelse(file_sizes[[i]]$exe == "no", "No, it doesn't.", "Yes, it does."), sep = " ")
print(i2)
}
## [1] "The name of the file number 1 is en_US.blogs.txt"
## [1] "File number 1 ( en_US.blogs.txt ) is blogs"
## [1] "File 1 ( en_US.blogs.txt ) size is equal to: 210160014"
## [1] "Is file 1 ( en_US.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( en_US.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( en_US.blogs.txt ) was created is: 2014-07-22 11:13:05"
## [1] "The assumed time when the file number 1 ( en_US.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:42"
## [1] "The estimated time when this file 1 ( en_US.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:49"
## [1] "Does the file 1 ( en_US.blogs.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 2 is en_US.news.txt"
## [1] "File number 2 ( en_US.news.txt ) is news"
## [1] "File 2 ( en_US.news.txt ) size is equal to: 205811889"
## [1] "Is file 2 ( en_US.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( en_US.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( en_US.news.txt ) was created is: 2014-07-22 11:13:04"
## [1] "The assumed time when the file number 2 ( en_US.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:35"
## [1] "The estimated time when this file 2 ( en_US.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:42"
## [1] "Does the file 2 ( en_US.news.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 3 is en_US.twitter.txt"
## [1] "File number 3 ( en_US.twitter.txt ) is twitter"
## [1] "File 3 ( en_US.twitter.txt ) size is equal to: 167105338"
## [1] "Is file 3 ( en_US.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( en_US.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( en_US.twitter.txt ) was created is: 2014-07-22 11:12:58"
## [1] "The assumed time when the file number 3 ( en_US.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:30"
## [1] "The estimated time when this file 3 ( en_US.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:35"
## [1] "Does the file 3 ( en_US.twitter.txt ) contain executable code? No, it doesn't."
#We could obtain names of each file in the dataset by using this loop:
for (i in 1:3){
dat = list()
for (x in 1:3) {
dat[[x]] <- filenames[[x]]
}
}
### Obtain size of each file in our dataset:
for (i in 1:3){
dat = list()
for (x in 1:3) {
dat[[x]] <- file_sizes[[x]]$size
}
}
dat = data.table(dat)
dat$filenames = filenames
dat$file_size = dat$dat
dat$dat = NULL
### New data - when the file was created:
for (i in 1:3){
pat = list()
for (x in 1:3) {
pat[[x]] <- file_sizes[[x]]$mtime
}
}
pat = data.table(pat)
dat$file_created = pat$pat
rm(pat) #There us no need in this component anymore - free some memory
#### My hyperloop - it tells everything about the files
for (i in 1:3){
a = paste("The name of the file number", i, "is", filenames[[i]], sep = " ")
print(a)
for (z in 1:3) {
a[[z]] = filenames[[z]]
}
b = paste("File number ", i, "(", filenames[[i]], ")", "is", ifelse(i == 1, "blogs", ifelse(i == 2, "news", "twitter")), sep = " ")
print(b)
c1 = paste("File ", i, "(", filenames[[i]], ")", "size is equal to:", file_sizes[[i]]$size, sep = " ")
print(c1)
for (y in 1:3) {
c1[[y]] = file_sizes[[y]]$size
}
d = paste("Is file ", i, "(", filenames[[i]], ")", "is a directory?", ifelse(file_sizes[[i]]$isdir == TRUE, "Yes. it is.", "No, it is not"), sep = " ")
print(d)
for (x in 1:3) {
d[[x]] = file_sizes[[x]]$isdir
}
e = paste("The file number", i, "(", filenames[[i]], ")", "permissions code printed in the octal number is:", file_sizes[[i]]$mode, sep = " ")
print(e)
for (u in 1:3) {
e[[u]] = file_sizes[[u]]$mode
}
f = paste("The supposed date and time when the original file number", i, "(", filenames[[i]], ")", "was created is:", file_sizes[[i]]$mtime , sep = " ")
print(f)
for (v in 1:3) {
f[[v]] = file_sizes[[v]]$mtime
}
g = paste("The assumed time when the file number", i, "(", filenames[[i]], ")", "was apploaded to the computer is:", file_sizes[[i]]$ctime, sep = " ")
print(g)
for (w in 1:3) {
g[[w]] = file_sizes[[w]]$ctime
}
h = paste("The estimated time when this file", i, "(", filenames[[i]], ")", "was the first time loaded into the R global environment is:", file_sizes[[i]]$atime, sep = " ")
print(h)
for (r in 1:3) {
# h1[[r]] = as.Date(file_sizes[[r]]$atime)
h[[r]] = file_sizes[[r]]$atime
}
i2 = paste("Does the file ", i, "(", filenames[[i]], ")", "contain executable code?", ifelse(file_sizes[[i]]$exe == "no", "No, it doesn't.", "Yes, it does."), sep = " ")
print(i2)
for (s in 1:3) {
i2[[s]] = file_sizes[[s]]$exe
}
a = data.table(a)
c1 = data.table(c1)
d = data.table(d)
e = data.table(e)
f = data.table(f)
g = data.table(g)
h = data.table(h)
i2 = data.table(i2)
my_data_frame = filenames
my_data_frame = data.table(my_data_frame)
my_data_frame$file_names = a
my_data_frame$file_size = c1
my_data_frame$file_size = as.numeric(my_data_frame$file_size)
my_data_frame$file_size_Mb = my_data_frame$file_size/(1024^2)
my_data_frame$if_directory = d
my_data_frame$octal_perm_code = e
my_data_frame$file_created = f
my_data_frame$file_downloaded = g
my_data_frame$file_recent_vers = h
my_data_frame$file_created = as.numeric(my_data_frame$file_created)
my_data_frame$file_downloaded = as.numeric(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = as.numeric(my_data_frame$file_recent_vers)
### In case if proper date and time processing package is not installed:
install_and_load = function(name, char = T)
{
if (!require(name, character.only = char))
{
install.packages(name)
}
require(name, character.only = char)
}
sapply(
c("anytime"),
install_and_load
)
rm(install_and_load)
### End of the package call
my_data_frame$file_created = anytime::anytime(my_data_frame$file_created)
my_data_frame$file_downloaded = anytime::anytime(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = anytime::anytime(my_data_frame$file_recent_vers)
my_data_frame$if_exe = i2
m1 = paste("In addition, a dataframe is created that contains all the data about each file.", "The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to:", i, sep = " ")
print(m1)
rm(a, c1, d, e, f, g, h, i2, m1)
}
## [1] "The name of the file number 1 is en_US.blogs.txt"
## [1] "File number 1 ( en_US.blogs.txt ) is blogs"
## [1] "File 1 ( en_US.blogs.txt ) size is equal to: 210160014"
## [1] "Is file 1 ( en_US.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( en_US.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( en_US.blogs.txt ) was created is: 2014-07-22 11:13:05"
## [1] "The assumed time when the file number 1 ( en_US.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:42"
## [1] "The estimated time when this file 1 ( en_US.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:49"
## [1] "Does the file 1 ( en_US.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is en_US.news.txt"
## [1] "File number 2 ( en_US.news.txt ) is news"
## [1] "File 2 ( en_US.news.txt ) size is equal to: 205811889"
## [1] "Is file 2 ( en_US.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( en_US.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( en_US.news.txt ) was created is: 2014-07-22 11:13:04"
## [1] "The assumed time when the file number 2 ( en_US.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:35"
## [1] "The estimated time when this file 2 ( en_US.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:42"
## [1] "Does the file 2 ( en_US.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is en_US.twitter.txt"
## [1] "File number 3 ( en_US.twitter.txt ) is twitter"
## [1] "File 3 ( en_US.twitter.txt ) size is equal to: 167105338"
## [1] "Is file 3 ( en_US.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( en_US.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( en_US.twitter.txt ) was created is: 2014-07-22 11:12:58"
## [1] "The assumed time when the file number 3 ( en_US.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:30"
## [1] "The estimated time when this file 3 ( en_US.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:35"
## [1] "Does the file 3 ( en_US.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
###### Full loop:
n = length(filenames)
for (i in 1:n){
a = paste("The name of the file number", i, "is", filenames[[i]], sep = " ")
print(a)
for (z in 1:n) {
a[[z]] = filenames[[z]]
}
b = paste("File number ", i, "(", filenames[[i]], ")", "is", ifelse(i == 1, "blogs", ifelse(i == 2, "news", "twitter")), sep = " ")
print(b)
c1 = paste("File ", i, "(", filenames[[i]], ")", "size is equal to:", file_sizes[[i]]$size, sep = " ")
print(c1)
for (y in 1:n) {
c1[[y]] = file_sizes[[y]]$size
}
d = paste("Is file ", i, "(", filenames[[i]], ")", "is a directory?", ifelse(file_sizes[[i]]$isdir == TRUE, "Yes. it is.", "No, it is not"), sep = " ")
print(d)
for (x in 1:n) {
d[[x]] = file_sizes[[x]]$isdir
}
e = paste("The file number", i, "(", filenames[[i]], ")", "permissions code printed in the octal number is:", file_sizes[[i]]$mode, sep = " ")
print(e)
for (u in 1:n) {
e[[u]] = file_sizes[[u]]$mode
}
f = paste("The supposed date and time when the original file number", i, "(", filenames[[i]], ")", "was created is:", file_sizes[[i]]$mtime , sep = " ")
print(f)
for (v in 1:n) {
f[[v]] = file_sizes[[v]]$mtime
}
g = paste("The assumed time when the file number", i, "(", filenames[[i]], ")", "was apploaded to the computer is:", file_sizes[[i]]$ctime, sep = " ")
print(g)
for (w in 1:n) {
g[[w]] = file_sizes[[w]]$ctime
}
h = paste("The estimated time when this file", i, "(", filenames[[i]], ")", "was the first time loaded into the R global environment is:", file_sizes[[i]]$atime, sep = " ")
print(h)
for (r in 1:n) {
# h1[[r]] = as.Date(file_sizes[[r]]$atime)
h[[r]] = file_sizes[[r]]$atime
}
i2 = paste("Does the file ", i, "(", filenames[[i]], ")", "contain executable code?", ifelse(file_sizes[[i]]$exe == "no", "No, it doesn't.", "Yes, it does."), sep = " ")
print(i2)
for (s in 1:n) {
i2[[s]] = file_sizes[[s]]$exe
}
# Preparing the data
a = data.table(a)
c1 = data.table(c1)
d = data.table(d)
e = data.table(e)
f = data.table(f)
g = data.table(g)
h = data.table(h)
i2 = data.table(i2)
# Creating the dataframe we could work with:
my_data_frame = filenames
my_data_frame = data.table(my_data_frame)
my_data_frame$file_names = a
my_data_frame$file_size = c1
my_data_frame$file_size = as.numeric(my_data_frame$file_size)
my_data_frame$file_size_Mb = my_data_frame$file_size/(1024^2)
my_data_frame$if_directory = d
my_data_frame$octal_perm_code = e
my_data_frame$file_created = f
my_data_frame$file_downloaded = g
my_data_frame$file_recent_vers = h
my_data_frame$file_created = as.numeric(my_data_frame$file_created)
my_data_frame$file_downloaded = as.numeric(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = as.numeric(my_data_frame$file_recent_vers)
### In case if proper date and time processing package is not installed:
install_and_load = function(name, char = T)
{
if (!require(name, character.only = char))
{
install.packages(name)
}
require(name, character.only = char)
}
sapply(
c("anytime"),
install_and_load
)
rm(install_and_load)
### End of the package call
my_data_frame$file_created = anytime::anytime(my_data_frame$file_created)
my_data_frame$file_downloaded = anytime::anytime(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = anytime::anytime(my_data_frame$file_recent_vers)
my_data_frame$if_exe = i2
m1 = paste("In addition, a dataframe is created that contains all the data about each file.", "The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to:", i, sep = " ")
print(m1)
rm(a, c1, d, e, f, g, h, i2, m1)
}
## [1] "The name of the file number 1 is en_US.blogs.txt"
## [1] "File number 1 ( en_US.blogs.txt ) is blogs"
## [1] "File 1 ( en_US.blogs.txt ) size is equal to: 210160014"
## [1] "Is file 1 ( en_US.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( en_US.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( en_US.blogs.txt ) was created is: 2014-07-22 11:13:05"
## [1] "The assumed time when the file number 1 ( en_US.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:42"
## [1] "The estimated time when this file 1 ( en_US.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:49"
## [1] "Does the file 1 ( en_US.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is en_US.news.txt"
## [1] "File number 2 ( en_US.news.txt ) is news"
## [1] "File 2 ( en_US.news.txt ) size is equal to: 205811889"
## [1] "Is file 2 ( en_US.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( en_US.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( en_US.news.txt ) was created is: 2014-07-22 11:13:04"
## [1] "The assumed time when the file number 2 ( en_US.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:35"
## [1] "The estimated time when this file 2 ( en_US.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:42"
## [1] "Does the file 2 ( en_US.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is en_US.twitter.txt"
## [1] "File number 3 ( en_US.twitter.txt ) is twitter"
## [1] "File 3 ( en_US.twitter.txt ) size is equal to: 167105338"
## [1] "Is file 3 ( en_US.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( en_US.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( en_US.twitter.txt ) was created is: 2014-07-22 11:12:58"
## [1] "The assumed time when the file number 3 ( en_US.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:30"
## [1] "The estimated time when this file 3 ( en_US.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:35"
## [1] "Does the file 3 ( en_US.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
my_data_frame
## my_data_frame file_names file_size file_size_Mb if_directory
## 1: en_US.blogs.txt en_US.blogs.txt 210160014 200.4242 FALSE
## 2: en_US.news.txt en_US.news.txt 205811889 196.2775 FALSE
## 3: en_US.twitter.txt en_US.twitter.txt 167105338 159.3641 FALSE
## octal_perm_code file_created file_downloaded file_recent_vers
## 1: 438 2014-07-22 11:13:05 2020-08-15 21:58:42 2020-08-15 21:58:49
## 2: 438 2014-07-22 11:13:04 2020-08-15 21:58:35 2020-08-15 21:58:42
## 3: 438 2014-07-22 11:12:58 2020-08-15 21:58:30 2020-08-15 21:58:35
## if_exe
## 1: no
## 2: no
## 3: no
## Making the output more fancy:
knitr::kable(my_data_frame, caption = "The main data about the files")
my_data_frame | file_names | file_size | file_size_Mb | if_directory | octal_perm_code | file_created | file_downloaded | file_recent_vers | if_exe |
---|---|---|---|---|---|---|---|---|---|
en_US.blogs.txt | en_US.blogs.txt | 210160014 | 200.4242 | FALSE | 438 | 2014-07-22 11:13:05 | 2020-08-15 21:58:42 | 2020-08-15 21:58:49 | no |
en_US.news.txt | en_US.news.txt | 205811889 | 196.2775 | FALSE | 438 | 2014-07-22 11:13:04 | 2020-08-15 21:58:35 | 2020-08-15 21:58:42 | no |
en_US.twitter.txt | en_US.twitter.txt | 167105338 | 159.3641 | FALSE | 438 | 2014-07-22 11:12:58 | 2020-08-15 21:58:30 | 2020-08-15 21:58:35 | no |
#### Line counts and word counts: ldf
n = length(ldf) #3
# my_length length(ldf[[1]])
for (i in 1:n){
### Counter for lines
bat = list()
for (x in 1:n) {
bat[[x]] <- length(ldf[[x]])
}
}
# sum(stringi::stri_count_words(ldf[[1]]))
for (i in 1:n){
#### Counter for words said in each DF (a long in processing loop)
pat = list()
for (x in 1:n) {
pat[[x]] <- sum(stringi::stri_count_words(ldf[[x]]))
}
}
# mean(stringi::stri_count_words(ldf[[1]]))
for (i in 1:n){
#### Counter for mean words per 1 line of text in the text document (a long in processing loop)
fat = list()
for (x in 1:n) {
fat[[x]] <- mean(stringi::stri_count_words(ldf[[x]]))
}
}
my_descriptive_stat = filenames
my_descriptive_stat = data.table(my_descriptive_stat)
bat = data.table(bat)
pat = data.table(pat)
fat = data.table(fat)
my_descriptive_stat$lines_count = bat$bat
my_descriptive_stat$words_count = pat$pat
my_descriptive_stat$mean_words_per_line = fat$fat
my_descriptive_stat #file name, Lines count, words count, mean words per line
## my_descriptive_stat lines_count words_count mean_words_per_line
## 1: en_US.blogs.txt 899288 38256701 42.5411
## 2: en_US.news.txt 77259 2697319 34.91268
## 3: en_US.twitter.txt 2360148 30249390 12.81673
## Making the output more fancy:
knitr::kable(my_descriptive_stat, caption = "The main summary statistics about the files")
my_descriptive_stat | lines_count | words_count | mean_words_per_line |
---|---|---|---|
en_US.blogs.txt | 899288 | 38256701 | 42.541100292676 |
en_US.news.txt | 77259 | 2697319 | 34.9126833119766 |
en_US.twitter.txt | 2360148 | 30249390 | 12.816734374285 |
rm(dat, bat, pat, fat)
# Dataset Cleaning: Data Preprocessing
### Sampling the data:
set.seed(20200826)
# tat = sample(ldf[[1]], length(ldf[[1]])*(5/100))
for (i in 1:n){
#### Counter for mean words per 1 line of text in the text document (a long in processing loop)
tat = list()
for (x in 1:n) {
tat[[x]] <- sample(ldf[[x]], length(ldf[[x]])*(5/100)) #5% sampling - can be any figure
}
}
summary(tat)
## Length Class Mode
## [1,] 44964 -none- character
## [2,] 3862 -none- character
## [3,] 118007 -none- character
tat = data.table(tat)
### Giving the row names as data file names:
myDF <- cbind(Row_Names = filenames, tat) #2 columns: tat ++ Row_Names
str(myDF) #Cool
## Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
## $ Row_Names: chr "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
## $ tat :List of 3
## ..$ : chr "EUGENIO ANGELES, Fiscal of City of Manila, respondent." "Joseph doesnâ\200\231t reveal himself to his brothers, although he recognizes them immediately. For reasons whi"| __truncated__ "â\200œIcon is a company that loves to see others striving to be different and the TTXGP is just that,â\200\235 "| __truncated__ "How to Indy Crawl(and get a sweet limited run pint glass) :" ...
## ..$ : chr "3. By calling 404-330-6309." "Spring Wildflower Walk: Meet at the Lexington entrance. Look for trillium, blue and yellow violets, spring beau"| __truncated__ "The owner of the Arm & Hammer brand of personal care and household products said it has outgrown its Princeton "| __truncated__ "The 89-year-old national program recognizes outstanding creative teens and offers scholarship opportunities for"| __truncated__ ...
## ..$ : chr "Damn u save those text from 2011? Thristy!!!!" "I ran the whole jello carbomb scenario by Kersi and he is onboard. Let's do this!" "No sleep in my near future." "Yea I think my theory is right, y'all ladies is nuts from the womb to the tomb" ...
## - attr(*, ".internal.selfref")=<externalptr>
# astr <- "Ábcdêãçoàúü"
# iconv(astr, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# bat <- iconv(myDF$tat[[1]], from = 'UTF-8', to = 'ASCII//TRANSLIT')
n = length(ldf) #3
for (i in 1:n){
#### Remove all weird symbols from the text like: Ábcdêãçoàúü and other similar
bat = list()
for (x in 1:n) {
bat[[x]] <- iconv(myDF$tat[[x]], from = 'UTF-8', to = 'ASCII//TRANSLIT')
}
bat = data.table(bat)
}
### Creating descriptive statistics word cloud:
# corpus <- VCorpus(VectorSource(myDF$tat))
corpus <- VCorpus(VectorSource(bat))
print(corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
rm(bat)
# print(corpus)
str(corpus)
## List of 1
## $ 1:List of 2
## ..$ content: chr [1:3] "c(\"EUGENIO ANGELES, Fiscal of City of Manila, respondent.\", \"Joseph doesn't reveal himself to his brothers, "| __truncated__ "c(\"3. By calling 404-330-6309.\", \"Spring Wildflower Walk: Meet at the Lexington entrance. Look for trillium,"| __truncated__ "c(\"Damn u save those text from 2011? Thristy!!!!\", \"I ran the whole jello carbomb scenario by Kersi and he i"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:36:58"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
### List standard English stop words
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very"
### We need to update them:
updatedStopwords <- c(stopwords('en'), "could", "can", "will", "would", "want", "wanna", "just", "like", "can't", "lot", "looking",
"as well as", "thank you for", "looking forward", "i want to", "one of the", "a lot", "a lot of", "of", "thanks for the", "it was a",
"the end of", "thanks for the rt", "for the first time", "at the same time", "thanks for the", "is going to be", "if you want to",
"thank you so much", "i am going to", "can't wait to see", "going to be", "is one of the", "one of the", "if you want to", "if you want",
"in the middle of the", "at the end of the", "by the end of the", "thank you so much", "hope you have a great", "thanks so much", "i have no idea", "what",
"it is possible", "get", "good", "thank", "come", "even", "one", "got", "said", "much", "also", "“", "“", "”") #“ is some break technical symbol return to R
### A bit of standard data transformation:
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# The data transformation process - it is standard for the word clouds analysis
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, updatedStopwords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument) #Remove typical word endings - the only non-intuitive command by command name
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
strange_symbols = c("ú","¨", "ů", "§", "ě", "š", "č", "ř", "ž", "ý", "á", "í", "é", "´", "$", "#", "&", "^", "%", "@", "€",
"‰", "†", "…", "“", "ś", "ť", "ž", "ź", "Ł", "©", "ş", "±", "µ", "ą", "ä", "ĺ", "ć", "ç", "ę", "ë", "ě", "î", "ď", "đ",
"ń", "ň", "ó", "ô", "ö") #For English - it is OK to remove it all
corpus <- tm_map(corpus, toSpace, strange_symbols)
## Warning in gsub(pattern, " ", x): argument 'pattern' has length > 1 and only the
## first element will be used
print(corpus)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
# Word cloud plot of the most common words in the corpus
potential_word_matrix <- TermDocumentMatrix(corpus)
potential_word_matrix <- as.matrix(potential_word_matrix)
word_frequences <- sort(rowSums(potential_word_matrix), decreasing=TRUE)
dm <- data.frame(word=names(word_frequences), freq=word_frequences)
wordcloud(dm$word, dm$freq, max.words = 50, random.order=FALSE, rot.per=.25, colors=brewer.pal(8, "Dark2"))
#bat <- list(bat)
#toks_filtered <- tokens(bat)
#dfm_text <- dfm(toks_filtered)
# corpus$content[[1]]
# Transform our corpus to DFM text by a trick: defining preserved and cleaned coprus text as character the dfm function to recognise it well (pure coprus recognition without this trick may be problematic)
dfm_text <- dfm(as.character(corpus$content))
# Main summary statistics of the text to be uni-bi-tri-....-grammed on plots later:
text_sum <- textstat_summary(dfm_text)
text_sum
## document chars sents tokens types puncts numbers symbols urls tags emojis
## 1 text1 NA NA 128191 17859 43 15 18 0 0 0
# Standard method for creating the unigram plot:
my_swf_chart <- dfm_text %>% # Plot name and where to take the data from
textstat_frequency(n = 41) %>% # How many most popular unigram words to represent (I chose 41)
ggplot(aes(x = reorder(feature, frequency), y = frequency)) + #standard line - ranged X VS frequency
geom_point() + #point scatter plot
coord_flip() + #vertical coordinates (for beauty + it is traditioned for unigrams)
labs(x = NULL, y = "Words Frequency") + #Names of axes
ggtitle("Single Word Frequency (SWF)") + #Title
theme_light() #Main colors
my_swf_chart #Plot the chart
# compute my_coverage_stat, including all words
my_coverage_stat <- textstat_frequency(dfm_text, n= length(dfm_text))
my_coverage_stat$index <- 1:dim(my_coverage_stat)[1]
my_coverage_stat$relative_frequency <- my_coverage_stat$frequency / sum(my_coverage_stat$frequency)
my_coverage_stat$my_coverage_stat <- cumsum(my_coverage_stat$relative_frequency)
str(my_coverage_stat)
## Classes 'frequency', 'textstat' and 'data.frame': 17859 obs. of 8 variables:
## $ feature : chr "go" "love" "day" "know" ...
## $ frequency : num 902 889 735 650 627 624 623 557 541 534 ...
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ docfreq : num 1 1 1 1 1 1 1 1 1 1 ...
## $ group : chr "all" "all" "all" "all" ...
## $ index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ relative_frequency: num 0.00704 0.00693 0.00573 0.00507 0.00489 ...
## $ my_coverage_stat : num 0.00704 0.01397 0.0197 0.02478 0.02967 ...
### Top 50 most popular words and % frequency of their usage
rm(r)
r = my_coverage_stat$feature[1:50]
r = data.table(r)
r$most_popular_words = my_coverage_stat$feature[1:50];
r$how_often_used_percents = 100*my_coverage_stat$relative_frequency[1:50]
r$r = NULL
r
## most_popular_words how_often_used_percents
## 1: go 0.7036375
## 2: love 0.6934964
## 3: day 0.5733632
## 4: know 0.5070559
## 5: time 0.4891139
## 6: make 0.4867736
## 7: now 0.4859936
## 8: rt 0.4345079
## 9: new 0.4220265
## 10: great 0.4165659
## 11: think 0.4103252
## 12: see 0.4079850
## 13: today 0.4072049
## 14: follow 0.4001841
## 15: u 0.3955036
## 16: work 0.3830222
## 17: need 0.3666404
## 18: lol 0.3635201
## 19: peopl 0.3416777
## 20: year 0.3190552
## 21: us 0.3120344
## 22: say 0.3065738
## 23: back 0.3003331
## 24: thank 0.2917522
## 25: thing 0.2753703
## 26: right 0.2706898
## 27: look 0.2605487
## 28: night 0.2597686
## 29: take 0.2488474
## 30: happi 0.2488474
## 31: realli 0.2465072
## 32: last 0.2433868
## 33: game 0.2394864
## 34: way 0.2348059
## 35: hope 0.2348059
## 36: play 0.2270050
## 37: tonight 0.2270050
## 38: start 0.2246648
## 39: show 0.2231046
## 40: week 0.2231046
## 41: never 0.2184241
## 42: still 0.2137436
## 43: feel 0.2137436
## 44: best 0.2129635
## 45: use 0.2129635
## 46: watch 0.2067228
## 47: get 0.2043825
## 48: first 0.2028224
## 49: well 0.1997020
## 50: friend 0.1981418
## most_popular_words how_often_used_percents
rm(r)
# Plot my_coverage_stat
my_coverage_plot <- my_coverage_stat %>%
ggplot(aes(x = index, y = my_coverage_stat)) +
geom_point() +
labs(x = "Number Of Unique Words", y = "Percent Of Used Language Covered By This Number Of Words") +
ggtitle("Coverage Statistics") +
theme_light() #Main colors
my_coverage_plot
# Plot coverage on log scale
my_coverage_plot <- my_coverage_plot + scale_x_log10() + scale_y_log10() + ggtitle("Coverage Statistics - log10 scale")
my_coverage_plot
#### Bigram creation:
tokenized_text = quanteda::tokens(as.character(corpus$content), what = "word1")
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 2)
aschr = quanteda::dfm(tokenized_bigrams, ngrams=2)
## Warning: ngrams argument is not used.
topFeatures <- topfeatures(aschr, 20)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Bigrams Excluding Stop Words')
rm(df1, aschr, tokenized_bigrams) #To free the memory
### Trigrams:
# tokenized_text = quanteda::tokens(as.character(corpus$content), what = "word1")
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 3)
aschr = quanteda::dfm(tokenized_bigrams)
topFeatures <- topfeatures(aschr, 20)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + theme_light() + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Trigrams Excluding Stop Words') + theme(legend.position = "none")
rm(df1, aschr, tokenized_bigrams) #To free the memory
### Quatrograms:
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 4)
aschr = quanteda::dfm(tokenized_bigrams)
topFeatures <- topfeatures(aschr, 20)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + theme_light() + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Quatrograms Excluding Stop Words') + theme(legend.position = "none")
rm(df1, aschr, tokenized_bigrams) #To free the memory
### Pentagrams:
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 5)
aschr = quanteda::dfm(tokenized_bigrams)
topFeatures <- topfeatures(aschr, 19)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + theme_light() + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Pentagrams Excluding Stop Words \n Top-20 Words Used Together In One Message') + theme(legend.position = "none")
# Please, note: all the word endings are truncated :)
rm(df1, aschr, tokenized_bigrams) #To free the memory
You can also embed plots, for example:
## [1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
## chr [1:3] "ru_RU.blogs.txt" "ru_RU.news.txt" "ru_RU.twitter.txt"
## chr [1:12] "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\/de_DE/de_DE.blogs.txt" ...
## Length Class Mode
## 12 character character
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 191902 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 309777 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 191902 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 309777 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 191902 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 309777 appears to contain an embedded nul
## List of 3
## $ : chr [1:337100] "Настало время и мне поделиться чем-нибудь сладким!!! Уже совсем скоро наступит Новый год, и поэтому моя конфека"| __truncated__ "сама элегантность и выдержанность...." "Знаменитые дизайнеры, популярные магазины дамской одежды и известные модницы - все сейчас увлечены женственным "| __truncated__ "Проверяем точно ли продавец высылает в Россию? Нажимаем на вкладку и читаем:" ...
## $ : chr [1:196360] "Словом, поле для деятельности большое. Можно даже не прибегать к помощи интернет-сообщества и сэкономить ценные"| __truncated__ "Для смены режима Эмомали Рахмона в Таджикистане все готово, взрыв может случиться в любую минуту, активисты опп"| __truncated__ "Много вопросов задавали владивостокские музыканты, пришедшие в зал поприветствовать своих кумиров и учителей. И"| __truncated__ "Мы отошли, и она продолжила." ...
## $ : chr [1:881414] "помыла с боем и модным шампунем котэ, а оно пришло и стряхивает всю воду мне в монитор! - мстит, сук..)" "Еще 20 минут" "какой же он всё-таки шут" "Пришел пешком в центр увидел, новогодную суету. Можно сказать почувствовал праздник :-)" ...
## List of 3
## $ : chr [1:337100] "Настало время и мне поделиться чем-нибудь сладким!!! Уже совсем скоро наступит Новый год, и поэтому моя конфека"| __truncated__ "сама элегантность и выдержанность...." "Знаменитые дизайнеры, популярные магазины дамской одежды и известные модницы - все сейчас увлечены женственным "| __truncated__ "Проверяем точно ли продавец высылает в Россию? Нажимаем на вкладку и читаем:" ...
## $ : chr [1:196360] "Словом, поле для деятельности большое. Можно даже не прибегать к помощи интернет-сообщества и сэкономить ценные"| __truncated__ "Для смены режима Эмомали Рахмона в Таджикистане все готово, взрыв может случиться в любую минуту, активисты опп"| __truncated__ "Много вопросов задавали владивостокские музыканты, пришедшие в зал поприветствовать своих кумиров и учителей. И"| __truncated__ "Мы отошли, и она продолжила." ...
## $ : chr [1:881414] "помыла с боем и модным шампунем котэ, а оно пришло и стряхивает всю воду мне в монитор! - мстит, сук..)" "Еще 20 минут" "какой же он всё-таки шут" "Пришел пешком в центр увидел, новогодную суету. Можно сказать почувствовал праздник :-)" ...
## [1] 118996424
## [1] 118996424
## [1] "file 1 size is equal to: 116855835"
## [1] "file 2 size is equal to: 118996424"
## [1] "file 3 size is equal to: 105182346"
## [1] "The name of the file number 1 is ru_RU.blogs.txt"
## [1] "File number 1 ( ru_RU.blogs.txt ) is blogs"
## [1] "File 1 ( ru_RU.blogs.txt ) size is equal to: 116855835"
## [1] "Is file 1 ( ru_RU.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( ru_RU.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( ru_RU.blogs.txt ) was created is: 2014-07-22 11:12:23"
## [1] "The assumed time when the file number 1 ( ru_RU.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:21"
## [1] "The estimated time when this file 1 ( ru_RU.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:24"
## [1] "Does the file 1 ( ru_RU.blogs.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 2 is ru_RU.news.txt"
## [1] "File number 2 ( ru_RU.news.txt ) is news"
## [1] "File 2 ( ru_RU.news.txt ) size is equal to: 118996424"
## [1] "Is file 2 ( ru_RU.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( ru_RU.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( ru_RU.news.txt ) was created is: 2014-07-22 11:12:29"
## [1] "The assumed time when the file number 2 ( ru_RU.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:24"
## [1] "The estimated time when this file 2 ( ru_RU.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:27"
## [1] "Does the file 2 ( ru_RU.news.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 3 is ru_RU.twitter.txt"
## [1] "File number 3 ( ru_RU.twitter.txt ) is twitter"
## [1] "File 3 ( ru_RU.twitter.txt ) size is equal to: 105182346"
## [1] "Is file 3 ( ru_RU.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( ru_RU.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( ru_RU.twitter.txt ) was created is: 2014-07-22 11:12:33"
## [1] "The assumed time when the file number 3 ( ru_RU.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:27"
## [1] "The estimated time when this file 3 ( ru_RU.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:30"
## [1] "Does the file 3 ( ru_RU.twitter.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 1 is ru_RU.blogs.txt"
## [1] "File number 1 ( ru_RU.blogs.txt ) is blogs"
## [1] "File 1 ( ru_RU.blogs.txt ) size is equal to: 116855835"
## [1] "Is file 1 ( ru_RU.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( ru_RU.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( ru_RU.blogs.txt ) was created is: 2014-07-22 11:12:23"
## [1] "The assumed time when the file number 1 ( ru_RU.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:21"
## [1] "The estimated time when this file 1 ( ru_RU.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:24"
## [1] "Does the file 1 ( ru_RU.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is ru_RU.news.txt"
## [1] "File number 2 ( ru_RU.news.txt ) is news"
## [1] "File 2 ( ru_RU.news.txt ) size is equal to: 118996424"
## [1] "Is file 2 ( ru_RU.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( ru_RU.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( ru_RU.news.txt ) was created is: 2014-07-22 11:12:29"
## [1] "The assumed time when the file number 2 ( ru_RU.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:24"
## [1] "The estimated time when this file 2 ( ru_RU.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:27"
## [1] "Does the file 2 ( ru_RU.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is ru_RU.twitter.txt"
## [1] "File number 3 ( ru_RU.twitter.txt ) is twitter"
## [1] "File 3 ( ru_RU.twitter.txt ) size is equal to: 105182346"
## [1] "Is file 3 ( ru_RU.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( ru_RU.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( ru_RU.twitter.txt ) was created is: 2014-07-22 11:12:33"
## [1] "The assumed time when the file number 3 ( ru_RU.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:27"
## [1] "The estimated time when this file 3 ( ru_RU.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:30"
## [1] "Does the file 3 ( ru_RU.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
## my_data_frame file_names file_size file_size_Mb if_directory
## 1: ru_RU.blogs.txt ru_RU.blogs.txt 116855835 111.4424 FALSE
## 2: ru_RU.news.txt ru_RU.news.txt 118996424 113.4838 FALSE
## 3: ru_RU.twitter.txt ru_RU.twitter.txt 105182346 100.3097 FALSE
## octal_perm_code file_created file_downloaded file_recent_vers
## 1: 438 2014-07-22 11:12:23 2020-08-15 21:58:21 2020-08-15 21:58:24
## 2: 438 2014-07-22 11:12:29 2020-08-15 21:58:24 2020-08-15 21:58:27
## 3: 438 2014-07-22 11:12:33 2020-08-15 21:58:27 2020-08-15 21:58:30
## if_exe
## 1: no
## 2: no
## 3: no
my_data_frame | file_names | file_size | file_size_Mb | if_directory | octal_perm_code | file_created | file_downloaded | file_recent_vers | if_exe |
---|---|---|---|---|---|---|---|---|---|
ru_RU.blogs.txt | ru_RU.blogs.txt | 116855835 | 111.4424 | FALSE | 438 | 2014-07-22 11:12:23 | 2020-08-15 21:58:21 | 2020-08-15 21:58:24 | no |
ru_RU.news.txt | ru_RU.news.txt | 118996424 | 113.4838 | FALSE | 438 | 2014-07-22 11:12:29 | 2020-08-15 21:58:24 | 2020-08-15 21:58:27 | no |
ru_RU.twitter.txt | ru_RU.twitter.txt | 105182346 | 100.3097 | FALSE | 438 | 2014-07-22 11:12:33 | 2020-08-15 21:58:27 | 2020-08-15 21:58:30 | no |
## my_descriptive_stat lines_count words_count mean_words_per_line
## 1: ru_RU.blogs.txt 337100 9388482 27.85073
## 2: ru_RU.news.txt 196360 9057248 46.12573
## 3: ru_RU.twitter.txt 881414 9231328 10.47332
my_descriptive_stat | lines_count | words_count | mean_words_per_line |
---|---|---|---|
ru_RU.blogs.txt | 337100 | 9388482 | 27.850732720261 |
ru_RU.news.txt | 196360 | 9057248 | 46.1257282542269 |
ru_RU.twitter.txt | 881414 | 9231328 | 10.4733167387856 |
## Length Class Mode
## [1,] 16855 -none- character
## [2,] 9818 -none- character
## [3,] 44070 -none- character
## Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
## $ Row_Names: chr "ru_RU.blogs.txt" "ru_RU.news.txt" "ru_RU.twitter.txt"
## $ tat :List of 3
## ..$ : chr "Помните, я говорила, что возвращаюсь в блог? Представляете, я вас тогда обманула) Но это всё моя подготовка к с"| __truncated__ "Немного расслабимся. Странный сайт, который продает квартиры в туле. Он у меня не влезает в экран, половина за "| __truncated__ "Когда Бог создавал время, он создал его достаточно." "Я даже подумать не могла, что их будет столько. Спасибо вам мои хорошие за вас. Со многими я уже успела подружи"| __truncated__ ...
## ..$ : chr "Исходя из этого, мы с братом Алексеем приняли решение поддержать строительство дацана, чтобы помочь верующим. И"| __truncated__ "– Во-первых, живём одним днём, во-вторых, политические процессы определяют экономические процессы. Велика корру"| __truncated__ "Перед судом предстали капитан этой субмарины Дмитрий Лаврентьев и трюмный матрос «Нерпы» Дмитрий Гробов." "Причём, помимо приморцев, в нелегальный оборот свежевыловленных лососёвых включились уже и граждане Узбекистана"| __truncated__ ...
## ..$ : chr "Ночные движения начинаются, еду играть сначала в Роад148, а потом в баркод на вечеринку от Dubstep+" "Жаль, что в экономическом блоке нет тезисов об отмене госкорпораций, антимонопольной политики, выплаты всех нал"| __truncated__ "Т.А очень-очень-очень скучаю :(" "А погодка хорошая для прогулки?=)" ...
## - attr(*, ".internal.selfref")=<externalptr>
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
## List of 3
## $ 1:List of 2
## ..$ content: chr [1:16855] "Помните, я говорила, что возвращаюсь в блог? Представляете, я вас тогда обманула) Но это всё моя подготовка к с"| __truncated__ "Немного расслабимся. Странный сайт, который продает квартиры в туле. Он у меня не влезает в экран, половина за "| __truncated__ "Когда Бог создавал время, он создал его достаточно." "Я даже подумать не могла, что их будет столько. Спасибо вам мои хорошие за вас. Со многими я уже успела подружи"| __truncated__ ...
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:49:15"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 2:List of 2
## ..$ content: chr [1:9818] "Исходя из этого, мы с братом Алексеем приняли решение поддержать строительство дацана, чтобы помочь верующим. И"| __truncated__ "– Во-первых, живём одним днём, во-вторых, политические процессы определяют экономические процессы. Велика корру"| __truncated__ "Перед судом предстали капитан этой субмарины Дмитрий Лаврентьев и трюмный матрос «Нерпы» Дмитрий Гробов." "Причём, помимо приморцев, в нелегальный оборот свежевыловленных лососёвых включились уже и граждане Узбекистана"| __truncated__ ...
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:49:15"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "2"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 3:List of 2
## ..$ content: chr [1:44070] "Ночные движения начинаются, еду играть сначала в Роад148, а потом в баркод на вечеринку от Dubstep+" "Жаль, что в экономическом блоке нет тезисов об отмене госкорпораций, антимонопольной политики, выплаты всех нал"| __truncated__ "Т.А очень-очень-очень скучаю :(" "А погодка хорошая для прогулки?=)" ...
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:49:15"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "3"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
## [1] "и" "в" "во" "не" "что" "он" "на"
## [8] "я" "с" "со" "как" "а" "то" "все"
## [15] "она" "так" "его" "но" "да" "ты" "к"
## [22] "у" "же" "вы" "за" "бы" "по" "только"
## [29] "ее" "мне" "было" "вот" "от" "меня" "еще"
## [36] "нет" "о" "из" "ему" "теперь" "когда" "даже"
## [43] "ну" "вдруг" "ли" "если" "уже" "или" "ни"
## [50] "быть" "был" "него" "до" "вас" "нибудь" "опять"
## [57] "уж" "вам" "сказал" "ведь" "там" "потом" "себя"
## [64] "ничего" "ей" "может" "они" "тут" "где" "есть"
## [71] "надо" "ней" "для" "мы" "тебя" "их" "чем"
## [78] "была" "сам" "чтоб" "без" "будто" "человек" "чего"
## [85] "раз" "тоже" "себе" "под" "жизнь" "будет" "ж"
## [92] "тогда" "кто" "этот" "говорил" "того" "потому" "этого"
## [99] "какой" "совсем" "ним" "здесь" "этом" "один" "почти"
## [106] "мой" "тем" "чтобы" "нее" "кажется" "сейчас" "были"
## [113] "куда" "зачем" "сказать" "всех" "никогда" "сегодня" "можно"
## [120] "при" "наконец" "два" "об" "другой" "хоть" "после"
## [127] "над" "больше" "тот" "через" "эти" "нас" "про"
## [134] "всего" "них" "какая" "много" "разве" "сказала" "три"
## [141] "эту" "моя" "впрочем" "хорошо" "свою" "этой" "перед"
## [148] "иногда" "лучше" "чуть" "том" "нельзя" "такой" "им"
## [155] "более" "всегда" "конечно" "всю" "между"
## List of 6
## $ i : int [1:707054] 7 12 15 19 24 26 27 28 40 41 ...
## $ j : int [1:707054] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:707054] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 705823
## $ ncol : int 3
## $ dimnames:List of 2
## ..$ Terms: chr [1:705823] "- ежедневная теорема" "- одни изображали" "–– ветер восточным" "–– некоммерческая история" ...
## ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
## $ i : int [1:725617] 7 9 10 11 12 13 14 15 16 17 ...
## $ j : int [1:725617] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:725617] 10 1 8 1 1 1 1 1 1 1 ...
## $ nrow : int 706114
## $ ncol : int 3
## $ dimnames:List of 2
## ..$ Terms: chr [1:706114] "- ежедневная" "- одни" "–– ветер" "–– некоммерческая" ...
## ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
## $ i : int [1:233578] 3 4 6 11 19 20 21 22 23 24 ...
## $ j : int [1:233578] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:233578] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 165543
## $ ncol : int 3
## $ dimnames:List of 2
## ..$ Terms: chr [1:165543] "–краткий" "–линейными" "–люди" "–солнце" ...
## ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## Terms count
## 34707 года 1866
## 127041 россии 1253
## 34772 году 1228
## 64093 который 1127
## 45855 жизни 888
## 70117 люди 885
## 120501 работы 817
## 70109 людей 806
## 38791 делать 688
## 42059 дома 652
## Terms count
## 407267 пермского края 327
## 540315 самом деле 231
## 558927 сих пор 228
## 153412 доброе утро 220
## 617195 таким образом 159
## 268004 лет назад 127
## 347631 новый год 123
## 299753 млрд руб 120
## 494475 прошлом году 117
## 164671 друг друга 109
## Terms count
## 455999 правительства пермского края 44
## 86596 возбуждено уголовное дело 23
## 282540 малого среднего бизнеса 16
## 404627 передает риа «новости» 16
## 155567 доброго времени суток 15
## 326687 наступающим новым годом 15
## 596119 ст ук рф 15
## 300470 млрд куб м 14
## 626632 территории пермского края 14
## 636285 трлн куб м 14
## 'data.frame': 165543 obs. of 2 variables:
## $ Terms: Factor w/ 165543 levels "–краткий","–линейными",..: 34707 127041 34772 64093 45855 70117 120501 70109 38791 42059 ...
## $ count: num 1866 1253 1228 1127 888 ...
## 'data.frame': 50 obs. of 2 variables:
## $ Terms: Factor w/ 165543 levels "–краткий","–линейными",..: 34707 127041 34772 64093 45855 70117 120501 70109 38791 42059 ...
## $ count: num 1866 1253 1228 1127 888 ...
## Terms count relative_frequency my_coverage_stat
## –краткий : 1 Min. : 1.000 Min. :1.257e-06 Min. :0.002346
## –линейными: 1 1st Qu.: 1.000 1st Qu.:1.257e-06 1st Qu.:0.809945
## –люди : 1 Median : 1.000 Median :1.257e-06 Median :0.895940
## –солнце : 1 Mean : 4.805 Mean :6.041e-06 Mean :0.854150
## –таки : 1 3rd Qu.: 3.000 3rd Qu.:3.772e-06 3rd Qu.:0.947970
## –хау : 1 Max. :1866.000 Max. :2.346e-03 Max. :1.000000
## (Other) :165537
## index
## Min. : 1
## 1st Qu.: 41387
## Median : 82772
## Mean : 82772
## 3rd Qu.:124158
## Max. :165543
##
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## [1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
## chr [1:3] "de_DE.blogs.txt" "de_DE.news.txt" "de_DE.twitter.txt"
## chr [1:12] "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\/de_DE/de_DE.blogs.txt" ...
## Length Class Mode
## 12 character character
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding
## = "UTF-8"): incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\de_DE\de_DE.blogs.txt'
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 8653 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 78077 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 152105 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding
## = "UTF-8"): incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\de_DE\de_DE.blogs.txt'
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 8653 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 78077 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 152105 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding
## = "UTF-8"): incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\de_DE\de_DE.blogs.txt'
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 8653 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 78077 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 152105 appears to contain an embedded nul
## List of 3
## $ : chr [1:181958] "Irgendwann wird es Zeit. Ich schleppe es ja auch jeden Tag mit mir herum. Da leidet es schon ein wenig. Nicht n"| __truncated__ "Kommentar: auch hier wird durch die Voranstellung einer entsprechenden Bemuskelung des Halses vor die Eleganz d"| __truncated__ "In vielen Staaten veranlasst vor allem die Hoffnung auf bessere Erwerbsmoglichkeiten die Menschen dazu, sich in"| __truncated__ "Nneka – Heartbeat (Crada Remix ft. NAS)" ...
## $ : chr [1:244743] "Das Rezept fur ihre Schokobrezln hat die 60-Jahrige schon vor 26 Jahren in einer osterreichischen Sendung entde"| __truncated__ "Fur die Linksparteibewerber ist nun die entscheidende Frage, wie sich SPD- und Grunenwahler in einer solchen Pe"| __truncated__ "Nach Einschatzung des DIW ist das kraftige Plus im dritten Quartal vor allem dem Quartalsauftakt im Juli zu ver"| __truncated__ "„Der Bau eines neuen Lagers ist ein weiterer Baustein unserer Umstrukturierungsma<U+00DF>nahmen, mit denen wir "| __truncated__ ...
## $ : chr [1:947774] "irgendwas stimmt mut meinem internet am pc nich :(" "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer daruber gekotzt hat." "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. Lauft..." "Gestern noch in Bangkok, heute wegen des Hochwassers in Vientiane, Laos. Schade, die Damme dort haben gehalten im Zentrum." ...
## List of 3
## $ : chr [1:181958] "Irgendwann wird es Zeit. Ich schleppe es ja auch jeden Tag mit mir herum. Da leidet es schon ein wenig. Nicht n"| __truncated__ "Kommentar: auch hier wird durch die Voranstellung einer entsprechenden Bemuskelung des Halses vor die Eleganz d"| __truncated__ "In vielen Staaten veranlasst vor allem die Hoffnung auf bessere Erwerbsmoglichkeiten die Menschen dazu, sich in"| __truncated__ "Nneka – Heartbeat (Crada Remix ft. NAS)" ...
## $ : chr [1:244743] "Das Rezept fur ihre Schokobrezln hat die 60-Jahrige schon vor 26 Jahren in einer osterreichischen Sendung entde"| __truncated__ "Fur die Linksparteibewerber ist nun die entscheidende Frage, wie sich SPD- und Grunenwahler in einer solchen Pe"| __truncated__ "Nach Einschatzung des DIW ist das kraftige Plus im dritten Quartal vor allem dem Quartalsauftakt im Juli zu ver"| __truncated__ "„Der Bau eines neuen Lagers ist ein weiterer Baustein unserer Umstrukturierungsma<U+00DF>nahmen, mit denen wir "| __truncated__ ...
## $ : chr [1:947774] "irgendwas stimmt mut meinem internet am pc nich :(" "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer daruber gekotzt hat." "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. Lauft..." "Gestern noch in Bangkok, heute wegen des Hochwassers in Vientiane, Laos. Schade, die Damme dort haben gehalten im Zentrum." ...
## [1] 95591959
## [1] 95591959
## [1] "file 1 size is equal to: 85459666"
## [1] "file 2 size is equal to: 95591959"
## [1] "file 3 size is equal to: 75578341"
## [1] "The name of the file number 1 is de_DE.blogs.txt"
## [1] "File number 1 ( de_DE.blogs.txt ) is blogs"
## [1] "File 1 ( de_DE.blogs.txt ) size is equal to: 85459666"
## [1] "Is file 1 ( de_DE.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( de_DE.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( de_DE.blogs.txt ) was created is: 2014-07-22 11:11:32"
## [1] "The assumed time when the file number 1 ( de_DE.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:17"
## [1] "The estimated time when this file 1 ( de_DE.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:19"
## [1] "Does the file 1 ( de_DE.blogs.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 2 is de_DE.news.txt"
## [1] "File number 2 ( de_DE.news.txt ) is news"
## [1] "File 2 ( de_DE.news.txt ) size is equal to: 95591959"
## [1] "Is file 2 ( de_DE.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( de_DE.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( de_DE.news.txt ) was created is: 2014-07-22 11:11:43"
## [1] "The assumed time when the file number 2 ( de_DE.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:19"
## [1] "The estimated time when this file 2 ( de_DE.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:21"
## [1] "Does the file 2 ( de_DE.news.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 3 is de_DE.twitter.txt"
## [1] "File number 3 ( de_DE.twitter.txt ) is twitter"
## [1] "File 3 ( de_DE.twitter.txt ) size is equal to: 75578341"
## [1] "Is file 3 ( de_DE.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( de_DE.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( de_DE.twitter.txt ) was created is: 2014-07-22 11:11:37"
## [1] "The assumed time when the file number 3 ( de_DE.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:14"
## [1] "The estimated time when this file 3 ( de_DE.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:17"
## [1] "Does the file 3 ( de_DE.twitter.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 1 is de_DE.blogs.txt"
## [1] "File number 1 ( de_DE.blogs.txt ) is blogs"
## [1] "File 1 ( de_DE.blogs.txt ) size is equal to: 85459666"
## [1] "Is file 1 ( de_DE.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( de_DE.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( de_DE.blogs.txt ) was created is: 2014-07-22 11:11:32"
## [1] "The assumed time when the file number 1 ( de_DE.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:17"
## [1] "The estimated time when this file 1 ( de_DE.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:19"
## [1] "Does the file 1 ( de_DE.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is de_DE.news.txt"
## [1] "File number 2 ( de_DE.news.txt ) is news"
## [1] "File 2 ( de_DE.news.txt ) size is equal to: 95591959"
## [1] "Is file 2 ( de_DE.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( de_DE.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( de_DE.news.txt ) was created is: 2014-07-22 11:11:43"
## [1] "The assumed time when the file number 2 ( de_DE.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:19"
## [1] "The estimated time when this file 2 ( de_DE.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:21"
## [1] "Does the file 2 ( de_DE.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is de_DE.twitter.txt"
## [1] "File number 3 ( de_DE.twitter.txt ) is twitter"
## [1] "File 3 ( de_DE.twitter.txt ) size is equal to: 75578341"
## [1] "Is file 3 ( de_DE.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( de_DE.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( de_DE.twitter.txt ) was created is: 2014-07-22 11:11:37"
## [1] "The assumed time when the file number 3 ( de_DE.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:14"
## [1] "The estimated time when this file 3 ( de_DE.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:17"
## [1] "Does the file 3 ( de_DE.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
## my_data_frame file_names file_size file_size_Mb if_directory
## 1: de_DE.blogs.txt de_DE.blogs.txt 85459666 81.50069 FALSE
## 2: de_DE.news.txt de_DE.news.txt 95591959 91.16360 FALSE
## 3: de_DE.twitter.txt de_DE.twitter.txt 75578341 72.07712 FALSE
## octal_perm_code file_created file_downloaded file_recent_vers
## 1: 438 2014-07-22 11:11:32 2020-08-15 21:58:17 2020-08-15 21:58:19
## 2: 438 2014-07-22 11:11:43 2020-08-15 21:58:19 2020-08-15 21:58:21
## 3: 438 2014-07-22 11:11:37 2020-08-15 21:58:14 2020-08-15 21:58:17
## if_exe
## 1: no
## 2: no
## 3: no
my_data_frame | file_names | file_size | file_size_Mb | if_directory | octal_perm_code | file_created | file_downloaded | file_recent_vers | if_exe |
---|---|---|---|---|---|---|---|---|---|
de_DE.blogs.txt | de_DE.blogs.txt | 85459666 | 81.50069 | FALSE | 438 | 2014-07-22 11:11:32 | 2020-08-15 21:58:17 | 2020-08-15 21:58:19 | no |
de_DE.news.txt | de_DE.news.txt | 95591959 | 91.16360 | FALSE | 438 | 2014-07-22 11:11:43 | 2020-08-15 21:58:19 | 2020-08-15 21:58:21 | no |
de_DE.twitter.txt | de_DE.twitter.txt | 75578341 | 72.07712 | FALSE | 438 | 2014-07-22 11:11:37 | 2020-08-15 21:58:14 | 2020-08-15 21:58:17 | no |
## my_descriptive_stat lines_count words_count mean_words_per_line
## 1: de_DE.blogs.txt 181958 6205913 34.10629
## 2: de_DE.news.txt 244743 13375092 54.64954
## 3: de_DE.twitter.txt 947774 11646033 12.28777
my_descriptive_stat | lines_count | words_count | mean_words_per_line |
---|---|---|---|
de_DE.blogs.txt | 181958 | 6205913 | 34.1062937600985 |
de_DE.news.txt | 244743 | 13375092 | 54.649538495483 |
de_DE.twitter.txt | 947774 | 11646033 | 12.2877743006244 |
## Length Class Mode
## [1,] 9097 -none- character
## [2,] 12237 -none- character
## [3,] 47388 -none- character
## Classes 'data.table' and 'data.frame': 3 obs. of 2 variables:
## $ Row_Names: chr "de_DE.blogs.txt" "de_DE.news.txt" "de_DE.twitter.txt"
## $ tat :List of 3
## ..$ : chr "Es gibt viele Situationen und Lebensumstande im Buch die mir sehr gut gefallen haben. So sind mitunter kleine W"| __truncated__ "Nur fur heute.*" "Und nun warten diverse Baume und noch ein spontaner Styroporschnitzjob auf uns- auf geht`s!" "Angezogen sieht das Kleid supersu<U+00DF> aus." ...
## ..$ : chr "Mit dem Papst tritt der Reprasentant einer autokratischen Wahlmonarchie auf, die weder die fur Demokratien ubli"| __truncated__ "Kinder sind allgemein nicht starker armutsgefahrdet als der Durchschnitt der Bevolkerung, betonte Egeler. 2008 "| __truncated__ "Schneider wird in Augsburg willkommen sein, hat er in der Region doch einige knifflige Falle gelost. Der Expert"| __truncated__ "Wer den arabischen Fruhling im Fernsehen verfolgt hat, ist mit hoher Wahrscheinlichkeit diesem Mann begegnet: H"| __truncated__ ...
## ..$ : chr "Nicht ein deutscher Nachrichtensender zeigt Nachrichten. Peinlich" "Aber hoffentlich ohne #Lion? Meines ist leider sehr langsam geworden" "Wohl war, Wohl war." "Bin morgen leider nicht dabei. Lass ins doch kommende Woche mal sprechen." ...
## - attr(*, ".internal.selfref")=<externalptr>
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
## List of 3
## $ 1:List of 2
## ..$ content: chr [1:9097] "Es gibt viele Situationen und Lebensumstande im Buch die mir sehr gut gefallen haben. So sind mitunter kleine W"| __truncated__ "Nur fur heute.*" "Und nun warten diverse Baume und noch ein spontaner Styroporschnitzjob auf uns- auf geht`s!" "Angezogen sieht das Kleid supersu<U+00DF> aus." ...
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 15:09:43"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 2:List of 2
## ..$ content: chr [1:12237] "Mit dem Papst tritt der Reprasentant einer autokratischen Wahlmonarchie auf, die weder die fur Demokratien ubli"| __truncated__ "Kinder sind allgemein nicht starker armutsgefahrdet als der Durchschnitt der Bevolkerung, betonte Egeler. 2008 "| __truncated__ "Schneider wird in Augsburg willkommen sein, hat er in der Region doch einige knifflige Falle gelost. Der Expert"| __truncated__ "Wer den arabischen Fruhling im Fernsehen verfolgt hat, ist mit hoher Wahrscheinlichkeit diesem Mann begegnet: H"| __truncated__ ...
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 15:09:43"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "2"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 3:List of 2
## ..$ content: chr [1:47388] "Nicht ein deutscher Nachrichtensender zeigt Nachrichten. Peinlich" "Aber hoffentlich ohne #Lion? Meines ist leider sehr langsam geworden" "Wohl war, Wohl war." "Bin morgen leider nicht dabei. Lass ins doch kommende Woche mal sprechen." ...
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 15:09:43"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "3"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
## [1] "aber" "alle" "allem" "allen" "aller" "alles"
## [7] "als" "also" "am" "an" "ander" "andere"
## [13] "anderem" "anderen" "anderer" "anderes" "anderm" "andern"
## [19] "anderr" "anders" "auch" "auf" "aus" "bei"
## [25] "bin" "bis" "bist" "da" "damit" "dann"
## [31] "der" "den" "des" "dem" "die" "das"
## [37] "da<U+00DF>" "derselbe" "derselben" "denselben" "desselben" "demselben"
## [43] "dieselbe" "dieselben" "dasselbe" "dazu" "dein" "deine"
## [49] "deinem" "deinen" "deiner" "deines" "denn" "derer"
## [55] "dessen" "dich" "dir" "du" "dies" "diese"
## [61] "diesem" "diesen" "dieser" "dieses" "doch" "dort"
## [67] "durch" "ein" "eine" "einem" "einen" "einer"
## [73] "eines" "einig" "einige" "einigem" "einigen" "einiger"
## [79] "einiges" "einmal" "er" "ihn" "ihm" "es"
## [85] "etwas" "euer" "eure" "eurem" "euren" "eurer"
## [91] "eures" "fur" "gegen" "gewesen" "hab" "habe"
## [97] "haben" "hat" "hatte" "hatten" "hier" "hin"
## [103] "hinter" "ich" "mich" "mir" "ihr" "ihre"
## [109] "ihrem" "ihren" "ihrer" "ihres" "euch" "im"
## [115] "in" "indem" "ins" "ist" "jede" "jedem"
## [121] "jeden" "jeder" "jedes" "jene" "jenem" "jenen"
## [127] "jener" "jenes" "jetzt" "kann" "kein" "keine"
## [133] "keinem" "keinen" "keiner" "keines" "konnen" "konnte"
## [139] "machen" "man" "manche" "manchem" "manchen" "mancher"
## [145] "manches" "mein" "meine" "meinem" "meinen" "meiner"
## [151] "meines" "mit" "muss" "musste" "nach" "nicht"
## [157] "nichts" "noch" "nun" "nur" "ob" "oder"
## [163] "ohne" "sehr" "sein" "seine" "seinem" "seinen"
## [169] "seiner" "seines" "selbst" "sich" "sie" "ihnen"
## [175] "sind" "so" "solche" "solchem" "solchen" "solcher"
## [181] "solches" "soll" "sollte" "sondern" "sonst" "uber"
## [187] "um" "und" "uns" "unse" "unsem" "unsen"
## [193] "unser" "unses" "unter" "viel" "vom" "von"
## [199] "vor" "wahrend" "war" "waren" "warst" "was"
## [205] "weg" "weil" "weiter" "welche" "welchem" "welchen"
## [211] "welcher" "welches" "wenn" "werde" "werden" "wie"
## [217] "wieder" "will" "wir" "wird" "wirst" "wo"
## [223] "wollen" "wollte" "wurde" "wurden" "zu" "zum"
## [229] "zur" "zwar" "zwischen"
## List of 6
## $ i : int [1:567999] 2 3 5 9 10 11 13 18 19 20 ...
## $ j : int [1:567999] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:567999] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 567687
## $ ncol : int 3
## $ dimnames:List of 2
## ..$ Terms: chr [1:567687] "\177 eigenen angaben" "–– wechselbeziehung dimen" "– – ”" "– – amerikanisch" ...
## ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
## $ i : int [1:599972] 2 3 4 5 7 8 9 11 12 13 ...
## $ j : int [1:599972] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:599972] 1 7 1 1 2 2 7 1 1 1 ...
## $ nrow : int 589173
## $ ncol : int 3
## $ dimnames:List of 2
## ..$ Terms: chr [1:589173] "\177 eigenen" "–– wechselbeziehung" "– –" "– —" ...
## ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
## $ i : int [1:164445] 1 2 3 4 5 6 9 10 11 13 ...
## $ j : int [1:164445] 1 1 1 1 1 1 1 1 1 1 ...
## $ v : num [1:164445] 1 1 1 1 1 1 1 2 1 1 ...
## $ nrow : int 123615
## $ ncol : int 3
## $ dimnames:List of 2
## ..$ Terms: chr [1:123615] "–durfen" "–ketchup" "–stromung" "–zuchter" ...
## ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## Terms count
## 33698 euro 1425
## 67391 macht 1354
## 16100 berlin 1173
## 69849 menschen 1136
## 64405 leben 1014
## 25369 deutschland 938
## 65620 lieb 830
## 117441 welt 825
## 49011 haus 699
## 9200 arbeit 671
## Terms count
## 343316 milliarden euro 284
## 343556 millionen euro 234
## 369236 new york 128
## 28335 angela merkel 101
## 439968 schones wochenend 94
## 456056 social media 90
## 235691 hartz iv 83
## 243650 herzlichen gluckwunsch 77
## 343312 milliarden dollar 75
## 527368 vereinigten staaten 72
## Terms count
## 104441 dauernd hartz iv 64
## 227720 hartz iv schreiben 64
## 256870 iv schreiben dauernd 63
## 425812 schreiben dauernd hartz 63
## 89860 bundeskanzlerin angela merkel 43
## 28730 angela merkel cdu 25
## 57604 beer beer beer 23
## 135964 ent ent ent 21
## 309136 lieb lieb lieb 21
## 152595 europaischen zentralbank ezb 19
## 'data.frame': 123615 obs. of 2 variables:
## $ Terms: Factor w/ 123615 levels "–durfen","–ketchup",..: 33698 67391 16100 69849 64405 25369 65620 117441 49011 9200 ...
## $ count: num 1425 1354 1173 1136 1014 ...
## 'data.frame': 50 obs. of 2 variables:
## $ Terms: Factor w/ 123615 levels "–durfen","–ketchup",..: 33698 67391 16100 69849 64405 25369 65620 117441 49011 9200 ...
## $ count: num 1425 1354 1173 1136 1014 ...
## Terms count relative_frequency
## –durfen : 1 Min. : 1.000 Min. :1.555e-06
## –ketchup : 1 1st Qu.: 1.000 1st Qu.:1.555e-06
## –stromung : 1 Median : 1.000 Median :1.555e-06
## –zuchter : 1 Mean : 5.201 Mean :8.090e-06
## —–ursprunglich: 1 3rd Qu.: 2.000 3rd Qu.:3.111e-06
## —fruh– : 1 Max. :1425.000 Max. :2.216e-03
## (Other) :123609
## my_coverage_stat index
## Min. :0.002216 Min. : 1
## 1st Qu.:0.836928 1st Qu.: 30905
## Median :0.903872 Median : 61808
## Mean :0.872124 Mean : 61808
## 3rd Qu.:0.951936 3rd Qu.: 92712
## Max. :1.000000 Max. :123615
##
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
This data may be used as a comparison data in the future or as a datatsets for creating MLT algorithms that will be picking up the word endings for the users based on the baginning of their inputs so that it would not overload the server equipment (the MLT should not be extremely computationally sophisticated in order the waiting time for the user to be minimal - for this project this is OK).