About the project

The mission of the project is to clean a raw database and present the data on on what people read or write in the online news, social media and blogs within one sample year in order to pass a JHU course project.

Internet is an important source of data nowadays - and effective statistical and IT ways to proceed its big data are of a high interest.

This project uses the .txt format raw (unclenaed) data in the original languages as they are.

The data dorcessing methods documentation presentation may be found here: https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html.

The data itself may be found here: .

The project is made in R which is a statistical programming language free software. The code samples are represented in this text.

Preparing the R environment

The English part of the analysis

An English part of this report is made on English Twitter, News outputs and US-UK blogs datatsets. The code and the output look like this:

Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
### Now load the data: 
my_path = "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\"
dir(my_path)
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
my_path_eng = "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\en_US\\"
# "de_DE" "en_US" "fi_FI" "ru_RU"
filenames = dir(my_path, pattern="*.txt")
filenames = dir(my_path_eng, pattern="*.txt")
str(filenames)
##  chr [1:3] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
files    <- list.files(path = my_path, full.names = TRUE, recursive = TRUE, pattern="*.txt")
str(files)
##  chr [1:12] "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\/de_DE/de_DE.blogs.txt" ...
summary(files)
##    Length     Class      Mode 
##        12 character character
# Open all files together - we need more data to teach our algorithm 
ldf <- lapply(paste(my_path_eng, filenames, sep=""), readLines)
## Warning in FUN(X[[i]], ...): incomplete final
## line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\en_US\en_US.news.txt'
## Warning in FUN(X[[i]], ...): line 167155 appears to contain an embedded nul
## Warning in FUN(X[[i]], ...): line 268547 appears to contain an embedded nul
## Warning in FUN(X[[i]], ...): line 1274086 appears to contain an embedded nul
## Warning in FUN(X[[i]], ...): line 1759032 appears to contain an embedded nul
str(ldf)
## List of 3
##  $ : chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235." "We love you Mr. Brown." "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been"| __truncated__ "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter."| __truncated__ ...
##  $ : chr [1:77259] "He wasn't home alone, apparently." "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset o"| __truncated__ "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new bi"| __truncated__ "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton"| __truncated__ ...
##  $ : chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason." "they've decided its more fun if I don't." "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)" ...
res <- lapply(ldf, summary)
res # [[1]] - blog; [[2]] - news; [[3]] - twitter - proofs see below where we will take summary for each file separately 
## [[1]]
##    Length     Class      Mode 
##    899288 character character 
## 
## [[2]]
##    Length     Class      Mode 
##     77259 character character 
## 
## [[3]]
##    Length     Class      Mode 
##   2360148 character character
#for (i in 1:length(res))
#  assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]]) #if there would be DFs different on number in names 

# Also open all files separately - it may be useful: 
blogs   <- readLines(paste(my_path_eng, "en_US.blogs.txt", sep=""),   encoding = "UTF-8", skipNul = TRUE)
news    <- readLines(paste(my_path_eng, "en_US.news.txt", sep=""),    encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(paste(my_path_eng, "en_US.news.txt", sep = ""), encoding
## = "UTF-8", : incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\en_US\en_US.news.txt'
twitter <- readLines(paste(my_path_eng, "en_US.twitter.txt", sep=""), encoding = "UTF-8", skipNul = TRUE)
summary(blogs)
##    Length     Class      Mode 
##    899288 character character
summary(news)
##    Length     Class      Mode 
##     77259 character character
summary(twitter)       
##    Length     Class      Mode 
##   2360148 character character
file_sizes <- lapply(paste(my_path_eng, filenames, sep=""), file.info)
file_sizes[[2]]$size
## [1] 205811889
#result
fileList <- list.files(my_path_eng, pattern="*.txt")



file_sizes <- lapply(paste(my_path_eng, filenames, sep=""), file.info)
file_sizes[[2]]$size
## [1] 205811889
dat <- file_sizes[[1]]$size
for (i in 1:3){
    c = paste("file ", i, "size is equal to:", file_sizes[[i]]$size, sep = " ")
    print(c)
    # print( file_sizes[[i]]$isdir)
    # write.dna(dat, file = sprintf("./fastas/%s.fasta", i), format = "fasta")
}
## [1] "file  1 size is equal to: 210160014"
## [1] "file  2 size is equal to: 205811889"
## [1] "file  3 size is equal to: 167105338"
### Loop for answering to all the questions about the dataset: 
for (i in 1:3){
    
    a = paste("The name of the file number", i, "is", filenames[[i]], sep = " ")
    print(a)
    b = paste("File number ", i, "(", filenames[[i]], ")", "is", ifelse(i == 1, "blogs", ifelse(i == 2, "news", "twitter")), sep = " ")
    print(b)
    c1 = paste("File ", i, "(", filenames[[i]], ")", "size is equal to:", file_sizes[[i]]$size, sep = " ")
    print(c1)
    d = paste("Is file ", i, "(", filenames[[i]], ")", "is a directory?", ifelse(file_sizes[[i]]$isdir == TRUE, "Yes. it is.", "No, it is not"), sep = " ")
    print(d)
    e = paste("The file number", i, "(", filenames[[i]], ")", "permissions code printed in the octal number is:", file_sizes[[i]]$mode, sep = " ")
    print(e)
    f = paste("The supposed date and time when the original file number", i, "(", filenames[[i]], ")", "was created is:",  file_sizes[[i]]$mtime , sep = " ")
    print(f)
    g = paste("The assumed time when the file number", i, "(", filenames[[i]], ")", "was apploaded to the computer is:", file_sizes[[i]]$ctime, sep = " ")
    print(g)
    h = paste("The estimated time when this file", i, "(", filenames[[i]], ")", "was the first time loaded into the R global environment is:", file_sizes[[i]]$atime, sep = " ")
    print(h)
    i2 = paste("Does the file ", i, "(", filenames[[i]], ")", "contain executable code?", ifelse(file_sizes[[i]]$exe == "no", "No, it doesn't.", "Yes, it does."), sep = " ")
    print(i2)

}
## [1] "The name of the file number 1 is en_US.blogs.txt"
## [1] "File number  1 ( en_US.blogs.txt ) is blogs"
## [1] "File  1 ( en_US.blogs.txt ) size is equal to: 210160014"
## [1] "Is file  1 ( en_US.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( en_US.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( en_US.blogs.txt ) was created is: 2014-07-22 11:13:05"
## [1] "The assumed time when the file number 1 ( en_US.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:42"
## [1] "The estimated time when this file 1 ( en_US.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:49"
## [1] "Does the file  1 ( en_US.blogs.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 2 is en_US.news.txt"
## [1] "File number  2 ( en_US.news.txt ) is news"
## [1] "File  2 ( en_US.news.txt ) size is equal to: 205811889"
## [1] "Is file  2 ( en_US.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( en_US.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( en_US.news.txt ) was created is: 2014-07-22 11:13:04"
## [1] "The assumed time when the file number 2 ( en_US.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:35"
## [1] "The estimated time when this file 2 ( en_US.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:42"
## [1] "Does the file  2 ( en_US.news.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 3 is en_US.twitter.txt"
## [1] "File number  3 ( en_US.twitter.txt ) is twitter"
## [1] "File  3 ( en_US.twitter.txt ) size is equal to: 167105338"
## [1] "Is file  3 ( en_US.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( en_US.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( en_US.twitter.txt ) was created is: 2014-07-22 11:12:58"
## [1] "The assumed time when the file number 3 ( en_US.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:30"
## [1] "The estimated time when this file 3 ( en_US.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:35"
## [1] "Does the file  3 ( en_US.twitter.txt ) contain executable code? No, it doesn't."
#We could obtain names of each file in the dataset by using this loop: 
for (i in 1:3){

    dat = list()
    
    for (x in 1:3) {
    dat[[x]] <-  filenames[[x]]
        
    }
    
}

### Obtain size of each file in our dataset: 
for (i in 1:3){

    dat = list()
    
    for (x in 1:3) {
    
    dat[[x]] <-  file_sizes[[x]]$size
        
    }
    
}

dat = data.table(dat)
dat$filenames = filenames
dat$file_size = dat$dat
dat$dat = NULL
### New data - when the file was created: 
for (i in 1:3){
    pat = list()
    
    for (x in 1:3) {
    
    pat[[x]] <-  file_sizes[[x]]$mtime
        
    }
    
}
pat = data.table(pat)
dat$file_created = pat$pat
rm(pat) #There us no need in this component anymore - free some memory 

#### My hyperloop - it tells everything about the files 
for (i in 1:3){
    
    a = paste("The name of the file number", i, "is", filenames[[i]], sep = " ")
    print(a)
    
    for (z in 1:3) {
    a[[z]] = filenames[[z]]
    }
    
    b = paste("File number ", i, "(", filenames[[i]], ")", "is", ifelse(i == 1, "blogs", ifelse(i == 2, "news", "twitter")), sep = " ")
    print(b)
    
    c1 = paste("File ", i, "(", filenames[[i]], ")", "size is equal to:", file_sizes[[i]]$size, sep = " ")
    print(c1)
    
    for (y in 1:3) {
    c1[[y]] = file_sizes[[y]]$size
    }
    
    d = paste("Is file ", i, "(", filenames[[i]], ")", "is a directory?", ifelse(file_sizes[[i]]$isdir == TRUE, "Yes. it is.", "No, it is not"), sep = " ")
    print(d)
    for (x in 1:3) {
    d[[x]] = file_sizes[[x]]$isdir
    }
    
    e = paste("The file number", i, "(", filenames[[i]], ")", "permissions code printed in the octal number is:", file_sizes[[i]]$mode, sep = " ")
    print(e)
    
    for (u in 1:3) {
    e[[u]] = file_sizes[[u]]$mode
    }
    
    f = paste("The supposed date and time when the original file number", i, "(", filenames[[i]], ")", "was created is:",  file_sizes[[i]]$mtime , sep = " ")
    print(f)
    
    for (v in 1:3) {
    f[[v]] = file_sizes[[v]]$mtime
    }
    
    g = paste("The assumed time when the file number", i, "(", filenames[[i]], ")", "was apploaded to the computer is:", file_sizes[[i]]$ctime, sep = " ")
    print(g)
    
    for (w in 1:3) {
    g[[w]] = file_sizes[[w]]$ctime
    }
    
    h = paste("The estimated time when this file", i, "(", filenames[[i]], ")", "was the first time loaded into the R global environment is:", file_sizes[[i]]$atime, sep = " ")
    print(h)
    
    for (r in 1:3) {
    # h1[[r]] = as.Date(file_sizes[[r]]$atime)
    h[[r]] = file_sizes[[r]]$atime
    
    }
    
    i2 = paste("Does the file ", i, "(", filenames[[i]], ")", "contain executable code?", ifelse(file_sizes[[i]]$exe == "no", "No, it doesn't.", "Yes, it does."), sep = " ")
    print(i2)

for (s in 1:3) {
    i2[[s]] = file_sizes[[s]]$exe
    }

a = data.table(a)
c1 = data.table(c1)
d = data.table(d)
e = data.table(e)
f = data.table(f)
g = data.table(g)
h = data.table(h)
i2 = data.table(i2)
my_data_frame = filenames
my_data_frame = data.table(my_data_frame)
my_data_frame$file_names = a
my_data_frame$file_size = c1
my_data_frame$file_size = as.numeric(my_data_frame$file_size)
my_data_frame$file_size_Mb = my_data_frame$file_size/(1024^2)

my_data_frame$if_directory = d
my_data_frame$octal_perm_code = e
my_data_frame$file_created = f 
my_data_frame$file_downloaded = g 
my_data_frame$file_recent_vers = h 

my_data_frame$file_created = as.numeric(my_data_frame$file_created)
my_data_frame$file_downloaded = as.numeric(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = as.numeric(my_data_frame$file_recent_vers)

### In case if proper date and time processing package is not installed: 
install_and_load = function(name, char = T)
{
  if (!require(name, character.only = char)) 
  {
    install.packages(name)
  }
  require(name, character.only = char)
}
sapply(
  c("anytime"),
  install_and_load
)
rm(install_and_load)
### End of the package call

my_data_frame$file_created = anytime::anytime(my_data_frame$file_created)
my_data_frame$file_downloaded = anytime::anytime(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = anytime::anytime(my_data_frame$file_recent_vers)

my_data_frame$if_exe = i2 

m1 = paste("In addition, a dataframe is created that contains all the data about each file.", "The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to:", i, sep = " ")
print(m1)
rm(a, c1, d, e, f, g, h, i2, m1)


}
## [1] "The name of the file number 1 is en_US.blogs.txt"
## [1] "File number  1 ( en_US.blogs.txt ) is blogs"
## [1] "File  1 ( en_US.blogs.txt ) size is equal to: 210160014"
## [1] "Is file  1 ( en_US.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( en_US.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( en_US.blogs.txt ) was created is: 2014-07-22 11:13:05"
## [1] "The assumed time when the file number 1 ( en_US.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:42"
## [1] "The estimated time when this file 1 ( en_US.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:49"
## [1] "Does the file  1 ( en_US.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is en_US.news.txt"
## [1] "File number  2 ( en_US.news.txt ) is news"
## [1] "File  2 ( en_US.news.txt ) size is equal to: 205811889"
## [1] "Is file  2 ( en_US.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( en_US.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( en_US.news.txt ) was created is: 2014-07-22 11:13:04"
## [1] "The assumed time when the file number 2 ( en_US.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:35"
## [1] "The estimated time when this file 2 ( en_US.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:42"
## [1] "Does the file  2 ( en_US.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is en_US.twitter.txt"
## [1] "File number  3 ( en_US.twitter.txt ) is twitter"
## [1] "File  3 ( en_US.twitter.txt ) size is equal to: 167105338"
## [1] "Is file  3 ( en_US.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( en_US.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( en_US.twitter.txt ) was created is: 2014-07-22 11:12:58"
## [1] "The assumed time when the file number 3 ( en_US.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:30"
## [1] "The estimated time when this file 3 ( en_US.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:35"
## [1] "Does the file  3 ( en_US.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
###### Full loop: 
n = length(filenames)
for (i in 1:n){
    
    a = paste("The name of the file number", i, "is", filenames[[i]], sep = " ")
    print(a)
    
    for (z in 1:n) {
    a[[z]] = filenames[[z]]
    }
    
    b = paste("File number ", i, "(", filenames[[i]], ")", "is", ifelse(i == 1, "blogs", ifelse(i == 2, "news", "twitter")), sep = " ")
    print(b)
    
    c1 = paste("File ", i, "(", filenames[[i]], ")", "size is equal to:", file_sizes[[i]]$size, sep = " ")
    print(c1)
    
    for (y in 1:n) {
    c1[[y]] = file_sizes[[y]]$size
    }
    
    d = paste("Is file ", i, "(", filenames[[i]], ")", "is a directory?", ifelse(file_sizes[[i]]$isdir == TRUE, "Yes. it is.", "No, it is not"), sep = " ")
    print(d)
    for (x in 1:n) {
    d[[x]] = file_sizes[[x]]$isdir
    }
    
    e = paste("The file number", i, "(", filenames[[i]], ")", "permissions code printed in the octal number is:", file_sizes[[i]]$mode, sep = " ")
    print(e)
    
    for (u in 1:n) {
    e[[u]] = file_sizes[[u]]$mode
    }
    
    f = paste("The supposed date and time when the original file number", i, "(", filenames[[i]], ")", "was created is:",  file_sizes[[i]]$mtime , sep = " ")
    print(f)
    
    for (v in 1:n) {
    f[[v]] = file_sizes[[v]]$mtime
    }
    
    g = paste("The assumed time when the file number", i, "(", filenames[[i]], ")", "was apploaded to the computer is:", file_sizes[[i]]$ctime, sep = " ")
    print(g)
    
    for (w in 1:n) {
    g[[w]] = file_sizes[[w]]$ctime
    }
    
    h = paste("The estimated time when this file", i, "(", filenames[[i]], ")", "was the first time loaded into the R global environment is:", file_sizes[[i]]$atime, sep = " ")
    print(h)
    
    for (r in 1:n) {
    # h1[[r]] = as.Date(file_sizes[[r]]$atime)
    h[[r]] = file_sizes[[r]]$atime
    
    }
    
    i2 = paste("Does the file ", i, "(", filenames[[i]], ")", "contain executable code?", ifelse(file_sizes[[i]]$exe == "no", "No, it doesn't.", "Yes, it does."), sep = " ")
    print(i2)

for (s in 1:n) {
    i2[[s]] = file_sizes[[s]]$exe
    }
# Preparing the data 
a = data.table(a)
c1 = data.table(c1)
d = data.table(d)
e = data.table(e)
f = data.table(f)
g = data.table(g)
h = data.table(h)
i2 = data.table(i2)
# Creating the dataframe we could work with:
my_data_frame = filenames
my_data_frame = data.table(my_data_frame)
my_data_frame$file_names = a
my_data_frame$file_size = c1
my_data_frame$file_size = as.numeric(my_data_frame$file_size)
my_data_frame$file_size_Mb = my_data_frame$file_size/(1024^2)

my_data_frame$if_directory = d
my_data_frame$octal_perm_code = e
my_data_frame$file_created = f 
my_data_frame$file_downloaded = g 
my_data_frame$file_recent_vers = h 

my_data_frame$file_created = as.numeric(my_data_frame$file_created)
my_data_frame$file_downloaded = as.numeric(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = as.numeric(my_data_frame$file_recent_vers)

### In case if proper date and time processing package is not installed: 
install_and_load = function(name, char = T)
{
  if (!require(name, character.only = char)) 
  {
    install.packages(name)
  }
  require(name, character.only = char)
}
sapply(
  c("anytime"),
  install_and_load
)
rm(install_and_load)
### End of the package call

my_data_frame$file_created = anytime::anytime(my_data_frame$file_created)
my_data_frame$file_downloaded = anytime::anytime(my_data_frame$file_downloaded)
my_data_frame$file_recent_vers = anytime::anytime(my_data_frame$file_recent_vers)

my_data_frame$if_exe = i2 

m1 = paste("In addition, a dataframe is created that contains all the data about each file.", "The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to:", i, sep = " ")
print(m1)
rm(a, c1, d, e, f, g, h, i2, m1)


}
## [1] "The name of the file number 1 is en_US.blogs.txt"
## [1] "File number  1 ( en_US.blogs.txt ) is blogs"
## [1] "File  1 ( en_US.blogs.txt ) size is equal to: 210160014"
## [1] "Is file  1 ( en_US.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( en_US.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( en_US.blogs.txt ) was created is: 2014-07-22 11:13:05"
## [1] "The assumed time when the file number 1 ( en_US.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:42"
## [1] "The estimated time when this file 1 ( en_US.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:49"
## [1] "Does the file  1 ( en_US.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is en_US.news.txt"
## [1] "File number  2 ( en_US.news.txt ) is news"
## [1] "File  2 ( en_US.news.txt ) size is equal to: 205811889"
## [1] "Is file  2 ( en_US.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( en_US.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( en_US.news.txt ) was created is: 2014-07-22 11:13:04"
## [1] "The assumed time when the file number 2 ( en_US.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:35"
## [1] "The estimated time when this file 2 ( en_US.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:42"
## [1] "Does the file  2 ( en_US.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is en_US.twitter.txt"
## [1] "File number  3 ( en_US.twitter.txt ) is twitter"
## [1] "File  3 ( en_US.twitter.txt ) size is equal to: 167105338"
## [1] "Is file  3 ( en_US.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( en_US.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( en_US.twitter.txt ) was created is: 2014-07-22 11:12:58"
## [1] "The assumed time when the file number 3 ( en_US.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:30"
## [1] "The estimated time when this file 3 ( en_US.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:35"
## [1] "Does the file  3 ( en_US.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
my_data_frame
##        my_data_frame        file_names file_size file_size_Mb if_directory
## 1:   en_US.blogs.txt   en_US.blogs.txt 210160014     200.4242        FALSE
## 2:    en_US.news.txt    en_US.news.txt 205811889     196.2775        FALSE
## 3: en_US.twitter.txt en_US.twitter.txt 167105338     159.3641        FALSE
##    octal_perm_code        file_created     file_downloaded    file_recent_vers
## 1:             438 2014-07-22 11:13:05 2020-08-15 21:58:42 2020-08-15 21:58:49
## 2:             438 2014-07-22 11:13:04 2020-08-15 21:58:35 2020-08-15 21:58:42
## 3:             438 2014-07-22 11:12:58 2020-08-15 21:58:30 2020-08-15 21:58:35
##    if_exe
## 1:     no
## 2:     no
## 3:     no
## Making the output more fancy: 
knitr::kable(my_data_frame, caption = "The main data about the files")
The main data about the files
my_data_frame file_names file_size file_size_Mb if_directory octal_perm_code file_created file_downloaded file_recent_vers if_exe
en_US.blogs.txt en_US.blogs.txt 210160014 200.4242 FALSE 438 2014-07-22 11:13:05 2020-08-15 21:58:42 2020-08-15 21:58:49 no
en_US.news.txt en_US.news.txt 205811889 196.2775 FALSE 438 2014-07-22 11:13:04 2020-08-15 21:58:35 2020-08-15 21:58:42 no
en_US.twitter.txt en_US.twitter.txt 167105338 159.3641 FALSE 438 2014-07-22 11:12:58 2020-08-15 21:58:30 2020-08-15 21:58:35 no
#### Line counts and word counts: ldf
n = length(ldf) #3
# my_length  length(ldf[[1]])
for (i in 1:n){

### Counter for lines 
    bat = list()
    
    for (x in 1:n) {
    
    bat[[x]] <-  length(ldf[[x]])
        
    }
    
}
# sum(stringi::stri_count_words(ldf[[1]]))

for (i in 1:n){

#### Counter for words said in each DF (a long in processing loop)
    pat = list()
    
    for (x in 1:n) {
    
    pat[[x]] <-  sum(stringi::stri_count_words(ldf[[x]]))
        
    }
    
}

# mean(stringi::stri_count_words(ldf[[1]]))
for (i in 1:n){

#### Counter for mean words per 1 line of text in the text document  (a long in processing loop)
    fat = list()
    
    for (x in 1:n) {
    
    fat[[x]] <-  mean(stringi::stri_count_words(ldf[[x]]))
        
    }
    
}

my_descriptive_stat = filenames
my_descriptive_stat = data.table(my_descriptive_stat)
bat = data.table(bat)
pat = data.table(pat)
fat = data.table(fat)
my_descriptive_stat$lines_count = bat$bat
my_descriptive_stat$words_count = pat$pat
my_descriptive_stat$mean_words_per_line = fat$fat
my_descriptive_stat #file name, Lines count, words count, mean words per line 
##    my_descriptive_stat lines_count words_count mean_words_per_line
## 1:     en_US.blogs.txt      899288    38256701             42.5411
## 2:      en_US.news.txt       77259     2697319            34.91268
## 3:   en_US.twitter.txt     2360148    30249390            12.81673
## Making the output more fancy: 
knitr::kable(my_descriptive_stat, caption = "The main summary statistics about the files")
The main summary statistics about the files
my_descriptive_stat lines_count words_count mean_words_per_line
en_US.blogs.txt 899288 38256701 42.541100292676
en_US.news.txt 77259 2697319 34.9126833119766
en_US.twitter.txt 2360148 30249390 12.816734374285
rm(dat, bat, pat, fat) 

# Dataset Cleaning: Data Preprocessing

### Sampling the data: 
set.seed(20200826)

# tat = sample(ldf[[1]], length(ldf[[1]])*(5/100))

for (i in 1:n){

#### Counter for mean words per 1 line of text in the text document  (a long in processing loop)
    tat = list()
    
    for (x in 1:n) {
    
    tat[[x]] <-  sample(ldf[[x]], length(ldf[[x]])*(5/100)) #5% sampling - can be any figure 
        
    }
    
}
summary(tat)
##      Length Class  Mode     
## [1,]  44964 -none- character
## [2,]   3862 -none- character
## [3,] 118007 -none- character
tat = data.table(tat)
### Giving the row names as data file names: 
myDF <- cbind(Row_Names = filenames, tat) #2 columns: tat ++ Row_Names
str(myDF) #Cool 
## Classes 'data.table' and 'data.frame':   3 obs. of  2 variables:
##  $ Row_Names: chr  "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
##  $ tat      :List of 3
##   ..$ : chr  "EUGENIO ANGELES, Fiscal of City of Manila, respondent." "Joseph doesnâ\200\231t reveal himself to his brothers, although he recognizes them immediately. For reasons whi"| __truncated__ "â\200œIcon is a company that loves to see others striving to be different and the TTXGP is just that,â\200\235 "| __truncated__ "How to Indy Crawl(and get a sweet limited run pint glass) :" ...
##   ..$ : chr  "3. By calling 404-330-6309." "Spring Wildflower Walk: Meet at the Lexington entrance. Look for trillium, blue and yellow violets, spring beau"| __truncated__ "The owner of the Arm & Hammer brand of personal care and household products said it has outgrown its Princeton "| __truncated__ "The 89-year-old national program recognizes outstanding creative teens and offers scholarship opportunities for"| __truncated__ ...
##   ..$ : chr  "Damn u save those text from 2011? Thristy!!!!" "I ran the whole jello carbomb scenario by Kersi and he is onboard. Let's do this!" "No sleep in my near future." "Yea I think my theory is right, y'all ladies is nuts from the womb to the tomb" ...
##  - attr(*, ".internal.selfref")=<externalptr>
# astr <- "Ábcdêãçoàúü"
# iconv(astr, from = 'UTF-8', to = 'ASCII//TRANSLIT')
# bat <- iconv(myDF$tat[[1]], from = 'UTF-8', to = 'ASCII//TRANSLIT')
n = length(ldf) #3
for (i in 1:n){

#### Remove all weird symbols from the text like: Ábcdêãçoàúü and other similar 
    bat = list()
    
    for (x in 1:n) {
    
    bat[[x]] <-  iconv(myDF$tat[[x]], from = 'UTF-8', to = 'ASCII//TRANSLIT')
        
    }
    bat = data.table(bat)
}




### Creating descriptive statistics word cloud: 
# corpus <- VCorpus(VectorSource(myDF$tat))
corpus <- VCorpus(VectorSource(bat))
print(corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
rm(bat)
# print(corpus)
str(corpus)
## List of 1
##  $ 1:List of 2
##   ..$ content: chr [1:3] "c(\"EUGENIO ANGELES, Fiscal of City of Manila, respondent.\", \"Joseph doesn't reveal himself to his brothers, "| __truncated__ "c(\"3. By calling 404-330-6309.\", \"Spring Wildflower Walk: Meet at the Lexington entrance. Look for trillium,"| __truncated__ "c(\"Damn u save those text from 2011? Thristy!!!!\", \"I ran the whole jello carbomb scenario by Kersi and he i"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:36:58"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
### List standard English stop words
stopwords("en")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"
### We need to update them: 
updatedStopwords <- c(stopwords('en'), "could", "can", "will", "would", "want", "wanna", "just", "like", "can't", "lot", "looking", 
"as well as", "thank you for", "looking forward", "i want to", "one of the", "a lot", "a lot of", "of", "thanks for the", "it was a",
"the end of", "thanks for the rt", "for the first time", "at the same time", "thanks for the", "is going to be", "if you want to", 
"thank you so much", "i am going to", "can't wait to see", "going to be", "is one of the", "one of the", "if you want to", "if you want", 
"in the middle of the", "at the end of the", "by the end of the", "thank you so much", "hope you have a great", "thanks so much", "i have no idea", "what",
"it is possible", "get", "good", "thank", "come", "even", "one", "got", "said", "much", "also", "“", "“", "”") #“ is some break technical symbol return to R
### A bit of standard data transformation: 
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# The data transformation process - it is standard for the word clouds analysis
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, updatedStopwords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument) #Remove typical word endings - the only non-intuitive command by command name
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
strange_symbols = c("ú","¨", "ů", "§", "ě", "š", "č", "ř", "ž", "ý", "á", "í", "é", "´", "$", "#", "&", "^", "%", "@", "€", 
"‰", "†", "…", "“", "ś", "ť", "ž", "ź", "Ł", "©", "ş", "±", "µ", "ą", "ä", "ĺ", "ć", "ç", "ę", "ë", "ě", "î", "ď", "đ", 
"ń", "ň", "ó", "ô", "ö") #For English - it is OK to remove it all
corpus <- tm_map(corpus, toSpace, strange_symbols)
## Warning in gsub(pattern, " ", x): argument 'pattern' has length > 1 and only the
## first element will be used
print(corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
# Word cloud plot of the most common words in the corpus
potential_word_matrix <- TermDocumentMatrix(corpus)
potential_word_matrix <- as.matrix(potential_word_matrix)
word_frequences <- sort(rowSums(potential_word_matrix), decreasing=TRUE) 
dm <- data.frame(word=names(word_frequences), freq=word_frequences)
wordcloud(dm$word, dm$freq, max.words = 50, random.order=FALSE, rot.per=.25, colors=brewer.pal(8, "Dark2"))

#bat <- list(bat)
#toks_filtered <- tokens(bat)
#dfm_text <- dfm(toks_filtered)
# corpus$content[[1]]
# Transform our corpus to DFM text by a trick: defining preserved and cleaned coprus text as character the dfm function to recognise it well (pure coprus recognition without this trick may be problematic) 
dfm_text <- dfm(as.character(corpus$content))


# Main summary statistics of the text to be uni-bi-tri-....-grammed on plots later:
text_sum <- textstat_summary(dfm_text)
text_sum
##   document chars sents tokens types puncts numbers symbols urls tags emojis
## 1    text1    NA    NA 128191 17859     43      15      18    0    0      0
# Standard method for creating the unigram plot: 
my_swf_chart <- dfm_text %>% # Plot name and where to take the data from 
  textstat_frequency(n = 41) %>% # How many most popular unigram words to represent (I chose 41)
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) + #standard line - ranged X VS frequency 
    geom_point() + #point scatter plot 
    coord_flip() + #vertical coordinates (for beauty + it is traditioned for unigrams) 
    labs(x = NULL, y = "Words Frequency") + #Names of axes 
    ggtitle("Single Word Frequency (SWF)") + #Title 
    theme_light() #Main colors 
my_swf_chart #Plot the chart 

# compute my_coverage_stat, including all words
my_coverage_stat <- textstat_frequency(dfm_text, n= length(dfm_text))
my_coverage_stat$index <- 1:dim(my_coverage_stat)[1]
my_coverage_stat$relative_frequency <- my_coverage_stat$frequency / sum(my_coverage_stat$frequency)
my_coverage_stat$my_coverage_stat <- cumsum(my_coverage_stat$relative_frequency)
str(my_coverage_stat)
## Classes 'frequency', 'textstat' and 'data.frame':    17859 obs. of  8 variables:
##  $ feature           : chr  "go" "love" "day" "know" ...
##  $ frequency         : num  902 889 735 650 627 624 623 557 541 534 ...
##  $ rank              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ docfreq           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ group             : chr  "all" "all" "all" "all" ...
##  $ index             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ relative_frequency: num  0.00704 0.00693 0.00573 0.00507 0.00489 ...
##  $ my_coverage_stat  : num  0.00704 0.01397 0.0197 0.02478 0.02967 ...
### Top 50 most popular words and % frequency of their usage
rm(r)
r = my_coverage_stat$feature[1:50]
r = data.table(r)
r$most_popular_words = my_coverage_stat$feature[1:50]; 
r$how_often_used_percents = 100*my_coverage_stat$relative_frequency[1:50]
r$r = NULL
r
##     most_popular_words how_often_used_percents
##  1:                 go               0.7036375
##  2:               love               0.6934964
##  3:                day               0.5733632
##  4:               know               0.5070559
##  5:               time               0.4891139
##  6:               make               0.4867736
##  7:                now               0.4859936
##  8:                 rt               0.4345079
##  9:                new               0.4220265
## 10:              great               0.4165659
## 11:              think               0.4103252
## 12:                see               0.4079850
## 13:              today               0.4072049
## 14:             follow               0.4001841
## 15:                  u               0.3955036
## 16:               work               0.3830222
## 17:               need               0.3666404
## 18:                lol               0.3635201
## 19:              peopl               0.3416777
## 20:               year               0.3190552
## 21:                 us               0.3120344
## 22:                say               0.3065738
## 23:               back               0.3003331
## 24:              thank               0.2917522
## 25:              thing               0.2753703
## 26:              right               0.2706898
## 27:               look               0.2605487
## 28:              night               0.2597686
## 29:               take               0.2488474
## 30:              happi               0.2488474
## 31:             realli               0.2465072
## 32:               last               0.2433868
## 33:               game               0.2394864
## 34:                way               0.2348059
## 35:               hope               0.2348059
## 36:               play               0.2270050
## 37:            tonight               0.2270050
## 38:              start               0.2246648
## 39:               show               0.2231046
## 40:               week               0.2231046
## 41:              never               0.2184241
## 42:              still               0.2137436
## 43:               feel               0.2137436
## 44:               best               0.2129635
## 45:                use               0.2129635
## 46:              watch               0.2067228
## 47:                get               0.2043825
## 48:              first               0.2028224
## 49:               well               0.1997020
## 50:             friend               0.1981418
##     most_popular_words how_often_used_percents
rm(r)
# Plot my_coverage_stat
my_coverage_plot <- my_coverage_stat %>% 
  ggplot(aes(x = index, y = my_coverage_stat)) +
  geom_point() +
  labs(x = "Number Of Unique Words", y = "Percent Of Used Language Covered By This Number Of Words") +
  ggtitle("Coverage Statistics") +
  theme_light() #Main colors
my_coverage_plot

# Plot coverage on log scale
my_coverage_plot <- my_coverage_plot + scale_x_log10() + scale_y_log10() + ggtitle("Coverage Statistics - log10 scale")
my_coverage_plot

#### Bigram creation: 
tokenized_text = quanteda::tokens(as.character(corpus$content), what = "word1")
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 2)

aschr = quanteda::dfm(tokenized_bigrams, ngrams=2)
## Warning: ngrams argument is not used.
topFeatures <- topfeatures(aschr, 20)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Bigrams Excluding Stop Words')

rm(df1, aschr, tokenized_bigrams) #To free the memory 


### Trigrams: 
# tokenized_text = quanteda::tokens(as.character(corpus$content), what = "word1")
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 3)

aschr = quanteda::dfm(tokenized_bigrams)

topFeatures <- topfeatures(aschr, 20)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + theme_light() + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Trigrams Excluding Stop Words') + theme(legend.position = "none") 

rm(df1, aschr, tokenized_bigrams) #To free the memory 

### Quatrograms: 
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 4)
aschr = quanteda::dfm(tokenized_bigrams)
topFeatures <- topfeatures(aschr, 20)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + theme_light() + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Quatrograms Excluding Stop Words') + theme(legend.position = "none") 

rm(df1, aschr, tokenized_bigrams) #To free the memory 

### Pentagrams: 
tokenized_bigrams = tokens_ngrams(tokenized_text, n = 5)
aschr = quanteda::dfm(tokenized_bigrams)
topFeatures <- topfeatures(aschr, 19)
df1 <- data.frame(word=names(topFeatures), count=topFeatures)
ggplot(df1, aes(x=reorder(word, count), y=count, fill = word)) + geom_bar(stat="identity") + theme_light() + coord_flip() + xlab('Count') + ylab('Words Frequency') + ggtitle('Most Common Pentagrams Excluding Stop Words \n Top-20 Words Used Together In One Message') + theme(legend.position = "none") 

# Please, note: all the word endings are truncated :) 
rm(df1, aschr, tokenized_bigrams) #To free the memory 

Findings

The word clouds and other interesting tools helped us to visualize the stitistics and, thus, we can move forward toward the final project in the word patterns recognition.

Futher steps

It is planned to use the data obtain to fix the most common word patterns we found out for MLT methods to recognise those pattern when entering the text strings in advance - before the strings will be finished by the user.

This could be of a high benefit for the final product which should be OK for the final assignment.

The next parts are like appendixes to the analysis

The Russian part of the analysis

You can also embed plots, for example:

## [1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
##  chr [1:3] "ru_RU.blogs.txt" "ru_RU.news.txt" "ru_RU.twitter.txt"
##  chr [1:12] "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\/de_DE/de_DE.blogs.txt" ...
##    Length     Class      Mode 
##        12 character character
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 191902 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 309777 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 191902 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 309777 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 191902 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 309777 appears to contain an embedded nul
## List of 3
##  $ : chr [1:337100] "Настало время и мне поделиться чем-нибудь сладким!!! Уже совсем скоро наступит Новый год, и поэтому моя конфека"| __truncated__ "сама элегантность и выдержанность...." "Знаменитые дизайнеры, популярные магазины дамской одежды и известные модницы - все сейчас увлечены женственным "| __truncated__ "Проверяем точно ли продавец высылает в Россию? Нажимаем на вкладку и читаем:" ...
##  $ : chr [1:196360] "Словом, поле для деятельности большое. Можно даже не прибегать к помощи интернет-сообщества и сэкономить ценные"| __truncated__ "Для смены режима Эмомали Рахмона в Таджикистане все готово, взрыв может случиться в любую минуту, активисты опп"| __truncated__ "Много вопросов задавали владивостокские музыканты, пришедшие в зал поприветствовать своих кумиров и учителей. И"| __truncated__ "Мы отошли, и она продолжила." ...
##  $ : chr [1:881414] "помыла с боем и модным шампунем котэ, а оно пришло и стряхивает всю воду мне в монитор! - мстит, сук..)" "Еще 20 минут" "какой же он всё-таки шут" "Пришел пешком в центр увидел, новогодную суету. Можно сказать почувствовал праздник :-)" ...
## List of 3
##  $ : chr [1:337100] "Настало время и мне поделиться чем-нибудь сладким!!! Уже совсем скоро наступит Новый год, и поэтому моя конфека"| __truncated__ "сама элегантность и выдержанность...." "Знаменитые дизайнеры, популярные магазины дамской одежды и известные модницы - все сейчас увлечены женственным "| __truncated__ "Проверяем точно ли продавец высылает в Россию? Нажимаем на вкладку и читаем:" ...
##  $ : chr [1:196360] "Словом, поле для деятельности большое. Можно даже не прибегать к помощи интернет-сообщества и сэкономить ценные"| __truncated__ "Для смены режима Эмомали Рахмона в Таджикистане все готово, взрыв может случиться в любую минуту, активисты опп"| __truncated__ "Много вопросов задавали владивостокские музыканты, пришедшие в зал поприветствовать своих кумиров и учителей. И"| __truncated__ "Мы отошли, и она продолжила." ...
##  $ : chr [1:881414] "помыла с боем и модным шампунем котэ, а оно пришло и стряхивает всю воду мне в монитор! - мстит, сук..)" "Еще 20 минут" "какой же он всё-таки шут" "Пришел пешком в центр увидел, новогодную суету. Можно сказать почувствовал праздник :-)" ...
## [1] 118996424
## [1] 118996424
## [1] "file  1 size is equal to: 116855835"
## [1] "file  2 size is equal to: 118996424"
## [1] "file  3 size is equal to: 105182346"
## [1] "The name of the file number 1 is ru_RU.blogs.txt"
## [1] "File number  1 ( ru_RU.blogs.txt ) is blogs"
## [1] "File  1 ( ru_RU.blogs.txt ) size is equal to: 116855835"
## [1] "Is file  1 ( ru_RU.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( ru_RU.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( ru_RU.blogs.txt ) was created is: 2014-07-22 11:12:23"
## [1] "The assumed time when the file number 1 ( ru_RU.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:21"
## [1] "The estimated time when this file 1 ( ru_RU.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:24"
## [1] "Does the file  1 ( ru_RU.blogs.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 2 is ru_RU.news.txt"
## [1] "File number  2 ( ru_RU.news.txt ) is news"
## [1] "File  2 ( ru_RU.news.txt ) size is equal to: 118996424"
## [1] "Is file  2 ( ru_RU.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( ru_RU.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( ru_RU.news.txt ) was created is: 2014-07-22 11:12:29"
## [1] "The assumed time when the file number 2 ( ru_RU.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:24"
## [1] "The estimated time when this file 2 ( ru_RU.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:27"
## [1] "Does the file  2 ( ru_RU.news.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 3 is ru_RU.twitter.txt"
## [1] "File number  3 ( ru_RU.twitter.txt ) is twitter"
## [1] "File  3 ( ru_RU.twitter.txt ) size is equal to: 105182346"
## [1] "Is file  3 ( ru_RU.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( ru_RU.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( ru_RU.twitter.txt ) was created is: 2014-07-22 11:12:33"
## [1] "The assumed time when the file number 3 ( ru_RU.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:27"
## [1] "The estimated time when this file 3 ( ru_RU.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:30"
## [1] "Does the file  3 ( ru_RU.twitter.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 1 is ru_RU.blogs.txt"
## [1] "File number  1 ( ru_RU.blogs.txt ) is blogs"
## [1] "File  1 ( ru_RU.blogs.txt ) size is equal to: 116855835"
## [1] "Is file  1 ( ru_RU.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( ru_RU.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( ru_RU.blogs.txt ) was created is: 2014-07-22 11:12:23"
## [1] "The assumed time when the file number 1 ( ru_RU.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:21"
## [1] "The estimated time when this file 1 ( ru_RU.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:24"
## [1] "Does the file  1 ( ru_RU.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is ru_RU.news.txt"
## [1] "File number  2 ( ru_RU.news.txt ) is news"
## [1] "File  2 ( ru_RU.news.txt ) size is equal to: 118996424"
## [1] "Is file  2 ( ru_RU.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( ru_RU.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( ru_RU.news.txt ) was created is: 2014-07-22 11:12:29"
## [1] "The assumed time when the file number 2 ( ru_RU.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:24"
## [1] "The estimated time when this file 2 ( ru_RU.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:27"
## [1] "Does the file  2 ( ru_RU.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is ru_RU.twitter.txt"
## [1] "File number  3 ( ru_RU.twitter.txt ) is twitter"
## [1] "File  3 ( ru_RU.twitter.txt ) size is equal to: 105182346"
## [1] "Is file  3 ( ru_RU.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( ru_RU.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( ru_RU.twitter.txt ) was created is: 2014-07-22 11:12:33"
## [1] "The assumed time when the file number 3 ( ru_RU.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:27"
## [1] "The estimated time when this file 3 ( ru_RU.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:30"
## [1] "Does the file  3 ( ru_RU.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
##        my_data_frame        file_names file_size file_size_Mb if_directory
## 1:   ru_RU.blogs.txt   ru_RU.blogs.txt 116855835     111.4424        FALSE
## 2:    ru_RU.news.txt    ru_RU.news.txt 118996424     113.4838        FALSE
## 3: ru_RU.twitter.txt ru_RU.twitter.txt 105182346     100.3097        FALSE
##    octal_perm_code        file_created     file_downloaded    file_recent_vers
## 1:             438 2014-07-22 11:12:23 2020-08-15 21:58:21 2020-08-15 21:58:24
## 2:             438 2014-07-22 11:12:29 2020-08-15 21:58:24 2020-08-15 21:58:27
## 3:             438 2014-07-22 11:12:33 2020-08-15 21:58:27 2020-08-15 21:58:30
##    if_exe
## 1:     no
## 2:     no
## 3:     no
The main data about the files
my_data_frame file_names file_size file_size_Mb if_directory octal_perm_code file_created file_downloaded file_recent_vers if_exe
ru_RU.blogs.txt ru_RU.blogs.txt 116855835 111.4424 FALSE 438 2014-07-22 11:12:23 2020-08-15 21:58:21 2020-08-15 21:58:24 no
ru_RU.news.txt ru_RU.news.txt 118996424 113.4838 FALSE 438 2014-07-22 11:12:29 2020-08-15 21:58:24 2020-08-15 21:58:27 no
ru_RU.twitter.txt ru_RU.twitter.txt 105182346 100.3097 FALSE 438 2014-07-22 11:12:33 2020-08-15 21:58:27 2020-08-15 21:58:30 no
##    my_descriptive_stat lines_count words_count mean_words_per_line
## 1:     ru_RU.blogs.txt      337100     9388482            27.85073
## 2:      ru_RU.news.txt      196360     9057248            46.12573
## 3:   ru_RU.twitter.txt      881414     9231328            10.47332
The main summary statistics about the files
my_descriptive_stat lines_count words_count mean_words_per_line
ru_RU.blogs.txt 337100 9388482 27.850732720261
ru_RU.news.txt 196360 9057248 46.1257282542269
ru_RU.twitter.txt 881414 9231328 10.4733167387856
##      Length Class  Mode     
## [1,] 16855  -none- character
## [2,]  9818  -none- character
## [3,] 44070  -none- character
## Classes 'data.table' and 'data.frame':   3 obs. of  2 variables:
##  $ Row_Names: chr  "ru_RU.blogs.txt" "ru_RU.news.txt" "ru_RU.twitter.txt"
##  $ tat      :List of 3
##   ..$ : chr  "Помните, я говорила, что возвращаюсь в блог? Представляете, я вас тогда обманула) Но это всё моя подготовка к с"| __truncated__ "Немного расслабимся. Странный сайт, который продает квартиры в туле. Он у меня не влезает в экран, половина за "| __truncated__ "Когда Бог создавал время, он создал его достаточно." "Я даже подумать не могла, что их будет столько. Спасибо вам мои хорошие за вас. Со многими я уже успела подружи"| __truncated__ ...
##   ..$ : chr  "Исходя из этого, мы с братом Алексеем приняли решение поддержать строительство дацана, чтобы помочь верующим. И"| __truncated__ "– Во-первых, живём одним днём, во-вторых, политические процессы определяют экономические процессы. Велика корру"| __truncated__ "Перед судом предстали капитан этой субмарины Дмитрий Лаврентьев и трюмный матрос «Нерпы» Дмитрий Гробов." "Причём, помимо приморцев, в нелегальный оборот свежевыловленных лососёвых включились уже и граждане Узбекистана"| __truncated__ ...
##   ..$ : chr  "Ночные движения начинаются, еду играть сначала в Роад148, а потом в баркод на вечеринку от Dubstep+" "Жаль, что в экономическом блоке нет тезисов об отмене госкорпораций, антимонопольной политики, выплаты всех нал"| __truncated__ "Т.А очень-очень-очень скучаю :(" "А погодка хорошая для прогулки?=)" ...
##  - attr(*, ".internal.selfref")=<externalptr>
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## List of 3
##  $ 1:List of 2
##   ..$ content: chr [1:16855] "Помните, я говорила, что возвращаюсь в блог? Представляете, я вас тогда обманула) Но это всё моя подготовка к с"| __truncated__ "Немного расслабимся. Странный сайт, который продает квартиры в туле. Он у меня не влезает в экран, половина за "| __truncated__ "Когда Бог создавал время, он создал его достаточно." "Я даже подумать не могла, что их будет столько. Спасибо вам мои хорошие за вас. Со многими я уже успела подружи"| __truncated__ ...
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:49:15"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 2:List of 2
##   ..$ content: chr [1:9818] "Исходя из этого, мы с братом Алексеем приняли решение поддержать строительство дацана, чтобы помочь верующим. И"| __truncated__ "– Во-первых, живём одним днём, во-вторых, политические процессы определяют экономические процессы. Велика корру"| __truncated__ "Перед судом предстали капитан этой субмарины Дмитрий Лаврентьев и трюмный матрос «Нерпы» Дмитрий Гробов." "Причём, помимо приморцев, в нелегальный оборот свежевыловленных лососёвых включились уже и граждане Узбекистана"| __truncated__ ...
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:49:15"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "2"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 3:List of 2
##   ..$ content: chr [1:44070] "Ночные движения начинаются, еду играть сначала в Роад148, а потом в баркод на вечеринку от Dubstep+" "Жаль, что в экономическом блоке нет тезисов об отмене госкорпораций, антимонопольной политики, выплаты всех нал"| __truncated__ "Т.А очень-очень-очень скучаю :(" "А погодка хорошая для прогулки?=)" ...
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 14:49:15"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "3"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
##   [1] "и"       "в"       "во"      "не"      "что"     "он"      "на"     
##   [8] "я"       "с"       "со"      "как"     "а"       "то"      "все"    
##  [15] "она"     "так"     "его"     "но"      "да"      "ты"      "к"      
##  [22] "у"       "же"      "вы"      "за"      "бы"      "по"      "только" 
##  [29] "ее"      "мне"     "было"    "вот"     "от"      "меня"    "еще"    
##  [36] "нет"     "о"       "из"      "ему"     "теперь"  "когда"   "даже"   
##  [43] "ну"      "вдруг"   "ли"      "если"    "уже"     "или"     "ни"     
##  [50] "быть"    "был"     "него"    "до"      "вас"     "нибудь"  "опять"  
##  [57] "уж"      "вам"     "сказал"  "ведь"    "там"     "потом"   "себя"   
##  [64] "ничего"  "ей"      "может"   "они"     "тут"     "где"     "есть"   
##  [71] "надо"    "ней"     "для"     "мы"      "тебя"    "их"      "чем"    
##  [78] "была"    "сам"     "чтоб"    "без"     "будто"   "человек" "чего"   
##  [85] "раз"     "тоже"    "себе"    "под"     "жизнь"   "будет"   "ж"      
##  [92] "тогда"   "кто"     "этот"    "говорил" "того"    "потому"  "этого"  
##  [99] "какой"   "совсем"  "ним"     "здесь"   "этом"    "один"    "почти"  
## [106] "мой"     "тем"     "чтобы"   "нее"     "кажется" "сейчас"  "были"   
## [113] "куда"    "зачем"   "сказать" "всех"    "никогда" "сегодня" "можно"  
## [120] "при"     "наконец" "два"     "об"      "другой"  "хоть"    "после"  
## [127] "над"     "больше"  "тот"     "через"   "эти"     "нас"     "про"    
## [134] "всего"   "них"     "какая"   "много"   "разве"   "сказала" "три"    
## [141] "эту"     "моя"     "впрочем" "хорошо"  "свою"    "этой"    "перед"  
## [148] "иногда"  "лучше"   "чуть"    "том"     "нельзя"  "такой"   "им"     
## [155] "более"   "всегда"  "конечно" "всю"     "между"

## List of 6
##  $ i       : int [1:707054] 7 12 15 19 24 26 27 28 40 41 ...
##  $ j       : int [1:707054] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:707054] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 705823
##  $ ncol    : int 3
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:705823] "- ежедневная теорема" "- одни изображали" "–– ветер восточным" "–– некоммерческая история" ...
##   ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
##  $ i       : int [1:725617] 7 9 10 11 12 13 14 15 16 17 ...
##  $ j       : int [1:725617] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:725617] 10 1 8 1 1 1 1 1 1 1 ...
##  $ nrow    : int 706114
##  $ ncol    : int 3
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:706114] "- ежедневная" "- одни" "–– ветер" "–– некоммерческая" ...
##   ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
##  $ i       : int [1:233578] 3 4 6 11 19 20 21 22 23 24 ...
##  $ j       : int [1:233578] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:233578] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 165543
##  $ ncol    : int 3
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:165543] "–краткий" "–линейными" "–люди" "–солнце" ...
##   ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
##          Terms count
## 34707     года  1866
## 127041  россии  1253
## 34772     году  1228
## 64093  который  1127
## 45855    жизни   888
## 70117     люди   885
## 120501  работы   817
## 70109    людей   806
## 38791   делать   688
## 42059     дома   652
##                 Terms count
## 407267 пермского края   327
## 540315     самом деле   231
## 558927        сих пор   228
## 153412    доброе утро   220
## 617195  таким образом   159
## 268004      лет назад   127
## 347631      новый год   123
## 299753       млрд руб   120
## 494475   прошлом году   117
## 164671     друг друга   109
##                               Terms count
## 455999 правительства пермского края    44
## 86596     возбуждено уголовное дело    23
## 282540      малого среднего бизнеса    16
## 404627       передает риа «новости»    16
## 155567        доброго времени суток    15
## 326687      наступающим новым годом    15
## 596119                     ст ук рф    15
## 300470                   млрд куб м    14
## 626632    территории пермского края    14
## 636285                   трлн куб м    14
## 'data.frame':    165543 obs. of  2 variables:
##  $ Terms: Factor w/ 165543 levels "–краткий","–линейными",..: 34707 127041 34772 64093 45855 70117 120501 70109 38791 42059 ...
##  $ count: num  1866 1253 1228 1127 888 ...
## 'data.frame':    50 obs. of  2 variables:
##  $ Terms: Factor w/ 165543 levels "–краткий","–линейными",..: 34707 127041 34772 64093 45855 70117 120501 70109 38791 42059 ...
##  $ count: num  1866 1253 1228 1127 888 ...

##         Terms            count          relative_frequency  my_coverage_stat  
##  –краткий  :     1   Min.   :   1.000   Min.   :1.257e-06   Min.   :0.002346  
##  –линейными:     1   1st Qu.:   1.000   1st Qu.:1.257e-06   1st Qu.:0.809945  
##  –люди     :     1   Median :   1.000   Median :1.257e-06   Median :0.895940  
##  –солнце   :     1   Mean   :   4.805   Mean   :6.041e-06   Mean   :0.854150  
##  –таки     :     1   3rd Qu.:   3.000   3rd Qu.:3.772e-06   3rd Qu.:0.947970  
##  –хау      :     1   Max.   :1866.000   Max.   :2.346e-03   Max.   :1.000000  
##  (Other)   :165537                                                            
##      index       
##  Min.   :     1  
##  1st Qu.: 41387  
##  Median : 82772  
##  Mean   : 82772  
##  3rd Qu.:124158  
##  Max.   :165543  
## 

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

The German part of the analysis

## [1] "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
## [1] "de_DE" "en_US" "fi_FI" "ru_RU"
##  chr [1:3] "de_DE.blogs.txt" "de_DE.news.txt" "de_DE.twitter.txt"
##  chr [1:12] "C:\\Users\\Alex\\Documents\\COURSERA STUDIES\\DATA_SCIENCE_CAPSTONE\\FINAL_DATA\\final\\/de_DE/de_DE.blogs.txt" ...
##    Length     Class      Mode 
##        12 character character
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding
## = "UTF-8"): incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\de_DE\de_DE.blogs.txt'
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 8653 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 78077 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 152105 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding
## = "UTF-8"): incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\de_DE\de_DE.blogs.txt'
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 8653 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 78077 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 152105 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding
## = "UTF-8"): incomplete final line found on 'C:\Users\Alex\Documents\COURSERA
## STUDIES\DATA_SCIENCE_CAPSTONE\FINAL_DATA\final\de_DE\de_DE.blogs.txt'
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 8653 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 78077 appears to contain an embedded nul
## Warning in readLines(paste(my_path_eng, filenames[[x]], sep = ""), encoding =
## "UTF-8"): line 152105 appears to contain an embedded nul
## List of 3
##  $ : chr [1:181958] "Irgendwann wird es Zeit. Ich schleppe es ja auch jeden Tag mit mir herum. Da leidet es schon ein wenig. Nicht n"| __truncated__ "Kommentar: auch hier wird durch die Voranstellung einer entsprechenden Bemuskelung des Halses vor die Eleganz d"| __truncated__ "In vielen Staaten veranlasst vor allem die Hoffnung auf bessere Erwerbsmoglichkeiten die Menschen dazu, sich in"| __truncated__ "Nneka – Heartbeat (Crada Remix ft. NAS)" ...
##  $ : chr [1:244743] "Das Rezept fur ihre Schokobrezln hat die 60-Jahrige schon vor 26 Jahren in einer osterreichischen Sendung entde"| __truncated__ "Fur die Linksparteibewerber ist nun die entscheidende Frage, wie sich SPD- und Grunenwahler in einer solchen Pe"| __truncated__ "Nach Einschatzung des DIW ist das kraftige Plus im dritten Quartal vor allem dem Quartalsauftakt im Juli zu ver"| __truncated__ "„Der Bau eines neuen Lagers ist ein weiterer Baustein unserer Umstrukturierungsma<U+00DF>nahmen, mit denen wir "| __truncated__ ...
##  $ : chr [1:947774] "irgendwas stimmt mut meinem internet am pc nich :(" "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer daruber gekotzt hat." "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. Lauft..." "Gestern noch in Bangkok, heute wegen des Hochwassers in Vientiane, Laos. Schade, die Damme dort haben gehalten im Zentrum." ...
## List of 3
##  $ : chr [1:181958] "Irgendwann wird es Zeit. Ich schleppe es ja auch jeden Tag mit mir herum. Da leidet es schon ein wenig. Nicht n"| __truncated__ "Kommentar: auch hier wird durch die Voranstellung einer entsprechenden Bemuskelung des Halses vor die Eleganz d"| __truncated__ "In vielen Staaten veranlasst vor allem die Hoffnung auf bessere Erwerbsmoglichkeiten die Menschen dazu, sich in"| __truncated__ "Nneka – Heartbeat (Crada Remix ft. NAS)" ...
##  $ : chr [1:244743] "Das Rezept fur ihre Schokobrezln hat die 60-Jahrige schon vor 26 Jahren in einer osterreichischen Sendung entde"| __truncated__ "Fur die Linksparteibewerber ist nun die entscheidende Frage, wie sich SPD- und Grunenwahler in einer solchen Pe"| __truncated__ "Nach Einschatzung des DIW ist das kraftige Plus im dritten Quartal vor allem dem Quartalsauftakt im Juli zu ver"| __truncated__ "„Der Bau eines neuen Lagers ist ein weiterer Baustein unserer Umstrukturierungsma<U+00DF>nahmen, mit denen wir "| __truncated__ ...
##  $ : chr [1:947774] "irgendwas stimmt mut meinem internet am pc nich :(" "\"Wir haben hier ein angebrochenes Fass Bier!\" habe ich mir auch anders vorgestellt. Fragt sich nur, wer daruber gekotzt hat." "Meine Kommilitonen beschweren sich, nie Freizeit zu haben... Anscheinend mache ich was falsch. Naja. Lauft..." "Gestern noch in Bangkok, heute wegen des Hochwassers in Vientiane, Laos. Schade, die Damme dort haben gehalten im Zentrum." ...
## [1] 95591959
## [1] 95591959
## [1] "file  1 size is equal to: 85459666"
## [1] "file  2 size is equal to: 95591959"
## [1] "file  3 size is equal to: 75578341"
## [1] "The name of the file number 1 is de_DE.blogs.txt"
## [1] "File number  1 ( de_DE.blogs.txt ) is blogs"
## [1] "File  1 ( de_DE.blogs.txt ) size is equal to: 85459666"
## [1] "Is file  1 ( de_DE.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( de_DE.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( de_DE.blogs.txt ) was created is: 2014-07-22 11:11:32"
## [1] "The assumed time when the file number 1 ( de_DE.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:17"
## [1] "The estimated time when this file 1 ( de_DE.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:19"
## [1] "Does the file  1 ( de_DE.blogs.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 2 is de_DE.news.txt"
## [1] "File number  2 ( de_DE.news.txt ) is news"
## [1] "File  2 ( de_DE.news.txt ) size is equal to: 95591959"
## [1] "Is file  2 ( de_DE.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( de_DE.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( de_DE.news.txt ) was created is: 2014-07-22 11:11:43"
## [1] "The assumed time when the file number 2 ( de_DE.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:19"
## [1] "The estimated time when this file 2 ( de_DE.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:21"
## [1] "Does the file  2 ( de_DE.news.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 3 is de_DE.twitter.txt"
## [1] "File number  3 ( de_DE.twitter.txt ) is twitter"
## [1] "File  3 ( de_DE.twitter.txt ) size is equal to: 75578341"
## [1] "Is file  3 ( de_DE.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( de_DE.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( de_DE.twitter.txt ) was created is: 2014-07-22 11:11:37"
## [1] "The assumed time when the file number 3 ( de_DE.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:14"
## [1] "The estimated time when this file 3 ( de_DE.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:17"
## [1] "Does the file  3 ( de_DE.twitter.txt ) contain executable code? No, it doesn't."
## [1] "The name of the file number 1 is de_DE.blogs.txt"
## [1] "File number  1 ( de_DE.blogs.txt ) is blogs"
## [1] "File  1 ( de_DE.blogs.txt ) size is equal to: 85459666"
## [1] "Is file  1 ( de_DE.blogs.txt ) is a directory? No, it is not"
## [1] "The file number 1 ( de_DE.blogs.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 1 ( de_DE.blogs.txt ) was created is: 2014-07-22 11:11:32"
## [1] "The assumed time when the file number 1 ( de_DE.blogs.txt ) was apploaded to the computer is: 2020-08-15 21:58:17"
## [1] "The estimated time when this file 1 ( de_DE.blogs.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:19"
## [1] "Does the file  1 ( de_DE.blogs.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 1"
## [1] "The name of the file number 2 is de_DE.news.txt"
## [1] "File number  2 ( de_DE.news.txt ) is news"
## [1] "File  2 ( de_DE.news.txt ) size is equal to: 95591959"
## [1] "Is file  2 ( de_DE.news.txt ) is a directory? No, it is not"
## [1] "The file number 2 ( de_DE.news.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 2 ( de_DE.news.txt ) was created is: 2014-07-22 11:11:43"
## [1] "The assumed time when the file number 2 ( de_DE.news.txt ) was apploaded to the computer is: 2020-08-15 21:58:19"
## [1] "The estimated time when this file 2 ( de_DE.news.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:21"
## [1] "Does the file  2 ( de_DE.news.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 2"
## [1] "The name of the file number 3 is de_DE.twitter.txt"
## [1] "File number  3 ( de_DE.twitter.txt ) is twitter"
## [1] "File  3 ( de_DE.twitter.txt ) size is equal to: 75578341"
## [1] "Is file  3 ( de_DE.twitter.txt ) is a directory? No, it is not"
## [1] "The file number 3 ( de_DE.twitter.txt ) permissions code printed in the octal number is: 666"
## [1] "The supposed date and time when the original file number 3 ( de_DE.twitter.txt ) was created is: 2014-07-22 11:11:37"
## [1] "The assumed time when the file number 3 ( de_DE.twitter.txt ) was apploaded to the computer is: 2020-08-15 21:58:14"
## [1] "The estimated time when this file 3 ( de_DE.twitter.txt ) was the first time loaded into the R global environment is: 2020-08-15 21:58:17"
## [1] "Does the file  3 ( de_DE.twitter.txt ) contain executable code? No, it doesn't."
## [1] "In addition, a dataframe is created that contains all the data about each file. The name of the data frame is: my_data_frame. The number of files represented there is now enlarged to: 3"
##        my_data_frame        file_names file_size file_size_Mb if_directory
## 1:   de_DE.blogs.txt   de_DE.blogs.txt  85459666     81.50069        FALSE
## 2:    de_DE.news.txt    de_DE.news.txt  95591959     91.16360        FALSE
## 3: de_DE.twitter.txt de_DE.twitter.txt  75578341     72.07712        FALSE
##    octal_perm_code        file_created     file_downloaded    file_recent_vers
## 1:             438 2014-07-22 11:11:32 2020-08-15 21:58:17 2020-08-15 21:58:19
## 2:             438 2014-07-22 11:11:43 2020-08-15 21:58:19 2020-08-15 21:58:21
## 3:             438 2014-07-22 11:11:37 2020-08-15 21:58:14 2020-08-15 21:58:17
##    if_exe
## 1:     no
## 2:     no
## 3:     no
The main data about the files
my_data_frame file_names file_size file_size_Mb if_directory octal_perm_code file_created file_downloaded file_recent_vers if_exe
de_DE.blogs.txt de_DE.blogs.txt 85459666 81.50069 FALSE 438 2014-07-22 11:11:32 2020-08-15 21:58:17 2020-08-15 21:58:19 no
de_DE.news.txt de_DE.news.txt 95591959 91.16360 FALSE 438 2014-07-22 11:11:43 2020-08-15 21:58:19 2020-08-15 21:58:21 no
de_DE.twitter.txt de_DE.twitter.txt 75578341 72.07712 FALSE 438 2014-07-22 11:11:37 2020-08-15 21:58:14 2020-08-15 21:58:17 no
##    my_descriptive_stat lines_count words_count mean_words_per_line
## 1:     de_DE.blogs.txt      181958     6205913            34.10629
## 2:      de_DE.news.txt      244743    13375092            54.64954
## 3:   de_DE.twitter.txt      947774    11646033            12.28777
The main summary statistics about the files
my_descriptive_stat lines_count words_count mean_words_per_line
de_DE.blogs.txt 181958 6205913 34.1062937600985
de_DE.news.txt 244743 13375092 54.649538495483
de_DE.twitter.txt 947774 11646033 12.2877743006244
##      Length Class  Mode     
## [1,]  9097  -none- character
## [2,] 12237  -none- character
## [3,] 47388  -none- character
## Classes 'data.table' and 'data.frame':   3 obs. of  2 variables:
##  $ Row_Names: chr  "de_DE.blogs.txt" "de_DE.news.txt" "de_DE.twitter.txt"
##  $ tat      :List of 3
##   ..$ : chr  "Es gibt viele Situationen und Lebensumstande im Buch die mir sehr gut gefallen haben. So sind mitunter kleine W"| __truncated__ "Nur fur heute.*" "Und nun warten diverse Baume und noch ein spontaner Styroporschnitzjob auf uns- auf geht`s!" "Angezogen sieht das Kleid supersu<U+00DF> aus." ...
##   ..$ : chr  "Mit dem Papst tritt der Reprasentant einer autokratischen Wahlmonarchie auf, die weder die fur Demokratien ubli"| __truncated__ "Kinder sind allgemein nicht starker armutsgefahrdet als der Durchschnitt der Bevolkerung, betonte Egeler. 2008 "| __truncated__ "Schneider wird in Augsburg willkommen sein, hat er in der Region doch einige knifflige Falle gelost. Der Expert"| __truncated__ "Wer den arabischen Fruhling im Fernsehen verfolgt hat, ist mit hoher Wahrscheinlichkeit diesem Mann begegnet: H"| __truncated__ ...
##   ..$ : chr  "Nicht ein deutscher Nachrichtensender zeigt Nachrichten. Peinlich" "Aber hoffentlich ohne #Lion? Meines ist leider sehr langsam geworden" "Wohl war, Wohl war." "Bin morgen leider nicht dabei. Lass ins doch kommende Woche mal sprechen." ...
##  - attr(*, ".internal.selfref")=<externalptr>
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## List of 3
##  $ 1:List of 2
##   ..$ content: chr [1:9097] "Es gibt viele Situationen und Lebensumstande im Buch die mir sehr gut gefallen haben. So sind mitunter kleine W"| __truncated__ "Nur fur heute.*" "Und nun warten diverse Baume und noch ein spontaner Styroporschnitzjob auf uns- auf geht`s!" "Angezogen sieht das Kleid supersu<U+00DF> aus." ...
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 15:09:43"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 2:List of 2
##   ..$ content: chr [1:12237] "Mit dem Papst tritt der Reprasentant einer autokratischen Wahlmonarchie auf, die weder die fur Demokratien ubli"| __truncated__ "Kinder sind allgemein nicht starker armutsgefahrdet als der Durchschnitt der Bevolkerung, betonte Egeler. 2008 "| __truncated__ "Schneider wird in Augsburg willkommen sein, hat er in der Region doch einige knifflige Falle gelost. Der Expert"| __truncated__ "Wer den arabischen Fruhling im Fernsehen verfolgt hat, ist mit hoher Wahrscheinlichkeit diesem Mann begegnet: H"| __truncated__ ...
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 15:09:43"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "2"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 3:List of 2
##   ..$ content: chr [1:47388] "Nicht ein deutscher Nachrichtensender zeigt Nachrichten. Peinlich" "Aber hoffentlich ohne #Lion? Meines ist leider sehr langsam geworden" "Wohl war, Wohl war." "Bin morgen leider nicht dabei. Lass ins doch kommende Woche mal sprechen." ...
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2020-08-29 15:09:43"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "3"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
##   [1] "aber"      "alle"      "allem"     "allen"     "aller"     "alles"    
##   [7] "als"       "also"      "am"        "an"        "ander"     "andere"   
##  [13] "anderem"   "anderen"   "anderer"   "anderes"   "anderm"    "andern"   
##  [19] "anderr"    "anders"    "auch"      "auf"       "aus"       "bei"      
##  [25] "bin"       "bis"       "bist"      "da"        "damit"     "dann"     
##  [31] "der"       "den"       "des"       "dem"       "die"       "das"      
##  [37] "da<U+00DF>" "derselbe"  "derselben" "denselben" "desselben" "demselben"
##  [43] "dieselbe"  "dieselben" "dasselbe"  "dazu"      "dein"      "deine"    
##  [49] "deinem"    "deinen"    "deiner"    "deines"    "denn"      "derer"    
##  [55] "dessen"    "dich"      "dir"       "du"        "dies"      "diese"    
##  [61] "diesem"    "diesen"    "dieser"    "dieses"    "doch"      "dort"     
##  [67] "durch"     "ein"       "eine"      "einem"     "einen"     "einer"    
##  [73] "eines"     "einig"     "einige"    "einigem"   "einigen"   "einiger"  
##  [79] "einiges"   "einmal"    "er"        "ihn"       "ihm"       "es"       
##  [85] "etwas"     "euer"      "eure"      "eurem"     "euren"     "eurer"    
##  [91] "eures"     "fur"       "gegen"     "gewesen"   "hab"       "habe"     
##  [97] "haben"     "hat"       "hatte"     "hatten"    "hier"      "hin"      
## [103] "hinter"    "ich"       "mich"      "mir"       "ihr"       "ihre"     
## [109] "ihrem"     "ihren"     "ihrer"     "ihres"     "euch"      "im"       
## [115] "in"        "indem"     "ins"       "ist"       "jede"      "jedem"    
## [121] "jeden"     "jeder"     "jedes"     "jene"      "jenem"     "jenen"    
## [127] "jener"     "jenes"     "jetzt"     "kann"      "kein"      "keine"    
## [133] "keinem"    "keinen"    "keiner"    "keines"    "konnen"    "konnte"   
## [139] "machen"    "man"       "manche"    "manchem"   "manchen"   "mancher"  
## [145] "manches"   "mein"      "meine"     "meinem"    "meinen"    "meiner"   
## [151] "meines"    "mit"       "muss"      "musste"    "nach"      "nicht"    
## [157] "nichts"    "noch"      "nun"       "nur"       "ob"        "oder"     
## [163] "ohne"      "sehr"      "sein"      "seine"     "seinem"    "seinen"   
## [169] "seiner"    "seines"    "selbst"    "sich"      "sie"       "ihnen"    
## [175] "sind"      "so"        "solche"    "solchem"   "solchen"   "solcher"  
## [181] "solches"   "soll"      "sollte"    "sondern"   "sonst"     "uber"     
## [187] "um"        "und"       "uns"       "unse"      "unsem"     "unsen"    
## [193] "unser"     "unses"     "unter"     "viel"      "vom"       "von"      
## [199] "vor"       "wahrend"   "war"       "waren"     "warst"     "was"      
## [205] "weg"       "weil"      "weiter"    "welche"    "welchem"   "welchen"  
## [211] "welcher"   "welches"   "wenn"      "werde"     "werden"    "wie"      
## [217] "wieder"    "will"      "wir"       "wird"      "wirst"     "wo"       
## [223] "wollen"    "wollte"    "wurde"     "wurden"    "zu"        "zum"      
## [229] "zur"       "zwar"      "zwischen"

## List of 6
##  $ i       : int [1:567999] 2 3 5 9 10 11 13 18 19 20 ...
##  $ j       : int [1:567999] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:567999] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 567687
##  $ ncol    : int 3
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:567687] "\177 eigenen angaben" "–– wechselbeziehung dimen" "– – ”" "– – amerikanisch" ...
##   ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
##  $ i       : int [1:599972] 2 3 4 5 7 8 9 11 12 13 ...
##  $ j       : int [1:599972] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:599972] 1 7 1 1 2 2 7 1 1 1 ...
##  $ nrow    : int 589173
##  $ ncol    : int 3
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:589173] "\177 eigenen" "–– wechselbeziehung" "– –" "– —" ...
##   ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
## List of 6
##  $ i       : int [1:164445] 1 2 3 4 5 6 9 10 11 13 ...
##  $ j       : int [1:164445] 1 1 1 1 1 1 1 1 1 1 ...
##  $ v       : num [1:164445] 1 1 1 1 1 1 1 2 1 1 ...
##  $ nrow    : int 123615
##  $ ncol    : int 3
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:123615] "–durfen" "–ketchup" "–stromung" "–zuchter" ...
##   ..$ Docs : chr [1:3] "character(0)" "character(0)" "character(0)"
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
##              Terms count
## 33698         euro  1425
## 67391        macht  1354
## 16100       berlin  1173
## 69849     menschen  1136
## 64405        leben  1014
## 25369  deutschland   938
## 65620         lieb   830
## 117441        welt   825
## 49011         haus   699
## 9200        arbeit   671
##                         Terms count
## 343316        milliarden euro   284
## 343556         millionen euro   234
## 369236               new york   128
## 28335           angela merkel   101
## 439968      schones wochenend    94
## 456056           social media    90
## 235691               hartz iv    83
## 243650 herzlichen gluckwunsch    77
## 343312      milliarden dollar    75
## 527368    vereinigten staaten    72
##                                Terms count
## 104441              dauernd hartz iv    64
## 227720            hartz iv schreiben    64
## 256870          iv schreiben dauernd    63
## 425812       schreiben dauernd hartz    63
## 89860  bundeskanzlerin angela merkel    43
## 28730              angela merkel cdu    25
## 57604                 beer beer beer    23
## 135964                   ent ent ent    21
## 309136                lieb lieb lieb    21
## 152595  europaischen zentralbank ezb    19
## 'data.frame':    123615 obs. of  2 variables:
##  $ Terms: Factor w/ 123615 levels "–durfen","–ketchup",..: 33698 67391 16100 69849 64405 25369 65620 117441 49011 9200 ...
##  $ count: num  1425 1354 1173 1136 1014 ...
## 'data.frame':    50 obs. of  2 variables:
##  $ Terms: Factor w/ 123615 levels "–durfen","–ketchup",..: 33698 67391 16100 69849 64405 25369 65620 117441 49011 9200 ...
##  $ count: num  1425 1354 1173 1136 1014 ...

##             Terms            count          relative_frequency 
##  –durfen       :     1   Min.   :   1.000   Min.   :1.555e-06  
##  –ketchup      :     1   1st Qu.:   1.000   1st Qu.:1.555e-06  
##  –stromung     :     1   Median :   1.000   Median :1.555e-06  
##  –zuchter      :     1   Mean   :   5.201   Mean   :8.090e-06  
##  —–ursprunglich:     1   3rd Qu.:   2.000   3rd Qu.:3.111e-06  
##  —fruh–        :     1   Max.   :1425.000   Max.   :2.216e-03  
##  (Other)       :123609                                         
##  my_coverage_stat       index       
##  Min.   :0.002216   Min.   :     1  
##  1st Qu.:0.836928   1st Qu.: 30905  
##  Median :0.903872   Median : 61808  
##  Mean   :0.872124   Mean   : 61808  
##  3rd Qu.:0.951936   3rd Qu.: 92712  
##  Max.   :1.000000   Max.   :123615  
## 

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

Results and futher steps on the Russian and German data (Appendixes)

This data may be used as a comparison data in the future or as a datatsets for creating MLT algorithms that will be picking up the word endings for the users based on the baginning of their inputs so that it would not overload the server equipment (the MLT should not be extremely computationally sophisticated in order the waiting time for the user to be minimal - for this project this is OK).