#author: Ana Lucic
#title: specifies steps for idf calculation using R

#This script provides the outline of the steps you can follow in order to complete Assignment 6 using R.
#There are multiple ways to complete Assignment 6 in R and if you found a different solution, this is perfectly fine. 
#There are also multiple and different ways to complete this assignment using different programming languages but  
#these instructions are particularly geared towards the students who are keen to complete this assignment in R.
#It provides guidance in terms of processes that are needed and it gives some suggestions in terms of which packages and functions to use. 
#You will need to figure out which functions to use and how to use them.

#Step 1
#the following are useful libraries to have handy when starting this assignment. 

require(tm)
## Loading required package: tm
## Loading required package: NLP
require(slam)
## Loading required package: slam
require(NLP)

#Step 2 -- specifying the file path

#start by specifying file path to your folder with Assignment 6 files; replace with the path to your folder
pathToData <- "C:\\Users\\Ana\\Documents\\LIS\\Text Mining\\Part1"

# Step 3 -- creating a corpus

#tm package has a useful function Corpus which reads in a corpus. It also provides 
#the argument recursive=TRUE which allows you to loop through the directories and folders of a corpus
#without specifying the for loop. 
#readerControl option specifies how the content of the directories will be read to get to individual files. 
#Example:

abstracts <- Corpus(DirSource(pathToData, recursive = TRUE), 
                    readerControl = list(reader = readPlain))

#Step 4 -- eliminating stopwords, punctuation, converting to lower case

#once the documents have been read, it will be good to create a TermDocumentMatrix, a matrix that contains
#terms and information about whether or not the term appears in the document or not. 
#Before that you want to remove stopwords, punctuation, and convert to lower case. 
#the following link provides information on how to do this: 
#http://www.inside-r.org/packages/cran/tm/docs/as.TermDocumentMatrix

#Step 5 -- how to represent terms in the documents? 

#For idf, binary format, 1 and 0 looks particularly suitable.  We don't want actual term frequencies in the document but only whether or not the term appears in the document.
#tm package contains WeightBin function (more information here: http://www.inside-r.org/packages/cran/tm/docs/weightBin) which allows you to indicate whether the term is present or not in the document.

#Step 6 -- you will now likely have a very large array

#If you run into any issues with running or creating such a large array or reading information from all the files, it may be good to increase the memory size.
#Thanks to Craig Evans, here is the link on how to do that: http://answers.stat.ucla.edu/groups/answers/wiki/0e7c9/Increasing_Memory_in_R.html
#Also, running R using command line interface rather than RStudio may also provide the solution for (default) limited memory size of RStudio.
#Another option is to do the processing in chunks, first process the files from one folder and then save the information, then another and then another. 

#After creating a TermDocumentMatrix, you will likely be interested in collapsing some of the information in it because it is a very, very sparse array.

#Step 7 -- luckily, there is slam package in R http://www.inside-r.org/packages/cran/slam/docs/col_sums

#slam package in R has some nice options for reducing and aggregating information from large sparse arrays.
#I suggest you check out the functions of slam package and see if there is any function you can use to collapse the information on a particular dimension (hint: columns)

#Step 8 -- after collapsing and reducing the size of the big matrix you would want to save the result as a data frame (easier to run some functions on a data frame). 
#First save as a matrix and then as a data frame. 

#Step 9 -- getting close to getting the idf

#Once you have the information about document frequency for each term  saved in a data frame, you will be interested in applying a function to each row and this can be done through apply() function.
#The number of documents can be easily obtained from earlier steps, more particularly from the dimension (dim()) of the large array created in Step 6. 
#Now you need all the information that you need in order to come up with the idf calculation. You just need to apply the idf function to each of the rows/terms (http://petewerner.blogspot.com/2012/12/using-apply-sapply-lapply-in-r.html)

#Step 10 -- sorting the idf score and writing it out to a file

#This seems like lots of steps but they can all be completed with the right functions and packages. 

#Good luck and feel free to ask questions!