##MODIFY TITLE TO DESCRIBE WHAT YOU’RE DOING AND THAT EACH STEP CORRESPONDS TO THE FLOW CHART WE CREATED
Step 1: #CHANGE LOCATION OF CSV FILE AND LOAD# the #2016-2017# ARS funded projects into R. To ensure all the string in the data frame aren’t treated as factors, we set the ‘stringasFactors’ as False.
ars=read.csv("C:/Users/josh.katz/Desktop/machine_learning/ars_funded_2016_2017.csv",stringsAsFactors = FALSE)
Step 2: Combine the Title and Objective column and paste it into the combination column. Use the head command to display the top 6 rows of the combination column.
str(ars);head(ars);tail(ars) ##GET DIMENSIONS TO VERIFY PROPER LOADING OF csv AND LOOK AT DATA
ars[91,1];ars[91,2] ##COMPARE LAST ROW TO EXCEL SPREADSHEET
ars$combination=paste(ars$Title,ars$Objective) ##PASTE TITLE AND OBJECTIVE COLUMNS TOGETHER
str(ars);head(ars$combination);tail(ars$combination) ##ADD THE STR AND TAIL FUNCTIONS HERE TO VERFIFY NEW DIMENSIONS AND THAT ROW 91 WAS PASTED TOGETHER
Step 3 : We call the library(tm) for executing text mining commands, library(wordcloud) to generate the word cloud. We do this only after we have installed the packages ‘tm’ and ‘wordcloud’ as shown below.
#install.package('tm')
#install.package('wordcloud')
library(tm)
library(wordcloud)
Step 4 : Remove the punctuation marks and numbers from the combination column using regular expressions and gsub command. The gsub command looks for the built-in punctuation marks and numbers in the combination column and replaces it with a blank.
head(ars$combination) ##BASELINE TEXT WITH PUNCTUATION AND NUMBERS
ars$combination=gsub("[[:punct:]]", " ", ars$combination) ##SUBSTITUTE PUNCTUATION WITH BLANK
ars$combination=gsub('[0-9]+',' ',ars$combination) ##SUBSTITUTE NUMBERS WITH BLANK
head(ars$combination) ##COMPARE TO BASELINE TEXT TO VERIFY THAT NUMBERS AND PUNCTUATION HAVE BEEN REMOVED
Step 5.a: To create a document term matrix, we first need to create the Corpus. A corpus is the main structure for managing documents in the text mining library. It formats the text present in the file in a way which is easily understood by the computer. It is the most effecient way to perform text analysis. We use VectorSource in the corpus to handle csv files. We use tm_map function to make transformations to the corpus. In this instance it converts all words to lower cases.
ars_corpus=Corpus(VectorSource(ars$combination))
print(ars_corpus) ##CONVERTED DATAFRAME TO CORPUS WITH 91 DOCUMENTS
head(ars_corpus[[91]]) ##BASELINE 91ST DOCUMENT
corpus_clean=tm_map(ars_corpus,tolower)
corpus_clean=tm_map(corpus_clean,stripWhitespace)
tail(corpus_clean$content) ##ADD TAIL FUNCTION TO COMPARE DOCUMENT 91 TO BASELINE
## ##TEXT CONVERTED TO LOWER CASE AND EXTRA WHITE SPACE REMOVED
Step 5.b: We create the Term Document Matrix(TDM) using the code shown below. TDM gives us the frequency of the words listed in each document as a matrix. The output shown below is just a sample of the TDM. From the output shown below we know : 1. There are a total of 2567 unique terms in the file of 91 documents. 2. There are a total of 7753 non zero entries in the matrix and 225844 zero entries ##EXPLAIN THIS 3. the precentage of zero entries in 97%
##EXPLAIN THIS
ars_tdm=TermDocumentMatrix(ars_corpus)
ars_tdm
inspect(ars_tdm[1:10,1:10]) ##DOCUMENT 1 HAS "ANIMAL" 4 TIMES WHICH IS VERIFIED WITH ORIGINAL SOURCE SPREADSHEET
Step 5.c: Getting the most frequent term using the ‘findFreqTerms’ command. This command takes in the TDM and an number (for e.g ‘25’) and sees which terms occur more than 50 times throughout the corpus. From this we figure out which of the following words are not related to food safety or too generic. Store those words as stop words and remove them from the corpus.
findFreqTerms(ars_tdm,25) ##"AND" IS THE MOST FREQUENT TERM IN ALL 91 DOCUMENTS FOLLOWED BY "ANIMAL"
Step 5.d: Removing other stop words from the Corpus using the tm_map function. Below you can see the list words we want removed from the Corpus. Then use the ‘stripWhitespace’ to remove all the empty space created due to removing punctuation marks and numbers
tail(corpus_clean$content) ##BASELINE LAST 6 DOCUMENTS
corpus_clean=tm_map(corpus_clean, removeWords, c('including','based','improved','used','assess','specific','ability','studies','characterize','are','their','develop','that','use','the','will','using','can','objective','subobjective', 'this', 'and', 'for', 'with', stopwords('english')))
corpus_clean=tm_map(corpus_clean,tolower)
corpus_clean=tm_map(corpus_clean,stripWhitespace)
tail(corpus_clean$content) ##STOPWORDS LOOK REMOVED FROM 6 DOCUMENTS
step 6: Use the command below to generate the word Cloud. The word cloud can also be generated by knitting this file into a html document.
wordcloud(corpus_clean, scale=c(2.5,0.5), max.words=100, random.order=FALSE, rot.per=0.35) ##MOST FREQUENT SINGLE WORDS HAVE LARGEST FONT
step 7: Install and call the library “tau” inorder to get most common occuring biagrams and trigrams. Run the code below so that you can generate the a list on 30 biagrams and trigrams. This would improve the accuracy of us finding projects that are related to food safety. The most frequent biagram is “foodbourne pathogens”. ##MOST FREQUENT WORD COUNTING MORE THAN ONCE PER PROJECT##
##install.packages ("tau")
library(tau)
grams = textcnt(corpus_clean, n = 1, method = "string") ##ADD SINGLE WORDS
grams =grams[order(grams, decreasing = TRUE)]
grams[1:50]
bigrams = textcnt(corpus_clean, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)]
bigrams[1:50]
trigrams = textcnt(corpus_clean, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
trigrams[1:25]
step 8:We find the projects in which the word foodbourne pathogen occurs. The command grep helps us find in the project in whcih the word is present. We add an additional column to the dataset which acts as a classifier based on the code shown below. This classifier will help improve the probability of the project being food safety project.
single=grep("pathogens",corpus_clean$content) ##ADD SINGLE WORD "PATHOGENS"
double=grep("foodborne pathogen",corpus_clean$content)
length(single);length(double) ##39 documents with pathogens and 27 with foodborne pathogen
double %in% single ##ALL BIGRAMS ARE ACCOUNTED FOR IN THE SINGLE VECTOR EXCEPT 2 DOCUMENTS 9&13
ars$classifier=ifelse(grepl("foodborne pathogen",corpus_clean$content),1,0) ##DEVELOP CLASSIFIER COLUMN AND ASSIGN EACH DOCUMENT A 1 OR 0
str(ars);double ## 0's and 1's match
findAssocs(ars_tdm,"pathogens",0.3) ##EXPLAIN WHAT THIS DOES AND WHY YOU'RE DOING IT
```