Purpose

For feature set F3 and F4 the Stanford Grammatical Parser was used. In order to ensure its proper function, only minimal cleaning efforts were taken since the presence of stopwords, punctuation and full words (rather than stems) are important to determine the grammatical dependency. In order to reduce the dimensionality of the feature set, words were cleaned and stemmed after the grammatical dependency parsing process. The following subsection describes this process.

1. Preparation

#install.packages("tokenizers")
#install.packages("FSelectorRcpp")
library(NLP)
library(FSelectorRcpp)
library(tokenizers)
## Warning: package 'tokenizers' was built under R version 3.4.4
library(tm)
## Warning: package 'tm' was built under R version 3.4.3
library(SnowballC)
library(stats)
library(readxl)
setwd("~/Google Drive/UM/Smart Services/Thesis/Thesis/Code/Data Cleaning/Feature Set 3 B/Code")
Data <- read_excel("~/Google Drive/UM/Smart Services/Thesis/Thesis/Code/Data Cleaning/Feature Set 3 B/Input/6.POS Set.xlsx")
Text <- Data$POS_Text

2. Cleaning the data

#Prior Text State
print(head(Text,1))
## [1] "I am so angry that i made this post available via all possible sites i use when planing my trips so will make the mistake of booking this place"
#Remove Numbers
Clean.Text <- removeNumbers(Text)
#Remove Space
Clean.Text <- stripWhitespace(Clean.Text)
print(head(Clean.Text,2))
## [1] "I am so angry that i made this post available via all possible sites i use when planing my trips so will make the mistake of booking this place"
## [2] "I made my booking via booking com ."
Clean.Text <-tolower(Clean.Text)
print(head(Clean.Text,2))
## [1] "i am so angry that i made this post available via all possible sites i use when planing my trips so will make the mistake of booking this place"
## [2] "i made my booking via booking com ."
Clean.Text <-stemDocument(Clean.Text)
print(head(Clean.Text,2))
## [1] "i am so angri that i made this post avail via all possibl site i use when plane my trip so will make the mistak of book this place"
## [2] "i made my book via book com ."
df <- data.frame(matrix(seq(1,4735),nrow = 4735,ncol=1))
df$Review.Fragments <- Clean.Text