Subsetting

Purpose

In order to perform POS on the remaining text elements, a subset excluding the negation tagged words was created. The following section presents the code.

library(tokenizers)

## Warning: package 'tokenizers' was built under R version 3.4.4

library(tm)

## Warning: package 'tm' was built under R version 3.4.3

## Loading required package: NLP

library(stringr)

setwd("~/Google Drive/UM/Smart Services/Thesis/Thesis/Code/Feature Set3/Code/5. Subsetting")

IMPORT DATA

NegationTaggedReviews <- readxl::read_excel("~/Google Drive/UM/Smart Services/Thesis/Thesis/Code/Feature Set3/Input/5.Reviews with Negation Tagging.xlsx")

Destination.Text <- as.list(NegationTaggedReviews$Tagged)

Create POS Subset Without Negations

Since the negation words are already tag, a subset without negations have been created for further tagging.

POS_Set <- list()

for (o in 1:4735){
  Extract <- unlist(Destination.Text[[o]])
  Extract <- gsub(pattern = "\\S+_NOT\\b",replacement = "",x = Extract)
  POS_Set[[o]] <- stripWhitespace(Extract)
}

POS_Set[[1]]

## [1] "I am so angry that i made this post available via all possible sites i use when planing my trips so will make the mistake of booking this place"

df <- data.frame(matrix(NA, nrow = 4735, ncol = 1))
df$POS_Text <- POS_Set

WriteXLS::WriteXLS(df,ExcelFileName = "6.POS Set.xlsx")

Extract Negations

Neg.Words <- str_extract(string = Destination.Text,pattern = "\\S+_NOT\\b")
Neg.Words.Clean <- Neg.Words[-which(sapply(Neg.Words,is.na)==TRUE)]
Index <- grep("\\S+_NOT\\b",Neg.Words)

df.Neg <- data.frame(matrix(seq(1,371),ncol=1,nrow=371))
df.Neg$Index <- Index
df.Neg$Neg.Words <- Neg.Words.Clean

WriteXLS::WriteXLS(df.Neg,ExcelFileName = "7. Negated Words and Index.xlsx")

Subsetting

Lisa

7/14/2018

Purpose

IMPORT DATA

Create POS Subset Without Negations

Extract Negations