#Load Packages
library(tidyverse)
library(tm) # Will use to create corpus and modify text therein.
library(SnowballC) # Will use for "stemming." 
library(rpart) # Will use to construct a CART model.
library(rpart.plot) # Will use to plot CART tree.
library(gridExtra)

1a) Let us first explore the dataset. What is the number of reviews associated with each rating? What is the average length for each of the five ratings? Please comment briefly. [10 pts] [Hint: The number of characters can be obtained with the nchar function.]

reviews = read.csv("/Users/kayhanbabakan/OneDrive/MIT/Analytics Edge/HWK5/airbnb-small.csv", stringsAsFactors = FALSE)
aggregate(id~review_scores_rating,reviews,length)
review_text = Corpus(VectorSource(reviews$comments)) 
review_charcount = data.frame(sapply(review_text, nchar))
mean(review_charcount[,"sapply.review_text..nchar."])
[1] 297.6779

Comments
The number of reviews associated with each data set can be found as per the table above. There seem to be alot of reviews in the 5 star range and fairly few as you decrease in rating. The average character count of eaching rating is 297.6779295


1b) Create a corpus based on the comments column of the dataset. Clean the corpus by converting the text to lowercase, removing all stop words, removing the word “airbnb”, and stemming the document. What is the text of the first three documents in the corpus? [5 pts]

corpus = Corpus(VectorSource(reviews$comments))
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removeWords, c("airbnb"))
corpus = tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removePunctuation)
suppressWarnings(strwrap(corpus[[1]]))
[1] "good stay issu direct instruct differ actual properti"
suppressWarnings(strwrap(corpus[[2]]))
[1] "home quiet neighborhood flushing room stay newli renovated enjoy stay access manhattan via one bus 7 train recommend host"
[2] "home"                                                                                                                     
suppressWarnings(strwrap(corpus[[3]]))
[1] "ernest welcom us generous beauti home plenti inform apart area enjoy wonder stay comfort avail surroundings can recommend"
[2] "ernest host apart great manhattan get away"                                                                               

1c) Calculate the word frequencies in the corpus. What are the words that appear in 800 reviews or more? Then, remove the words that appear in less than 1% of the reviews. How many words do you still have after sparsifying the corpus? [5 pts]

frequencies = DocumentTermMatrix(corpus)
findFreqTerms(frequencies, lowfreq=800)
 [1] "stay"      "host"      "recommend" "room"      "apart"     "great"     "clean"     "locat"     "place"     "subway"    "nice"     
[12] "realli"   
sparse = removeSparseTerms(frequencies, 0.99)
document_terms = as.data.frame(as.matrix(sparse))
ncol(document_terms)
[1] 413

1d) Convert your corpus to a dataframe containing a row for each review, a column for each word in the sparsified corpus, a column for the review length, and a column for the dependent variable (i.e., whether the review is positive or negative). Next, split the dataframe into a training set comprising all reviews up to December 31, 2017, and a test set comprising all from January 1, 2018 onward. What is the proportion of positive reviews in the training set and in the test set? Comment briefly.

document_terms$char_count = sapply(review_text, nchar)
document_terms$review_scores_rating = reviews$review_scores_rating >= 4
document_terms$date = reviews$date
split1 = (document_terms$date <= "2017-12-31")
split2 = (document_terms$date >= "2018-01-01")
train = document_terms[split1,]
test = document_terms[split2,]
train$date = as.Date(train$date)
test$date = as.Date(test$date)

train_tot=nrow(train)
train_tru = nrow(subset(train,review_scores_rating==TRUE))
test_tot=nrow(test)
test_tru = nrow(subset(test,review_scores_rating==TRUE))

train_tru/train_tot
[1] 0.9269134
test_tru/test_tot
[1] 0.9583756

Comments
The proportion of Trues/Total is rougly the same however we did not stratify the data set and we do not have the same proprotion of true/total in both datasets. The percentages are so similar due to the low volume of negative reviews in general.


1e) Construct a manually (without performing cross- validation). Attach the images of three trees obtained with three values of cp. Discuss and interpret the variables selected by the models. [15 pts] Note: If you wish, you can look for comments containing a particular word with the following command: reviews[grepl(“wordtosearch”, reviews$comments), “comments”]

tree.cp01 = rpart(review_scores_rating ~. -char_count -date, data=train,method = "class", control = rpart.control(cp=.01))
prp(tree.cp01)


tree.cp001 = rpart(review_scores_rating ~. -char_count -date, data=train,method = "class", control = rpart.control(cp=.001))
prp(tree.cp001)


tree.cp005 = rpart(review_scores_rating ~. -char_count -date, data=train, method = "class", control = rpart.control(cp=.005))
prp(tree.cp005)

Comments
following the path of the simple model, we can see if the word bathroom was not used at all the rating is percieved as positive. If the bathroom was mentioned once or more, and the word expect was used then the rating was poor as this could also mean (unexpected), if the word expected was not used and the word door was used we look to the word recommend, if the word recommend was not used the rating was poor if it was the rating was positive.


1f) Propose a simple baseline model. Report the accuracy, the true positive rate and the false positive rate of each your three CART models and of your baseline model on the test set. Comment briefly on your results—including the magnitude of your true positive and false positive rates. [10 pts]

#baselineprediction
baseline = matrix(0,2,2)
baseline[1,2]=223
baseline[2,2]=2955
colnames(baseline)=c("False","True")
rownames(baseline)=c("False","True")
baseline
      False True
False     0  223
True      0 2955
#r-part prediction
pred = predict(tree.cp01, newdata=test, type="class")
confusionmatrix.cp01 = table(test$review_scores_rating,pred)
confusionmatrix.cp01
       pred
        FALSE TRUE
  FALSE     3   38
  TRUE      8  936
pred = predict(tree.cp001, newdata=test, type="class")
confusionmatrix.cp001 = table(test$review_scores_rating,pred)
confusionmatrix.cp001
       pred
        FALSE TRUE
  FALSE     8   33
  TRUE     14  930
pred = predict(tree.cp005, newdata=test, type="class")
confusionmatrix.cp005 = table(test$review_scores_rating,pred)
confusionmatrix.cp005
       pred
        FALSE TRUE
  FALSE     5   36
  TRUE     12  932
---
title: "Homework 5 Text Analytics"
author: "Kayhan Babakan </br> Analytics Edge 15.071"
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
  word_document: default
---
```{r}
#Load Packages
library(tidyverse)
library(tm) # Will use to create corpus and modify text therein.
library(SnowballC) # Will use for "stemming." 
library(rpart) # Will use to construct a CART model.
library(rpart.plot) # Will use to plot CART tree.
library(gridExtra)
```
<b>
1a) Let us first explore the dataset. What is the number of reviews associated with each rating? What is the average length for each of the five ratings? Please comment briefly. [10 pts]
[Hint: The number of characters can be obtained with the nchar function.]</b>

```{r}
reviews = read.csv("/Users/kayhanbabakan/OneDrive/MIT/Analytics Edge/HWK5/airbnb-small.csv", stringsAsFactors = FALSE)
aggregate(id~review_scores_rating,reviews,length)
review_text = Corpus(VectorSource(reviews$comments)) 
review_charcount = data.frame(sapply(review_text, nchar))
mean(review_charcount[,"sapply.review_text..nchar."])
```
<small><b> Comments </b></br>
The number of reviews associated with each data set can be found as per the table above. There seem to be alot of reviews in the 5 star range and fairly few as you decrease in rating. The average character count of eaching rating is `r mean(review_charcount[,"sapply.review_text..nchar."])`
</small></br></br>
<b>1b) Create a corpus based on the comments column of the dataset. Clean the corpus by converting the text to lowercase, removing all stop words, removing the word “airbnb”, and stemming the document. What is the text of the first three documents in the corpus? [5 pts]</b>

```{r warning=FALSE}
corpus = Corpus(VectorSource(reviews$comments))
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removeWords, c("airbnb"))
corpus = tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removePunctuation)
suppressWarnings(strwrap(corpus[[1]]))
suppressWarnings(strwrap(corpus[[2]]))
suppressWarnings(strwrap(corpus[[3]]))
```
<b>
1c) Calculate the word frequencies in the corpus. What are the words that appear in 800 reviews or more? Then, remove the words that appear in less than 1% of the reviews. How many words do you still have after sparsifying the corpus? [5 pts]</b>
```{r}
frequencies = DocumentTermMatrix(corpus)
findFreqTerms(frequencies, lowfreq=800)
sparse = removeSparseTerms(frequencies, 0.99)
document_terms = as.data.frame(as.matrix(sparse))
ncol(document_terms)
```

<b>1d) Convert your corpus to a dataframe containing a row for each review, a column for each word in the sparsified corpus, a column for the review length, and a column for the dependent variable (i.e., whether the review is positive or negative). Next, split the dataframe into a training set comprising all reviews up to December 31, 2017, and a test set comprising all from January 1, 2018 onward. What is the proportion of positive reviews in the training set and in the test set? Comment briefly. </br> </b>

```{r}
document_terms$char_count = sapply(review_text, nchar)
document_terms$review_scores_rating = reviews$review_scores_rating >= 4
document_terms$date = reviews$date
split1 = (document_terms$date <= "2017-12-31")
split2 = (document_terms$date >= "2018-01-01")
train = document_terms[split1,]
test = document_terms[split2,]
train$date = as.Date(train$date)
test$date = as.Date(test$date)

train_tot=nrow(train)
train_tru = nrow(subset(train,review_scores_rating==TRUE))
test_tot=nrow(test)
test_tru = nrow(subset(test,review_scores_rating==TRUE))

train_tru/train_tot
test_tru/test_tot

```
<small>
<b> Comments </b></br>
The proportion of Trues/Total is rougly the same however we did not stratify the data set and we do not have the same proprotion of true/total in both datasets. The percentages are so similar due to the low volume of negative reviews in general.</small></br></br>

<b>1e) Construct a manually (without performing cross- validation). Attach the images of three trees obtained with three values of cp. Discuss and interpret the variables selected by the models. [15 pts]
Note: If you wish, you can look for comments containing a particular word with the following command: reviews[grepl("wordtosearch", reviews$comments), "comments"]</b></br>

```{r}
tree.cp01 = rpart(review_scores_rating ~. -char_count -date, data=train,method = "class", control = rpart.control(cp=.01))
prp(tree.cp01)

tree.cp001 = rpart(review_scores_rating ~. -char_count -date, data=train,method = "class", control = rpart.control(cp=.001))
prp(tree.cp001)

tree.cp005 = rpart(review_scores_rating ~. -char_count -date, data=train, method = "class", control = rpart.control(cp=.005))
prp(tree.cp005)
```
<small>
<b> Comments </b></br>
following the path of the simple model, we can see if the word bathroom was not used at all the rating is percieved as positive. 
If the bathroom was mentioned once or more, and the word expect was used then the rating was poor as this could also mean (unexpected), if the word expected was not used and the word door was used we look to the word recommend, if the word recommend was not used the rating was poor if it was the rating was positive.</small></br></br>

<b>1f) Propose a simple baseline model. Report the accuracy, the true positive rate and the false positive rate of each your three CART models and of your baseline model on the test set. Comment briefly on your results—including the magnitude of your true positive and false positive rates. [10 pts]</b></br>

```{r}
#baselineprediction
baseline = matrix(0,2,2)
baseline[1,2]=223
baseline[2,2]=2955
colnames(baseline)=c("False","True")
rownames(baseline)=c("False","True")
baseline

#r-part prediction
pred = predict(tree.cp01, newdata=test, type="class")
confusionmatrix.cp01 = table(test$review_scores_rating,pred)
confusionmatrix.cp01

pred = predict(tree.cp001, newdata=test, type="class")
confusionmatrix.cp001 = table(test$review_scores_rating,pred)
confusionmatrix.cp001

pred = predict(tree.cp005, newdata=test, type="class")
confusionmatrix.cp005 = table(test$review_scores_rating,pred)
confusionmatrix.cp005
```
