ReporteSegundoProyecto

Step 1. Preparing the data!

We used a dataset of movies reviews, with 5331 positive reviews and 2332 negative reviews, we get the dataset from a project of github where is use to practice data science with python.

We used this dataset to predict if a new review is negative or positive with naive bayes.

We read the data with headers, and taking all data frame treated as just plain strings. We assigned this data into a variable “dsMovies”

knitr::opts_chunk$set(echo = TRUE)
#dsMovies <- read.table(file = 'movie-reviews-dataset.tsv', sep = '\t', header = TRUE)
dsMovies <- read.csv("dsMovies.csv", stringsAsFactors = FALSE, header = TRUE)

Step 2. Checking dataset.

First of all we make a summary of dataset which has a length of 7663 characters in “type” column, and in “message” column we have a length of 7663 characters.

We see the structure that has 7663 observable plain strings on 2 columns.

We show the first 6 records value of “type” column, and then another 6 records value of column message.

We show the last 6 records value of “type” column (from 7658 - 7663 rows), and then another last 6 records value of column message.

It’s time to make factors the 2 values of column “type” which we have negateive or positive, this for the next step that is Preeproccesing.

For last, we show how many values for each factor, which we expect 2332 negative messages and 5331 positive.

summary(dsMovies)

##      type             message         
##  Length:7663        Length:7663       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character

str(dsMovies)

## 'data.frame':    7663 obs. of  2 variables:
##  $ type   : chr  "negative" "negative" "negative" "negative" ...
##  $ message: chr  "it's a mindless action flick with a twist -- far better suited to video-viewing than the multiplex . " "after a while , the only way for a reasonably intelligent person to get through the country bears is to ponder how a whole segm"| __truncated__ "we get light showers of emotion a couple of times , but then -- strangely -- these wane to an inconsistent and ultimately unsat"| __truncated__ "summer's far too fleeting to squander on offal like this . " ...

head(dsMovies)

##       type
## 1 negative
## 2 negative
## 3 negative
## 4 negative
## 5 negative
## 6 negative
##                                                                                                                                                                                                     message
## 1                                                                                                     it's a mindless action flick with a twist -- far better suited to video-viewing than the multiplex . 
## 2 after a while , the only way for a reasonably intelligent person to get through the country bears is to ponder how a whole segment of pop-music history has been allowed to get wet , fuzzy and sticky . 
## 3                                                         we get light showers of emotion a couple of times , but then -- strangely -- these wane to an inconsistent and ultimately unsatisfying drizzle . 
## 4                                                                                                                                               summer's far too fleeting to squander on offal like this . 
## 5                                                                                                             the film is grossly contradictory in conveying its social message , if indeed there is one . 
## 6                                                            often lingers just as long on the irrelevant as on the engaging , which gradually turns what time is it there ? into how long is this movie ?

tail(dsMovies)

##          type
## 7658 positive
## 7659 positive
## 7660 positive
## 7661 positive
## 7662 positive
## 7663 positive
##                                                                                                                                               message
## 7658                                               [has] an immediacy and an intimacy that sucks you in and dares you not to believe it's all true . 
## 7659                                          it treats ana's journey with honesty that is tragically rare in the depiction of young women in film . 
## 7660 captivates as it shows excess in business and pleasure , allowing us to find the small , human moments , and leaving off with a grand whimper . 
## 7661                                                                                a refreshingly realistic , affectation-free coming-of-age tale . 
## 7662                              how good this film might be , depends if you believe that the shocking conclusion is too much of a plunge or not . 
## 7663          great fun both for sports aficionados and for ordinary louts whose idea of exercise is climbing the steps of a stadium-seat megaplex .

dsMovies$type <- factor(dsMovies$type)
table(dsMovies$type)

## 
## negative positive 
##     2332     5331

In the following graph we can observe the amount of positive and negative messages that we have in our dataset. We have 7663 messages in total, 5331 are positive and 2332 are negative.

barplot(table(dsMovies$type), xlab = "Quantity", ylab = "Type", horiz = TRUE, col='#990066')

Step 3. Preprocessing

We load the library “tm” which holds the format VCorups of Text Documents, which create a volatile corpora.

Then we inspect “dsMoviesCorpus” and it has a content of 2 documents, which the First Plain Text document has a content of 101 characters, and in Second Plain Text document has 201 characters.

library("tm")

## Loading required package: NLP

dsMoviesCorpus <- VCorpus(VectorSource (dsMovies$message))
print(dsMoviesCorpus)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 7663

inspect(dsMoviesCorpus[1:2])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 101
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 201

Step 4. Checking the first message

We print the information of the “dmMoviesCorupes” (first message), then we convert it into character to see the message in.

library("tm")
print(dsMoviesCorpus[[1]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 101

as.character(dsMoviesCorpus[[1]])

## [1] "it's a mindless action flick with a twist -- far better suited to video-viewing than the multiplex . "

Step 5. Checking multiple messages

We print different messages, for this we use the function “lapply” to transform in range of rows the messages and converts to character.

library("tm")
lapply(dsMoviesCorpus[1:5], as.character)

## $`1`
## [1] "it's a mindless action flick with a twist -- far better suited to video-viewing than the multiplex . "
## 
## $`2`
## [1] "after a while , the only way for a reasonably intelligent person to get through the country bears is to ponder how a whole segment of pop-music history has been allowed to get wet , fuzzy and sticky . "
## 
## $`3`
## [1] "we get light showers of emotion a couple of times , but then -- strangely -- these wane to an inconsistent and ultimately unsatisfying drizzle . "
## 
## $`4`
## [1] "summer's far too fleeting to squander on offal like this . "
## 
## $`5`
## [1] "the film is grossly contradictory in conveying its social message , if indeed there is one . "

Step 6. Transform to Lower Case.

This is the preparation of the data set, in a formal way we call this “CLEAN THE DATA”, first we will transform all messages to lowercase with the function tolower(), for checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

dsMovies_clean <- tm_map ( dsMoviesCorpus, content_transformer(tolower))
as.character(dsMoviesCorpus[[1]])

## [1] "it's a mindless action flick with a twist -- far better suited to video-viewing than the multiplex . "

as.character(dsMovies_clean[[1]])

## [1] "it's a mindless action flick with a twist -- far better suited to video-viewing than the multiplex . "

Step 7. Remove numbers

After that we will remove all numbers in the mssages using the function “content_transformer()”, this is because removeNumbers() is built into tm along with several ther mapping function that do not need to be wraped. For checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

dsMovies_clean <- tm_map ( dsMovies_clean, removeNumbers)
as.character(dsMoviesCorpus[[32]])

## [1] "one of the worst films of 2002 . "

as.character(dsMovies_clean[[32]])

## [1] "one of the worst films of  . "

Step 8. Remove STOP WORDS

Remove stop words is to remove filler words, this is before text analysis, that wordsdoesn’t give valious information about the meaning of the message. We use function stopwords(), that fuction allow us to access various sets of stop words, in this case for “Spanish” and the default one (english). We’ll also use the tm_map() function to apply this mapping to the data, providing the stopwords() as parameters, in this way we could indicate the words we would like to remove. For checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

#stopwords()
#stopwords('spanish')
dsMovies_clean <- tm_map ( dsMovies_clean, removeWords, stopwords('spanish'))
dsMovies_clean <- tm_map ( dsMovies_clean, removeWords, stopwords())
as.character(dsMoviesCorpus[[6]])

## [1] "often lingers just as long on the irrelevant as on the engaging , which gradually turns what time is it there ? into how long is this movie ? "

as.character(dsMovies_clean[[6]])

## [1] "often lingers just  long   irrelevant    engaging ,  gradually turns  time    ?   long   movie ? "

as.character(dsMoviesCorpus[[1200]])

## [1] "fairly run-of-the-mill . "

as.character(dsMovies_clean[[1200]])

## [1] "fairly run---mill . "

Step 9. Remove Punctuation

Now we remove any punctuation mark, like parenthesis, comas, ponts, etc. For checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

dsMovies_clean <- tm_map ( dsMovies_clean, removePunctuation)
as.character(dsMoviesCorpus[[6]])

## [1] "often lingers just as long on the irrelevant as on the engaging , which gradually turns what time is it there ? into how long is this movie ? "

as.character(dsMovies_clean[[6]])

## [1] "often lingers just  long   irrelevant    engaging   gradually turns  time       long   movie  "

Step 10. Remove Spaces

Here we remove reamining white spaces, using the funtion stripWhitespace(). For checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

dsMovies_clean <- tm_map ( dsMovies_clean, stripWhitespace)
as.character(dsMoviesCorpus[[452]])

## [1] "jarecki and gibney do find enough material to bring kissinger's record into question and explain how the diplomat's tweaked version of statecraft may have cost thousands and possibly millions of lives . "

as.character(dsMovies_clean[[452]])

## [1] "jarecki gibney find enough material bring kissingers record question explain diplomats tweaked version statecraft may cost thousands possibly millions lives "

Step 11. Data Preparation for Analysis

We need to create a DocumentTermMatrix() that allows us to create a Document in Matrix using the function “DocumentTermMatrix()”, well being more specific is a sparse matrix, which cells have a value of zero.

dsMovies_dtm <- DocumentTermMatrix (dsMovies_clean)

str(dsMovies_dtm)

## List of 6
##  $ i       : int [1:78753] 1 1 1 1 1 1 1 1 1 2 ...
##  $ j       : int [1:78753] 136 1327 5259 5569 9238 9551 14253 15273 15831 402 ...
##  $ v       : num [1:78753] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 7663
##  $ ncol    : int 16567
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:7663] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:16567] "<c2><bd>" "<c3><a9>lan" "<e2><80><93>" "<e2><80><94>" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

Step 12. Creating training and test Datasets

Here its very simply we divide the data in 2 sections: 75 percent for training and 25 percent for testing.

Here we try to have both groups of data balanced, from the point of view that we have more or less the same amount of both positive and negative messages in both sets of training and testing.

dsMovies_dtm_train <- dsMovies_dtm[1:5364, ]
dsMovies_dtm_test <- dsMovies_dtm[5365:7663, ]
dsMovies_train_labels <- dsMovies[1:5364,]$type
dsMovies_test_labels <- dsMovies[5365:7663,]$type
prop.table(table(dsMovies_train_labels))

## dsMovies_train_labels
##  negative  positive 
## 0.3296048 0.6703952

prop.table(table(dsMovies_test_labels))

## dsMovies_test_labels
##  negative  positive 
## 0.2453241 0.7546759

Step 13. Visualizing text data

In the next section we can observe a word cloud that allows us to observe the frequency with which the words appear in the messages.

library("wordcloud")

## Loading required package: RColorBrewer

wordcloud(dsMovies_clean, min.freq = 40, random.order =TRUE)

## Warning in wordcloud(dsMovies_clean, min.freq = 40, random.order = TRUE):
## film could not be fit on page. It will not be plotted.

Now we can observe the wordcloud of how often words appear only in negative messages.

bad <- subset(dsMovies, type=="negative")
good <- subset(dsMovies, type =="positive")

wordcloud(bad$message, max.words = 80, scale = c(5, 0.5))

Now we can observe the wordcloud of how often words appear only in positive messages.

bad <- subset(dsMovies, type=="negative")
good <- subset(dsMovies, type =="positive")

wordcloud(good$message, max.words = 80, scale = c(5, 0.5))

To work with this as a tidy dataset, we need to restructure it as one-token-per-row format. The unnest_tokens function is a way to convert a dataframe with a text column to be one-token-per-row:

This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, ngrams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join.

We can also use count to find the most common words in all the books as a whole.

Then those are the ten most common words in our dataset, but ilustrated with a barplot.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)
library(tidyr)


bing <- get_sentiments("bing")

tidy_dsMovies <- dsMovies %>%
  unnest_tokens(word, message)

data("stop_words")
dsMoviesCl <- tidy_dsMovies %>%
  anti_join(stop_words)

## Joining, by = "word"

res = dsMoviesCl %>%
  count(word, sort = TRUE)

barplot(res[1:10,]$n, xlab = "Word", ylab = "Quantity", horiz = FALSE, col='#FF6100',names.arg=res[1:10,]$word, cex.names=0.7)

res[1:10,]

## # A tibble: 10 x 2
##          word     n
##         <chr> <int>
## 1        film  1087
## 2       movie   874
## 3       story   357
## 4      comedy   289
## 5        time   261
## 6  characters   232
## 7       funny   230
## 8    director   208
## 9        life   208
## 10       love   182

Lets find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in the messages.

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment.

This can be shown visually, and we can pipe straight into ggplot2 because of the way we are consistently using tools built for handling tidy data frames.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

bing_word_counts <- tidy_dsMovies %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

## Joining, by = "word"

bing_word_counts %>%
  filter(n > 50) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment")

Step 14. Creating indicator for Frequent Words

In this step we use the function findFreqTerms() to get which words appear more frequently in the whole datasets.

dsMovies_freq_words <- findFreqTerms(dsMovies_dtm_train,20)
str(dsMovies_freq_words)

##  chr [1:429] "acted" "acting" "action" "actor" "actors" ...

dsMovies_dtm_freq_train <- dsMovies_dtm_train[,dsMovies_freq_words]
dsMovies_dtm_freq_test <- dsMovies_dtm_test[,dsMovies_freq_words]
dsMovies_dtm_freq_train

## <<DocumentTermMatrix (documents: 5364, terms: 429)>>
## Non-/sparse entries: 21067/2280089
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)

dsMovies_dtm_freq_test

## <<DocumentTermMatrix (documents: 2299, terms: 429)>>
## Non-/sparse entries: 8863/977408
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)

Step 15. The Naive Bayes classifier need categorical data.

First, we categrize the data, because Naive Bayes needs it; we make a function the in the train and test dataset, to apply the converts_counts.

converts_counts <- function(x){x <- ifelse(x>0,"Yes","No")}
dsMovies_train <- apply(dsMovies_dtm_freq_train, MARGIN = 2, converts_counts)
dsMovies_test <- apply(dsMovies_dtm_freq_test, MARGIN = 2, converts_counts)

Naive Bayes classification outputs.

Step 16. Naive Bayes classification outputs.

We build a model on the train data, with the name of dsMovies classifier, and another with dsMovies predictions, that will return a vector of predicted class values or raw predicted probabilites depending upon the value of the type parameter.

library(e1071)
dsMovies_classifier <- naiveBayes(dsMovies_train,dsMovies_train_labels)
dsMovies_text_pred <- predict(dsMovies_classifier,dsMovies_test)
table(dsMovies_text_pred)

## dsMovies_text_pred
## negative positive 
##      473     1826

Step 17. Frequency Table.

For the fnal step we evaluate our moel on useed data, we compare the predictions with true values.

Compare predictions vs real classes.

This is the result with minimum 20 repetitions of each word in the reviews without laplace.

library(gmodels)
#CrossTable(dsMovies_text_pred,dsMovies_test_labels,prop.chisq = FALSE,prop.t = FALSE, dnn = c('predicted','actual'))
table(dsMovies_text_pred,dsMovies_test_labels)

##                   dsMovies_test_labels
## dsMovies_text_pred negative positive
##           negative      230      243
##           positive      334     1492

This is the result with minimum 20 repetitions of each word in the reviews with laplace smoothing.

This smoothing is use to handle cases where

\(P(X_i = x_i | Y = y) = 0\)

To avoid overfitting the data. This is to consider this data, if we don’t do this, we will ignore the data with this particular case, and that is bad

library(e1071)
dsMovies_classifier <- naiveBayes(dsMovies_train,dsMovies_train_labels,laplace = 1)
dsMovies_text_pred <- predict(dsMovies_classifier,dsMovies_test)
table(dsMovies_text_pred,dsMovies_test_labels)

##                   dsMovies_test_labels
## dsMovies_text_pred negative positive
##           negative      222      237
##           positive      342     1498

Comparative table without laplace

Comparative table with laplace

Conclusions and Limitations

Does the study generalize to other domains?

Yes, in fact Naive Bayes has been used in many other domains, for example in Medical Data Classification, for example in pattern recognition and image processing also for improving diagnostic speed and increasing the quality of medical treatment if we have a data set of determined symptoms and the disease is helpful to give a prediction of the illness for a result of a “good” or “bad” diagnosis.

Limitations

There are some limitation in the dataset for example, we have a little portion of a dataset that is based in some reviews of some cinema critics, but we have ony a few portion of critics and it doesn’t reflect an absolute truth review for a movie.

Advantages

This study reflects a good use of prediction with this dataset, we have a positive or negative review, we can use this to analyze personal opinions in different aspects of a book for example, but in general we can add more reviews of movies to the dataset and the accuracy of prediction wil maintain in the same level.

ReporteSegundoProyecto

Erik Mejia Hernandez - Julieta Guadalupe Rodriguez Ruiz - Adrian Homero MorenoGarcia

21 de abril de 2017

Step 1. Preparing the data!

We used a dataset of movies reviews, with 5331 positive reviews and 2332 negative reviews, we get the dataset from a project of github where is use to practice data science with python.

We used this dataset to predict if a new review is negative or positive with naive bayes.

We read the data with headers, and taking all data frame treated as just plain strings. We assigned this data into a variable “dsMovies”

Step 2. Checking dataset.

First of all we make a summary of dataset which has a length of 7663 characters in “type” column, and in “message” column we have a length of 7663 characters.

We see the structure that has 7663 observable plain strings on 2 columns.

We show the first 6 records value of “type” column, and then another 6 records value of column message.

We show the last 6 records value of “type” column (from 7658 - 7663 rows), and then another last 6 records value of column message.

It’s time to make factors the 2 values of column “type” which we have negateive or positive, this for the next step that is Preeproccesing.

For last, we show how many values for each factor, which we expect 2332 negative messages and 5331 positive.

In the following graph we can observe the amount of positive and negative messages that we have in our dataset. We have 7663 messages in total, 5331 are positive and 2332 are negative.

Step 3. Preprocessing

We load the library “tm” which holds the format VCorups of Text Documents, which create a volatile corpora.

Then we inspect “dsMoviesCorpus” and it has a content of 2 documents, which the First Plain Text document has a content of 101 characters, and in Second Plain Text document has 201 characters.

Step 4. Checking the first message

We print the information of the “dmMoviesCorupes” (first message), then we convert it into character to see the message in.

Step 5. Checking multiple messages

We print different messages, for this we use the function “lapply” to transform in range of rows the messages and converts to character.

Step 6. Transform to Lower Case.

Step 7. Remove numbers

Step 8. Remove STOP WORDS

Step 9. Remove Punctuation

Now we remove any punctuation mark, like parenthesis, comas, ponts, etc. For checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

Step 10. Remove Spaces

Here we remove reamining white spaces, using the funtion stripWhitespace(). For checking we print with “as.character” function. Let’s check the transformation by comparing a message in the original corpus in the transformed corpus.

Step 11. Data Preparation for Analysis

We need to create a DocumentTermMatrix() that allows us to create a Document in Matrix using the function “DocumentTermMatrix()”, well being more specific is a sparse matrix, which cells have a value of zero.

Step 12. Creating training and test Datasets

Here its very simply we divide the data in 2 sections: 75 percent for training and 25 percent for testing.

Here we try to have both groups of data balanced, from the point of view that we have more or less the same amount of both positive and negative messages in both sets of training and testing.

Step 13. Visualizing text data

In the next section we can observe a word cloud that allows us to observe the frequency with which the words appear in the messages.

Now we can observe the wordcloud of how often words appear only in negative messages.

Now we can observe the wordcloud of how often words appear only in positive messages.

To work with this as a tidy dataset, we need to restructure it as one-token-per-row format. The unnest_tokens function is a way to convert a dataframe with a text column to be one-token-per-row:

This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, ngrams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join.

We can also use count to find the most common words in all the books as a whole.

Then those are the ten most common words in our dataset, but ilustrated with a barplot.

Lets find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in the messages.

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment.

This can be shown visually, and we can pipe straight into ggplot2 because of the way we are consistently using tools built for handling tidy data frames.

Step 14. Creating indicator for Frequent Words

In this step we use the function findFreqTerms() to get which words appear more frequently in the whole datasets.

Step 15. The Naive Bayes classifier need categorical data.

First, we categrize the data, because Naive Bayes needs it; we make a function the in the train and test dataset, to apply the converts_counts.

Naive Bayes classification outputs.

Step 16. Naive Bayes classification outputs.

We build a model on the train data, with the name of dsMovies classifier, and another with dsMovies predictions, that will return a vector of predicted class values or raw predicted probabilites depending upon the value of the type parameter.

Step 17. Frequency Table.

For the fnal step we evaluate our moel on useed data, we compare the predictions with true values.

Compare predictions vs real classes.

This is the result with minimum 20 repetitions of each word in the reviews without laplace.

This is the result with minimum 20 repetitions of each word in the reviews with laplace smoothing.

This smoothing is use to handle cases where

To avoid overfitting the data. This is to consider this data, if we don’t do this, we will ignore the data with this particular case, and that is bad

Comparative table without laplace

Comparative table with laplace

Conclusions and Limitations

Does the study generalize to other domains?

Limitations

There are some limitation in the dataset for example, we have a little portion of a dataset that is based in some reviews of some cinema critics, but we have ony a few portion of critics and it doesn’t reflect an absolute truth review for a movie.

Advantages