Text Analysis Codethrough Using Amazon Fine Food Reviews Dataset

Introduction

Welcome, Brave Adventurer!

You’ve been summoned to embark on an epic quest — the Quest for the Ultimate Knowledge of Text Analysis using the mighty powers of stringr in R! Prepare your gear, young scholar, as you will be wielding these tools to battle messy text data and emerge victorious. This quest will take you through different lands: extracting patterns, cleaning strings, performing sentiment analysis, and more. Do not fear, for I shall be your guide, and together we will unlock the secrets of text manipulation. Shall we begin?

Step 1: Preparing for Battle - Setting up the Environment

Before you start your journey, it is essential to load the necessary tools into your R environment. Here, the stringr package will be your weapon of choice for text manipulation.

Load the Libraries

#iInstall packages
#install.packages("stringr")
#install.packages("dplyr")

# load your libaries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library (stringr)

stringr offers a consistent interface for working with strings, and it is part of the tidyverse.

Step 2: Equipping the Dataset (Loading the Data)

Adventurer, we will be using the Amazon Fine Food Reviews dataset, kindly made available to us by McAuley & Leskovec (2013). This dataset is large, making it the perfect training ground to hone your text manipulation skills.

You can download the dataset from Kaggle here.

Here is how you equip the dataset:

# load file path and save to object
reviews <- read.csv("C:\\Users\\swest\\Downloads\\archive\\Reviews.csv") # your file path will be different based on where the downloaded file is stored

# take a look at the dataset
glimpse(reviews)

## Rows: 35,173
## Columns: 10
## $ Id                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ ProductId              <chr> "B001E4KFG0", "B00813GRG4", "B000LQOCH0", "B000…
## $ UserId                 <chr> "A3SGXH7AUHU8GW", "A1D87F6ZCVE5NK", "ABXLMWJIXX…
## $ ProfileName            <chr> "delmartian", "dll pa", "Natalia Corres \"Natal…
## $ HelpfulnessNumerator   <int> 1, 0, 1, 3, 0, 0, 0, 0, 1, 0, 1, 4, 1, 2, 4, 4,…
## $ HelpfulnessDenominator <int> 1, 0, 1, 3, 0, 0, 0, 0, 1, 0, 1, 4, 1, 2, 5, 5,…
## $ Score                  <int> 5, 1, 4, 2, 5, 4, 5, 5, 5, 5, 5, 5, 1, 4, 5, 5,…
## $ Time                   <int> 1303862400, 1346976000, 1219017600, 1307923200,…
## $ Summary                <chr> "Good Quality Dog Food", "Not as Advertised", "…
## $ Text                   <chr> "I have bought several of the Vitality canned d…

Step 3: Understanding the Power of String Manipulation

Ah, string manipulation! The art of bending text to your will. In this section, you’ll learn how to clean and manipulate text using stringr. Remember, mastering this will allow you to make sense of even the most chaotic text data.

Extracting Portions of Text with str_sub() Just as a skilled warrior uses their blade to cut precisely, you will use str_sub() to extract portions of text. Let’s start by creating a column of “short reviews” from the first 50 characters of the review text.

# create a new column using mutate() which uses the extracted strings that have at least 1 character and up to 50
reviews <- reviews %>% 
  mutate(Short_Reviews = str_sub(Text, 1,50)) 

# review the data
head(reviews$Short_Reviews)

## [1] "I have bought several of the Vitality canned dog f"
## [2] "Product arrived labeled as Jumbo Salted Peanuts..."
## [3] "This is a confection that has been around a few ce"
## [4] "If you are looking for the secret ingredient in Ro"
## [5] "Great taffy at a great price.  There was a wide as"
## [6] "I got a wild hair for taffy and ordered this five "

In this step, we use str_sub() to extract the first 50 characters of each review. You can change these numbers based on your needs. This is useful when you want a summary or snippet of long reviews.

Step 4: Cleaning the Battlfield (Text Cleaning)

Just like tidying up a battlefield after a long fight, we need to clean the text before diving deeper into analysis. The stringr package offers convenient functions like str_replace_all() to remove unwanted characters, numbers, or symbols.

Cleaning Text Here’s how you can clean your text to remove punctuation and make it all lowercase for consistency:

# create a function that will allow you to clean all of your data (remove special character and punctions)
clean_text <- function(text) {
  text <- str_replace_all(text, "[^a-zA-Z\\s]", "") 
  
  # setting your string to all lower case makes working with data easier going forward
  text <- str_to_lower(text)  
  return(text)
}

# run all of the text through the function
reviews <- reviews %>%
  mutate(Clean_Text = clean_text(Text))

# review the cleaned text
head(reviews$Clean_Text)

## [1] "i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than  most"                                                                                                                                                                                                                                       
## [2] "product arrived labeled as jumbo salted peanutsthe peanuts were actually small sized unsalted not sure if this was an error or if the vendor intended to represent the product as jumbo"                                                                                                                                                                                                                                                                                                                    
## [3] "this is a confection that has been around a few centuries  it is a light pillowy citrus gelatin with nuts  in this case filberts and it is cut into tiny squares and then liberally coated with powdered sugar  and it is a tiny mouthful of heaven  not too chewy and very flavorful  i highly recommend this yummy treat  if you are familiar with the story of cs lewis the lion the witch and the wardrobe  this is the treat that seduces edmund into selling out his brother and sisters to the witch"
## [4] "if you are looking for the secret ingredient in robitussin i believe i have found it  i got this in addition to the root beer extract i ordered which was good and made some cherry soda  the flavor is very medicinal"                                                                                                                                                                                                                                                                                     
## [5] "great taffy at a great price  there was a wide assortment of yummy taffy  delivery was very quick  if your a taffy lover this is a deal"                                                                                                                                                                                                                                                                                                                                                                    
## [6] "i got a wild hair for taffy and ordered this five pound bag the taffy was all very enjoyable with many flavors watermelon root beer melon peppermint grape etc my only complaint is there was a bit too much redblack licoriceflavored pieces just not my particular favorites between me my kids and my husband this lasted only two weeks i would recommend this brand of taffy  it was a delightful treat"

In this step:

str_replace_all(): Removes anything that is not a letter or space.
str_to_lower(): Converts all letters to lowercase to avoid issues like “Good” being treated differently from “good”.

Step 5: Finding Hidden Patterns - Searching for Words

What if you want to find specific words in your reviews? This is where pattern matching comes into play. You can search for specific words using str_detect() or count how often they appear using str_count().

Example: Detecting Positive Words

# define a vector that consists of positive words
positive_words <- c("good", "great", "excellent", "love", "like", "amazing", "delicious", "tasty", "yummy")

# check if any positive words are present in the 'Clean_Text' column
reviews <- reviews %>%
  mutate(Has_Positive = str_detect(Clean_Text, str_c(positive_words, collapse = "|")))

# check results
head(reviews$Has_Positive)

## [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Here, str_detect() is like a radar, scanning each review for the presence of any word from the positive words list.

Step 6: Performing Sentiment Analysis (Loops for Counting)

Now, let’s try our hand at creating our own sentiment analysis function to classify reviews as Positive, Negative, or Neutral. You’ll learn how to loop through text, count occurrences of words, and classify text based on sentiment.

Sentiment Analysis Function

analyze_sentiment <- function(text) {
  # use your positive_words vector and create a new negative_words vector
  positive_words <- c("good", "great", "excellent", "love", "like", "amazing", "delicious", "tasty", "yummy")
  negative_words <- c("bad", "terrible", "poor", "hate", "dislike", "awful", "yuck", "gross", "horrible")
  
  # now count each instance of positive and negative words
  positive_count <- str_count(text, str_c(positive_words, collapse = "|"))
  negative_count <- str_count(text, str_c(negative_words, collapse = "|"))
  
  # use if and else if statements to classify the comment based on the type of words used, more negative words than positive should return negative, more positive than negative should return negative, and anything outside of those is neutral
  if (positive_count > negative_count) {
    return("Positive")
  } else if (negative_count > positive_count) {
    return("Negative")
  } else {
    return("Neutral")
  }
}

# run all of the text data through the sentiment analysis function you created
reviews <- reviews %>%
  mutate(Sentiment = sapply(Clean_Text, analyze_sentiment))

# check the analysis results
head(reviews$Sentiment)

##                                                                                                                                                                                                                                        i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than  most 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Positive" 
##                                                                                                                                                                                                                                                                                                                     product arrived labeled as jumbo salted peanutsthe peanuts were actually small sized unsalted not sure if this was an error or if the vendor intended to represent the product as jumbo 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "Neutral" 
## this is a confection that has been around a few centuries  it is a light pillowy citrus gelatin with nuts  in this case filberts and it is cut into tiny squares and then liberally coated with powdered sugar  and it is a tiny mouthful of heaven  not too chewy and very flavorful  i highly recommend this yummy treat  if you are familiar with the story of cs lewis the lion the witch and the wardrobe  this is the treat that seduces edmund into selling out his brother and sisters to the witch 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Positive" 
##                                                                                                                                                                                                                                                                                      if you are looking for the secret ingredient in robitussin i believe i have found it  i got this in addition to the root beer extract i ordered which was good and made some cherry soda  the flavor is very medicinal 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Positive" 
##                                                                                                                                                                                                                                                                                                                                                                     great taffy at a great price  there was a wide assortment of yummy taffy  delivery was very quick  if your a taffy lover this is a deal 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Positive" 
##                                                                                                i got a wild hair for taffy and ordered this five pound bag the taffy was all very enjoyable with many flavors watermelon root beer melon peppermint grape etc my only complaint is there was a bit too much redblack licoriceflavored pieces just not my particular favorites between me my kids and my husband this lasted only two weeks i would recommend this brand of taffy  it was a delightful treat 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "Neutral"

In this function:

str_count(): Counts how many positive and negative words appear in the review.
sapply(): Applies the sentiment analysis function to each review.

Step 7: Summarizing Results

After performing sentiment analysis on the entire dataset, let’s summarize the results to see how many reviews fall into each sentiment category.

Grouping by Sentiment

# sum up the results
sentiment_summary <- reviews %>%
  group_by(Sentiment) %>%
  summarize(count = n())

# check the results
print(sentiment_summary)

## # A tibble: 3 × 2
##   Sentiment count
##   <chr>     <int>
## 1 Negative    838
## 2 Neutral    7848
## 3 Positive  26487

Using group_by() and summarize(), you can easily count how many reviews are classified as Positive, Negative, or Neutral.

Step 8: Hidden Gems - String Lengths & Patterns

Apart from sentiment analysis, there are many other operations you can do with strings. For example, you can measure the length of each review, or extract all occurrences of a specific pattern.

Example: Finding the Length of Each Review

# Find the length of each review
reviews <- reviews %>%
  mutate(Review_Length = str_length(Clean_Text))

# View the review lengths
head(reviews$Review_Length)

## [1] 260 183 491 214 135 396

Here, str_length() calculates the number of characters in each review.

Step 9: Victory! Your Quest is Complete

Congratulations, brave adventurer! You have completed the quest and are now armed with powerful knowledge of text analysis using stringr. You’ve learned how to:

Load and clean text data.
Extract and manipulate strings.
Perform sentiment analysis using loops and counting functions.
Summarize results using dplyr.

Go forth and apply your newfound skills to the challenges that await in the world of data!

May your data be tidy, and your strings never break.