Text Analysis Codethrough Using Amazon Fine Food Reviews Dataset
Introduction
Welcome, Brave Adventurer!
You’ve been summoned to embark on an epic quest — the Quest for the Ultimate Knowledge of Text Analysis using the mighty powers of stringr in R! Prepare your gear, young scholar, as you will be wielding these tools to battle messy text data and emerge victorious. This quest will take you through different lands: extracting patterns, cleaning strings, performing sentiment analysis, and more. Do not fear, for I shall be your guide, and together we will unlock the secrets of text manipulation. Shall we begin?
Step 1: Preparing for Battle - Setting up the Environment
Before you start your journey, it is essential to load the necessary tools into your R environment. Here, the stringr package will be your weapon of choice for text manipulation.
Load the Libraries
#iInstall packages
#install.packages("stringr")
#install.packages("dplyr")
# load your libaries
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
stringr offers a consistent interface for working with strings, and it is part of the tidyverse.
Step 2: Equipping the Dataset (Loading the Data)
Adventurer, we will be using the Amazon Fine Food Reviews dataset, kindly made available to us by McAuley & Leskovec (2013). This dataset is large, making it the perfect training ground to hone your text manipulation skills.
You can download the dataset from Kaggle here.
Here is how you equip the dataset:
# load file path and save to object
reviews <- read.csv("C:\\Users\\swest\\Downloads\\archive\\Reviews.csv") # your file path will be different based on where the downloaded file is stored
# take a look at the dataset
glimpse(reviews) ## Rows: 35,173
## Columns: 10
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ ProductId <chr> "B001E4KFG0", "B00813GRG4", "B000LQOCH0", "B000…
## $ UserId <chr> "A3SGXH7AUHU8GW", "A1D87F6ZCVE5NK", "ABXLMWJIXX…
## $ ProfileName <chr> "delmartian", "dll pa", "Natalia Corres \"Natal…
## $ HelpfulnessNumerator <int> 1, 0, 1, 3, 0, 0, 0, 0, 1, 0, 1, 4, 1, 2, 4, 4,…
## $ HelpfulnessDenominator <int> 1, 0, 1, 3, 0, 0, 0, 0, 1, 0, 1, 4, 1, 2, 5, 5,…
## $ Score <int> 5, 1, 4, 2, 5, 4, 5, 5, 5, 5, 5, 5, 1, 4, 5, 5,…
## $ Time <int> 1303862400, 1346976000, 1219017600, 1307923200,…
## $ Summary <chr> "Good Quality Dog Food", "Not as Advertised", "…
## $ Text <chr> "I have bought several of the Vitality canned d…
Step 3: Understanding the Power of String Manipulation
Ah, string manipulation! The art of bending text to your will. In this section, you’ll learn how to clean and manipulate text using stringr. Remember, mastering this will allow you to make sense of even the most chaotic text data.
Extracting Portions of Text with str_sub() Just as a skilled warrior uses their blade to cut precisely, you will use str_sub() to extract portions of text. Let’s start by creating a column of “short reviews” from the first 50 characters of the review text.
# create a new column using mutate() which uses the extracted strings that have at least 1 character and up to 50
reviews <- reviews %>%
mutate(Short_Reviews = str_sub(Text, 1,50))
# review the data
head(reviews$Short_Reviews)## [1] "I have bought several of the Vitality canned dog f"
## [2] "Product arrived labeled as Jumbo Salted Peanuts..."
## [3] "This is a confection that has been around a few ce"
## [4] "If you are looking for the secret ingredient in Ro"
## [5] "Great taffy at a great price. There was a wide as"
## [6] "I got a wild hair for taffy and ordered this five "
In this step, we use str_sub() to extract the first 50 characters of each review. You can change these numbers based on your needs. This is useful when you want a summary or snippet of long reviews.
Step 4: Cleaning the Battlfield (Text Cleaning)
Just like tidying up a battlefield after a long fight, we need to clean the text before diving deeper into analysis. The stringr package offers convenient functions like str_replace_all() to remove unwanted characters, numbers, or symbols.
Cleaning Text Here’s how you can clean your text to remove punctuation and make it all lowercase for consistency:
# create a function that will allow you to clean all of your data (remove special character and punctions)
clean_text <- function(text) {
text <- str_replace_all(text, "[^a-zA-Z\\s]", "")
# setting your string to all lower case makes working with data easier going forward
text <- str_to_lower(text)
return(text)
}
# run all of the text through the function
reviews <- reviews %>%
mutate(Clean_Text = clean_text(Text))
# review the cleaned text
head(reviews$Clean_Text) ## [1] "i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than most"
## [2] "product arrived labeled as jumbo salted peanutsthe peanuts were actually small sized unsalted not sure if this was an error or if the vendor intended to represent the product as jumbo"
## [3] "this is a confection that has been around a few centuries it is a light pillowy citrus gelatin with nuts in this case filberts and it is cut into tiny squares and then liberally coated with powdered sugar and it is a tiny mouthful of heaven not too chewy and very flavorful i highly recommend this yummy treat if you are familiar with the story of cs lewis the lion the witch and the wardrobe this is the treat that seduces edmund into selling out his brother and sisters to the witch"
## [4] "if you are looking for the secret ingredient in robitussin i believe i have found it i got this in addition to the root beer extract i ordered which was good and made some cherry soda the flavor is very medicinal"
## [5] "great taffy at a great price there was a wide assortment of yummy taffy delivery was very quick if your a taffy lover this is a deal"
## [6] "i got a wild hair for taffy and ordered this five pound bag the taffy was all very enjoyable with many flavors watermelon root beer melon peppermint grape etc my only complaint is there was a bit too much redblack licoriceflavored pieces just not my particular favorites between me my kids and my husband this lasted only two weeks i would recommend this brand of taffy it was a delightful treat"
In this step:
- str_replace_all(): Removes anything that is not a letter or space.
- str_to_lower(): Converts all letters to lowercase to avoid issues like “Good” being treated differently from “good”.
Step 6: Performing Sentiment Analysis (Loops for Counting)
Now, let’s try our hand at creating our own sentiment analysis function to classify reviews as Positive, Negative, or Neutral. You’ll learn how to loop through text, count occurrences of words, and classify text based on sentiment.
Sentiment Analysis Function
analyze_sentiment <- function(text) {
# use your positive_words vector and create a new negative_words vector
positive_words <- c("good", "great", "excellent", "love", "like", "amazing", "delicious", "tasty", "yummy")
negative_words <- c("bad", "terrible", "poor", "hate", "dislike", "awful", "yuck", "gross", "horrible")
# now count each instance of positive and negative words
positive_count <- str_count(text, str_c(positive_words, collapse = "|"))
negative_count <- str_count(text, str_c(negative_words, collapse = "|"))
# use if and else if statements to classify the comment based on the type of words used, more negative words than positive should return negative, more positive than negative should return negative, and anything outside of those is neutral
if (positive_count > negative_count) {
return("Positive")
} else if (negative_count > positive_count) {
return("Negative")
} else {
return("Neutral")
}
}
# run all of the text data through the sentiment analysis function you created
reviews <- reviews %>%
mutate(Sentiment = sapply(Clean_Text, analyze_sentiment))
# check the analysis results
head(reviews$Sentiment)## i have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like a stew than a processed meat and it smells better my labrador is finicky and she appreciates this product better than most
## "Positive"
## product arrived labeled as jumbo salted peanutsthe peanuts were actually small sized unsalted not sure if this was an error or if the vendor intended to represent the product as jumbo
## "Neutral"
## this is a confection that has been around a few centuries it is a light pillowy citrus gelatin with nuts in this case filberts and it is cut into tiny squares and then liberally coated with powdered sugar and it is a tiny mouthful of heaven not too chewy and very flavorful i highly recommend this yummy treat if you are familiar with the story of cs lewis the lion the witch and the wardrobe this is the treat that seduces edmund into selling out his brother and sisters to the witch
## "Positive"
## if you are looking for the secret ingredient in robitussin i believe i have found it i got this in addition to the root beer extract i ordered which was good and made some cherry soda the flavor is very medicinal
## "Positive"
## great taffy at a great price there was a wide assortment of yummy taffy delivery was very quick if your a taffy lover this is a deal
## "Positive"
## i got a wild hair for taffy and ordered this five pound bag the taffy was all very enjoyable with many flavors watermelon root beer melon peppermint grape etc my only complaint is there was a bit too much redblack licoriceflavored pieces just not my particular favorites between me my kids and my husband this lasted only two weeks i would recommend this brand of taffy it was a delightful treat
## "Neutral"
In this function:
- str_count(): Counts how many positive and negative words appear in the review.
- sapply(): Applies the sentiment analysis function to each review.
Step 7: Summarizing Results
After performing sentiment analysis on the entire dataset, let’s summarize the results to see how many reviews fall into each sentiment category.
Grouping by Sentiment
# sum up the results
sentiment_summary <- reviews %>%
group_by(Sentiment) %>%
summarize(count = n())
# check the results
print(sentiment_summary)## # A tibble: 3 × 2
## Sentiment count
## <chr> <int>
## 1 Negative 838
## 2 Neutral 7848
## 3 Positive 26487
Using group_by() and summarize(), you can easily count how many reviews are classified as Positive, Negative, or Neutral.
Step 9: Victory! Your Quest is Complete
Congratulations, brave adventurer! You have completed the quest and are now armed with powerful knowledge of text analysis using stringr. You’ve learned how to:
- Load and clean text data.
- Extract and manipulate strings.
- Perform sentiment analysis using loops and counting functions.
- Summarize results using dplyr.
Go forth and apply your newfound skills to the challenges that await in the world of data!
May your data be tidy, and your strings never break.