Welcome to my code-through

This code-through is an introduction to text mining in R utilizing the text mining framework provided by the tm package. We will also be using the wordcloud package to further visualize our findings. The tm package was created by Ingo Feinerer. The package is considered quite new since it was published in 2019. It enables people who are new to programming to easily analyze texts. You can read more about the package here

The purpose of this code-through is to understand how to analyze text ,
find most frequent words used , and be able to visualize the findings in a presentable manner.

The required packages for this code-through include tm, dplyr, kableExtra, wordcloud.



About the dataset

We will start by embedding a URL of a dataset from Yelp which publishes crowd-sourced reviews about businesses. The data is a detailed dump of Yelp reviews, businesses, users, and checkins for the Phoenix, AZ metropolitan area. There are 229,907 yelp reviews in this dataset. Since the dataset has many rows , we will reduce them to 7000 for ease of reference and easier loading into R.

The dataset is retrieved from **data.world* . The steps below are done to demonstrate how we can find the most frequent words used in reviews from the people depending on a particular business category.

Check the steps below to load the dataset

# copy and paste into a codeblock as follows

d <- read.csv("https://query.data.world/s/wmaxggxnhn4wi4jah3q73vjpvldaj6", header = TRUE,
    stringsAsFactors = FALSE, nrows = 7000)
#Preview column names
colnames(d)
##  [1] "X"                      "business_blank"         "business_categories"   
##  [4] "business_city"          "business_full_address"  "business_id"           
##  [7] "business_latitude"      "business_longitude"     "business_name"         
## [10] "business_neighborhoods" "business_open"          "business_review_count" 
## [13] "business_stars"         "business_state"         "business_type"         
## [16] "cool"                   "date"                   "funny"                 
## [19] "review_id"              "reviewer_average_stars" "reviewer_blank"        
## [22] "reviewer_cool"          "reviewer_funny"         "reviewer_name"         
## [25] "reviewer_review_count"  "reviewer_type"          "reviewer_useful"       
## [28] "stars"                  "text"                   "type"                  
## [31] "useful"                 "user_id"
# Transfer all the column names to lower case.

colnames(d)<-tolower(colnames(d))
colnames(d)
summary(d)


Here we will choose required columns for our text analysis ,and make our dataset smaller and easier to understand. This way we can view only these columns and disregard any extra columns that we do not need for our analysis.

The text section is what people have said about the place or in other words, they represent reviews on the place.

Since we are interested in finding most used words in relation to the business category , we will pick the following:

dat<-d[c( "business_name", "business_categories", "text" )]
dat
#View our new new dataset
View(dat)

A preview of this dataset

head(dat)%>% kable() %>% kable_styling()
business_name business_categories text
Morning Glory Cafe Breakfast & Brunch; Restaurants

My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I’ve ever had. I’m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best “toast” I’ve ever had.

Anyway, I can’t wait to go back!
Spinato’s Pizzeria Italian; Pizza; Restaurants

I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault…there are many people like that.

In any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we’ll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small “Here’s The Beef” pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that’s how I like my sauce!

We had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that you have to try these things yourself because all these bad reviewers have some serious issues.
Haji-Baba Middle Eastern; Restaurants love the gyro plate. Rice is so good and I also dig their candy selection :)
Chaparral Dog Park Active Life; Dog Parks; Parks

Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It’s very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.

The Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.

The fenced in area is huge to let the dogs run, play, and sniff!
Discount Tire Tires; Automotive

General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I’d be surprised if you don’t walk out totally satisfied as I just did. Like I always say….. “Mistakes are inevitable, it’s how we recover from them that is important”!!!

Thanks to Scott and his awesome staff. You’ve got a customer for life!! ………. :^)
Quiessence Restaurant Wine Bars; Bars; American (New); Nightlife; Restaurants

Quiessence is, simply put, beautiful. Full windows and earthy wooden walls give a feeling of warmth inside this restaurant perched in the middle of a farm. The restaurant seemed fairly full even on a Tuesday evening; we had secured reservations just a couple days before.

My friend and I had sampled sandwiches at the Farm Kitchen earlier that week, and were impressed enough to want to eat at the restaurant. The crisp, fresh veggies didn’t disappoint: we ordered the salad with orange and grapefruit slices and the crudites to start. Both were very good; I didn’t even know how much I liked raw radishes and turnips until I tried them with their pesto and aioli sauces.

For entrees, I ordered the lamb and my friend ordered the pork shoulder. Service started out very good, but trailed off quickly. Waiting for our food took a very long time (a couple seated after us received and finished their entrees before we received our’s), and no one bothered to explain the situation until the maitre’d apologized almost 45 minutes later. Apparently the chef was unhappy with the sauce on my entree, so he started anew. This isn’t really a problem, but they should have communicated this to us earlier. For our troubles, they comped me the glass of wine I ordered, but they forgot to bring out with my entree as I had requested. Also, they didn’t offer us bread, but I will echo the lady who whispered this to us on her way out: ask for the bread. We received warm foccacia, apple walnut, and pomegranate slices of wonder with honey and butter. YUM.

The entrees were both solid, but didn’t quite live up to the innovation and freshness of the vegetables. My lamb’s sauce was delicious, but the meat was tough. Maybe the vegetarian entrees are the way to go? But our dessert, the gingerbread pear cake, was yet another winner.

If the entrees were tad more inspired, or the service weren’t so spotty, this place definitely would have warranted five stars. If I return, I’d like to try the 75$ tasting menu. Our bill came out to about 100$ for two people, including tip, no drinks.
summary(dat)
##  business_name      business_categories     text          
##  Length:7000        Length:7000         Length:7000       
##  Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character    Mode  :character


Match our criteria

Here we will use the grep() and grepl() functions to find business categories that match the criteria we specify and then it will return the full string.After that, we will use the grepl() which is the logical version of grep() , so it returns a vector of TRUE or FALSE, with TRUE representing the cases that match the specified criteria.

We will start by using grep() to match all the business categories that are considered to be in the coffee business.

grep(pattern = "coffee", x = dat$business_categories, value = TRUE, ignore.case = TRUE) %>% head() %>%kable()
x
Food; Coffee & Tea
Food; Sandwiches; Coffee & Tea; Breakfast & Brunch; Restaurants
Food; Donuts; Coffee & Tea
Food; Coffee & Tea; Breakfast & Brunch; Restaurants
Food; Coffee & Tea; Vegan; Restaurants
Food; Coffee & Tea


Here by using the grepl()function , we can get the count of business categories that are considered to be in the coffee business within our 7000 rows.

We find that 262 places belong to the coffee category.

coffee <- grepl("coffee", dat$business_categories, ignore.case = T)
sum(coffee)
## [1] 262


New dataset to represent business names and text for coffee business categories.

dat.cof<- dat[coffee, c( "business_name", "business_categories", "text")]
dat.cof %>% head(3) %>% kable() %>% kable_styling()
business_name business_categories text
47 Lux Food; Coffee & Tea (Un)fortunately for me, lux is close to my house. I walk there nearly every day and am much poorer because of it. The coffee and pastries are amazing. They always play really great music too!
48 The Coffee Shop Food; Sandwiches; Coffee & Tea; Breakfast & Brunch; Restaurants After watching her win on Cupcake Wars I was determined to find this place (It’s very well hidden)and sample the cupcakes. I have to say I was not impressed; i make better cupcakes on my own. More creative and not so sweet you’ll go into a diabetic coma. i did have a California BLT as well and it was a good sandwich although the bread was a little too toasted for my liking. The staff was friendly, the ambiance was very pleasant and it was clean. But the cupcakes…really not all that. Sorry….
58 Cherubini Coffee Co Food; Donuts; Coffee & Tea

This would most certainly be “my” coffee shop if I hadn’t already established myself as a regular at another local shop. This brand-new, independent coffee shop has spent the money to look really nice (it even has a fountain inside!).

But what money can’t buy is community and unfortunately this shop doesn’t seem to have much of that. Granted, building community takes time and/or energy. For their sake, I really really hope it works out in the long-run.
# New vector containing texts 
  cof.text<-dat.cof$text
  cof.text


Load data as a corpus

Here we can load our data as a corpus. We need to create a collection of documents (technically referred to as a Corpus). The tm package utilizes the Corpus as its main structure. A corpus is simply a collection of documents, but like most things in R, the corpus has specific attributes that enable certain types of analysis.

docs <- VCorpus(VectorSource(dat.cof$text))
docs
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 262

The output shows that this corpus has 262 documents

To Further explain this, we can check the object mycorp. Notice how each string is treated as a document.

mycorp <- c("My name is Marah", "Her name is Sarah", "His name is John")
mycorp <- VCorpus(VectorSource(mycorp))
mycorp
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3


It is necessary to identify a source of corpus. To see what sources are available for the tm package, try the function getSources() .

VectorSource a vector of characters (treats each component as a document).

getSources()
## [1] "DataframeSource" "DirSource"       "URISource"       "VectorSource"   
## [5] "XMLSource"       "ZipSource"


Corpus Transformation

What is considered useful in the tm package is the ability to transform text into workable data without a great deal of code. To do this, we can use Transformations which are accessible in the tm package. To see available Transformations enter getTransformations().

getTransformations()
## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"

We can use the base R command writeLines() to write 2 lines of text number 2.I use this to double check that the transformations will be later applied to my data. So far we do not have any transformations yet!

writeLines(head(strwrap(docs[[2]]), 2)) 
## After watching her win on Cupcake Wars I was determined to find this
## place (It's very well hidden)and sample the cupcakes.  I have to say I


Lets apply transformations! The tm uses a specific interface to apply functions to corpora called tm_map(). Let’s try it out. Now we can remove punctuation, numbers, make all words lower case, remove stop words and remove extra white space.

#Transform to lower case (need to wrap in content_transformer)
docs <- tm_map(docs, content_transformer(tolower))
# Since we are not interested in numbers because they do not contribute to meaning of text we will strip the digits
docs <- tm_map(docs, removeNumbers)

The next sage is to eliminate common words from the text. These incorporate words such as articles (a, an, the), conjunctions (and, or but etc.), common verbs (is), qualifiers (yet, however etc) . The tm package includes a list of such stop words. We remove stop words using removeWords transformation.

docs <- tm_map(docs, removeWords, stopwords("english"))
#Remove punctuations
docs <- tm_map(docs, removePunctuation)
## Remove extra white spaces
docs <- tm_map(docs, stripWhitespace)

Usually a large corpus may have many words coming from the same root. For Instance: Play, played , playing. Stemming is the process of reducing related words to their common root, which in this case would be the word play.

stemmed.docs<- tm_map(docs, stemDocument)

# In this code-through we will not be stemming our docs

writeLines(head(strwrap(stemmed.docs[[2]]), 2)) 
## watch win cupcak war determin find place well hidden sampl cupcak say
## impress make better cupcak creativ sweet go diabet coma california blt

Note: If you go up the sheet to our previously loaded writeLines()function , you can see that the word watch was watching before stemming.


Document term matrix

The next step is to create a Document-Term Matrix (DTM). DTM is a matrix that lists all occurrences of words in the corpus. In a DTM, documents are represented by rows and the terms (or words) by columns. If a word occurs in a particular document n times, then the matrix entry for corresponding to that row and column is n, if it doesn’t occur at all, the entry is 0.

#Creates a term document matrix summary.
dtm <- TermDocumentMatrix(docs) 

#Unpack the summary into a matrix.
m <- as.matrix(dtm)

#Counts the words.
v <- sort(rowSums(m),decreasing=TRUE) 

#convert the count into a data frame.
d <- data.frame(word = names(v),freq=v)

#View top 15 words in terms of frequency.
head(d, 15)


Visualize most frequent words

We will visualize the most used words using word clouds due to their simplicity in communicating qualitative findings in terms of words.

Arguments of the word cloud generator function are

  • words : the words to be plotted
  • freq : their frequencies
  • min.freq : words with frequency below min.freq will not be plotted
  • max.words : maximum number of words to be plotted
  • random.order : plot words in random order. If false, they will be plotted in decreasing frequency
  • rot.per : proportion words with 90 degree rotation (vertical text)
  • colors : color words from least to most frequent. Use, for example, colors = black for single color.
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=15, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))


Most frequent words in the travel business category

Here I will repeat the same steps for the sake of comparison. We will see the most frequent words used for business categories that fall under travel and compare them with the most frequent words used for coffeeshops.

travel<- grepl("travel", dat$business_categories, ignore.case = T)
sum(travel)
## [1] 219
dat.travel<- dat[travel, c( "business_name", "business_categories", "text")]
dat.travel %>% head(2) %>% kable() %>% kable_styling()
travel.text<-dat.travel$text
docs.travel <- VCorpus(VectorSource(dat.travel$text))
docs.travel <- tm_map(docs.travel, content_transformer(tolower))
docs.travel <- tm_map(docs.travel, removeNumbers)
docs.travel <- tm_map(docs.travel, removeWords, stopwords("english"))
docs.travel <- tm_map(docs.travel, removePunctuation)
docs.travel <- tm_map(docs.travel, stripWhitespace)
dtm <- TermDocumentMatrix(docs.travel)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 15)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=15, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Summary

As you can see , the most frequent used words to describe coffee shops are coffee, great , place etc.. All of these words represent reviews on the coffee shops. The most frequent words that describe business categories that fall under travel are room, hotel , service, resort etc..

I hope you enjoyed reading this and had the chance to understand more about the tmpackage. Both the tm package and the wordcloud are considered great tools to perform quick and easy text analysis!


Refrences & Further Resources

tm package

Text mining in data science

There are also many other ways to perform text analysis in R.

If you are interested click here