Nicolette Lorraway

28 March 2018

This vignette explains how to search for a character string within a data frame, then categorise and anaylse the data using this new value.

First I downloaded two packages - ggplot2() and stringr()

library(ggplot2)
library(stringr)

Data Sets

I used the inbuilt data set named ‘Fruit’, in ‘stringr’ package. This data set has 80 observations made up of names of different fruits.

##  chr [1:80] "apple" "apricot" "avocado" "banana" "bell pepper" ...

Preparing the data frame

I then converted this into a data frame, named ‘AllFruit’.

The data set currently contains just one varable, which is a list of fruit. I then added in another empty variable to the data frame, called ‘Type’. Later we can fill this variable with the category of fruit.

AllFruit <- data.frame(Type = rep("NONE", nrow(AllFruit)), AllFruit[,])

The data frame structure has now changed to this:

## 'data.frame':    80 obs. of  2 variables:
##  $ Type        : Factor w/ 1 level "NONE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ AllFruit....: Factor w/ 80 levels "apple","apricot",..: 1 2 3 4 5 6 7 8 9 10 ...

Finding character strings

The data set currently is not categorised. Now I would like to see what proportion of this list can fit into one of two specific categories.

First I would like to check if there are any berries. To do this, I will use str_detect() to see if there are any instances of the the phrase ‘berry’ within the AllFruit variable of the data set.

str_detect(AllFruit$AllFruit, "berry")

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [34] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
## [78] FALSE FALSE FALSE

The above tells us where each instance of ‘berry’ is located.

Adding in a Category Variable Using Character Strings

Now that I now where the berries are (using str_detect()), I then use grpl() to perform pattern matching on the phrase “berry”.

Using an ‘if else’ statement in combination with grpl(), if this character string appears within an observation, the word ‘Berry’ will appear in the ‘Type’ variable. If it does not, the phrase ‘Other’ will appear.

AllFruit$Type <- ifelse(grepl("berry",AllFruit$AllFruit),'Berry','Other')
str(AllFruit$Type)

##  chr [1:80] "Other" "Other" "Other" "Other" "Other" "Berry" "Berry" ...

But wait! I want more! I also want to know how many of the fruits have the character string ‘melon’ in their name.

This time I’ll use an ‘ifelse’ statement with multiple statements - This will check if the word ‘berry’ appears, if the word ‘melon’ appears, or anything else.

AllFruit$Type <- ifelse(grepl('berry',tolower(AllFruit$AllFruit)),'Berry',  ifelse(grepl('melon',tolower(AllFruit$AllFruit)),'Melon','Other'))

I will now change all the data in the ‘Type’ column to factors.

AllFruit$Type <- as.factor(AllFruit$Type)
str(AllFruit$Type)

##  Factor w/ 3 levels "Berry","Melon",..: 3 3 3 3 3 1 1 3 3 1 ...

Showing the frequency

Now to graphically show the proportions of ‘Berry’, ‘Melon’ and ‘Other’, I will use ggplot() and geom_bar() to represent this as a histogram.

ggplot(AllFruit, aes(x=Type)) + geom_bar(width=0.8) + xlab("Category") + ylab("Total Fruit Count") + labs(fill = "Type")