This vignette explains how to search for a character string within a data frame, then categorise and anaylse the data using this new value.
First I downloaded two packages - ggplot2() and stringr()
library(ggplot2)
library(stringr)
I used the inbuilt data set named ‘Fruit’, in ‘stringr’ package. This data set has 80 observations made up of names of different fruits.
## chr [1:80] "apple" "apricot" "avocado" "banana" "bell pepper" ...
I then converted this into a data frame, named ‘AllFruit’.
The data set currently contains just one varable, which is a list of fruit. I then added in another empty variable to the data frame, called ‘Type’. Later we can fill this variable with the category of fruit.
AllFruit <- data.frame(Type = rep("NONE", nrow(AllFruit)), AllFruit[,])
The data frame structure has now changed to this:
## 'data.frame': 80 obs. of 2 variables:
## $ Type : Factor w/ 1 level "NONE": 1 1 1 1 1 1 1 1 1 1 ...
## $ AllFruit....: Factor w/ 80 levels "apple","apricot",..: 1 2 3 4 5 6 7 8 9 10 ...
The data set currently is not categorised. Now I would like to see what proportion of this list can fit into one of two specific categories.
First I would like to check if there are any berries. To do this, I will use str_detect() to see if there are any instances of the the phrase ‘berry’ within the AllFruit variable of the data set.
str_detect(AllFruit$AllFruit, "berry")
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
## [34] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [78] FALSE FALSE FALSE
The above tells us where each instance of ‘berry’ is located.
Now that I now where the berries are (using str_detect()), I then use grpl() to perform pattern matching on the phrase “berry”.
Using an ‘if else’ statement in combination with grpl(), if this character string appears within an observation, the word ‘Berry’ will appear in the ‘Type’ variable. If it does not, the phrase ‘Other’ will appear.
AllFruit$Type <- ifelse(grepl("berry",AllFruit$AllFruit),'Berry','Other')
str(AllFruit$Type)
## chr [1:80] "Other" "Other" "Other" "Other" "Other" "Berry" "Berry" ...
But wait! I want more! I also want to know how many of the fruits have the character string ‘melon’ in their name.
This time I’ll use an ‘ifelse’ statement with multiple statements - This will check if the word ‘berry’ appears, if the word ‘melon’ appears, or anything else.
AllFruit$Type <- ifelse(grepl('berry',tolower(AllFruit$AllFruit)),'Berry', ifelse(grepl('melon',tolower(AllFruit$AllFruit)),'Melon','Other'))
I will now change all the data in the ‘Type’ column to factors.
AllFruit$Type <- as.factor(AllFruit$Type)
str(AllFruit$Type)
## Factor w/ 3 levels "Berry","Melon",..: 3 3 3 3 3 1 1 3 3 1 ...
Now to graphically show the proportions of ‘Berry’, ‘Melon’ and ‘Other’, I will use ggplot() and geom_bar() to represent this as a histogram.
ggplot(AllFruit, aes(x=Type)) + geom_bar(width=0.8) + xlab("Category") + ylab("Total Fruit Count") + labs(fill = "Type")