Week Three - R Character Manipulation and Date Processing

1. Overview
2. Filtering data using Regular Expressions (regex)
3 Data Transformation
4 Additional exercises

1. Overview

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.

2. Filtering data using Regular Expressions (regex)

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset (The Economic Guide To Picking A College Major), provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

# define the URL for the raw csv file containing major list
majors_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"

# read csv file and store the records into a data frame
majors <- read_csv(majors_url)

# convert the data frame to a tibble
mt <- as_tibble(majors)

# from the list of majors filter only those which contain the words DATA OR STATISTICS
tbl2 <- mt %>% filter(grepl('DATA|STATISTICS', Major, ignore.case = TRUE))

# count the total of major found in the majors list
total_majors <- nrow(majors) # Actual total of majors in data set

#An actual total of `total_majors` majors were found in the current site's data set.


# display the list of filtered majors
knitr::kable(tbl2, caption = 'Table: List of Majors that contain the words "DATA" or "STATISTICS"')

Table: List of Majors that contain the words “DATA” or “STATISTICS”
FOD1P	Major	Major_Category
6212	MANAGEMENT INFORMATION SYSTEMS AND STATISTICS	Business
2101	COMPUTER PROGRAMMING AND DATA PROCESSING	Computers & Mathematics
3702	STATISTICS AND DECISION SCIENCE	Computers & Mathematics

3 Data Transformation

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

# stored the character data as is, including spaces, carriage returns, double quotes
raw_data <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'


# define a regex to match words surrounded by double quotes "\b\w+\s?\w*\b"
regex_double_quote_surrounded_words <- '\\b\\w+\\s?\\w*\\b"'

# extract the quoted words from the string and store them into a list
extracted_quoted_words <- str_extract_all(raw_data, regex_double_quote_surrounded_words)

# concatenate the quoted words and store them in a character vector as per the ask
new_character_format <- paste(extracted_quoted_words, ",")


extracted_quoted_words

## [[1]]
##  [1] "bell pepper\""  "bilberry\""     "blackberry\""   "blood orange\""
##  [5] "blueberry\""    "cantaloupe\""   "chili pepper\"" "cloudberry\""  
##  [9] "elderberry\""   "lime\""         "lychee\""       "mulberry\""    
## [13] "olive\""        "salal berry\""

new_character_format

## [1] "c(\"bell pepper\\\"\", \"bilberry\\\"\", \"blackberry\\\"\", \"blood orange\\\"\", \"blueberry\\\"\", \"cantaloupe\\\"\", \"chili pepper\\\"\", \"cloudberry\\\"\", \"elderberry\\\"\", \"lime\\\"\", \"lychee\\\"\", \"mulberry\\\"\", \"olive\\\"\", \"salal berry\\\"\") ,"

4 Additional exercises

# Build a tibble with a list of sample words to be tested against the regular expressions
dt_test <- tibble( words = c("Ohio","Pennelope","Kabbak","tomato","eleven","church","Pop","papa","Robocop","look","Pajama","noN","XXXrated"))

Table: List of Sample words to be used to test the Regular Expressions
words
Ohio
Pennelope
Kabbak
tomato
eleven
church
Pop
papa
Robocop
look
Pajama
noN
XXXrated

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

4.1 Interpreting regular expresions

Describe, in words, what these expressions will match:

(.)\1\1

It matches a character that repeats 3 times consecutively.

dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex('(.)\\1\\1', ignore_case = T)))

knitr::kable(dt_matches)

words
XXXrated

“(.)(.)\2\1”

It matches two characters followed by the same two characters but in reversed order.

dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(.)(.)\\2\\1", ignore_case = T)))

knitr::kable(dt_matches)

words
Pennelope
Kabbak

(..)\1

It matches two characters followed by the same two characters.

dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(..)\\1", ignore_case = T)))

knitr::kable(dt_matches)

words
papa

“(.).\1.\1”

It matches a character repeated 3 times in non-consecutive positions.

dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(.).\\1.\\1", ignore_case = T)))

knitr::kable(dt_matches)

words
eleven
Robocop
Pajama

"(.)(.)(.).*\3\2\1"

It matches 3 characters followed by the same 3 characters but in reversed order.

dt_matches <- dt_test %>% dplyr::filter(str_detect(words, regex("(.)(.)(.).*\\3\\2\\1", ignore_case = T)))

knitr::kable(dt_matches)

words
Kabbak

4.2 Constructing regular expressions

4.2.1 Regular expression to match words that start and end with the same character.

# define a regular expression that will match words starting and ending with the same character: \b(\w)\w*\1\b
regex_same_1st_n_last_char <- "\\b(\\w)\\w*\\1\\b"

# filter the words that match the regular expression
dt_flc_filtered <- dt_test %>% dplyr::filter(str_detect(words, regex(regex_same_1st_n_last_char, ignore_case = T)))

knitr::kable(dt_flc_filtered, caption = 'Table: List of words starting and ending with the same character')

Table: List of words starting and ending with the same character
words
Ohio
Kabbak
Pop
noN

4.2.3 Regular expression to match words that contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

# define a regular expression that will match words that contain a repeated pair of letters:  \b(..)\w*\1\b
regex_2_rep_letters <- "\\b(..)\\w*\\1\\b"

# filter the words that match the regular expression
dt_flc_filtered <- dt_test %>% dplyr::filter(str_detect(words, regex(regex_2_rep_letters, ignore_case = T)))

knitr::kable(dt_flc_filtered, caption = 'Table: List of words that contain a repeated pair of letters')

Table: List of words that contain a repeated pair of letters
words
Pennelope
tomato
church
papa

4.2.4 Regular expression to match words that contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

# define a regular expression that will match words that contain one letter repeated in at least three places:  ([a-z])\w*\1\w*\1
regex_3_rep_letters <- "([a-z])\\w*\\1\\w*\\1"

# filter the words that match the regular expression
dt_flc_filtered <- dt_test %>% dplyr::filter(str_detect(words, regex(regex_3_rep_letters, ignore_case = T)))

knitr::kable(dt_flc_filtered, caption = 'Table: List of words that contain one letter repeated in at least three places')

Table: List of words that contain one letter repeated in at least three places
words
Pennelope
eleven
Robocop
Pajama
XXXrated