Introduction

In this assignment, I perform exercises related to string data manipulation in 4 parts.

Part 1

I use the 173 majors listed in fivethirtyeight.com’s College Majors dataset to provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

# Loading data (original data file is accessible through code), clean column names, and filter data down to the 3 observations where the variable "major" includes the words "DATA" or "STATISTICS".
majors_list <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv") |> 
  janitor::clean_names() |> 
  dplyr::filter(str_detect(major, "DATA|STATISTICS"))
## Rows: 173 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Major, Major_category
## dbl (9): Major_code, Total, Employed, Employed_full_time_year_round, Unemplo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

After running the above code, there are only 3 majors remaining in the filtered dataset: computer programming and data processing, statistics and decision science, and management information systems and statistics.

majors_list$major
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "STATISTICS AND DECISION SCIENCE"              
## [3] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"

Part 2

The code below transforms the following data:

[1] “bell pepper”  “bilberry”     “blackberry”   “blood orange”

[5] “blueberry”    “cantaloupe”   “chili pepper” “cloudberry”  

[9] “elderberry”   “lime”         “lychee”       “mulberry”    

[13] “olive”        “salal berry”

Into a format that like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

First, I store the character vector, pre-transformation, in R:

# Creating the data in R as specified in the assignment.
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange",
                   "blueberry", "cantaloupe", "chili pepper", "cloudberry",
                   "elderberry", "lime", "lychee", "mulberry",
                   "olive", "salal berry")

# This is the data pre-transformation, as specified in assignment instructions:
fruits
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Next, I transform the character vector into a string that looks like the explicit representation of the character vector using the c() function. I show the transformed format in string view below:

# The following command transforms the original data into a string starting with "c(", comma-deliminated, with each fruit in the list enclosed in quotes, and ending with ")", as specified in the assignment question:
fruits_transformed <- str_c("c(", str_c('"', fruits, '"', collapse = ", "), ")")
str_view(fruits_transformed)
## [1] │ c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Part 3

Here I describe, in words, what the following expressions will match:

  • (.)\1\1 is a regular expression. It matches strings with the same character three consecutive times. E.g. (.)\1\1: c("ccc", "444", "$$$").
expression <- "(.)\\1\\1" # naming expression
str_view(expression) # String view should print like the regular expression as specified in assignment text above
## [1] │ (.)\1\1
str_view(c(words,"ccc", "444", "$$$"), expression) # testing matches
## [981] │ <ccc>
## [982] │ <444>
## [983] │ <$$$>
  • "(.)(.)\\2\\1" is a character string representing a regular expression–the actual regular expression is (.)(.)\2\1. It matches a string with any two characters followed by those same two characters in reverse order. E.g., "(.)(.)\\2\\1": c("afternoon", "apparent", "arrange").
expression <- "(.)(.)\\2\\1" # naming expression
str_view(expression) # string view of expression
## [1] │ (.)(.)\2\1
str_view(c(words), expression) # testing matches
##  [19] │ after<noon>
##  [43] │ <appa>rent
##  [53] │ <arra>nge
## [107] │ b<otto>m
## [112] │ br<illi>ant
## [174] │ c<ommo>n
## [230] │ d<iffi>cult
## [259] │ <effe>ct
## [329] │ f<ollo>w
## [422] │ in<deed>
## [470] │ l<ette>r
## [521] │ m<illi>on
## [581] │ <oppo>rtunity
## [582] │ <oppo>se
## [877] │ tom<orro>w
  • (..)\1 is a regular expression that matches any string with a repeated pair of letters. E.g., (..)\1: c("remember", "banana", "coconut").
expression <- "(..)\\1" # naming expression
str_view(expression) # string view of expression
## [1] │ (..)\1
str_view(c(words, fruit), expression) # testing matches
##  [696] │ r<emem>ber
##  [984] │ b<anan>a
## [1000] │ <coco>nut
## [1002] │ <cucu>mber
## [1021] │ <juju>be
## [1036] │ <papa>ya
## [1053] │ s<alal> berry
  • "(.).\\1.\\1" is a character string representing the regular expression (.).\1.\1. This matches a string with at least 5 consecutive characters, where the 1st, 3rd, and 5th of those characters are the same. E.g., "(.).\\1.\\1" : c("eleven", "banana", "papaya").
expression <- "(.).\\1.\\1" # naming expression
str_view(expression) # string view of expression
## [1] │ (.).\1.\1
str_view(c(words, fruit, "elleven"), expression) # testing matches
##  [265] │ <eleve>n
##  [984] │ b<anana>
## [1036] │ p<apaya>
  • "(.)(.)(.).\*\\3\\2\\1" is a character string representing the regular expression (.)(.)(.).\*\3\2\1, which matches any string with at least 6 characters, where there are 3 characters in a row, followed by those same 3 characters in reverse order. There can be any number of characters in between the first chunk of 3 and the second reversed chunk of 3. E.g., "(.)(.)(.).\*\\3\\2\\1": c("paragraph", "parrap", "abc123abc321abc").
expression <- "(.)(.)(.).*\\3\\2\\1"  # naming expression
str_view(expression) # string view of expression
## [1] │ (.)(.)(.).*\3\2\1
str_view(c(words, fruit, "parrap", "parap", "abc123abc321abc"), expression) # testing matches
##  [598] │ <paragrap>h
## [1061] │ <parrap>
## [1063] │ abc<123abc321>abc

Part 4

Below are regular expressions to do the following:

  • match words that start and end with the same character: the regular expression ^(.).*1$.
expression <- "^(.).*\\1$" # naming expression using character string representing regular expression
str_view(expression) # string view of expression
## [1] │ ^(.).*\1$
str_view(c(words), expression) # testing matches
##  [36] │ <america>
##  [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
## [266] │ <else>
## [268] │ <encourage>
## [270] │ <engine>
## [278] │ <europe>
## [283] │ <evidence>
## [285] │ <example>
## [287] │ <excuse>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [296] │ <eye>
## [386] │ <health>
## [394] │ <high>
## [450] │ <knock>
## ... and 16 more
  • contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): (..).*\1.
expression <- "(..).*\\1" # naming expression
str_view(expression) # string view of expression
## [1] │ (..).*\1
str_view(c(words), expression) # testing matches
##  [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
## [598] │ pa<ragra>ph
## [603] │ p<articular>
## [617] │ <photograph>
## [638] │ p<repare>
## [641] │ p<ressure>
## [696] │ r<emem>ber
## [698] │ <repre>sent
## [699] │ <require>
## [739] │ <sense>
## [858] │ the<refore>
## [903] │ u<nderstand>
## [946] │ w<hethe>r
  • contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): (.).*\1.*\1.
expression <- "(.).*\\1.*\\1" # naming expression
str_view(expression) # string view of expression
## [1] │ (.).*\1.*\1
str_view(c(words), expression) # testing matches
##  [48] │ a<pprop>riate
##  [62] │ <availa>ble
##  [86] │ b<elieve>
##  [90] │ b<etwee>n
## [119] │ bu<siness>
## [221] │ d<egree>
## [229] │ diff<erence>
## [233] │ di<scuss>
## [265] │ <eleve>n
## [275] │ e<nvironmen>t
## [283] │ <evidence>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [423] │ <indivi>dual
## [598] │ p<aragra>ph
## [684] │ r<eceive>
## [696] │ r<emembe>r
## [698] │ r<eprese>nt
## [845] │ t<elephone>
## ... and 2 more

Conclusion

In this assignment, I explored various aspects of string data manipulation in R, including filtered string data based on specific criteria and using REGEX to look for specific patterns in strings. I would expand on practical experience gained here with cleaning and manipulating larger datasets of string data in the future. Verifying these REGEX patterns on larger or more diverse datasets of strings would increase my confidence in their applicability, including revising some of these patterns for use with the txt file for Project 1 next week.