Introduction

In this assignment, I perform exercises related to string data manipulation in 4 parts.

Part 1

I use the 173 majors listed in fivethirtyeight.com’s College Majors dataset to provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

# Loading data (original data file is accessible through code), clean column names, and filter data down to the 3 observations where the variable "major" includes the words "DATA" or "STATISTICS".
majors_list <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv") |> 
  janitor::clean_names() |> 
  dplyr::filter(str_detect(major, "DATA|STATISTICS"))

## Rows: 173 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Major, Major_category
## dbl (9): Major_code, Total, Employed, Employed_full_time_year_round, Unemplo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

After running the above code, there are only 3 majors remaining in the filtered dataset: computer programming and data processing, statistics and decision science, and management information systems and statistics.

majors_list$major

## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "STATISTICS AND DECISION SCIENCE"              
## [3] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"

Part 2

The code below transforms the following data:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format that like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

First, I store the character vector, pre-transformation, in R:

# Creating the data in R as specified in the assignment.
fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange",
                   "blueberry", "cantaloupe", "chili pepper", "cloudberry",
                   "elderberry", "lime", "lychee", "mulberry",
                   "olive", "salal berry")

# This is the data pre-transformation, as specified in assignment instructions:
fruits

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Next, I transform the character vector into a string that looks like the explicit representation of the character vector using the c() function. I show the transformed format in string view below:

# The following command transforms the original data into a string starting with "c(", comma-deliminated, with each fruit in the list enclosed in quotes, and ending with ")", as specified in the assignment question:
fruits_transformed <- str_c("c(", str_c('"', fruits, '"', collapse = ", "), ")")
str_view(fruits_transformed)

## [1] │ c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Part 3

Here I describe, in words, what the following expressions will match:

(.)\1\1 is a regular expression. It matches strings with the same character three consecutive times. E.g. (.)\1\1: c("ccc", "444", "$$$").

expression <- "(.)\\1\\1" # naming expression
str_view(expression) # String view should print like the regular expression as specified in assignment text above

## [1] │ (.)\1\1

str_view(c(words,"ccc", "444", "$$$"), expression) # testing matches

## [981] │ <ccc>
## [982] │ <444>
## [983] │ <$$$>

"(.)(.)\\2\\1" is a character string representing a regular expression–the actual regular expression is (.)(.)\2\1. It matches a string with any two characters followed by those same two characters in reverse order. E.g., "(.)(.)\\2\\1": c("afternoon", "apparent", "arrange").

expression <- "(.)(.)\\2\\1" # naming expression
str_view(expression) # string view of expression

## [1] │ (.)(.)\2\1

str_view(c(words), expression) # testing matches

##  [19] │ after<noon>
##  [43] │ <appa>rent
##  [53] │ <arra>nge
## [107] │ b<otto>m
## [112] │ br<illi>ant
## [174] │ c<ommo>n
## [230] │ d<iffi>cult
## [259] │ <effe>ct
## [329] │ f<ollo>w
## [422] │ in<deed>
## [470] │ l<ette>r
## [521] │ m<illi>on
## [581] │ <oppo>rtunity
## [582] │ <oppo>se
## [877] │ tom<orro>w

(..)\1 is a regular expression that matches any string with a repeated pair of letters. E.g., (..)\1: c("remember", "banana", "coconut").

expression <- "(..)\\1" # naming expression
str_view(expression) # string view of expression

## [1] │ (..)\1

str_view(c(words, fruit), expression) # testing matches

##  [696] │ r<emem>ber
##  [984] │ b<anan>a
## [1000] │ <coco>nut
## [1002] │ <cucu>mber
## [1021] │ <juju>be
## [1036] │ <papa>ya
## [1053] │ s<alal> berry

"(.).\\1.\\1" is a character string representing the regular expression (.).\1.\1. This matches a string with at least 5 consecutive characters, where the 1st, 3rd, and 5th of those characters are the same. E.g., "(.).\\1.\\1" : c("eleven", "banana", "papaya").

expression <- "(.).\\1.\\1" # naming expression
str_view(expression) # string view of expression

## [1] │ (.).\1.\1

str_view(c(words, fruit, "elleven"), expression) # testing matches

##  [265] │ <eleve>n
##  [984] │ b<anana>
## [1036] │ p<apaya>

"(.)(.)(.).\*\\3\\2\\1" is a character string representing the regular expression (.)(.)(.).\*\3\2\1, which matches any string with at least 6 characters, where there are 3 characters in a row, followed by those same 3 characters in reverse order. There can be any number of characters in between the first chunk of 3 and the second reversed chunk of 3. E.g., "(.)(.)(.).\*\\3\\2\\1": c("paragraph", "parrap", "abc123abc321abc").

expression <- "(.)(.)(.).*\\3\\2\\1"  # naming expression
str_view(expression) # string view of expression

## [1] │ (.)(.)(.).*\3\2\1

str_view(c(words, fruit, "parrap", "parap", "abc123abc321abc"), expression) # testing matches

##  [598] │ <paragrap>h
## [1061] │ <parrap>
## [1063] │ abc<123abc321>abc

Part 4

Below are regular expressions to do the following:

match words that start and end with the same character: the regular expression ^(.).*1$.

expression <- "^(.).*\\1$" # naming expression using character string representing regular expression
str_view(expression) # string view of expression

## [1] │ ^(.).*\1$

str_view(c(words), expression) # testing matches

##  [36] │ <america>
##  [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
## [266] │ <else>
## [268] │ <encourage>
## [270] │ <engine>
## [278] │ <europe>
## [283] │ <evidence>
## [285] │ <example>
## [287] │ <excuse>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [296] │ <eye>
## [386] │ <health>
## [394] │ <high>
## [450] │ <knock>
## ... and 16 more

contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): (..).*\1.

expression <- "(..).*\\1" # naming expression
str_view(expression) # string view of expression

## [1] │ (..).*\1

str_view(c(words), expression) # testing matches

##  [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
## [598] │ pa<ragra>ph
## [603] │ p<articular>
## [617] │ <photograph>
## [638] │ p<repare>
## [641] │ p<ressure>
## [696] │ r<emem>ber
## [698] │ <repre>sent
## [699] │ <require>
## [739] │ <sense>
## [858] │ the<refore>
## [903] │ u<nderstand>
## [946] │ w<hethe>r

contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): (.).*\1.*\1.

expression <- "(.).*\\1.*\\1" # naming expression
str_view(expression) # string view of expression

## [1] │ (.).*\1.*\1

str_view(c(words), expression) # testing matches

##  [48] │ a<pprop>riate
##  [62] │ <availa>ble
##  [86] │ b<elieve>
##  [90] │ b<etwee>n
## [119] │ bu<siness>
## [221] │ d<egree>
## [229] │ diff<erence>
## [233] │ di<scuss>
## [265] │ <eleve>n
## [275] │ e<nvironmen>t
## [283] │ <evidence>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [423] │ <indivi>dual
## [598] │ p<aragra>ph
## [684] │ r<eceive>
## [696] │ r<emembe>r
## [698] │ r<eprese>nt
## [845] │ t<elephone>
## ... and 2 more

Week 3 assignment

Naomi Buell

2024-02-14

Introduction

Part 1

Part 2

Part 3

Part 4

Conclusion