DATA 607: Assignment 3

Question 1

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Load Libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load data from Github

College_maj<-read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv')

Code that identifies the majors that contain either “DATA” or “STATISTICS”

find='DATA|STATISTICS'
College_maj_sub <- College_maj$Major[grep(find, College_maj$Major)]
print(College_maj_sub)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

grep() function has been used to search for matches of a pattern within each element of the given string.

Question 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Fruits <- data.frame(c("bell pepper", "bilberry", "blackberry","blood orange","blueberry","cantalope","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry"))
cat(paste0(Fruits), collapse=",")

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantalope", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry") ,

Question 3

Describe, in words, what these expressions will match:

(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
“(.)(.)(.).*\3\2\1”

Answers:

(.)\1\1: Regular expression is used to match a pattern in a strings with character repeats three times or more in it.

exp <- c("toooo little", "sooo cute", "blackberry", "blackberrrry", "limeee", "lime", "12345", "347565", "07770")
str_subset(exp, "(.)\\1\\1")

## [1] "toooo little" "sooo cute"    "blackberrrry" "limeee"       "07770"

“(.)(.)\2\1”: Regular expression is used to match strings with a set of 4 characters with 2 characters attached to the same 2 characters in reverse order(e.g: bell pepper and chili pepper).

str_view(fruit, "(.)(.)\\2\\1")

##  [5] │ bell p<eppe>r
## [17] │ chili p<eppe>r

(..)\1: This regular expression is used to match any strings that have a repeated pair of letters (e.g.: banana and/or coconut).

str_view(fruit, "(..)\\1")

##  [4] │ b<anan>a
## [20] │ <coco>nut
## [22] │ <cucu>mber
## [41] │ <juju>be
## [56] │ <papa>ya
## [73] │ s<alal> berry

“(.).\1.\1”: Regular expression is used to match any strings that has the same character repeat 3 times and they are all separated by one character (e.g: banana and papaya).

str_view(fruit, "(.).\\1.\\1")

##  [4] │ b<anana>
## [56] │ p<apaya>

“(.)(.)(.).*\3\2\1”: This regular expression is used to match any strings with any 3 characters that repeat in the reverse order (e.g:347743 and/or abcdeffedcba)

exp <- c("toooo little", "abcdeffedcba", "blackberry", "blackberrrry", "limeee", "lime", "12345", "07770", "077770", "347743", "34788743")
str_subset(exp, "(.)(.)(.).*\\3\\2\\1")

## [1] "abcdeffedcba" "077770"       "347743"       "34788743"

Question 4

Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Answers:

Start and end with same character:

df.names <-c("alisha", "farhana", "anna", "sahana", "church", "bob", "harry", "paul", "eleven", "bubble", "cell", "apple", "dog", "ada", "sense", "banana", "pepperoni", "india", "ten", "twelve", "soso", "oso", "bandana", "Louisiana", "Missouri", "Mississippi", "Connecticut", "google", "conscience", "dalda", "short", "Evon", "ele", "Tort")

regex_expr1 <-"^(.)((.*\\1$)|\\1?$)"
str_subset(df.names,regex_expr1)

## [1] "alisha" "anna"   "bob"    "ada"    "oso"    "ele"

Contain a repeated pair of letters:

regex_expr2 <-"([A-Za-z][A-Za-z]).*\\1"
str_subset(df.names,regex_expr2)

## [1] "church"      "sense"       "banana"      "pepperoni"   "soso"       
## [6] "bandana"     "Mississippi" "dalda"

Contain one letter repeated in at least three places

regex_expr3 <-"([A-Za-z]).*\\1.*\\1"
str_subset(df.names,regex_expr3)

## [1] "farhana"     "sahana"      "eleven"      "bubble"      "banana"     
## [6] "pepperoni"   "bandana"     "Mississippi" "conscience"

DATA 607: Assignment 3

Farhana Akther

2023-02-09

Question 1

Load Libraries

Load data from Github

Code that identifies the majors that contain either “DATA” or “STATISTICS”

Question 2

Write code that transforms the data below:

Into a format like this:

Question 3

Describe, in words, what these expressions will match:

Answers:

Question 4

Construct regular expressions to match words that: