Week 3 Lab 607

Question 1:

Problem: Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Solution to Question 1

To solve Question 1 Problem; I first imported the raw college major data from Github, first using the link provided from fivethirtyeight website to open the article.
Then I loaded packages: dplyr , tidyverse and readr. Afterwards, I searched the majors data frame through filter. The filter uses regex to filter out majors that contain either ‘DATA’ or ‘STATISTICS’ and pulls it out into a new data frame called mathematical_majors.

#import data and packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)
majors <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))

#filter majors that contains "DATA" & "STATISTICS"
(mathematical_majors <- majors %>%
  filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS")))

##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Question 2

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”.

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Solution to Question 2

fruits <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

fruit_vec <- paste0("\"", fruits, "\"")

fru_vec2 <- paste0(fruit_vec, collapse = ", ") 

fru_vec3 <- paste0("c(", fru_vec2, ")")

cat(fru_vec3)

## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Steps

So I had a hard time with this one a bit to make it look exactly like the way it is showed. I at first had it but minus the quotations around it. So I eventually came to this solution a different style.
I first created the string. The I had added backlash and quotation around each element (fruits), and defined it into a new vector called “fruit_vec”.
Next, I collapsed the quoted words into a string, separating each word with a comma. Now identified as “fruit_vec2”.
Then, I added the leading characters which is “c(…..)”.
Finally, I parsed the backslashes as escape characters using cat function. The output is now in the desired format.

Question 3

Describe, in words, what these expressions will match:

(.)\1\1
“(.)(.)\\2\\1”.
(..)\1
“(.).\\1.\\1”
“(.)(.)(.).*\\3\2\\1”

Solution to Question 3

there is an extra backslash in script to show the original expression in bullet points 2,4,and 5 because the first backlash escapes ; in question above is original expression.

(.)\1\1 - matches any character that is repeated three times in a row, like ‘aaa’, ‘bbb’.
“(.)(.)\\2\\1” - matches any two characters followed by the same two characters in reverse order, example: abba or 1441.
(..)\1 - this expression will match with 2 adjoining repeats of a pair of characters or two characters repeated; example ‘3131’ or ‘abab’ .
“(.).\\1.\\1” - this match, will search for a five character term, three of which are the same. Example lets say you have characters ‘d’ ‘e’ ‘f’. The output would be ‘dedfd’ or ‘41424’.
“(.)(.)(.).*\\3\\2\\1” - this will match a set of characters that begin and end with the same three characters, except the second instance is reversed, like “racecar” or ‘12321’.

Question 4

Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Solution to Question 4

there is an extra backslash in script to show the expression I intend to be solution. a to c is 2 backslash but made it 3 backlashes in script to show 2 backlashes for html

a). Start and end with the same character —> ^(.).*\\1$

b). Contain a repeated pair of letters —-> (.).*\\1

c). Contain one letter repeated in at least three places —> (.).\\1.\\1

Week 3 Lab 607

Blessing Anoroh

02-08-2024

Question 1:

Question 2

Question 3

Question 4