Assignment 3: Data manipulation:

Setting up the environment:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Question 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors_list <- read.csv(url)
selected_majors <- majors_list %>%
  filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS"))
selected_majors

##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Question 2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

x <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" 
 [5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
 [9] "elderberry"   "lime"         "lychee"       "mulberry"    
 [13] "olive"        "salal berry"'

x <- str_remove_all(x, "\\d+")

x <- str_replace_all(x, "\\[]", "")

x <- str_replace_all(x, "\\s+", " ")

x <- str_replace_all(x, "\"\\s+", "\",")

x <- str_c("c(",x,")")

str_view(x)

## [1] │ c( "bell pepper","bilberry","blackberry","blood orange","blueberry","cantaloupe","chili pepper","cloudberry","elderberry","lime","lychee","mulberry","olive","salal berry")

Question 3. Describe, in words, what these expressions will match:

i) (.)\1\1

The above regular expression will match any character that will repeat three time in a row. The “(.)” represents a group followed by “\1” which is a back reference to first group. Our first group in this case has a “.” which means a character. This character is back referred twice with “\1” so the will match any character that will repeat three time in a row.

For example :

Test_string <- c("abc312131cba", "abcabc", "abab", "111222","aabb","acca", "church","xyzzyx","reverence", "repeled", "pulp")

str_subset(Test_string, "(.)\\1\\1")

## [1] "111222"

ii) “(.)(.)\2\1”

This REGEX has two groups each has one character which is back referred in reverse order. So in other words it will match out two characters followed by the same two characters in reverse order

For Example:

str_subset(Test_string, "(.)(.)\\2\\1")

## [1] "acca"   "xyzzyx"

iii) “(..)\1”

This particular REGEX has one group that contains two characters and that group is back referred so it will match out two repeated characters

for example:

str_subset(Test_string, "(..)\\1")

## [1] "abab"

iv) “(.).\1.\1”

This REGEX will match out a character repeated three times with characters in between each repetition. The “.” between the group and back references stand for character between the repeated ones.

For Example:

str_subset(Test_string, "(.).\\1.\\1")

## [1] "abc312131cba" "reverence"    "repeled"

v) “(.)(.)(.).*\3\2\1”

The characters followed by any character repeat 0 or more times and then the same three characters in reverse order.

For Example:

str_subset(Test_string, "(.)(.)(.).*\\3\\2\\1")

## [1] "abc312131cba" "xyzzyx"

Question 4. Construct regular expressions to match words that:

Start and end with the same character.

str_subset(Test_string, "^(.).*\\1$")

## [1] "abc312131cba" "acca"         "xyzzyx"       "pulp"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_subset(Test_string, "(..).*\\1")

## [1] "abc312131cba" "abcabc"       "abab"         "church"       "reverence"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_subset(Test_string, "([a-z]).*\\1.*\\1")

## [1] "reverence" "repeled"

Assignment 3: Data manipulation:

Umer Farooq

2023-02-07

Setting up the environment:

Question 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Question 2. Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this: c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Question 3. Describe, in words, what these expressions will match:

Question 4. Construct regular expressions to match words that: