Title: CUNY SPS MDS Data607_Assignmet3"

Author: Charles Ugiagbe

Date: “9/12/2021”

R Character Manipulation

Question 1

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Load the required Packages

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.1
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Read file

This reads the file from github into a tibble (dataframe) and take a glimpse of the data

url <- "https://raw.githubusercontent.com/omocharly/Data607_Assignment3/main/all-ages.csv"
college_majors <- read.csv(url)
head(college_majors)
##   Major_code                                 Major
## 1       1100                   GENERAL AGRICULTURE
## 2       1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3       1102                AGRICULTURAL ECONOMICS
## 4       1103                       ANIMAL SCIENCES
## 5       1104                          FOOD SCIENCE
## 6       1105            PLANT SCIENCE AND AGRONOMY
##                    Major_category  Total Employed Employed_full_time_year_round
## 1 Agriculture & Natural Resources 128148    90245                         74078
## 2 Agriculture & Natural Resources  95326    76865                         64240
## 3 Agriculture & Natural Resources  33955    26321                         22810
## 4 Agriculture & Natural Resources 103549    81177                         64937
## 5 Agriculture & Natural Resources  24280    17281                         12722
## 6 Agriculture & Natural Resources  79409    63043                         51077
##   Unemployed Unemployment_rate Median P25th P75th
## 1       2423        0.02614711  50000 34000 80000
## 2       2266        0.02863606  54000 36000 80000
## 3        821        0.03024832  63000 40000 98000
## 4       3619        0.04267890  46000 30000 72000
## 5        894        0.04918845  62000 38500 90000
## 6       2070        0.03179089  50000 35000 75000
sub_filter <- dplyr::filter(college_majors, grepl('DATA|STATISTIC', Major))
sub_filter
##   Major_code                                         Major
## 1       2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 2       3702               STATISTICS AND DECISION SCIENCE
## 3       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##            Major_category  Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics  29317    22828                         18747
## 2 Computers & Mathematics  24806    18808                         14468
## 3                Business 156673   134478                        118249
##   Unemployed Unemployment_rate Median P25th  P75th
## 1       2265        0.09026422  60000 40000  85000
## 2       1138        0.05705405  70000 43000 102000
## 3       6186        0.04397714  72000 50000 100000
sub_col_majors <- sub_filter %>% select(Major_code:Employed)
sub_col_majors
##   Major_code                                         Major
## 1       2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 2       3702               STATISTICS AND DECISION SCIENCE
## 3       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##            Major_category  Total Employed
## 1 Computers & Mathematics  29317    22828
## 2 Computers & Mathematics  24806    18808
## 3                Business 156673   134478

Question 2

#2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

solution 2

library(stringr)
main_data <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  
[9] "elderberry"   "lime"         "lychee"       "mulberry"    
[13] "olive"        "salal berry"'

output_string will return a list

output <- str_extract_all(main_data, pattern = '[a-z]+\\s?[a-z]+')
output_string <- str_c(output, collapse = ", ")
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing
writeLines(output_string)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

Question 3

#3 Describe, in words, what these expressions will match:

(.)\1\1 “(.)(.)\2\1” (..)\1 “(.).\1.\1” "(.)(.)(.).\3\2\1"

Solution 3

(1) (.)\1\1 Matches strings containing any single character except new line followed by ‘\1\1’.

(2) “(.)(.)\2\1” Matches strings containing any two characters that are immediately followed by the same two characters in the opposite order. In essence, it matches a 4-character palindrome.

(3) (..)\1 Matches strings containing any two characters (except new line) followed by ‘\1’.

(4) “(.).\1.\1” Matches strings containing any single character (except new line) followed by any character then the matched character followed by any character, then the matched character again.

(5) "(.)(.)(.).*\3\2\1" Matches strings containing any three characters in a row in which the three characters are repreated but in the opposite order later in the string.

See examples to show usage of these strings

example_1 <- c('a\1\1', 'b\1\1', 'ccc\1\1c', 'd\1d')
str_view(example_1, "(.)\1\1")
example_2 <- c('abcccba', 'badmomdad', 'maddam')
str_view(example_2, "(.)(.)\\2\\1")
example_3 <- c('abc\1d', 'abc\2dd', '\1d')
str_view(example_3, "(..)\1")
example_04 <- c('philzlyl')
str_view(example_04, "(.).\\1.\\1")

Exercise 4

Construct regular expressions to match words that:

Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

solutios 4

answer.str <- c("CHURCH","ELEVEN","PROP","SEVENTEEN","BANANA" ,"EAGLE")

Word that start and end with the same character

code1 <- "^(.).*\\1$"
answer.str %>% str_subset(code1)
## [1] "PROP"  "EAGLE"

Word that contain a repeated pair of letters

code2 <- "(..).*\\1"
answer.str %>% str_subset(code2)
## [1] "CHURCH"    "SEVENTEEN" "BANANA"

words that contain one letter in at least 3 places

code3 <- "(.).*\\1.*\\1"
answer.str %>% str_subset(code3)
## [1] "ELEVEN"    "SEVENTEEN" "BANANA"