This assignment is about R character Manipulation and Date Processing. These problems are constructed to help manipulating strings in R.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
Upload dataset to GitHub repository, load it to R then proceed to the analysis.
# Load data from GitHub
college_majors<-read.csv("https://raw.githubusercontent.com/jnataky/RCharacter_manipulation/master/College_majors.csv")
# Check the columns names in dataframe
names(college_majors)
## [1] "FOD1P" "Major" "Major_Category"
# Count unique majors, to ensure no major is repeated
length(unique(college_majors[["Major"]]))
## [1] 173
# Identify the majors that contain either "DATA" or "STATISTICS"
Major<-college_majors$Major
Major[grepl("DATA|STATISTICS", Major)]
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
I used grepl() function to identify majors that contain either “DATA” or “STATISTICS”. What the function does is that it looks for (‘DATA’)(any character sequence)(‘STATISTICS’) OR (‘STATISTICS’)(any character sequence)(‘DATA’).
From the 173 majors, there are 3 majors containing either “DATA” or “STATISTICS”.
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
Represent the given fruits as a string then convert it to a list following the pattern A-Z
library(stringr)
# Write a string for fruits
fruits <-'[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry" '
# Convert str to a list with pattern p
p<-"[A-Za-z]+.?[A-Za-z]+"
fruits<- str_extract_all(fruits, p)
list_fruits <- str_c(fruits, sep = "", collapse = NULL)
## Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
## argument is not an atomic vector; coercing
list_fruits <-writeLines(list_fruits)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
Write in words then illustrate.
Describe, in words, what these expressions will match:
# Define group of words, gr:
gr <-c("eleven", "bob", "jujube", "church", "pepper", "coool", "monisnotnom")
# Example
str_view(gr,"(.)\\1\\1", match = TRUE)
# Example
str_view(gr,"(.)(.)\\2\\1", match = TRUE)
str_view(gr,"(..)\\1", match = TRUE)
str_view(gr,"(.).\\1.\\1", match = TRUE)
str_view(gr, "(.)(.)(.).*\\3\\2\\1", match = TRUE)
Use the same list as in problem 3 for illustration.
Construct regular expressions to match words that:
str_view(gr, "^(.).*\\1$", match = TRUE)
str_view(gr, "(..).*\\1", match = TRUE)
str_view(gr, "(.).*\\1.*\\1", match = TRUE)