In previous assignments we analyze data that comes in tables. the objective of this assignment is to use regular expressions and essential string functions to analyze data that are not available as a neatly organized dataset but in plain text?
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
First we will load the fivethirtyeight.com’s College Majors dataset from Github
CollegeMajors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
grep(pattern = 'DATA|STATISTICS', CollegeMajors$Major, value = TRUE) # with 'value' (showing matched text)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
Solution
First I careated a text file rawdata.txt, and uploaded it to github from where we’ll read it using the function readLines().
rawdata = readLines("./rawdata.txt")
rawdata
## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\""
## [2] "[5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" "
## [3] "[9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" "
## [4] "[13] \"olive\" \"salal berry\""
readLines() creates a character vector in which each element represents the lines of the URL we are trying to read. To know how many elements (i.e how many lines) are in rawdata we can use the function length()
length(rawdata)
## [1] 4
Now lets Transform the Data
length(rawdata)
## [1] 4
library(stringr)
plants <- str_extract_all(rawdata, '[:alpha:]+\\s[:alpha:]+|[:alpha:]+')
unlist(plants)
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
Describe, in words, what these expressions will match:
Solution
# 1. `(.)\1\1`: The same character appearing three times in a row.
Explanation
#2. `"(.)(.)\\2\\1"`: A pair of characters followed by the same pair of characters in reversed order.
Explanation
#3. `(..)\1`: Any two characters repeated.
Explanation
#4. `"(.).\\1.\\1"`: A character followed by any character, the original character, any other character, the original character again.
Explanation
#5. `"(.)(.)(.).*\\3\\2\\1"` Three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order.
Explanation
Construct regular expressions to match words that:
1. Start and end with the same character.
str_subset(c("aga", "bob", "car", "nose", "eye"), "^(.)((.*\\1$)|\\1?$)")
## [1] "aga" "bob" "eye"
2. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
str_subset(c("church", "data", "london", "tomato"), "([A-Za-z][A-Za-z]).*\\1")
## [1] "church" "london" "tomato"
3. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
str_subset("eleven", "([a-z]).*\\1.*\\1")
## [1] "eleven"
Regular expressions is a powerful, useful tool for parsing text.