DATA607 | Week 3 Assignment

Objective of the Assignment:

In previous assignments we analyze data that comes in tables. the objective of this assignment is to use regular expressions and essential string functions to analyze data that are not available as a neatly organized dataset but in plain text?

Exersice 1:

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

First we will load the fivethirtyeight.com’s College Majors dataset from Github

CollegeMajors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

grep(pattern = 'DATA|STATISTICS', CollegeMajors$Major, value = TRUE) # with 'value' (showing matched text)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Exersice 2:

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange” [5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

Solution

First I careated a text file rawdata.txt, and uploaded it to github from where we’ll read it using the function readLines().

rawdata = readLines("./rawdata.txt")
rawdata

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\""
## [2] "[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  "
## [3] "[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    "
## [4] "[13] \"olive\"        \"salal berry\""

readLines() creates a character vector in which each element represents the lines of the URL we are trying to read. To know how many elements (i.e how many lines) are in rawdata we can use the function length()

length(rawdata)

## [1] 4

Now lets Transform the Data

length(rawdata)

## [1] 4

library(stringr)
plants <- str_extract_all(rawdata, '[:alpha:]+\\s[:alpha:]+|[:alpha:]+')
unlist(plants)

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Exercise 3:

Describe, in words, what these expressions will match:

(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
"(.)(.)(.).*\3\2\1"

Solution

# 1.  `(.)\1\1`: The same character appearing three times in a row.

Explanation

(.) Capturing Group
. matches any character (except for line terminators)
\1 matches the same text as most recently matched by the 1st capturing group
\1 matches the same text as most recently matched by the 1st capturing group

#2.  `"(.)(.)\\2\\1"`: A pair of characters followed by the same pair of characters in reversed order.

Explanation

(.) 1st Capturing Group
. matches any character (except for line terminators)
(.) 2nd Capturing Group
. matches any character (except for line terminators)
\ matches the character literally
2 matches the character 2 literally
\ matches the character literally (case sensitive)

#3.  `(..)\1`: Any two characters repeated.

Explanation

(..) Capturing Group
. matches any character (except for line terminators)
. matches any character (except for line terminators)
\1 matches the same text as most recently matched by the 1st

#4.  `"(.).\\1.\\1"`: A character followed by any character, the original character, any other character, the original character again.

Explanation

(.) Capturing Group
. matches any character (except for line terminators)
. matches any character (except for line terminators)
\ matches the character literally (case sensitive)
1 matches the character 1 literally (case sensitive)
. matches any character (except for line terminators)
\ matches the character literally (case sensitive)

#5.  `"(.)(.)(.).*\\3\\2\\1"` Three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order.

Explanation

1st Capturing Group (.)
. matches any character (except for line terminators)
2nd Capturing Group (.)
. matches any character (except for line terminators)
3rd Capturing Group (.)
. matches any character (except for line terminators)
.* matches any character (except for line terminators)
- Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\ matches the character literally
3 matches the character 3 literally
\ matches the character literally
2 matches the character 2 literally
\ matches the character literally

Exercise 3:

Construct regular expressions to match words that:

1. Start and end with the same character.

str_subset(c("aga", "bob", "car", "nose", "eye"), "^(.)((.*\\1$)|\\1?$)")

## [1] "aga" "bob" "eye"

2. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)

str_subset(c("church", "data", "london", "tomato"), "([A-Za-z][A-Za-z]).*\\1")

## [1] "church" "london" "tomato"

3. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)

str_subset("eleven", "([a-z]).*\\1.*\\1")

## [1] "eleven"

Conclusion

Regular expressions is a powerful, useful tool for parsing text.

DATA607 | Week 3 Assignment

Ait Elmouden Abdellah

02/12/2020

Objective of the Assignment:

Exersice 1:

Exersice 2:

Exercise 3:

Exercise 3:

Conclusion