• The Overview:
    • Loading the required library
    • Getting the data from fivethirtyeight github
    • Getting the majors containing “DATA”
    • Getting the majors containing “DATA”
  • 2 Write code that transforms the data below:
  • 3 Describe, in words, what these expressions will match:
  • 4 Construct regular expressions to match words that:

The Overview:

You can find the file on Github here

You can find the file on Rpubs here

#1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Loading the required library

library(stringr)

Getting the data from fivethirtyeight github

url <- ("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")

data <- read.csv(url, sep = ",")

head(data)
##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources

Getting the majors containing “DATA”

data$Major[grepl("DATA", data$Major)]
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"

Getting the majors containing “DATA”

data$Major[grepl("STATISTICS", data$Major)]
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

2 Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

fruits <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'

fruits
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n\n[13] \"olive\"        \"salal berry\""

Here we use the str extract for all of the fruits then we join them with comma separator

fruits_string <- str_extract_all(fruits,pattern = '[A-Za-z]+.?[A-Za-z]+')

fruits <- writeLines(str_c(fruits_string, collapse =", "))
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3 Describe, in words, what these expressions will match:

(.)\1\1 This is to get one character with two repetitions like “AAA”.

“(.)(.)\2\1” This is to get two characters repeated in a reverse way like “ABBA”

(..)\1 This is to get two characters repeated like “ABAB”

“(.).\1.\1” This is to get 5 characters and three of them are the same like “ABACA”

**"(.)(.)(.).*\3\2\1"** This is to get a number of characters begin and end with the same characters in a reverse way like “ABC42342CBA”

4 Construct regular expressions to match words that:

  • Start and end with the same character. The answer: "(.).*\1"
data <- c("church", "individual", "phillip")
str_view(data, "^(.).*\\1$", match = TRUE)
  • phillip
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) The answer: "(..).*\1"
data <- c("church", "individual", "phillip")
str_view(data, "^(..).*\\1$", match = TRUE)
  • church
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) The answer: “(.).\1.\1”
data <- c("church", "individual", "phillip")
str_view(data, "(.).*\\1.*\\1", match = TRUE)

  • individual

LS0tCnRpdGxlOiAiUiBDaGFyYWN0ZXIgTWFuaXB1bGF0aW9uIGFuZCBEYXRlIFByb2Nlc3NpbmciCmF1dGhvcjogIkthcmltIEhhbW1vdWQiCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCIKb3V0cHV0OiBvcGVuaW50cm86OmxhYl9yZXBvcnQKLS0tCgojIyBUaGUgT3ZlcnZpZXc6CgpZb3UgY2FuIGZpbmQgdGhlIGZpbGUgb24gW0dpdGh1YiBoZXJlXShodHRwczovL2dpdGh1Yi5jb20vYWthcmltaGFtbW91ZC82MDctRGF0YS1BY3F1aXNpdGlvbi1hbmQtTWFuYWdlbWVudC1DVU5ZLVNQUy1GYWxsMjAyMC90cmVlL21hc3Rlci9XMyUyMC0lMjBSJTIwQ2hhcmFjdGVyJTIwTWFuaXB1bGF0aW9uJTIwYW5kJTIwRGF0ZSUyMFByb2Nlc3NpbmcpCgpZb3UgY2FuIGZpbmQgdGhlIGZpbGUgb24gW1JwdWJzIGhlcmVdKGh0dHBzOi8vcnB1YnMuY29tL2thcmltN21vZC82NTk3MTkpIAoKIzEuIFVzaW5nIHRoZSAxNzMgbWFqb3JzIGxpc3RlZCBpbiBmaXZldGhpcnR5ZWlnaHQuY29t4oCZcyBDb2xsZWdlIE1ham9ycyBkYXRhc2V0IFtodHRwczovL2ZpdmV0aGlydHllaWdodC5jb20vZmVhdHVyZXMvdGhlLWVjb25vbWljLWd1aWRlLXRvLXBpY2tpbmctYS1jb2xsZWdlLW1ham9yL10sIHByb3ZpZGUgY29kZSB0aGF0IGlkZW50aWZpZXMgdGhlIG1ham9ycyB0aGF0IGNvbnRhaW4gZWl0aGVyICJEQVRBIiBvciAiU1RBVElTVElDUyIKCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpCmBgYAoKCiMjIyBMb2FkaW5nIHRoZSByZXF1aXJlZCBsaWJyYXJ5CgpgYGB7cn0KbGlicmFyeShzdHJpbmdyKQpgYGAKCgojIyMgR2V0dGluZyB0aGUgZGF0YSBmcm9tIGZpdmV0aGlydHllaWdodCBnaXRodWIKCmBgYHtyfQp1cmwgPC0gKCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZml2ZXRoaXJ0eWVpZ2h0L2RhdGEvbWFzdGVyL2NvbGxlZ2UtbWFqb3JzL21ham9ycy1saXN0LmNzdiIpCgpkYXRhIDwtIHJlYWQuY3N2KHVybCwgc2VwID0gIiwiKQoKaGVhZChkYXRhKQpgYGAKCiMjIyBHZXR0aW5nIHRoZSBtYWpvcnMgY29udGFpbmluZyAiREFUQSIgCgpgYGB7cn0KZGF0YSRNYWpvcltncmVwbCgiREFUQSIsIGRhdGEkTWFqb3IpXQpgYGAKCiMjIyBHZXR0aW5nIHRoZSBtYWpvcnMgY29udGFpbmluZyAiREFUQSIgCgpgYGB7cn0KZGF0YSRNYWpvcltncmVwbCgiU1RBVElTVElDUyIsIGRhdGEkTWFqb3IpXQpgYGAKCgojIyAyIFdyaXRlIGNvZGUgdGhhdCB0cmFuc2Zvcm1zIHRoZSBkYXRhIGJlbG93OgoKWzFdICJiZWxsIHBlcHBlciIgICJiaWxiZXJyeSIgICAgICJibGFja2JlcnJ5IiAgICJibG9vZCBvcmFuZ2UiCgpbNV0gImJsdWViZXJyeSIgICAgImNhbnRhbG91cGUiICAgImNoaWxpIHBlcHBlciIgImNsb3VkYmVycnkiICAKCls5XSAiZWxkZXJiZXJyeSIgICAibGltZSIgICAgICAgICAibHljaGVlIiAgICAgICAibXVsYmVycnkiICAgIAoKWzEzXSAib2xpdmUiICAgICAgICAic2FsYWwgYmVycnkiCgpJbnRvIGEgZm9ybWF0IGxpa2UgdGhpczoKCmMoImJlbGwgcGVwcGVyIiwgImJpbGJlcnJ5IiwgImJsYWNrYmVycnkiLCAiYmxvb2Qgb3JhbmdlIiwgImJsdWViZXJyeSIsICJjYW50YWxvdXBlIiwgImNoaWxpIHBlcHBlciIsICJjbG91ZGJlcnJ5IiwgImVsZGVyYmVycnkiLCAibGltZSIsICJseWNoZWUiLCAibXVsYmVycnkiLCAib2xpdmUiLCAic2FsYWwgYmVycnkiKQoKCmBgYHtyfQpmcnVpdHMgPC0gJ1sxXSAiYmVsbCBwZXBwZXIiICAiYmlsYmVycnkiICAgICAiYmxhY2tiZXJyeSIgICAiYmxvb2Qgb3JhbmdlIgoKWzVdICJibHVlYmVycnkiICAgICJjYW50YWxvdXBlIiAgICJjaGlsaSBwZXBwZXIiICJjbG91ZGJlcnJ5IiAgCgpbOV0gImVsZGVyYmVycnkiICAgImxpbWUiICAgICAgICAgImx5Y2hlZSIgICAgICAgIm11bGJlcnJ5IiAgICAKClsxM10gIm9saXZlIiAgICAgICAgInNhbGFsIGJlcnJ5IicKCmZydWl0cwpgYGAKSGVyZSB3ZSB1c2UgdGhlIHN0ciBleHRyYWN0IGZvciBhbGwgb2YgdGhlIGZydWl0cyB0aGVuIHdlIGpvaW4gdGhlbSB3aXRoIGNvbW1hIHNlcGFyYXRvcgoKYGBge3Igd2FybmluZyA9IEZBTFNFfQpmcnVpdHNfc3RyaW5nIDwtIHN0cl9leHRyYWN0X2FsbChmcnVpdHMscGF0dGVybiA9ICdbQS1aYS16XSsuP1tBLVphLXpdKycpCgpmcnVpdHMgPC0gd3JpdGVMaW5lcyhzdHJfYyhmcnVpdHNfc3RyaW5nLCBjb2xsYXBzZSA9IiwgIikpCmBgYAoKClRoZSB0d28gZXhlcmNpc2VzIGJlbG93IGFyZSB0YWtlbiBmcm9tIFIgZm9yIERhdGEgU2NpZW5jZSwgMTQuMy41LjEgaW4gdGhlIG9uLWxpbmUgdmVyc2lvbjoKCiMjIDMgRGVzY3JpYmUsIGluIHdvcmRzLCB3aGF0IHRoZXNlIGV4cHJlc3Npb25zIHdpbGwgbWF0Y2g6CgoqKiguKVwxXDEqKgpUaGlzIGlzIHRvIGdldCBvbmUgY2hhcmFjdGVyIHdpdGggdHdvIHJlcGV0aXRpb25zIGxpa2UgIkFBQSIuCgoqKiIoLikoLilcXDJcXDEiKioKVGhpcyBpcyB0byBnZXQgdHdvIGNoYXJhY3RlcnMgcmVwZWF0ZWQgaW4gYSByZXZlcnNlIHdheSBsaWtlICJBQkJBIiAKCioqKC4uKVwxKioKVGhpcyBpcyB0byBnZXQgdHdvIGNoYXJhY3RlcnMgcmVwZWF0ZWQgbGlrZSAiQUJBQiIKCioqIiguKS5cXDEuXFwxIioqClRoaXMgaXMgdG8gZ2V0IDUgY2hhcmFjdGVycyBhbmQgdGhyZWUgb2YgdGhlbSBhcmUgdGhlIHNhbWUgbGlrZSAiQUJBQ0EiCgoqKiIoLikoLikoLikuKlxcM1xcMlxcMSIqKgpUaGlzIGlzIHRvIGdldCBhIG51bWJlciBvZiBjaGFyYWN0ZXJzIGJlZ2luIGFuZCBlbmQgd2l0aCB0aGUgc2FtZSBjaGFyYWN0ZXJzIGluIGEgcmV2ZXJzZSB3YXkgbGlrZSAiQUJDNDIzNDJDQkEiCgoKCiMjIDQgQ29uc3RydWN0IHJlZ3VsYXIgZXhwcmVzc2lvbnMgdG8gbWF0Y2ggd29yZHMgdGhhdDoKCiogU3RhcnQgYW5kIGVuZCB3aXRoIHRoZSBzYW1lIGNoYXJhY3Rlci4KVGhlIGFuc3dlcjogIiguKS4qXDEiCgpgYGB7cn0KZGF0YSA8LSBjKCJjaHVyY2giLCAiaW5kaXZpZHVhbCIsICJwaGlsbGlwIikKc3RyX3ZpZXcoZGF0YSwgIl4oLikuKlxcMSQiLCBtYXRjaCA9IFRSVUUpCmBgYAoKCiogQ29udGFpbiBhIHJlcGVhdGVkIHBhaXIgb2YgbGV0dGVycyAoZS5nLiAiY2h1cmNoIiBjb250YWlucyAiY2giIHJlcGVhdGVkIHR3aWNlLikKVGhlIGFuc3dlcjogIiguLikuKlxcMSIKCmBgYHtyfQpkYXRhIDwtIGMoImNodXJjaCIsICJpbmRpdmlkdWFsIiwgInBoaWxsaXAiKQpzdHJfdmlldyhkYXRhLCAiXiguLikuKlxcMSQiLCBtYXRjaCA9IFRSVUUpCmBgYAoKCiogQ29udGFpbiBvbmUgbGV0dGVyIHJlcGVhdGVkIGluIGF0IGxlYXN0IHRocmVlIHBsYWNlcyAoZS5nLiAiZWxldmVuIiBjb250YWlucyB0aHJlZSAiZSJzLikKVGhlIGFuc3dlcjogIiguKS4qXFwxLipcXDEiCgpgYGB7cn0KZGF0YSA8LSBjKCJjaHVyY2giLCAiaW5kaXZpZHVhbCIsICJwaGlsbGlwIikKc3RyX3ZpZXcoZGF0YSwgIiguKS4qXFwxLipcXDEiLCBtYXRjaCA9IFRSVUUpCmBgYAouLi4=