Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
#load the FiveThirtyEight data from 'majors-list.csv' to a data.frame
library( tidyverse )
dataURL <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'
dataDF <- read_csv( dataURL )
head( dataDF )## # A tibble: 6 x 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
#identify majors that contain the word 'DATA'
DATA_idx <- grepl( 'DATA', dataDF$Major )
dataDF$Major[ DATA_idx ]## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
#identify majors that contain the word 'STATISTICS'
DATA_idx <- grepl( 'STATISTICS', dataDF$Major )
dataDF$Major[ DATA_idx ]## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"
#I'm just curious. See a lot of majors with 'SCIENCE' or 'SCIENCES'.
#Would like to identify & count them as well:
DATA_idx <- grepl( 'SCIENCE | SCIENCES', dataDF$Major )
dataDF$Major[ DATA_idx ]## [1] "ANIMAL SCIENCES"
## [2] "PLANT SCIENCE AND AGRONOMY"
## [3] "BIOCHEMICAL SCIENCES"
## [4] "COGNITIVE SCIENCE AND BIOPSYCHOLOGY"
## [5] "INFORMATION SCIENCES"
## [6] "SCIENCE AND COMPUTER TEACHER EDUCATION"
## [7] "SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION"
## [8] "NUTRITION SCIENCES"
## [9] "COMMUNICATION DISORDERS SCIENCES AND SERVICES"
## [10] "PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION"
## [11] "FAMILY AND CONSUMER SCIENCES"
## [12] "TRANSPORTATION SCIENCES AND TECHNOLOGIES"
## [13] "PHYSICAL SCIENCES"
## [14] "ATMOSPHERIC SCIENCES AND METEOROLOGY"
## [15] "INTERDISCIPLINARY SOCIAL SCIENCES"
## [16] "GENERAL SOCIAL SCIENCES"
## [17] "POLITICAL SCIENCE AND GOVERNMENT"
## [18] "MISCELLANEOUS SOCIAL SCIENCES"
## [1] 18
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
#make character vector that hold desired output to use as a test case:
testStrings <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
#I was really confused by the wording of this question
#& am still not sure what we are supposed to work with as input.
#This was my best guess:
data <-
'[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
data## [1] "[1] \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\"\n\n[5] \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \n\n[9] \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \n\n[13] \"olive\" \"salal berry\""
We want to parse the words from the string above and pass them to a character vector using regular expressions
#use strsplit() to split by '\"'
result <- strsplit(data, '\\"')
# '[a-z].*[a-z]' -> find any lower case alphabet,
#there can be any number of characters in between,
#it has the end with a lower case alphabet number.
dataStrings = grep(pattern = '[a-z].*[a-z]', result[[1]], value = TRUE, ignore.case = TRUE)
dataStrings## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
‘(.)’ Find a character ‘\1’ that repeats and ‘\1’ then repeats again ->
find 3 repreating characters (e.g. ‘aaa’)
‘(.)’=Find a character ‘(.)’= Find the next character ‘\2’= then the have the second character ‘\1’= then the contents of the 1st character -> Find a four member palindrome (e.g. ‘eppe’ in ‘bell pepper’)
## [1] "bell pepper" "chili pepper"
## [1] "salal berry"
dataStrings <- c( dataStrings, 'banana')
grep(pattern = '(.).\\1.\\1', dataStrings, value = TRUE, ignore.case = TRUE)## [1] "banana"
For this exercise I will use the feature ‘country.name.de’ from the dataframe ‘codelist’ from the ‘countrycode’ library to test my expressions
countryNames = tolower( codelist$country.name.de )
#i changed to lower to make the first expression a simpler case
head( countryNames )## [1] "afghanistan" "aland islands" "albanien"
## [4] "algerien" "amerikanisch-samoa" "andorra"
## [1] "amerikanisch-samoa" "andorra"
## [3] "angola" "anguilla"
## [5] "antigua und barbuda" "aruba"
## [7] "elfenbeinküste" "korea, demokratische volksrepublik"
## [9] "deutschland" "niederländische antillen"
## [11] "neu-kaledonien" "nördliche marianen"
## [13] "norwegen" "st. kitts und nevis"
## [1] "afghanistan"
## [2] "aland islands"
## [3] "amerikanisch-samoa"
## [4] "argentinien"
## [5] "armenien"
## [6] "barbados"
## [7] "bonaire, sint eustatius und saba"
## [8] "britisches territorium im indischen ozean"
## [9] "brunei darussalam"
## [10] "zentralafrikanische republik"
## [11] "kokosinseln (keelinginseln)"
## [12] "tschechische republik"
## [13] "tschechoslowakei"
## [14] "korea, demokratische volksrepublik"
## [15] "dominikanische republik"
## [16] "falklandinseln (malvinas)"
## [17] "französisch-guayana"
## [18] "französisch polynesien"
## [19] "deutsche demokratische republik"
## [20] "kurfürstentum hessen"
## [21] "großherzogtum hessen"
## [22] "vatikanstaat"
## [23] "hongkong"
## [24] "kirgisistan"
## [25] "liechtenstein"
## [26] "niederlande"
## [27] "papua-neuguinea"
## [28] "korea, republik von"
## [29] "st. helena, ascension und tristan da cunha"
## [30] "saint martin (französischer teil)"
## [31] "st. vincent und die grenadinen"
## [32] "sint maarten (niederländischer teil)"
## [33] "slowenien"
## [34] "südgeorgien und die südlichen sandwichinseln"
## [35] "syrische arabische republik"
## [36] "tadschikistan"
## [37] "königreich beider sizilien"
## [38] "vereinigte arabische emirate"
## [39] "tansania"
## [40] "united states minor outlying islands"
## [41] "vereinigte staaten"
## [42] "wallis und futuna"
## [43] "jemenitische arabische republik"
## [44] "demokratische volksrepublik jemen"
## [1] "afghanistan"
## [2] "aland islands"
## [3] "amerikanisch-samoa"
## [4] "antigua und barbuda"
## [5] "argentinien"
## [6] "österreich-ungarn"
## [7] "aserbaidschan"
## [8] "bahamas"
## [9] "bonaire, sint eustatius und saba"
## [10] "bosnien und herzegowina"
## [11] "britisches territorium im indischen ozean"
## [12] "brunei darussalam"
## [13] "kanada"
## [14] "cayman inseln"
## [15] "zentralafrikanische republik"
## [16] "kokosinseln (keelinginseln)"
## [17] "elfenbeinküste"
## [18] "tschechische republik"
## [19] "korea, demokratische volksrepublik"
## [20] "demokratische republik kongo"
## [21] "dominikanische republik"
## [22] "äquatorialguinea"
## [23] "falklandinseln (malvinas)"
## [24] "finnland"
## [25] "französisch-guayana"
## [26] "französisch polynesien"
## [27] "französische südgebiete"
## [28] "deutsche demokratische republik"
## [29] "guatemala"
## [30] "heard und mcdonaldinseln"
## [31] "kurfürstentum hessen"
## [32] "großherzogtum hessen"
## [33] "vatikanstaat"
## [34] "indonesien"
## [35] "jamaika"
## [36] "kasachstan"
## [37] "kiribati"
## [38] "kosovo"
## [39] "kirgisistan"
## [40] "demokratische volksrepublik laos"
## [41] "liechtenstein"
## [42] "madagaskar"
## [43] "malaysia"
## [44] "marshallinseln"
## [45] "mecklenburg-schwerin"
## [46] "niederlande"
## [47] "niederländische antillen"
## [48] "neu-kaledonien"
## [49] "neuseeland"
## [50] "nicaragua"
## [51] "nördliche marianen"
## [52] "panama"
## [53] "papua-neuguinea"
## [54] "paraguay"
## [55] "philippinen"
## [56] "russische föderation"
## [57] "st. helena, ascension und tristan da cunha"
## [58] "st. kitts und nevis"
## [59] "saint martin (französischer teil)"
## [60] "saint-pierre und miquelon"
## [61] "st. vincent und die grenadinen"
## [62] "saudi arabien"
## [63] "seychellen"
## [64] "sierra leone"
## [65] "sint maarten (niederländischer teil)"
## [66] "salomon-inseln"
## [67] "südgeorgien und die südlichen sandwichinseln"
## [68] "svalbard und jan mayen"
## [69] "syrische arabische republik"
## [70] "trinidad und tobago"
## [71] "turks- und caicosinseln"
## [72] "königreich beider sizilien"
## [73] "vereinigte arabische emirate"
## [74] "großbritannien"
## [75] "tansania"
## [76] "united states minor outlying islands"
## [77] "vereinigte staaten"
## [78] "uruguay"
## [79] "venezuela"
## [80] "britische jungferninseln"
## [81] "virgin islands, u.s."
## [82] "westsahara"
## [83] "jemenitische arabische republik"
## [84] "demokratische volksrepublik jemen"