1. College Majors Dataset

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

## # A tibble: 6 x 3
##   FOD1P Major                                 Major_Category                 
##   <chr> <chr>                                 <chr>                          
## 1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
## 2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
## 4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
## 5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
## 6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"
##  [1] "ANIMAL SCIENCES"                                    
##  [2] "PLANT SCIENCE AND AGRONOMY"                         
##  [3] "BIOCHEMICAL SCIENCES"                               
##  [4] "COGNITIVE SCIENCE AND BIOPSYCHOLOGY"                
##  [5] "INFORMATION SCIENCES"                               
##  [6] "SCIENCE AND COMPUTER TEACHER EDUCATION"             
##  [7] "SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION"        
##  [8] "NUTRITION SCIENCES"                                 
##  [9] "COMMUNICATION DISORDERS SCIENCES AND SERVICES"      
## [10] "PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION"
## [11] "FAMILY AND CONSUMER SCIENCES"                       
## [12] "TRANSPORTATION SCIENCES AND TECHNOLOGIES"           
## [13] "PHYSICAL SCIENCES"                                  
## [14] "ATMOSPHERIC SCIENCES AND METEOROLOGY"               
## [15] "INTERDISCIPLINARY SOCIAL SCIENCES"                  
## [16] "GENERAL SOCIAL SCIENCES"                            
## [17] "POLITICAL SCIENCE AND GOVERNMENT"                   
## [18] "MISCELLANEOUS SOCIAL SCIENCES"
## [1] 18




2. Parse words to list

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n\n[13] \"olive\"        \"salal berry\""



We want to parse the words from the string above and pass them to a character vector using regular expressions

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE




3. Describe, in words, what these expressions will match:

  • (.)\1\1 ->

    ‘(.)’ Find a character ‘\1’ that repeats and ‘\1’ then repeats again ->
    find 3 repreating characters (e.g. ‘aaa’)

  • (.)(.)\2\1 ->

    ‘(.)’=Find a character ‘(.)’= Find the next character ‘\2’= then the have the second character ‘\1’= then the contents of the 1st character -> Find a four member palindrome (e.g. ‘eppe’ in ‘bell pepper’)

## [1] "bell pepper"  "chili pepper"
  • (..)\1 ->

    (..) = Find two consecutive characters as a group. ‘\1’ = let the group repeat again -> Find repeating pairs of characters
## [1] "salal berry"
  • (.).\1.\1 ->

    ‘(.)’ = A character, ‘.'= 1 character of something else,’\1’= that 1st character again, ‘.'= 1 character of something else,’\1’= that first character yet again.
## [1] "banana"
  • (.)(.)(.).\3\2\1 -> This takes the second expression ((.)(.)\2\1) one step further: ‘(.)(.)(.)’ = find three characters ’.’ = there can be anything & any number of characters inbetween ‘\3\2\1’ = then reverse the order of the 1st three characters. (e.g 123badpassword321)






4. Construct regular expressions to match words that:

  • Start and end with the same character.
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

For this exercise I will use the feature ‘country.name.de’ from the dataframe ‘codelist’ from the ‘countrycode’ library to test my expressions

## [1] "afghanistan"        "aland islands"      "albanien"          
## [4] "algerien"           "amerikanisch-samoa" "andorra"
  • Start and end with the same character.

##  [1] "amerikanisch-samoa"                 "andorra"                           
##  [3] "angola"                             "anguilla"                          
##  [5] "antigua und barbuda"                "aruba"                             
##  [7] "elfenbeinküste"                     "korea, demokratische volksrepublik"
##  [9] "deutschland"                        "niederländische antillen"          
## [11] "neu-kaledonien"                     "nördliche marianen"                
## [13] "norwegen"                           "st. kitts und nevis"
  • Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

##  [1] "afghanistan"                                 
##  [2] "aland islands"                               
##  [3] "amerikanisch-samoa"                          
##  [4] "argentinien"                                 
##  [5] "armenien"                                    
##  [6] "barbados"                                    
##  [7] "bonaire, sint eustatius und saba"            
##  [8] "britisches territorium im indischen ozean"   
##  [9] "brunei darussalam"                           
## [10] "zentralafrikanische republik"                
## [11] "kokosinseln (keelinginseln)"                 
## [12] "tschechische republik"                       
## [13] "tschechoslowakei"                            
## [14] "korea, demokratische volksrepublik"          
## [15] "dominikanische republik"                     
## [16] "falklandinseln (malvinas)"                   
## [17] "französisch-guayana"                         
## [18] "französisch polynesien"                      
## [19] "deutsche demokratische republik"             
## [20] "kurfürstentum hessen"                        
## [21] "großherzogtum hessen"                        
## [22] "vatikanstaat"                                
## [23] "hongkong"                                    
## [24] "kirgisistan"                                 
## [25] "liechtenstein"                               
## [26] "niederlande"                                 
## [27] "papua-neuguinea"                             
## [28] "korea, republik von"                         
## [29] "st. helena, ascension und tristan da cunha"  
## [30] "saint martin (französischer teil)"           
## [31] "st. vincent und die grenadinen"              
## [32] "sint maarten (niederländischer teil)"        
## [33] "slowenien"                                   
## [34] "südgeorgien und die südlichen sandwichinseln"
## [35] "syrische arabische republik"                 
## [36] "tadschikistan"                               
## [37] "königreich beider sizilien"                  
## [38] "vereinigte arabische emirate"                
## [39] "tansania"                                    
## [40] "united states minor outlying islands"        
## [41] "vereinigte staaten"                          
## [42] "wallis und futuna"                           
## [43] "jemenitische arabische republik"             
## [44] "demokratische volksrepublik jemen"
  • Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

##  [1] "afghanistan"                                 
##  [2] "aland islands"                               
##  [3] "amerikanisch-samoa"                          
##  [4] "antigua und barbuda"                         
##  [5] "argentinien"                                 
##  [6] "österreich-ungarn"                           
##  [7] "aserbaidschan"                               
##  [8] "bahamas"                                     
##  [9] "bonaire, sint eustatius und saba"            
## [10] "bosnien und herzegowina"                     
## [11] "britisches territorium im indischen ozean"   
## [12] "brunei darussalam"                           
## [13] "kanada"                                      
## [14] "cayman inseln"                               
## [15] "zentralafrikanische republik"                
## [16] "kokosinseln (keelinginseln)"                 
## [17] "elfenbeinküste"                              
## [18] "tschechische republik"                       
## [19] "korea, demokratische volksrepublik"          
## [20] "demokratische republik kongo"                
## [21] "dominikanische republik"                     
## [22] "äquatorialguinea"                            
## [23] "falklandinseln (malvinas)"                   
## [24] "finnland"                                    
## [25] "französisch-guayana"                         
## [26] "französisch polynesien"                      
## [27] "französische südgebiete"                     
## [28] "deutsche demokratische republik"             
## [29] "guatemala"                                   
## [30] "heard und mcdonaldinseln"                    
## [31] "kurfürstentum hessen"                        
## [32] "großherzogtum hessen"                        
## [33] "vatikanstaat"                                
## [34] "indonesien"                                  
## [35] "jamaika"                                     
## [36] "kasachstan"                                  
## [37] "kiribati"                                    
## [38] "kosovo"                                      
## [39] "kirgisistan"                                 
## [40] "demokratische volksrepublik laos"            
## [41] "liechtenstein"                               
## [42] "madagaskar"                                  
## [43] "malaysia"                                    
## [44] "marshallinseln"                              
## [45] "mecklenburg-schwerin"                        
## [46] "niederlande"                                 
## [47] "niederländische antillen"                    
## [48] "neu-kaledonien"                              
## [49] "neuseeland"                                  
## [50] "nicaragua"                                   
## [51] "nördliche marianen"                          
## [52] "panama"                                      
## [53] "papua-neuguinea"                             
## [54] "paraguay"                                    
## [55] "philippinen"                                 
## [56] "russische föderation"                        
## [57] "st. helena, ascension und tristan da cunha"  
## [58] "st. kitts und nevis"                         
## [59] "saint martin (französischer teil)"           
## [60] "saint-pierre und miquelon"                   
## [61] "st. vincent und die grenadinen"              
## [62] "saudi arabien"                               
## [63] "seychellen"                                  
## [64] "sierra leone"                                
## [65] "sint maarten (niederländischer teil)"        
## [66] "salomon-inseln"                              
## [67] "südgeorgien und die südlichen sandwichinseln"
## [68] "svalbard und jan mayen"                      
## [69] "syrische arabische republik"                 
## [70] "trinidad und tobago"                         
## [71] "turks- und caicosinseln"                     
## [72] "königreich beider sizilien"                  
## [73] "vereinigte arabische emirate"                
## [74] "großbritannien"                              
## [75] "tansania"                                    
## [76] "united states minor outlying islands"        
## [77] "vereinigte staaten"                          
## [78] "uruguay"                                     
## [79] "venezuela"                                   
## [80] "britische jungferninseln"                    
## [81] "virgin islands, u.s."                        
## [82] "westsahara"                                  
## [83] "jemenitische arabische republik"             
## [84] "demokratische volksrepublik jemen"