DATA607 Assignment #3: R Character Manipulation

1. College Majors Dataset

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

#load the FiveThirtyEight data from 'majors-list.csv' to a data.frame
library( tidyverse )
dataURL <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv'
dataDF <- read_csv( dataURL )
head( dataDF )

## # A tibble: 6 x 3
##   FOD1P Major                                 Major_Category                 
##   <chr> <chr>                                 <chr>                          
## 1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
## 2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
## 4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
## 5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
## 6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources

#identify majors that contain the word 'DATA'
DATA_idx <- grepl( 'DATA', dataDF$Major )
dataDF$Major[ DATA_idx ]

## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"

#identify majors that contain the word 'STATISTICS'
DATA_idx <- grepl( 'STATISTICS', dataDF$Major )
dataDF$Major[ DATA_idx ]

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

#I'm just curious. See a lot of majors with 'SCIENCE' or 'SCIENCES'. 
#Would like to identify & count them as well:
DATA_idx <- grepl( 'SCIENCE | SCIENCES', dataDF$Major )
dataDF$Major[ DATA_idx ]

##  [1] "ANIMAL SCIENCES"                                    
##  [2] "PLANT SCIENCE AND AGRONOMY"                         
##  [3] "BIOCHEMICAL SCIENCES"                               
##  [4] "COGNITIVE SCIENCE AND BIOPSYCHOLOGY"                
##  [5] "INFORMATION SCIENCES"                               
##  [6] "SCIENCE AND COMPUTER TEACHER EDUCATION"             
##  [7] "SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION"        
##  [8] "NUTRITION SCIENCES"                                 
##  [9] "COMMUNICATION DISORDERS SCIENCES AND SERVICES"      
## [10] "PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION"
## [11] "FAMILY AND CONSUMER SCIENCES"                       
## [12] "TRANSPORTATION SCIENCES AND TECHNOLOGIES"           
## [13] "PHYSICAL SCIENCES"                                  
## [14] "ATMOSPHERIC SCIENCES AND METEOROLOGY"               
## [15] "INTERDISCIPLINARY SOCIAL SCIENCES"                  
## [16] "GENERAL SOCIAL SCIENCES"                            
## [17] "POLITICAL SCIENCE AND GOVERNMENT"                   
## [18] "MISCELLANEOUS SOCIAL SCIENCES"

length( dataDF$Major[ DATA_idx ] )

## [1] 18

2. Parse words to list

Write code that transforms the data below:

[1] “bell pepper” “bilberry” “blackberry” “blood orange”

[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”

[9] “elderberry” “lime” “lychee” “mulberry”

[13] “olive” “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

#make character vector that hold desired output to use as a test case:
testStrings <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

#I was really confused by the wording of this question 
#& am still not sure what we are supposed to work with as input. 
#This was my best guess:
data <- 
'[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'
data

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n\n[13] \"olive\"        \"salal berry\""

We want to parse the words from the string above and pass them to a character vector using regular expressions

#use strsplit() to split by '\"'
result <- strsplit(data, '\\"')
# '[a-z].*[a-z]' -> find any lower case alphabet, 
#there can be any number of characters in between, 
#it has the end with a lower case alphabet number.
dataStrings = grep(pattern = '[a-z].*[a-z]', result[[1]], value = TRUE, ignore.case = TRUE)
dataStrings

##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

#test to see if our result matches our known test character vector
dataStrings == testStrings

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

3. Describe, in words, what these expressions will match:

(.)\1\1 ->

‘(.)’ Find a character ‘\1’ that repeats and ‘\1’ then repeats again ->
find 3 repreating characters (e.g. ‘aaa’)
(.)(.)\2\1 ->

‘(.)’=Find a character ‘(.)’= Find the next character ‘\2’= then the have the second character ‘\1’= then the contents of the 1st character -> Find a four member palindrome (e.g. ‘eppe’ in ‘bell pepper’)

grep(pattern = '(.)(.)\\2\\1', dataStrings, value = TRUE, ignore.case = TRUE)

## [1] "bell pepper"  "chili pepper"

(..)\1 ->
(..) = Find two consecutive characters as a group. ‘\1’ = let the group repeat again -> Find repeating pairs of characters

grep(pattern = '(..)\\1', dataStrings, value = TRUE, ignore.case = TRUE)

## [1] "salal berry"

(.).\1.\1 ->
‘(.)’ = A character, ‘.'= 1 character of something else,’\1’= that 1st character again, ‘.'= 1 character of something else,’\1’= that first character yet again.

dataStrings <- c( dataStrings, 'banana')
grep(pattern = '(.).\\1.\\1', dataStrings, value = TRUE, ignore.case = TRUE)

## [1] "banana"

(.)(.)(.).\3\2\1 -> This takes the second expression ((.)(.)\2\1) one step further: ‘(.)(.)(.)’ = find three characters ’.’ = there can be anything & any number of characters inbetween ‘\3\2\1’ = then reverse the order of the 1st three characters. (e.g 123badpassword321)

4. Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

For this exercise I will use the feature ‘country.name.de’ from the dataframe ‘codelist’ from the ‘countrycode’ library to test my expressions

library( countrycode )

countryNames = tolower( codelist$country.name.de ) 
#i changed to lower to make the first expression a simpler case
head( countryNames )

## [1] "afghanistan"        "aland islands"      "albanien"          
## [4] "algerien"           "amerikanisch-samoa" "andorra"

Start and end with the same character.

grep(pattern = '^([A-Za-z]).*\\1$', countryNames, value = TRUE, ignore.case = TRUE)

##  [1] "amerikanisch-samoa"                 "andorra"                           
##  [3] "angola"                             "anguilla"                          
##  [5] "antigua und barbuda"                "aruba"                             
##  [7] "elfenbeinküste"                     "korea, demokratische volksrepublik"
##  [9] "deutschland"                        "niederländische antillen"          
## [11] "neu-kaledonien"                     "nördliche marianen"                
## [13] "norwegen"                           "st. kitts und nevis"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

grep(pattern = '([A-Za-z][A-Za-z]).*\\1', countryNames, value = TRUE, ignore.case = TRUE)

##  [1] "afghanistan"                                 
##  [2] "aland islands"                               
##  [3] "amerikanisch-samoa"                          
##  [4] "argentinien"                                 
##  [5] "armenien"                                    
##  [6] "barbados"                                    
##  [7] "bonaire, sint eustatius und saba"            
##  [8] "britisches territorium im indischen ozean"   
##  [9] "brunei darussalam"                           
## [10] "zentralafrikanische republik"                
## [11] "kokosinseln (keelinginseln)"                 
## [12] "tschechische republik"                       
## [13] "tschechoslowakei"                            
## [14] "korea, demokratische volksrepublik"          
## [15] "dominikanische republik"                     
## [16] "falklandinseln (malvinas)"                   
## [17] "französisch-guayana"                         
## [18] "französisch polynesien"                      
## [19] "deutsche demokratische republik"             
## [20] "kurfürstentum hessen"                        
## [21] "großherzogtum hessen"                        
## [22] "vatikanstaat"                                
## [23] "hongkong"                                    
## [24] "kirgisistan"                                 
## [25] "liechtenstein"                               
## [26] "niederlande"                                 
## [27] "papua-neuguinea"                             
## [28] "korea, republik von"                         
## [29] "st. helena, ascension und tristan da cunha"  
## [30] "saint martin (französischer teil)"           
## [31] "st. vincent und die grenadinen"              
## [32] "sint maarten (niederländischer teil)"        
## [33] "slowenien"                                   
## [34] "südgeorgien und die südlichen sandwichinseln"
## [35] "syrische arabische republik"                 
## [36] "tadschikistan"                               
## [37] "königreich beider sizilien"                  
## [38] "vereinigte arabische emirate"                
## [39] "tansania"                                    
## [40] "united states minor outlying islands"        
## [41] "vereinigte staaten"                          
## [42] "wallis und futuna"                           
## [43] "jemenitische arabische republik"             
## [44] "demokratische volksrepublik jemen"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

grep(pattern = '([A-Za-z]).*\\1.*\\1', countryNames, value = TRUE, ignore.case = TRUE)

##  [1] "afghanistan"                                 
##  [2] "aland islands"                               
##  [3] "amerikanisch-samoa"                          
##  [4] "antigua und barbuda"                         
##  [5] "argentinien"                                 
##  [6] "österreich-ungarn"                           
##  [7] "aserbaidschan"                               
##  [8] "bahamas"                                     
##  [9] "bonaire, sint eustatius und saba"            
## [10] "bosnien und herzegowina"                     
## [11] "britisches territorium im indischen ozean"   
## [12] "brunei darussalam"                           
## [13] "kanada"                                      
## [14] "cayman inseln"                               
## [15] "zentralafrikanische republik"                
## [16] "kokosinseln (keelinginseln)"                 
## [17] "elfenbeinküste"                              
## [18] "tschechische republik"                       
## [19] "korea, demokratische volksrepublik"          
## [20] "demokratische republik kongo"                
## [21] "dominikanische republik"                     
## [22] "äquatorialguinea"                            
## [23] "falklandinseln (malvinas)"                   
## [24] "finnland"                                    
## [25] "französisch-guayana"                         
## [26] "französisch polynesien"                      
## [27] "französische südgebiete"                     
## [28] "deutsche demokratische republik"             
## [29] "guatemala"                                   
## [30] "heard und mcdonaldinseln"                    
## [31] "kurfürstentum hessen"                        
## [32] "großherzogtum hessen"                        
## [33] "vatikanstaat"                                
## [34] "indonesien"                                  
## [35] "jamaika"                                     
## [36] "kasachstan"                                  
## [37] "kiribati"                                    
## [38] "kosovo"                                      
## [39] "kirgisistan"                                 
## [40] "demokratische volksrepublik laos"            
## [41] "liechtenstein"                               
## [42] "madagaskar"                                  
## [43] "malaysia"                                    
## [44] "marshallinseln"                              
## [45] "mecklenburg-schwerin"                        
## [46] "niederlande"                                 
## [47] "niederländische antillen"                    
## [48] "neu-kaledonien"                              
## [49] "neuseeland"                                  
## [50] "nicaragua"                                   
## [51] "nördliche marianen"                          
## [52] "panama"                                      
## [53] "papua-neuguinea"                             
## [54] "paraguay"                                    
## [55] "philippinen"                                 
## [56] "russische föderation"                        
## [57] "st. helena, ascension und tristan da cunha"  
## [58] "st. kitts und nevis"                         
## [59] "saint martin (französischer teil)"           
## [60] "saint-pierre und miquelon"                   
## [61] "st. vincent und die grenadinen"              
## [62] "saudi arabien"                               
## [63] "seychellen"                                  
## [64] "sierra leone"                                
## [65] "sint maarten (niederländischer teil)"        
## [66] "salomon-inseln"                              
## [67] "südgeorgien und die südlichen sandwichinseln"
## [68] "svalbard und jan mayen"                      
## [69] "syrische arabische republik"                 
## [70] "trinidad und tobago"                         
## [71] "turks- und caicosinseln"                     
## [72] "königreich beider sizilien"                  
## [73] "vereinigte arabische emirate"                
## [74] "großbritannien"                              
## [75] "tansania"                                    
## [76] "united states minor outlying islands"        
## [77] "vereinigte staaten"                          
## [78] "uruguay"                                     
## [79] "venezuela"                                   
## [80] "britische jungferninseln"                    
## [81] "virgin islands, u.s."                        
## [82] "westsahara"                                  
## [83] "jemenitische arabische republik"             
## [84] "demokratische volksrepublik jemen"

DATA607 Assignment #3: R Character Manipulation

Bonnie Cooper

2/16/2020

1. College Majors Dataset

2. Parse words to list

3. Describe, in words, what these expressions will match:

(.)\1\1 ->

(.)(.)\2\1 ->

(..)\1 ->

(.).\1.\1 ->

(.)(.)(.).\3\2\1 -> This takes the second expression ((.)(.)\2\1) one step further: ‘(.)(.)(.)’ = find three characters ’.’ = there can be anything & any number of characters inbetween ‘\3\2\1’ = then reverse the order of the 1st three characters. (e.g 123badpassword321)

4. Construct regular expressions to match words that:

Start and end with the same character.

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)