Lab-3

Author

Gabriel Castellanos

Published

February 12, 2023

Introduction

The purpose of this lab is to practice with R packages that allow for string manipulation. The main dataset comes from the American Community Survey 2010-2012 Public Use Microdata Series. More info about the universities dataset and other datasets can be found here: About Dataset

Part 1: Counting the Appearance of Select Terms

The first part of the lab is to search for the select majors within this data set that contain “Data” or “Statistics”. This can be accomplished in many ways, but this approach involves using sapply and then fetching the rows thatch indicate a match. An alternative approach uses the string r approach that concatenates two separate queries into one result.

install.packages("stringr")
Installing stringr [1.5.0] ...
    OK [linked cache]
library(stringr)
college.majors <- read.csv("https://raw.githubusercontent.com/gc521/DATA-607-Data-Acquisition-Mangement/Lab-3/majors-list.csv")

#Method 1
major <- college.majors$Major
s <- c("DATA", "STATISTICS")

sapply(X = s, FUN = grepl, major)
        DATA STATISTICS
  [1,] FALSE      FALSE
  [2,] FALSE      FALSE
  [3,] FALSE      FALSE
  [4,] FALSE      FALSE
  [5,] FALSE      FALSE
  [6,] FALSE      FALSE
  [7,] FALSE      FALSE
  [8,] FALSE      FALSE
  [9,] FALSE      FALSE
 [10,] FALSE      FALSE
 [11,] FALSE      FALSE
 [12,] FALSE      FALSE
 [13,] FALSE      FALSE
 [14,] FALSE      FALSE
 [15,] FALSE      FALSE
 [16,] FALSE      FALSE
 [17,] FALSE      FALSE
 [18,] FALSE      FALSE
 [19,] FALSE      FALSE
 [20,] FALSE      FALSE
 [21,] FALSE      FALSE
 [22,] FALSE      FALSE
 [23,] FALSE      FALSE
 [24,] FALSE      FALSE
 [25,] FALSE      FALSE
 [26,] FALSE      FALSE
 [27,] FALSE      FALSE
 [28,] FALSE      FALSE
 [29,] FALSE      FALSE
 [30,] FALSE      FALSE
 [31,] FALSE      FALSE
 [32,] FALSE      FALSE
 [33,] FALSE      FALSE
 [34,] FALSE      FALSE
 [35,] FALSE      FALSE
 [36,] FALSE      FALSE
 [37,] FALSE      FALSE
 [38,] FALSE      FALSE
 [39,] FALSE      FALSE
 [40,] FALSE      FALSE
 [41,] FALSE      FALSE
 [42,] FALSE      FALSE
 [43,] FALSE      FALSE
 [44,] FALSE       TRUE
 [45,] FALSE      FALSE
 [46,] FALSE      FALSE
 [47,] FALSE      FALSE
 [48,] FALSE      FALSE
 [49,] FALSE      FALSE
 [50,] FALSE      FALSE
 [51,] FALSE      FALSE
 [52,]  TRUE      FALSE
 [53,] FALSE      FALSE
 [54,] FALSE      FALSE
 [55,] FALSE      FALSE
 [56,] FALSE      FALSE
 [57,] FALSE      FALSE
 [58,] FALSE      FALSE
 [59,] FALSE       TRUE
 [60,] FALSE      FALSE
 [61,] FALSE      FALSE
 [62,] FALSE      FALSE
 [63,] FALSE      FALSE
 [64,] FALSE      FALSE
 [65,] FALSE      FALSE
 [66,] FALSE      FALSE
 [67,] FALSE      FALSE
 [68,] FALSE      FALSE
 [69,] FALSE      FALSE
 [70,] FALSE      FALSE
 [71,] FALSE      FALSE
 [72,] FALSE      FALSE
 [73,] FALSE      FALSE
 [74,] FALSE      FALSE
 [75,] FALSE      FALSE
 [76,] FALSE      FALSE
 [77,] FALSE      FALSE
 [78,] FALSE      FALSE
 [79,] FALSE      FALSE
 [80,] FALSE      FALSE
 [81,] FALSE      FALSE
 [82,] FALSE      FALSE
 [83,] FALSE      FALSE
 [84,] FALSE      FALSE
 [85,] FALSE      FALSE
 [86,] FALSE      FALSE
 [87,] FALSE      FALSE
 [88,] FALSE      FALSE
 [89,] FALSE      FALSE
 [90,] FALSE      FALSE
 [91,] FALSE      FALSE
 [92,] FALSE      FALSE
 [93,] FALSE      FALSE
 [94,] FALSE      FALSE
 [95,] FALSE      FALSE
 [96,] FALSE      FALSE
 [97,] FALSE      FALSE
 [98,] FALSE      FALSE
 [99,] FALSE      FALSE
[100,] FALSE      FALSE
[101,] FALSE      FALSE
[102,] FALSE      FALSE
[103,] FALSE      FALSE
[104,] FALSE      FALSE
[105,] FALSE      FALSE
[106,] FALSE      FALSE
[107,] FALSE      FALSE
[108,] FALSE      FALSE
[109,] FALSE      FALSE
[110,] FALSE      FALSE
[111,] FALSE      FALSE
[112,] FALSE      FALSE
[113,] FALSE      FALSE
[114,] FALSE      FALSE
[115,] FALSE      FALSE
[116,] FALSE      FALSE
[117,] FALSE      FALSE
[118,] FALSE      FALSE
[119,] FALSE      FALSE
[120,] FALSE      FALSE
[121,] FALSE      FALSE
[122,] FALSE      FALSE
[123,] FALSE      FALSE
[124,] FALSE      FALSE
[125,] FALSE      FALSE
[126,] FALSE      FALSE
[127,] FALSE      FALSE
[128,] FALSE      FALSE
[129,] FALSE      FALSE
[130,] FALSE      FALSE
[131,] FALSE      FALSE
[132,] FALSE      FALSE
[133,] FALSE      FALSE
[134,] FALSE      FALSE
[135,] FALSE      FALSE
[136,] FALSE      FALSE
[137,] FALSE      FALSE
[138,] FALSE      FALSE
[139,] FALSE      FALSE
[140,] FALSE      FALSE
[141,] FALSE      FALSE
[142,] FALSE      FALSE
[143,] FALSE      FALSE
[144,] FALSE      FALSE
[145,] FALSE      FALSE
[146,] FALSE      FALSE
[147,] FALSE      FALSE
[148,] FALSE      FALSE
[149,] FALSE      FALSE
[150,] FALSE      FALSE
[151,] FALSE      FALSE
[152,] FALSE      FALSE
[153,] FALSE      FALSE
[154,] FALSE      FALSE
[155,] FALSE      FALSE
[156,] FALSE      FALSE
[157,] FALSE      FALSE
[158,] FALSE      FALSE
[159,] FALSE      FALSE
[160,] FALSE      FALSE
[161,] FALSE      FALSE
[162,] FALSE      FALSE
[163,] FALSE      FALSE
[164,] FALSE      FALSE
[165,] FALSE      FALSE
[166,] FALSE      FALSE
[167,] FALSE      FALSE
[168,] FALSE      FALSE
[169,] FALSE      FALSE
[170,] FALSE      FALSE
[171,] FALSE      FALSE
[172,] FALSE      FALSE
[173,] FALSE      FALSE
[174,] FALSE      FALSE
major[c(44, 52, 59)]
[1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
[2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
[3] "STATISTICS AND DECISION SCIENCE"              
#Method 2
my.major.data <- str_subset(major, pattern = "DATA") 

my.major.stats <- str_subset(major, pattern = "STATISTICS")

paste(my.major.data, my.major.stats, sep = ',')
[1] "COMPUTER PROGRAMMING AND DATA PROCESSING,MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
[2] "COMPUTER PROGRAMMING AND DATA PROCESSING,STATISTICS AND DECISION SCIENCE"              

Part 2: Gluing Strings

The next part calls for converting a raw text format into something resembling a vector string in R. Shown is only one of many potential solutions.

fruits_raw <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"

[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"  

[9] "elderberry"   "lime"         "lychee"       "mulberry"    

[13] "olive"        "salal berry"'
fruits_raw
[1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \n\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \n\n[13] \"olive\"        \"salal berry\""
fruits_raw <- unlist(str_extract_all(fruits_raw, pattern = "\"([a-z]+.[a-z]+)\""))
new_str <- gsub('"','',fruits_raw) #Can also use str_extract_all
new_str
 [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
 [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
[11] "lychee"       "mulberry"     "olive"        "salal berry" 

Part 3: Explanation of Common String Expressions.

  1. (.)\1\1 - The (.) allows us to search for every character. The following two terms reference whichever character that was selected with the first term.They are known as a back reference. Therefore, the whole term as a whole allows us to search for a repeat of any three characters.

  2. “(.)(.)\2\1” - Very similar to previous example, except the second capturing expression (.) allows us to select any two characters. The \2 references the second capturing expression while the \1 references the first. All together, the entire expression first select any two random characters then selects those same characters again, but in reverse order

  3. (..)\1 - This (again) is similar to the previous expression but the \1 refers to the pair of selected characters. All together, this means that any two characters that are selected are then selected a second time E.g. “abab”.

  4. “(.).\1.\1” - The two periods allow us to select any character but they are not captured. Taken together, we capture a character, followed by a different character, then the captured character, then another different character and finally the captured character (again). An example might be “abaca”.

  5. “(.)(.)(.).*\3\2\1” - The ‘new’ term here is the period followed by the star. This zero or more of any character, until we select the third captured character, followed the second captured character, followed by the first E.g “abc1cba” or “afddfa”

Part 4: Miscellaneous String Exercises

The following exercises basically ask to match patterns with regular expressions

  1. Start and end with the same character. Note that there is a character vector called ‘words’ that come from the rcorpora package. More info can be found in this link: About_corpora
str_subset(words, "^(.)((.*\\1$)|\\1?$)")
 [1] "a"          "america"    "area"       "dad"        "dead"      
 [6] "depend"     "educate"    "else"       "encourage"  "engine"    
[11] "europe"     "evidence"   "example"    "excuse"     "exercise"  
[16] "expense"    "experience" "eye"        "health"     "high"      
[21] "knock"      "level"      "local"      "nation"     "non"       
[26] "rather"     "refer"      "remember"   "serious"    "stairs"    
[31] "test"       "tonight"    "transport"  "treat"      "trust"     
[36] "window"     "yesterday" 
  1. Any pair of repeated letters within the same word.Letters are defined by the ASCII letters E.g A-Z
str_subset(words, "([A-Za-z][A-Za-z]).*\\1")
 [1] "appropriate" "church"      "condition"   "decide"      "environment"
 [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
[11] "pressure"    "remember"    "represent"   "require"     "sense"      
[16] "therefore"   "understand"  "whether"    
  1. The last part is meant to search for the aperence of any given letter in a word at exactly three times.
str_subset(words, "([a-z]).*\\1.*\\1")
 [1] "appropriate" "available"   "believe"     "between"     "business"   
 [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
[11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
[16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
[21] "therefore"   "tomorrow"