As we work with larger and more complicated data sets, it gets difficult to decipher information when the data isn’t represented in a readable way. We will use several tools to help manipulate text, filter data frames, and recognize patterns within data for the purpose of “tidying” up our data sets.
# read in the data
majors <- read.csv(url("https://raw.githubusercontent.com/JAdames27/DATA-607---Data-Acquisition-and-Management/main/DATA%20607%20-%20Assignment%203/majors-list.csv"))
head(majors)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
dataStats <- majors %>%
filter_all(any_vars(grepl("DATA", .)|grepl("STATISTICS", .)))
dataStats
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
First I declared “myList” which outputs the data mentioned initially above.
myList <- c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
myList
## [1] "bell pepper" "bilberry" "blackberry" "blood orange" "blueberry"
## [6] "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime"
## [11] "lychee" "mulberry" "olive" "salal berry"
I then used the paste0() function to add a leading quotation mark and unlist() to turn “myList” into a string that separated each element by the following characters between the single quotes ‘“,’.
myString1 <- paste0('"',unlist(myList), collapse='", ')
myString1
## [1] "\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry"
I still needed to add the “c()” text to the beginning and end of the string.
I appended c( to the front of the string and “) to the end.
myString2 <- paste0('c(', myString1, '")')
myString2
## [1] "c(\"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\")"
Since the c() function does not remove the backslashes in “myString2”, I used the cat() function to work around this issue.
cat(myString2)
## c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")
I used the identical() function to determine if the output to “myString2” and the expected text (ie. c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)) are equal to each other.
identical((myString2),'c("bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")')
## [1] TRUE
Describe, in words, what these expressions will match:
(.)\1\1 “(.)(.)\2\1” (..)\1 “(.).\1.\1” “(.)(.)(.).*\3\2\1”
In R, the expression
needs to be expressed as
since R requires an escape for the backslash. The latter expression will match (.) to some character in a given string and each instance of ‘\1’ matches to that same character being repeated again. So this will match to any character that appears and repeats two more times consecutively within the string.
However, if taken explicitly as
there is no matching.
The expression
matches to nothing. However, without the enclosing double quotation marks, the expression
matches to two consecutive characters, ie. (.)(.), immediately followed by the same two characters in reverse order, ie. \2\1. So this will match to any collection of four characters that is “palindromic” within the given text.
The expression
matches to nothing. However, with the backslash escape, the expression
matches to two consecutive characters that are immediately followed by the same two consecutive integers within the given text.
The expression
matches to nothing. However, without the enclosing double quotation marks, the expression
matches the following way:
(.) is grouped in parentheses as group 1 and matches to any character.
. is un-grouped and matches to any character.
\\1 refers to the character in group 1.
. is un-grouped and matches to any character.
\\1 refers to the character in group 1 again.
“(.)(.)(.).*\3\2\1”
(.) is grouped in parentheses as group 1 and matches to any character.
The following (.) is grouped in parentheses as group 2 and matches to any character.
The last (.) is grouped in parentheses as group 3 and matches to any character.
The .* will can match multiple characters after the (.)(.)(.) which references the first three.
The \3\2\1 refers to the three grouped elements and returns them in reverse order.
Construct regular expressions to match words that:
Start and end with the same character. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
I create a list of string text, text1. and a pattern to match to, pattern1.
The output confirms that the code runs properly since it returns only the elements that strictly start and end with the same character.
# Example string
text1 <- c('abra', '24y', 'ucv', '818259', '8sdgsdfh8', 'kodak')
# create the pattern
pattern1 <- "^(\\w)\\w*\\1$"
# Find all matches in the text
matches <- gregexpr(pattern1, text1)
# Extract the matching substrings
results <- regmatches(text1, matches)
print("Matches:")
## [1] "Matches:"
print(unlist(results))
## [1] "abra" "8sdgsdfh8" "kodak"
I create a list of string text, text2. and a pattern to match to, pattern2.
The output confirms that the code runs properly.
text2 <- c('church', 'mississipi', 'banana pepper')
pattern2 <- "\\w*(\\w\\w)\\w*\\1\\w*"
# Find all matches in the text
matches <- gregexpr(pattern2, text2)
# Extract the matching substrings
results <- regmatches(text2, matches)
print("Matches:")
## [1] "Matches:"
print(unlist(results))
## [1] "church" "mississipi" "banana" "pepper"
I create a list of string text, text3. and a pattern to match to, pattern3.
The output confirms that the code runs properly.
text3 <- c('eleven', 'cincinnati', 'boars', 'salamander', 'abrakababra', 'whatever')
pattern3 <- "\\w*(\\w)\\w*\\1\\w*\\1\\w*"
# Find all matches in the text
matches <- gregexpr(pattern3, text3)
# Extract the matching substrings
results <- regmatches(text3, matches)
print("Matches:")
## [1] "Matches:"
print(unlist(results))
## [1] "eleven" "cincinnati" "salamander" "abrakababra"
When working with data, it is often very useful to manipulate text in addition to the data itself in order to organize it in a readable or “tidy” way. This can come in the form of filtering data stored in a data frame or re-labeling columns. Another useful tool is using code to recognize patterns within a set of data. This can be especially helpful for extremely large data sets. Organized data is much easier to work with and obtain insights from.
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.