Overview
Nate silver (Nathaniel Read Silver) is an American statistician and writer who analyzes baseball and elections. His website fivethirtyeight.com provides plethora of datasets to munge on. Even if you are not a data person, there is a likely change that you will fall in love with the content on this site.
This week’s assignment is based on one of the articles on the site, “The Economic Guide To Picking A College Major.” five.thirtyeight.com.
Data Munging and steps involved
Step 1:Let’s load required libraries and raw data
Step 1 is to load the csv from the github library provided in the five.thirtyeight.com github library and the required R libraries.
knitr::opts_chunk$set(eval = TRUE, results = FALSE)
library(tidyverse)
library(RCurl)
library(stringr)
library(knitr)
major_list_url <- getURL("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
major_list <- read.csv(text = major_list_url)
Step 2: Excercises
#1
Identify the majors that contain either “DATA” or “STATISTICS”
Data_stats_list <- major_list %>%
filter(str_detect(Major, 'DATA|STATISTICS'))
kable(Data_stats_list)
| 6212 |
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS |
Business |
| 2101 |
COMPUTER PROGRAMMING AND DATA PROCESSING |
Computers & Mathematics |
| 3702 |
STATISTICS AND DECISION SCIENCE |
Computers & Mathematics |
#2
Write code that transforms the data below:
[1] “bell pepper” “bilberry” “blackberry” “blood orange”
[5] “blueberry” “cantaloupe” “chili pepper” “cloudberry”
[9] “elderberry” “lime” “lychee” “mulberry”
[13] “olive” “salal berry”
Into a format like this:
c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)
Solution: Let’s create a string variable to take the input as is
string <- '[1] "bell pepper" "bilberry" "blackberry" "blood orange"
[5] "blueberry" "cantaloupe" "chili pepper" "cloudberry"
[9] "elderberry" "lime" "lychee" "mulberry"
[13] "olive" "salal berry"'
string
[1] “[1] "bell pepper" "bilberry" "blackberry" "blood orange""blueberry" "cantaloupe" "chili pepper" "cloudberry" "elderberry" "lime" "lychee" "mulberry" "olive" "salal berry"”
We see additional characters that we need to clean from the string (letters, white space, quotes)
string_mod <- unlist(str_extract_all(string, '[[:alpha:]]+\\s[[:alpha:]]+|[[:alpha:]]+'))
string_mod
[1] “bell pepper” “bilberry” “blackberry” “blood orange” “blueberry”
[6] “cantaloupe” “chili pepper” “cloudberry” “elderberry” “lime”
[11] “lychee” “mulberry” “olive” “salal berry”
#3
Describe, in words, what these expressions will match:
(.)\1\1
test <- list("777", "anna", "2002", "aaa")
str_view(test , '(.)\1\1', match = TRUE)
The expression does not do anything. If it is replaced with (.)\1\1 it will detect the characters which are repeated thrice. In this example, 777 and aaa.
test <- list("777", "anna", "2002", "aaa")
str_view(test , '(.)\\1\\1', match = TRUE)
“(.)(.)\2\1”
test <- list("777", "anna", "2002", '"elle"', "12121")
str_view(test , '"(.)(.)\\2\\1"', match = TRUE)
The expression identifies such cases which are 4 character palindromes which are within quotes. In this case, it returns “elle” only. If we looked for (.)(.)\2\1, we would have got anna 2002 as well.
(..)\1
test <- list("777", "anna", "2002", '"elle"', "2020","aabb")
str_view(test , '(..)\1', match = TRUE)
It does not return anything. This is similar to first case. Let’s try modifying it \1
test <- list("777", "anna", "2002", '"elle"', "2020","aabb")
str_view(test , '(..)\\1', match = TRUE)
It gives 2020 from the selected test samples as the expression (..)\1 identifies a set of two characters that repeat consecutively (like 2020 and not like aabb)
“(.).\1.\1”
test <- list("777", "anna", "2002", '"elle"', "2020",'"12121"','"ababa"')
str_view(test , '"(.).\\1.\\1"', match = TRUE)
The expression identifies five character strings like 12121 and ababa (in quotes) where first character repeats at 1st,3rd and 5th positions and second character repeats at 2nd and 4th position
"(.)(.)(.).*\3\2\1"
test <- list("777", "anna", "2002", '"elle"', "2020",'"123321"','"abcdcba"')
str_view(test , '"(.)(.)(.).*\\3\\2\\1"', match = TRUE)
The expression identifies the characters that start with two characters (same or different) and end with the same characters in reverse order. Length of the character doesn’t matter.
#4
Construct regular expressions to match words that:
Start and end with the same character.
^(.).*\1$
test <- list("777", "anna", "2002", '"elle"', "20201",'"123321"','"abcdcba"')
str_view(test , '^(.).*\\1$', match = TRUE)
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) ([A-Za-z][A-Za-z]).*\1
test <- list("777", "anna", "2002", '"elle"', "20201",'khokho','"church"',"winwin")
str_view(test , '([A-Za-z][A-Za-z]).*\\1', match = TRUE)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) ([A-Za-z]).\1.\1.*
test <- list("777", "miamiwinter", "tweleve", '"ellee"', "20201",'khokho','"church"',"wisconsinite")
str_view(test , '([A-Za-z]).*\\1.*\\1.*', match = TRUE)
LS0tDQp0aXRsZTogJ0Fzc2lnbm1lbnQgMzogQ2hhcmFjdGVyIE1hbmlwdWxhdGlvbiBhbmQgRGF0ZSBQcm9jZXNzaW5nJw0KYXV0aG9yOiAiQmhhcmFuaSBOaXR0YWxhIg0KZGF0ZTogImByIFN5cy5EYXRlKClgIg0Kb3V0cHV0Og0KICBvcGVuaW50cm86OmxhYl9yZXBvcnQ6IGRlZmF1bHQNCiAgaHRtbF9kb2N1bWVudDoNCiAgICBpbmNsdWRlczoNCiAgICAgIGluX2hlYWRlcjogaGVhZGVyLmh0bWwNCiAgICBjc3M6IC4vbGFiLmNzcw0KICAgIGhpZ2hsaWdodDogcHlnbWVudHMNCiAgICB0aGVtZTogY2VydWxlYW4NCiAgICB0b2M6IHllcw0KICAgIHRvY19mbG9hdDogeWVzDQogIHBkZl9kb2N1bWVudDogZGVmYXVsdA0KZWRpdG9yX29wdGlvbnM6DQogIGNodW5rX291dHB1dF90eXBlOiBjb25zb2xlDQotLS0NCg0KIyMjIE92ZXJ2aWV3DQoNCk5hdGUgc2lsdmVyIChOYXRoYW5pZWwgUmVhZCBTaWx2ZXIpIGlzIGFuIEFtZXJpY2FuIHN0YXRpc3RpY2lhbiBhbmQgd3JpdGVyIHdobyBhbmFseXplcyBiYXNlYmFsbCBhbmQgZWxlY3Rpb25zLiBIaXMgd2Vic2l0ZSBmaXZldGhpcnR5ZWlnaHQuY29tIHByb3ZpZGVzIHBsZXRob3JhIG9mIGRhdGFzZXRzIHRvIG11bmdlIG9uLiBFdmVuIGlmIHlvdSBhcmUgbm90IGEgZGF0YSBwZXJzb24sIHRoZXJlIGlzIGEgbGlrZWx5IGNoYW5nZSB0aGF0IHlvdSB3aWxsIGZhbGwgaW4gbG92ZSB3aXRoIHRoZSBjb250ZW50IG9uIHRoaXMgc2l0ZS4gDQoNClRoaXMgd2VlaydzIGFzc2lnbm1lbnQgaXMgYmFzZWQgb24gb25lIG9mIHRoZSBhcnRpY2xlcyBvbiB0aGUgc2l0ZSwgIlRoZSBFY29ub21pYyBHdWlkZSBUbyBQaWNraW5nIEEgQ29sbGVnZSBNYWpvci4iIFtmaXZlLnRoaXJ0eWVpZ2h0LmNvbV0oaHR0cHM6Ly9maXZldGhpcnR5ZWlnaHQuY29tL2ZlYXR1cmVzL3RoZS1lY29ub21pYy1ndWlkZS10by1waWNraW5nLWEtY29sbGVnZS1tYWpvci8pLiANCg0KDQojIyMgRGF0YSBNdW5naW5nIGFuZCBzdGVwcyBpbnZvbHZlZA0KIyMjIFN0ZXAgMTpMZXQncyBsb2FkIHJlcXVpcmVkIGxpYnJhcmllcyBhbmQgcmF3IGRhdGENCg0KU3RlcCAxIGlzIHRvIGxvYWQgdGhlIGNzdiBmcm9tIHRoZSBnaXRodWIgbGlicmFyeSBwcm92aWRlZCBpbiB0aGUgIFtmaXZlLnRoaXJ0eWVpZ2h0LmNvbSBnaXRodWIgbGlicmFyeV0oaHR0cHM6Ly9kYXRhLmZpdmV0aGlydHllaWdodC5jb20vKSBhbmQgdGhlIHJlcXVpcmVkIFIgbGlicmFyaWVzLiANCmBgYHtyIGxvYWQtZGF0YSwgbWVzc2FnZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChldmFsID0gVFJVRSwgcmVzdWx0cyA9IEZBTFNFKQ0KbGlicmFyeSh0aWR5dmVyc2UpIA0KbGlicmFyeShSQ3VybCkNCmxpYnJhcnkoc3RyaW5ncikNCmxpYnJhcnkoa25pdHIpDQoNCm1ham9yX2xpc3RfdXJsIDwtIGdldFVSTCgiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2ZpdmV0aGlydHllaWdodC9kYXRhL21hc3Rlci9jb2xsZWdlLW1ham9ycy9tYWpvcnMtbGlzdC5jc3YiKSANCm1ham9yX2xpc3QgPC0gcmVhZC5jc3YodGV4dCA9IG1ham9yX2xpc3RfdXJsKQ0KIA0KYGBgDQoNCiMjIyBTdGVwIDI6IEV4Y2VyY2lzZXMNCg0KDQojIyMgIzEgDQpJZGVudGlmeSB0aGUgbWFqb3JzIHRoYXQgY29udGFpbiBlaXRoZXIgIkRBVEEiIG9yICJTVEFUSVNUSUNTIg0KYGBge3IgbWFqb3JzX2lkZW50aWZ5LCByZXN1bHRzPSJhc2lzIn0NCkRhdGFfc3RhdHNfbGlzdCA8LSBtYWpvcl9saXN0ICU+JQ0KICBmaWx0ZXIoc3RyX2RldGVjdChNYWpvciwgJ0RBVEF8U1RBVElTVElDUycpKQ0Ka2FibGUoRGF0YV9zdGF0c19saXN0KQ0KYGBgDQoNCiMjIyAjMiANCldyaXRlIGNvZGUgdGhhdCB0cmFuc2Zvcm1zIHRoZSBkYXRhIGJlbG93Og0KDQpbMV0gImJlbGwgcGVwcGVyIiAgImJpbGJlcnJ5IiAgICAgImJsYWNrYmVycnkiICAgImJsb29kIG9yYW5nZSINCg0KWzVdICJibHVlYmVycnkiICAgICJjYW50YWxvdXBlIiAgICJjaGlsaSBwZXBwZXIiICJjbG91ZGJlcnJ5IiAgDQoNCls5XSAiZWxkZXJiZXJyeSIgICAibGltZSIgICAgICAgICAibHljaGVlIiAgICAgICAibXVsYmVycnkiICAgIA0KDQpbMTNdICJvbGl2ZSIgICAgICAgICJzYWxhbCBiZXJyeSINCg0KSW50byBhIGZvcm1hdCBsaWtlIHRoaXM6DQoNCmMoImJlbGwgcGVwcGVyIiwgImJpbGJlcnJ5IiwgImJsYWNrYmVycnkiLCAiYmxvb2Qgb3JhbmdlIiwgImJsdWViZXJyeSIsICJjYW50YWxvdXBlIiwgImNoaWxpIHBlcHBlciIsICJjbG91ZGJlcnJ5IiwgImVsZGVyYmVycnkiLCAibGltZSIsICJseWNoZWUiLCAibXVsYmVycnkiLCAib2xpdmUiLCAic2FsYWwgYmVycnkiKQ0KDQpTb2x1dGlvbjogTGV0J3MgY3JlYXRlIGEgc3RyaW5nIHZhcmlhYmxlIHRvIHRha2UgdGhlIGlucHV0IGFzIGlzDQpgYGB7ciB0ZXh0LXJlYWQsIHJlc3VsdHM9ImFzaXMifQ0Kc3RyaW5nIDwtICdbMV0gImJlbGwgcGVwcGVyIiAgImJpbGJlcnJ5IiAgICAgImJsYWNrYmVycnkiICAgImJsb29kIG9yYW5nZSINCg0KWzVdICJibHVlYmVycnkiICAgICJjYW50YWxvdXBlIiAgICJjaGlsaSBwZXBwZXIiICJjbG91ZGJlcnJ5IiAgDQoNCls5XSAiZWxkZXJiZXJyeSIgICAibGltZSIgICAgICAgICAibHljaGVlIiAgICAgICAibXVsYmVycnkiICAgIA0KDQpbMTNdICJvbGl2ZSIgICAgICAgICJzYWxhbCBiZXJyeSInDQpzdHJpbmcNCg0KDQpgYGANCg0KV2Ugc2VlIGFkZGl0aW9uYWwgY2hhcmFjdGVycyB0aGF0IHdlIG5lZWQgdG8gY2xlYW4gZnJvbSB0aGUgc3RyaW5nIChsZXR0ZXJzLCB3aGl0ZSBzcGFjZSwgcXVvdGVzKQ0KDQpgYGB7ciB0ZXh0LXRyYW5zZm9ybSxyZXN1bHRzPSJhc2lzIn0NCnN0cmluZ19tb2QgPC0gIHVubGlzdChzdHJfZXh0cmFjdF9hbGwoc3RyaW5nLCAnW1s6YWxwaGE6XV0rXFxzW1s6YWxwaGE6XV0rfFtbOmFscGhhOl1dKycpKQ0Kc3RyaW5nX21vZA0KDQpgYGANCg0KIyMjICMzDQoNCkRlc2NyaWJlLCBpbiB3b3Jkcywgd2hhdCB0aGVzZSBleHByZXNzaW9ucyB3aWxsIG1hdGNoOg0KDQooLilcMVwxDQoNCmBgYHtyIHBhdHRlcm4gZGV0ZWN0MSxyZXN1bHRzPSJhc2lzIn0NCnRlc3QgPC0gbGlzdCgiNzc3IiwgImFubmEiLCAiMjAwMiIsICJhYWEiKQ0Kc3RyX3ZpZXcodGVzdCAsICcoLilcMVwxJywgbWF0Y2ggPSBUUlVFKQ0KYGBgDQogDQpUaGUgZXhwcmVzc2lvbiBkb2VzIG5vdCBkbyBhbnl0aGluZy4gSWYgaXQgaXMgcmVwbGFjZWQgd2l0aCAoLilcXDFcXDEgaXQgd2lsbCBkZXRlY3QgdGhlIGNoYXJhY3RlcnMgd2hpY2ggYXJlIHJlcGVhdGVkIHRocmljZS4gSW4gdGhpcyBleGFtcGxlLCA3NzcgYW5kIGFhYS4NCg0KYGBge3IgcGF0dGVybiBkZXRlY3QxX21vZCxyZXN1bHRzPSJhc2lzIn0NCnRlc3QgPC0gbGlzdCgiNzc3IiwgImFubmEiLCAiMjAwMiIsICJhYWEiKQ0Kc3RyX3ZpZXcodGVzdCAsICcoLilcXDFcXDEnLCBtYXRjaCA9IFRSVUUpDQpgYGANCg0KIiguKSguKVxcMlxcMSINCg0KYGBge3IgcGF0dGVybiBkZXRlY3QyLHJlc3VsdHM9ImFzaXMifQ0KdGVzdCA8LSBsaXN0KCI3NzciLCAiYW5uYSIsICIyMDAyIiwgICciZWxsZSInLCAiMTIxMjEiKQ0Kc3RyX3ZpZXcodGVzdCAsICciKC4pKC4pXFwyXFwxIicsIG1hdGNoID0gVFJVRSkNCmBgYA0KDQpUaGUgZXhwcmVzc2lvbiBpZGVudGlmaWVzIHN1Y2ggY2FzZXMgd2hpY2ggYXJlIDQgY2hhcmFjdGVyIHBhbGluZHJvbWVzIHdoaWNoIGFyZSB3aXRoaW4gcXVvdGVzLiBJbiB0aGlzIGNhc2UsIGl0IHJldHVybnMgICJlbGxlIiBvbmx5LiBJZiB3ZSBsb29rZWQgZm9yICguKSguKVxcMlxcMSwgd2Ugd291bGQgaGF2ZSBnb3QgYW5uYSAyMDAyIGFzIHdlbGwuDQoNCiguLilcMQ0KDQpgYGB7ciBwYXR0ZXJuIGRldGVjdDMscmVzdWx0cz0iYXNpcyJ9DQp0ZXN0IDwtIGxpc3QoIjc3NyIsICJhbm5hIiwgIjIwMDIiLCAgJyJlbGxlIicsICIyMDIwIiwiYWFiYiIpDQpzdHJfdmlldyh0ZXN0ICwgJyguLilcMScsIG1hdGNoID0gVFJVRSkNCmBgYA0KDQpJdCBkb2VzIG5vdCByZXR1cm4gYW55dGhpbmcuIFRoaXMgaXMgc2ltaWxhciB0byBmaXJzdCBjYXNlLiBMZXQncyB0cnkgbW9kaWZ5aW5nIGl0IFxcMQ0KYGBge3IgcGF0dGVybiBkZXRlY3QzX21vZCxyZXN1bHRzPSJhc2lzIn0NCnRlc3QgPC0gbGlzdCgiNzc3IiwgImFubmEiLCAiMjAwMiIsICAnImVsbGUiJywgIjIwMjAiLCJhYWJiIikNCnN0cl92aWV3KHRlc3QgLCAnKC4uKVxcMScsIG1hdGNoID0gVFJVRSkNCmBgYA0KDQpJdCBnaXZlcyAyMDIwIGZyb20gdGhlIHNlbGVjdGVkIHRlc3Qgc2FtcGxlcyBhcyB0aGUgZXhwcmVzc2lvbiAoLi4pXFwxIGlkZW50aWZpZXMgYSBzZXQgb2YgdHdvIGNoYXJhY3RlcnMgdGhhdCByZXBlYXQgY29uc2VjdXRpdmVseSAobGlrZSAyMDIwIGFuZCBub3QgbGlrZSBhYWJiKQ0KDQoiKC4pLlxcMS5cXDEiDQoNCmBgYHtyIHBhdHRlcm4gZGV0ZWN0NCxyZXN1bHRzPSJhc2lzIn0NCnRlc3QgPC0gbGlzdCgiNzc3IiwgImFubmEiLCAiMjAwMiIsICAnImVsbGUiJywgIjIwMjAiLCciMTIxMjEiJywnImFiYWJhIicpDQpzdHJfdmlldyh0ZXN0ICwgJyIoLikuXFwxLlxcMSInLCBtYXRjaCA9IFRSVUUpDQpgYGANCg0KVGhlIGV4cHJlc3Npb24gaWRlbnRpZmllcyBmaXZlIGNoYXJhY3RlciBzdHJpbmdzIGxpa2UgMTIxMjEgYW5kIGFiYWJhIChpbiBxdW90ZXMpIHdoZXJlIGZpcnN0IGNoYXJhY3RlciByZXBlYXRzIGF0IDFzdCwzcmQgYW5kIDV0aCBwb3NpdGlvbnMgYW5kIHNlY29uZCBjaGFyYWN0ZXIgcmVwZWF0cyBhdCAybmQgYW5kIDR0aCBwb3NpdGlvbiANCg0KDQoiKC4pKC4pKC4pLipcXDNcXDJcXDEiDQpgYGB7ciBwYXR0ZXJuIGRldGVjdDUscmVzdWx0cz0iYXNpcyJ9DQp0ZXN0IDwtIGxpc3QoIjc3NyIsICJhbm5hIiwgIjIwMDIiLCAgJyJlbGxlIicsICIyMDIwIiwnIjEyMzMyMSInLCciYWJjZGNiYSInKQ0Kc3RyX3ZpZXcodGVzdCAsICciKC4pKC4pKC4pLipcXDNcXDJcXDEiJywgbWF0Y2ggPSBUUlVFKQ0KYGBgDQoNClRoZSBleHByZXNzaW9uIGlkZW50aWZpZXMgdGhlIGNoYXJhY3RlcnMgdGhhdCBzdGFydCB3aXRoIHR3byBjaGFyYWN0ZXJzIChzYW1lIG9yIGRpZmZlcmVudCkgYW5kIGVuZCB3aXRoIHRoZSBzYW1lIGNoYXJhY3RlcnMgaW4gcmV2ZXJzZSBvcmRlci4gTGVuZ3RoIG9mIHRoZSBjaGFyYWN0ZXIgZG9lc24ndCBtYXR0ZXIuIA0KDQoNCiMjIyAjNA0KQ29uc3RydWN0IHJlZ3VsYXIgZXhwcmVzc2lvbnMgdG8gbWF0Y2ggd29yZHMgdGhhdDoNCg0KU3RhcnQgYW5kIGVuZCB3aXRoIHRoZSBzYW1lIGNoYXJhY3Rlci4NCg0KXiguKS4qXFwxJA0KYGBge3IgcGF0dGVybiBnZW4xLHJlc3VsdHM9ImFzaXMifQ0KdGVzdCA8LSBsaXN0KCI3NzciLCAiYW5uYSIsICIyMDAyIiwgICciZWxsZSInLCAiMjAyMDEiLCciMTIzMzIxIicsJyJhYmNkY2JhIicpDQpzdHJfdmlldyh0ZXN0ICwgJ14oLikuKlxcMSQnLCBtYXRjaCA9IFRSVUUpDQpgYGANCg0KQ29udGFpbiBhIHJlcGVhdGVkIHBhaXIgb2YgbGV0dGVycyAoZS5nLiAiY2h1cmNoIiBjb250YWlucyAiY2giIHJlcGVhdGVkIHR3aWNlLikNCihbQS1aYS16XVtBLVphLXpdKS4qXFwxDQpgYGB7ciBwYXR0ZXJuIGdlbjIscmVzdWx0cz0iYXNpcyJ9DQp0ZXN0IDwtIGxpc3QoIjc3NyIsICJhbm5hIiwgIjIwMDIiLCAgJyJlbGxlIicsICIyMDIwMSIsJ2tob2tobycsJyJjaHVyY2giJywid2lud2luIikNCnN0cl92aWV3KHRlc3QgLCAnKFtBLVphLXpdW0EtWmEtel0pLipcXDEnLCBtYXRjaCA9IFRSVUUpDQpgYGANCg0KQ29udGFpbiBvbmUgbGV0dGVyIHJlcGVhdGVkIGluIGF0IGxlYXN0IHRocmVlIHBsYWNlcyAoZS5nLiAiZWxldmVuIiBjb250YWlucyB0aHJlZSAiZSJzLikNCihbQS1aYS16XSkuKlxcMS4qXFwxLioNCmBgYHtyIHBhdHRlcm4gZ2VuMyxyZXN1bHRzPSJhc2lzIn0NCnRlc3QgPC0gbGlzdCgiNzc3IiwgIm1pYW1pd2ludGVyIiwgInR3ZWxldmUiLCAgJyJlbGxlZSInLCAiMjAyMDEiLCdraG9raG8nLCciY2h1cmNoIicsIndpc2NvbnNpbml0ZSIpDQpzdHJfdmlldyh0ZXN0ICwgJyhbQS1aYS16XSkuKlxcMS4qXFwxLionLCBtYXRjaCA9IFRSVUUpDQpgYGA=