R Markdown

This is a week 3 assignment, working with strings. The assignment has four different sections each will be presented below separately.

  1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

  2. Write code that transforms the data from an unorganized format to an organized structure extracting meaningful information

  3. Describe, in words, what these expressions will match.

  4. Construct regular expressions to match words.

Code and assignment

In the first section of the code like, I ensure all the relevant packages are installed and libraries are loaded.

required_packages <- c("RSQLite","devtools","tidyverse","DBI","dplyr","odbc","openintro","ggplot2","psych","reshape2","knitr","markdown","shiny","R.rsp","fivethirtyeight","RCurl", "stringr","readr","glue") # Specify packages

not_installed <- required_packages[!(required_packages %in% installed.packages()[ , "Package"])]# Extract not installed packages
if(length(not_installed)==0){
  print("All required packages are installed")
} else {
  print(paste(length(not_installed), "package(s) had to be installed.")) # print the list of packages that need to be isstall
  install.packages(not_installed)
}
## [1] "All required packages are installed"

#1 section of the 3rd week homework

The goal is to download the 173 majors and list those that contain “DATA” or “STATISTICS”. Like other code, the first goal is to load the data to R, since it is URL code, we will be using the relevant line of code has been used before.

#LaodData
library('RCurl')
#for downlaoding the CSV we need to use raw code, otherwise, we will not be able to doan the file correctly
URL <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
URL_handle <- RCurl::getURL(URL)
Major_data<-data.frame(read.csv(text=URL_handle, header=TRUE,sep=","))

print("This is the size of the dataframen and let's take a look at its contents")
## [1] "This is the size of the dataframen and let's take a look at its contents"
#dim(Major_data)
pillar::glimpse(Major_data)
## Rows: 174
## Columns: 3
## $ FOD1P          <chr> "1100", "1101", "1102", "1103", "1104", "1105", "1106",…
## $ Major          <chr> "GENERAL AGRICULTURE", "AGRICULTURE PRODUCTION AND MANA…
## $ Major_Category <chr> "Agriculture & Natural Resources", "Agriculture & Natur…

After loading the data, we use code of strings like those that have been shared to filter for “DATA” and “STATISTICS” as well as the ones that contain both.

#using the basic function of R
#search for data
Contains_DATA_B <- grep(pattern = 'DATA', Major_data$Major, value = TRUE, ignore.case = TRUE)
#search for Science
Contains_STATISTICS_B <-grep(pattern = 'STATISTICS', Major_data$Major, value = TRUE, ignore.case = TRUE)

#search for those with either DATA or STATISTICS
print("Here is the list of the majors that have either DATA or STATISTICS")
## [1] "Here is the list of the majors that have either DATA or STATISTICS"
Contains_DATA_STATISTICS_B<- grep(pattern='DATA|STATISTICS', Major_data$Major,value = TRUE, ignore.case = TRUE)
print("Result of the first way")
## [1] "Result of the first way"
print(Contains_DATA_STATISTICS_B)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"
#In addition to grep I also will use dplyr to do the same 

#using more adacned function of dplyr
Contains_DATA<- dplyr::filter(Major_data, grepl('DATA',Major))
Contains_STATISTICS<- dplyr::filter(Major_data, grepl('STATISTICS',Major))
Contains_DATA_STATISTICS<- dplyr::filter(Major_data, grepl('STATISTICS|DATA',Major))
#Contains_DATA_STATISTICS<- dplyr::filter(Major_data, grepl('STATISTICS&DATA',Major))

print("Result of the 2nd way using dplyr")
## [1] "Result of the 2nd way using dplyr"
print(Contains_DATA_STATISTICS)
##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

#2 Text readout and organization

the goal of this section is to write a code that transforms the data from an unorganized format to an organized structure extracting meaningful information. The example to be worked on it the following:

Data to be transformed:

[1] “bell pepper”  “bilberry”     “blackberry”   “blood orange”

[5] “blueberry”    “cantaloupe”   “chili pepper” “cloudberry”  

[9] “elderberry”   “lime”         “lychee”       “mulberry”    

[13] “olive”        “salal berry”

Into a format like this:

c(“bell pepper”, “bilberry”, “blackberry”, “blood orange”, “blueberry”, “cantaloupe”, “chili pepper”, “cloudberry”, “elderberry”, “lime”, “lychee”, “mulberry”, “olive”, “salal berry”)

#To make this excersise realsitic I entered all those into a text file and them will
#read them into RStudio usin readLines 

con <- file("Data/raw_data.txt","r", blocking = FALSE) #Define local connection to file in Data folder

lines <- readLines(con, n= -1L, encoding="unknown", warn = TRUE, skipNul = FALSE) #read all lines 
singleString <- paste(lines, collapse=" ")

close(con) #close con
lines
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\""
## [2] "[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  "
## [3] "[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    "
## [4] "[13] \"olive\"        \"salal berry\""                                  
## [5] ""                                                                       
## [6] ""
#second way use readr cuntion from readr library 
library(readr)
mystring <- read_file("Data/raw_data.txt")
mystring
## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\r\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"  \r\n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"    \r\n[13] \"olive\"        \"salal berry\"\r\n\r\n\r\n"

Manipulate the uploaded text

In section of the code, I will be using regular expression to manipulate the data that are uplaoded.

First solution

# Extract words between quotations using regular expressions
words <- unlist(regmatches(lines, gregexpr('"([^"]*)"', lines)))
#print the results all contents were seperated by only words between qoutations 
print("Initial extracted words")
## [1] "Initial extracted words"
print(words)
##  [1] "\"bell pepper\""  "\"bilberry\""     "\"blackberry\""   "\"blood orange\""
##  [5] "\"blueberry\""    "\"cantaloupe\""   "\"chili pepper\"" "\"cloudberry\""  
##  [9] "\"elderberry\""   "\"lime\""         "\"lychee\""       "\"mulberry\""    
## [13] "\"olive\""        "\"salal berry\""
#substitute the two qoutation 
#'"([^"]+)"' matches anything between double quotes and excludes the quotes themselves. The \\1 in the gsub function replaces the matched pattern with just the content within the quotes.

words <- gsub('"([^"]+)"', '\\1', words)

print("Words after spaces and extra qoutatuons are removed")
## [1] "Words after spaces and extra qoutatuons are removed"
#Seperate all words 
print(words)
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
# Remove empty elements
words <- words[words != ""]
#Print the modified words
print("Words after all empty cells are removed")
## [1] "Words after all empty cells are removed"
print(words)
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"
#Prepared the text as ask to be printed out 
#I will used glue from glue library to do the final print

print("Here is the final results")
## [1] "Here is the final results"
glue::glue(
  'c("{combined}")',
  combined=glue::glue_collapse(
    words,
    sep='" ,"',
    last=""
  )
)|>print()
## c("bell pepper" ,"bilberry" ,"blackberry" ,"blood orange" ,"blueberry" ,"cantaloupe" ,"chili pepper" ,"cloudberry" ,"elderberry" ,"lime" ,"lychee" ,"mulberry" ,"olive" ,"salal berry")

#3 Describe, in words, what these expressions will match

Described the expression in words
Expression Explanation in words
(.)\1\1 matches any sequence of three consecutive characters where the first and second characters are the same
“(.)(.)\\2\\1” matches any sequence of four characters where the 1st and 4th are the same, and the 2nd and 3rd are the same but can be different or the same as the 1st and 4th.
(..)\1 matches any sequence of four characters where the first two characters are identical to the last two characters.
“(.).\\1.\\1” matches any sequence of five characters where, and the 1st and 3rd are the same, the 2nd is any character, the 5th is the same as the first.
“(.)(.)(.).*\\3\\2\\1” matches a sequence of characters that starts with any three characters, followed by any or not character, and finally ends with the same characters but reversely meaning from the third first characters toward the first first character.

#4:Construct regular expressions to match words.

Construct regular expressions
Explanation Regular expressions
Start and end with the same character. ^(.).*\1$
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) .*(.{2}).*\1.*
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) .*(.).*\1.*\1.*