Introduction:

This article delves into the economic realities that college students face when choosing a major, highlighting how a degree alone no longer guarantees financial success. It examines detailed data on earnings across different fields of study, revealing significant disparities in income potential. By analyzing trends and offering insights, the article underscores the importance of making informed choices when selecting a major, as it can dramatically influence graduates’ long-term financial outcomes and career trajectories.

The link to the article: https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

file_path<-"https://raw.githubusercontent.com/Natacode819/Character-Manipulation-and-Date-Processing/main/all-ages.csv"
majors<-read.csv(file_path)
head(majors, 10)

##    Major_code                                 Major
## 1        1100                   GENERAL AGRICULTURE
## 2        1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3        1102                AGRICULTURAL ECONOMICS
## 4        1103                       ANIMAL SCIENCES
## 5        1104                          FOOD SCIENCE
## 6        1105            PLANT SCIENCE AND AGRONOMY
## 7        1106                          SOIL SCIENCE
## 8        1199             MISCELLANEOUS AGRICULTURE
## 9        1301                 ENVIRONMENTAL SCIENCE
## 10       1302                              FORESTRY
##                     Major_category  Total Employed
## 1  Agriculture & Natural Resources 128148    90245
## 2  Agriculture & Natural Resources  95326    76865
## 3  Agriculture & Natural Resources  33955    26321
## 4  Agriculture & Natural Resources 103549    81177
## 5  Agriculture & Natural Resources  24280    17281
## 6  Agriculture & Natural Resources  79409    63043
## 7  Agriculture & Natural Resources   6586     4926
## 8  Agriculture & Natural Resources   8549     6392
## 9           Biology & Life Science 106106    87602
## 10 Agriculture & Natural Resources  69447    48228
##    Employed_full_time_year_round Unemployed Unemployment_rate Median P25th
## 1                          74078       2423        0.02614711  50000 34000
## 2                          64240       2266        0.02863606  54000 36000
## 3                          22810        821        0.03024832  63000 40000
## 4                          64937       3619        0.04267890  46000 30000
## 5                          12722        894        0.04918845  62000 38500
## 6                          51077       2070        0.03179089  50000 35000
## 7                           4042        264        0.05086705  63000 39400
## 8                           5074        261        0.03923042  52000 35000
## 9                          65238       4736        0.05128983  52000 38000
## 10                         39613       2144        0.04256333  58000 40500
##    P75th
## 1  80000
## 2  80000
## 3  98000
## 4  72000
## 5  90000
## 6  75000
## 7  88000
## 8  75000
## 9  75000
## 10 80000

colnames(majors)

##  [1] "Major_code"                    "Major"                        
##  [3] "Major_category"                "Total"                        
##  [5] "Employed"                      "Employed_full_time_year_round"
##  [7] "Unemployed"                    "Unemployment_rate"            
##  [9] "Median"                        "P25th"                        
## [11] "P75th"

#1 Providing code that identifies the majors that contain either “DATA” or “STATISTICS”

subset_majors<-majors%>% filter(grepl("DATA|STATISTICS", Major, ,ignore.case=TRUE))
subset_majors

##   Major_code                                         Major
## 1       2101      COMPUTER PROGRAMMING AND DATA PROCESSING
## 2       3702               STATISTICS AND DECISION SCIENCE
## 3       6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
##            Major_category  Total Employed Employed_full_time_year_round
## 1 Computers & Mathematics  29317    22828                         18747
## 2 Computers & Mathematics  24806    18808                         14468
## 3                Business 156673   134478                        118249
##   Unemployed Unemployment_rate Median P25th  P75th
## 1       2265        0.09026422  60000 40000  85000
## 2       1138        0.05705405  70000 43000 102000
## 3       6186        0.04397714  72000 50000 100000

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(stringr)

#2 Write code that transforms the data below:

input_data <- '[1] "bell pepper"  "bilberry"     "blackberry"   "blood orange"
[5] "blueberry"    "cantaloupe"   "chili pepper" "cloudberry"   
[9] "elderberry"   "lime"         "lychee"       "mulberry"     
[13] "olive"        "salal berry"'

input_data

## [1] "[1] \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n[5] \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"   \n[9] \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"     \n[13] \"olive\"        \"salal berry\""

First, I remove the indices and newlines, then collapse the string into a single line

cleaned_data <- gsub("\\[\\d+\\]", "", input_data) # Remove indices
cleaned_data

## [1] " \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\"\n \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"   \n \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"     \n \"olive\"        \"salal berry\""

Second, I remove newlines

cleaned_data <- gsub("\n", "", cleaned_data)       
cleaned_data

## [1] " \"bell pepper\"  \"bilberry\"     \"blackberry\"   \"blood orange\" \"blueberry\"    \"cantaloupe\"   \"chili pepper\" \"cloudberry\"    \"elderberry\"   \"lime\"         \"lychee\"       \"mulberry\"      \"olive\"        \"salal berry\""

Third, I replace multiple spaces with a single space

cleaned_data <- gsub("  +", " ", cleaned_data)      
cleaned_data

## [1] " \"bell pepper\" \"bilberry\" \"blackberry\" \"blood orange\" \"blueberry\" \"cantaloupe\" \"chili pepper\" \"cloudberry\" \"elderberry\" \"lime\" \"lychee\" \"mulberry\" \"olive\" \"salal berry\""

Forth, I add commas between items

cleaned_data <- gsub('" "', '", "', cleaned_data)
cleaned_data

## [1] " \"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\""

Next, I format the final output

formatted_output <- paste0("c(",cleaned_data,")")
formatted_output

## [1] "c( \"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\")"

Last, I remove leading/trailing spaces

formatted_output <- gsub('^\\s+"|"$', '', formatted_output) 
formatted_output

## [1] "c( \"bell pepper\", \"bilberry\", \"blackberry\", \"blood orange\", \"blueberry\", \"cantaloupe\", \"chili pepper\", \"cloudberry\", \"elderberry\", \"lime\", \"lychee\", \"mulberry\", \"olive\", \"salal berry\")"

To provide a desired format, I print the formatted output using cat() function

cat(formatted_output)

## c( "bell pepper", "bilberry", "blackberry", "blood orange", "blueberry", "cantaloupe", "chili pepper", "cloudberry", "elderberry", "lime", "lychee", "mulberry", "olive", "salal berry")

#3 Describe, in words, what these expressions will match:

(.)\\1\\1
(.)(.)\\2\\1
(..)\1
(.)\\.\\1.\\1
(.)(.)(.).*\\3\\2\\1

I load the stringr package, which provides a set of functions designed to simplify string manipulation in R. I use str_subset() function to match expressions

library(stringr)

Data to match expressions:

words<-c("aaaaa", "dddy" , "reremind", "xyxy", "agga", "129921", "a.a.a", "1.1.1", "a.b.a", "7755745", "abcxyzcbazyx", "123456321", "fggggddsss")

I use str_detect() to check if the expression (.)\\1\\1 matches.

(.) captures any single character (denoted by `.`) and stores it in a capturing group.

The first `\\1` refers to the first capturing group, so it matches the same character captured in the first group.

The second `\\1` again refers to the first capturing group, matching the same character once more.

In short, `(.)\\1\\1` will match any string where the same character appears three times in a row.

match1<-words%>%str_subset("(.)\\1\\1")
match1

## [1] "aaaaa"      "dddy"       "fggggddsss"

The next example is to match (.)(.)\\2\\1 expression.

This expression is used to match a pattern where two characters are followed by the same characters in reverse order. Here’s a breakdown:

`(.)(.)` captures two consecutive characters. Each character is captured by a separate group:

The first `.` captures any single character and stores it in the first capturing group (`\\1`).

The second `.` captures any single character and stores it in the second capturing group (`\\2`).

`\\2` matches the same character as captured by the second capturing group. So, this part matches the second character again.

`\\1` matches the same character as captured by the first capturing group. So, this part matches the first character again.

In short, `(.)(.)\\2\\1` matches a sequence where the first two characters are followed by those same characters in reverse order.

match2<-words%>%str_subset("(.)(.)\\2\\1")
match2

## [1] "aaaaa"      "agga"       "129921"     "7755745"    "fggggddsss"

The third example is to match (..)\1 expression.This expression is used to match a sequence where a two-character substring is immediately followed by a repetition of the same two-character substring. The details are as follows:

`( .. )` captures any two characters into a capturing group. This means the first part of the pattern is any two-character sequence.

`\1` refers to the content of the first capturing group, meaning it matches the same two-character sequence captured by the first group.

In short, `(..)\1` will match any string where a two-character sequence is immediately followed by the same two-character sequence.

match3<-words%>%str_subset("\\b(..)\\1\\b")
match3

## [1] "xyxy"  "a.a.a" "1.1.1"

The next example is to match (.)\\.\\1.\\1 expression. This expression is used to match a pattern where a single character is repeated at specific positions in a string, separated by literal dots. Here’s a breakdown:

`(.)` captures any single character into the first capturing group.

`\.` matches a literal dot.

`\1` refers to the same character captured by the first capturing group.

`\.` matches another literal dot.

`\1` refers to the same character again, matching the same character as the one captured in the first group.

In short, `(.)\\.\\1.\\1` will match a string where a single character is followed by a dot, then the same character is followed by another dot, and then the same character appears again.

match4<-words%>%str_subset("(.)\\.\\1.\\1")
match4

## [1] "a.a.a" "1.1.1"

The final example is to match (.)(.)(.).*\\3\\2\\1 expression.This expression is used to match a string where three specific characters are repeated in reverse order after any number of other characters. Here’s a breakdown:

`(.)(.)(.)` captures three consecutive characters into three separate capturing groups:

The first `(.)` captures the first character.

The second `(.)` captures the second character.

The third `(.)` captures the third character.

`.*` matches any number of any characters. This allows for any content to appear between the initial three characters and their reversed repetition.

`\\3` refers to the third capturing group.

`\\2` refers to the second capturing group.

`\\1` refers to the first capturing group.

In short, `(.)(.)(.).*\\3\\2\\1` will match any string where the first three characters appers first; after these characters, there can be any sequence of characters; and the same three characters appear again in reverse order.

match5<-words%>%str_subset("(.)(.)(.).*\\3\\2\\1")
match5

## [1] "129921"       "abcxyzcbazyx" "123456321"

#4 Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Explanation for “Start and end with the same character”:

`^` asserts the start of the string.

`(.)` captures the first character.

`.*` matches any characters (zero or more) in between.

`\\1` ensures that the last character matches the first captured character.

`$` asserts the end of the string.

words_match1<-c("throughout", "level", "window", "see", "need", "refer")
words_match1

## [1] "throughout" "level"      "window"     "see"        "need"      
## [6] "refer"

match6<-words_match1%>%str_subset("^(.)[\\s\\S]*\\1$")
match6

## [1] "throughout" "level"      "window"     "refer"

Explanation for “Contain a repeated pair of letters (e.g. ”church” contains “ch” repeated twice.)“:

`(..)`: Captures any two consecutive characters (a pair).

`.*`: Matches any number of characters in between (if any).

`\\1`: Ensures the pair captured is repeated somewhere later in the word.

words_match2 <- c("church", "datada", "banana",  "abcd", "chch", "different")
words_match2

## [1] "church"    "datada"    "banana"    "abcd"      "chch"      "different"

match7<-words_match2%>%str_subset("(..).*?\\1")
match7

## [1] "church" "datada" "banana" "chch"

Explanation for “Contain one letter repeated in at least three places (e.g. ”eleven” contains three “e”s.)“:

`(.)`: Captures any single character into group 1.

`\\w*`: Matches any number of word characters (letters, digits, and underscores) in between.

`\\1`: Ensures that the captured character appears at least three times.

words_match3<- c("eleven", "dataset", "success", "mathematical", "complete")
words_match3

## [1] "eleven"       "dataset"      "success"      "mathematical" "complete"

match8<-words_match3%>%str_subset("(.)\\w+\\1\\w*\\1")
match8

## [1] "eleven"       "success"      "mathematical"

Character Manipulation

Nataliya Ferdinand

2024-09-15

Introduction:

#1 Providing code that identifies the majors that contain either “DATA” or “STATISTICS”

#2 Write code that transforms the data below:

#3 Describe, in words, what these expressions will match:

I load the stringr package, which provides a set of functions designed to simplify string manipulation in R. I use str_subset() function to match expressions

(.) captures any single character (denoted by .) and stores it in a capturing group.

The first \\1 refers to the first capturing group, so it matches the same character captured in the first group.

The second \\1 again refers to the first capturing group, matching the same character once more.

In short, (.)\\1\\1 will match any string where the same character appears three times in a row.

This expression is used to match a pattern where two characters are followed by the same characters in reverse order. Here’s a breakdown:

(.)(.) captures two consecutive characters. Each character is captured by a separate group:

The first . captures any single character and stores it in the first capturing group (\\1).

The second . captures any single character and stores it in the second capturing group (\\2).

\\2 matches the same character as captured by the second capturing group. So, this part matches the second character again.

\\1 matches the same character as captured by the first capturing group. So, this part matches the first character again.

In short, (.)(.)\\2\\1 matches a sequence where the first two characters are followed by those same characters in reverse order.

( .. ) captures any two characters into a capturing group. This means the first part of the pattern is any two-character sequence.

\1 refers to the content of the first capturing group, meaning it matches the same two-character sequence captured by the first group.

In short, (..)\1 will match any string where a two-character sequence is immediately followed by the same two-character sequence.

(.) captures any single character into the first capturing group.

\. matches a literal dot.

\1 refers to the same character captured by the first capturing group.

\. matches another literal dot.

\1 refers to the same character again, matching the same character as the one captured in the first group.

In short, (.)\\.\\1.\\1 will match a string where a single character is followed by a dot, then the same character is followed by another dot, and then the same character appears again.

(.)(.)(.) captures three consecutive characters into three separate capturing groups:

The first (.) captures the first character.

The second (.) captures the second character.

The third (.) captures the third character.

.* matches any number of any characters. This allows for any content to appear between the initial three characters and their reversed repetition.

\\3 refers to the third capturing group.

\\2 refers to the second capturing group.

\\1 refers to the first capturing group.

In short, (.)(.)(.).*\\3\\2\\1 will match any string where the first three characters appers first; after these characters, there can be any sequence of characters; and the same three characters appear again in reverse order.