In this lab you will practice perform a series of exercises that use text and string manipulation to either analyze data with text, manipulate data containing strings, apply regular expressions, or handle data files with unusual formats or text strings.
Problems
Problem 1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”, case insensitive. You can find this dataset on R by installing the package fivethirtyeight and using the major column in either college_recent_grades, college_new_grads, or college_all_ages.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.1
Warning: package 'ggplot2' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fivethirtyeight)
Warning: package 'fivethirtyeight' was built under R version 4.4.1
Some larger datasets need to be installed separately, like senators and
house_district_forecast. To install these, we recommend you install the
fivethirtyeightdata package by running:
install.packages('fivethirtyeightdata', repos =
'https://fivethirtyeightdata.github.io/drat/', type = 'source')
# A tibble: 3 × 1
major
<chr>
1 Computer Programming And Data Processing
2 Statistics And Decision Science
3 Management Information Systems And Statistics
Problem 2 Write code that transforms the data below:
Hint: There are many different ways to solve this problem, but if you use str_extract_all a helpful flag that returns a character vector instead of a list is simplify=TRUE. Then you can apply other tools from stringr if needed.
Problem 3 Describe, in words, what these regular expressions will match. Read carefully to see if each entry is a regular expression or a string that defines a regular expression.
^.*$ Regular expression that matches any string . is wildacard meaning any charachter^starting anchor* any number of times 0 or mote meaning we would get a mtach for an empty string$ ending anchor
"\\{.+\\}"
string defining a pattern to match a string is in curly braces and has at least one character except a new line within the curly braces
\d{4}-\d{2}-\d{2}
reg exp for a date format
_ _ _ _ (digits) “hyphen” _ _ (digits) “hyphen” _ _(digits) so —>>>yyyy-mm-dd
the values inside the { } determine how many digits are accepted
"\\\\{4}"
a string matching 1 literal back slashes 4 times LITETERAL = \\\\
"(..)\\1"
a string expression of any two characters that immediately repeat like
coconut it would find coco
banana it would find anana
papal it would return papa
Problem 4. Construct regular expressions to match words that:
head(words)
[1] "a" "able" "about" "absolute" "accept" "account"
Start with “y”.
lettersY<- words|>str_detect("^y")words[lettersY]
[1] "year" "yes" "yesterday" "yet" "you" "young"
Have seven letters or more.
bigwords <- words |>str_detect("[A-Za-z]{7,}") words[bigwords][1:10]
For each example, verify that they work by running them on the stringr::words dataset and show the first 10 results (hint: combine str_detect and logical subsetting).
Problem 5 Consider the gss_cat data-frame discussed in Chapter 16 of R4DS (provided as part of the forcats package):
Create a new variable that describes whether the party-id of a survey respondent is “strong” if they are a strong republican or strong democrat, “weak” if they are a not strong democrat, not strong republican, or independent of any type, and “other” for the rest.
# this works better, on the console i also see the numeric value of each levelgss_cat|>count(partyid)
# A tibble: 10 × 2
partyid n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Strong republican 2314
5 Not str republican 3032
6 Ind,near rep 1791
7 Independent 4119
8 Ind,near dem 2499
9 Not str democrat 3690
10 Strong democrat 3490
politic_levels<-c("strong","weak","other")# accesing the levels of partyid to logicaly fill party_commitmentgss_cat <- gss_cat |>mutate(party_commitment =case_when(as.numeric(partyid) %in%c(4, 10) ~"strong", as.numeric(partyid) %in%c(5, 6, 7, 8, 9) ~"weak", as.numeric(partyid) %in%c(1, 2, 3) ~"other", TRUE~"other" )) |>mutate(party_commitment =factor(party_commitment, levels = politic_levels))gss_cat|>count(party_commitment)
# A tibble: 3 × 2
party_commitment n
<fct> <int>
1 strong 5804
2 weak 15131
3 other 548
Calculate the mean hours of TV watched by each of the groups “strong”, “weak”, and “other” and display it with a dot-plot (geom_point). Sort the levels in the dot-plot so that the group appears in order of most mean TV hours watched.
gss_cat|>group_by(party_commitment)|>summarise(mean_tv=mean(tvhours, na.rm =TRUE))|>ggplot(aes(x=mean_tv,y =fct_reorder(party_commitment,mean_tv)))+geom_point()+labs(x="Average Hours of TV watched",y="Commitment to party",title ="Avegerage TV \nWatched by levels of \nPolitical Commitment" )