Overview

This assignment goes over normalization and character manipulation examples

Loading the Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(babynames)

Normalization

  1. Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.

Let’s create normalized dataframes for gym data (this data is synthetic):

# gym trainers and which class they teach
trainer_classes <- data.frame (
  TrainerID = c("T01", "T02", "T03", "T03", "T03", "T04", "T04", "T05", "T06", "T07", "T08", "T08"),
  ClassID = c("C03", "C01", "C01", "C02", "C03", "C03", "C04", "C05", "C04", "C02", "C03", "C05")
)
trainer_classes
##    TrainerID ClassID
## 1        T01     C03
## 2        T02     C01
## 3        T03     C01
## 4        T03     C02
## 5        T03     C03
## 6        T04     C03
## 7        T04     C04
## 8        T05     C05
## 9        T06     C04
## 10       T07     C02
## 11       T08     C03
## 12       T08     C05
trainers_locations <- data.frame(
  TrainerID = c("T01", "T02", "T03", "T04", "T05", "T06", "T07", "T08"),
  Name = c("Steve", "Sara", "Bill", "Bill", "Bill", "Rob", "Rob", "Tina"),
  LocationID = c("10", "10", "10", "11", "12", "13", "11", "12")
)
trainers_locations
##   TrainerID  Name LocationID
## 1       T01 Steve         10
## 2       T02  Sara         10
## 3       T03  Bill         10
## 4       T04  Bill         11
## 5       T05  Bill         12
## 6       T06   Rob         13
## 7       T07   Rob         11
## 8       T08  Tina         12
classes <- data.frame (
  ClassID = c("C01", "C02", "C03", "C04", "C05"),
  Class = c("Pilates", "Yoga", "Weights", "Cycling", "Treadmill")
)
classes
##   ClassID     Class
## 1     C01   Pilates
## 2     C02      Yoga
## 3     C03   Weights
## 4     C04   Cycling
## 5     C05 Treadmill
locations <- data.frame (
  LocationID = c("10", "11", "12", "13"),
  Location = c("Tribeca", "Williamsburg", "UES", "FiDi")
)
locations
##   LocationID     Location
## 1         10      Tribeca
## 2         11 Williamsburg
## 3         12          UES
## 4         13         FiDi

These dataframes are:

Having the different dataframes is important here because trainers can teach multiple classes at multiple locations.

Having tables with an ID and just one feature column is beneficial because each attribute depends only on the primary key, which is the ID in this case. In the real world, datasets can be very complicated with many tables, so it is easier to be able to refer back to a single table if there is a certain part of the data that needs to be looked into. Overall, it makes the data easier to understand.

Additionally, having normalized tables protects the data from insertion, update and deletion anomalies. If you changed a feature in the main table, you could risk missing updating critical rows or parts of the data, causing the data to be inaccurate or not make any sense (update anomaly). With normalized tables, there is no chance of repeating data or data that contradicts itself. Also, if you delete a row in the main table, you could risk removing critical data permanently (deletion anomaly). You could also risk not adding/inserting critical data because it doesn’t exactly fit the features in the main table (insertion anomaly). For example, if a new trainer is hired at a gym location, but hasn’t been assign a class yet, it’s still important to add the new trainer’s info to the dataset. This is only possible by having a separate table dedicated for trainers. If the main table contained everything, the new trainer wouldn’t fit the criteria of the main table.

With normalized tables, you can easily navigate the data, and build upon it.

Character Manipulation

  1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Load the data:

majors_list_df <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv"))
majors_list_df
##     FOD1P                                                             Major
## 1    1100                                               GENERAL AGRICULTURE
## 2    1101                             AGRICULTURE PRODUCTION AND MANAGEMENT
## 3    1102                                            AGRICULTURAL ECONOMICS
## 4    1103                                                   ANIMAL SCIENCES
## 5    1104                                                      FOOD SCIENCE
## 6    1105                                        PLANT SCIENCE AND AGRONOMY
## 7    1106                                                      SOIL SCIENCE
## 8    1199                                         MISCELLANEOUS AGRICULTURE
## 9    1302                                                          FORESTRY
## 10   1303                                      NATURAL RESOURCES MANAGEMENT
## 11   6000                                                         FINE ARTS
## 12   6001                                            DRAMA AND THEATER ARTS
## 13   6002                                                             MUSIC
## 14   6003                                        VISUAL AND PERFORMING ARTS
## 15   6004                                 COMMERCIAL ART AND GRAPHIC DESIGN
## 16   6005                                  FILM VIDEO AND PHOTOGRAPHIC ARTS
## 17   6007                                                       STUDIO ARTS
## 18   6099                                           MISCELLANEOUS FINE ARTS
## 19   1301                                             ENVIRONMENTAL SCIENCE
## 20   3600                                                           BIOLOGY
## 21   3601                                              BIOCHEMICAL SCIENCES
## 22   3602                                                            BOTANY
## 23   3603                                                 MOLECULAR BIOLOGY
## 24   3604                                                           ECOLOGY
## 25   3605                                                          GENETICS
## 26   3606                                                      MICROBIOLOGY
## 27   3607                                                      PHARMACOLOGY
## 28   3608                                                        PHYSIOLOGY
## 29   3609                                                           ZOOLOGY
## 30   3611                                                      NEUROSCIENCE
## 31   3699                                             MISCELLANEOUS BIOLOGY
## 32   4006                               COGNITIVE SCIENCE AND BIOPSYCHOLOGY
## 33   6200                                                  GENERAL BUSINESS
## 34   6201                                                        ACCOUNTING
## 35   6202                                                 ACTUARIAL SCIENCE
## 36   6203                            BUSINESS MANAGEMENT AND ADMINISTRATION
## 37   6204                               OPERATIONS LOGISTICS AND E-COMMERCE
## 38   6205                                                BUSINESS ECONOMICS
## 39   6206                                  MARKETING AND MARKETING RESEARCH
## 40   6207                                                           FINANCE
## 41   6209                          HUMAN RESOURCES AND PERSONNEL MANAGEMENT
## 42   6210                                            INTERNATIONAL BUSINESS
## 43   6211                                            HOSPITALITY MANAGEMENT
## 44   6212                     MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
## 45   6299                   MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION
## 46   1901                                                    COMMUNICATIONS
## 47   1902                                                        JOURNALISM
## 48   1903                                                        MASS MEDIA
## 49   1904                                  ADVERTISING AND PUBLIC RELATIONS
## 50   2001                                        COMMUNICATION TECHNOLOGIES
## 51   2100                                  COMPUTER AND INFORMATION SYSTEMS
## 52   2101                          COMPUTER PROGRAMMING AND DATA PROCESSING
## 53   2102                                                  COMPUTER SCIENCE
## 54   2105                                              INFORMATION SCIENCES
## 55   2106                   COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY
## 56   2107                        COMPUTER NETWORKING AND TELECOMMUNICATIONS
## 57   3700                                                       MATHEMATICS
## 58   3701                                               APPLIED MATHEMATICS
## 59   3702                                   STATISTICS AND DECISION SCIENCE
## 60   4005                                  MATHEMATICS AND COMPUTER SCIENCE
## 61   2300                                                 GENERAL EDUCATION
## 62   2301                        EDUCATIONAL ADMINISTRATION AND SUPERVISION
## 63   2303                                         SCHOOL STUDENT COUNSELING
## 64   2304                                              ELEMENTARY EDUCATION
## 65   2305                                     MATHEMATICS TEACHER EDUCATION
## 66   2306                            PHYSICAL AND HEALTH EDUCATION TEACHING
## 67   2307                                         EARLY CHILDHOOD EDUCATION
## 68   2308                            SCIENCE AND COMPUTER TEACHER EDUCATION
## 69   2309                                       SECONDARY TEACHER EDUCATION
## 70   2310                                           SPECIAL NEEDS EDUCATION
## 71   2311                       SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION
## 72   2312                                TEACHER EDUCATION: MULTIPLE LEVELS
## 73   2313                                      LANGUAGE AND DRAMA EDUCATION
## 74   2314                                           ART AND MUSIC EDUCATION
## 75   2399                                           MISCELLANEOUS EDUCATION
## 76   3501                                                   LIBRARY SCIENCE
## 77   1401                                                      ARCHITECTURE
## 78   2400                                               GENERAL ENGINEERING
## 79   2401                                             AEROSPACE ENGINEERING
## 80   2402                                            BIOLOGICAL ENGINEERING
## 81   2403                                         ARCHITECTURAL ENGINEERING
## 82   2404                                            BIOMEDICAL ENGINEERING
## 83   2405                                              CHEMICAL ENGINEERING
## 84   2406                                                 CIVIL ENGINEERING
## 85   2407                                              COMPUTER ENGINEERING
## 86   2408                                            ELECTRICAL ENGINEERING
## 87   2409                         ENGINEERING MECHANICS PHYSICS AND SCIENCE
## 88   2410                                         ENVIRONMENTAL ENGINEERING
## 89   2411                            GEOLOGICAL AND GEOPHYSICAL ENGINEERING
## 90   2412                          INDUSTRIAL AND MANUFACTURING ENGINEERING
## 91   2413                       MATERIALS ENGINEERING AND MATERIALS SCIENCE
## 92   2414                                            MECHANICAL ENGINEERING
## 93   2415                                         METALLURGICAL ENGINEERING
## 94   2416                                    MINING AND MINERAL ENGINEERING
## 95   2417                         NAVAL ARCHITECTURE AND MARINE ENGINEERING
## 96   2418                                               NUCLEAR ENGINEERING
## 97   2419                                             PETROLEUM ENGINEERING
## 98   2499                                         MISCELLANEOUS ENGINEERING
## 99   2500                                          ENGINEERING TECHNOLOGIES
## 100  2501                             ENGINEERING AND INDUSTRIAL MANAGEMENT
## 101  2502                                 ELECTRICAL ENGINEERING TECHNOLOGY
## 102  2503                                INDUSTRIAL PRODUCTION TECHNOLOGIES
## 103  2504                       MECHANICAL ENGINEERING RELATED TECHNOLOGIES
## 104  2599                            MISCELLANEOUS ENGINEERING TECHNOLOGIES
## 105  5008                                                 MATERIALS SCIENCE
## 106  4002                                                NUTRITION SCIENCES
## 107  6100                               GENERAL MEDICAL AND HEALTH SERVICES
## 108  6102                     COMMUNICATION DISORDERS SCIENCES AND SERVICES
## 109  6103                        HEALTH AND MEDICAL ADMINISTRATIVE SERVICES
## 110  6104                                        MEDICAL ASSISTING SERVICES
## 111  6105                                  MEDICAL TECHNOLOGIES TECHNICIANS
## 112  6106                           HEALTH AND MEDICAL PREPARATORY PROGRAMS
## 113  6107                                                           NURSING
## 114  6108               PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION
## 115  6109                                     TREATMENT THERAPY PROFESSIONS
## 116  6110                                       COMMUNITY AND PUBLIC HEALTH
## 117  6199                          MISCELLANEOUS HEALTH MEDICAL PROFESSIONS
## 118  1501                              AREA ETHNIC AND CIVILIZATION STUDIES
## 119  2601               LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE
## 120  2602     FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES
## 121  2603                                           OTHER FOREIGN LANGUAGES
## 122  3301                                   ENGLISH LANGUAGE AND LITERATURE
## 123  3302                                          COMPOSITION AND RHETORIC
## 124  3401                                                      LIBERAL ARTS
## 125  3402                                                        HUMANITIES
## 126  4001                           INTERCULTURAL AND INTERNATIONAL STUDIES
## 127  4801                                  PHILOSOPHY AND RELIGIOUS STUDIES
## 128  4901                                  THEOLOGY AND RELIGIOUS VOCATIONS
## 129  5502                                       ANTHROPOLOGY AND ARCHEOLOGY
## 130  6006                                         ART HISTORY AND CRITICISM
## 131  6402                                                           HISTORY
## 132  6403                                             UNITED STATES HISTORY
## 133  2201                            COSMETOLOGY SERVICES AND CULINARY ARTS
## 134  2901                                      FAMILY AND CONSUMER SCIENCES
## 135  3801                                             MILITARY TECHNOLOGIES
## 136  4101                     PHYSICAL FITNESS PARKS RECREATION AND LEISURE
## 137  5601                                             CONSTRUCTION SERVICES
## 138  5701 ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES AND PRODUCTION
## 139  5901                          TRANSPORTATION SCIENCES AND TECHNOLOGIES
## 140  4000                                   MULTI/INTERDISCIPLINARY STUDIES
## 141  3201                                                   COURT REPORTING
## 142  3202                                         PRE-LAW AND LEGAL STUDIES
## 143  5301                              CRIMINAL JUSTICE AND FIRE PROTECTION
## 144  5401                                             PUBLIC ADMINISTRATION
## 145  5402                                                     PUBLIC POLICY
## 146 bbbb                                  N/A (less than bachelor's degree)
## 147  5000                                                 PHYSICAL SCIENCES
## 148  5001                                        ASTRONOMY AND ASTROPHYSICS
## 149  5002                              ATMOSPHERIC SCIENCES AND METEOROLOGY
## 150  5003                                                         CHEMISTRY
## 151  5004                                         GEOLOGY AND EARTH SCIENCE
## 152  5005                                                       GEOSCIENCES
## 153  5006                                                      OCEANOGRAPHY
## 154  5007                                                           PHYSICS
## 155  5098                             MULTI-DISCIPLINARY OR GENERAL SCIENCE
## 156  5102        NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES
## 157  5200                                                        PSYCHOLOGY
## 158  5201                                            EDUCATIONAL PSYCHOLOGY
## 159  5202                                               CLINICAL PSYCHOLOGY
## 160  5203                                             COUNSELING PSYCHOLOGY
## 161  5205                          INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY
## 162  5206                                                 SOCIAL PSYCHOLOGY
## 163  5299                                          MISCELLANEOUS PSYCHOLOGY
## 164  5403                         HUMAN SERVICES AND COMMUNITY ORGANIZATION
## 165  5404                                                       SOCIAL WORK
## 166  4007                                 INTERDISCIPLINARY SOCIAL SCIENCES
## 167  5500                                           GENERAL SOCIAL SCIENCES
## 168  5501                                                         ECONOMICS
## 169  5503                                                       CRIMINOLOGY
## 170  5504                                                         GEOGRAPHY
## 171  5505                                           INTERNATIONAL RELATIONS
## 172  5506                                  POLITICAL SCIENCE AND GOVERNMENT
## 173  5507                                                         SOCIOLOGY
## 174  5599                                     MISCELLANEOUS SOCIAL SCIENCES
##                          Major_Category
## 1       Agriculture & Natural Resources
## 2       Agriculture & Natural Resources
## 3       Agriculture & Natural Resources
## 4       Agriculture & Natural Resources
## 5       Agriculture & Natural Resources
## 6       Agriculture & Natural Resources
## 7       Agriculture & Natural Resources
## 8       Agriculture & Natural Resources
## 9       Agriculture & Natural Resources
## 10      Agriculture & Natural Resources
## 11                                 Arts
## 12                                 Arts
## 13                                 Arts
## 14                                 Arts
## 15                                 Arts
## 16                                 Arts
## 17                                 Arts
## 18                                 Arts
## 19               Biology & Life Science
## 20               Biology & Life Science
## 21               Biology & Life Science
## 22               Biology & Life Science
## 23               Biology & Life Science
## 24               Biology & Life Science
## 25               Biology & Life Science
## 26               Biology & Life Science
## 27               Biology & Life Science
## 28               Biology & Life Science
## 29               Biology & Life Science
## 30               Biology & Life Science
## 31               Biology & Life Science
## 32               Biology & Life Science
## 33                             Business
## 34                             Business
## 35                             Business
## 36                             Business
## 37                             Business
## 38                             Business
## 39                             Business
## 40                             Business
## 41                             Business
## 42                             Business
## 43                             Business
## 44                             Business
## 45                             Business
## 46          Communications & Journalism
## 47          Communications & Journalism
## 48          Communications & Journalism
## 49          Communications & Journalism
## 50              Computers & Mathematics
## 51              Computers & Mathematics
## 52              Computers & Mathematics
## 53              Computers & Mathematics
## 54              Computers & Mathematics
## 55              Computers & Mathematics
## 56              Computers & Mathematics
## 57              Computers & Mathematics
## 58              Computers & Mathematics
## 59              Computers & Mathematics
## 60              Computers & Mathematics
## 61                            Education
## 62                            Education
## 63                            Education
## 64                            Education
## 65                            Education
## 66                            Education
## 67                            Education
## 68                            Education
## 69                            Education
## 70                            Education
## 71                            Education
## 72                            Education
## 73                            Education
## 74                            Education
## 75                            Education
## 76                            Education
## 77                          Engineering
## 78                          Engineering
## 79                          Engineering
## 80                          Engineering
## 81                          Engineering
## 82                          Engineering
## 83                          Engineering
## 84                          Engineering
## 85                          Engineering
## 86                          Engineering
## 87                          Engineering
## 88                          Engineering
## 89                          Engineering
## 90                          Engineering
## 91                          Engineering
## 92                          Engineering
## 93                          Engineering
## 94                          Engineering
## 95                          Engineering
## 96                          Engineering
## 97                          Engineering
## 98                          Engineering
## 99                          Engineering
## 100                         Engineering
## 101                         Engineering
## 102                         Engineering
## 103                         Engineering
## 104                         Engineering
## 105                         Engineering
## 106                              Health
## 107                              Health
## 108                              Health
## 109                              Health
## 110                              Health
## 111                              Health
## 112                              Health
## 113                              Health
## 114                              Health
## 115                              Health
## 116                              Health
## 117                              Health
## 118           Humanities & Liberal Arts
## 119           Humanities & Liberal Arts
## 120           Humanities & Liberal Arts
## 121           Humanities & Liberal Arts
## 122           Humanities & Liberal Arts
## 123           Humanities & Liberal Arts
## 124           Humanities & Liberal Arts
## 125           Humanities & Liberal Arts
## 126           Humanities & Liberal Arts
## 127           Humanities & Liberal Arts
## 128           Humanities & Liberal Arts
## 129           Humanities & Liberal Arts
## 130           Humanities & Liberal Arts
## 131           Humanities & Liberal Arts
## 132           Humanities & Liberal Arts
## 133 Industrial Arts & Consumer Services
## 134 Industrial Arts & Consumer Services
## 135 Industrial Arts & Consumer Services
## 136 Industrial Arts & Consumer Services
## 137 Industrial Arts & Consumer Services
## 138 Industrial Arts & Consumer Services
## 139 Industrial Arts & Consumer Services
## 140                   Interdisciplinary
## 141                 Law & Public Policy
## 142                 Law & Public Policy
## 143                 Law & Public Policy
## 144                 Law & Public Policy
## 145                 Law & Public Policy
## 146                                <NA>
## 147                   Physical Sciences
## 148                   Physical Sciences
## 149                   Physical Sciences
## 150                   Physical Sciences
## 151                   Physical Sciences
## 152                   Physical Sciences
## 153                   Physical Sciences
## 154                   Physical Sciences
## 155                   Physical Sciences
## 156                   Physical Sciences
## 157            Psychology & Social Work
## 158            Psychology & Social Work
## 159            Psychology & Social Work
## 160            Psychology & Social Work
## 161            Psychology & Social Work
## 162            Psychology & Social Work
## 163            Psychology & Social Work
## 164            Psychology & Social Work
## 165            Psychology & Social Work
## 166                      Social Science
## 167                      Social Science
## 168                      Social Science
## 169                      Social Science
## 170                      Social Science
## 171                      Social Science
## 172                      Social Science
## 173                      Social Science
## 174                      Social Science

List out the majors containing “DATA” or “STATISTICS”:

str_view(majors_list_df$Major, "DATA|STATISTICS")
## [44] │ MANAGEMENT INFORMATION SYSTEMS AND <STATISTICS>
## [52] │ COMPUTER PROGRAMMING AND <DATA> PROCESSING
## [59] │ <STATISTICS> AND DECISION SCIENCE

As seen above, the results are:

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

  1. Describe, in words, what these expressions will match:

This is a regular expression, so to turn it into a string defining the regex, we must add a \ before each \:

str_view("aaaabbc", "(.)\\1\\1")
## [1] │ <aaa>abbc
str_view("1111", "(.)\\1\\1")
## [1] │ <111>1

This will match any character repeated 3 times in a row such as “aaa”, “111”, etc.

This is a string defining a regular expression, so we can just throw this into the str_view function:

str_view(fruit, "(.)(.)\\2\\1")
##  [5] │ bell p<eppe>r
## [17] │ chili p<eppe>r
str_view("aaaabbc", "(.)(.)\\2\\1")
## [1] │ <aaaa>bbc
str_view("11111", "(.)(.)\\2\\1")
## [1] │ <1111>1

This will match a pair of characters immediately followed by the same pair of characters but reversed such as “ep” followed by “pe”, or “11” followed by “11”.

This is a regular expression, so to turn it into a string defining the regex, we must add a \ before each \:

str_view(fruit, "(..)\\1")
##  [4] │ b<anan>a
## [20] │ <coco>nut
## [22] │ <cucu>mber
## [41] │ <juju>be
## [56] │ <papa>ya
## [73] │ s<alal> berry
str_view("aaaabbc", "(..)\\1")
## [1] │ <aaaa>bbc
str_view("1111111", "(..)\\1")
## [1] │ <1111>111

This will match a repeated pair of characters such as “anan” or “1111”.

This is a string defining a regular expression, so we can just throw this into the str_view function:

str_view(fruit, "(.).\\1.\\1")
##  [4] │ b<anana>
## [56] │ p<apaya>
str_view("11111", "(.).\\1.\\1")
## [1] │ <11111>
str_view("121314", "(.).\\1.\\1")
## [1] │ <12131>4

This will match a character repeated in three places, separated by any single character such as “12131” in “121314”.

This is a string defining a regular expression, so we can just throw this into the str_view function:

str_view(sentences, "(.)(.)(.).*\\3\\2\\1")
##   [4] │ These days< a chicken leg is a >rare dish.
##  [10] │ A large< size in stockings is >hard to sell.
##  [14] │ Kick< the ball straight >and follow through.
##  [16] │ A p<ot of tea helps to> pass the evening.
##  [22] │ The fis<h twisted and turned on the bent h>ook.
##  [28] │ The colt rea<red and threw the tall rider>.
##  [57] │ Marc<h the soldiers past the next h>ill.
##  [67] │ The set of chin<a hit the floor with a> crash.
##  [68] │ This is a grand< season for hikes >on the road.
##  [71] │ A yac<ht slid around the point into th>e bay.
##  [83] │ Th<ere are more than two factors here>.
##  [97] │ The term< ended in late june >that year.
## [101] │ Oak i<s strong and also gives s>hade.
## [105] │ Add the sum t<o the product o>f these three.
## [117] │ Weave< the carpet on the right >hand side.
## [118] │ Hemp is a weed< found in parts of >the tropics.
## [122] │ The harder he trie<d the less he got d>one.
## [131] │ A cramp is< no small danger on >a swim.
## [133] │ Pluck< the bright >rose without leaves.
## [135] │ The glow< deepened >in the eyes of the sweet girl.
## ... and 112 more
str_view("12345678.321", "(.)(.)(.).*\\3\\2\\1")
## [1] │ <12345678.321>

This will match strings that start and end with the same 3 characters, but the end pattern is reversed, such as the carpet on the right (th and ht).

  1. Construct regular expressions to match words that:

4.1 Start and end with the same character.

^(.).*\1$ or "^(.).*\\1$"

str_view(words, "^(.).*\\1$")
##  [36] │ <america>
##  [49] │ <area>
## [209] │ <dad>
## [213] │ <dead>
## [223] │ <depend>
## [258] │ <educate>
## [266] │ <else>
## [268] │ <encourage>
## [270] │ <engine>
## [278] │ <europe>
## [283] │ <evidence>
## [285] │ <example>
## [287] │ <excuse>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [296] │ <eye>
## [386] │ <health>
## [394] │ <high>
## [450] │ <knock>
## ... and 16 more

Since we want to match the start and end of the string we use ^ and $ , and the middle regex captures a character followed by the same character with any character in between.

4.2 Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

(..).*\1 or "(..).*\\1"

str_view(words, "(..).*\\1")
##  [48] │ ap<propr>iate
## [152] │ <church>
## [181] │ c<ondition>
## [217] │ <decide>
## [275] │ <environmen>t
## [487] │ l<ondon>
## [598] │ pa<ragra>ph
## [603] │ p<articular>
## [617] │ <photograph>
## [638] │ p<repare>
## [641] │ p<ressure>
## [696] │ r<emem>ber
## [698] │ <repre>sent
## [699] │ <require>
## [739] │ <sense>
## [858] │ the<refore>
## [903] │ u<nderstand>
## [946] │ w<hethe>r
str_view(fruit, "(..).*\\1")
##  [4] │ b<anan>a
##  [5] │ bell <peppe>r
## [17] │ chili <peppe>r
## [20] │ <coco>nut
## [22] │ <cucu>mber
## [29] │ eld<erber>ry
## [41] │ <juju>be
## [51] │ <nectarine>
## [56] │ <papa>ya
## [73] │ s<alal> berry

This will capture a pair of characters repeated, with any amount of characters in between the 2 pairs.

4.3 Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

(.).*\1.*\1 or "(.).*\\1.*\\1"

str_view(words, "(.).*\\1.*\\1")
##  [48] │ a<pprop>riate
##  [62] │ <availa>ble
##  [86] │ b<elieve>
##  [90] │ b<etwee>n
## [119] │ bu<siness>
## [221] │ d<egree>
## [229] │ diff<erence>
## [233] │ di<scuss>
## [265] │ <eleve>n
## [275] │ e<nvironmen>t
## [283] │ <evidence>
## [288] │ <exercise>
## [291] │ <expense>
## [292] │ <experience>
## [423] │ <indivi>dual
## [598] │ p<aragra>ph
## [684] │ r<eceive>
## [696] │ r<emembe>r
## [698] │ r<eprese>nt
## [845] │ t<elephone>
## ... and 2 more
str_view("aaba", "(.).*\\1.*\\1")
## [1] │ <aaba>

This will capture a character repeated 3 times, with any amount of characters in between each character.