library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Normalization

1. Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.

customers <- data.frame(
  CustomerID = c(101, 102),
  CustomerName = c("John Smith", "Jane Doe"),
  CustomerAddress = c("123 Elm St", "456 Oak St")
)
products <- data.frame(
  ProductName = c("Apples", "Bananas", "Orange"),
  ProductPrice = c(15.00, 40.00, 10.00)
)

order_details <- data.frame(
  OrderID = c("001", "002", "003"),
  ProductName = c("Apples", "Bananas", "Orange"),
  Quantity = c(2, 1, 1)
)

print(customers)
##   CustomerID CustomerName CustomerAddress
## 1        101   John Smith      123 Elm St
## 2        102     Jane Doe      456 Oak St
print(products)
##   ProductName ProductPrice
## 1      Apples           15
## 2     Bananas           40
## 3      Orange           10
print(order_details)
##   OrderID ProductName Quantity
## 1     001      Apples        2
## 2     002     Bananas        1
## 3     003      Orange        1

Character Manipulation

2. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

majors <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(majors)
## Rows: 174
## Columns: 3
## $ FOD1P          <chr> "1100", "1101", "1102", "1103", "1104", "1105", "1106",…
## $ Major          <chr> "GENERAL AGRICULTURE", "AGRICULTURE PRODUCTION AND MANA…
## $ Major_Category <chr> "Agriculture & Natural Resources", "Agriculture & Natur…
(majors_filtered <- majors %>% 
   filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS")))
## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

3. Describe, in words, what these expressions will match:

“(.)\1\1” - Matches all words that contain 3 adjacent identical characters (eg. aaa)

“(.)(.)\2\1” - Matches a pair of character to the same pari but in reverse (eg. abba)

“(..)\1” - Matches all words that contain two character that repeat (eg. acac)

“(.).\1.\1” - Matches a character followed by any character than the first character repeated then any character then the first character again. (eg. abaca)

“(.)(.)(.).*\3\2\1” - Matches a sequence of three characters then any set of characters followed by the first sequence in reverse. (eg. abckdjhfcba)

4. Construct regular expressions to match words that:

Start and end with the same character.

“(.).*\1$”

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

“(..).*\1”

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

“([a-z]).\1.\1.*”