## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NORMALIZATION QUESTION 1

Raw Synthetic Code

First I will create a synthetic dataframe, also known as fake data. In this example, the dataframe resembles that of customer/purchase data from Microcenter, which is a tech store.

orders <- data.frame(
  order_id     = c(1, 2, 3),
  customer_id  = c(101, 101, 102),
  customer_name = c("Matt", "Matt", "Jesus"),
  product_id   = c("P001", "P002", "P001"),
  product_name = c("Laptop", "Mouse", "Laptop"),
  quantity     = c(1, 2, 1),
  price        = c(1000, 30, 1000),
  stringsAsFactors = FALSE
)
orders
##   order_id customer_id customer_name product_id product_name quantity price
## 1        1         101          Matt       P001       Laptop        1  1000
## 2        2         101          Matt       P002        Mouse        2    30
## 3        3         102         Jesus       P001       Laptop        1  1000

Three Normalized Dataframes

Here are 3 examples of the same data, but through different dataframes such as different tables. This is done through normalization

customers <- data.frame(
  customer_id   = c(101, 102),
  customer_name = c("Matt", "Jesus"),
  stringsAsFactors = FALSE
)
products <- data.frame(
  product_id   = c("P001", "P002"),
  product_name = c("Laptop", "Mouse"),
  price        = c(1000, 30),
  stringsAsFactors = FALSE
)
order_records <- data.frame(
  order_id    = c(1, 2, 3),
  customer_id = c(101, 101, 102),
  product_id  = c("P001", "P002", "P001"),
  quantity    = c(1, 2, 1),
  stringsAsFactors = FALSE
)
customers
##   customer_id customer_name
## 1         101          Matt
## 2         102         Jesus
products
##   product_id product_name price
## 1       P001       Laptop  1000
## 2       P002        Mouse    30
order_records
##   order_id customer_id product_id quantity
## 1        1         101       P001        1
## 2        2         101       P002        2
## 3        3         102       P001        1

CHARACTER MANIPULATION QUESTIONS 2,3, AND 4

Major’s list According to the second task, the goal here is to display the majors that contain either “dATA” or “STATISTICS” with the help of chapter 15’s regular expression: str_detect()

majors_csv <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(majors_csv, stringsAsFactors = FALSE)
majors_data_stats <- majors %>%
  filter(str_detect(Major, regex("DATA|STATISTICS")))
majors_data_stats
##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Expression meaning

I will number the lines of the expressions from top to bottom as 1 through 5 as I point out the meaning. 1. 3 character string: is suppose to be 1 character which is the “.” within the parentheses (or captured). The following \1 the same character within and the \1 backslash 1 implies that the “(.)” character gets repeated 2 times more. 2. 4 character string: two characters are being used, where the following \2 implies only the second character is subject and thus \1 is the first character afterwards. The additional backslashes do not change the meaning. 3. 4 character string: means there are two characters within as in they are together. So by \1 that means repeating the grouped up two characters. 4. 5 character string: where the first is a captured character (.) followed by a lone “.” in position 2 and 5 which can be any character. The \1 is referring back to the first character in the string. 5. Technically 7 character string (but can be more): the first 3 characters are captured and the .* is unique because it can either be nothing or can be any subset of different characters. The \3 \2 and \1 resembles the positions of the first set of captured characters so therefore repeating.

1. (.)\1\1
2. "(.)(.)\\2\\1"
3. (..)\1
4. "(.).\\1.\\1"
5. "(.)(.)(.).*\\3\\2\\1"

Construct regular expressions

  1. Start and end with the same character.
  2. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
  3. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
1. 4-character string (.)(.)(.)\1 "oreo"
2. 6-character string (.)(.)\1\1\2. "Mammal"
3. 6-charater string .(.)(.)\2\3\2 "Banana"