## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
NORMALIZATION QUESTION 1
Raw Synthetic Code
First I will create a synthetic dataframe, also known as fake data. In this example, the dataframe resembles that of customer/purchase data from Microcenter, which is a tech store.
orders <- data.frame(
order_id = c(1, 2, 3),
customer_id = c(101, 101, 102),
customer_name = c("Matt", "Matt", "Jesus"),
product_id = c("P001", "P002", "P001"),
product_name = c("Laptop", "Mouse", "Laptop"),
quantity = c(1, 2, 1),
price = c(1000, 30, 1000),
stringsAsFactors = FALSE
)
orders
## order_id customer_id customer_name product_id product_name quantity price
## 1 1 101 Matt P001 Laptop 1 1000
## 2 2 101 Matt P002 Mouse 2 30
## 3 3 102 Jesus P001 Laptop 1 1000
Three Normalized Dataframes
Here are 3 examples of the same data, but through different dataframes such as different tables. This is done through normalization
customers <- data.frame(
customer_id = c(101, 102),
customer_name = c("Matt", "Jesus"),
stringsAsFactors = FALSE
)
products <- data.frame(
product_id = c("P001", "P002"),
product_name = c("Laptop", "Mouse"),
price = c(1000, 30),
stringsAsFactors = FALSE
)
order_records <- data.frame(
order_id = c(1, 2, 3),
customer_id = c(101, 101, 102),
product_id = c("P001", "P002", "P001"),
quantity = c(1, 2, 1),
stringsAsFactors = FALSE
)
customers
## customer_id customer_name
## 1 101 Matt
## 2 102 Jesus
products
## product_id product_name price
## 1 P001 Laptop 1000
## 2 P002 Mouse 30
order_records
## order_id customer_id product_id quantity
## 1 1 101 P001 1
## 2 2 101 P002 2
## 3 3 102 P001 1
CHARACTER MANIPULATION QUESTIONS 2,3, AND 4
Major’s list According to the second task, the goal here is to display the majors that contain either “dATA” or “STATISTICS” with the help of chapter 15’s regular expression: str_detect()
majors_csv <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"
majors <- read.csv(majors_csv, stringsAsFactors = FALSE)
majors_data_stats <- majors %>%
filter(str_detect(Major, regex("DATA|STATISTICS")))
majors_data_stats
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Expression meaning
I will number the lines of the expressions from top to bottom as 1 through 5 as I point out the meaning. 1. 3 character string: is suppose to be 1 character which is the “.” within the parentheses (or captured). The following \1 the same character within and the \1 backslash 1 implies that the “(.)” character gets repeated 2 times more. 2. 4 character string: two characters are being used, where the following \2 implies only the second character is subject and thus \1 is the first character afterwards. The additional backslashes do not change the meaning. 3. 4 character string: means there are two characters within as in they are together. So by \1 that means repeating the grouped up two characters. 4. 5 character string: where the first is a captured character (.) followed by a lone “.” in position 2 and 5 which can be any character. The \1 is referring back to the first character in the string. 5. Technically 7 character string (but can be more): the first 3 characters are captured and the .* is unique because it can either be nothing or can be any subset of different characters. The \3 \2 and \1 resembles the positions of the first set of captured characters so therefore repeating.
1. (.)\1\1
2. "(.)(.)\\2\\1"
3. (..)\1
4. "(.).\\1.\\1"
5. "(.)(.)(.).*\\3\\2\\1"
Construct regular expressions
1. 4-character string (.)(.)(.)\1 "oreo"
2. 6-character string (.)(.)\1\1\2. "Mammal"
3. 6-charater string .(.)(.)\2\3\2 "Banana"