library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
1. Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.
customers <- data.frame(
CustomerID = c(101, 102),
CustomerName = c("John Smith", "Jane Doe"),
CustomerAddress = c("123 Elm St", "456 Oak St")
)
products <- data.frame(
ProductName = c("Apples", "Bananas", "Orange"),
ProductPrice = c(15.00, 40.00, 10.00)
)
order_details <- data.frame(
OrderID = c("001", "002", "003"),
ProductName = c("Apples", "Bananas", "Orange"),
Quantity = c(2, 1, 1)
)
print(customers)
## CustomerID CustomerName CustomerAddress
## 1 101 John Smith 123 Elm St
## 2 102 Jane Doe 456 Oak St
print(products)
## ProductName ProductPrice
## 1 Apples 15
## 2 Bananas 40
## 3 Orange 10
print(order_details)
## OrderID ProductName Quantity
## 1 001 Apples 2
## 2 002 Bananas 1
## 3 003 Orange 1
2. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
majors <- read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(majors)
## Rows: 174
## Columns: 3
## $ FOD1P <chr> "1100", "1101", "1102", "1103", "1104", "1105", "1106",…
## $ Major <chr> "GENERAL AGRICULTURE", "AGRICULTURE PRODUCTION AND MANA…
## $ Major_Category <chr> "Agriculture & Natural Resources", "Agriculture & Natur…
(majors_filtered <- majors %>%
filter(str_detect(Major, "DATA") | str_detect(Major, "STATISTICS")))
## # A tibble: 3 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
3. Describe, in words, what these expressions will match:
“(.)\1\1” - Matches all words that contain 3 adjacent identical characters (eg. aaa)
“(.)(.)\2\1” - Matches a pair of character to the same pari but in reverse (eg. abba)
“(..)\1” - Matches all words that contain two character that repeat (eg. acac)
“(.).\1.\1” - Matches a character followed by any character than the first character repeated then any character then the first character again. (eg. abaca)
“(.)(.)(.).*\3\2\1” - Matches a sequence of three characters then any set of characters followed by the first sequence in reverse. (eg. abckdjhfcba)
4. Construct regular expressions to match words that:
Start and end with the same character.
“(.).*\1$”
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
“(..).*\1”
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
“([a-z]).\1.\1.*”