DATA607_Week3_Assignment

Normalization

Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course. Character Manipulation

Dataframe 1 - supermarket customer purchase data including what item they bought

supermarket_customer <- data.frame(
  customer_id = c(1, 2, 3, 4, 5),
  name = c("Addie", "Eddie", "Elma", "Saif", "Dawa"),
  item = c("Apple", "Banana", "Pie", "Apple", "Donut"),
  stringsAsFactors = FALSE
)
print(supermarket_customer)

##   customer_id  name   item
## 1           1 Addie  Apple
## 2           2 Eddie Banana
## 3           3  Elma    Pie
## 4           4  Saif  Apple
## 5           5  Dawa  Donut

Dataframe 2 - the date the customers ordered the items

customer_orders <- data.frame(
  order_id = c(200, 201, 202, 203, 204),
  customer_id = c(1, 2, 3, 4, 5),
  order_date = as.Date(c("2025-02-01", "2025-02-02", "2025-02-03", "2025-02-04", "2025-02-05")),
  stringsAsFactors = FALSE
)
print(customer_orders)

##   order_id customer_id order_date
## 1      200           1 2025-02-01
## 2      201           2 2025-02-02
## 3      202           3 2025-02-03
## 4      203           4 2025-02-04
## 5      204           5 2025-02-05

Dataframe 3 - the total amount of money the customers spent for each transaction

order_total <- data.frame(
  order_total_id = c(300, 301, 302, 303, 304),
  order_id = c(200, 201, 202, 203, 204),
  order_total = c(20.25, 34.52, 11.23, 23.87, 5.36),
  stringsAsFactors = FALSE
)
print(order_total)

##   order_total_id order_id order_total
## 1            300      200       20.25
## 2            301      201       34.52
## 3            302      202       11.23
## 4            303      203       23.87
## 5            304      204        5.36

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

college_major_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(college_major_data)

## Rows: 174
## Columns: 3
## $ FOD1P          <chr> "1100", "1101", "1102", "1103", "1104", "1105", "1106",…
## $ Major          <chr> "GENERAL AGRICULTURE", "AGRICULTURE PRODUCTION AND MANA…
## $ Major_Category <chr> "Agriculture & Natural Resources", "Agriculture & Natur…

data_stats_majors <- college_major_data %>%
  filter(str_detect(Major, regex("DATA|STATISTICS", ignore_case = TRUE)))

print(data_stats_majors)

## # A tibble: 3 × 3
##   FOD1P Major                                         Major_Category         
##   <chr> <chr>                                         <chr>                  
## 1 6212  MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business               
## 2 2101  COMPUTER PROGRAMMING AND DATA PROCESSING      Computers & Mathematics
## 3 3702  STATISTICS AND DECISION SCIENCE               Computers & Mathematics

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

Describe, in words, what these expressions will match:

(.)\1\1
The (.) portion campures single characters while \1\1 matches the occurrences of the same characters so:
this expression matches any characters repeated three times
example: “???”
“(.)(.)\2\1”
This expression matches a pattern where a two character sequence separately then it matches the second character that is followed by the original one
example: “abba”
(..)\1
this expression matches any two character sequences that are repeated right after
example: “hihi”
“(.).\1.\1”
this expression matches a five character string where the first, third, and fifth characters are the same
example: hahyh”
**“(.)(.)(.).*\3\2\1”**
this expression macthes a string that begins with three characters and ends with the sme three but in reverse order
example: xyzhihowareyouzyx”

Construct regular expressions to match words that:

Start and end with the same character.

start_end_regex <- "^(.).*\\1$"

#Time to test
test_words <- c("banana", "apple", "alpha", "bulb")
grep(start_end_regex, test_words, value = TRUE)

## [1] "alpha" "bulb"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

repeated_pair_regex <- "([A-Za-z]{2}).*\\1"

#time to test
test_words <- c("church", "happy", "jolly", "cat")
grep(repeated_pair_regex, test_words, value = TRUE)

## [1] "church"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

repeat_3_regex <- ".*([A-Za-z]).*\\1.*\\1.*"

#time to test
test_words <- c("eleven", "banana", "ice cream", "apple")
grep(repeat_3_regex, test_words, value = TRUE)

## [1] "eleven" "banana"

DATA607_Week3_Assignment

Silma

2025-02-13

Instructions

Normalization

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: