1NF (Normal Form) - Using row order to convey information is not permitted. - Mixing data types within the same column is not permitted. - Having a table without a primary key is not permitted. - Repeating groups are not permitted
2NF (Normal Form) Each non-key attribute in the table must be dependent on the entire primary key.
3NF (Normal Form) Every non-key attribute in a table should depends on the key, the whole key, and nothing but the key.
4NF (Normal Form) Multivalued dependencies in a table must be multivalued on the key.
5NF (Normal Form) The table (which must be in 4NF) cannot be described as the logical result of joining some other tables togather.
Ask #1: Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.
theater <- data.frame(
TheaterID = c(1, 2, 3, 4, 5),
TheaterName = c("Apollo Theater", "Ambassador Theatre", "August Wilson Theatre", "Apollo Theater", "Ambassador Theatre"),
Location = c("New York", "New York", "New York", "Chicago", "Los Angeles"),
Seats = c(1500, 1200, 1300, 1600, 1100),
Actors = c("John New York, Jane New York", "John New York", "John New York", "John Chicago", "John Los Angeles, Jane Los Angeles")
)
print(theater)
## TheaterID TheaterName Location Seats
## 1 1 Apollo Theater New York 1500
## 2 2 Ambassador Theatre New York 1200
## 3 3 August Wilson Theatre New York 1300
## 4 4 Apollo Theater Chicago 1600
## 5 5 Ambassador Theatre Los Angeles 1100
## Actors
## 1 John New York, Jane New York
## 2 John New York
## 3 John New York
## 4 John Chicago
## 5 John Los Angeles, Jane Los Angeles
# 1NF table example
# The table theater_1NF is in 1NF because it has no repeating groups or arrays.
theater_1NF <- data.frame(
TheaterID = c(1, 2, 3, 4, 5),
TheaterName = c("Apollo Theater", "Ambassador Theatre", "August Wilson Theatre", "Apollo Theater", "Ambassador Theatre"),
Location = c("New York", "New York", "New York", "Chicago", "Los Angeles"),
Seats = c(1500, 1200, 1300, 1600, 1100),
Actor = c("John New York", "John New York", "John New York", "John Chicago", "John Los Angeles")
)
print(theater_1NF)
## TheaterID TheaterName Location Seats Actor
## 1 1 Apollo Theater New York 1500 John New York
## 2 2 Ambassador Theatre New York 1200 John New York
## 3 3 August Wilson Theatre New York 1300 John New York
## 4 4 Apollo Theater Chicago 1600 John Chicago
## 5 5 Ambassador Theatre Los Angeles 1100 John Los Angeles
#In theater_1NF, Location and Actor attributes depend on TheaterName, not TheaterID. This means that Location and Actor are partially dependent on the primary key (TheaterID), which violates 2NF.
# 2NF table example
# The table theater_2NF is in 2NF because it has no partial dependencies. The Location attribute has been moved to a separate location_2NF table.
# Theater table
theater_2NF <- data.frame(
TheaterID = c(1, 2, 3, 4, 5),
TheaterName = c("Apollo Theater", "Ambassador Theatre", "August Wilson Theatre", "Apollo Theater", "Ambassador Theatre"),
LocationID = c(1, 1, 1, 2, 3),
Seats = c(1500, 1200, 1300, 1600, 1100),
Actor = c("John New York", "John New York", "John New York", "John Chicago", "John Los Angeles")
)
# Location table
location_2NF <- data.frame(
LocationID = c(1, 2, 3),
Location = c("New York", "Chicago", "Los Angeles")
)
print(theater_2NF)
## TheaterID TheaterName LocationID Seats Actor
## 1 1 Apollo Theater 1 1500 John New York
## 2 2 Ambassador Theatre 1 1200 John New York
## 3 3 August Wilson Theatre 1 1300 John New York
## 4 4 Apollo Theater 2 1600 John Chicago
## 5 5 Ambassador Theatre 3 1100 John Los Angeles
print(location_2NF)
## LocationID Location
## 1 1 New York
## 2 2 Chicago
## 3 3 Los Angeles
# In theater_2NF, the Actor attribute depends on the LocationID, which in turn depends on the TheaterID. This means Actor attribute does not dependent on the primary key (TheaterID), which violates 3NF.
# 3NF table example
# The table theater_3NF is in 3NF because it has no transitive dependencies. The Actor attribute has been moved to a separate actor_3NF table, and the Location table now includes an ActorID to maintain the relationship.
# Theater table
theater_3NF <- data.frame(
TheaterID = c(1, 2, 3, 4, 5),
TheaterName = c("Apollo Theater", "Ambassador Theatre", "August Wilson Theatre", "Apollo Theater", "Ambassador Theatre"),
LocationID = c(1, 1, 1, 2, 3),
Seats = c(1500, 1200, 1300, 1600, 1100)
)
# Location table
location_3NF <- data.frame(
LocationID = c(1, 2, 3),
Location = c("New York", "Chicago", "Los Angeles"),
ActorID = c(1, 2, 3)
)
# Manager table
actor_3NF <- data.frame(
ActorID = c(1, 2, 3),
Actor = c("John New York", "John Chicago", "John Los Angeles")
)
print(theater_3NF)
## TheaterID TheaterName LocationID Seats
## 1 1 Apollo Theater 1 1500
## 2 2 Ambassador Theatre 1 1200
## 3 3 August Wilson Theatre 1 1300
## 4 4 Apollo Theater 2 1600
## 5 5 Ambassador Theatre 3 1100
print(location_3NF)
## LocationID Location ActorID
## 1 1 New York 1
## 2 2 Chicago 2
## 3 3 Los Angeles 3
print(actor_3NF)
## ActorID Actor
## 1 1 John New York
## 2 2 John Chicago
## 3 3 John Los Angeles
library(knitr)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
urlfile <- "https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv"
majors <- read.csv(urlfile)
#filter data
#pattern
pattern <- c("DATA", "STATISTICS")
majors_filtered <- majors %>% filter(str_detect(Major, regex(paste(pattern, collapse = "|"))))
(.)\1\1 - it matches any character followed by itself two more times. example: aaa (.)(.)\2\1 - it matches two charactersfollowed by the second character and then the first character. example: anna (..)\1 - it matches any two characters (…) followed by the same two characters again. example: ANAN (.).\1.\1 - it matches one character that is following another character, and again the first character with the second character following the first one. Like this: anana - The character “a” is followed by any character “n”, then “a” again, another character “n”, and “a” once more (.)(.)(.).*\3\2\1 - It looks for a pattern where the first three characters appear in reverse order somewhere else. example: abc…cba.