Understanding Database Normalization

Database normalization is a method to organize data in a consistent way. Normalizing data can help ensure there are not duplicate entries and maintain data integrity. Using this method, the data is organized in many tables, linked together by relationships (such as a unique id or key). There are a few levels to normalized datasets: 1NF, 2NF, 3NF.

If a database were to be considered 1NF, or in the first normal form, it would be meet the following criteria:

Each cell does not hold more than one value
Every row has a unique identifier
Each column only has one value for each row in the table

An example dataset that is in the 1NF form is New York State’s Energy Efficiency Completed Projects Dataset:

# completed energy efficiency projects in NY 
# documentation here: https://data.ny.gov/Energy-Environment/Energy-Efficiency-Completed-Projects-Beginning-198/erjw-j2zx/about_data

# loading data from open ny data
ny_ee_proj <- read.socrata("https://data.ny.gov/resource/erjw-j2zx.json") 

# calculating the number of unique project codes
n_distinct(ny_ee_proj$project_code)

## [1] 2747

# calculating the number of rows
nrow(ny_ee_proj)

## [1] 2747

# creating a data frame with an agency ID, and rearranging the columns
df_1nf <- ny_ee_proj %>% 
  group_by(agency) %>% 
  mutate(agency_id =cur_group_id()) %>% 
  ungroup() |> 
  relocate(`project_code`,`project_name`,`year_completed`,`date_completed`,`agency`,`agency_id`)

df_1nf_clean <- clean_names(df_1nf, "title") |> 
  rename("County" = "Ny State County")

# rendering the 1NF data frame, showing the first 6 rows and columns
knitr::kable(head(df_1nf_clean, format ="markdown", n = c(6, 6)))

Project Code	Project Name	Year Completed	Date Completed	Agency	Agency Id
CH79001	Watt Busters - SKANEATELES	1987	1987-12-31	NA	50
CH79002	Watt Busters - SHERRILL	1987	1987-12-31	NA	50
CH10725	Watt Busters - GROTON	1987	1987-12-31	NA	50
CH79003	Watt Busters - BERGEN	1988	1988-12-31	NA	50
CH10723	Watt Busters - SPENCERPORT	1988	1988-12-31	NA	50
CH10727	Watt Busters - ENDICOTT	1988	1988-12-31	NA	50

The data seen above is managed by The New York Power Authority, and is collected from project contracts, including project name and codes, location, cost estimates, and energy efficiency savings estimates. This collection effort is related to the Energy Efficiency Market Acceleration Program (EE MAP), which was an initiative lead by former Governor Cuomo to advance market development and commercialization of new energy efficient technologies.

To transform the data into the 2NF form, the code below creates two different tables, the Agency Table, and the Energy Project Table:

# agency table
# removing values where Agency title is empty (NA)
df_2nf_ex1 <- df_1nf %>% 
  select(agency_id,agency,address,city,zip,address_ext) %>% 
  distinct() %>% 
  drop_na(agency) %>% 
  arrange(agency_id) |> 
  clean_names("title")

# Energy Project Table
df_2nf_ex2 <- df_1nf %>% 
  select(-agency,-address,-city,-zip,-address_ext) %>% 
  relocate(agency_id) |> 
  clean_names("title")

# rendering the Agency Table, showing the first 6 rows and columns
knitr::kable(head(df_2nf_ex1, format ="markdown", n = c(6, 6)))

Agency Id	Agency	Address	City	Zip	Address Ext
1	Brooklyn Public Library	10 Grand Army Plaza	Brooklyn	11238	NA
1	Brooklyn Public Library	286 Fulton St	Brooklyn	11201	NA
1	Brooklyn Public Library	81 Devoe Street; 360 Irving Ave	Brooklyn	11211	NA
1	Brooklyn Public Library	340 Bushwick Avenue	Brooklyn	11206	NA
2	City University of New York	56th Ave & Springfield Boulvd	Queens	11364	NA
2	City University of New York	535 East 80th Street	New York	10021	NA

# rendering the Energy Project Table, showing the first 6 rows and columns
knitr::kable(head(df_2nf_ex2, format ="markdown", n = c(6, 6)))

Agency Id	Project Code	Project Name	Year Completed	Date Completed	Regional Economic Development Council
50	CH79001	Watt Busters - SKANEATELES	1987	1987-12-31	Central New York
50	CH79002	Watt Busters - SHERRILL	1987	1987-12-31	Mohawk Valley
50	CH10725	Watt Busters - GROTON	1987	1987-12-31	Southern Tier
50	CH79003	Watt Busters - BERGEN	1988	1988-12-31	Finger Lakes
50	CH10723	Watt Busters - SPENCERPORT	1988	1988-12-31	Finger Lakes
50	CH10727	Watt Busters - ENDICOTT	1988	1988-12-31	Southern Tier

By adding an Agency ID, it ensures that if there were two organizations with the same name, or similar names, there would still be a distinct key to identify them. Additionally, with a unique ID it can become easier to look up information about an agency:

df_2nf_ex2 |> 
  filter(`Agency Id` == 1) |> 
  head(n = c(6, 6))

## # A tibble: 5 × 6
##   `Agency Id` `Project Code` `Project Name` `Year Completed` `Date Completed`   
##         <int> <chr>          <chr>          <chr>            <dttm>             
## 1           1 ES-GSN-0146    Brooklyn Publ… 2006             2006-03-16 00:00:00
## 2           1 ES-GSN-0014    Brooklyn Publ… 2010             2010-08-12 00:00:00
## 3           1 ES-GSN-0525    Brooklyn Publ… 2012             2012-11-30 00:00:00
## 4           1 ES-GSN-0492    Brooklyn Publ… 2012             2012-11-30 00:00:00
## 5           1 ES-GSN-0966    Brooklyn Publ… 2023             2023-05-25 00:00:00
## # ℹ 1 more variable: `Regional Economic Development Council` <chr>

Below is an example of the data as unnormalized:

# creating a function to combine the year_completed and date_completed columns 
# as a list in a new column titled 'Completion'
df_1nf$Completion <- mapply(function(year, date) list(Year = as.integer(year), 
                                                      Date = as.Date(date, 
                                                                     format = "%Y-%m-%d")),
                            df_1nf$year_completed, 
                            as.Date(df_1nf$date_completed, format = "%Y-%m-%d"),
                            SIMPLIFY = FALSE)

df_un <- df_1nf[, c("project_code", "Completion")]

# render the unnormalized dataframe
kable(head(df_un, format = "html", escape = FALSE)) |> 
 kable_styling("striped", full_width = FALSE)

project_code	Completion
CH79001	1987, 6573
CH79002	1987, 6573
CH10725	1987, 6573
CH79003	1988, 6939
CH10723	1988, 6939
CH10727	1988, 6939

In this example, the completion dates and year is combined into one column. As a note, when the table is rendered, it calculates the number of days since January 1, 1970 (representing when R was created). This type of data structure could complicate data entry or analysis, as the dates may need to be extracted from the Completion column to be analyzed, and if a date needed to be edited, it would not be simple to do so.

To achieve an 3NF form, the database should be in the second form (2NF), and have no transitive partial dependency. This means there are no duplicates or transitory keys. For example, the code below transforms the data to the third form:

# creating project code table
df_3nf_ex1_projectcode <- df_2nf_ex2 |>
  select(`Project Code`, `Project Name`) |>
  distinct() |> 
  clean_names("title")

# creating the project table
df_3nf_ex2_projects <- df_2nf_ex2 |>
  select(-`Project Name`) |> 
  clean_names("title")

# render the project code table
knitr::kable(head(df_3nf_ex1_projectcode, format ="markdown"))

Project Code	Project Name
CH79001	Watt Busters - SKANEATELES
CH79002	Watt Busters - SHERRILL
CH10725	Watt Busters - GROTON
CH79003	Watt Busters - BERGEN
CH10723	Watt Busters - SPENCERPORT
CH10727	Watt Busters - ENDICOTT

# render the project table with descriptive columns
knitr::kable(head(df_3nf_ex2_projects, format ="markdown", n = c(6, 6)))

Agency Id	Project Code	Year Completed	Date Completed	Regional Economic Development Council	Nyiso Zone
50	CH79001	1987	1987-12-31	Central New York	C - Central
50	CH79002	1987	1987-12-31	Mohawk Valley	E - Mohawk Valley
50	CH10725	1987	1987-12-31	Southern Tier	C - Central
50	CH79003	1988	1988-12-31	Finger Lakes	A - West
50	CH10723	1988	1988-12-31	Finger Lakes	B - Genesee
50	CH10727	1988	1988-12-31	Southern Tier	C - Central

Character Manipulation

In this section of the assignment, using 538’s data from the article, The Economic Guide To Picking A College Major, we will identify which majors contain the terms “data” or “statistics”.

To do this, first, you should read in the data, which is hosted on 538’s Github Page.

majors_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

Then, using the stringr package, we can filter the data to only include majors that incorporate the terms “data” or “statistics”:

majors_data_filtered <- majors_data %>% 
  filter(str_detect(Major,"DATA|STATISTICS"))

This creates a dataframe with three majors:

majors_data_filtered$Major

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Regex Part 1:

In the assignment instructions, there is an exercise to translate the following regular expressions in layman’s terms:

(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
“(.)(.)(.).*\3\2\1”

Answers:

The “(.)” signifies a group, “\1” signifies a character, and the second “\1” signifies a matching character. This means that a pattern following this regular expression could be “aaa”, “111” or “…”.
This regular expression means there are two groups, depicted by the “(.)” symbols. The “\2” means the exact same character that appeared in the second group appears again. So, an example expression that matches this pattern could be “ceec”, “1221”, or “woow”.
The (..) means there is a group with two characters. The “\1” means that it matches two characters. Therefore, this pattern matches two consecutive identical pairs of characters. This could be “abab”, “3434”, or “aaaa”.
“(.).\1.\1” is a pattern where there is a repeating character that appears three times with a character in between each repetition. A string following this pattern is “1_1_1”, or “abaca”.
This pattern can be explained as one where there are three groups, that occur twice, however the second time the group appears, it is mirrored. An example of a string that matches this pattern is “123xxxx321”.

Regex Part 2

In this part of the assignment, there are instructions to construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Answers

^(.).*\1$
..).*?\1
([a-zA-Z]).\1.\1.

Assignment 3

Alyssa Gurkas

2025-03-01