Understanding Database Normalization

Database normalization is a method to organize data in a consistent way. Normalizing data can help ensure there are not duplicate entries and maintain data integrity. Using this method, the data is organized in many tables, linked together by relationships (such as a unique id or key). There are a few levels to normalized datasets: 1NF, 2NF, 3NF.


If a database were to be considered 1NF, or in the first normal form, it would be meet the following criteria:

  1. Each cell does not hold more than one value
  2. Every row has a unique identifier
  3. Each column only has one value for each row in the table


An example dataset that is in the 1NF form is New York State’s Energy Efficiency Completed Projects Dataset:

# completed energy efficiency projects in NY 
# documentation here: https://data.ny.gov/Energy-Environment/Energy-Efficiency-Completed-Projects-Beginning-198/erjw-j2zx/about_data

# loading data from open ny data
ny_ee_proj <- read.socrata("https://data.ny.gov/resource/erjw-j2zx.json") 

# calculating the number of unique project codes
n_distinct(ny_ee_proj$project_code)
## [1] 2747
# calculating the number of rows
nrow(ny_ee_proj)
## [1] 2747
# creating a data frame with an agency ID, and rearranging the columns
df_1nf <- ny_ee_proj %>% 
  group_by(agency) %>% 
  mutate(agency_id =cur_group_id()) %>% 
  ungroup() |> 
  relocate(`project_code`,`project_name`,`year_completed`,`date_completed`,`agency`,`agency_id`)

df_1nf_clean <- clean_names(df_1nf, "title") |> 
  rename("County" = "Ny State County")

# rendering the 1NF data frame, showing the first 6 rows and columns
knitr::kable(head(df_1nf_clean, format ="markdown", n = c(6, 6))) 
Project Code Project Name Year Completed Date Completed Agency Agency Id
CH79001 Watt Busters - SKANEATELES 1987 1987-12-31 NA 50
CH79002 Watt Busters - SHERRILL 1987 1987-12-31 NA 50
CH10725 Watt Busters - GROTON 1987 1987-12-31 NA 50
CH79003 Watt Busters - BERGEN 1988 1988-12-31 NA 50
CH10723 Watt Busters - SPENCERPORT 1988 1988-12-31 NA 50
CH10727 Watt Busters - ENDICOTT 1988 1988-12-31 NA 50

The data seen above is managed by The New York Power Authority, and is collected from project contracts, including project name and codes, location, cost estimates, and energy efficiency savings estimates. This collection effort is related to the Energy Efficiency Market Acceleration Program (EE MAP), which was an initiative lead by former Governor Cuomo to advance market development and commercialization of new energy efficient technologies.


To transform the data into the 2NF form, the code below creates two different tables, the Agency Table, and the Energy Project Table:

# agency table
# removing values where Agency title is empty (NA)
df_2nf_ex1 <- df_1nf %>% 
  select(agency_id,agency,address,city,zip,address_ext) %>% 
  distinct() %>% 
  drop_na(agency) %>% 
  arrange(agency_id) |> 
  clean_names("title")

# Energy Project Table
df_2nf_ex2 <- df_1nf %>% 
  select(-agency,-address,-city,-zip,-address_ext) %>% 
  relocate(agency_id) |> 
  clean_names("title")

# rendering the Agency Table, showing the first 6 rows and columns
knitr::kable(head(df_2nf_ex1, format ="markdown", n = c(6, 6)))
Agency Id Agency Address City Zip Address Ext
1 Brooklyn Public Library 10 Grand Army Plaza Brooklyn 11238 NA
1 Brooklyn Public Library 286 Fulton St Brooklyn 11201 NA
1 Brooklyn Public Library 81 Devoe Street; 360 Irving Ave Brooklyn 11211 NA
1 Brooklyn Public Library 340 Bushwick Avenue Brooklyn 11206 NA
2 City University of New York 56th Ave & Springfield Boulvd Queens 11364 NA
2 City University of New York 535 East 80th Street New York 10021 NA
# rendering the Energy Project Table, showing the first 6 rows and columns
knitr::kable(head(df_2nf_ex2, format ="markdown", n = c(6, 6)))
Agency Id Project Code Project Name Year Completed Date Completed Regional Economic Development Council
50 CH79001 Watt Busters - SKANEATELES 1987 1987-12-31 Central New York
50 CH79002 Watt Busters - SHERRILL 1987 1987-12-31 Mohawk Valley
50 CH10725 Watt Busters - GROTON 1987 1987-12-31 Southern Tier
50 CH79003 Watt Busters - BERGEN 1988 1988-12-31 Finger Lakes
50 CH10723 Watt Busters - SPENCERPORT 1988 1988-12-31 Finger Lakes
50 CH10727 Watt Busters - ENDICOTT 1988 1988-12-31 Southern Tier


By adding an Agency ID, it ensures that if there were two organizations with the same name, or similar names, there would still be a distinct key to identify them. Additionally, with a unique ID it can become easier to look up information about an agency:

df_2nf_ex2 |> 
  filter(`Agency Id` == 1) |> 
  head(n = c(6, 6))
## # A tibble: 5 × 6
##   `Agency Id` `Project Code` `Project Name` `Year Completed` `Date Completed`   
##         <int> <chr>          <chr>          <chr>            <dttm>             
## 1           1 ES-GSN-0146    Brooklyn Publ… 2006             2006-03-16 00:00:00
## 2           1 ES-GSN-0014    Brooklyn Publ… 2010             2010-08-12 00:00:00
## 3           1 ES-GSN-0525    Brooklyn Publ… 2012             2012-11-30 00:00:00
## 4           1 ES-GSN-0492    Brooklyn Publ… 2012             2012-11-30 00:00:00
## 5           1 ES-GSN-0966    Brooklyn Publ… 2023             2023-05-25 00:00:00
## # ℹ 1 more variable: `Regional Economic Development Council` <chr>

Below is an example of the data as unnormalized:

# creating a function to combine the year_completed and date_completed columns 
# as a list in a new column titled 'Completion'
df_1nf$Completion <- mapply(function(year, date) list(Year = as.integer(year), 
                                                      Date = as.Date(date, 
                                                                     format = "%Y-%m-%d")),
                            df_1nf$year_completed, 
                            as.Date(df_1nf$date_completed, format = "%Y-%m-%d"),
                            SIMPLIFY = FALSE)

df_un <- df_1nf[, c("project_code", "Completion")]

# render the unnormalized dataframe
kable(head(df_un, format = "html", escape = FALSE)) |> 
 kable_styling("striped", full_width = FALSE)
project_code Completion
CH79001 1987, 6573
CH79002 1987, 6573
CH10725 1987, 6573
CH79003 1988, 6939
CH10723 1988, 6939
CH10727 1988, 6939

In this example, the completion dates and year is combined into one column. As a note, when the table is rendered, it calculates the number of days since January 1, 1970 (representing when R was created). This type of data structure could complicate data entry or analysis, as the dates may need to be extracted from the Completion column to be analyzed, and if a date needed to be edited, it would not be simple to do so.


To achieve an 3NF form, the database should be in the second form (2NF), and have no transitive partial dependency. This means there are no duplicates or transitory keys. For example, the code below transforms the data to the third form:

# creating project code table
df_3nf_ex1_projectcode <- df_2nf_ex2 |>
  select(`Project Code`, `Project Name`) |>
  distinct() |> 
  clean_names("title")

# creating the project table
df_3nf_ex2_projects <- df_2nf_ex2 |>
  select(-`Project Name`) |> 
  clean_names("title")

# render the project code table
knitr::kable(head(df_3nf_ex1_projectcode, format ="markdown"))
Project Code Project Name
CH79001 Watt Busters - SKANEATELES
CH79002 Watt Busters - SHERRILL
CH10725 Watt Busters - GROTON
CH79003 Watt Busters - BERGEN
CH10723 Watt Busters - SPENCERPORT
CH10727 Watt Busters - ENDICOTT
# render the project table with descriptive columns
knitr::kable(head(df_3nf_ex2_projects, format ="markdown", n = c(6, 6)))
Agency Id Project Code Year Completed Date Completed Regional Economic Development Council Nyiso Zone
50 CH79001 1987 1987-12-31 Central New York C - Central
50 CH79002 1987 1987-12-31 Mohawk Valley E - Mohawk Valley
50 CH10725 1987 1987-12-31 Southern Tier C - Central
50 CH79003 1988 1988-12-31 Finger Lakes A - West
50 CH10723 1988 1988-12-31 Finger Lakes B - Genesee
50 CH10727 1988 1988-12-31 Southern Tier C - Central

Character Manipulation

In this section of the assignment, using 538’s data from the article, The Economic Guide To Picking A College Major, we will identify which majors contain the terms “data” or “statistics”.


To do this, first, you should read in the data, which is hosted on 538’s Github Page.

majors_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

Then, using the stringr package, we can filter the data to only include majors that incorporate the terms “data” or “statistics”:

majors_data_filtered <- majors_data %>% 
  filter(str_detect(Major,"DATA|STATISTICS"))

This creates a dataframe with three majors:

majors_data_filtered$Major
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Regex Part 1:

In the assignment instructions, there is an exercise to translate the following regular expressions in layman’s terms:

  1. (.)\1\1

  2. “(.)(.)\2\1”

  3. (..)\1

  4. “(.).\1.\1”

  5. “(.)(.)(.).*\3\2\1”


Answers:

  1. The “(.)” signifies a group, “\1” signifies a character, and the second “\1” signifies a matching character. This means that a pattern following this regular expression could be “aaa”, “111” or “…”.
  2. This regular expression means there are two groups, depicted by the “(.)” symbols. The “\2” means the exact same character that appeared in the second group appears again. So, an example expression that matches this pattern could be “ceec”, “1221”, or “woow”.
  3. The (..) means there is a group with two characters. The “\1” means that it matches two characters. Therefore, this pattern matches two consecutive identical pairs of characters. This could be “abab”, “3434”, or “aaaa”.
  4. “(.).\1.\1” is a pattern where there is a repeating character that appears three times with a character in between each repetition. A string following this pattern is “1_1_1”, or “abaca”.
  5. This pattern can be explained as one where there are three groups, that occur twice, however the second time the group appears, it is mirrored. An example of a string that matches this pattern is “123xxxx321”.

Regex Part 2

In this part of the assignment, there are instructions to construct regular expressions to match words that:

  1. Start and end with the same character.
  2. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
  3. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

Answers

  1. ^(.).*\1$
  2. ..).*?\1
  3. ([a-zA-Z]).\1.\1.