Database normalization is a method to organize data in a consistent way. Normalizing data can help ensure there are not duplicate entries and maintain data integrity. Using this method, the data is organized in many tables, linked together by relationships (such as a unique id or key). There are a few levels to normalized datasets: 1NF, 2NF, 3NF.
If a database were to be considered 1NF, or in the first normal
form, it would be meet the following criteria:
An example dataset that is in the 1NF form is New York State’s Energy Efficiency Completed Projects Dataset:
# completed energy efficiency projects in NY
# documentation here: https://data.ny.gov/Energy-Environment/Energy-Efficiency-Completed-Projects-Beginning-198/erjw-j2zx/about_data
# loading data from open ny data
ny_ee_proj <- read.socrata("https://data.ny.gov/resource/erjw-j2zx.json")
# calculating the number of unique project codes
n_distinct(ny_ee_proj$project_code)
## [1] 2747
# calculating the number of rows
nrow(ny_ee_proj)
## [1] 2747
# creating a data frame with an agency ID, and rearranging the columns
df_1nf <- ny_ee_proj %>%
group_by(agency) %>%
mutate(agency_id =cur_group_id()) %>%
ungroup() |>
relocate(`project_code`,`project_name`,`year_completed`,`date_completed`,`agency`,`agency_id`)
df_1nf_clean <- clean_names(df_1nf, "title") |>
rename("County" = "Ny State County")
# rendering the 1NF data frame, showing the first 6 rows and columns
knitr::kable(head(df_1nf_clean, format ="markdown", n = c(6, 6)))
| Project Code | Project Name | Year Completed | Date Completed | Agency | Agency Id |
|---|---|---|---|---|---|
| CH79001 | Watt Busters - SKANEATELES | 1987 | 1987-12-31 | NA | 50 |
| CH79002 | Watt Busters - SHERRILL | 1987 | 1987-12-31 | NA | 50 |
| CH10725 | Watt Busters - GROTON | 1987 | 1987-12-31 | NA | 50 |
| CH79003 | Watt Busters - BERGEN | 1988 | 1988-12-31 | NA | 50 |
| CH10723 | Watt Busters - SPENCERPORT | 1988 | 1988-12-31 | NA | 50 |
| CH10727 | Watt Busters - ENDICOTT | 1988 | 1988-12-31 | NA | 50 |
The data seen above is managed by The New York Power Authority, and is collected from project contracts, including project name and codes, location, cost estimates, and energy efficiency savings estimates. This collection effort is related to the Energy Efficiency Market Acceleration Program (EE MAP), which was an initiative lead by former Governor Cuomo to advance market development and commercialization of new energy efficient technologies.
To transform the data into the 2NF form, the code below creates
two different tables, the Agency Table, and the Energy Project
Table:
# agency table
# removing values where Agency title is empty (NA)
df_2nf_ex1 <- df_1nf %>%
select(agency_id,agency,address,city,zip,address_ext) %>%
distinct() %>%
drop_na(agency) %>%
arrange(agency_id) |>
clean_names("title")
# Energy Project Table
df_2nf_ex2 <- df_1nf %>%
select(-agency,-address,-city,-zip,-address_ext) %>%
relocate(agency_id) |>
clean_names("title")
# rendering the Agency Table, showing the first 6 rows and columns
knitr::kable(head(df_2nf_ex1, format ="markdown", n = c(6, 6)))
| Agency Id | Agency | Address | City | Zip | Address Ext |
|---|---|---|---|---|---|
| 1 | Brooklyn Public Library | 10 Grand Army Plaza | Brooklyn | 11238 | NA |
| 1 | Brooklyn Public Library | 286 Fulton St | Brooklyn | 11201 | NA |
| 1 | Brooklyn Public Library | 81 Devoe Street; 360 Irving Ave | Brooklyn | 11211 | NA |
| 1 | Brooklyn Public Library | 340 Bushwick Avenue | Brooklyn | 11206 | NA |
| 2 | City University of New York | 56th Ave & Springfield Boulvd | Queens | 11364 | NA |
| 2 | City University of New York | 535 East 80th Street | New York | 10021 | NA |
# rendering the Energy Project Table, showing the first 6 rows and columns
knitr::kable(head(df_2nf_ex2, format ="markdown", n = c(6, 6)))
| Agency Id | Project Code | Project Name | Year Completed | Date Completed | Regional Economic Development Council |
|---|---|---|---|---|---|
| 50 | CH79001 | Watt Busters - SKANEATELES | 1987 | 1987-12-31 | Central New York |
| 50 | CH79002 | Watt Busters - SHERRILL | 1987 | 1987-12-31 | Mohawk Valley |
| 50 | CH10725 | Watt Busters - GROTON | 1987 | 1987-12-31 | Southern Tier |
| 50 | CH79003 | Watt Busters - BERGEN | 1988 | 1988-12-31 | Finger Lakes |
| 50 | CH10723 | Watt Busters - SPENCERPORT | 1988 | 1988-12-31 | Finger Lakes |
| 50 | CH10727 | Watt Busters - ENDICOTT | 1988 | 1988-12-31 | Southern Tier |
By adding an Agency ID, it ensures that if there were two organizations with the same name, or similar names, there would still be a distinct key to identify them. Additionally, with a unique ID it can become easier to look up information about an agency:
df_2nf_ex2 |>
filter(`Agency Id` == 1) |>
head(n = c(6, 6))
## # A tibble: 5 × 6
## `Agency Id` `Project Code` `Project Name` `Year Completed` `Date Completed`
## <int> <chr> <chr> <chr> <dttm>
## 1 1 ES-GSN-0146 Brooklyn Publ… 2006 2006-03-16 00:00:00
## 2 1 ES-GSN-0014 Brooklyn Publ… 2010 2010-08-12 00:00:00
## 3 1 ES-GSN-0525 Brooklyn Publ… 2012 2012-11-30 00:00:00
## 4 1 ES-GSN-0492 Brooklyn Publ… 2012 2012-11-30 00:00:00
## 5 1 ES-GSN-0966 Brooklyn Publ… 2023 2023-05-25 00:00:00
## # ℹ 1 more variable: `Regional Economic Development Council` <chr>
Below is an example of the data as unnormalized:
# creating a function to combine the year_completed and date_completed columns
# as a list in a new column titled 'Completion'
df_1nf$Completion <- mapply(function(year, date) list(Year = as.integer(year),
Date = as.Date(date,
format = "%Y-%m-%d")),
df_1nf$year_completed,
as.Date(df_1nf$date_completed, format = "%Y-%m-%d"),
SIMPLIFY = FALSE)
df_un <- df_1nf[, c("project_code", "Completion")]
# render the unnormalized dataframe
kable(head(df_un, format = "html", escape = FALSE)) |>
kable_styling("striped", full_width = FALSE)
| project_code | Completion |
|---|---|
| CH79001 | 1987, 6573 |
| CH79002 | 1987, 6573 |
| CH10725 | 1987, 6573 |
| CH79003 | 1988, 6939 |
| CH10723 | 1988, 6939 |
| CH10727 | 1988, 6939 |
In this example, the completion dates and year is combined into one
column. As a note, when the table is rendered, it calculates the number
of days since January 1, 1970 (representing when R was created). This
type of data structure could complicate data entry or analysis, as the
dates may need to be extracted from the Completion column
to be analyzed, and if a date needed to be edited, it would not be
simple to do so.
To achieve an 3NF form, the database should be in the second form (2NF), and have no transitive partial dependency. This means there are no duplicates or transitory keys. For example, the code below transforms the data to the third form:
# creating project code table
df_3nf_ex1_projectcode <- df_2nf_ex2 |>
select(`Project Code`, `Project Name`) |>
distinct() |>
clean_names("title")
# creating the project table
df_3nf_ex2_projects <- df_2nf_ex2 |>
select(-`Project Name`) |>
clean_names("title")
# render the project code table
knitr::kable(head(df_3nf_ex1_projectcode, format ="markdown"))
| Project Code | Project Name |
|---|---|
| CH79001 | Watt Busters - SKANEATELES |
| CH79002 | Watt Busters - SHERRILL |
| CH10725 | Watt Busters - GROTON |
| CH79003 | Watt Busters - BERGEN |
| CH10723 | Watt Busters - SPENCERPORT |
| CH10727 | Watt Busters - ENDICOTT |
# render the project table with descriptive columns
knitr::kable(head(df_3nf_ex2_projects, format ="markdown", n = c(6, 6)))
| Agency Id | Project Code | Year Completed | Date Completed | Regional Economic Development Council | Nyiso Zone |
|---|---|---|---|---|---|
| 50 | CH79001 | 1987 | 1987-12-31 | Central New York | C - Central |
| 50 | CH79002 | 1987 | 1987-12-31 | Mohawk Valley | E - Mohawk Valley |
| 50 | CH10725 | 1987 | 1987-12-31 | Southern Tier | C - Central |
| 50 | CH79003 | 1988 | 1988-12-31 | Finger Lakes | A - West |
| 50 | CH10723 | 1988 | 1988-12-31 | Finger Lakes | B - Genesee |
| 50 | CH10727 | 1988 | 1988-12-31 | Southern Tier | C - Central |
In this section of the assignment, using 538’s data from the article, The Economic Guide To Picking A College Major, we will identify which majors contain the terms “data” or “statistics”.
To do this, first, you should read in the data, which is hosted on 538’s Github Page.
majors_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")
Then, using the stringr package, we can filter the data to only include majors that incorporate the terms “data” or “statistics”:
majors_data_filtered <- majors_data %>%
filter(str_detect(Major,"DATA|STATISTICS"))
This creates a dataframe with three majors:
majors_data_filtered$Major
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
In the assignment instructions, there is an exercise to translate the following regular expressions in layman’s terms:
(.)\1\1
“(.)(.)\2\1”
(..)\1
“(.).\1.\1”
“(.)(.)(.).*\3\2\1”
In this part of the assignment, there are instructions to construct regular expressions to match words that: