Here is a very large, publicly-available sample dataset taken from fivethirtyeight, expressing age information for Congress:
congress <- read.csv(
"https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/congress-demographics/data_aging_congress.csv",
header = TRUE)
glimpse(congress)
## Rows: 29,120
## Columns: 13
## $ congress <int> 82, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, …
## $ start_date <chr> "1951-01-03", "1947-01-03", "1949-01-03", "1951-01-03", …
## $ chamber <chr> "House", "House", "House", "House", "House", "House", "H…
## $ state_abbrev <chr> "ND", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "V…
## $ party_code <int> 200, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ bioname <chr> "AANDAHL, Fred George", "ABBITT, Watkins Moorman", "ABBI…
## $ bioguide_id <chr> "A000001", "A000002", "A000002", "A000002", "A000002", "…
## $ birthday <chr> "1897-04-09", "1908-05-21", "1908-05-21", "1908-05-21", …
## $ cmltv_cong <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ cmltv_chamber <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ age_days <int> 19626, 14106, 14837, 15567, 16298, 17028, 17759, 18489, …
## $ age_years <dbl> 53.73306, 38.62012, 40.62149, 42.62012, 44.62149, 46.620…
## $ generation <chr> "Lost", "Greatest", "Greatest", "Greatest", "Greatest", …
The values in the generation
column are defined in the
README of the github:
Generation the member belonged to, based on the year of birth. Generations in the data are defined as follows: Gilded (1822-1842), Progressive (1843-1859), Missionary (1860-1882), Lost (1883-1900), Greatest (1901-1927), Silent (1928-1945), baby boomer (1946-1964), Generation X (1965-1980), millennial (1981-1996), Generation Z (1997-2012).
To normalize this, the columns for age/generation could be split off into a separate data frame. Each member of congress could then be assigned to a generation based on the birth year values, instead of repeating the rows.
gen_name <- c("Gilded", "Progressive", "Missionary", "Lost", "Greatest", "Silent", "baby boomer", "Generation X", "millennial", "Generation Z")
start_year <- c(1822, 1843, 1860, 1883, 1901, 1928, 1946, 1965, 1981, 1997)
end_year <- c(1842, 1859, 1882, 1900, 1927, 1945, 1964, 1980, 1996, 2012)
generations <- data.frame(gen_name, start_year, end_year)
generations
## gen_name start_year end_year
## 1 Gilded 1822 1842
## 2 Progressive 1843 1859
## 3 Missionary 1860 1882
## 4 Lost 1883 1900
## 5 Greatest 1901 1927
## 6 Silent 1928 1945
## 7 baby boomer 1946 1964
## 8 Generation X 1965 1980
## 9 millennial 1981 1996
## 10 Generation Z 1997 2012
Building an example dataset from scratch based on my music tastes, I may initially want to include columns like the following:
sc_music <- data.frame(song_title, album_title, artist, genre, plays)
But this could easily result in rows with many repeated groups. To avoid repetition from the same album, artist or genre, the data frames could be split into multiple tables with only 2 columns each, a unique ID and a value:
artist_name <- c("Dom Dolla", "Taylor Swift", "BLACKPINK", "Young the Giant", "ODESZA")
artists <- data.frame(artist_name)
artists
## artist_name
## 1 Dom Dolla
## 2 Taylor Swift
## 3 BLACKPINK
## 4 Young the Giant
## 5 ODESZA
album_title <- c("ten days", "Lover", "THE ALBUM", "The Last Goodbye", "In Return")
albums <- data.frame(album_title)
albums
## album_title
## 1 ten days
## 2 Lover
## 3 THE ALBUM
## 4 The Last Goodbye
## 5 In Return
genre_name <- c("pop", "house", "UK garage", "indie", "dance")
genres <- data.frame(genre_name)
genres
## genre_name
## 1 pop
## 2 house
## 3 UK garage
## 4 indie
## 5 dance
**** EDIT **** From there, a final data frame listing songs could include the following columns: song title, album foreign key(s), artist foreign primary key(s), genre foreign key(s), and at least one column dependent solely on the song title, like the number of times I’ve played the track.
I created the following csv file detailing a simplified travel history of my friends:
travel <- read.csv(file="travel.csv", header = TRUE)
travel
## name country traveled
## 1 Steph Japan TRUE
## 2 Steph Italy FALSE
## 3 Steph China TRUE
## 4 Steph Sri Lanka FALSE
## 5 Steph Belgium TRUE
## 6 Thach Japan TRUE
## 7 Thach Italy TRUE
## 8 Thach China FALSE
## 9 Thach Sri Lanka FALSE
## 10 Thach Belgium TRUE
## 11 Kristina Japan TRUE
## 12 Kristina Italy TRUE
## 13 Kristina China FALSE
## 14 Kristina Sri Lanka TRUE
## 15 Kristina Belgium TRUE
## 16 Kevin Japan TRUE
## 17 Kevin Italy FALSE
## 18 Kevin China FALSE
## 19 Kevin Sri Lanka FALSE
## 20 Kevin Belgium FALSE
## 21 Jay Japan FALSE
## 22 Jay Italy FALSE
## 23 Jay China TRUE
## 24 Jay Sri Lanka FALSE
## 25 Jay Belgium TRUE
This data is normalized / tidy because even though names and
countries appear in multiple rows, each pair of name-country could be
used as a composite primary key. The traveled
boolean
column value is fully dependent on that key, the whole key and nothing
but that key.
**** EDIT **** Adding another column like year
would
break this normalization. In that case, separate tables for name and
country could be split off and the TRUE/FALSE traveled
column could also be removed entirely. The existence of a
year
value could then represent whether or not each person
traveled there. The columns of the resulting join table could be: the
name foreign key(s), the country foreign key(s), and the year(s).
name <- c(1, 1, 2, 2, 2, 4)
country <- c(1, 2, 3, 3, 4, 4)
year <- c(2019, 2021, 2018, 2023, 2022, 2023)
travel_2 <- data.frame(name, country, year)
travel_2
## name country year
## 1 1 1 2019
## 2 1 2 2021
## 3 2 3 2018
## 4 2 3 2023
## 5 2 4 2022
## 6 4 4 2023
majors <- read.csv(
"https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",
header = TRUE
)
data_stats_majors <- str_subset(majors$Major, pattern = "DATA|STATISTICS")
print(data_stats_majors)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
(.)\1\1
This parentheses is a group of only one character (any character,
denoted by the period). The \1
is referring to this group,
and requiring it to be repeated. So this character must appear 3 times
in a row to match. For example, aaa
.
"(.)(.)\\2\\1"
The quotes just mean quotes, so this is a string version OF a regex.
This is 2 groups, each of any one character. Because this is a string,
the double slashes are escaped to one slash and denote order so next,
the 2nd group has to precede the 1st. For example
"azza"
.
(..)\1
This is a group of any 2 characters, repeated. For example,
xyxy
.
"(.).\\1.\\1"
This is a string again, and means a group of any single character,
followed by any character, then the group, any character and finally the
group again. For example, "bybxb"
.
"(.)(.)(.).*\\3\\2\\1"
This is also a string of a regex. Groups 1, 2 and 3, each of any
single character, followed by any number of any characters, then groups
3, 2, and 1 in that order. For example,
"xyzHelloWorldzyx"
.
Start and end with the same character: ^(.).*\1$
Contain a repeated pair of letters (e.g. “church” contains “ch”
repeated twice.): (..).*\1
Contain one letter repeated in at least three places (e.g. “eleven”
contains three “e”s.): (.).*\1.*\1