Week 3 Assignment

1. Normalization

Example A:

Here is a very large, publicly-available sample dataset taken from fivethirtyeight, expressing age information for Congress:

congress <- read.csv(
  "https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/congress-demographics/data_aging_congress.csv",
  header = TRUE)

glimpse(congress)
## Rows: 29,120
## Columns: 13
## $ congress      <int> 82, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, …
## $ start_date    <chr> "1951-01-03", "1947-01-03", "1949-01-03", "1951-01-03", …
## $ chamber       <chr> "House", "House", "House", "House", "House", "House", "H…
## $ state_abbrev  <chr> "ND", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "V…
## $ party_code    <int> 200, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ bioname       <chr> "AANDAHL, Fred George", "ABBITT, Watkins Moorman", "ABBI…
## $ bioguide_id   <chr> "A000001", "A000002", "A000002", "A000002", "A000002", "…
## $ birthday      <chr> "1897-04-09", "1908-05-21", "1908-05-21", "1908-05-21", …
## $ cmltv_cong    <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ cmltv_chamber <int> 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ age_days      <int> 19626, 14106, 14837, 15567, 16298, 17028, 17759, 18489, …
## $ age_years     <dbl> 53.73306, 38.62012, 40.62149, 42.62012, 44.62149, 46.620…
## $ generation    <chr> "Lost", "Greatest", "Greatest", "Greatest", "Greatest", …

The values in the generation column are defined in the README of the github:

Generation the member belonged to, based on the year of birth. Generations in the data are defined as follows: Gilded (1822-1842), Progressive (1843-1859), Missionary (1860-1882), Lost (1883-1900), Greatest (1901-1927), Silent (1928-1945), baby boomer (1946-1964), Generation X (1965-1980), millennial (1981-1996), Generation Z (1997-2012).

To normalize this, the columns for age/generation could be split off into a separate data frame. Each member of congress could then be assigned to a generation based on the birth year values, instead of repeating the rows.

gen_name <- c("Gilded", "Progressive", "Missionary", "Lost", "Greatest", "Silent", "baby boomer", "Generation X", "millennial", "Generation Z")
start_year <- c(1822, 1843, 1860, 1883, 1901, 1928, 1946, 1965, 1981, 1997)
end_year <- c(1842, 1859, 1882, 1900, 1927, 1945, 1964, 1980, 1996, 2012)
generations <- data.frame(gen_name, start_year, end_year)

generations
##        gen_name start_year end_year
## 1        Gilded       1822     1842
## 2   Progressive       1843     1859
## 3    Missionary       1860     1882
## 4          Lost       1883     1900
## 5      Greatest       1901     1927
## 6        Silent       1928     1945
## 7   baby boomer       1946     1964
## 8  Generation X       1965     1980
## 9    millennial       1981     1996
## 10 Generation Z       1997     2012

Example B:

Building an example dataset from scratch based on my music tastes, I may initially want to include columns like the following:

sc_music <- data.frame(song_title, album_title, artist, genre, plays)

But this could easily result in rows with many repeated groups. To avoid repetition from the same album, artist or genre, the data frames could be split into multiple tables with only 2 columns each, a unique ID and a value:

artist_name <- c("Dom Dolla", "Taylor Swift", "BLACKPINK", "Young the Giant", "ODESZA")
artists <- data.frame(artist_name)
artists
##       artist_name
## 1       Dom Dolla
## 2    Taylor Swift
## 3       BLACKPINK
## 4 Young the Giant
## 5          ODESZA
album_title <- c("ten days", "Lover", "THE ALBUM", "The Last Goodbye", "In Return")
albums <- data.frame(album_title)
albums
##        album_title
## 1         ten days
## 2            Lover
## 3        THE ALBUM
## 4 The Last Goodbye
## 5        In Return
genre_name <- c("pop", "house", "UK garage", "indie", "dance")
genres <- data.frame(genre_name)
genres
##   genre_name
## 1        pop
## 2      house
## 3  UK garage
## 4      indie
## 5      dance

**** EDIT **** From there, a final data frame listing songs could include the following columns: song title, album foreign key(s), artist foreign primary key(s), genre foreign key(s), and at least one column dependent solely on the song title, like the number of times I’ve played the track.

Example C:

I created the following csv file detailing a simplified travel history of my friends:

travel <- read.csv(file="travel.csv", header = TRUE)
travel
##        name   country traveled
## 1     Steph     Japan     TRUE
## 2     Steph     Italy    FALSE
## 3     Steph     China     TRUE
## 4     Steph Sri Lanka    FALSE
## 5     Steph   Belgium     TRUE
## 6     Thach     Japan     TRUE
## 7     Thach     Italy     TRUE
## 8     Thach     China    FALSE
## 9     Thach Sri Lanka    FALSE
## 10    Thach   Belgium     TRUE
## 11 Kristina     Japan     TRUE
## 12 Kristina     Italy     TRUE
## 13 Kristina     China    FALSE
## 14 Kristina Sri Lanka     TRUE
## 15 Kristina   Belgium     TRUE
## 16    Kevin     Japan     TRUE
## 17    Kevin     Italy    FALSE
## 18    Kevin     China    FALSE
## 19    Kevin Sri Lanka    FALSE
## 20    Kevin   Belgium    FALSE
## 21      Jay     Japan    FALSE
## 22      Jay     Italy    FALSE
## 23      Jay     China     TRUE
## 24      Jay Sri Lanka    FALSE
## 25      Jay   Belgium     TRUE

This data is normalized / tidy because even though names and countries appear in multiple rows, each pair of name-country could be used as a composite primary key. The traveled boolean column value is fully dependent on that key, the whole key and nothing but that key.

**** EDIT **** Adding another column like year would break this normalization. In that case, separate tables for name and country could be split off and the TRUE/FALSE traveled column could also be removed entirely. The existence of a year value could then represent whether or not each person traveled there. The columns of the resulting join table could be: the name foreign key(s), the country foreign key(s), and the year(s).

name <- c(1, 1, 2, 2, 2, 4)
country <- c(1, 2, 3, 3, 4, 4)
year <- c(2019, 2021, 2018, 2023, 2022, 2023)
travel_2 <- data.frame(name, country, year)

travel_2
##   name country year
## 1    1       1 2019
## 2    1       2 2021
## 3    2       3 2018
## 4    2       3 2023
## 5    2       4 2022
## 6    4       4 2023

2. Character Manipulation

majors <- read.csv(
  "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv",
  header = TRUE
)

data_stats_majors <- str_subset(majors$Major, pattern = "DATA|STATISTICS")
print(data_stats_majors)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

3. RegEx match

(.)\1\1

This parentheses is a group of only one character (any character, denoted by the period). The \1 is referring to this group, and requiring it to be repeated. So this character must appear 3 times in a row to match. For example, aaa.

"(.)(.)\\2\\1"

The quotes just mean quotes, so this is a string version OF a regex. This is 2 groups, each of any one character. Because this is a string, the double slashes are escaped to one slash and denote order so next, the 2nd group has to precede the 1st. For example "azza".

(..)\1

This is a group of any 2 characters, repeated. For example, xyxy.

"(.).\\1.\\1"

This is a string again, and means a group of any single character, followed by any character, then the group, any character and finally the group again. For example, "bybxb".

"(.)(.)(.).*\\3\\2\\1"

This is also a string of a regex. Groups 1, 2 and 3, each of any single character, followed by any number of any characters, then groups 3, 2, and 1 in that order. For example, "xyzHelloWorldzyx".

4. RegEx construction

Start and end with the same character: ^(.).*\1$

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.): (..).*\1

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.): (.).*\1.*\1