The following assignment demonstrates data normalization and character manipulation. Data normalization organizes data in a consistent way and to reduce redundancy and ensure data integrity. Character manipulation is an essential skill needed to clean and transform character data.
College course data was synthetically generated to create four dataframes that demonstrate normalization. The dataframes include student, course, department, and teacher related info in third normal form. A seed was set to ensure reproducible generated data results .
set.seed(789)
student_df = data.frame(
STUDENT_ID = sprintf("S%02d", 1:30),
COURSE_ID = sample(c("D101", "D102", "D103", "N101", "N102", "N103"), 30, replace = TRUE),
STUDENT_FIRST_NAME = randomNames(30, which.names="first"),
STUDENT_LAST_NAME = randomNames(30, which.names="last")
)
course_df = data.frame(
COURSE_ID = c("D101", "D102", "D103", "N101", "N102", "N103"),
DEPARTMENT_ID = c("DEPT1", "DEPT1", "DEPT1", "DEPT2", "DEPT2", "DEPT2"),
TEACHER_ID = c("T1", "T1", "T2", "T3", "T4", "T5"),
COURSE_NAME = c("Data Analytics", "Stats", "Machine Learning","Chemistry", "Biology", "Pharmacology")
)
teacher_df = data.frame(
TEACHER_ID =c ("T1", "T2", "T3", "T4", "T5"),
TEACHER_FIRST_NAME = randomNames(5, which.names="first"),
TEACHER_LAST_NAME = randomNames(5, which.names="last"),
TEACHER_CONTACT = sample(1234567:87654321, 5)
)
department_df = data.frame(
DEPARTMENT_ID = c("DEPT1", "DEPT2"),
DEPARTMENT_NAME = c("Data Analytics", "Nursing")
)
join_df = left_join(student_df, course_df, by = join_by(COURSE_ID))
join_df = left_join(join_df, teacher_df, by = join_by(TEACHER_ID))
join_df = left_join(join_df, department_df, by = join_by(DEPARTMENT_ID))
join_df = join_df |>
select(STUDENT_ID, STUDENT_FIRST_NAME, STUDENT_LAST_NAME,
COURSE_ID, COURSE_NAME, TEACHER_ID, TEACHER_FIRST_NAME,
TEACHER_LAST_NAME, TEACHER_CONTACT, DEPARTMENT_ID, DEPARTMENT_NAME)
| STUDENT_ID | COURSE_ID | STUDENT_FIRST_NAME | STUDENT_LAST_NAME |
|---|---|---|---|
| S01 | N102 | Dallas | Harrell |
| S02 | N101 | Shannon | Lagunas-Tapia |
| S03 | N101 | Faaid | Stalberte |
| S04 | N103 | Miqdaad | Borski |
| S05 | D102 | Zachary | Lao |
| S06 | D102 | Rawda | el-Javid |
| S07 | D103 | Zachary | Davis |
| S08 | N102 | Khalissa | Townsend |
| S09 | N101 | Emily | Jiang |
| S10 | D103 | Jacqueline | el-Amiri |
| S11 | D103 | Travis | al-Husain |
| S12 | D103 | Matthew | Ankele |
| S13 | N103 | Holly | Kim |
| S14 | N101 | Jennifer | al-Abbasi |
| S15 | D101 | Jami | Gallegos |
| S16 | D101 | Brett | al-Dia |
| S17 | N103 | Tanisha | Robbins |
| S18 | D102 | Mardiyya | Tolbert |
| S19 | N102 | Diana | Story |
| S20 | D102 | Lizsandra | Rodriguez |
| S21 | D102 | Aphisitt | Lor |
| S22 | N102 | Fakeeha | Vidana |
| S23 | N101 | Rebecca | Nash |
| S24 | D102 | Mei | Brown |
| S25 | N103 | Teresa | el-Doud |
| S26 | N103 | Isaiah | Huizar |
| S27 | N102 | Sharmayne | Castello |
| S28 | D103 | Jacob | al-Guler |
| S29 | D102 | Hannah | al-Jabara |
| S30 | N102 | Shenoa | Chimedtseren |
| COURSE_ID | DEPARTMENT_ID | TEACHER_ID | COURSE_NAME |
|---|---|---|---|
| D101 | DEPT1 | T1 | Data Analytics |
| D102 | DEPT1 | T1 | Stats |
| D103 | DEPT1 | T2 | Machine Learning |
| N101 | DEPT2 | T3 | Chemistry |
| N102 | DEPT2 | T4 | Biology |
| N103 | DEPT2 | T5 | Pharmacology |
| TEACHER_ID | TEACHER_FIRST_NAME | TEACHER_LAST_NAME | TEACHER_CONTACT |
|---|---|---|---|
| T1 | Mario | Avelar | 46110472 |
| T2 | Abdur Rasheed | Mestas | 59420892 |
| T3 | Matthew | Gilbert | 23187419 |
| T4 | Nhi | Heng | 16379084 |
| T5 | Tahiyya | Keuper | 50391373 |
| DEPARTMENT_ID | DEPARTMENT_NAME |
|---|---|
| DEPT1 | Data Analytics |
| DEPT2 | Nursing |
| STUDENT_ID | STUDENT_FIRST_NAME | STUDENT_LAST_NAME | COURSE_ID | COURSE_NAME | TEACHER_ID | TEACHER_FIRST_NAME | TEACHER_LAST_NAME | TEACHER_CONTACT | DEPARTMENT_ID | DEPARTMENT_NAME |
|---|---|---|---|---|---|---|---|---|---|---|
| S01 | Dallas | Harrell | N102 | Biology | T4 | Nhi | Heng | 16379084 | DEPT2 | Nursing |
| S02 | Shannon | Lagunas-Tapia | N101 | Chemistry | T3 | Matthew | Gilbert | 23187419 | DEPT2 | Nursing |
| S03 | Faaid | Stalberte | N101 | Chemistry | T3 | Matthew | Gilbert | 23187419 | DEPT2 | Nursing |
| S04 | Miqdaad | Borski | N103 | Pharmacology | T5 | Tahiyya | Keuper | 50391373 | DEPT2 | Nursing |
| S05 | Zachary | Lao | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S06 | Rawda | el-Javid | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S07 | Zachary | Davis | D103 | Machine Learning | T2 | Abdur Rasheed | Mestas | 59420892 | DEPT1 | Data Analytics |
| S08 | Khalissa | Townsend | N102 | Biology | T4 | Nhi | Heng | 16379084 | DEPT2 | Nursing |
| S09 | Emily | Jiang | N101 | Chemistry | T3 | Matthew | Gilbert | 23187419 | DEPT2 | Nursing |
| S10 | Jacqueline | el-Amiri | D103 | Machine Learning | T2 | Abdur Rasheed | Mestas | 59420892 | DEPT1 | Data Analytics |
| S11 | Travis | al-Husain | D103 | Machine Learning | T2 | Abdur Rasheed | Mestas | 59420892 | DEPT1 | Data Analytics |
| S12 | Matthew | Ankele | D103 | Machine Learning | T2 | Abdur Rasheed | Mestas | 59420892 | DEPT1 | Data Analytics |
| S13 | Holly | Kim | N103 | Pharmacology | T5 | Tahiyya | Keuper | 50391373 | DEPT2 | Nursing |
| S14 | Jennifer | al-Abbasi | N101 | Chemistry | T3 | Matthew | Gilbert | 23187419 | DEPT2 | Nursing |
| S15 | Jami | Gallegos | D101 | Data Analytics | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S16 | Brett | al-Dia | D101 | Data Analytics | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S17 | Tanisha | Robbins | N103 | Pharmacology | T5 | Tahiyya | Keuper | 50391373 | DEPT2 | Nursing |
| S18 | Mardiyya | Tolbert | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S19 | Diana | Story | N102 | Biology | T4 | Nhi | Heng | 16379084 | DEPT2 | Nursing |
| S20 | Lizsandra | Rodriguez | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S21 | Aphisitt | Lor | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S22 | Fakeeha | Vidana | N102 | Biology | T4 | Nhi | Heng | 16379084 | DEPT2 | Nursing |
| S23 | Rebecca | Nash | N101 | Chemistry | T3 | Matthew | Gilbert | 23187419 | DEPT2 | Nursing |
| S24 | Mei | Brown | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S25 | Teresa | el-Doud | N103 | Pharmacology | T5 | Tahiyya | Keuper | 50391373 | DEPT2 | Nursing |
| S26 | Isaiah | Huizar | N103 | Pharmacology | T5 | Tahiyya | Keuper | 50391373 | DEPT2 | Nursing |
| S27 | Sharmayne | Castello | N102 | Biology | T4 | Nhi | Heng | 16379084 | DEPT2 | Nursing |
| S28 | Jacob | al-Guler | D103 | Machine Learning | T2 | Abdur Rasheed | Mestas | 59420892 | DEPT1 | Data Analytics |
| S29 | Hannah | al-Jabara | D102 | Stats | T1 | Mario | Avelar | 46110472 | DEPT1 | Data Analytics |
| S30 | Shenoa | Chimedtseren | N102 | Biology | T4 | Nhi | Heng | 16379084 | DEPT2 | Nursing |
Read in college majors csv file as data frame
cm_df = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")
#### College Major String Pattern Match Identifies the majors that
contain either “DATA” or “STATISTICS”
pat_df = cm_df |>
filter(str_detect(Major, "DATA|STATISTICS")) |>
select(Major)
colnames(pat_df) = c('Majors that contain either \"DATA\" or \"STATISTICS\"')
| Majors that contain either “DATA” or “STATISTICS” |
|---|
| MANAGEMENT INFORMATION SYSTEMS AND STATISTICS |
| COMPUTER PROGRAMMING AND DATA PROCESSING |
| STATISTICS AND DECISION SCIENCE |
Descriptions for five regular expressions have been provided below. Each expression description includes a write up on what the expression will match with supporting examples.
(.)\1\1This regex will match strings that contain 3 identical consecutive characters.
(.) matches any single character\1 is a back reference to the first captured group"(.)(.)\\2\\1"This is a string that represents a regex which matches strings with any 2 characters followed by the second character and first character repeated in that order.
(.) matches any single character and
represents the first captured group(.) again matches any single character and
represents the second captured group\\2 referring to the second captured group and
\\1 referring to the first captured group.(..)\1This regex will match strings with a repeated pair of characters.
(..) matches any 2 characters and represents the first
captured group\1 is a back reference to the first captured
group."(.).\\1.\\1"This is a string that represents a regex which matches strings with any character repeated in 3 places if a single character is in between the repeated characters.
(.) matches any single character and represents the
first captured group. matches any single character\\1 is a back reference referring to the first captured
group"(.)(.)(.).*\\3\\2\\1"This is a string that represents a regex which matches strings with any 3 characters followed by an optional character that can be repeated 0 or more times. The pattern ends with same first 3 characters being repeated in reverse order.
(.) matches any single character and creates a captured
group. This is repeated 3 times; thus, three captured groups are
defined.* optionally matches any character with a repeat
(i.e. it matches any number of times, including 0).\\3\\2\\1 are back references referring to the third,
second, and first captured group characters respectivelyRegex to match words that start and end with the same character
Regex:
^(.).*\\1$
Test on words character vector
str_subset(words, "^(.).*\\1$")
## [1] "america" "area" "dad" "dead" "depend"
## [6] "educate" "else" "encourage" "engine" "europe"
## [11] "evidence" "example" "excuse" "exercise" "expense"
## [16] "experience" "eye" "health" "high" "knock"
## [21] "level" "local" "nation" "non" "rather"
## [26] "refer" "remember" "serious" "stairs" "test"
## [31] "tonight" "transport" "treat" "trust" "window"
## [36] "yesterday"
Regex to match words that contain a repeated pair of letters
(e.g. “church” contains “ch” repeated twice.)
Regex:
([a-zA-Z]{2}).*\\1
Test on words character vector
str_subset(words, "([a-zA-Z]{2}).*\\1")
## [1] "appropriate" "church" "condition" "decide" "environment"
## [6] "london" "paragraph" "particular" "photograph" "prepare"
## [11] "pressure" "remember" "represent" "require" "sense"
## [16] "therefore" "understand" "whether"
Regex to match words that contain one letter repeated in at least
three places (e.g. “eleven” contains three “e”s.)
Regex:
([a-zA-Z])(.*\\1){2,}
Test on words character vector
str_subset(words, "([a-zA-Z])(.*\\1){2,}")
## [1] "appropriate" "available" "believe" "between" "business"
## [6] "degree" "difference" "discuss" "eleven" "environment"
## [11] "evidence" "exercise" "expense" "experience" "individual"
## [16] "paragraph" "receive" "remember" "represent" "telephone"
## [21] "therefore" "tomorrow"