Overview

The following assignment demonstrates data normalization and character manipulation. Data normalization organizes data in a consistent way and to reduce redundancy and ensure data integrity. Character manipulation is an essential skill needed to clean and transform character data.


Normalization

Question 1

College course data was synthetically generated to create four dataframes that demonstrate normalization. The dataframes include student, course, department, and teacher related info in third normal form. A seed was set to ensure reproducible generated data results .

set.seed(789)
student_df = data.frame(
  STUDENT_ID = sprintf("S%02d", 1:30),
  COURSE_ID = sample(c("D101", "D102", "D103", "N101", "N102", "N103"), 30, replace = TRUE),
  STUDENT_FIRST_NAME = randomNames(30, which.names="first"),
  STUDENT_LAST_NAME = randomNames(30, which.names="last")
)

course_df = data.frame(
  COURSE_ID = c("D101", "D102", "D103", "N101", "N102", "N103"),
  DEPARTMENT_ID = c("DEPT1", "DEPT1", "DEPT1", "DEPT2", "DEPT2", "DEPT2"),
  TEACHER_ID = c("T1", "T1", "T2", "T3", "T4", "T5"),
  COURSE_NAME = c("Data Analytics", "Stats", "Machine Learning","Chemistry", "Biology", "Pharmacology")
)

teacher_df = data.frame(
  TEACHER_ID =c ("T1", "T2", "T3", "T4", "T5"),
  TEACHER_FIRST_NAME = randomNames(5, which.names="first"),
  TEACHER_LAST_NAME = randomNames(5, which.names="last"),
  TEACHER_CONTACT = sample(1234567:87654321, 5)
)

department_df = data.frame(
  DEPARTMENT_ID = c("DEPT1", "DEPT2"),
  DEPARTMENT_NAME = c("Data Analytics", "Nursing")
)

join_df = left_join(student_df, course_df, by = join_by(COURSE_ID))
join_df = left_join(join_df, teacher_df, by = join_by(TEACHER_ID))
join_df = left_join(join_df, department_df, by = join_by(DEPARTMENT_ID))
join_df = join_df |>
  select(STUDENT_ID, STUDENT_FIRST_NAME, STUDENT_LAST_NAME,
         COURSE_ID, COURSE_NAME, TEACHER_ID, TEACHER_FIRST_NAME,
         TEACHER_LAST_NAME, TEACHER_CONTACT, DEPARTMENT_ID, DEPARTMENT_NAME)

Normalized Dataframes

Student Dataframe
STUDENT_ID COURSE_ID STUDENT_FIRST_NAME STUDENT_LAST_NAME
S01 N102 Dallas Harrell
S02 N101 Shannon Lagunas-Tapia
S03 N101 Faaid Stalberte
S04 N103 Miqdaad Borski
S05 D102 Zachary Lao
S06 D102 Rawda el-Javid
S07 D103 Zachary Davis
S08 N102 Khalissa Townsend
S09 N101 Emily Jiang
S10 D103 Jacqueline el-Amiri
S11 D103 Travis al-Husain
S12 D103 Matthew Ankele
S13 N103 Holly Kim
S14 N101 Jennifer al-Abbasi
S15 D101 Jami Gallegos
S16 D101 Brett al-Dia
S17 N103 Tanisha Robbins
S18 D102 Mardiyya Tolbert
S19 N102 Diana Story
S20 D102 Lizsandra Rodriguez
S21 D102 Aphisitt Lor
S22 N102 Fakeeha Vidana
S23 N101 Rebecca Nash
S24 D102 Mei Brown
S25 N103 Teresa el-Doud
S26 N103 Isaiah Huizar
S27 N102 Sharmayne Castello
S28 D103 Jacob al-Guler
S29 D102 Hannah al-Jabara
S30 N102 Shenoa Chimedtseren
Course Dataframe
COURSE_ID DEPARTMENT_ID TEACHER_ID COURSE_NAME
D101 DEPT1 T1 Data Analytics
D102 DEPT1 T1 Stats
D103 DEPT1 T2 Machine Learning
N101 DEPT2 T3 Chemistry
N102 DEPT2 T4 Biology
N103 DEPT2 T5 Pharmacology
Teacher Dataframe
TEACHER_ID TEACHER_FIRST_NAME TEACHER_LAST_NAME TEACHER_CONTACT
T1 Mario Avelar 46110472
T2 Abdur Rasheed Mestas 59420892
T3 Matthew Gilbert 23187419
T4 Nhi Heng 16379084
T5 Tahiyya Keuper 50391373
Department Dataframe
DEPARTMENT_ID DEPARTMENT_NAME
DEPT1 Data Analytics
DEPT2 Nursing

Joined Normalized Dataframes

All Joined Dataframe (scroll to view all columns and rows)
STUDENT_ID STUDENT_FIRST_NAME STUDENT_LAST_NAME COURSE_ID COURSE_NAME TEACHER_ID TEACHER_FIRST_NAME TEACHER_LAST_NAME TEACHER_CONTACT DEPARTMENT_ID DEPARTMENT_NAME
S01 Dallas Harrell N102 Biology T4 Nhi Heng 16379084 DEPT2 Nursing
S02 Shannon Lagunas-Tapia N101 Chemistry T3 Matthew Gilbert 23187419 DEPT2 Nursing
S03 Faaid Stalberte N101 Chemistry T3 Matthew Gilbert 23187419 DEPT2 Nursing
S04 Miqdaad Borski N103 Pharmacology T5 Tahiyya Keuper 50391373 DEPT2 Nursing
S05 Zachary Lao D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S06 Rawda el-Javid D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S07 Zachary Davis D103 Machine Learning T2 Abdur Rasheed Mestas 59420892 DEPT1 Data Analytics
S08 Khalissa Townsend N102 Biology T4 Nhi Heng 16379084 DEPT2 Nursing
S09 Emily Jiang N101 Chemistry T3 Matthew Gilbert 23187419 DEPT2 Nursing
S10 Jacqueline el-Amiri D103 Machine Learning T2 Abdur Rasheed Mestas 59420892 DEPT1 Data Analytics
S11 Travis al-Husain D103 Machine Learning T2 Abdur Rasheed Mestas 59420892 DEPT1 Data Analytics
S12 Matthew Ankele D103 Machine Learning T2 Abdur Rasheed Mestas 59420892 DEPT1 Data Analytics
S13 Holly Kim N103 Pharmacology T5 Tahiyya Keuper 50391373 DEPT2 Nursing
S14 Jennifer al-Abbasi N101 Chemistry T3 Matthew Gilbert 23187419 DEPT2 Nursing
S15 Jami Gallegos D101 Data Analytics T1 Mario Avelar 46110472 DEPT1 Data Analytics
S16 Brett al-Dia D101 Data Analytics T1 Mario Avelar 46110472 DEPT1 Data Analytics
S17 Tanisha Robbins N103 Pharmacology T5 Tahiyya Keuper 50391373 DEPT2 Nursing
S18 Mardiyya Tolbert D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S19 Diana Story N102 Biology T4 Nhi Heng 16379084 DEPT2 Nursing
S20 Lizsandra Rodriguez D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S21 Aphisitt Lor D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S22 Fakeeha Vidana N102 Biology T4 Nhi Heng 16379084 DEPT2 Nursing
S23 Rebecca Nash N101 Chemistry T3 Matthew Gilbert 23187419 DEPT2 Nursing
S24 Mei Brown D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S25 Teresa el-Doud N103 Pharmacology T5 Tahiyya Keuper 50391373 DEPT2 Nursing
S26 Isaiah Huizar N103 Pharmacology T5 Tahiyya Keuper 50391373 DEPT2 Nursing
S27 Sharmayne Castello N102 Biology T4 Nhi Heng 16379084 DEPT2 Nursing
S28 Jacob al-Guler D103 Machine Learning T2 Abdur Rasheed Mestas 59420892 DEPT1 Data Analytics
S29 Hannah al-Jabara D102 Stats T1 Mario Avelar 46110472 DEPT1 Data Analytics
S30 Shenoa Chimedtseren N102 Biology T4 Nhi Heng 16379084 DEPT2 Nursing

Character Manipulation

Question 2

Import Data

Read in college majors csv file as data frame

cm_df = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")


#### College Major String Pattern Match Identifies the majors that contain either “DATA” or “STATISTICS”

pat_df = cm_df |> 
  filter(str_detect(Major, "DATA|STATISTICS")) |>
  select(Major)
colnames(pat_df) = c('Majors that contain either \"DATA\" or \"STATISTICS\"')
Majors that contain either “DATA” or “STATISTICS”
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
COMPUTER PROGRAMMING AND DATA PROCESSING
STATISTICS AND DECISION SCIENCE

Question 3

Descriptions for five regular expressions have been provided below. Each expression description includes a write up on what the expression will match with supporting examples.

(.)\1\1

This regex will match strings that contain 3 identical consecutive characters.

  • Breakdown:
    • (.) matches any single character
    • \1 is a back reference to the first captured group
  • Examples: ‘111’, ‘aaa’

"(.)(.)\\2\\1"

This is a string that represents a regex which matches strings with any 2 characters followed by the second character and first character repeated in that order.

  • Breakdown:
    • The first (.) matches any single character and represents the first captured group
    • The second (.) again matches any single character and represents the second captured group
    • These captured groups are then back referenced with the \\2 referring to the second captured group and \\1 referring to the first captured group.
  • Examples: ‘1221’, ‘abba’, ‘3333’

(..)\1

This regex will match strings with a repeated pair of characters.

  • Breakdown:
    • (..) matches any 2 characters and represents the first captured group
    • \1 is a back reference to the first captured group.
  • Examples: ‘1212’, ‘abab’, ‘cccc’

"(.).\\1.\\1"

This is a string that represents a regex which matches strings with any character repeated in 3 places if a single character is in between the repeated characters.

  • Breakdown:
    • (.) matches any single character and represents the first captured group
    • . matches any single character
    • \\1 is a back reference referring to the first captured group
  • Examples: ‘12131’, ‘abaca’, ‘dddd’

"(.)(.)(.).*\\3\\2\\1"

This is a string that represents a regex which matches strings with any 3 characters followed by an optional character that can be repeated 0 or more times. The pattern ends with same first 3 characters being repeated in reverse order.

  • Breakdown:
    • (.) matches any single character and creates a captured group. This is repeated 3 times; thus, three captured groups are defined
    • .* optionally matches any character with a repeat (i.e. it matches any number of times, including 0).
    • \\3\\2\\1 are back references referring to the third, second, and first captured group characters respectively
  • Examples: ‘123321’, ‘abcdddddddcba’

Question 4

Regex to match words that start and end with the same character
Regex: ^(.).*\\1$

Test on words character vector

str_subset(words, "^(.).*\\1$")
##  [1] "america"    "area"       "dad"        "dead"       "depend"    
##  [6] "educate"    "else"       "encourage"  "engine"     "europe"    
## [11] "evidence"   "example"    "excuse"     "exercise"   "expense"   
## [16] "experience" "eye"        "health"     "high"       "knock"     
## [21] "level"      "local"      "nation"     "non"        "rather"    
## [26] "refer"      "remember"   "serious"    "stairs"     "test"      
## [31] "tonight"    "transport"  "treat"      "trust"      "window"    
## [36] "yesterday"

Regex to match words that contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Regex: ([a-zA-Z]{2}).*\\1

Test on words character vector

str_subset(words, "([a-zA-Z]{2}).*\\1")
##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

Regex to match words that contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
Regex: ([a-zA-Z])(.*\\1){2,}

Test on words character vector

str_subset(words, "([a-zA-Z])(.*\\1){2,}")
##  [1] "appropriate" "available"   "believe"     "between"     "business"   
##  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
## [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [21] "therefore"   "tomorrow"