DATA607_Week 3 Assignment

Overview

The following assignment demonstrates data normalization and character manipulation. Data normalization organizes data in a consistent way and to reduce redundancy and ensure data integrity. Character manipulation is an essential skill needed to clean and transform character data.

Normalization

Question 1

College course data was synthetically generated to create four dataframes that demonstrate normalization. The dataframes include student, course, department, and teacher related info in third normal form. A seed was set to ensure reproducible generated data results .

set.seed(789)
student_df = data.frame(
  STUDENT_ID = sprintf("S%02d", 1:30),
  COURSE_ID = sample(c("D101", "D102", "D103", "N101", "N102", "N103"), 30, replace = TRUE),
  STUDENT_FIRST_NAME = randomNames(30, which.names="first"),
  STUDENT_LAST_NAME = randomNames(30, which.names="last")
)

course_df = data.frame(
  COURSE_ID = c("D101", "D102", "D103", "N101", "N102", "N103"),
  DEPARTMENT_ID = c("DEPT1", "DEPT1", "DEPT1", "DEPT2", "DEPT2", "DEPT2"),
  TEACHER_ID = c("T1", "T1", "T2", "T3", "T4", "T5"),
  COURSE_NAME = c("Data Analytics", "Stats", "Machine Learning","Chemistry", "Biology", "Pharmacology")
)

teacher_df = data.frame(
  TEACHER_ID =c ("T1", "T2", "T3", "T4", "T5"),
  TEACHER_FIRST_NAME = randomNames(5, which.names="first"),
  TEACHER_LAST_NAME = randomNames(5, which.names="last"),
  TEACHER_CONTACT = sample(1234567:87654321, 5)
)

department_df = data.frame(
  DEPARTMENT_ID = c("DEPT1", "DEPT2"),
  DEPARTMENT_NAME = c("Data Analytics", "Nursing")
)

join_df = left_join(student_df, course_df, by = join_by(COURSE_ID))
join_df = left_join(join_df, teacher_df, by = join_by(TEACHER_ID))
join_df = left_join(join_df, department_df, by = join_by(DEPARTMENT_ID))
join_df = join_df |>
  select(STUDENT_ID, STUDENT_FIRST_NAME, STUDENT_LAST_NAME,
         COURSE_ID, COURSE_NAME, TEACHER_ID, TEACHER_FIRST_NAME,
         TEACHER_LAST_NAME, TEACHER_CONTACT, DEPARTMENT_ID, DEPARTMENT_NAME)

Normalized Dataframes

Student Dataframe
STUDENT_ID	COURSE_ID	STUDENT_FIRST_NAME	STUDENT_LAST_NAME
S01	N102	Dallas	Harrell
S02	N101	Shannon	Lagunas-Tapia
S03	N101	Faaid	Stalberte
S04	N103	Miqdaad	Borski
S05	D102	Zachary	Lao
S06	D102	Rawda	el-Javid
S07	D103	Zachary	Davis
S08	N102	Khalissa	Townsend
S09	N101	Emily	Jiang
S10	D103	Jacqueline	el-Amiri
S11	D103	Travis	al-Husain
S12	D103	Matthew	Ankele
S13	N103	Holly	Kim
S14	N101	Jennifer	al-Abbasi
S15	D101	Jami	Gallegos
S16	D101	Brett	al-Dia
S17	N103	Tanisha	Robbins
S18	D102	Mardiyya	Tolbert
S19	N102	Diana	Story
S20	D102	Lizsandra	Rodriguez
S21	D102	Aphisitt	Lor
S22	N102	Fakeeha	Vidana
S23	N101	Rebecca	Nash
S24	D102	Mei	Brown
S25	N103	Teresa	el-Doud
S26	N103	Isaiah	Huizar
S27	N102	Sharmayne	Castello
S28	D103	Jacob	al-Guler
S29	D102	Hannah	al-Jabara
S30	N102	Shenoa	Chimedtseren

Course Dataframe
COURSE_ID	DEPARTMENT_ID	TEACHER_ID	COURSE_NAME
D101	DEPT1	T1	Data Analytics
D102	DEPT1	T1	Stats
D103	DEPT1	T2	Machine Learning
N101	DEPT2	T3	Chemistry
N102	DEPT2	T4	Biology
N103	DEPT2	T5	Pharmacology

Teacher Dataframe
TEACHER_ID	TEACHER_FIRST_NAME	TEACHER_LAST_NAME	TEACHER_CONTACT
T1	Mario	Avelar	46110472
T2	Abdur Rasheed	Mestas	59420892
T3	Matthew	Gilbert	23187419
T4	Nhi	Heng	16379084
T5	Tahiyya	Keuper	50391373

Department Dataframe
DEPARTMENT_ID	DEPARTMENT_NAME
DEPT1	Data Analytics
DEPT2	Nursing

Joined Normalized Dataframes

All Joined Dataframe (scroll to view all columns and rows)
STUDENT_ID	STUDENT_FIRST_NAME	STUDENT_LAST_NAME	COURSE_ID	COURSE_NAME	TEACHER_ID	TEACHER_FIRST_NAME	TEACHER_LAST_NAME	TEACHER_CONTACT	DEPARTMENT_ID	DEPARTMENT_NAME
S01	Dallas	Harrell	N102	Biology	T4	Nhi	Heng	16379084	DEPT2	Nursing
S02	Shannon	Lagunas-Tapia	N101	Chemistry	T3	Matthew	Gilbert	23187419	DEPT2	Nursing
S03	Faaid	Stalberte	N101	Chemistry	T3	Matthew	Gilbert	23187419	DEPT2	Nursing
S04	Miqdaad	Borski	N103	Pharmacology	T5	Tahiyya	Keuper	50391373	DEPT2	Nursing
S05	Zachary	Lao	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S06	Rawda	el-Javid	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S07	Zachary	Davis	D103	Machine Learning	T2	Abdur Rasheed	Mestas	59420892	DEPT1	Data Analytics
S08	Khalissa	Townsend	N102	Biology	T4	Nhi	Heng	16379084	DEPT2	Nursing
S09	Emily	Jiang	N101	Chemistry	T3	Matthew	Gilbert	23187419	DEPT2	Nursing
S10	Jacqueline	el-Amiri	D103	Machine Learning	T2	Abdur Rasheed	Mestas	59420892	DEPT1	Data Analytics
S11	Travis	al-Husain	D103	Machine Learning	T2	Abdur Rasheed	Mestas	59420892	DEPT1	Data Analytics
S12	Matthew	Ankele	D103	Machine Learning	T2	Abdur Rasheed	Mestas	59420892	DEPT1	Data Analytics
S13	Holly	Kim	N103	Pharmacology	T5	Tahiyya	Keuper	50391373	DEPT2	Nursing
S14	Jennifer	al-Abbasi	N101	Chemistry	T3	Matthew	Gilbert	23187419	DEPT2	Nursing
S15	Jami	Gallegos	D101	Data Analytics	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S16	Brett	al-Dia	D101	Data Analytics	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S17	Tanisha	Robbins	N103	Pharmacology	T5	Tahiyya	Keuper	50391373	DEPT2	Nursing
S18	Mardiyya	Tolbert	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S19	Diana	Story	N102	Biology	T4	Nhi	Heng	16379084	DEPT2	Nursing
S20	Lizsandra	Rodriguez	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S21	Aphisitt	Lor	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S22	Fakeeha	Vidana	N102	Biology	T4	Nhi	Heng	16379084	DEPT2	Nursing
S23	Rebecca	Nash	N101	Chemistry	T3	Matthew	Gilbert	23187419	DEPT2	Nursing
S24	Mei	Brown	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S25	Teresa	el-Doud	N103	Pharmacology	T5	Tahiyya	Keuper	50391373	DEPT2	Nursing
S26	Isaiah	Huizar	N103	Pharmacology	T5	Tahiyya	Keuper	50391373	DEPT2	Nursing
S27	Sharmayne	Castello	N102	Biology	T4	Nhi	Heng	16379084	DEPT2	Nursing
S28	Jacob	al-Guler	D103	Machine Learning	T2	Abdur Rasheed	Mestas	59420892	DEPT1	Data Analytics
S29	Hannah	al-Jabara	D102	Stats	T1	Mario	Avelar	46110472	DEPT1	Data Analytics
S30	Shenoa	Chimedtseren	N102	Biology	T4	Nhi	Heng	16379084	DEPT2	Nursing

Character Manipulation

Question 2

Import Data

Read in college majors csv file as data frame

cm_df = read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

#### College Major String Pattern Match Identifies the majors that contain either “DATA” or “STATISTICS”

pat_df = cm_df |> 
  filter(str_detect(Major, "DATA|STATISTICS")) |>
  select(Major)
colnames(pat_df) = c('Majors that contain either \"DATA\" or \"STATISTICS\"')


Majors that contain either “DATA” or “STATISTICS”
MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
COMPUTER PROGRAMMING AND DATA PROCESSING
STATISTICS AND DECISION SCIENCE

Question 3

Descriptions for five regular expressions have been provided below. Each expression description includes a write up on what the expression will match with supporting examples.

`(.)\1\1`

This regex will match strings that contain 3 identical consecutive characters.

Breakdown:
- (.) matches any single character
- \1 is a back reference to the first captured group
Examples: ‘111’, ‘aaa’

`"(.)(.)\\2\\1"`

This is a string that represents a regex which matches strings with any 2 characters followed by the second character and first character repeated in that order.

Breakdown:
- The first (.) matches any single character and represents the first captured group
- The second (.) again matches any single character and represents the second captured group
- These captured groups are then back referenced with the \\2 referring to the second captured group and \\1 referring to the first captured group.
Examples: ‘1221’, ‘abba’, ‘3333’

`(..)\1`

This regex will match strings with a repeated pair of characters.

Breakdown:
- (..) matches any 2 characters and represents the first captured group
- \1 is a back reference to the first captured group.
Examples: ‘1212’, ‘abab’, ‘cccc’

`"(.).\\1.\\1"`

This is a string that represents a regex which matches strings with any character repeated in 3 places if a single character is in between the repeated characters.

Breakdown:
- (.) matches any single character and represents the first captured group
- . matches any single character
- \\1 is a back reference referring to the first captured group
Examples: ‘12131’, ‘abaca’, ‘dddd’

`"(.)(.)(.).*\\3\\2\\1"`

This is a string that represents a regex which matches strings with any 3 characters followed by an optional character that can be repeated 0 or more times. The pattern ends with same first 3 characters being repeated in reverse order.

Breakdown:
- (.) matches any single character and creates a captured group. This is repeated 3 times; thus, three captured groups are defined
- .* optionally matches any character with a repeat (i.e. it matches any number of times, including 0).
- \\3\\2\\1 are back references referring to the third, second, and first captured group characters respectively
Examples: ‘123321’, ‘abcdddddddcba’

Question 4

Regex to match words that start and end with the same character
Regex: ^(.).*\\1$

Test on words character vector

str_subset(words, "^(.).*\\1$")

##  [1] "america"    "area"       "dad"        "dead"       "depend"    
##  [6] "educate"    "else"       "encourage"  "engine"     "europe"    
## [11] "evidence"   "example"    "excuse"     "exercise"   "expense"   
## [16] "experience" "eye"        "health"     "high"       "knock"     
## [21] "level"      "local"      "nation"     "non"        "rather"    
## [26] "refer"      "remember"   "serious"    "stairs"     "test"      
## [31] "tonight"    "transport"  "treat"      "trust"      "window"    
## [36] "yesterday"

Regex to match words that contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Regex: ([a-zA-Z]{2}).*\\1

Test on words character vector

str_subset(words, "([a-zA-Z]{2}).*\\1")

##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

Regex to match words that contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
Regex: ([a-zA-Z])(.*\\1){2,}

Test on words character vector

str_subset(words, "([a-zA-Z])(.*\\1){2,}")

##  [1] "appropriate" "available"   "believe"     "between"     "business"   
##  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
## [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [21] "therefore"   "tomorrow"

DATA607_Week 3 Assignment

Normalization & Character Manipulation

Olivia Azevedo

2025-02-15

Overview

Normalization

Question 1

Normalized Dataframes

Joined Normalized Dataframes

Character Manipulation

Question 2

Import Data

Question 3

`(.)\1\1`

`"(.)(.)\\2\\1"`

`(..)\1`

`"(.).\\1.\\1"`

`"(.)(.)(.).*\\3\\2\\1"`

Question 4