DATA607_Assignment

1- Normalization

Created a csv file with sample table to be normalized Sample Table.csv and uploaded to GitHub. This table has 7 columns that represents a students’ grade for courses taught by different professors.

Check if it is in first normal form

Each row is representing a different observation
No mix data types in each column
Does not store a repeating group of data items in one row
This table have a composite primary keys of Student_ID, Course_ID, and Teacher_Email

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Read the file from GitHub and save to a data frame
denormalized <- read.csv("https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Assignment_3/Normalize%20Table.csv")
print(denormalized)

##   Student_ID Course_ID Student_Name Course_Name Grade Teacher  Teacher.Email
## 1        100        10       Amelia        Math     B     Ken  Ken@email.com
## 2        200        10        Henry        Math     A     Ken  Ken@email.com
## 3        100        20       Amelia     Science     C    Liam liam@email.com
## 4        200        20        Henry     Science     A    Liam liam@email.com
## 5        300        10        James        Math     C     Ben  Ben@email.com
## 6        400        10        Mandy        Math     B     Ben  Ben@email.com
## 7        400        20        Mandy     Science     B    Liam liam@email.com

Check if it is in second normal form

Second normal form, each non-key attribute must depend on the entire primary key Student_Name depends on the Student_ID Course_Name depends on the Course_ID Teacher name each have unique email and they each teach a course The bridge table has grade dependent on student, course, and teacher

# Student_ID is the primary key for Student_Name
# Get the distinct Students from the table
Students <- denormalized %>% 
  select(Student_ID, Student_Name) %>% 
  distinct(Student_ID, .keep_all = TRUE)%>% 
  print()

##   Student_ID Student_Name
## 1        100       Amelia
## 2        200        Henry
## 3        300        James
## 4        400        Mandy

# Course_ID is the primary key for Course_Name
# Get the distinct courses from the table
Courses <- denormalized %>% 
  select(Course_ID, Course_Name) %>% 
  distinct(Course_ID, .keep_all = TRUE)%>% 
  print()

##   Course_ID Course_Name
## 1        10        Math
## 2        20     Science

# Teacher.email is the primary key for teacher and each teacher teaches a course
# Get the distinct teachers from the table
Teachers <- denormalized %>% 
  select(Teacher, Teacher.Email, Course_ID) %>% 
  distinct(Teacher.Email, .keep_all = TRUE) %>% 
  print()

##   Teacher  Teacher.Email Course_ID
## 1     Ken  Ken@email.com        10
## 2    Liam liam@email.com        20
## 3     Ben  Ben@email.com        10

# The de-normalized table after removing Student, Courses, and teacher with only the primary key of each table.
Bridge_Table <- denormalized %>% 
  select(Student_ID, Course_ID, Teacher.Email, Grade) %>% 
  print()

##   Student_ID Course_ID  Teacher.Email Grade
## 1        100        10  Ken@email.com     B
## 2        200        10  Ken@email.com     A
## 3        100        20 liam@email.com     C
## 4        200        20 liam@email.com     A
## 5        300        10  Ben@email.com     C
## 6        400        10  Ben@email.com     B
## 7        400        20 liam@email.com     B

Check if it is in third normal form

Each non-key attribute must be dependent on the key, the whole key, and nothing but the key.

# Looking at the Teachers table there is a relationship between Course_ID and Teachers.To represent that I created another bridge table called Teachers_Course and created an ID call Section_ID and created a primary key call Teacher_ID

# Modify Teachers table to add Teacher_ID
Teachers <- Teachers %>% 
  mutate(Teacher_ID = 1:n(), .before = Teacher) %>% 
  print()

##   Teacher_ID Teacher  Teacher.Email Course_ID
## 1          1     Ken  Ken@email.com        10
## 2          2    Liam liam@email.com        20
## 3          3     Ben  Ben@email.com        10

# Create dataframe for Course_Section
Teacher_Course <- Teachers %>% 
  select(Course_ID, Teacher_ID) %>% 
  arrange(Course_ID) %>% 
  mutate(Section_ID = c(1,2,1)) %>% # manually creating the section_id for now
  print()

##   Course_ID Teacher_ID Section_ID
## 1        10          1          1
## 2        10          3          2
## 3        20          2          1

# converting the bridge table to Grades based on student, course, and section
Grades <- Bridge_Table %>% 
  left_join(Teachers) %>% 
  left_join(Teacher_Course) %>% 
  select(Student_ID, Course_ID, Section_ID, Grade) %>% 
  print()

## Joining with `by = join_by(Course_ID, Teacher.Email)`
## Joining with `by = join_by(Course_ID, Teacher_ID)`

##   Student_ID Course_ID Section_ID Grade
## 1        100        10          1     B
## 2        200        10          1     A
## 3        100        20          1     C
## 4        200        20          1     A
## 5        300        10          2     C
## 6        400        10          2     B
## 7        400        20          1     B

Normalized tables

print(Students)

##   Student_ID Student_Name
## 1        100       Amelia
## 2        200        Henry
## 3        300        James
## 4        400        Mandy

print(Courses)

##   Course_ID Course_Name
## 1        10        Math
## 2        20     Science

print(Teachers)

##   Teacher_ID Teacher  Teacher.Email Course_ID
## 1          1     Ken  Ken@email.com        10
## 2          2    Liam liam@email.com        20
## 3          3     Ben  Ben@email.com        10

print(Teacher_Course)

##   Course_ID Teacher_ID Section_ID
## 1        10          1          1
## 2        10          3          2
## 3        20          2          1

print(Grades)

##   Student_ID Course_ID Section_ID Grade
## 1        100        10          1     B
## 2        200        10          1     A
## 3        100        20          1     C
## 4        200        20          1     A
## 5        300        10          2     C
## 6        400        10          2     B
## 7        400        20          1     B

2 - Character Manipulation

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”.

There are three majors containing “DATA” or “STATISTICS” in the name.

majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

str_view(majors$Major,"DATA|STATISTICS")

## [44] │ MANAGEMENT INFORMATION SYSTEMS AND <STATISTICS>
## [52] │ COMPUTER PROGRAMMING AND <DATA> PROCESSING
## [59] │ <STATISTICS> AND DECISION SCIENCE

3. Describe, in words, what these expressions will match:

(.)\1\1

This expression will match a string with three repeating characters.
(.) Capturing group is of any character
\1 same character as the capturing group repeat once
\1 same character as the capturing group repeat the second time

x <- str_c("aaa bbb abc")
str_view(x, "(.)\\1\\1")

## [1] │ <aaa> <bbb> abc

“(.)(.)\2\1”

This expression will match two characters followed by two characters in reversed order.
(.) 1^st capturing group of any character
(.) 2^nd capturing group of any character
\2 character from the second capturing group
\1 character from the first capturing group

x <- str_c("abba abbac aaaa")
str_view(x, "(.)(.)\\2\\1")

## [1] │ <abba> <abba>c <aaaa>

(..)\1

This expression will match string with any two characters repeated once. (..) capturing group of any two characters
\1 repeat the characters from the capturing group once

x <- str_c("aaaa abababab cababc")
str_view(x, "(..)\\1")

## [1] │ <aaaa> <abab><abab> c<abab>c

“(.).\1.\1”

This expression match 5 characters where every other odd character is the same character

(.) Capturing group of any character
. any character
\1 repeat of capturing group character
. any character
\1 repeat of capturing group character

x <- str_c("ababa abaca aaaaaa")
str_view(x, "(.).\\1.\\1")

## [1] │ <ababa> <abaca> <aaaaa>a

“(.)(.)(.).*\3\2\1”

This expression will match string with the three characters follow by any character and then the reverse of the three characters in the beginning

(.) first capturing group of any character
(.) second capturing group of any character
(.) third capturing group of any character
.* any number of characters
\3 third capturing group of any character
\2 second capturing group of any character
\1 first capturing group of any character

x <- str_c("abcdefedcba efg gfe")
str_view(x,"(.)(.)(.).*\\3\\2\\1")

## [1] │ <abcdefedcba> <efg gfe>

4. Construct regular expressions to match words that:

Start and end with the same character

# "(.).*\\1"
x <- str_c("aba xyxdwn blob")
str_view(x, "(.).*\\1" )

## [1] │ <aba>< xyxdwn ><blob>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

# "(..).*\\1"
x <- ("church photography")
str_view(x, "(..).*\\1")

## [1] │ <church> <photograph>y

# If we want to tweak it to have letters starts and ends with the same letters
# "^(..).*\\1$"
x <- str_c("church")
str_view(x, "^(..).*\\1$")

## [1] │ <church>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

# "(.).*\\1.*\\1"
x <- str_c("ababa cdceleven fififofum")
str_view(x, "(.).*\\1.*\\1")

## [1] │ <ababa> cdc<eleve>n <fififof>um

DATA607_Assignment_3

Ying Fang Lee

9/22/2024