Created a csv file with sample table to be normalized Sample Table.csv and uploaded to GitHub. This table has 7 columns that represents a students’ grade for courses taught by different professors.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the file from GitHub and save to a data frame
denormalized <- read.csv("https://raw.githubusercontent.com/amily52131/DATA607/refs/heads/main/Assignment_3/Normalize%20Table.csv")
print(denormalized)
## Student_ID Course_ID Student_Name Course_Name Grade Teacher Teacher.Email
## 1 100 10 Amelia Math B Ken Ken@email.com
## 2 200 10 Henry Math A Ken Ken@email.com
## 3 100 20 Amelia Science C Liam liam@email.com
## 4 200 20 Henry Science A Liam liam@email.com
## 5 300 10 James Math C Ben Ben@email.com
## 6 400 10 Mandy Math B Ben Ben@email.com
## 7 400 20 Mandy Science B Liam liam@email.com
Second normal form, each non-key attribute must depend on the entire primary key Student_Name depends on the Student_ID Course_Name depends on the Course_ID Teacher name each have unique email and they each teach a course The bridge table has grade dependent on student, course, and teacher
# Student_ID is the primary key for Student_Name
# Get the distinct Students from the table
Students <- denormalized %>%
select(Student_ID, Student_Name) %>%
distinct(Student_ID, .keep_all = TRUE)%>%
print()
## Student_ID Student_Name
## 1 100 Amelia
## 2 200 Henry
## 3 300 James
## 4 400 Mandy
# Course_ID is the primary key for Course_Name
# Get the distinct courses from the table
Courses <- denormalized %>%
select(Course_ID, Course_Name) %>%
distinct(Course_ID, .keep_all = TRUE)%>%
print()
## Course_ID Course_Name
## 1 10 Math
## 2 20 Science
# Teacher.email is the primary key for teacher and each teacher teaches a course
# Get the distinct teachers from the table
Teachers <- denormalized %>%
select(Teacher, Teacher.Email, Course_ID) %>%
distinct(Teacher.Email, .keep_all = TRUE) %>%
print()
## Teacher Teacher.Email Course_ID
## 1 Ken Ken@email.com 10
## 2 Liam liam@email.com 20
## 3 Ben Ben@email.com 10
# The de-normalized table after removing Student, Courses, and teacher with only the primary key of each table.
Bridge_Table <- denormalized %>%
select(Student_ID, Course_ID, Teacher.Email, Grade) %>%
print()
## Student_ID Course_ID Teacher.Email Grade
## 1 100 10 Ken@email.com B
## 2 200 10 Ken@email.com A
## 3 100 20 liam@email.com C
## 4 200 20 liam@email.com A
## 5 300 10 Ben@email.com C
## 6 400 10 Ben@email.com B
## 7 400 20 liam@email.com B
Each non-key attribute must be dependent on the key, the whole key, and nothing but the key.
# Looking at the Teachers table there is a relationship between Course_ID and Teachers.To represent that I created another bridge table called Teachers_Course and created an ID call Section_ID and created a primary key call Teacher_ID
# Modify Teachers table to add Teacher_ID
Teachers <- Teachers %>%
mutate(Teacher_ID = 1:n(), .before = Teacher) %>%
print()
## Teacher_ID Teacher Teacher.Email Course_ID
## 1 1 Ken Ken@email.com 10
## 2 2 Liam liam@email.com 20
## 3 3 Ben Ben@email.com 10
# Create dataframe for Course_Section
Teacher_Course <- Teachers %>%
select(Course_ID, Teacher_ID) %>%
arrange(Course_ID) %>%
mutate(Section_ID = c(1,2,1)) %>% # manually creating the section_id for now
print()
## Course_ID Teacher_ID Section_ID
## 1 10 1 1
## 2 10 3 2
## 3 20 2 1
# converting the bridge table to Grades based on student, course, and section
Grades <- Bridge_Table %>%
left_join(Teachers) %>%
left_join(Teacher_Course) %>%
select(Student_ID, Course_ID, Section_ID, Grade) %>%
print()
## Joining with `by = join_by(Course_ID, Teacher.Email)`
## Joining with `by = join_by(Course_ID, Teacher_ID)`
## Student_ID Course_ID Section_ID Grade
## 1 100 10 1 B
## 2 200 10 1 A
## 3 100 20 1 C
## 4 200 20 1 A
## 5 300 10 2 C
## 6 400 10 2 B
## 7 400 20 1 B
print(Students)
## Student_ID Student_Name
## 1 100 Amelia
## 2 200 Henry
## 3 300 James
## 4 400 Mandy
print(Courses)
## Course_ID Course_Name
## 1 10 Math
## 2 20 Science
print(Teachers)
## Teacher_ID Teacher Teacher.Email Course_ID
## 1 1 Ken Ken@email.com 10
## 2 2 Liam liam@email.com 20
## 3 3 Ben Ben@email.com 10
print(Teacher_Course)
## Course_ID Teacher_ID Section_ID
## 1 10 1 1
## 2 10 3 2
## 3 20 2 1
print(Grades)
## Student_ID Course_ID Section_ID Grade
## 1 100 10 1 B
## 2 200 10 1 A
## 3 100 20 1 C
## 4 200 20 1 A
## 5 300 10 2 C
## 6 400 10 2 B
## 7 400 20 1 B
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset, provide code that identifies the majors that contain either “DATA” or “STATISTICS”.
There are three majors containing “DATA” or “STATISTICS” in the name.
majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")
str_view(majors$Major,"DATA|STATISTICS")
## [44] │ MANAGEMENT INFORMATION SYSTEMS AND <STATISTICS>
## [52] │ COMPUTER PROGRAMMING AND <DATA> PROCESSING
## [59] │ <STATISTICS> AND DECISION SCIENCE
This expression will match a string with three repeating
characters.
(.) Capturing group is of any character
\1 same character as the capturing group repeat once
\1 same character as the capturing group repeat the second time
x <- str_c("aaa bbb abc")
str_view(x, "(.)\\1\\1")
## [1] │ <aaa> <bbb> abc
This expression will match two characters followed by two characters
in reversed order.
(.) 1^st capturing group of any character
(.) 2^nd capturing group of any character
\2 character from the second capturing group
\1 character from the first capturing group
x <- str_c("abba abbac aaaa")
str_view(x, "(.)(.)\\2\\1")
## [1] │ <abba> <abba>c <aaaa>
This expression will match string with any two characters repeated
once. (..) capturing group of any two characters
\1 repeat the characters from the capturing group once
x <- str_c("aaaa abababab cababc")
str_view(x, "(..)\\1")
## [1] │ <aaaa> <abab><abab> c<abab>c
This expression match 5 characters where every other odd character is
the same character
(.) Capturing group of any character
. any character
\1 repeat of capturing group character
. any character
\1 repeat of capturing group character
x <- str_c("ababa abaca aaaaaa")
str_view(x, "(.).\\1.\\1")
## [1] │ <ababa> <abaca> <aaaaa>a
This expression will match string with the three characters follow by
any character and then the reverse of the three characters in the
beginning
(.) first capturing group of any character
(.) second capturing group of any character
(.) third capturing group of any character
.* any number of characters
\3 third capturing group of any character
\2 second capturing group of any character
\1 first capturing group of any character
x <- str_c("abcdefedcba efg gfe")
str_view(x,"(.)(.)(.).*\\3\\2\\1")
## [1] │ <abcdefedcba> <efg gfe>
# "(.).*\\1"
x <- str_c("aba xyxdwn blob")
str_view(x, "(.).*\\1" )
## [1] │ <aba>< xyxdwn ><blob>
# "(..).*\\1"
x <- ("church photography")
str_view(x, "(..).*\\1")
## [1] │ <church> <photograph>y
# If we want to tweak it to have letters starts and ends with the same letters
# "^(..).*\\1$"
x <- str_c("church")
str_view(x, "^(..).*\\1$")
## [1] │ <church>
# "(.).*\\1.*\\1"
x <- str_c("ababa cdceleven fififofum")
str_view(x, "(.).*\\1.*\\1")
## [1] │ <ababa> cdc<eleve>n <fififof>um