Assignment 3 DATA 607

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

Instructions

Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.

Normalization Question 1

Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.

Question 1 Answer:

Schema Normalization is the process of organizing the structure of a relational database to minimize redundancy, avoid anomalies, and improve data integrity. The goal is to break down large, redundant tables into smaller, logically related ones, adhering to certain rules called normal forms (First Normal Form, Second Normal Form, Third Normal Form).

DataFrame 1: Unnormalized Data (Denormalized Table)

# Original Denormalized student data (includes student, course, and instructor information)
StudentData_original <- data.frame(
  StudentID = 1:5,
  StudentName = c("Alice", "Bob", "Charlie", "David", "Eva"),
  CourseName = c("Math", "Math", "Science", "Math", "Science"),
  Instructor = c("Dr. Matthew", "Dr. Matthew", "Dr. Stein", "Dr. Matthew", "Dr. Stein"),
  Grade = c("B", "C", "A", "A", "B")
)

print(StudentData_original)

##   StudentID StudentName CourseName  Instructor Grade
## 1         1       Alice       Math Dr. Matthew     B
## 2         2         Bob       Math Dr. Matthew     C
## 3         3     Charlie    Science   Dr. Stein     A
## 4         4       David       Math Dr. Matthew     A
## 5         5         Eva    Science   Dr. Stein     B

DataFrame 2: First Normal Form - ensures that the table had atomic values (record is unique) and no repeating groups.

# Ensuring uniqueness by removing duplicate rows (if any)
StudentData_1nf <- StudentData_original %>%
  distinct()

print(StudentData_1nf)

##   StudentID StudentName CourseName  Instructor Grade
## 1         1       Alice       Math Dr. Matthew     B
## 2         2         Bob       Math Dr. Matthew     C
## 3         3     Charlie    Science   Dr. Stein     A
## 4         4       David       Math Dr. Matthew     A
## 5         5         Eva    Science   Dr. Stein     B

DataFrame 3: Second Normal Form - separate data into tables that avoid partial dependencie (eliminates partial dependencies by splitting student and enrollment data).

# Students Table (StudentID, StudentName)
StudentData_students <- StudentData_1nf %>%
  select(StudentID, StudentName) %>%
  distinct()

print(StudentData_students)

##   StudentID StudentName
## 1         1       Alice
## 2         2         Bob
## 3         3     Charlie
## 4         4       David
## 5         5         Eva

# Enrollments Table (StudentID, CourseName, Instructor, Grade)
StudentData_enrollments <- StudentData_1nf %>%
  select(StudentID, CourseName, Instructor, Grade) %>%
  distinct()

print(StudentData_enrollments)

##   StudentID CourseName  Instructor Grade
## 1         1       Math Dr. Matthew     B
## 2         2       Math Dr. Matthew     C
## 3         3    Science   Dr. Stein     A
## 4         4       Math Dr. Matthew     A
## 5         5    Science   Dr. Stein     B

DataFrame 4: Third Normal Form - further separate data to remove transitive dependencies (removes transitive dependencies by separating course and instructor information into a new table).

# Courses Table (CourseName, Instructor)
StudentData_courses <- StudentData_enrollments %>%
  select(CourseName, Instructor) %>%
  distinct()

print(StudentData_courses)

##   CourseName  Instructor
## 1       Math Dr. Matthew
## 2    Science   Dr. Stein

# Enrollments Table (StudentID, CourseName, Grade)
StudentData_enrollments_3nf <- StudentData_enrollments %>%
  select(StudentID, CourseName, Grade)

print(StudentData_enrollments_3nf)

##   StudentID CourseName Grade
## 1         1       Math     B
## 2         2       Math     C
## 3         3    Science     A
## 4         4       Math     A
## 5         5    Science     B

Character Manipulation Question 2

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

# Load the dataset 
url <- "https://raw.githubusercontent.com/Badigun/Data-607-Assignments/refs/heads/main/majors-list.csv"
majors_list <- read.csv(url)

# Check the dataset
head(majors_list)

##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources

# Filter majors that contain "DATA" or "STATISTICS"
filtered_Majors_Data <- majors_list %>%
  filter(grepl("DATA|STATISTICS", Major, ignore.case = TRUE))

# Display the filtered majors
print(filtered_Majors_Data$Major)

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

Question 3: Describe, in words, what these expressions will match:

(.)\1\1 Answer: This expression matches any three consecutive characters where the first character is repeated two more times. For example, it will match strings like “aaa”, “bbb” where the first character repeats twice more.

text <- c("aaa", "abc", "333", "bbb1")

pattern <- "(.)\\1\\1"

matches <- grep(pattern, text, value = TRUE)

print(matches)

## [1] "aaa"  "333"  "bbb1"

“(.)(.)\2\1” Answer: This expression matches a string with two characters, followed by those same two characters in reverse order. For example, it will match strings like “abba”. The first character is matched again as the last character, and the second character is matched again as the second-to-last character.

text <- c("abba", "zyyz", "213312", "abc", "cccf")

pattern <- "(.)(.)\\2\\1"

matches <- grep(pattern, text, value = TRUE)

print(matches)

## [1] "abba"   "zyyz"   "213312"

(..)\1 Answer: This expression matches a string where the first two characters are exactly repeated after them. For example, it will match strings like “abab”, or “1212”. It requires that the first two characters are the same as the next two characters.

text <- c("cdcdcd", "abab", "1212", "ijkl", "aaab", "abcabc")

pattern <- "(..)\\1"

matches <- grep(pattern, text, value = TRUE)

print(matches)

## [1] "cdcdcd" "abab"   "1212"

“(.).\1.\1” Answer: This expression matches strings where the first character appears three times, with two other characters in between. For example, it will match strings like “acaca”.

text <- c("acaca", "axa", "121244", "abc", "xyzzyx")

pattern <- "(.).\\1.\\1"

matches <- grep(pattern, text, value = TRUE)

print(matches)

## [1] "acaca"

“(.)(.)(.).*\3\2\1” Answer: This expression matches strings where the first three characters are followed by any characters, and then those three characters are repeated in reverse order. For example, it will match strings like “abcxyzcba”, or “123xyzyx321”. The first three characters appear again after some characters, but in reverse order.

text <- c("xyzabzzyx", "123xyz321", "xyzabczyx", "efgj", "abcd", "abcdefgj")

pattern <- "(.)(.)(.).*\\3\\2\\1"

matches <- grep(pattern, text, value = TRUE)

print(matches)

## [1] "xyzabzzyx" "123xyz321" "xyzabczyx"

Question 4: Construct regular expressions to match words that:

Start and end with the same character - \b([a-zA-Z]).*\1\b

#examples
text <- c("plump", "pen", "radar", "noon", "civic", "world")

# Regex pattern to match words that start and end with the same character
pattern <- "\\b([a-zA-Z]).*\\1\\b"

# use grep to find words that match the pattern
matches <- grep(pattern, text, value = TRUE)
print(matches)

## [1] "plump" "radar" "noon"  "civic"

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - \b([a-zA-Z]{2}).*\1\b

# Sample text
text <- c("church", "money", "success", "monkey", "love")

pattern <- "\\b([a-zA-Z]{2}).*\\1\\b"

# Find words that match the pattern
matches <- grep(pattern, text, value = TRUE)
print(matches)

## [1] "church"

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) - \b([a-zA-Z]).\1.\1\b

text <- c("love", "balloon", "success", "hello", "apple", "test")

pattern <- "\\b([a-zA-Z]).*\\1.*\\1\\b"

matches <- grep(pattern, text, value = TRUE)
print(matches)

## [1] "success"