Lab 3

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.4.1

library(openintro)

## Warning: package 'openintro' was built under R version 4.4.1

library(tibble)

create an N1 table

N1<- tibble(
  EmployeeID = c(1, 1, 2, 3, 3),
  EmployeeName = c("John", "John", "Jane", "Alice", "Alice"),
  Department = c("Sales", "Sales", "Marketing", "HR", "HR"),
  Project = c("Project A", "Project B", "Project C", "Project D", "Project E"),
  HoursWorked = c(10, 15, 20, 30, 25)
)

N1

This table is in N1 form because all the data are atomic and the rows and columns are not repeating.

transform N1 table to N2

Since the table is in N1 form, we can transform it into N2, eliminating partial dependencies on non-key columns.

N2EmpID<- N1 %>%
  subset(select=c(EmployeeID, EmployeeName, Department)) %>%
  distinct()
N2EmpID

To further eliminate partial dependency, we will introduce a new column called ProjectID. This makes the Project column fully dependent on the primary key, which is ProjectID rather than the EmployeeID in the original table.

N2D_Pro<- N1 %>%
  mutate(ProjectID= c('AA01', 'AA02', 'AA03', 'AA04', 'AA05')) %>%
    subset(select=c(ProjectID, Project)) %>%
      distinct()
print(N2D_Pro)

## # A tibble: 5 × 2
##   ProjectID Project  
##   <chr>     <chr>    
## 1 AA01      Project A
## 2 AA02      Project B
## 3 AA03      Project C
## 4 AA04      Project D
## 5 AA05      Project E

N2Hour<- N1 %>%
  mutate(ProjectID= c('AA01', 'AA02', 'AA03', 'AA04', 'AA05')) %>%
    subset(select=c(EmployeeID, ProjectID, HoursWorked)) %>%
      distinct()
print(N2Hour)

## # A tibble: 5 × 3
##   EmployeeID ProjectID HoursWorked
##        <dbl> <chr>           <dbl>
## 1          1 AA01               10
## 2          1 AA02               15
## 3          2 AA03               20
## 4          3 AA04               30
## 5          3 AA05               25

These three tables happen to also satisfy N3F since each non-key attributes are fully dependent on the primary keys in each table and no transitive dependency is observed. The non-key attribute “HoursWorked” is fully dependent on the composite keys “EmployeeID” and “ProjectID”.

Data filtering. 173 majors

major<- read.csv('https://raw.githubusercontent.com/stormwhale/607-assignment/refs/heads/main/majors-list.csv')
major<-data.frame(major)


name<- major %>%
  filter(str_detect(Major, regex('DATA|STATISTIC', ignore_case = TRUE)))
name

3. Describe, in words, what these expressions will match:

(.)\1\1 The correct expression should be (.)\\1\\1 and it will match three of the same characters captured in the first group. (e.g. aaa, ccc, bbb)

“(.)(.)\2\1” <- This will capture 4 characters that are reverse of one another. (e.g saas)

(..)\1 <- The correct expression should be “(..)\\1” and it will match anything that is repeating the first two characters. e.g. abab

“(.).\1.\1” <- this captures any first letter with any letter that follows and returns the first captured letter with another letter finally followed by the first letter. E.g “cbcnc”

“(.)(.)(.).*\3\2\1” <- captures three characters followed by a string that could be anything and any length as long as the reverse of the first three characters are followed. E.g. xyz123abc123zyx

4. Construct regular expressions to match words that:

Start and end with the same character.

df<- 'aaafffddf'
print(str_view(df, "(.)\\1"))

## [1] │ <aa>a<ff>f<dd>f

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

df2<- c("church", "success", "letter", "banana", "bookkeeper")
str_view(df2, "(.)(.).*\\1\\2")

## [1] │ <church>
## [4] │ b<anan>a

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

df3<- 'Tennessee'
str_view(df3, "(.).*\\1.*\\1")

## [1] │ T<ennessee>