DATA607: Assignment 3

Introduction

For this week’s assignment, I provided examples of Normalization with First, Second, and Third Normal Form using fake data. I then used character manipulation to interpret expressions and used regular expressions to match different examples.

Normalization

This is an example of First Normal Form (1NF) since we seperated blood pressure readings for each participant into systolic and diastolic. This way we do not have more than one value in each cell.

base_bp_notnorm <- data.frame(id=1001:1005,
                      baseline_bp=c('143/93', '124 over 77', '98 over 60', '129/91', '110/70'))

print(base_bp_notnorm)

##     id baseline_bp
## 1 1001      143/93
## 2 1002 124 over 77
## 3 1003  98 over 60
## 4 1004      129/91
## 5 1005      110/70

base_bp_norm <- data.frame(id=1001:1005,
                      baseline_sys_bp=c(143, 124, 98, 129, 110),
                      baseline_dia_bp=c(93, 77, 60, 91, 70))
print(base_bp_norm)

##     id baseline_sys_bp baseline_dia_bp
## 1 1001             143              93
## 2 1002             124              77
## 3 1003              98              60
## 4 1004             129              91
## 5 1005             110              70

This is an example of Second Normal Form (2NF) since the non-key attribute (item_price) is not dependent on the entire primary key which is customer_id and items they purchase. In order to put this is second normal form, the table must be split up into two tables.

store <- data.frame(customer_id = c(1, 1, 2,3,3,3),
                      items =c("Shampoo", 'Airpods', 'Toothpase', 'Laptop', 'Bedding Set', "Puzzle"),
                      item_price=c('$8.38', '$199.99', '$5.29', '$589.99', '$79.00', "9.99"))

print(store)

##   customer_id       items item_price
## 1           1     Shampoo      $8.38
## 2           1     Airpods    $199.99
## 3           2   Toothpase      $5.29
## 4           3      Laptop    $589.99
## 5           3 Bedding Set     $79.00
## 6           3      Puzzle       9.99

#We will create two tables, one with customer ids and item purchased and the other table with items and their prices. 


customer_purchase <- data.frame(customer_id = c(1,1,2,3,3,3),
                      items =c("Shampoo", 'Airpods', 'Toothpase', 'Laptop', 'Bedding Set', "Puzzle"))

item_prices <- data.frame(
                      items =c("Shampoo", 'Airpods', 'Toothpase', 'Laptop', 'Bedding Set', "Puzzle"),
                      item_price=c('$8.38', '$199.99', '$5.29', '$589.99', '$79.00', "$9.99"))

#Customer Purchases Table
print(customer_purchase)

##   customer_id       items
## 1           1     Shampoo
## 2           1     Airpods
## 3           2   Toothpase
## 4           3      Laptop
## 5           3 Bedding Set
## 6           3      Puzzle

#Item Prices Table
print(item_prices)

##         items item_price
## 1     Shampoo      $8.38
## 2     Airpods    $199.99
## 3   Toothpase      $5.29
## 4      Laptop    $589.99
## 5 Bedding Set     $79.00
## 6      Puzzle      $9.99

This is an example of Third Normal Form (3NF) since each non-key attribute must depend on the key, the whole key, and nothing but the key. In order to put this is third normal form, the table must be split up into two tables since the primary keys are patient_id and patient_name but test_cost does not depend on the whole primary key.

hospital <- data.frame(patient_id = c(1302,9293,1882,8900,3654),
                    patient_name =c('Sam', 'Joe', 'Lisa', 'Nathan', 'Nora'),
                    test=c('X-ray', 'Biopsy', 'MRI', 'Complete Blood Count (CBC)', 'Employer Drug Test'),
                    test_cost=c('$300.00','$540.00', "350.00", "30.00", "$130.00" ))

print(hospital)

##   patient_id patient_name                       test test_cost
## 1       1302          Sam                      X-ray   $300.00
## 2       9293          Joe                     Biopsy   $540.00
## 3       1882         Lisa                        MRI    350.00
## 4       8900       Nathan Complete Blood Count (CBC)     30.00
## 5       3654         Nora         Employer Drug Test   $130.00

patient_test <- data.frame(patient_id = c(1302,9293,1882,8900,3654),
                    patient_name =c('Sam', 'Joe', 'Lisa', 'Nathan', 'Nora'),
                    test=c('X-ray', 'Biopsy', 'MRI', 'Complete Blood Count (CBC)', 'Employer Drug Test'))
                    
print(patient_test)

##   patient_id patient_name                       test
## 1       1302          Sam                      X-ray
## 2       9293          Joe                     Biopsy
## 3       1882         Lisa                        MRI
## 4       8900       Nathan Complete Blood Count (CBC)
## 5       3654         Nora         Employer Drug Test

test_cost <- data.frame(test=c('X-ray', 'Biopsy', 'MRI', 'Complete Blood Count (CBC)', 'Employer Drug Test'),
                    test_cost=c('$300.00','$540.00', "350.00", "30.00", "$130.00" ))

print(test_cost)

##                         test test_cost
## 1                      X-ray   $300.00
## 2                     Biopsy   $540.00
## 3                        MRI    350.00
## 4 Complete Blood Count (CBC)     30.00
## 5         Employer Drug Test   $130.00

Character Manipulation

The CSV. with data on college majors was imported into R and then I used str_subset to find matches in the dataset “major” within the variable Major that had “DATA” or “STATISTICS” in it. The two pieces of the code returned a total of 3 Majors (1 for DATA and 2 for STATISTICS).

majors<- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv")

str_subset(majors$Major, pattern = "DATA")

## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"

str_subset(majors$Major, pattern = "STATISTICS")

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "STATISTICS AND DECISION SCIENCE"

Describe, in words, what these expressions will match:

#(.)\1\1 : The same character appearing three times. Some examples of this could be: "bbb" or "ccc"
    
#"(.)(.)\\2\\1" : The same two characters twice but the second time, it is backwards.
str_subset(words, "(.)(.)\\2\\1")

##  [1] "afternoon"   "apparent"    "arrange"     "bottom"      "brilliant"  
##  [6] "common"      "difficult"   "effect"      "follow"      "indeed"     
## [11] "letter"      "million"     "opportunity" "oppose"      "tomorrow"

#(..)\1 : The same two characters appearing twice in a row. Some examples could be "mgmg" or "cdcd"
    
#"(.).\\1.\\1" : A character, then another character, the first character, then another character, and then the first character again.
str_subset(words, "(.).\\1.\\1")

## [1] "eleven"

#(.)(.)(.).*\\3\\2\\1 : Three characters, then any character(s) or none, then the same three characters from the beginning but backwards.
str_subset(words, "(.)(.)(.).*\\3\\2\\1")

## [1] "paragraph"

Construct regular expressions to match words that:

#Start and end with the same character. 
str_subset(words, "^(.).*\\1+$")

##  [1] "america"    "area"       "dad"        "dead"       "depend"    
##  [6] "educate"    "else"       "encourage"  "engine"     "europe"    
## [11] "evidence"   "example"    "excuse"     "exercise"   "expense"   
## [16] "experience" "eye"        "health"     "high"       "knock"     
## [21] "level"      "local"      "nation"     "non"        "rather"    
## [26] "refer"      "remember"   "serious"    "stairs"     "test"      
## [31] "tonight"    "transport"  "treat"      "trust"      "window"    
## [36] "yesterday"

#Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
str_subset(words, "(..).+\\1")

##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "represent"   "require"     "sense"       "therefore"  
## [16] "understand"  "whether"

#Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s._
str_subset(words, "(.)+.+\\1.+\\1")

##  [1] "available"   "believe"     "difference"  "eleven"      "environment"
##  [6] "evidence"    "exercise"    "expense"     "experience"  "individual" 
## [11] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
## [16] "therefore"   "tomorrow"

Conclusions

In this week’s assignment we learned about normalization (1NF, 2NF, 3NF, 4NF, and 5NF) and how to create examples of them which is important for relational databases (e.g. FileMaker). In the future, I would try using the rules on normalization with real data. The second part of the assignment is character manipulation using regular expressions to find or match strings in a dataset. Character manipulation is useful for data cleaning and analysis, an example with real data would be taking the ID from the name of a csv. as a string and adding it into the dataset as a variable.

DATA607: Assignment 3

Nakesha Fray

2024-09-22

Introduction

Normalization

Character Manipulation

Conclusions