Normalization
Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.
# Connecting to the three dataframes that I created in SQL Workbench to demonstrate normalization
library(RMySQL)
## Loading required package: DBI
Localdbconnect <- dbConnect(MySQL(),user= 'bchung',password= 'august29!',dbname= '607',host= 'bchungcuny607.mysql.database.azure.com')
dbListTables(Localdbconnect)
## [1] "customers" "invoices" "movie ratings" "products"
# Reading the three dataframes from SQL in R
Customers <- dbReadTable(Localdbconnect,'customers')
Invoices <- dbReadTable(Localdbconnect,'invoices')
Products <- dbReadTable(Localdbconnect,'products')
head(Customers)
## Customer.ID Customer.Name Customer.Address Customer.Class.of.Trade
## 1 1 John's Pharmacy 123 Race Street 6000
## 2 2 Davita Dialysis 34 Lincoln Drive 3104
## 3 3 Raritan Emergency Room 1 Raritan Road 1003
head(Invoices)
## Invoice. Invoice.Date CustomerID Product.Name Quantity
## 1 1 2024-09-01 1 Tylenol 9
## 2 2 2024-09-05 3 Tylenol 4
## 3 3 2024-09-12 1 Aspirin 5
## 4 4 2024-09-17 2 BandAids 3
head(Products)
## ProductName ProductPrice ProductUnit
## 1 Aspirin $6 100mg Vial
## 2 BandAids $4 25 Count Pack
## 3 Tylenol $5 100mg Vial
The databases I created are synthetic data of a pharmacutical supplier’s sales and customer data. Following the Boyce-Codd Normal Form, I attempted to ensure that every attribute in each table depends solely on the primary key’s (the first column in each table).
Character Manipulation
# Loading the dataframe from fivethirtyeight's The Economic Guide to Picking a COllege Major article.
library(readr)
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/all-ages.csv"
college_majors <- read.csv(url)
# Identifying majors with DATA or STATISTICS in its name.
MajorV <- as.vector(college_majors[,2])
grep ("DATA|STATISTICS", MajorV, value = TRUE, ignore.case = TRUE)
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [2] "STATISTICS AND DECISION SCIENCE"
## [3] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
To identify majors that contain either DATA or STATISTICS in the name I had to create a vector from the college_major dataset’s “Major” variable, then run the above grep code.
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:
(.)\1\1 This will match any characters in a string that repeats three times.
“(.)(.)\2\1” This will match any characters in a string with four characters that the first and fourth characters repeat and the second and third characters repeat.
(..)\1 This will match any four characters in a string where the first and third characters repeat and the second and fourth characters repeat.
“(.).\1.\1” This will match any five characters in a string where the first, third, and fifth characters repeat.
“(.)(.)(.).*\3\2\1” This will match any seven characters in a string where the first and seventh, second and sixth, and third and fifth characters repeat.
Start and end with the same character. “^([a-z]).*\1$”
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) “(..).*\1”
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) “([a-z].*){3,}”