Normalization

Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.

# Connecting to the three dataframes that I created in SQL Workbench to demonstrate normalization

library(RMySQL)
## Loading required package: DBI
Localdbconnect <- dbConnect(MySQL(),user= 'bchung',password= 'august29!',dbname= '607',host= 'bchungcuny607.mysql.database.azure.com')

dbListTables(Localdbconnect)
## [1] "customers"     "invoices"      "movie ratings" "products"
# Reading the three dataframes from SQL in R
Customers <- dbReadTable(Localdbconnect,'customers')
Invoices <- dbReadTable(Localdbconnect,'invoices')
Products <- dbReadTable(Localdbconnect,'products')

head(Customers)
##   Customer.ID          Customer.Name Customer.Address Customer.Class.of.Trade
## 1           1        John's Pharmacy  123 Race Street                    6000
## 2           2        Davita Dialysis 34 Lincoln Drive                    3104
## 3           3 Raritan Emergency Room   1 Raritan Road                    1003
head(Invoices)
##   Invoice. Invoice.Date CustomerID Product.Name Quantity
## 1        1   2024-09-01          1      Tylenol        9
## 2        2   2024-09-05          3      Tylenol        4
## 3        3   2024-09-12          1      Aspirin        5
## 4        4   2024-09-17          2     BandAids        3
head(Products)
##   ProductName ProductPrice   ProductUnit
## 1     Aspirin           $6    100mg Vial
## 2    BandAids           $4 25 Count Pack
## 3     Tylenol           $5    100mg Vial

The databases I created are synthetic data of a pharmacutical supplier’s sales and customer data. Following the Boyce-Codd Normal Form, I attempted to ensure that every attribute in each table depends solely on the primary key’s (the first column in each table).

Character Manipulation

  1. Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
# Loading the dataframe from fivethirtyeight's The Economic Guide to Picking a COllege Major article.

library(readr)
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/all-ages.csv"
college_majors <- read.csv(url)
# Identifying majors with DATA or STATISTICS in its name.
MajorV <- as.vector(college_majors[,2])
grep ("DATA|STATISTICS", MajorV, value = TRUE, ignore.case = TRUE)
## [1] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [2] "STATISTICS AND DECISION SCIENCE"              
## [3] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"

To identify majors that contain either DATA or STATISTICS in the name I had to create a vector from the college_major dataset’s “Major” variable, then run the above grep code.

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version:

  1. Describe, in words, what these expressions will match:

(.)\1\1 This will match any characters in a string that repeats three times.

“(.)(.)\2\1” This will match any characters in a string with four characters that the first and fourth characters repeat and the second and third characters repeat.

(..)\1 This will match any four characters in a string where the first and third characters repeat and the second and fourth characters repeat.

“(.).\1.\1” This will match any five characters in a string where the first, third, and fifth characters repeat.

“(.)(.)(.).*\3\2\1” This will match any seven characters in a string where the first and seventh, second and sixth, and third and fifth characters repeat.

  1. Construct regular expressions to match words that:

Start and end with the same character. “^([a-z]).*\1$”

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) “(..).*\1”

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) “([a-z].*){3,}”