Assignment 3 Data 607

Normalization

1.Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.

Below I will show three data frames that are normalized

Data frame 1- in this data frame called my cousins’ height, you will find the names of my cousin and their age along with their height. Since height and age can change over time and can be a repeated values, separate keys were created for these values. Data not normalized

my_cousins_heights<-data.frame(person_ID_PK=c(1,2,3,4), name=c('Elvin', 'Nina', 'Nathan', 'William'), height_feet=c(5.8, 5.4, 4.8, 5.7), age= c(38,30,7,40))
my_cousins_heights

##   person_ID_PK    name height_feet age
## 1            1   Elvin         5.8  38
## 2            2    Nina         5.4  30
## 3            3  Nathan         4.8   7
## 4            4 William         5.7  40

Normalized Data

Person_name<-subset(my_cousins_heights, select = c(person_ID_PK, name))
Person_name

##   person_ID_PK    name
## 1            1   Elvin
## 2            2    Nina
## 3            3  Nathan
## 4            4 William

Cousin_age<-data.frame(Age_ID=1:4,age= c(38,30,4,40))
Cousin_age

##   Age_ID age
## 1      1  38
## 2      2  30
## 3      3   4
## 4      4  40

Height_info <-data.frame(heightID=c('h1','h2','h3','h4'), height_feet=c(5.8, 5.4, 4.8, 5.7))
Height_info

##   heightID height_feet
## 1       h1         5.8
## 2       h2         5.4
## 3       h3         4.8
## 4       h4         5.7

Height_data<-data.frame(Age_ID=1:4,person_ID_PK=c(1,2,3,4), heightID=c('h1','h2','h3','h4'))
Height_data

##   Age_ID person_ID_PK heightID
## 1      1            1       h1
## 2      2            2       h2
## 3      3            3       h3
## 4      4            4       h4

Data frame 2- is normalized data frame for my emergency contacts, where you will some see some of my contacts have multiple numbers of contact and in order to normalize the data a new column was made. For those in my emergency contact that didn’t have a second phone number will have an “NA” as a value. To avoid the second column for phone number 2, separate keys were made for the phone numbers group in 1 and 2.

Data not normalized

my_emergency_contacts<-data.frame(contact_ID_PK=1:3, contacts=c('Jenny', 'Nanie', 'Elvin'), phone_number1=c('555-555-5555', '789-788-9999', '347-000-4545'), phone_number2=c('646-555-5555', NA, NA))
my_emergency_contacts

##   contact_ID_PK contacts phone_number1 phone_number2
## 1             1    Jenny  555-555-5555  646-555-5555
## 2             2    Nanie  789-788-9999          <NA>
## 3             3    Elvin  347-000-4545          <NA>

Normalized data

Contact_name<-subset(my_emergency_contacts, select = c(contact_ID_PK, contacts))
Contact_name

##   contact_ID_PK contacts
## 1             1    Jenny
## 2             2    Nanie
## 3             3    Elvin

phone_numbers<-c('555-555-5555','646-555-5555', '789-788-9999','347-000-4545')

Contact_info<-data.frame(info_ID=1:4,contact_ID_PK=c(1,1,2,3),phone_numbers, Phone_ID=c('1','2','1','1'))
Contact_info

##   info_ID contact_ID_PK phone_numbers Phone_ID
## 1       1             1  555-555-5555        1
## 2       2             1  646-555-5555        2
## 3       3             2  789-788-9999        1
## 4       4             3  347-000-4545        1

Phone_ID<-data.frame(phone_ID=1:2, phone_type=c('phone_number1','phone_number2'))
Phone_ID

##   phone_ID    phone_type
## 1        1 phone_number1
## 2        2 phone_number2

Data frame 3- A CVS receipt where I brought 5 items, two pair of items had the same price in this case a key was made as well for the item type and for the items themselves. The keys were create for each of these columns because their values could be repeated.

Data not normalized

my_CVS_receipt<-data.frame(Receipt_K=1:5, items=c('M&Ms', 'Motts apple juice', 'M&Ms', 'Dove shampoo', 'Dove soap'), item_price=c('2.50', '3.50', '2.50','5.00','5.00'), item_type=c('food', 'drink','food','hair care', 'body care'))
my_CVS_receipt

##   Receipt_K             items item_price item_type
## 1         1              M&Ms       2.50      food
## 2         2 Motts apple juice       3.50     drink
## 3         3              M&Ms       2.50      food
## 4         4      Dove shampoo       5.00 hair care
## 5         5         Dove soap       5.00 body care

Data normalized

items_info<-data.frame(items.keys=c('i1', 'i2','i3','i4'), items=c('M&Ms', 'Motts apple juice', 'Dove shampoo', 'Dove soap'))
items_info

##   items.keys             items
## 1         i1              M&Ms
## 2         i2 Motts apple juice
## 3         i3      Dove shampoo
## 4         i4         Dove soap

items_type<-data.frame(items.keys=c('t1', 't2','t3','t4'), item_type=c('food', 'drink','hair care', 'body care'))
items_type

##   items.keys item_type
## 1         t1      food
## 2         t2     drink
## 3         t3 hair care
## 4         t4 body care

items_price<-data.frame(items.keys=c('p1', 'p2','p3'), item_price=c('2.50', '3.50','5.00'))
items_price

##   items.keys item_price
## 1         p1       2.50
## 2         p2       3.50
## 3         p3       5.00

receipt_data<-select(my_CVS_receipt, Receipt_K, items, item_price, item_type)

receipt_data2<-receipt_data %>%
  dplyr::mutate(items=c('i1','i2','i1','i3','i4'),item_price=c('p1', 'p2','p1','p3','p3'), item_type=c('t1', 't2','t1','t3','t4'))
receipt_data2

##   Receipt_K items item_price item_type
## 1         1    i1         p1        t1
## 2         2    i2         p2        t2
## 3         3    i1         p1        t1
## 4         4    i3         p3        t3
## 5         5    i4         p3        t4

Character Manipulation

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Article:https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

Data frame:https://github.com/fivethirtyeight/data/blob/master/college-majors/majors-list.csv

major_list<-read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')

## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

major_list #view data frame

## # A tibble: 174 × 3
##    FOD1P Major                                 Major_Category                 
##    <chr> <chr>                                 <chr>                          
##  1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
##  2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
##  3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
##  4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
##  5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
##  6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources
##  7 1106  SOIL SCIENCE                          Agriculture & Natural Resources
##  8 1199  MISCELLANEOUS AGRICULTURE             Agriculture & Natural Resources
##  9 1302  FORESTRY                              Agriculture & Natural Resources
## 10 1303  NATURAL RESOURCES MANAGEMENT          Agriculture & Natural Resources
## # ℹ 164 more rows

data_or_statistics_majors<-str_subset(major_list$Major, "DATA|STATISTICS")
data_or_statistics_majors #View data for major in Data or Statistics

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Based on the data frame on majors to pick to have economical success only contained one major in ‘DATA’ which is “COMPUTER PROGRAMMING AND DATA PROCESSING”, or in ‘STATISTICS’ would be “MANAGEMENT INFORMATION SYSTEMS AND STATISTICS” and “STATISTICS AND DECISION SCIENCE”. There are only 3 majors pertaining to statistics or data for a successful economical future. Yet there are so many engineer majors, but I mean we can’t all be engineers.

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: 3. Describe, in words, what these expressions will match:

(.)\1\1- Matches with a group of characters that have three repeating characters, ex “eee” and “111”. For the purpose of R the function had to be used a a string “(.)\1\1”.

“(.)(.)\2\1”- Matches with a group of characters that have a pair characters matching in reverse, ex. “cbbc”.

(..)\1- After turning the function into a string “(..)\1”, the function was used to find groups of characters that have a pair of characters repeating one after another, ex.abab

“(.).\1.\1”-Matches with a group of characters that have single character that repeats itself after another character three times, ex. “efere” from reference.

“(.)(.)(.).*\3\2\1”- Matches with a group of characters that have three characters that match in reverse but in between the pair of matching three characters, there will be zero or more of any characters, ex “abc27hnchcba”.

4.Construct regular expressions to match words that: Start and end with the same character.

test<-c("111", "aaa", "abc", "abbc", "abba", "Eleven", "Church", "abab")
str_view(test, regex("^(.)((.*\\1$)|\\1$)", ignore_case=TRUE))

## [1] │ <111>
## [2] │ <aaa>
## [5] │ <abba>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(test, regex("([A-Za-z][A-Za-z]).*\\1", ignore_case=TRUE))

## [7] │ <Church>
## [8] │ <abab>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(test, regex("(.).\\1.\\1", ignore_case=TRUE))

## [6] │ <Eleve>n

Assignment 3 Data 607

Andreina A

2024-09-21

Loading packages

Normalization