Assignment 3 Data 607

Normalization

1.Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.

Below I will show three data frames that are normalized

Data frame 1- in this data frame called my cousins’ height, you will find the names of my cousin and their age along with their height. The data is normalized by first form where there are primary keys and each key is not repeat.

my_cousins_heights<-data.frame(person_ID_PK=c(1,2,3,4), name=c('Elvin', 'Nina', 'Nathan', 'William'), height_feet=c(5.8, 5.4, 4.8, 5.7), age= c(38,30,7,40))
my_cousins_heights

##   person_ID_PK    name height_feet age
## 1            1   Elvin         5.8  38
## 2            2    Nina         5.4  30
## 3            3  Nathan         4.8   7
## 4            4 William         5.7  40

Data frame 2- is normalized data frame for my emergency contacts, where you will some see some of my contacts have multiple numbers of contact and in order to normalize the data a new column was made. For those in my emergency contact that didn’t have a second phone number will have an “NA” as a value

my_emergency_contacts<-data.frame(contact_ID_PK=1:3, contacts=c('Jenny', 'Nanie', 'Elvin'), phone_number1=c('555-555-5555', '789-788-9999', '347-000-4545'), phone_number2=c('646-555-5555', NA, NA))
my_emergency_contacts

##   contact_ID_PK contacts phone_number1 phone_number2
## 1             1    Jenny  555-555-5555  646-555-5555
## 2             2    Nanie  789-788-9999          <NA>
## 3             3    Elvin  347-000-4545          <NA>

Data frame 3- a data frame built into ‘OpenIntro’ with nutritional information on Starbucks food items. The data frame was missing primary keys in order to be in first normal form, I piped a mutation into the data frame to create food item numbers, since the items are not repetitive all I had to do was create a new column called Food number to label each food items with a number

data(starbucks)
starbucks_<- starbucks |>
        mutate(food_number=1:77,
              .before=item)
starbucks

## # A tibble: 77 × 7
##    item                          calories   fat  carb fiber protein type  
##    <chr>                            <int> <dbl> <int> <int>   <int> <fct> 
##  1 "8-Grain Roll"                     350     8    67     5      10 bakery
##  2 "Apple Bran Muffin"                350     9    64     7       6 bakery
##  3 "Apple Fritter"                    420    20    59     0       5 bakery
##  4 "Banana Nut Loaf"                  490    19    75     4       7 bakery
##  5 "Birthday Cake Mini Doughnut"      130     6    17     0       0 bakery
##  6 "Blueberry Oat Bar"                370    14    47     5       6 bakery
##  7 "Blueberry Scone"                  460    22    61     2       7 bakery
##  8 "Bountiful Blueberry Muffin"       370    14    55     0       6 bakery
##  9 "Butter Croissant "                310    18    32     0       5 bakery
## 10 "Cheese Danish"                    420    25    39     0       7 bakery
## # ℹ 67 more rows

Character Manipulation

Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”

Article:https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

Data frame:https://github.com/fivethirtyeight/data/blob/master/college-majors/majors-list.csv

major_list<-read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')

## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

major_list #view data frame

## # A tibble: 174 × 3
##    FOD1P Major                                 Major_Category                 
##    <chr> <chr>                                 <chr>                          
##  1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
##  2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
##  3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
##  4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
##  5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
##  6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources
##  7 1106  SOIL SCIENCE                          Agriculture & Natural Resources
##  8 1199  MISCELLANEOUS AGRICULTURE             Agriculture & Natural Resources
##  9 1302  FORESTRY                              Agriculture & Natural Resources
## 10 1303  NATURAL RESOURCES MANAGEMENT          Agriculture & Natural Resources
## # ℹ 164 more rows

data_or_statistics_majors<-str_subset(major_list$Major, "DATA|STATISTICS")
data_or_statistics_majors #View data for major in Data or Statistics

## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"     
## [3] "STATISTICS AND DECISION SCIENCE"

Based on the data frame on majors to pick to have economical success only contained one major in ‘DATA’ which is “COMPUTER PROGRAMMING AND DATA PROCESSING”, or in ‘STATISTICS’ would be “MANAGEMENT INFORMATION SYSTEMS AND STATISTICS” and “STATISTICS AND DECISION SCIENCE”. There are only 3 majors pertaining to statistics or data for a successful economical future. Yet there are so many engineer majors, but I mean we can’t all be engineers.

The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: 3. Describe, in words, what these expressions will match:

(.)\1\1- Matches with a group of characters that have three repeating characters, ex “eee” and “111”. For the purpose of R the function had to be used a a string “(.)\1\1”.

“(.)(.)\2\1”- Matches with a group of characters that have a pair characters matching in reverse, ex. “cbbc”.

(..)\1- After turning the function into a string “(..)\1”, the function was used to find groups of characters that have a pair of characters repeating one after another, ex.abab

“(.).\1.\1”-Matches with a group of characters that have single character that repeats itself after another character three times, ex. “efere” from reference.

“(.)(.)(.).*\3\2\1”- Matches with a group of characters that have three characters that match in reverse but in between the pair of matching three characters, there will be zero or more of any characters, ex “abc27hnchcba”.

4.Construct regular expressions to match words that: Start and end with the same character.

test<-c("111", "aaa", "abc", "abbc", "abba", "Eleven", "Church", "abab")
str_view(test, regex("^(.)((.*\\1$)|\\1$)", ignore_case=TRUE))

## [1] │ <111>
## [2] │ <aaa>
## [5] │ <abba>

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)

str_view(test, regex("([A-Za-z][A-Za-z]).*\\1", ignore_case=TRUE))

## [7] │ <Church>
## [8] │ <abab>

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)

str_view(test, regex("(.).\\1.\\1", ignore_case=TRUE))

## [6] │ <Eleve>n

Assignment 3 Data 607

Andreina A

2024-09-21

Loading packages

Normalization