Loading packages needed for this assignment.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
library(knitr)
1.Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.
Below I will show three data frames that are normalized
Data frame 1- in this data frame called my cousins’ height, you will find the names of my cousin and their age along with their height. The data is normalized by first form where there are primary keys and each key is not repeat.
my_cousins_heights<-data.frame(person_ID_PK=c(1,2,3,4), name=c('Elvin', 'Nina', 'Nathan', 'William'), height_feet=c(5.8, 5.4, 4.8, 5.7), age= c(38,30,7,40))
my_cousins_heights
## person_ID_PK name height_feet age
## 1 1 Elvin 5.8 38
## 2 2 Nina 5.4 30
## 3 3 Nathan 4.8 7
## 4 4 William 5.7 40
Data frame 2- is normalized data frame for my emergency contacts, where you will some see some of my contacts have multiple numbers of contact and in order to normalize the data a new column was made. For those in my emergency contact that didn’t have a second phone number will have an “NA” as a value
my_emergency_contacts<-data.frame(contact_ID_PK=1:3, contacts=c('Jenny', 'Nanie', 'Elvin'), phone_number1=c('555-555-5555', '789-788-9999', '347-000-4545'), phone_number2=c('646-555-5555', NA, NA))
my_emergency_contacts
## contact_ID_PK contacts phone_number1 phone_number2
## 1 1 Jenny 555-555-5555 646-555-5555
## 2 2 Nanie 789-788-9999 <NA>
## 3 3 Elvin 347-000-4545 <NA>
Data frame 3- a data frame built into ‘OpenIntro’ with nutritional information on Starbucks food items. The data frame was missing primary keys in order to be in first normal form, I piped a mutation into the data frame to create food item numbers, since the items are not repetitive all I had to do was create a new column called Food number to label each food items with a number
data(starbucks)
starbucks_<- starbucks |>
mutate(food_number=1:77,
.before=item)
starbucks
## # A tibble: 77 × 7
## item calories fat carb fiber protein type
## <chr> <int> <dbl> <int> <int> <int> <fct>
## 1 "8-Grain Roll" 350 8 67 5 10 bakery
## 2 "Apple Bran Muffin" 350 9 64 7 6 bakery
## 3 "Apple Fritter" 420 20 59 0 5 bakery
## 4 "Banana Nut Loaf" 490 19 75 4 7 bakery
## 5 "Birthday Cake Mini Doughnut" 130 6 17 0 0 bakery
## 6 "Blueberry Oat Bar" 370 14 47 5 6 bakery
## 7 "Blueberry Scone" 460 22 61 2 7 bakery
## 8 "Bountiful Blueberry Muffin" 370 14 55 0 6 bakery
## 9 "Butter Croissant " 310 18 32 0 5 bakery
## 10 "Cheese Danish" 420 25 39 0 7 bakery
## # ℹ 67 more rows
Character Manipulation
Article:https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/
Data frame:https://github.com/fivethirtyeight/data/blob/master/college-majors/majors-list.csv
major_list<-read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')
## Rows: 174 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): FOD1P, Major, Major_Category
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
major_list #view data frame
## # A tibble: 174 × 3
## FOD1P Major Major_Category
## <chr> <chr> <chr>
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
## 7 1106 SOIL SCIENCE Agriculture & Natural Resources
## 8 1199 MISCELLANEOUS AGRICULTURE Agriculture & Natural Resources
## 9 1302 FORESTRY Agriculture & Natural Resources
## 10 1303 NATURAL RESOURCES MANAGEMENT Agriculture & Natural Resources
## # ℹ 164 more rows
data_or_statistics_majors<-str_subset(major_list$Major, "DATA|STATISTICS")
data_or_statistics_majors #View data for major in Data or Statistics
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
Based on the data frame on majors to pick to have economical success only contained one major in ‘DATA’ which is “COMPUTER PROGRAMMING AND DATA PROCESSING”, or in ‘STATISTICS’ would be “MANAGEMENT INFORMATION SYSTEMS AND STATISTICS” and “STATISTICS AND DECISION SCIENCE”. There are only 3 majors pertaining to statistics or data for a successful economical future. Yet there are so many engineer majors, but I mean we can’t all be engineers.
The two exercises below are taken from R for Data Science, 14.3.5.1 in the on-line version: 3. Describe, in words, what these expressions will match:
(.)\1\1- Matches with a group of characters that have three repeating characters, ex “eee” and “111”. For the purpose of R the function had to be used a a string “(.)\1\1”.
“(.)(.)\2\1”- Matches with a group of characters that have a pair characters matching in reverse, ex. “cbbc”.
(..)\1- After turning the function into a string “(..)\1”, the function was used to find groups of characters that have a pair of characters repeating one after another, ex.abab
“(.).\1.\1”-Matches with a group of characters that have single character that repeats itself after another character three times, ex. “efere” from reference.
“(.)(.)(.).*\3\2\1”- Matches with a group of characters that have three characters that match in reverse but in between the pair of matching three characters, there will be zero or more of any characters, ex “abc27hnchcba”.
4.Construct regular expressions to match words that: Start and end with the same character.
test<-c("111", "aaa", "abc", "abbc", "abba", "Eleven", "Church", "abab")
str_view(test, regex("^(.)((.*\\1$)|\\1$)", ignore_case=TRUE))
## [1] │ <111>
## [2] │ <aaa>
## [5] │ <abba>
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(test, regex("([A-Za-z][A-Za-z]).*\\1", ignore_case=TRUE))
## [7] │ <Church>
## [8] │ <abab>
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(test, regex("(.).\\1.\\1", ignore_case=TRUE))
## [6] │ <Eleve>n