Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.
# Install and load the necessary packages in one step
# Check if the 'dplyr' and 'tidyverse' packages are installed; if not, install them
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
if (!requireNamespace("tidyverse", quietly = TRUE)) {
install.packages("tidyverse")
}
# Load the necessary libraries
library(dplyr) # Load dplyr for data manipulation and %>%
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse) # Load tidyverse, which includes useful packages like dplyr
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Create the 'employees' dataframe, which contains employee information
employees <- data.frame(
employee_id = c(1, 2, 3, 4), # Employee IDs
employee_name = c("John Doe", "Jane Smith", # Employee names
"Emily Davis", "Michael Brown"),
hire_date = as.Date(c("2020-01-15", "2019-07-30", # Hire dates for employees
"2021-02-22", "2022-06-10"))
)
# Create the 'departments' dataframe, which contains department information
departments <- data.frame(
department_id = c(1, 2, 3), # Department IDs
department_name = c("HR", "IT", "Marketing") # Department names
)
# Create the 'employee_departments' dataframe, which links employees to departments
employee_departments <- data.frame(
employee_id = c(1, 2, 3, 4), # Employee IDs linking to employees table
department_id = c(1, 2, 2, 3) # Department IDs linking to departments table
)
# Print each dataframe separately
print("Employees Dataframe:")
## [1] "Employees Dataframe:"
print(employees)
## employee_id employee_name hire_date
## 1 1 John Doe 2020-01-15
## 2 2 Jane Smith 2019-07-30
## 3 3 Emily Davis 2021-02-22
## 4 4 Michael Brown 2022-06-10
print("Departments Dataframe:")
## [1] "Departments Dataframe:"
print(departments)
## department_id department_name
## 1 1 HR
## 2 2 IT
## 3 3 Marketing
print("Employee_Departments Dataframe:")
## [1] "Employee_Departments Dataframe:"
print(employee_departments)
## employee_id department_id
## 1 1 1
## 2 2 2
## 3 3 2
## 4 4 3
# Join the dataframes using left_join and the pipe (%>%) operator
# First, join 'employee_departments' with 'employees' by 'employee_id'
# Then, join the result with 'departments' by 'department_id'
merged_data <- employee_departments %>%
left_join(employees, by = "employee_id") %>%
left_join(departments, by = "department_id")
# Print the final merged dataframe
print("Merged Dataframe:")
## [1] "Merged Dataframe:"
print(merged_data)
## employee_id department_id employee_name hire_date department_name
## 1 1 1 John Doe 2020-01-15 HR
## 2 2 2 Jane Smith 2019-07-30 IT
## 3 3 2 Emily Davis 2021-02-22 IT
## 4 4 3 Michael Brown 2022-06-10 Marketing
Contains unique information about each employee (Employee ID, Name, Hire Date). Does not store department information directly to avoid redundancy.
Contains unique information about each department (Department ID, Department Name). This table is independent of employees.
This table creates the relationship between employees and departments, demonstrating a many-to-one relationship where each employee is assigned to a department.
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
First, I imported the CSV file directly from the raw GitHub URL and
stored the data in a dataframe called majors. Then, I
loaded the necessary libraries, dplyr and
tidyverse, to facilitate data manipulation. After that, I
used the filter function to apply a regular expression that
filtered out the majors containing either ‘DATA’ or ‘STATISTICS’. The
result was saved in a new dataframe named infosci_maj.
#Importing the majors data
library(readr)
majors=read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv"))
names(majors)
## [1] "FOD1P" "Major" "Major_Category"
#loading, tidyverse, dplyr
library(dplyr)
library(tidyverse)
#filtering out only DATA or STATISTICS using regex
infosci_maj = majors %>%
filter(grepl("(?=.*DATA)|(?=.*STATISTICS)", Major, perl = TRUE))
head(infosci_maj)
Describe, in words, what these expressions will match:
(.)\1\1
“(.)(.)\\2\\1”
(..)\1
“(.).\\1.\\1”
“(.)(.)(.).*\\3\\2\\1”
This regular expression matches any character that is repeated three times in a row, such as ‘zzz’, ‘555’, or ‘AAAinsurance’.
This regular expression matches two characters followed by the same two characters in reverse order. For example, it would match ‘toot’, but not ‘ttoo’.
This regular expression matches any two characters that are repeated immediately, like ‘2121’ or ‘momo’.
This one is a bit more complex. For instance, if you have three characters like ‘e’, ‘m’, and ‘w’, it will match sequences such as ‘emewe’. The same pattern can apply to other letters, such as ‘tmtwt’ or ‘42484’. The key is that the first character repeats in positions 1, 3, and 5.
This regular expression matches any sequence of three characters followed by any characters, and then the same three characters in reverse order. For example, it would match ‘123…anything…321’ or ‘xyz…zyx’. It can be particularly useful for matching anagrams, like ‘racecar…racecar’.
Construct regular expressions to match words that:
Start and end with the same character.
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
To tackle this problem, I first created a sample text containing a variety of U.S. city names, ensuring there were examples that represented each condition specified, along with some additional values for extra coverage.
Next, I processed the city names by splitting the text into a vector,
city_vector, using str_split to divide the
cities by spaces. After the split, I removed any empty strings that may
have appeared during the process.
To identify cities where the first and last characters are the same,
I first converted all city names in city_vector to
lowercase. Then, I applied str_detect with the regular
expression “^(.).*\\1$”. This regex filters cities where the first and
last characters are identical. Finally, I printed the results.
Similarly, to find cities that contain a repeated pair of characters,
I used str_detect with the regular expression “(.).*\\1”.
This pattern identifies cities where a pair of characters appears more
than once in the string. The filtered results were printed
afterward.
For the last example, I applied the regex “(.).*\\1.*\\1” using
str_detect, which matches cities that contain a single
character repeated at least three times within the string. These results
were also printed for verification.
This method effectively identifies cities that meet the specified conditions through regular expressions and string manipulation functions in R.
# Define a string of city names for pattern matching
cities = "NewYork Chicago Alameda LosAngeles Mississippi Houston Miami Philadelphia Elma Atlanta Dallas Tallahassee Seattle Eureka Cincinnati Sacramento Denver NewPaltz Buffalo Aurora "
# Split the city names into individual elements stored in a vector
city_vector = str_split(cities, " ")[[1]]
# Remove any empty elements from the vector
city_vector = city_vector[city_vector != ""]
# Identify cities that start and end with the same character
abba = city_vector[str_detect(tolower(city_vector), "^(.).*\\1$")]
print(paste("Cities that start and end with the same character:", paste(abba, collapse = ", ")))
## [1] "Cities that start and end with the same character: Alameda, Atlanta, Aurora"
# Identify cities that contain a repeated pair of characters
aa = city_vector[str_detect(city_vector, "(.).*\\1")]
print(paste("Cities that contain a repeated pair of letters:", paste(aa, collapse = ", ")))
## [1] "Cities that contain a repeated pair of letters: Alameda, LosAngeles, Mississippi, Houston, Miami, Philadelphia, Atlanta, Dallas, Tallahassee, Seattle, Cincinnati, Sacramento, Denver, Buffalo, Aurora"
# Identify cities that contain a single character repeated three times
aaa = city_vector[str_detect(city_vector, "(.).*\\1.*\\1")]
print(paste("Cities that contain one letter repeated in at least three places:", paste(aaa, collapse = ", ")))
## [1] "Cities that contain one letter repeated in at least three places: Mississippi, Tallahassee, Cincinnati"