library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Please deliver links to an R Markdown file (in GitHub and rpubs.com) with solutions to the problems below. You may work in a small group, but please submit separately with names of all group participants in your submission.
Provide an example of at least three dataframes in R that demonstrate normalization. The dataframes can contain any data, either real or synthetic. Although normalization is typically done in SQL and relational databases, you are expected to show this example in R, as it is our main work environment in this course.
DataFrame 1:Product ratings
## Product ratings
Product_ratings <- data.frame(
ProductID = 1:6,
Rating = c(4.1, 3.7, 5.0, 4.4, 3.2, 4.5),
Price = c(50, 60, 80, 70, 40, 45)
)
print(Product_ratings)
## ProductID Rating Price
## 1 1 4.1 50
## 2 2 3.7 60
## 3 3 5.0 80
## 4 4 4.4 70
## 5 5 3.2 40
## 6 6 4.5 45
Example 1 using Dataframe 1: Min-Max Normalization Min-max normalization scales numerical data to a fixed range, typically between 0 and 1. The benefits of using this strategy includes; Consistent Scale Across Variables which ensures all features (such as, rating, pricing) are on the same scale, preventing variables with larger ranges from dominating. It also Preserved Relationships: The relative relationships between data points are maintained, ensuring data integrity.
# Function to normalize each numeric column in a dataframe
normalize_df <- function(df) {
# Loop through all numeric columns and apply Min-Max normalization
df_normalized <- df
for (col in names(df)) {
if (is.numeric(df[[col]])) {
min_val <- min(df[[col]], na.rm = TRUE)
max_val <- max(df[[col]], na.rm = TRUE)
df_normalized[[col]] <- (df[[col]] - min_val) / (max_val - min_val)
}
}
return(df_normalized)
}
Product_ratings_normalized <- normalize_df(Product_ratings)
print(Product_ratings_normalized)
## ProductID Rating Price
## 1 0.0 0.5000000 0.250
## 2 0.2 0.2777778 0.500
## 3 0.4 1.0000000 1.000
## 4 0.6 0.6666667 0.750
## 5 0.8 0.0000000 0.000
## 6 1.0 0.7222222 0.125
DataFrame 2: Sales data with different units
## Sales data with different units
Sales_Data <- data.frame(
SalesID = 1:5,
UnitsSold = c(120, 150, 80, 200, 50),
Revenue = c(1000, 2000, 1500, 2500, 1200)
)
print(Sales_Data)
## SalesID UnitsSold Revenue
## 1 1 120 1000
## 2 2 150 2000
## 3 3 80 1500
## 4 4 200 2500
## 5 5 50 1200
Example 2 using Dataframe 2: Normalization Using a Custom Range Custom range normalization allows you to scale data to any desired range, such as between -1 and 1, or any other interval. By applying a custom range, you can adjust the normalized data to a more specific scale that better suits the problem one is working on.
# Custom normalization (scale to range between -1 and 1)
Sales_Data_custom_normalized <- Sales_Data %>%
mutate(
Revenue_custom_normalized = 2 * ((Revenue - min(Revenue)) / (max(Revenue) - min(Revenue))) - 1
)
# View the original and custom normalized data
print(Sales_Data_custom_normalized)
## SalesID UnitsSold Revenue Revenue_custom_normalized
## 1 1 120 1000 -1.0000000
## 2 2 150 2000 0.3333333
## 3 3 80 1500 -0.3333333
## 4 4 200 2500 1.0000000
## 5 5 50 1200 -0.7333333
DataFrame3: Student Data including Student Test Scores and Study Hours
# DataFrame: Student Test Scores and Study Hours
Student_Data <- data.frame(
StudentID = 1:6,
TestScore = c(78, 85, 92, 65, 88, 73),
StudyHours = c(10, 15, 12, 8, 20, 11)
)
print(Student_Data)
## StudentID TestScore StudyHours
## 1 1 78 10
## 2 2 85 15
## 3 3 92 12
## 4 4 65 8
## 5 5 88 20
## 6 6 73 11
Example 3 using Dataframe 3: Z-Score Normalization (Standardization) Z-score normalization transforms the data so that the mean is 0 and the standard deviation is 1. This method is particularly useful when the data has varying scales or distributions. It helps improve model performance, ensures fair contribution from all features, and speeds up optimization processes.
# Z-Score normalization for Height and Weight
Student_Data_standardized <- Student_Data %>%
mutate(
TestScore_standardized = (TestScore - mean(TestScore)) / sd(TestScore),
StudyHours_standardized = (StudyHours - mean(StudyHours)) / sd(StudyHours)
)
# View the original and standardized data
Student_Data_standardized
## StudentID TestScore StudyHours TestScore_standardized StudyHours_standardized
## 1 1 78 10 -0.2143569 -0.6239346
## 2 2 85 15 0.4781808 0.5459428
## 3 3 92 12 1.1707185 -0.1559837
## 4 4 65 8 -1.5004984 -1.0918856
## 5 5 88 20 0.7749827 1.7158202
## 6 6 73 11 -0.7090267 -0.3899591
Using the 173 majors listed in fivethirtyeight.com’s College Majors dataset [https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/], provide code that identifies the majors that contain either “DATA” or “STATISTICS”
# Load library
library(dplyr)
# Load the dataset
url <- "https://raw.githubusercontent.com/Badigun/Data-607-Assignments/refs/heads/main/majors-list.csv"
majors_list <- read.csv(url)
# Check the dataset
head(majors_list)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
# Filter majors that contain "DATA" or "STATISTICS"
filtered_Majors_Data <- majors_list %>%
filter(grepl("DATA|STATISTICS", Major, ignore.case = TRUE))
# Display the filtered majors
print(filtered_Majors_Data$Major)
## [1] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"
## [2] "COMPUTER PROGRAMMING AND DATA PROCESSING"
## [3] "STATISTICS AND DECISION SCIENCE"
(.)\1\1 Answer: This expression matches any three consecutive characters where the first character is repeated two more times. For example, it will match strings like “aaa”, “bbb” where the first character repeats twice more.
“(.)(.)\2\1” Answer: This expression matches a string with two characters, followed by those same two characters in reverse order. For example, it will match strings like “abba”. The first character is matched again as the last character, and the second character is matched again as the second-to-last character.
(..)\1 Answer: This expression matches a string where the first two characters are exactly repeated after them. For example, it will match strings like “abab”, or “1212”. It requires that the first two characters are the same as the next two characters.
“(.).\1.\1” Answer: This expression matches strings where the first character appears three times, with two other characters in between. For example, it will match strings like “aXa”.
“(.)(.)(.).*\3\2\1” Answer: This expression matches strings where the first three characters are followed by any characters, and then those three characters are repeated in reverse order. For example, it will match strings like “abcxyzcba”, or “123xyzyx321”. The first three characters appear again after some characters, but in reverse order.
# Word examples
word <- c("plump", "pen", "radar", "noon", "civic", "world")
# Regex pattern to match words that start and end with the same character
pattern <- "\\b([a-zA-Z]).*\\1\\b"
# use grep to find words that match the pattern
matches <- grep(pattern, word, value = TRUE)
print(matches)
## [1] "plump" "radar" "noon" "civic"
# Sample text
text <- c("church", "money", "success", "monkey", "love")
pattern <- "\\b([a-zA-Z]{2}).*\\1\\b"
# Find words that match the pattern
matches <- grep(pattern, text, value = TRUE)
print(matches)
## [1] "church"
text <- c("love", "balloon", "success", "hello", "apple", "test")
pattern <- "\\b([a-zA-Z]).*\\1.*\\1\\b"
matches <- grep(pattern, text, value = TRUE)
print(matches)
## [1] "success"