Normalization and character manipulation are essential for structuring and analyzing data efficiently. Using MySQL in R, I normalized movie rating data into structured tables, ensuring consistency and reducing redundancy. For text analysis, I extracted majors containing “DATA” or “STATISTICS” and used regular expressions to detect word patterns. These techniques are valuable for data cleaning, querying, and preparing unstructured data for deeper analysis.
(Using MySQL Database)
# Secure connection to MySQL database
password <- Sys.getenv("MYSQL_PWD")
conn <- tryCatch(
dbConnect(MySQL(),
user = "root",
password = password,
host = "127.0.0.1",
dbname = "movies"),
error = function(e) {
message("Error: ", e$message)
return(NULL)
}
)
if (!is.null(conn)) {
print("Database connection successful.")
ratings_df <- dbGetQuery(conn, "SELECT * FROM ratings;")
} else {
stop("Database connection failed. Check credentials and try again.")
}
## [1] "Database connection successful."
Normalization ensures data is stored in a structured and efficient way. Below are three normalized dataframes:
Users Table
users <- dbGetQuery(conn, "SELECT user_id, user_name, user_preference FROM users;")
users
## user_id user_name user_preference
## 1 1 Saqib Action
## 2 2 Irina Romantic Comedies
## 3 3 Helen Supernatural
## 4 4 Viktoria Drama
## 5 5 Vlad History
## 6 6 Lana K-Drama
## 7 7 Jason Conspiracy
## 8 8 Nataly Nostalgic
## 9 9 Alex Suspenseful
## 10 10 Michaella Arthouse
Movies Table
movies <- dbGetQuery(conn, "SELECT movie_id, movie_title, movie_genre FROM movies;")
movies
## movie_id movie_title movie_genre
## 1 1 Deadpool & Wolverine Action
## 2 2 Parasite Thriller
## 3 3 Dune: Part Two Sci-Fi
## 4 4 Inside Out 2 Animation
## 5 5 The Substance Horror
## 6 6 Oppenheimer Drama
Ratings Table
ratings <- dbGetQuery(conn, "SELECT rating_id, user_id, movie_id, rating FROM ratings_records;")
ratings
## rating_id user_id movie_id rating
## 1 1 1 1 5
## 2 2 2 2 5
## 3 3 3 4 4
## 4 4 5 6 5
## 5 5 1 1 5
## 6 6 1 2 3
## 7 7 1 3 4
## 8 8 1 4 2
## 9 9 1 5 3
## 10 10 1 6 5
## 11 11 2 1 3
## 12 12 2 2 5
## 13 13 2 3 2
## 14 14 2 4 5
## 15 15 2 5 4
## 16 16 2 6 3
## 17 17 3 1 2
## 18 18 3 2 5
## 19 19 3 3 3
## 20 20 3 4 4
## 21 21 3 5 5
## 22 22 3 6 2
## 23 23 5 1 3
## 24 24 5 2 4
## 25 25 5 3 2
## 26 26 5 4 3
## 27 27 5 5 3
## 28 28 5 6 5
## 29 29 6 1 2
## 30 30 6 2 5
## 31 31 6 3 3
## 32 32 6 4 5
## 33 33 6 5 4
## 34 34 6 6 3
## 35 35 7 1 4
## 36 36 7 2 3
## 37 37 7 3 5
## 38 38 7 4 2
## 39 39 7 5 4
## 40 40 7 6 5
## 41 41 1 1 5
## 42 42 1 2 3
## 43 43 1 3 4
## 44 44 1 4 2
## 45 45 1 5 3
## 46 46 1 6 5
## 47 47 2 1 3
## 48 48 2 2 5
## 49 49 2 3 2
## 50 50 2 4 5
## 51 51 2 5 4
## 52 52 2 6 3
## 53 53 3 1 2
## 54 54 3 2 5
## 55 55 3 3 3
## 56 56 3 4 4
## 57 57 3 5 5
## 58 58 3 6 2
## 59 59 5 1 3
## 60 60 5 2 4
## 61 61 5 3 2
## 62 62 5 4 3
## 63 63 5 5 3
## 64 64 5 6 5
## 65 65 6 1 2
## 66 66 6 2 5
## 67 67 6 3 3
## 68 68 6 4 5
## 69 69 6 5 4
## 70 70 6 6 3
## 71 71 7 1 4
## 72 72 7 2 3
## 73 73 7 3 5
## 74 74 7 4 2
## 75 75 7 5 4
## 76 76 7 6 5
## 77 77 8 1 3
## 78 78 8 2 4
## 79 79 8 3 2
## 80 80 8 4 5
## 81 81 8 5 3
## 82 82 8 6 3
## 83 83 9 1 5
## 84 84 9 2 3
## 85 85 9 3 5
## 86 86 9 4 2
## 87 87 9 5 4
## 88 88 9 6 4
## 89 89 10 1 3
## 90 90 10 2 4
## 91 91 10 3 2
## 92 92 10 4 5
## 93 93 10 5 5
## 94 94 10 6 3
This normalized structure prevents data redundancy and update anomalies. Each table represents a single entity and references other tables using foreign keys.
College Majors Dataset
We extract majors containing DATA or STATISTICS from the dataset available at FiveThirtyEight.
# Read in the dataset
majors <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/majors-list.csv")
# Find majors containing "DATA" or "STATISTICS"
data_statistics_majors <- majors %>%
filter(str_detect(Major, "DATA|STATISTICS")) %>%
select(Major)
data_statistics_majors
## Major
## 1 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS
## 2 COMPUTER PROGRAMMING AND DATA PROCESSING
## 3 STATISTICS AND DECISION SCIENCE
(.)\1\1
This pattern matches any single character repeated three times in a row.
Breakdown:
(.)
→ Captures any one character.\1
→ Refers back to the same captured character.\1
→ Again, repeats the captured character.Examples that match: "aaa"
,
"111"
, "ccc"
"(.)(.)\\2\\1"
This pattern matches four-character palindromes (mirror patterns).
Breakdown:
(.)
→ Captures the first character.(.)
→ Captures the second character.\2
→ Must match the second captured character
again.\1
→ Must match the first captured character
again.Examples that match: "abba"
,
"1221"
, "boob"
(..)\1
This pattern matches any two-character sequence repeated twice in a row.
Breakdown:
(..)
→ Captures two characters.\1
→ Must repeat the same two characters.Examples that match: "abab"
,
"1212"
, "cfcf"
"(.).\\1.\\1"
This pattern matches a character appearing at positions 1, 3, and 5.
Breakdown:
(.)
→ Captures one character..
→ Matches any character (acts as a placeholder).\1
→ Must match the first captured character..
→ Matches any character.\1
→ Again, must match the first captured
character.Examples that match: "abaca"
,
"1x1x1"
, "momom"
"(.)(.)(.).*\\3\\2\\1"
This pattern matches a six-character palindrome (with any number of characters in between).
Breakdown:
(.)
→ Captures the first character.(.)
→ Captures the second character.(.)
→ Captures the third character..*
→ Matches any number of characters in between.\3
→ Must match the third captured character.\2
→ Must match the second captured character.\1
→ Must match the first captured character.Examples that match: "xyzzyx"
,
"123xx321"
, "abcddcba"
Start and end with the same character
start_end_same <- "^(.).*\\1$"
Contain a repeated pair of letters
repeated_pair <- "(..).*\\1"
Contain one letter repeated in at least three places
letter_repeated_three <- "(.).*\\1.*\\1"
Test:
# Sample test words
words <- c("level", "noon", "church", "mississippi", "banana", "racecar")
# Test each pattern
matches_1 <- words[str_detect(words, start_end_same)]
matches_2 <- words[str_detect(words, repeated_pair)]
matches_3 <- words[str_detect(words, letter_repeated_three)]
# Print results
list(
"Start and end with the same character" = matches_1,
"Contains a repeated pair of letters" = matches_2,
"Contains a letter repeated 3 times" = matches_3
)
## $`Start and end with the same character`
## [1] "level" "noon" "racecar"
##
## $`Contains a repeated pair of letters`
## [1] "church" "mississippi" "banana"
##
## $`Contains a letter repeated 3 times`
## [1] "mississippi" "banana"
This assignment focused on normalization, text filtering, and regular expressions to clean and structure data effectively. Using SQL in R, we normalized datasets to improve consistency and avoid redundancy. Character manipulation techniques helped extract relevant text data, while regex patterns allowed complex string analysis. These skills are crucial for efficient data processing and analysis.