In this assignment we explore the uses of normalization and character manipulation in data frames. We start off by providing at least three data frames in R that demonstrate normalization. We then use the 173 majors listed in fivethirtyeight.com’s College Majors data set to provide code that identifies the majors that contain either “DATA” or “STATISTICS.” To finish off, we handle a few exercises revolving around regular expressions.
As always, let’s start off by loading the necessary packages.
library(tidyverse)
library(dplyr)
For the data frames, I decided to recreate a data set used on the “Decomplexify” YouTube channel in a video entitled “Learn Database Normalization - 1NF, 2NF, 3NF, 4NF, 5NF.” After recreating the data set and making it available on GitHub, we then have to load it into our environment.
url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/Video_Game_Data.csv')
video_game <- read.csv(url)
video_game
## Player_ID Item_Type Item_Quantity Player_Rating Player_Skill_Level
## 1 jdog21 amulets 2 Intermediate 4
## 2 jdog21 rings 4 Intermediate 4
## 3 gila19 copper coins 18 Beginner 3
## 4 trev73 shields 3 Advanced 8
## 5 trev73 arrows 5 Advanced 8
## 6 trev73 copper coins 30 Advanced 8
## 7 trev73 rings 7 Advanced 8
Now, we can go ahead and recreate the normalized data frames that
were exemplified in the video. The players and
players_inventory data frames are easily constructed from
the original data set. However, the layer_skill_levels data
frame must be built from scratch using the relationship between
Player_Skill_Level and Player_Rating that was
shown in the video. Once all is said and done, we can see that all three
data frames have just one instance per row and each cell is in atomic
level. They also abide by the rules surrounding non-key attributes as
everyone one in a table key, the whole key, and nothing but the key.
players <- subset(video_game,select = c(Player_ID,Player_Rating)) %>%
distinct()
players
## Player_ID Player_Rating
## 1 jdog21 Intermediate
## 2 gila19 Beginner
## 3 trev73 Advanced
player_inventory <- subset(video_game,select = c(Player_ID,Item_Type,Item_Quantity)) %>%
distinct()
player_inventory
## Player_ID Item_Type Item_Quantity
## 1 jdog21 amulets 2
## 2 jdog21 rings 4
## 3 gila19 copper coins 18
## 4 trev73 shields 3
## 5 trev73 arrows 5
## 6 trev73 copper coins 30
## 7 trev73 rings 7
player_skill_levels <- data.frame(Player_Skill_Level = 1:9) %>%
mutate(Player_Rating = case_when(
Player_Skill_Level == 1:3 ~ "Beginner",
Player_Skill_Level == 4:6 ~ "Intermediate",
Player_Skill_Level == 7:9 ~ "Advanced"))
player_skill_levels
## Player_Skill_Level Player_Rating
## 1 1 Beginner
## 2 2 Beginner
## 3 3 Beginner
## 4 4 Intermediate
## 5 5 Intermediate
## 6 6 Intermediate
## 7 7 Advanced
## 8 8 Advanced
## 9 9 Advanced
Next, we will load the data used in fivethirtyeight.com’s “College Majors” article.
url2 <-
url('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')
college_majors <- read.csv(url2)
head(college_majors)
## FOD1P Major Major_Category
## 1 1100 GENERAL AGRICULTURE Agriculture & Natural Resources
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102 AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4 1103 ANIMAL SCIENCES Agriculture & Natural Resources
## 5 1104 FOOD SCIENCE Agriculture & Natural Resources
## 6 1105 PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources
We can now search for the majors that contain either “DATA” or
“STATISTICS” within the college_majors data set using the
filter() and grepl() functions.
college_majors %>% filter(grepl('DATA|STATISTICS', Major))
## FOD1P Major Major_Category
## 1 6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS Business
## 2 2101 COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3 3702 STATISTICS AND DECISION SCIENCE Computers & Mathematics
Next we’ve been tasked with describing what the expressions below will match:
(.)\1\1 - This expression will not match anything since it is not contained within quotation marks and the back references are missing a backslash.
“(.)(.)\2\1” - This expression will match any string where a pair of characters is followed by the same pair in reverse order.
(..)\1 - This expression will not match anything since it is not contained within quotation marks and the back reference is missing a backslash.
“(.).\1.\1” - This expression will match any string where the first letter is repeated every other time.
“(.)(.)(.).*\3\2\1” - This expression matches any string where the first three characters and the last three are the same but in reverse order.
Finally, we will construct regular expressions to match words that meet the descriptions below:
Start and end with the same character. - “^(.).*\1$”
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - “(..)\1”
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) - “(.)\1{2,}”
Normalizing data frames is definitely a must do when preparing to analyze data. It might feel odd in the beginning taking one table and creating various with relationships, but, although it is not aesthetically pleasing, it makes a huge difference in the feasibility of analyzing certain data sets. Along with normalization, the ability to fetch data through regular expressions is also a must have for data analysis. I look forward to using these methods in future data analysis.
Casselman, B. (2017, September 28). The economic guide to picking a college major. FiveThirtyEight. https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/
Decomplexify. (2021, November 21). Learn Database Normalization - 1NF, 2NF, 3NF, 4NF, 5NF [Video]. YouTube. https://www.youtube.com/watch?v=GFQaEYEc8_8