Overview / Introduction

In this assignment we explore the uses of normalization and character manipulation in data frames. We start off by providing at least three data frames in R that demonstrate normalization. We then use the 173 majors listed in fivethirtyeight.com’s College Majors data set to provide code that identifies the majors that contain either “DATA” or “STATISTICS.” To finish off, we handle a few exercises revolving around regular expressions.

Load Packages

As always, let’s start off by loading the necessary packages.

library(tidyverse)
library(dplyr)

Load Data

For the data frames, I decided to recreate a data set used on the “Decomplexify” YouTube channel in a video entitled “Learn Database Normalization - 1NF, 2NF, 3NF, 4NF, 5NF.” After recreating the data set and making it available on GitHub, we then have to load it into our environment.

url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/Video_Game_Data.csv')
video_game <- read.csv(url)
video_game
##   Player_ID    Item_Type Item_Quantity Player_Rating Player_Skill_Level
## 1    jdog21      amulets             2  Intermediate                  4
## 2    jdog21        rings             4  Intermediate                  4
## 3    gila19 copper coins            18      Beginner                  3
## 4    trev73      shields             3      Advanced                  8
## 5    trev73       arrows             5      Advanced                  8
## 6    trev73 copper coins            30      Advanced                  8
## 7    trev73        rings             7      Advanced                  8

Normalize Data

Now, we can go ahead and recreate the normalized data frames that were exemplified in the video. The players and players_inventory data frames are easily constructed from the original data set. However, the layer_skill_levels data frame must be built from scratch using the relationship between Player_Skill_Level and Player_Rating that was shown in the video. Once all is said and done, we can see that all three data frames have just one instance per row and each cell is in atomic level. They also abide by the rules surrounding non-key attributes as everyone one in a table key, the whole key, and nothing but the key.

players <- subset(video_game,select = c(Player_ID,Player_Rating)) %>% 
  distinct()
players
##   Player_ID Player_Rating
## 1    jdog21  Intermediate
## 2    gila19      Beginner
## 3    trev73      Advanced
player_inventory <- subset(video_game,select = c(Player_ID,Item_Type,Item_Quantity)) %>% 
  distinct()
player_inventory
##   Player_ID    Item_Type Item_Quantity
## 1    jdog21      amulets             2
## 2    jdog21        rings             4
## 3    gila19 copper coins            18
## 4    trev73      shields             3
## 5    trev73       arrows             5
## 6    trev73 copper coins            30
## 7    trev73        rings             7
player_skill_levels <- data.frame(Player_Skill_Level = 1:9) %>% 
  mutate(Player_Rating = case_when(
    Player_Skill_Level == 1:3 ~ "Beginner",
    Player_Skill_Level == 4:6 ~ "Intermediate",
    Player_Skill_Level == 7:9 ~ "Advanced"))
player_skill_levels
##   Player_Skill_Level Player_Rating
## 1                  1      Beginner
## 2                  2      Beginner
## 3                  3      Beginner
## 4                  4  Intermediate
## 5                  5  Intermediate
## 6                  6  Intermediate
## 7                  7      Advanced
## 8                  8      Advanced
## 9                  9      Advanced

Load FiveThirtyEight’s College Majors Data

Next, we will load the data used in fivethirtyeight.com’s “College Majors” article.

url2 <-
  url('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')
college_majors <- read.csv(url2)
head(college_majors)
##   FOD1P                                 Major                  Major_Category
## 1  1100                   GENERAL AGRICULTURE Agriculture & Natural Resources
## 2  1101 AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3  1102                AGRICULTURAL ECONOMICS Agriculture & Natural Resources
## 4  1103                       ANIMAL SCIENCES Agriculture & Natural Resources
## 5  1104                          FOOD SCIENCE Agriculture & Natural Resources
## 6  1105            PLANT SCIENCE AND AGRONOMY Agriculture & Natural Resources

Identify Requested Majors

We can now search for the majors that contain either “DATA” or “STATISTICS” within the college_majors data set using the filter() and grepl() functions.

college_majors %>% filter(grepl('DATA|STATISTICS', Major))
##   FOD1P                                         Major          Major_Category
## 1  6212 MANAGEMENT INFORMATION SYSTEMS AND STATISTICS                Business
## 2  2101      COMPUTER PROGRAMMING AND DATA PROCESSING Computers & Mathematics
## 3  3702               STATISTICS AND DECISION SCIENCE Computers & Mathematics

Expressions Practice

Next we’ve been tasked with describing what the expressions below will match:

(.)\1\1 - This expression will not match anything since it is not contained within quotation marks and the back references are missing a backslash.

“(.)(.)\2\1” - This expression will match any string where a pair of characters is followed by the same pair in reverse order.

(..)\1 - This expression will not match anything since it is not contained within quotation marks and the back reference is missing a backslash.

“(.).\1.\1” - This expression will match any string where the first letter is repeated every other time.

“(.)(.)(.).*\3\2\1” - This expression matches any string where the first three characters and the last three are the same but in reverse order.

Finally, we will construct regular expressions to match words that meet the descriptions below:

Start and end with the same character. - “^(.).*\1$”

Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.) - “(..)\1”

Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.) - “(.)\1{2,}”

Conclusions / Findings and Recommendations

Normalizing data frames is definitely a must do when preparing to analyze data. It might feel odd in the beginning taking one table and creating various with relationships, but, although it is not aesthetically pleasing, it makes a huge difference in the feasibility of analyzing certain data sets. Along with normalization, the ability to fetch data through regular expressions is also a must have for data analysis. I look forward to using these methods in future data analysis.

Work Cited

Casselman, B. (2017, September 28). The economic guide to picking a college major. FiveThirtyEight. https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/

Decomplexify. (2021, November 21). Learn Database Normalization - 1NF, 2NF, 3NF, 4NF, 5NF [Video]. YouTube. https://www.youtube.com/watch?v=GFQaEYEc8_8