Overview Database management will become increasingly important in our course, as it is in the field, and as such making sure our data is easy to process is key. Cleaning up data also means that we must reorganize our sets of data into more manageable tables. From here we can begin to process it and understand trends. Additionally, combing through data as it exists as a text is important, especially when you only need to find fragments of words to get the correct data. Either creating your own regular expressions or simply understanding them will be vital.

1.Loading Packages & Table Creation

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
music_survey <- tribble(
  ~Friend, ~Type, ~Format, ~Genre,
  "Abigail",   "Physical, Digital", "Spotify, CD", "Pop, Hip-Hop",
  "David",   "Physical", "CD, Vinyl, Tape", "Ambient, Pop, Rock, Jazz",
  "Naomi",   "Digital", "Spotify, YouTube", "Pop, Jazz, Latin"
)
music_survey
## # A tibble: 3 × 4
##   Friend  Type              Format           Genre                   
##   <chr>   <chr>             <chr>            <chr>                   
## 1 Abigail Physical, Digital Spotify, CD      Pop, Hip-Hop            
## 2 David   Physical          CD, Vinyl, Tape  Ambient, Pop, Rock, Jazz
## 3 Naomi   Digital           Spotify, YouTube Pop, Jazz, Latin

After a brief survey from friends about how they listen to music and what sorts of music they like, I entered that data into a table. However, the data needs to be normalized.

Normalized Data

music_preferences <- tribble(
  ~Friend, ~Type, 
  "Abigail",   "Physical",
  "Abigail", "Digital",
  "David",   "Physical",
  "David", "Physical",
  "Naomi",   "Digital"
)
music_preferences
## # A tibble: 5 × 2
##   Friend  Type    
##   <chr>   <chr>   
## 1 Abigail Physical
## 2 Abigail Digital 
## 3 David   Physical
## 4 David   Physical
## 5 Naomi   Digital

Here we have our data for their listening preference normalized, however we can extrapolate what we’ve done to their prefered format to listen to music and genre of music.

music_format <- tribble(
  ~Friend, ~Format,
  "Abigail", "Spotify",
  "Abigail", "CD",
  "David", "CD",
  "David", "Vinyl",
  "David", "Tape",
  "Naomi", "Spotify",
  "Naomi", "YouTube"
)
music_format
## # A tibble: 7 × 2
##   Friend  Format 
##   <chr>   <chr>  
## 1 Abigail Spotify
## 2 Abigail CD     
## 3 David   CD     
## 4 David   Vinyl  
## 5 David   Tape   
## 6 Naomi   Spotify
## 7 Naomi   YouTube

Here we have a normalized table of what format everyone enjoys listening music on.

format_type <- tribble(
  ~Type, ~Format,
  "Digital", "Spotify",
  "Digital", "YouTube",
  "Physical", "CD",
  "Physical", "Vinyl",
  "Physical", "Tape"
)
format_type
## # A tibble: 5 × 2
##   Type     Format 
##   <chr>    <chr>  
## 1 Digital  Spotify
## 2 Digital  YouTube
## 3 Physical CD     
## 4 Physical Vinyl  
## 5 Physical Tape

We now have a normalized table of the format each preference is in.

music_genre <- tribble(
  ~Friend, ~Genre,
  "Abigail",   "Pop",
  "Abigail", "Hip-Hop",
  "David",   "Ambient",
  "David", "Pop",
  "David", "Rock",
  "David", "Jazz",
  "Naomi",   "Pop",
  "Naomi", "Jazz",
  "Naomi", "Latin"
)
music_genre
## # A tibble: 9 × 2
##   Friend  Genre  
##   <chr>   <chr>  
## 1 Abigail Pop    
## 2 Abigail Hip-Hop
## 3 David   Ambient
## 4 David   Pop    
## 5 David   Rock   
## 6 David   Jazz   
## 7 Naomi   Pop    
## 8 Naomi   Jazz   
## 9 Naomi   Latin

Lastly we have a normalized table of the genre of music everyone likes to listen to.

2.Loading the data

major_list <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/college-majors/majors-list.csv')

pull(major_list, Major)
##   [1] "GENERAL AGRICULTURE"                                              
##   [2] "AGRICULTURE PRODUCTION AND MANAGEMENT"                            
##   [3] "AGRICULTURAL ECONOMICS"                                           
##   [4] "ANIMAL SCIENCES"                                                  
##   [5] "FOOD SCIENCE"                                                     
##   [6] "PLANT SCIENCE AND AGRONOMY"                                       
##   [7] "SOIL SCIENCE"                                                     
##   [8] "MISCELLANEOUS AGRICULTURE"                                        
##   [9] "FORESTRY"                                                         
##  [10] "NATURAL RESOURCES MANAGEMENT"                                     
##  [11] "FINE ARTS"                                                        
##  [12] "DRAMA AND THEATER ARTS"                                           
##  [13] "MUSIC"                                                            
##  [14] "VISUAL AND PERFORMING ARTS"                                       
##  [15] "COMMERCIAL ART AND GRAPHIC DESIGN"                                
##  [16] "FILM VIDEO AND PHOTOGRAPHIC ARTS"                                 
##  [17] "STUDIO ARTS"                                                      
##  [18] "MISCELLANEOUS FINE ARTS"                                          
##  [19] "ENVIRONMENTAL SCIENCE"                                            
##  [20] "BIOLOGY"                                                          
##  [21] "BIOCHEMICAL SCIENCES"                                             
##  [22] "BOTANY"                                                           
##  [23] "MOLECULAR BIOLOGY"                                                
##  [24] "ECOLOGY"                                                          
##  [25] "GENETICS"                                                         
##  [26] "MICROBIOLOGY"                                                     
##  [27] "PHARMACOLOGY"                                                     
##  [28] "PHYSIOLOGY"                                                       
##  [29] "ZOOLOGY"                                                          
##  [30] "NEUROSCIENCE"                                                     
##  [31] "MISCELLANEOUS BIOLOGY"                                            
##  [32] "COGNITIVE SCIENCE AND BIOPSYCHOLOGY"                              
##  [33] "GENERAL BUSINESS"                                                 
##  [34] "ACCOUNTING"                                                       
##  [35] "ACTUARIAL SCIENCE"                                                
##  [36] "BUSINESS MANAGEMENT AND ADMINISTRATION"                           
##  [37] "OPERATIONS LOGISTICS AND E-COMMERCE"                              
##  [38] "BUSINESS ECONOMICS"                                               
##  [39] "MARKETING AND MARKETING RESEARCH"                                 
##  [40] "FINANCE"                                                          
##  [41] "HUMAN RESOURCES AND PERSONNEL MANAGEMENT"                         
##  [42] "INTERNATIONAL BUSINESS"                                           
##  [43] "HOSPITALITY MANAGEMENT"                                           
##  [44] "MANAGEMENT INFORMATION SYSTEMS AND STATISTICS"                    
##  [45] "MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION"                  
##  [46] "COMMUNICATIONS"                                                   
##  [47] "JOURNALISM"                                                       
##  [48] "MASS MEDIA"                                                       
##  [49] "ADVERTISING AND PUBLIC RELATIONS"                                 
##  [50] "COMMUNICATION TECHNOLOGIES"                                       
##  [51] "COMPUTER AND INFORMATION SYSTEMS"                                 
##  [52] "COMPUTER PROGRAMMING AND DATA PROCESSING"                         
##  [53] "COMPUTER SCIENCE"                                                 
##  [54] "INFORMATION SCIENCES"                                             
##  [55] "COMPUTER ADMINISTRATION MANAGEMENT AND SECURITY"                  
##  [56] "COMPUTER NETWORKING AND TELECOMMUNICATIONS"                       
##  [57] "MATHEMATICS"                                                      
##  [58] "APPLIED MATHEMATICS"                                              
##  [59] "STATISTICS AND DECISION SCIENCE"                                  
##  [60] "MATHEMATICS AND COMPUTER SCIENCE"                                 
##  [61] "GENERAL EDUCATION"                                                
##  [62] "EDUCATIONAL ADMINISTRATION AND SUPERVISION"                       
##  [63] "SCHOOL STUDENT COUNSELING"                                        
##  [64] "ELEMENTARY EDUCATION"                                             
##  [65] "MATHEMATICS TEACHER EDUCATION"                                    
##  [66] "PHYSICAL AND HEALTH EDUCATION TEACHING"                           
##  [67] "EARLY CHILDHOOD EDUCATION"                                        
##  [68] "SCIENCE AND COMPUTER TEACHER EDUCATION"                           
##  [69] "SECONDARY TEACHER EDUCATION"                                      
##  [70] "SPECIAL NEEDS EDUCATION"                                          
##  [71] "SOCIAL SCIENCE OR HISTORY TEACHER EDUCATION"                      
##  [72] "TEACHER EDUCATION: MULTIPLE LEVELS"                               
##  [73] "LANGUAGE AND DRAMA EDUCATION"                                     
##  [74] "ART AND MUSIC EDUCATION"                                          
##  [75] "MISCELLANEOUS EDUCATION"                                          
##  [76] "LIBRARY SCIENCE"                                                  
##  [77] "ARCHITECTURE"                                                     
##  [78] "GENERAL ENGINEERING"                                              
##  [79] "AEROSPACE ENGINEERING"                                            
##  [80] "BIOLOGICAL ENGINEERING"                                           
##  [81] "ARCHITECTURAL ENGINEERING"                                        
##  [82] "BIOMEDICAL ENGINEERING"                                           
##  [83] "CHEMICAL ENGINEERING"                                             
##  [84] "CIVIL ENGINEERING"                                                
##  [85] "COMPUTER ENGINEERING"                                             
##  [86] "ELECTRICAL ENGINEERING"                                           
##  [87] "ENGINEERING MECHANICS PHYSICS AND SCIENCE"                        
##  [88] "ENVIRONMENTAL ENGINEERING"                                        
##  [89] "GEOLOGICAL AND GEOPHYSICAL ENGINEERING"                           
##  [90] "INDUSTRIAL AND MANUFACTURING ENGINEERING"                         
##  [91] "MATERIALS ENGINEERING AND MATERIALS SCIENCE"                      
##  [92] "MECHANICAL ENGINEERING"                                           
##  [93] "METALLURGICAL ENGINEERING"                                        
##  [94] "MINING AND MINERAL ENGINEERING"                                   
##  [95] "NAVAL ARCHITECTURE AND MARINE ENGINEERING"                        
##  [96] "NUCLEAR ENGINEERING"                                              
##  [97] "PETROLEUM ENGINEERING"                                            
##  [98] "MISCELLANEOUS ENGINEERING"                                        
##  [99] "ENGINEERING TECHNOLOGIES"                                         
## [100] "ENGINEERING AND INDUSTRIAL MANAGEMENT"                            
## [101] "ELECTRICAL ENGINEERING TECHNOLOGY"                                
## [102] "INDUSTRIAL PRODUCTION TECHNOLOGIES"                               
## [103] "MECHANICAL ENGINEERING RELATED TECHNOLOGIES"                      
## [104] "MISCELLANEOUS ENGINEERING TECHNOLOGIES"                           
## [105] "MATERIALS SCIENCE"                                                
## [106] "NUTRITION SCIENCES"                                               
## [107] "GENERAL MEDICAL AND HEALTH SERVICES"                              
## [108] "COMMUNICATION DISORDERS SCIENCES AND SERVICES"                    
## [109] "HEALTH AND MEDICAL ADMINISTRATIVE SERVICES"                       
## [110] "MEDICAL ASSISTING SERVICES"                                       
## [111] "MEDICAL TECHNOLOGIES TECHNICIANS"                                 
## [112] "HEALTH AND MEDICAL PREPARATORY PROGRAMS"                          
## [113] "NURSING"                                                          
## [114] "PHARMACY PHARMACEUTICAL SCIENCES AND ADMINISTRATION"              
## [115] "TREATMENT THERAPY PROFESSIONS"                                    
## [116] "COMMUNITY AND PUBLIC HEALTH"                                      
## [117] "MISCELLANEOUS HEALTH MEDICAL PROFESSIONS"                         
## [118] "AREA ETHNIC AND CIVILIZATION STUDIES"                             
## [119] "LINGUISTICS AND COMPARATIVE LANGUAGE AND LITERATURE"              
## [120] "FRENCH GERMAN LATIN AND OTHER COMMON FOREIGN LANGUAGE STUDIES"    
## [121] "OTHER FOREIGN LANGUAGES"                                          
## [122] "ENGLISH LANGUAGE AND LITERATURE"                                  
## [123] "COMPOSITION AND RHETORIC"                                         
## [124] "LIBERAL ARTS"                                                     
## [125] "HUMANITIES"                                                       
## [126] "INTERCULTURAL AND INTERNATIONAL STUDIES"                          
## [127] "PHILOSOPHY AND RELIGIOUS STUDIES"                                 
## [128] "THEOLOGY AND RELIGIOUS VOCATIONS"                                 
## [129] "ANTHROPOLOGY AND ARCHEOLOGY"                                      
## [130] "ART HISTORY AND CRITICISM"                                        
## [131] "HISTORY"                                                          
## [132] "UNITED STATES HISTORY"                                            
## [133] "COSMETOLOGY SERVICES AND CULINARY ARTS"                           
## [134] "FAMILY AND CONSUMER SCIENCES"                                     
## [135] "MILITARY TECHNOLOGIES"                                            
## [136] "PHYSICAL FITNESS PARKS RECREATION AND LEISURE"                    
## [137] "CONSTRUCTION SERVICES"                                            
## [138] "ELECTRICAL, MECHANICAL, AND PRECISION TECHNOLOGIES AND PRODUCTION"
## [139] "TRANSPORTATION SCIENCES AND TECHNOLOGIES"                         
## [140] "MULTI/INTERDISCIPLINARY STUDIES"                                  
## [141] "COURT REPORTING"                                                  
## [142] "PRE-LAW AND LEGAL STUDIES"                                        
## [143] "CRIMINAL JUSTICE AND FIRE PROTECTION"                             
## [144] "PUBLIC ADMINISTRATION"                                            
## [145] "PUBLIC POLICY"                                                    
## [146] "N/A (less than bachelor's degree)"                                
## [147] "PHYSICAL SCIENCES"                                                
## [148] "ASTRONOMY AND ASTROPHYSICS"                                       
## [149] "ATMOSPHERIC SCIENCES AND METEOROLOGY"                             
## [150] "CHEMISTRY"                                                        
## [151] "GEOLOGY AND EARTH SCIENCE"                                        
## [152] "GEOSCIENCES"                                                      
## [153] "OCEANOGRAPHY"                                                     
## [154] "PHYSICS"                                                          
## [155] "MULTI-DISCIPLINARY OR GENERAL SCIENCE"                            
## [156] "NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL TECHNOLOGIES"       
## [157] "PSYCHOLOGY"                                                       
## [158] "EDUCATIONAL PSYCHOLOGY"                                           
## [159] "CLINICAL PSYCHOLOGY"                                              
## [160] "COUNSELING PSYCHOLOGY"                                            
## [161] "INDUSTRIAL AND ORGANIZATIONAL PSYCHOLOGY"                         
## [162] "SOCIAL PSYCHOLOGY"                                                
## [163] "MISCELLANEOUS PSYCHOLOGY"                                         
## [164] "HUMAN SERVICES AND COMMUNITY ORGANIZATION"                        
## [165] "SOCIAL WORK"                                                      
## [166] "INTERDISCIPLINARY SOCIAL SCIENCES"                                
## [167] "GENERAL SOCIAL SCIENCES"                                          
## [168] "ECONOMICS"                                                        
## [169] "CRIMINOLOGY"                                                      
## [170] "GEOGRAPHY"                                                        
## [171] "INTERNATIONAL RELATIONS"                                          
## [172] "POLITICAL SCIENCE AND GOVERNMENT"                                 
## [173] "SOCIOLOGY"                                                        
## [174] "MISCELLANEOUS SOCIAL SCIENCES"

Here we have loaded our data provided by 538 from Github and saved it as a table within R. We also ran a test to make sure we can pull everything as vectors from this table.

Finding Majors

major_list |> 
  pull(Major) |> 
  str_view("STATISTICS|DATA")
## [44] │ MANAGEMENT INFORMATION SYSTEMS AND <STATISTICS>
## [52] │ COMPUTER PROGRAMMING AND <DATA> PROCESSING
## [59] │ <STATISTICS> AND DECISION SCIENCE

Here we have pulled that previous list of majors, but refined our search to only include majors with the words “STATISTICS” or “DATA” in them.

  1. While (.)\1\1 will not match anything, as I believe it is not written correctly, changing it to more proper syntax will. “(.)\1\1” matches when a word has one letter that is then followed by that same letter twice. In effect, when an object has three of the same letters back to back, like “eee”.

“(.)(.)\2\1” matches when a word has two letters back to back with those same letters reversed and following right after the first two.

Similarly to our first example, (..)\1 will match anything, but changing the syntax to “(..)\1” will match when we have two letters together that are then repeated. An example includes “na” in “banana”. This code would return “banana” as it identifies “na” as two letters coming together, and then immediately followed by another “na”.

“(.).\1.\1” would match an object that contains a letter, followed by any other letter, with the first letter repeated, followed by any other letter again, and finally is followed by that first letter repeated again. It could look like “abaca” or “xyxzx”.

“(.)(.)(.).*\3\2\1” would match an object that has one letter, followed by a second letter, followed by a third letter, separated by at least one letter, with the third letter repeating, followed by the second letter, followed by the first letter. An example would be “abcdcba” or even “xyzaaaaaaaaazyx”.

  1. A regular expression to match a word that starts and ends with the same character we could write “^(.).*\1$”

To match a word that contains a repeated pair of letters we could write “(..).*\1”.

To match a word that contains one letter repeated in at least three places we could write “(.).\1.\1”.

Conclusions While I was able to normalize this data by hand, I am looking forward to understanding ways in which I can create a code to make these normalized tables for me. In the future, I imagine that creating them by hand will be an irresponsible use of time and that there is surely a more efficient way to do so.