For my first example, I used samples of Census data from the IPUMS USA website [https://usa.ipums.org/usa/index.shtml]. I selected multiple related variables in order to create a wide dataset. I then downloaded the data into a .csv format.
ipums_data <- read.csv("usa_00020.csv")
str(ipums_data)
## 'data.frame': 966058 obs. of 18 variables:
## $ YEAR : int 1850 1850 1850 1850 1850 1850 1850 1850 1850 1850 ...
## $ DATANUM : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SERIAL : int 301 301 601 1001 1001 1001 1001 1001 1001 1001 ...
## $ CBSERIAL : num NA NA NA NA NA NA NA NA NA NA ...
## $ HHWT : num 359 359 359 335 335 ...
## $ GQ : int 1 1 1 1 1 1 1 1 1 1 ...
## $ PERNUM : int 1 2 1 1 2 3 4 5 6 7 ...
## $ PERWT : num 359 359 359 335 335 ...
## $ FAMSIZE : int 1 1 1 6 6 6 6 6 6 1 ...
## $ RACE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ RACED : int 120 120 120 120 120 120 120 120 120 120 ...
## $ HIGRADE : int NA NA NA NA NA NA NA NA NA NA ...
## $ HIGRADED : int NA NA NA NA NA NA NA NA NA NA ...
## $ EDUC : int NA NA NA NA NA NA NA NA NA NA ...
## $ EDUCD : int NA NA NA NA NA NA NA NA NA NA ...
## $ DEGFIELD : int NA NA NA NA NA NA NA NA NA NA ...
## $ DEGFIELDD: int NA NA NA NA NA NA NA NA NA NA ...
## $ FTOTINC : int NA NA NA NA NA NA NA NA NA NA ...
I see that this is a data.frame consisting of 966,058 rows of 18 variables.
names(ipums_data)
## [1] "YEAR" "DATANUM" "SERIAL" "CBSERIAL" "HHWT"
## [6] "GQ" "PERNUM" "PERWT" "FAMSIZE" "RACE"
## [11] "RACED" "HIGRADE" "HIGRADED" "EDUC" "EDUCD"
## [16] "DEGFIELD" "DEGFIELDD" "FTOTINC"
I see the columns appear to have somewhat descriptive names but will be changed in the final output
ipums_tbl <- as.tbl(ipums_data)
ipums_tbl
## # A tibble: 966,058 x 18
## YEAR DATANUM SERIAL CBSERIAL HHWT GQ PERNUM PERWT FAMSIZE RACE
## <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int>
## 1 1850 1 301 NA 359. 1 1 359. 1 1
## 2 1850 1 301 NA 359. 1 2 359. 1 1
## 3 1850 1 601 NA 359. 1 1 359. 1 1
## 4 1850 1 1001 NA 335. 1 1 335. 6 1
## 5 1850 1 1001 NA 335. 1 2 335. 6 1
## 6 1850 1 1001 NA 335. 1 3 335. 6 1
## 7 1850 1 1001 NA 335. 1 4 335. 6 1
## 8 1850 1 1001 NA 335. 1 5 335. 6 1
## 9 1850 1 1001 NA 335. 1 6 335. 6 1
## 10 1850 1 1001 NA 335. 1 7 335. 1 1
## # ... with 966,048 more rows, and 8 more variables: RACED <int>,
## # HIGRADE <int>, HIGRADED <int>, EDUC <int>, EDUCD <int>,
## # DEGFIELD <int>, DEGFIELDD <int>, FTOTINC <int>
We see that the column race has numeric variables. I will create n new column showing the character representation of these values 1.Create a vector of race identification codes(obtained from IPUMS)
glimpse(ipums_tbl)
## Observations: 966,058
## Variables: 18
## $ YEAR <int> 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850...
## $ DATANUM <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ SERIAL <int> 301, 301, 601, 1001, 1001, 1001, 1001, 1001, 1001, 1...
## $ CBSERIAL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HHWT <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ GQ <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PERNUM <int> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8...
## $ PERWT <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ FAMSIZE <int> 1, 1, 1, 6, 6, 6, 6, 6, 6, 1, 8, 8, 8, 8, 8, 8, 8, 8...
## $ RACE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ RACED <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12...
## $ HIGRADE <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HIGRADED <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUCD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELDD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTOTINC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
race_codes <- read_csv("race_codes.csv")
## Parsed with column specification:
## cols(
## RACE = col_double(),
## `Race [general version]` = col_character()
## )
str(race_codes)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9 obs. of 2 variables:
## $ RACE : num 1 2 3 4 5 6 7 8 9
## $ Race [general version]: chr "White" "Black/African American/Negro" "American Indian or Alaska Native" "Chinese" ...
## - attr(*, "spec")=
## .. cols(
## .. RACE = col_double(),
## .. `Race [general version]` = col_character()
## .. )
racevec <- race_codes$`Race [general version]`
names(racevec) <- race_codes$RACE
We now have a named vector containing the character description of the 9 numeric codes in our original data. I will create a new column which translates these numeric values to thir description for each row. Then chaange the column to a factor.
ipums_tbl$RaceDesc <- racevec[ipums_tbl$RACE]
ipums_tbl$RaceDesc <- as.factor(ipums_tbl$RaceDesc)
glimpse(ipums_tbl)
## Observations: 966,058
## Variables: 19
## $ YEAR <int> 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850...
## $ DATANUM <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ SERIAL <int> 301, 301, 601, 1001, 1001, 1001, 1001, 1001, 1001, 1...
## $ CBSERIAL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HHWT <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ GQ <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PERNUM <int> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8...
## $ PERWT <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ FAMSIZE <int> 1, 1, 1, 6, 6, 6, 6, 6, 6, 1, 8, 8, 8, 8, 8, 8, 8, 8...
## $ RACE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ RACED <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12...
## $ HIGRADE <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HIGRADED <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUCD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELDD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTOTINC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ RaceDesc <fct> White, White, White, White, White, White, White, Whi...
We also see that the column HIGRADE has numeric variables. I will create n new column showing the character representation of these values 1.Create a vector of highest grades completed codes(obtained from IPUMS)
grade_codes <- read_csv("grade_codes.csv")
## Parsed with column specification:
## cols(
## HIGRADE = col_double(),
## `Highest grade of schooling [general version]` = col_character()
## )
str(grade_codes)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 24 obs. of 2 variables:
## $ HIGRADE : num 0 1 2 3 4 5 6 7 8 9 ...
## $ Highest grade of schooling [general version]: chr "N/A" "None" "Nursery school" "Kindergarten" ...
## - attr(*, "spec")=
## .. cols(
## .. HIGRADE = col_double(),
## .. `Highest grade of schooling [general version]` = col_character()
## .. )
gradevec <- grade_codes$`Highest grade of schooling [general version]`
names(gradevec) <- grade_codes$HIGRADE
We see that there are 24 values of grades describing the highest grade completed.We now have a named vector containing the character description of the 24 numeric codes in our original data. I will create a new column which translates these numeric values to thir description for each row.
ipums_tbl$Highest_Grade_Completed <- gradevec[ipums_tbl$HIGRADE + 1]
ipums_tbl$Highest_Grade_Completed <- as.factor(ipums_tbl$Highest_Grade_Completed)
glimpse(ipums_tbl)
## Observations: 966,058
## Variables: 20
## $ YEAR <int> 1850, 1850, 1850, 1850, 1850, 1850, 18...
## $ DATANUM <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ SERIAL <int> 301, 301, 601, 1001, 1001, 1001, 1001,...
## $ CBSERIAL <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ HHWT <dbl> 358.90, 358.90, 358.90, 334.89, 334.89...
## $ GQ <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ PERNUM <int> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3,...
## $ PERWT <dbl> 358.90, 358.90, 358.90, 334.89, 334.89...
## $ FAMSIZE <int> 1, 1, 1, 6, 6, 6, 6, 6, 6, 1, 8, 8, 8,...
## $ RACE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ RACED <int> 120, 120, 120, 120, 120, 120, 120, 120...
## $ HIGRADE <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ HIGRADED <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ EDUC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ EDUCD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ DEGFIELD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ DEGFIELDD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTOTINC <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ RaceDesc <fct> White, White, White, White, White, Whi...
## $ Highest_Grade_Completed <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
table(ipums_tbl$Highest_Grade_Completed)
##
## 10th grade 11th grade 12th grade
## 9108 7117 28972
## 1st grade 1st year 2nd grade
## 3687 4636 3995
## 2nd year 3rd grade 3rd year
## 4534 4673 2181
## 4th grade 4th year 5th grade
## 5123 5434 5569
## 5th year or more (40-50) 6th grade 6th year or more (60,70)
## 1576 6944 1325
## 7th grade 7th year 8th grade
## 7675 169 18045
## 8th year or more 9th grade Kindergarten
## 220 8362 2509
## N/A None Nursery school
## 38970 11023 1789
Select a subset of columns for analysis