First Wide Dataset Example

For my first example, I used samples of Census data from the IPUMS USA website [https://usa.ipums.org/usa/index.shtml]. I selected multiple related variables in order to create a wide dataset. I then downloaded the data into a .csv format.

Read Wide Data into R Data Frame

ipums_data <- read.csv("usa_00020.csv")

Explore the data

str(ipums_data)
## 'data.frame':    966058 obs. of  18 variables:
##  $ YEAR     : int  1850 1850 1850 1850 1850 1850 1850 1850 1850 1850 ...
##  $ DATANUM  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SERIAL   : int  301 301 601 1001 1001 1001 1001 1001 1001 1001 ...
##  $ CBSERIAL : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ HHWT     : num  359 359 359 335 335 ...
##  $ GQ       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PERNUM   : int  1 2 1 1 2 3 4 5 6 7 ...
##  $ PERWT    : num  359 359 359 335 335 ...
##  $ FAMSIZE  : int  1 1 1 6 6 6 6 6 6 1 ...
##  $ RACE     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ RACED    : int  120 120 120 120 120 120 120 120 120 120 ...
##  $ HIGRADE  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ HIGRADED : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ EDUC     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ EDUCD    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ DEGFIELD : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ DEGFIELDD: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ FTOTINC  : int  NA NA NA NA NA NA NA NA NA NA ...

I see that this is a data.frame consisting of 966,058 rows of 18 variables.

names(ipums_data)
##  [1] "YEAR"      "DATANUM"   "SERIAL"    "CBSERIAL"  "HHWT"     
##  [6] "GQ"        "PERNUM"    "PERWT"     "FAMSIZE"   "RACE"     
## [11] "RACED"     "HIGRADE"   "HIGRADED"  "EDUC"      "EDUCD"    
## [16] "DEGFIELD"  "DEGFIELDD" "FTOTINC"

I see the columns appear to have somewhat descriptive names but will be changed in the final output

Convert data.frame to a tibble and display in tibble format

ipums_tbl <- as.tbl(ipums_data)
ipums_tbl
## # A tibble: 966,058 x 18
##     YEAR DATANUM SERIAL CBSERIAL  HHWT    GQ PERNUM PERWT FAMSIZE  RACE
##    <int>   <int>  <int>    <dbl> <dbl> <int>  <int> <dbl>   <int> <int>
##  1  1850       1    301       NA  359.     1      1  359.       1     1
##  2  1850       1    301       NA  359.     1      2  359.       1     1
##  3  1850       1    601       NA  359.     1      1  359.       1     1
##  4  1850       1   1001       NA  335.     1      1  335.       6     1
##  5  1850       1   1001       NA  335.     1      2  335.       6     1
##  6  1850       1   1001       NA  335.     1      3  335.       6     1
##  7  1850       1   1001       NA  335.     1      4  335.       6     1
##  8  1850       1   1001       NA  335.     1      5  335.       6     1
##  9  1850       1   1001       NA  335.     1      6  335.       6     1
## 10  1850       1   1001       NA  335.     1      7  335.       1     1
## # ... with 966,048 more rows, and 8 more variables: RACED <int>,
## #   HIGRADE <int>, HIGRADED <int>, EDUC <int>, EDUCD <int>,
## #   DEGFIELD <int>, DEGFIELDD <int>, FTOTINC <int>

Cleaning - recoding Race

We see that the column race has numeric variables. I will create n new column showing the character representation of these values 1.Create a vector of race identification codes(obtained from IPUMS)

glimpse(ipums_tbl)
## Observations: 966,058
## Variables: 18
## $ YEAR      <int> 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850...
## $ DATANUM   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ SERIAL    <int> 301, 301, 601, 1001, 1001, 1001, 1001, 1001, 1001, 1...
## $ CBSERIAL  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HHWT      <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ GQ        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PERNUM    <int> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8...
## $ PERWT     <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ FAMSIZE   <int> 1, 1, 1, 6, 6, 6, 6, 6, 6, 1, 8, 8, 8, 8, 8, 8, 8, 8...
## $ RACE      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ RACED     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12...
## $ HIGRADE   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HIGRADED  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUC      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUCD     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELD  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELDD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTOTINC   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
race_codes <- read_csv("race_codes.csv")
## Parsed with column specification:
## cols(
##   RACE = col_double(),
##   `Race [general version]` = col_character()
## )
str(race_codes)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 9 obs. of  2 variables:
##  $ RACE                  : num  1 2 3 4 5 6 7 8 9
##  $ Race [general version]: chr  "White" "Black/African American/Negro" "American Indian or Alaska Native" "Chinese" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   RACE = col_double(),
##   ..   `Race [general version]` = col_character()
##   .. )
racevec <- race_codes$`Race [general version]`
names(racevec) <- race_codes$RACE

We now have a named vector containing the character description of the 9 numeric codes in our original data. I will create a new column which translates these numeric values to thir description for each row. Then chaange the column to a factor.

ipums_tbl$RaceDesc <- racevec[ipums_tbl$RACE]
ipums_tbl$RaceDesc <- as.factor(ipums_tbl$RaceDesc)
glimpse(ipums_tbl)
## Observations: 966,058
## Variables: 19
## $ YEAR      <int> 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1850...
## $ DATANUM   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ SERIAL    <int> 301, 301, 601, 1001, 1001, 1001, 1001, 1001, 1001, 1...
## $ CBSERIAL  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HHWT      <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ GQ        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ PERNUM    <int> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8...
## $ PERWT     <dbl> 358.90, 358.90, 358.90, 334.89, 334.89, 334.89, 334....
## $ FAMSIZE   <int> 1, 1, 1, 6, 6, 6, 6, 6, 6, 1, 8, 8, 8, 8, 8, 8, 8, 8...
## $ RACE      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ RACED     <int> 120, 120, 120, 120, 120, 120, 120, 120, 120, 120, 12...
## $ HIGRADE   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ HIGRADED  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUC      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ EDUCD     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELD  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ DEGFIELDD <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ FTOTINC   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ RaceDesc  <fct> White, White, White, White, White, White, White, Whi...

Cleaning - Recoding Grade

We also see that the column HIGRADE has numeric variables. I will create n new column showing the character representation of these values 1.Create a vector of highest grades completed codes(obtained from IPUMS)

grade_codes <- read_csv("grade_codes.csv")
## Parsed with column specification:
## cols(
##   HIGRADE = col_double(),
##   `Highest grade of schooling [general version]` = col_character()
## )
str(grade_codes)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 24 obs. of  2 variables:
##  $ HIGRADE                                     : num  0 1 2 3 4 5 6 7 8 9 ...
##  $ Highest grade of schooling [general version]: chr  "N/A" "None" "Nursery school" "Kindergarten" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   HIGRADE = col_double(),
##   ..   `Highest grade of schooling [general version]` = col_character()
##   .. )
gradevec <- grade_codes$`Highest grade of schooling [general version]`
names(gradevec) <- grade_codes$HIGRADE

We see that there are 24 values of grades describing the highest grade completed.We now have a named vector containing the character description of the 24 numeric codes in our original data. I will create a new column which translates these numeric values to thir description for each row.

ipums_tbl$Highest_Grade_Completed <- gradevec[ipums_tbl$HIGRADE + 1]
ipums_tbl$Highest_Grade_Completed <- as.factor(ipums_tbl$Highest_Grade_Completed)
glimpse(ipums_tbl)
## Observations: 966,058
## Variables: 20
## $ YEAR                    <int> 1850, 1850, 1850, 1850, 1850, 1850, 18...
## $ DATANUM                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ SERIAL                  <int> 301, 301, 601, 1001, 1001, 1001, 1001,...
## $ CBSERIAL                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ HHWT                    <dbl> 358.90, 358.90, 358.90, 334.89, 334.89...
## $ GQ                      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ PERNUM                  <int> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3,...
## $ PERWT                   <dbl> 358.90, 358.90, 358.90, 334.89, 334.89...
## $ FAMSIZE                 <int> 1, 1, 1, 6, 6, 6, 6, 6, 6, 1, 8, 8, 8,...
## $ RACE                    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ RACED                   <int> 120, 120, 120, 120, 120, 120, 120, 120...
## $ HIGRADE                 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ HIGRADED                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ EDUC                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ EDUCD                   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ DEGFIELD                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ DEGFIELDD               <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTOTINC                 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ RaceDesc                <fct> White, White, White, White, White, Whi...
## $ Highest_Grade_Completed <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
table(ipums_tbl$Highest_Grade_Completed)
## 
##               10th grade               11th grade               12th grade 
##                     9108                     7117                    28972 
##                1st grade                 1st year                2nd grade 
##                     3687                     4636                     3995 
##                 2nd year                3rd grade                 3rd year 
##                     4534                     4673                     2181 
##                4th grade                 4th year                5th grade 
##                     5123                     5434                     5569 
## 5th year or more (40-50)                6th grade 6th year or more (60,70) 
##                     1576                     6944                     1325 
##                7th grade                 7th year                8th grade 
##                     7675                      169                    18045 
##         8th year or more                9th grade             Kindergarten 
##                      220                     8362                     2509 
##                      N/A                     None           Nursery school 
##                    38970                    11023                     1789

Prepare for analysis

Select a subset of columns for analysis