Source the controller, wherein necessary packages and custom functions get read in. The controller also reads in the data and uses a cleaning function to get it in a place that we can work with it. I then separate out my cleaned data and prepare to work with it.

## Warning: package 'data.table' was built under R version 3.3.2
## Warning: package 'tidyr' was built under R version 3.3.2
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'tibble' was built under R version 3.3.2
## Warning: package 'purrr' was built under R version 3.3.2
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## Conflicts with tidy packages ----------------------------------------------
## between():   dplyr, data.table
## filter():    dplyr, stats
## first():     dplyr, data.table
## lag():       dplyr, stats
## last():      dplyr, data.table
## transpose(): purrr, data.table
## Warning: package 'stringr' was built under R version 3.3.2
## Warning: package 'plotly' was built under R version 3.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## Loading required package: rpart
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## Warning: package 'shiny' was built under R version 3.3.2
## Warning: package 'corrplot' was built under R version 3.3.2
## corrplot 0.84 loaded
## Warning: package 'forecast' was built under R version 3.3.2
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.2
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded glmnet 2.0-5
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## Warning: package 'caret' was built under R version 3.3.2
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: package 'pscl' was built under R version 3.3.2
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## Warning: package 'pROC' was built under R version 3.3.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following object is masked from 'package:glmnet':
## 
##     auc
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
## Warning: package 'zoo' was built under R version 3.3.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
## c("X73", "Not.in.universe", "X0", "X0.1", "High.school.graduate", 
## "X0.2", "Not.in.universe.1", "Widowed", "Not.in.universe.or.children", 
## "Not.in.universe.2", "White", "All.other", "Female", "Not.in.universe.3", 
## "Not.in.universe.4", "Not.in.labor.force", "X0.3", "X0.4", "X0.5", 
## "Nonfiler", "Not.in.universe.5", "Not.in.universe.6", "Other.Rel.18..ever.marr.not.in.subfamily", 
## "Other.relative.of.householder", "X1700.09", "X.", "X..1", "X..2", 
## "Not.in.universe.under.1.year.old", "X..3", "X0.6", "Not.in.universe.7", 
## "United.States", "United.States.1", "United.States.2", "Native..Born.in.the.United.States", 
## "X0.7", "Not.in.universe.8", "X2", "X0.8", "X95", "X..50000."
## )
## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion
## c("X38", "Private", "X6", "X36", "X1st.2nd.3rd.or.4th.grade", 
## "X0", "Not.in.universe", "Married.civilian.spouse.present", "Manufacturing.durable.goods", 
## "Machine.operators.assmblrs...inspctrs", "White", "Mexican..Mexicano.", 
## "Female", "Not.in.universe.1", "Not.in.universe.2", "Full.time.schedules", 
## "X0.1", "X0.2", "X0.3", "Joint.one.under.65...one.65.", "Not.in.universe.3", 
## "Not.in.universe.4", "Spouse.of.householder", "Spouse.of.householder.1", 
## "X1032.38", "X.", "X..1", "X..2", "Not.in.universe.under.1.year.old", 
## "X..3", "X4", "Not.in.universe.5", "Mexico", "Mexico.1", "Mexico.2", 
## "Foreign.born..Not.a.citizen.of.U.S", "X0.4", "Not.in.universe.6", 
## "X2", "X12", "X95", "X..50000.")
## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

## Warning in FUN(X[[i]], ...): NAs introduced by coercion

Before diving in, I’m curious to see the distribution of my data as well as the distribution of the older, messier data. Everyone likes an ego boost. After running the summary, since most of my variables are dummies, nothing is too surprising. I then compare it to the data as it was. As I suspeced, the qualitative data is not ideal for summary statistics or analysis. However, despite my cleaning, the summary is not incredibly helpful. I then make a quick set of barcharts to view the distribution of the data. One could also facet wrap and use ggplot functionality; however, since the graphs are only for personal eye-balling, base R graphics do the job.

summary(census_train)
##       age            ageSq           male            female      
##  Min.   : 0.00   Min.   :   0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:15.00   1st Qu.: 225   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :33.00   Median :1089   Median :0.0000   Median :1.0000  
##  Mean   :34.49   Mean   :1688   Mean   :0.4788   Mean   :0.5212  
##  3rd Qu.:50.00   3rd Qu.:2500   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :90.00   Max.   :8100   Max.   :1.0000   Max.   :1.0000  
##                                                                  
##  normalizedWageHr     foreignDad       foreignMom        foreign      
##  Min.   :    0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:    0.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :    0.00   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :   56.43   Mean   :0.0947   Mean   :0.0947   Mean   :0.0947  
##  3rd Qu.:    0.00   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10298.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :99695      NA's   :2813     NA's   :2813     NA's   :2813    
##  wksWorkedPastYr     black            white           hispanic     
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.00   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median : 8.00   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :23.18   Mean   :0.1023   Mean   :0.8388   Mean   :0.9941  
##  3rd Qu.:52.00   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :52.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##    unemployed         blueCollar      whiteCollar      belowCollege   
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :0.000000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.000827   Mean   :0.1608   Mean   :0.8392   Mean   :0.8255  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                       
##     college        aboveCollege     aboveMasters        divorced     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.1264   Mean   :0.0481   Mean   :0.00633   Mean   :0.0637  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##                                                                      
##     married           single        householder      bothParents    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4331   Mean   :0.5669   Mean   :0.2669   Mean   :0.1954  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##     children      whiteDivorcedF    blackDivorcedF     hispanicDivorcedF
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000    
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.000    
##  Median :0.0000   Median :0.00000   Median :0.000000   Median :1.000    
##  Mean   :0.2771   Mean   :0.03202   Mean   :0.004776   Mean   :0.518    
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.000    
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000    
##                                                                         
##  whiteDivorcedM    blackDivorcedM     hispanicDivorcedM    over50k       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.02207   Mean   :0.002356   Mean   :0.02516   Mean   :0.06206  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
## 
summary(full_train$dtMessy)
##       age        classOfWorker      industryRecode      occRecode        
##  Min.   : 0.00   Length:199522      Length:199522      Length:199522     
##  1st Qu.:15.00   Class :character   Class :character   Class :character  
##  Median :33.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :34.49                                                           
##  3rd Qu.:50.00                                                           
##  Max.   :90.00                                                           
##                                                                          
##      edu                wageHr        eduInLastWk       
##  Length:199522      Min.   :   0.00   Length:199522     
##  Class :character   1st Qu.:   0.00   Class :character  
##  Mode  :character   Median :   0.00   Mode  :character  
##                     Mean   :  55.43                     
##                     3rd Qu.:   0.00                     
##                     Max.   :9999.00                     
##                                                         
##  maritalStat        majorIndustry      majorOccCode          hispanic     
##  Length:199522      Length:199522      Length:199522      Min.   :0.0000  
##  Class :character   Class :character   Class :character   1st Qu.:1.0000  
##  Mode  :character   Mode  :character   Mode  :character   Median :1.0000  
##                                                           Mean   :0.9941  
##                                                           3rd Qu.:1.0000  
##                                                           Max.   :1.0000  
##                                                                           
##      sex             laborUnion        unemploymentReason
##  Length:199522      Length:199522      Length:199522     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  employmentStatus     capGains           capLoss         
##  Length:199522      Length:199522      Length:199522     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     stocks           taxStatus            region         
##  Length:199522      Length:199522      Length:199522     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     state              hhStat             hhSum          
##  Length:199522      Length:199522      Length:199522     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   instanceWt        migrationMSA       migrationReg      
##  Length:199522      Length:199522      Length:199522     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  migrationWithInReg house1PlusYr       prevResInSunbelt  
##  Length:199522      Length:199522      Length:199522     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  pplWorkForEmp       fam18under          foreignDad       foreignMom    
##  Length:199522      Length:199522      Min.   :0.0000   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :0.0000   Median :0.0000  
##                                        Mean   :0.0947   Mean   :0.0947  
##                                        3rd Qu.:0.0000   3rd Qu.:0.0000  
##                                        Max.   :1.0000   Max.   :1.0000  
##                                        NA's   :2813     NA's   :2813    
##     foreign       citizenship        bizOrSelfEmp         vetAdmin        
##  Min.   :0.0000   Length:199522      Length:199522      Length:199522     
##  1st Qu.:0.0000   Class :character   Class :character   Class :character  
##  Median :0.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.0947                                                           
##  3rd Qu.:0.0000                                                           
##  Max.   :1.0000                                                           
##  NA's   :2813                                                             
##    vetBens          wksWorkedPastYr     year              over50k       
##  Length:199522      Min.   : 0.00   Length:199522      Min.   :0.00000  
##  Class :character   1st Qu.: 0.00   Class :character   1st Qu.:0.00000  
##  Mode  :character   Median : 8.00   Mode  :character   Median :0.00000  
##                     Mean   :23.18                      Mean   :0.06206  
##                     3rd Qu.:52.00                      3rd Qu.:0.00000  
##                     Max.   :52.00                      Max.   :1.00000  
##                                                                         
##      female            male            black            white       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.5212   Mean   :0.4788   Mean   :0.1023   Mean   :0.8388  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##    unemployed         blueCollar      whiteCollar      belowCollege   
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :0.000000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.000827   Mean   :0.1608   Mean   :0.8392   Mean   :0.8255  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                       
##     college        aboveCollege     aboveMasters        divorced     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.1264   Mean   :0.0481   Mean   :0.00633   Mean   :0.0637  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##                                                                      
##     married           single        householder      bothParents    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4331   Mean   :0.5669   Mean   :0.2669   Mean   :0.1954  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##     children      normalizedWageHr       ageSq      whiteDivorcedF   
##  Min.   :0.0000   Min.   :    0.00   Min.   :   0   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:    0.00   1st Qu.: 225   1st Qu.:0.00000  
##  Median :0.0000   Median :    0.00   Median :1089   Median :0.00000  
##  Mean   :0.2771   Mean   :   56.44   Mean   :1688   Mean   :0.03202  
##  3rd Qu.:1.0000   3rd Qu.:    0.00   3rd Qu.:2500   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :10298.97   Max.   :8100   Max.   :1.00000  
##                   NA's   :99695                                      
##  blackDivorcedF     hispanicDivorcedF whiteDivorcedM    blackDivorcedM    
##  Min.   :0.000000   Min.   :0.000     Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000     1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.000000   Median :1.000     Median :0.00000   Median :0.000000  
##  Mean   :0.004776   Mean   :0.518     Mean   :0.02207   Mean   :0.002356  
##  3rd Qu.:0.000000   3rd Qu.:1.000     3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000     Max.   :1.00000   Max.   :1.000000  
##                                                                           
##  hispanicDivorcedM
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.02516  
##  3rd Qu.:0.00000  
##  Max.   :1.00000  
## 
meltMissing <- full_train$missingPct
barchart <- barchart(meltMissing)

I then do the same at-a-glance review, but a bit differently for the continuous data. Here, rather than raw frequencies, I can see a better picture of the distribution of the data.

# First I want to get a basic summary of my data. However, since my custom data is mostly binary, this is not 
# helpful, nor intuitive. So I do that only for my integer data types.
intData <- select_if(census_train, is.integer)
summary(intData)
##       age            ageSq           male            female      
##  Min.   : 0.00   Min.   :   0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:15.00   1st Qu.: 225   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :33.00   Median :1089   Median :0.0000   Median :1.0000  
##  Mean   :34.49   Mean   :1688   Mean   :0.4788   Mean   :0.5212  
##  3rd Qu.:50.00   3rd Qu.:2500   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :90.00   Max.   :8100   Max.   :1.0000   Max.   :1.0000  
##                                                                  
##  normalizedWageHr     foreignDad       foreignMom        foreign      
##  Min.   :    0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:    0.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :    0.00   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :   56.43   Mean   :0.0947   Mean   :0.0947   Mean   :0.0947  
##  3rd Qu.:    0.00   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10298.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :99695      NA's   :2813     NA's   :2813     NA's   :2813    
##  wksWorkedPastYr     black            white           hispanic     
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.00   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median : 8.00   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :23.18   Mean   :0.1023   Mean   :0.8388   Mean   :0.9941  
##  3rd Qu.:52.00   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :52.00   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##    unemployed         blueCollar      whiteCollar      belowCollege   
##  Min.   :0.000000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
##  Median :0.000000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.000827   Mean   :0.1608   Mean   :0.8392   Mean   :0.8255  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                       
##     college        aboveCollege     aboveMasters        divorced     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.1264   Mean   :0.0481   Mean   :0.00633   Mean   :0.0637  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##                                                                      
##     married           single        householder      bothParents    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4331   Mean   :0.5669   Mean   :0.2669   Mean   :0.1954  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##     children      whiteDivorcedF    blackDivorcedF     hispanicDivorcedF
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000    
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.000    
##  Median :0.0000   Median :0.00000   Median :0.000000   Median :1.000    
##  Mean   :0.2771   Mean   :0.03202   Mean   :0.004776   Mean   :0.518    
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:1.000    
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000    
##                                                                         
##  whiteDivorcedM    blackDivorcedM     hispanicDivorcedM    over50k       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.02207   Mean   :0.002356   Mean   :0.02516   Mean   :0.06206  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000   Max.   :1.00000  
## 
boxplot(intData)

Lastly, I put it all together in one place. This serves as a gut check to the graphs and summaries I was seeing above.

# For the rest of the data, I was to get an idea of frequencies.
meltedCensus <- melt(census_train)
## Warning in melt.data.table(census_train): To be consistent with reshape2's
## melt, id.vars and measure.vars are internally guessed when both are 'NULL'.
## All non-numeric/integer/logical type columns are conisdered id.vars, which
## in this case are columns []. Consider providing at least one of 'id' or
## 'measure' vars in future.
ggplot(meltedCensus,aes(x = value)) + 
    facet_wrap(~variable,scales = "free_x") + 
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 108134 rows containing non-finite values (stat_bin).