This was my input for the following Kaggle Competition: https://www.kaggle.com/census/2013-american-community-survey

The data consisted of two different data sets, one on the grain of individual people, the other on indvidiual households.
Also, included with the data were “County Facts”, variables that described the population in that County.

Libraries Used

Library Description
Data.Table Efficient table functions
GGPlot2 Grammar of Graphics Plotting library
SQLDf Query dataframes like SQL tables
Maps US Maps
ggplot2 Charts
dplyr the data wrangling package
reshpae2 reshape your data
RSQLite Connect to SQLite database
Choroplethr Map plots

Variables

Variable Description
ST State
PWGTP A library for plotting US County and State data
AGEP AGE
CIT CITIZENSHIP
COW Class of Worker
DEAR Hearing Disability
DEYE Seeing Disability
ENG English Skills
MAR Marital Status
SCH School Enrollment
SCHG Grade Level Attending
SCHL Educational Attainment
SEX Gender
WAGP Wage
ANC1P Recoded Detailed Ancestry
POBP A library for plotting US County and State data
VPS Veteran Status
ESR Employment Status Recode
PINCP Total Person’s Income
RAC1P Recoded Detailed Race Code

Data.Table package was originally used to read in the data, to only select in the columns we wanted.
To make running it faster I saved the dataframe to a .rds

All of the dimensions are referenced by IDs, so we must give them the description. This doesn’t make the storage of the dataframe any less efficient, because of R factors. It will still store the id, or level. We just need to name the levels.

I find sql case statements to be the easiest way to make derived columns. You can also use Mutate from dyplr, I find this easier to read however.

popDF$SCHLRecode <- sqldf(
  "
  select case 
  when SCHL <= 15 then 'NonGrad' 
  when SCHL = 16 then 'HS' 
  when SCHL = 17 then 'GED' 
  when SCHL <= 19 and SCHL >= 18 then 'College Drop Out' 
  when SCHL = 20 then 'Associate'
  when SCHL = 21 then 'Bachelor'
  when SCHL = 22 then 'Master'
  when SCHL = 23 then 'Professional'
  when SCHL = 24 then 'Doctorate'
  
  else NULL end 

  as SCHLRecode

  from popDF"
  
)



popDF$SCHLRecode <- factor(popDF$SCHLRecode)

Before we get into plotting I create a pallete with specific color values that I find works well with a lot of different factors

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7" , "magenta1")
##Map the Income Data

incomemapDF <- subset(popDF, !is.na(PINCP))
incomemapDF <- as.data.frame(tapply(incomemapDF$PINCP, incomemapDF$ST, mean))
incomemapDF$region <- row.names(incomemapDF)
colnames(incomemapDF)[1] <- "value"
state_choropleth(incomemapDF)

##Boxplot of Income vs Graduate Level Bucket
ggplot(popDF, aes(x = SCHLRecode, y = PINCP, fill = SCHLRecode)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Degree Level") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income") +ylim(0,100000) + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = SEX, y = PINCP, fill = SEX)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Gender") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income") + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = MAR, y = PINCP, fill = MAR)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Marital Status") + 
  ## ylab("Count")  +
  ggtitle("Income") + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(PINCP)) + geom_area(stat = "bin", bins = 300, aes(fill = SCHLRecode ))  +xlim(0,100000)

ggplot(popDF, aes(x = PINCP, fill = SCHLRecode)) + geom_density(alpha = .3)  +xlim(0,100000) + ylim(0,.00025) + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = COW, y = PINCP, fill = COW)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Class of Worker") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income") + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = CIT, y = PINCP, fill = CIT)) + ylim(0,100000) + 
  geom_boxplot() +
  xlab("Citizen Status") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income")  + scale_color_manual(values = cbPalette)

##levels(popDF.tidy$SCHLRecode)  
  
popDF.tidy <- popDF %>% select(PINCP,SCHLRecode,AGEP) %>% filter(!is.na(PINCP), !is.na(SCHLRecode), !is.na(AGEP))


ggplot(aes(x=AGEP,y=PINCP,color=SCHLRecode),data=popDF.tidy)+geom_jitter() + ggtitle("Age vs Income")+xlab("Age") + ylab("Income") + scale_color_manual(values = cbPalette)