Kaggle Script: American Community Survey

The data consisted of two different data sets, one on the grain of individual people, the other on indvidiual households.
Also, included with the data were “County Facts”, variables that described the population in that County.

Library	Description
Data.Table	Efficient table functions
GGPlot2	Grammar of Graphics Plotting library
SQLDf	Query dataframes like SQL tables
Maps	US Maps
ggplot2	Charts
dplyr	the data wrangling package
reshpae2	reshape your data
RSQLite	Connect to SQLite database
Choroplethr	Map plots

Variables

Variable	Description
ST	State
PWGTP	A library for plotting US County and State data
AGEP	AGE
CIT	CITIZENSHIP
COW	Class of Worker
DEAR	Hearing Disability
DEYE	Seeing Disability
ENG	English Skills
MAR	Marital Status
SCH	School Enrollment
SCHG	Grade Level Attending
SCHL	Educational Attainment
SEX	Gender
WAGP	Wage
ANC1P	Recoded Detailed Ancestry
POBP	A library for plotting US County and State data
VPS	Veteran Status
ESR	Employment Status Recode
PINCP	Total Person’s Income
RAC1P	Recoded Detailed Race Code

Data.Table package was originally used to read in the data, to only select in the columns we wanted.
To make running it faster I saved the dataframe to a .rds

All of the dimensions are referenced by IDs, so we must give them the description. This doesn’t make the storage of the dataframe any less efficient, because of R factors. It will still store the id, or level. We just need to name the levels.

I find sql case statements to be the easiest way to make derived columns. You can also use Mutate from dyplr, I find this easier to read however.

popDF$SCHLRecode <- sqldf(
  "
  select case 
  when SCHL <= 15 then 'NonGrad' 
  when SCHL = 16 then 'HS' 
  when SCHL = 17 then 'GED' 
  when SCHL <= 19 and SCHL >= 18 then 'College Drop Out' 
  when SCHL = 20 then 'Associate'
  when SCHL = 21 then 'Bachelor'
  when SCHL = 22 then 'Master'
  when SCHL = 23 then 'Professional'
  when SCHL = 24 then 'Doctorate'
  
  else NULL end 

  as SCHLRecode

  from popDF"
  
)



popDF$SCHLRecode <- factor(popDF$SCHLRecode)

Before we get into plotting I create a pallete with specific color values that I find works well with a lot of different factors

cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7" , "magenta1")

##Map the Income Data

incomemapDF <- subset(popDF, !is.na(PINCP))
incomemapDF <- as.data.frame(tapply(incomemapDF$PINCP, incomemapDF$ST, mean))
incomemapDF$region <- row.names(incomemapDF)
colnames(incomemapDF)[1] <- "value"
state_choropleth(incomemapDF)

##Boxplot of Income vs Graduate Level Bucket
ggplot(popDF, aes(x = SCHLRecode, y = PINCP, fill = SCHLRecode)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Degree Level") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income") +ylim(0,100000) + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = SEX, y = PINCP, fill = SEX)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Gender") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income") + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = MAR, y = PINCP, fill = MAR)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Marital Status") + 
  ## ylab("Count")  +
  ggtitle("Income") + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(PINCP)) + geom_area(stat = "bin", bins = 300, aes(fill = SCHLRecode ))  +xlim(0,100000)

ggplot(popDF, aes(x = PINCP, fill = SCHLRecode)) + geom_density(alpha = .3)  +xlim(0,100000) + ylim(0,.00025) + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = COW, y = PINCP, fill = COW)) +  ylim(0,100000) + 
  geom_boxplot() +
  xlab("Class of Worker") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income") + scale_color_manual(values = cbPalette)

ggplot(popDF, aes(x = CIT, y = PINCP, fill = CIT)) + ylim(0,100000) + 
  geom_boxplot() +
  xlab("Citizen Status") + ylab("Income USD") +
  ## ylab("Count") + 
  ggtitle("Income")  + scale_color_manual(values = cbPalette)

##levels(popDF.tidy$SCHLRecode)  
  
popDF.tidy <- popDF %>% select(PINCP,SCHLRecode,AGEP) %>% filter(!is.na(PINCP), !is.na(SCHLRecode), !is.na(AGEP))


ggplot(aes(x=AGEP,y=PINCP,color=SCHLRecode),data=popDF.tidy)+geom_jitter() + ggtitle("Age vs Income")+xlab("Age") + ylab("Income") + scale_color_manual(values = cbPalette)

Kaggle Script: American Community Survey

Salil Gupta

April 7, 2016

Libraries Used

Variables