This was my input for the following Kaggle Competition: https://www.kaggle.com/census/2013-american-community-survey
The data consisted of two different data sets, one on the grain of individual people, the other on indvidiual households.
Also, included with the data were “County Facts”, variables that described the population in that County.
| Library | Description |
|---|---|
| Data.Table | Efficient table functions |
| GGPlot2 | Grammar of Graphics Plotting library |
| SQLDf | Query dataframes like SQL tables |
| Maps | US Maps |
| ggplot2 | Charts |
| dplyr | the data wrangling package |
| reshpae2 | reshape your data |
| RSQLite | Connect to SQLite database |
| Choroplethr | Map plots |
| Variable | Description |
|---|---|
| ST | State |
| PWGTP | A library for plotting US County and State data |
| AGEP | AGE |
| CIT | CITIZENSHIP |
| COW | Class of Worker |
| DEAR | Hearing Disability |
| DEYE | Seeing Disability |
| ENG | English Skills |
| MAR | Marital Status |
| SCH | School Enrollment |
| SCHG | Grade Level Attending |
| SCHL | Educational Attainment |
| SEX | Gender |
| WAGP | Wage |
| ANC1P | Recoded Detailed Ancestry |
| POBP | A library for plotting US County and State data |
| VPS | Veteran Status |
| ESR | Employment Status Recode |
| PINCP | Total Person’s Income |
| RAC1P | Recoded Detailed Race Code |
Data.Table package was originally used to read in the data, to only select in the columns we wanted.
To make running it faster I saved the dataframe to a .rds
All of the dimensions are referenced by IDs, so we must give them the description. This doesn’t make the storage of the dataframe any less efficient, because of R factors. It will still store the id, or level. We just need to name the levels.
I find sql case statements to be the easiest way to make derived columns. You can also use Mutate from dyplr, I find this easier to read however.
popDF$SCHLRecode <- sqldf(
"
select case
when SCHL <= 15 then 'NonGrad'
when SCHL = 16 then 'HS'
when SCHL = 17 then 'GED'
when SCHL <= 19 and SCHL >= 18 then 'College Drop Out'
when SCHL = 20 then 'Associate'
when SCHL = 21 then 'Bachelor'
when SCHL = 22 then 'Master'
when SCHL = 23 then 'Professional'
when SCHL = 24 then 'Doctorate'
else NULL end
as SCHLRecode
from popDF"
)
popDF$SCHLRecode <- factor(popDF$SCHLRecode)
Before we get into plotting I create a pallete with specific color values that I find works well with a lot of different factors
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7" , "magenta1")
##Map the Income Data
incomemapDF <- subset(popDF, !is.na(PINCP))
incomemapDF <- as.data.frame(tapply(incomemapDF$PINCP, incomemapDF$ST, mean))
incomemapDF$region <- row.names(incomemapDF)
colnames(incomemapDF)[1] <- "value"
state_choropleth(incomemapDF)
##Boxplot of Income vs Graduate Level Bucket
ggplot(popDF, aes(x = SCHLRecode, y = PINCP, fill = SCHLRecode)) + ylim(0,100000) +
geom_boxplot() +
xlab("Degree Level") + ylab("Income USD") +
## ylab("Count") +
ggtitle("Income") +ylim(0,100000) + scale_color_manual(values = cbPalette)
ggplot(popDF, aes(x = SEX, y = PINCP, fill = SEX)) + ylim(0,100000) +
geom_boxplot() +
xlab("Gender") + ylab("Income USD") +
## ylab("Count") +
ggtitle("Income") + scale_color_manual(values = cbPalette)
ggplot(popDF, aes(x = MAR, y = PINCP, fill = MAR)) + ylim(0,100000) +
geom_boxplot() +
xlab("Marital Status") +
## ylab("Count") +
ggtitle("Income") + scale_color_manual(values = cbPalette)
ggplot(popDF, aes(PINCP)) + geom_area(stat = "bin", bins = 300, aes(fill = SCHLRecode )) +xlim(0,100000)
ggplot(popDF, aes(x = PINCP, fill = SCHLRecode)) + geom_density(alpha = .3) +xlim(0,100000) + ylim(0,.00025) + scale_color_manual(values = cbPalette)
ggplot(popDF, aes(x = COW, y = PINCP, fill = COW)) + ylim(0,100000) +
geom_boxplot() +
xlab("Class of Worker") + ylab("Income USD") +
## ylab("Count") +
ggtitle("Income") + scale_color_manual(values = cbPalette)
ggplot(popDF, aes(x = CIT, y = PINCP, fill = CIT)) + ylim(0,100000) +
geom_boxplot() +
xlab("Citizen Status") + ylab("Income USD") +
## ylab("Count") +
ggtitle("Income") + scale_color_manual(values = cbPalette)
##levels(popDF.tidy$SCHLRecode)
popDF.tidy <- popDF %>% select(PINCP,SCHLRecode,AGEP) %>% filter(!is.na(PINCP), !is.na(SCHLRecode), !is.na(AGEP))
ggplot(aes(x=AGEP,y=PINCP,color=SCHLRecode),data=popDF.tidy)+geom_jitter() + ggtitle("Age vs Income")+xlab("Age") + ylab("Income") + scale_color_manual(values = cbPalette)