First, we have to load them. For more efficiency, I am going to load and transform both learn and test files at the same time.
usc <- fread("~/Desktop/VINCENT/Dataiku/Data/census_income_learn.csv", integer64 = "character", showProgress = TRUE)
usc$part <- "learn"
usc2 <- fread("~/Desktop/VINCENT/Dataiku/Data/census_income_test.csv", integer64 = "character", showProgress = TRUE)
usc2$part <- "test"
usc <- rbind.data.frame(usc,usc2)
rm(usc2)
By exploring the metadata file and the U.S. census file, I have figured out that some columns were missing, and others were not in the metadata file. I had to create my own file with the names of variables :
Var <- read.csv("~/Desktop/VINCENT/Dataiku/Data/Var.txt", sep="")
Var<- Var$Lib
names(usc) <- as.vector(Var)
Now that we have our complete file, we can begin to explore it.
Let’s plot the first two variables age , class of worker and sex which I assume, have a great incidence on the income.
a <- ggplot(usc, aes(Age,fill = class_of_worker)) +
geom_histogram(binwidth=1, alpha=.5,position="identity") +
facet_grid(~income) +
ggtitle("Class of worker by age and income")
ggplotly(a)
First thing we see, the Not in universe category in the class of worker variable.There is almost no one with an income of +50000$ in it. Most of them are under 18 or over 60. We can assume that it reprensents the inactive population. It is not really worth it to predict an income on a population which is not eligible for work. Let’s exclude them. Plus, I transform the income variable to make it clearer.
usc <- usc[which(class_of_worker != "Not in universe")]
usc$income[usc$income == "- 50000."] <- "-50000$"
usc$income[usc$income == "50000+."] <- "+50000$"
Let us give a “demographic” dimension to our age / sex variable by income I prefer to hide the code because of its length.
It is obvious that both age and sex have an influence on the level of income.
c <- ggplot(data=usc,aes(x = education,fill = income)) +
geom_bar() +
coord_flip() +
scale_fill_manual(values = c("#CCCCCC","#FFCC33")) +
theme_classic()
ggplotly(c)
We see that the education variable has too many categories. Let’s regroup it by schoolgrades, and let’s transform it as a factor so that we have an ordinal variable for education (from undereducated to the highest degrees)
usc$education_REC <- revalue(usc$education,c("10th grade" = "01-Less than Hight School",
"1st 2nd 3rd or 4th grade"= "01-Less than Hight School",
"9th grade" = "01-Less than Hight School",
"Less than 1st grade" = "01-Less than Hight School",
"12th grade no diploma" = "01-Less than Hight School",
"5th or 6th grade" = "01-Less than Hight School",
"7th and 8th grade" = "01-Less than Hight School",
"11th grade" = "01-Less than Hight School",
"High school graduate" = "02-High school",
"Some college but no degree" = "02-High school",
"Associates degree-occup /vocational" = "03-Associates degree",
"Associates degree-academic program" = "03-Associates degree",
"Bachelors degree(BA AB BS)" = "04-Degree",
"Masters degree(MA MS MEng MEd MSW MBA)" = "04-Degree",
"Doctorate degree(PhD EdD)" = "05-Phd",
"Prof school degree (MD DDS DVM LLB JD)" = "05-Phd"))
usc$education_REC <- as.factor(usc$education_REC)
Then we can plot it, using proportional histogramm. It will underline in which categories the +50000$ are overrepresented.
d <- ggplot(usc, aes(x = usc$education_REC, fill = income)) +
geom_bar(position = "fill") +
theme(axis.text.x = element_text(angle=45,face="bold")
,axis.title.x = element_blank()
,axis.title.y = element_blank()) +
scale_fill_manual(values = c("#CCCCCC","#FFCC33")) +
coord_flip()
plotly_build(d)
Education seems to have a strong influence on the odds to get an income superior to 50000$.
We can also explore the birth_self variable, which indicates were the individual was born. In order to do that, let’s represent the percentage by country of the indivuduals with a 5000k income.
usc$Less50 = if_else(usc$income == "-50000$",1,0)
usc$More50 = if_else(usc$income == "+50000$",1,0)
usc$TOTAL = 1
MAP <- as.data.frame(usc[,list(More50 = sum(More50),
Total = sum(TOTAL)),
by = "birth_self"])
MAP$percent = MAP$More50 / MAP$Total
head(MAP,5)
## birth_self More50 Total percent
## 1 United-States 15476 129756 0.11927001
## 2 Columbia 21 407 0.05159705
## 3 Mexico 120 5035 0.02383317
## 4 Peru 12 246 0.04878049
## 5 Philippines 109 897 0.12151616
We can try to project it on a map. First, let’s import a shapefile so that we have a border layer.
#shape <- readShapePoly("C:/Users/GAUME/Desktop/Personnel/Da/TM_WORLD_BORDERS")
shape <- readShapePoly("~/Desktop/VINCENT/Dataiku/SHP/TM_WORLD_BORDER")
In a second time, we can merge our shapefile with our data. To make the job easier, I have renamed the country names directly in the shapefile.
MAP_merged<- geo_join(shape,MAP, "NAME", "birth_self")
Finally, we can project the results :
####### POP-UP ########
popup <- paste0(shape$NAME, "<br>",
"% Income > 50K$. : ",round((MAP_merged$percent),2),"<br>",
"Nb individuals > 50K$ : " , MAP_merged$More50)
####### Palette #######
pal <- colorNumeric(palette = "PuBu",
domain = MAP_merged$percent)
####### MAP #######
map <-leaflet() %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(data = MAP_merged,
fillColor = ~pal(percent),
color = "#b2aeae", # border color
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
popup = popup)
map