Homelessness Garph Analytics

ggplot(ml_data, aes(EntryDateMonth)) + geom_histogram(binwidth = 1)

As we see, we have sessional pattern here.
More people comes to the center during summer and winter
Less people during Spring and fall
We need more resources during peak sesson to ensure that we can provide the quality services.

ggplot(ml_data, aes(EntryDateMonth, fill = as.factor(Destination))) + geom_histogram(binwidth = 1)

Higher the factor number means better result after they competed the program.
We have great result dsuring the summer but not so good result during the winter.
Let’s see why is that the case.

ggplot(ml_data, aes(EntryDateMonth, fill = as.factor(TypeProvided))) + geom_histogram(binwidth = 1)

Factor variables are representing specific service they have received during the enrollment.
We know that during the summer, the program does great job and ‘A2’ services are offered the most during the summer. (A2 = Community service/service learning (CSL))
During the winter session, we offer less ‘A2/ service and more ’B1’ services. We might need to change the service stratagies. (B1 = Rental assistance)

ggplot(ml_data, aes(EntryDateMonth, fill = as.factor(VADisabilityService))) + geom_histogram(binwidth = 1)

Most of the clients during the summer receive VA Disability Service which could be the potensial reason why we have such a great outcome during the summer. correlation but not causation.

Heat Map Graph Analytics

g2

## Warning: Removed 69 rows containing non-finite values (stat_density2d).

g1

## Warning: Removed 1110 rows containing missing values (geom_point).

This map represents the intensity of the homelessness.
More freqeunt homelessness has bigger diameter of the circle.
Let see which area need more resources.

print("Top 10 the Most Homelessness Area")

## [1] "Top 10 the Most Homelessness Area"

top10_zip$LastPermanentZIP

##  [1] 63101 63136 63118 63103 63111 63125 63114 63104 63107 63143

Above zip code is the top 10 zipcodes where the homelessness is frequently occuring.
According to these information, we can distribute available resources more intelligently for higher quality services.

Predictive modeling

If we can predice the successfulness of the client prior to the service enrollment, then it is greatly beneficial for the center for having a chance to improve the 1 to 1 service if it predicts the negative outcomes.
However, the current data sets have too much features which the model does not need them all.
Let’s plot the most important features to select the best predictor variables.

plotFilterValues(var_imp, feat.type.cols=TRUE)

imp_feat <- (var_imp$data %>% arrange(-information.gain) %>% top_n(7))$name

## Selecting by information.gain

imp_feat

## [1] "EntryDateMonth"      "Discharge_Status.1"  "Discharge_Status.2" 
## [4] "Age"                 "Employed"            "TypeProvided"       
## [7] "VADisabilityService"

We already analized some of the variables previously and found some significant intuition.
As long as we have those 7 variables, we can make great predictive model.
Let’s do some demonstration

head(test, 3)

##   TypeProvided VADisabilityService EntryDateMonth Discharge_Status.1
## 2           B1                   0              1                  0
## 5           B1                   0              1                  0
## 7           B1                   0              1                  0
##   Discharge_Status.2 Age Employed Destination
## 2                  1   4        1           3
## 5                  1   4        1           3
## 7                  1   4        1           3

pred.rf.test <- predict(mdl.rf, test)
conf.mtx <- confusionMatrix(pred.rf.test, test$Destination)
conf.mtx$overall

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.8656172      0.7352898      0.8639407      0.8672805      0.6662625 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

As you can see, we have over 86% of accuracy to determine the successfulness of the particular individual prior to the enrollment without collecting too much data.

Improvement TODO:

Explore more featurese and perform decent feature engineering process.
Construct Deep Neuron Network through tensorflow to build more compelx predictive model.
Use Spark distributed computing system to find the best hyper-parameters for Deep Neuron Network.

Challenges of this project

Cleaning multiple, exreamly dirty, and small dataset and merging them into descent size was one of the challenges.

Global_Hack6

Kyu Cho

October 23, 2016

Homelessness Garph Analytics

Heat Map Graph Analytics

Predictive modeling

Improvement TODO:

Challenges of this project