Homelessness Garph Analytics
ggplot(ml_data, aes(EntryDateMonth)) + geom_histogram(binwidth = 1)

- As we see, we have sessional pattern here.
- More people comes to the center during summer and winter
- Less people during Spring and fall
- We need more resources during peak sesson to ensure that we can provide the quality services.
ggplot(ml_data, aes(EntryDateMonth, fill = as.factor(Destination))) + geom_histogram(binwidth = 1)

- Higher the factor number means better result after they competed the program.
- We have great result dsuring the summer but not so good result during the winter.
- Let’s see why is that the case.
ggplot(ml_data, aes(EntryDateMonth, fill = as.factor(TypeProvided))) + geom_histogram(binwidth = 1)

- Factor variables are representing specific service they have received during the enrollment.
- We know that during the summer, the program does great job and ‘A2’ services are offered the most during the summer. (A2 = Community service/service learning (CSL))
- During the winter session, we offer less ‘A2/ service and more ’B1’ services. We might need to change the service stratagies. (B1 = Rental assistance)
ggplot(ml_data, aes(EntryDateMonth, fill = as.factor(VADisabilityService))) + geom_histogram(binwidth = 1)

- Most of the clients during the summer receive VA Disability Service which could be the potensial reason why we have such a great outcome during the summer. correlation but not causation.
Heat Map Graph Analytics
g2
## Warning: Removed 69 rows containing non-finite values (stat_density2d).

g1
## Warning: Removed 1110 rows containing missing values (geom_point).

- This map represents the intensity of the homelessness.
- More freqeunt homelessness has bigger diameter of the circle.
- Let see which area need more resources.
print("Top 10 the Most Homelessness Area")
## [1] "Top 10 the Most Homelessness Area"
top10_zip$LastPermanentZIP
## [1] 63101 63136 63118 63103 63111 63125 63114 63104 63107 63143
- Above zip code is the top 10 zipcodes where the homelessness is frequently occuring.
- According to these information, we can distribute available resources more intelligently for higher quality services.
Predictive modeling
- If we can predice the successfulness of the client prior to the service enrollment, then it is greatly beneficial for the center for having a chance to improve the 1 to 1 service if it predicts the negative outcomes.
- However, the current data sets have too much features which the model does not need them all.
- Let’s plot the most important features to select the best predictor variables.
plotFilterValues(var_imp, feat.type.cols=TRUE)

imp_feat <- (var_imp$data %>% arrange(-information.gain) %>% top_n(7))$name
## Selecting by information.gain
imp_feat
## [1] "EntryDateMonth" "Discharge_Status.1" "Discharge_Status.2"
## [4] "Age" "Employed" "TypeProvided"
## [7] "VADisabilityService"
- We already analized some of the variables previously and found some significant intuition.
- As long as we have those 7 variables, we can make great predictive model.
- Let’s do some demonstration
head(test, 3)
## TypeProvided VADisabilityService EntryDateMonth Discharge_Status.1
## 2 B1 0 1 0
## 5 B1 0 1 0
## 7 B1 0 1 0
## Discharge_Status.2 Age Employed Destination
## 2 1 4 1 3
## 5 1 4 1 3
## 7 1 4 1 3
pred.rf.test <- predict(mdl.rf, test)
conf.mtx <- confusionMatrix(pred.rf.test, test$Destination)
conf.mtx$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.8656172 0.7352898 0.8639407 0.8672805 0.6662625
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
- As you can see, we have over 86% of accuracy to determine the successfulness of the particular individual prior to the enrollment without collecting too much data.
Improvement TODO:
- Explore more featurese and perform decent feature engineering process.
- Construct Deep Neuron Network through tensorflow to build more compelx predictive model.
- Use Spark distributed computing system to find the best hyper-parameters for Deep Neuron Network.
Challenges of this project
- Cleaning multiple, exreamly dirty, and small dataset and merging them into descent size was one of the challenges.