Visualization

URL of the visualization page: http://mkds.github.io/visual/places-to-live/

Background

About 40,000,000 Americans move every year, according to the Census. About 60% of those Americans will stay within the same county, another 20% will stay within the same state, and the remainder will move to another state. We have an opportunity to provide these Americans with information that can help them make a more sustainable choice about where they want to live

The smart location data base provides over 90 variables characterizing the built environment, transit service, destination accessibility, employment, and demographics at the census block group scale. The county health database provide healthcare details and some environment attributes.

Using smart location database and County health ranking data provide visualization that helps the user to analyze places in terms of employment, transit, demography, environment, health care etc. Generate visualization for specific themes (for example where jobs are available? Which places are good for retirement?) This project allows user to interact with the visualization and look for places that suits the user’s interests based on the criteria they provided and the importance of those criteria.

Purpose

While there are many rankings and top lists, very few provide user to customize the factors used in the ranking and find the ranking the user is interested in. This visualization allows user to select the factors they interested in, set the importance level for the factor and visualize the map to see how the counties score (on scale of 100) for the factors user selected. The visualization also provides two pre-defined interests (employment and retirement) for quick visualization.

National Map, Factors and Importance

The visualization provides 30 factors and allow user to pick the factors that are important to them. The user also provided with flexibility of setting the importance of the factors they selected. This sort of visualization allows to visualize the change in scores as the factor importance changed giving an indication of which counties score better for given factor and how their search would move toward particular region of a country as they change the factors and importance levels for the factors. User can hover over a county to see the score out hundred for each of the factors they selected and the overall score.

The top counties based on the factors and importance selected by user is displayed as well.

Maps and Data handling using D3

The national choropleth map is drawn using D3. The scaled total score of the counties for the importance level selected by user is used for choropleth. D3 is used to bind the importance value for the sliders defined in right column and to handle factor importance.

State Level Map

The state level map provides comparison of all the counties within a state for selected factors. User can again pick the factors of their interest and importance level. Google charts is used to draw the stacked bar graphs which provide visualization on how counties fare in each factors selected.

Data table

The data table provides the data for all counties for selected factors in a sortable table for exploration. Google data table is used to display the data in a sortable table.

Style and Behaviour

The visualization uses CSS and bootstrap framework for styling the document. JQuery, D3 and Javascript is used for interactivity and visualization.

Data - Obtain, clean and Prepare Data

Getting Data - Smart Location Data

The Smart location database has over 90 variable at block level (each county is divided into one or more blocks). We need extract variables of interest (variable related to employment, transport, housing, environment etc) from this dataset. As the block level is lower level than county the data needs to be summarized at county level.

library(foreign)
smart_living_data_all = read.dbf("SmartLocationDb.dbf", as.is = T)
#List of varaibles of interes
req_variables=c("SFIPS","CFIPS","AC_WATER","AC_UNPR","TOTPOP10","P_WRKAGE","EMPTOT",
"E8_RET10","E8_OFF10","E8_IND10","E8_ED10","E8_SVC10","E8_HLTH10","E_HIWAGEWK",
"E_LOWWAGEW","E_MEDWAGEW","COUNTHU10","D4d","D3a")
#Extract the varaible of interst
smart_living_data=smart_living_data_all[req_variables]

Data Clean up - Smart Location Data

#Varaiable names
var_names = c("St_FIPS", "County_FIPS", "Water_Area", "UnProtected_Area", "Population", "WorkAge_pct", "Total_Employment",
"Reatil_jobs", "Office_jobs", "Industrial_jobs", "Education_jobs", "Service_jobs", 
"HealthCare_jobs", "Wage_more_than3333", "Wage_med", "Wage_low", "Households","Transit_freq",
"Road_density")
#Rename the columns to easy to read names
colnames(smart_living_data) = var_names
#Value of -99999 denotes missing data in data set
smart_living_data[smart_living_data==-99999] = NA

smart_living_data$UnProtected_Area[smart_living_data$UnProtected_Area==0] = 
  smart_living_data_all$AC_TOT[smart_living_data$UnProtected_Area==0]

#The P_WRKAGE, D4d and D3a values are percent or densitiy values so covert them to normal value
smart_living_data$Workage_pop = smart_living_data$WorkAge_pct * smart_living_data$Population
smart_living_data$Transit = smart_living_data$Transit_freq * smart_living_data$UnProtected_Area
smart_living_data$Road = smart_living_data$Road_density * smart_living_data$UnProtected_Area
#Add low and medium wage to compute wage less than 3333
smart_living_data$Wage_less_than3333 = smart_living_data$Wage_med + smart_living_data$Wage_low

#Remove the columnsthat are no longer required
smart_living_data$Wage_med = NULL
smart_living_data$Wage_low = NULL

Summarize Data at county Level

#The smart living database has the values in block levels so aggregate value to county level
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

county_sld = smart_living_data %>% 
                group_by(St_FIPS,County_FIPS) %>%
                summarise_each(funs(sum(.,na.rm=T)))

county_sld$WorkAge_pct = county_sld$Workage_pop / county_sld$Population
county_sld$Transit_freq = county_sld$Transit / county_sld$UnProtected_Area
county_sld$Road_density = county_sld$Road / county_sld$UnProtected_Area
county_sld$FIPS = paste0(county_sld$St_FIPS,county_sld$County_FIPS)
#Remove below columns as we captured normalized values for these
county_sld$St_FIPS = NULL
county_sld$County_FIPS = NULL
county_sld$Transit = NULL
county_sld$Road = NULL
county_sld$Workage_pop = NULL

Getting Data - Health Record

health_record_all = read.csv("2016CHR_CSV_Analytic_Data_v2.csv",stringsAsFactors = F)
req_health_vars = c("STATECODE", "COUNTYCODE", "County", "State", "Poor.or.fair.health.Value", 
"Access.to.exercise.opportunities.Value", "Health.care.costs.Value", 
"Primary.care.physicians.Value", "Social.associations.Value", "Violent.crime.Value", 
"Air.pollution...particulate.matter.Value", "Severe.housing.problems.Value", "Limited.access.to.healthy.foods.Value", 
"Percent.of.population.aged.65.years.and.older",  "Median.household.income.Value", 
"Population.living.in.a.rural.area.Value", "Unemployment.Value")
health_record = health_record_all[req_health_vars]
#Rename the varaibles to easy to read values
health_var_names = c("St_FIPS", "County_FIPS", "County", "State", "Poor_Health", "Exercise_Facility", "Health_Cost",
"Primary_care", "Associations", "Violent_crime", "Air_Quality", "Housing_problem", "limited_healthy_food", "Age65_plus", "Household_income", "Rural_pct", "Unemployment")


#Convert numeric stored as text to text


colnames(health_record) = health_var_names
health_record$Health_Cost = as.numeric(gsub(",","",health_record$Health_Cost))
health_record$Household_income = as.numeric(gsub(",","",health_record$Household_income))
health_record$Violent_crime = as.numeric(health_record$Violent_crime)

## Warning: NAs introduced by coercion

#Compute Average by state
State_health = health_record[,c(1,5:17)] %>% 
                    group_by(St_FIPS) %>%
                    summarise_each(funs(mean(.,na.rm=T)))

#Air quality parameter are missing for all counties of AK and HI
#So , let's manually set the Air quality values
State_health$Air_Quality[c(2,12)] = c(7.5,9.7)

#Function to get state mean value
get_state_avg = function(state_fips,col) return(State_health[State_health$St_FIPS==state_fips,col])
get_state_avg = Vectorize(get_state_avg)

#Replace NA with state average
for (i in c(5,6,7,8,10,11,14,15,17)) {
  avg = get_state_avg(health_record[is.na(health_record[,i]),1],i-3) 
  health_record[is.na(health_record[,i]),i] = as.data.frame(avg)[,1]

  
}

#Format FIPS code as two digit state code + 3 digit county code

health_record$FIPS = paste0(formatC(health_record$St_FIPS,width = 2,flag="0"), 
                            formatC(health_record$County_FIPS,width = 3,flag="0"))
health_record$St_FIPS = NULL
health_record$County_FIPS = NULL

Preparing Data for visualization

In order to make the data suitable for visualization the data needs to be scaled so that the range of a particular factor does not skew the visualization. As we are interested in relative performance of counties, let’s define a function which assigns value 100 for top county and value 50 for a county that would be ranked middle and so on.

calc_score=function(values){
  val_median = median(values)
  val_med_to_max = max(max(values) - val_median,1)
  val_med_to_min = max(val_median - min(values),1)
  
  score = ifelse(values>val_median, 50 + (values - val_median)/val_med_to_max*50,
                 50 - (val_median - values)/val_med_to_min*50)
  return(round(score))
}

Consolidate data from two sources

As the project uses data from two sources they need to be merged. The FIPS code from SLD as well as health care data would be used as key to merge the variables from two dataset.

Some of the data need to pre-scaled before we could apply the calc_score function. As county with large population expected to have large value for some of the factors, in order to get true score for a county the variables that has count value needs to be converted to rate per 1000.

places_data = merge(health_record,county_sld)
#Covert value to ratio of population or area in order to account for varying population or area
places_data$Primary_care_per1000 = places_data$Primary_care/places_data$Population * 1000 
places_data$Associations_per1000 = places_data$Associations/places_data$Population * 1000 
places_data$Violent_crime_per1000 = places_data$Violent_crime/places_data$Population * 1000 
places_data$Water_Area_per100Acre = places_data$Water_Area / places_data$UnProtected_Area * 100
places_data$Employment_per1000 = places_data$Total_Employment / places_data$Population * 1000
places_data$Reatil_jobs_per1000 = places_data$Reatil_jobs / places_data$Population * 1000
places_data$Office_jobs_per1000 = places_data$Office_jobs / places_data$Population * 1000
places_data$Industrial_jobs_per1000 = places_data$Industrial_jobs / places_data$Population * 1000
places_data$Education_jobs_per1000 = places_data$Education_jobs / places_data$Population * 1000
places_data$Service_jobs_per1000 = places_data$Service_jobs / places_data$Population * 1000
places_data$HealthCare_jobs_per1000 = places_data$HealthCare_jobs / places_data$Population * 1000
places_data$Wage_more_than3333_per1000 = places_data$Wage_more_than3333 / places_data$Population * 1000
places_data$Households_per100Acre = places_data$Households / places_data$UnProtected_Area * 100
places_data$Wage_less_than3333_per1000 = places_data$Wage_less_than3333 / places_data$Population * 1000


#Drop the variables as we have the ratio value for these 

places_data$Primary_care = NULL
places_data$Associations = NULL
places_data$Violent_crime = NULL
places_data$Water_Area = NULL
places_data$Total_Employment = NULL
places_data$Reatil_jobs = NULL
places_data$Office_jobs = NULL
places_data$Industrial_jobs = NULL
places_data$Education_jobs = NULL
places_data$Service_jobs = NULL
places_data$HealthCare_jobs = NULL
places_data$Wage_more_than3333 = NULL
places_data$Households = NULL
places_data$Wage_less_than3333 = NULL

places_data$Good_Health = 100 - places_data$Poor_Health
places_data$No_Housing_problem = 100 - places_data$Housing_problem
places_data$Availability_healthy_food = 100 - places_data$limited_healthy_food
places_data$Employment_pct = 100 - places_data$Unemployment

places_data$Poor_Health = NULL
places_data$Housing_problem = NULL
places_data$limited_healthy_food = NULL
places_data$Unemployment = NULL

places_scores = places_data
for (i in 4:ncol(places_scores)){
  places_scores[,i] = calc_score(places_scores[,i])
}
places_data$HealthCare_Affordable = 100 - places_scores$Health_Cost
places_data$Air_Quality = 100 - places_scores$Air_Quality
places_data$Saftey = 100 - places_scores$Violent_crime_per1000
places_scores$Saftey = places_data$Saftey
places_scores$Violent_crime_per1000 = NULL
places_data$Violent_crime_per1000 = NULL

places_scores$Healthcare_Affordable = places_data$HealthCare_Affordable
places_scores$Air_Quality = places_data$Air_Quality
places_scores$Health_Cost = NULL
places_data$Health_Cost = NULL

Output files

Two output files are produced for visualization. One file has all data scaled and relatively scored to be used for visualization and another file which provides the actual data that would be used to disaply actual data.

write.csv(places_data,"places_data_std.csv",row.names = F)
write.csv(places_scores,"places_scores.csv",row.names = F)

References

EPA Website: https://www.epa.gov/smartgrowth
Data: https://edg.epa.gov/data/PUBLIC/OP/SLD (Fie Sld.dbf.zip)
User guide for data: https://www.epa.gov/sites/production/files/2014-03/documents/sld_userguide.pdf
Health Care data: http://www.countyhealthrankings.org/rankings/data
D3 : https://bl.ocks.org/mbostock
Google Chart: https://developers.google.com/chart/

Acknowledgement

I like to thank contributors of following open source software.
D3, JQuery, Javascript, Atom and Bootstrap
I like to thank my professor Josh Laurito for the guidance he provided.

Where to Live?