About this project

Making decision on location for buying home is one of the most important decision for lot of people in their lifetime. Lot of effort and time is spent to do research on various locations on different location attributes. This project aims at building a reusable framework that can be used for home buying decision making process. Following are the key objectives of this project- 1. Build a reusable framework to do comparative data analysis among various locations around a selected location 2. Build a scoring model to factors various city livability parameters and come up with livability score for locations 3. Build decision metrics considering price and livability score to come up with comparative decision index for cities

Evaluation framework used in this project

$1. For the purpose of the project Cleveland was selected as a location around which best other cities to be compared. Below site was used to collect data on top cities around Cleveland by doing web scrapping. %a. http://www.bestplaces.net/people/city/ohio/strongsville $2. Web scrapping was done to extract city livability parameters. Following parameters were used %a. City population growth rate, population density, diversity, young population %, mass commute provision, avg commute time, crime rate, investment in education. $3. Build overall scoring model with the above mention factors and come up with a combined scoring model. $4. Source home price information by city $a. Zillow data will be used here $5. Analyze home price by city and combined score in point 4 $6. Analyze the top factors that drives home prices of a city

  1. Load the required packages to be used for analysis
suppressWarnings(library(dplyr))
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
suppressWarnings(library(stringr))
suppressWarnings(library(XML))
suppressWarnings(library(RCurl))
## Loading required package: bitops
suppressWarnings(library(ggplot2))
suppressWarnings(library(reshape2))
suppressWarnings(library(varhandle))
suppressWarnings(library(tidyr))
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
  1. Selecting a City in this case, Cleveland and identifying other top cities near that place using bestplacens.net.Extract best cities near cleveland and bestplaces.net livability score of the city
# Following url was used to identify the top cities near cleveland

best_place <- 'http://www.areavibes.com/cleveland-oh/best-places-to-live/'


# Extract list of cities and score

webpage<-getURL(best_place)
  
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
  
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
  
text1<-unlist(xpathSApply(pagetree,"/html/body/div/div/div/a",xmlValue))

best_cities<-data.frame()

name<-c(unlist(gsub(",","",(str_sub(str_extract(text1,"[[:alnum:] ]{1,},"),start=3)))))

score<-c(unlist(str_sub(str_extract(text1,"[[:alnum:] ]{1,},"),start=1 , end=2)))

best_cities<-data.frame(name=c(name),score=c(score))

best_cities<-best_cities[1:10,]

best_cities$name<-tolower(best_cities$name)

best_cities$name<-gsub(" ","%20",best_cities$name)

# name of best cities and related bestplaces.net score

best_cities
##               name score
## 1        cleveland    72
## 2            solon    83
## 3           orange    83
## 4    rocky%20river    82
## 5    bay%20village    82
## 6      brecksville    82
## 7         mayfield    82
## 8            berea    81
## 9  olmsted%20falls    81
## 10    strongsville    81
  1. Change in city population is one good indicator of growth of a city. This parameter was selected to analyze how city population has changed between 2010-2015.Extract city population data and calculate % change in city population from 2010-2015
df_population<-data.frame(PEOPLE=c(1,2,3,4,5,6,7,8))

for (i in 1:10)
{

# Extract information from below website 

webpage_popu<-getURL(gsub(" ","",paste('http://www.bestplaces.net/people/city/ohio/',best_cities$name[i])))
  
webpage_det <- readLines(tc <- textConnection(webpage_popu)); close(tc)
  
pagetree_popu <- htmlTreeParse(webpage_det, error=function(...){}, useInternalNodes = TRUE)
  
city_popu<-unlist(xpathSApply(pagetree_popu,"//td",xmlValue))


# Creating data frame on city population

df_population1<-matrix(city_popu[5:28],ncol=8, nrow=3)

df_population1<-as.data.frame(df_population1)

df_population1<-as.data.frame(t(df_population1))

df_population1<-df_population1[,-3]

df_population1<-unfactor(df_population1)

colnames(df_population1)<-df_population1[1,]

df_population<-cbind(df_population,df_population1)

}

df_population<-unfactor(df_population)

df_population<-df_population[,-c(1,4,6,8,10,12,14,16,18,20,22)]

df_population_format<-df_population[c(2,3),]

df_population_format<-t(df_population_format)

df_population_format<-data.frame(df_population_format)

colnames(df_population_format)<-c("current_population","population_2010")

df_population_format<-df_population_format[-1,]

df_population_format$current_population<-as.numeric(gsub(",","",(df_population_format$current_population)))

df_population_format$population_2010<-as.numeric(gsub(",","",(df_population_format$population_2010)))

df_population_format$city<-rownames(df_population_format)


# Calculate % chnage in population 
df_population_format$Popu_change<-as.numeric(format(round(((df_population_format[,1]-df_population_format[,2])*100/df_population_format[,2]),2),nsmall = 2))

# Plot data in a chart--> 1

ggplot(data=df_population_format, aes(x=city, y=Popu_change,fill=city)) +geom_bar(colour="black", stat="identity")+ ggtitle("City population %change 2000-2010")+  geom_text(aes(label = Popu_change)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Stacking not well defined when ymin != 0

# Rank cities by change in population

df_population_format<-df_population_format %>% mutate(cityrank_population_change = rank(Popu_change, ties.method ="first"))

df_rank_consolidate<-df_population_format[,4:5]

df_rank_consolidate<-cbind(best_cities$name,df_rank_consolidate)

colnames(df_rank_consolidate)<-c("name","popu change","cityrank_population_change")
  1. Population density of a city is a good indicator of livability factor. High density implies good opportunities but expensive cost of living. Also low density areas are often considers good for quality of life. Extract data from web for city population density
df_population_density<-df_population[8,]

df_population_density<-t(df_population_density)

df_population_density<-as.data.frame(df_population_density)

df_population_density$city<-rownames(df_population_density)

colnames(df_population_density)<-c("density", "city")

df_population_density<-df_population_density[-1,]

df_population_density$density<-as.numeric(gsub(",","",df_population_density$density))

# plot city density data chart-->2

ggplot(data=df_population_density, aes(x=city, y=density,fill=city)) +geom_bar(colour="black", stat="identity")+ ggtitle("City population by density")+  geom_text(aes(label = density)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

# rank city with population density
df_population_density<-df_population_density %>% mutate(cityrank_density = rank(-density, ties.method ="first"))

# consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_population_density[,3])
  1. Population diversity is considered to be one of the key criteria for livability index. Analysis of population diversity by extracting data
df_diversity_combine<-data.frame(popultion=c(),city=c(),USA=c(),city_name=c())

for (i in 1:10)
{

# Extract data from webpage

webpage_div<-getURL(gsub(" ","",paste('http://www.bestplaces.net/people/city/ohio/',best_cities$name[i])))
  
webpage_div <- readLines(tc <- textConnection(webpage_div)); close(tc)
  
pagetree_div <- htmlTreeParse(webpage_div, error=function(...){}, useInternalNodes = TRUE)
  
city_div<-unlist(xpathSApply(pagetree_div,"//td",xmlValue))


# creating table on population

df_diversity<-matrix(city_div[5:109],ncol=35, nrow=3)

df_diversity<-as.data.frame(df_diversity)

df_diversity_h<-df_diversity[1,]

df_diversity<-df_diversity[-c(1,3),]

df_diversity_combine<-rbind(df_diversity_combine,df_diversity)

}

# creating data frame for race diversity

df_diversity_race<-df_diversity_combine[,30:35]

df_diversity_race<-cbind(df_diversity_combine[,1],df_diversity_race)

df_diversity_race<-unfactor(df_diversity_race)
colnames(df_diversity_race)<-c("city","white","black","asian","native_american","hawaiian_pacific_Islander","other")


df_diversity_race$white<-gsub("%","",df_diversity_race$white)

df_diversity_race$black <-gsub("%","",df_diversity_race$black)

df_diversity_race$asian<-gsub("%","",df_diversity_race$asian)

df_diversity_race$native_american<-gsub("%","",df_diversity_race$native_american)

df_diversity_race$hawaiian_pacific_Islander<-gsub("%","",df_diversity_race$hawaiian_pacific_Islander)

df_diversity_race$other<-gsub("%","",df_diversity_race$other)


df_diversity_format<-df_diversity_race %>% gather(race,percentage,white:other)

df_diversity_format$percentage<-as.numeric(df_diversity_format$percentage)

# Plot city diversity data chart -->3

ggplot(data=df_diversity_format, aes(x=city, y=percentage, fill=race)) +
    geom_bar(stat="identity") +ggtitle("City diversity by race") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()

# Rank the city with diversity by using % of data for white population as reference

df_diversity_rank<-df_diversity_format[df_diversity_format$race=="white",]

df_diversity_rank<-df_diversity_rank %>% mutate(cityrank_divercity = rank(-percentage, ties.method ="first"))

# consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_diversity_rank[,4])
  1. Young population % is an important criteria for future growth. Extract city data by age group and rank the cities based on youth population between group 25-34.
df_diversity_age<-df_diversity_combine[,15:28]

df_diversity_age<-cbind(df_diversity_combine[,1],df_diversity_age)

colnames(df_diversity_age)<-c("city","0_4","5_9","10_14","15_17","18_20","21_24","25_34","35_44","45_54","55_59","60_64","65_74","75_84","85_over")


df_diversity_age$`0_4`<-gsub("%","",df_diversity_age$`0_4`)

df_diversity_age$`5_9`<-gsub("%","",df_diversity_age$`5_9`)

df_diversity_age$`10_14`<-gsub("%","",df_diversity_age$`10_14`)

df_diversity_age$`15_17`<-gsub("%","",df_diversity_age$`15_17`)

df_diversity_age$`18_20`<-gsub("%","",df_diversity_age$`18_20`)

df_diversity_age$`21_24`<-gsub("%","",df_diversity_age$`21_24`)

df_diversity_age$`25_34`<-gsub("%","",df_diversity_age$`25_34`)

df_diversity_age$`35_44`<-gsub("%","",df_diversity_age$`35_44`)

df_diversity_age$`45_54`<-gsub("%","",df_diversity_age$`45_54`)

df_diversity_age$`55_59`<-gsub("%","",df_diversity_age$`55_59`)

df_diversity_age$`60_64`<-gsub("%","",df_diversity_age$`60_64`)

df_diversity_age$`65_74`<-gsub("%","",df_diversity_age$`65_74`)

df_diversity_age$`75_84`<-gsub("%","",df_diversity_age$`75_84`)

df_diversity_age$`85_over`<-gsub("%","",df_diversity_age$`85_over`)


# Plot city data by age group chart -->4

df_diversity_age_format<-df_diversity_age %>% gather(age_grp,percentage, `0_4`:`85_over`)

df_diversity_age_format$percentage<-as.numeric(df_diversity_age_format$percentage)

ggplot(data=df_diversity_age_format, aes(x=city, y=percentage, fill=age_grp)) +
    geom_bar(stat="identity") +ggtitle("City population by age") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ coord_flip()

# Ranking the city on youth population %

df_diversity_rank_age<-df_diversity_age_format[df_diversity_age_format$age_grp=="25_34",]

df_diversity_rank_age<-df_diversity_rank_age %>% mutate(cityrank_youthfulness = rank(percentage, ties.method ="first"))


# Consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_diversity_rank_age[,4])
  1. Transportation is an important aspects of city living. Average commute time determines how much time daily officer goers will spend on road.
df_mass_com<-data.frame(TRANSPORTATION=c(),Commute_Time=c())

for (i in 1:10)
{

# Extract data from the webpage

webpage_trans<-getURL(gsub(" ","",paste('http://www.bestplaces.net/transportation/city/ohio/',best_cities$name[i])))
  
webpage_trans <- readLines(tc <- textConnection(webpage_trans)); close(tc)
  
pagetree_trans <- htmlTreeParse(webpage_trans, error=function(...){}, useInternalNodes = TRUE)
  
city_trans<-unlist(xpathSApply(pagetree_trans,"//td",xmlValue))


df_trans<-matrix(city_trans[5:10],ncol=2, nrow=3)

df_trans<-as.data.frame(df_trans)


colnames(df_trans)<-c("TRANSPORTATION","Commute_Time")

df_trans<-df_trans[-c(1,3),]

df_mass_com<-rbind(df_mass_com,df_trans)
}


df_mass_com$TRANSPORTATION<-unfactor(df_mass_com$TRANSPORTATION)

df_mass_com$Commute_Time<-unfactor(df_mass_com$Commute_Time)

# plot commute time for different cities (same bar diagram as below) chart-->5

ggplot(data=df_mass_com, aes(x=TRANSPORTATION, y=Commute_Time,fill=TRANSPORTATION)) +geom_bar(colour="black", stat="identity")+ ggtitle("City Avg. Commute Time")+  geom_text(aes(label = Commute_Time)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Rank cities by avg commute time

df_mass_com<-df_mass_com %>% mutate(cityrank_commute = rank(-Commute_Time, ties.method ="first"))



# Consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_mass_com[,3])
  1. Availability of mass transit add convenience to city life and less dependency on person car usage.
df_trans_combine<-data.frame(TRANSPORTATION=c(),Commute_Time=c(),COMMUTE_MODE=c(),Auto_alone=c(),Carpool=c(),Mass_Transit=c(),Bicycle=c(),Walk=c(),Work_at_Home=c())


for (i in 1:10)
{
# Extract data fro the webpage

webpage_trans_mass<-getURL(gsub(" ","",paste('http://www.bestplaces.net/transportation/city/ohio/',best_cities$name[i])))
  
webpage_trans_mass <- readLines(tc <- textConnection(webpage_trans_mass)); close(tc)
  
pagetree_trans_mass <- htmlTreeParse(webpage_trans_mass, error=function(...){}, useInternalNodes = TRUE)
  
city_trans_mass<-unlist(xpathSApply(pagetree_trans_mass,"//td",xmlValue))


df_trans_mass<-matrix(city_trans_mass[5:31],ncol=9, nrow=3)

df_trans_mass<-as.data.frame(df_trans_mass)

colnames(df_trans_mass)<-c("TRANSPORTATION","Commute_Time","COMMUTE_MODE","Auto_alone", "Carpool","Mass_Transit","Bicycle","Walk","Work_at_Home")

df_trans_mass<-df_trans_mass[-c(1,3),]

df_trans_combine<-rbind(df_trans_combine,df_trans_mass)
}

df_trans_combine<-df_trans_combine[,-3]

df_trans_combine<-unfactor(df_trans_combine)

df_trans_combine$Auto_alone<-gsub("%","",df_trans_combine$Auto_alone)

df_trans_combine$Carpool<-gsub("%","",df_trans_combine$Carpool)

df_trans_combine$Mass_Transit<-gsub("%","",df_trans_combine$Mass_Transit)

df_trans_combine$Bicycle<-gsub("%","",df_trans_combine$Bicycle)

df_trans_combine$Walk<-gsub("%","",df_trans_combine$Walk)

df_trans_combine$Work_at_Home<-gsub("%","",df_trans_combine$Work_at_Home)

df_trans_combine$Mass_Transit<-as.numeric(df_trans_combine$Mass_Transit)


# plot city mass transport data points chart -->6


ggplot(data=df_trans_combine, aes(x=TRANSPORTATION, y=Mass_Transit,fill=TRANSPORTATION)) +geom_bar(colour="black", stat="identity")+ ggtitle("City Avg. mass transportaion %")+  geom_text(aes(label = Mass_Transit)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Rank the city based on mass transit

df_trans_combine<-df_trans_combine %>% mutate(cityrank_mass = rank(Mass_Transit, ties.method ="first"))
  
 
# consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_trans_combine[,9])
  1. Crime rate is one important aspects of law and order situation in a city. Lower crime rate makes a city more secured compared to other places.
df_crime_combine<-data.frame(city=c(),violent_crime=c(),property_crime=c())  
  
  for (i in 1:10)
{

# Extracting data from the webpage

webpage_crime<-getURL(gsub(" ","",paste('http://www.bestplaces.net/crime/city/ohio/',best_cities$name[i])))
  
webpage_crime <- readLines(tc <- textConnection(webpage_crime)); close(tc)
  
pagetree_crime <- htmlTreeParse(webpage_crime, error=function(...){}, useInternalNodes = TRUE)
  
city_crime<-unlist(xpathSApply(pagetree_crime,"//td",xmlValue))

df_crime<-matrix(city_crime[5:13],ncol=3, nrow=3)

df_crime<-as.data.frame(df_crime)

df_crime<-df_crime[-c(1,3),]

colnames(df_crime)<-c("city","violent_crime","property_crime")

df_trans_mass<-df_trans_mass[-c(1,3),]

df_crime_combine<-rbind(df_crime_combine,df_crime)
}
  
df_crime_combine<-unfactor(df_crime_combine)

df_crime_format<-df_crime_combine %>% gather(crime_type,occurance,violent_crime:property_crime)

df_crime_format$occurance<-as.numeric(df_crime_format$occurance)

# Plot city crime rate data chart --> 7

ggplot(data=df_crime_format, aes(x=city, y=occurance, fill=crime_type)) +
    geom_bar(stat="identity") +ggtitle("City crime by types") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Rank the data by crime rate

df_crime_city_rank<-df_crime_format[df_crime_format$crime_type=="violent_crime",]
df_crime_city_rank<-df_crime_city_rank%>% mutate(cityrank_crime = rank(-occurance, ties.method ="first"))


# Consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_crime_city_rank[,4])
  1. Analysis of education data and investment city makes on education. For this study student per teacher ratio has been consider as the evaluation parameter
df_edu_combine<-data.frame() 
  
  for (i in 1:10)
{

# Extract data from the webpage

webpage_edu<-getURL(gsub(" ","",paste('http://www.bestplaces.net/education/city/ohio/',best_cities$name[i])))
  
webpage_edu <- readLines(tc <- textConnection(webpage_edu)); close(tc)
  
pagetree_edu <- htmlTreeParse(webpage_edu, error=function(...){}, useInternalNodes = TRUE)
  
city_edu<-unlist(xpathSApply(pagetree_edu,"//td",xmlValue))


df_edu<-matrix(city_edu[5:52],ncol=16, nrow=3)

df_edu<-as.data.frame(df_edu)

colnames(df_edu)<-unfactor(df_edu[1,])

df_edu<-df_edu[-c(1,3),]

df_edu_combine<-rbind(df_edu_combine,df_edu)
}

df_edu_stu<-df_edu_combine[,c(1,2,5)]

colnames(df_edu_stu)<-c("city","expenses","stu_teacher_ratio")

df_edu_stu$stu_teacher_ratio<-as.numeric(unfactor(df_edu_stu$stu_teacher_ratio))

# plot students per teacher ratio chart -->8

ggplot(data=df_edu_stu, aes(x=city, y=stu_teacher_ratio,fill=city)) +
    geom_bar(stat="identity") +ggtitle("City student to teacher ratio") +geom_text(aes(label = stu_teacher_ratio)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Rank the city on this parameter

df_edu_stu<-df_edu_stu %>% mutate(cityrank_edu = rank(-stu_teacher_ratio, ties.method ="f"))

# Consolidate ranking with previous city parameters

df_rank_consolidate<-cbind(df_rank_consolidate,df_edu_stu[,4])

Overall City Score

Overall City score= sum(total rank score of following parameters(population change,population density,race diversity,% of youth population,avg commute time,mass transit %,crime rate, investment in education))*1.25

city score has been multiplied by 1.25 to make it adjusted in scale of 100

  1. City overall score has been calculated based on simple addition of all the different ranking scores of a city based on above 8 parameters which is also called Overall Score. This score has also been weighted by 1.25(as total score is 8 parameter *10=80) to elevate that to a 100 scoring scale at per with the bestplcaes.net score.
df_rank_consolidate<-df_rank_consolidate[,-2]

colnames(df_rank_consolidate)<-c("City","Population_change","Population_density","Race_diversity","Young_Population","Mass_Commute","Avg_Commute","Crime_rate","Student_teach_ratio")

df_rank_consolidate$Overall_score<-(df_rank_consolidate[,2]+df_rank_consolidate[,3]+df_rank_consolidate[,4]+df_rank_consolidate[,5]+df_rank_consolidate[,6]+df_rank_consolidate[,7]+df_rank_consolidate[,8]+df_rank_consolidate[,9])*1.25

df_rank_consolidate_format<-df_rank_consolidate %>% gather(Parameter,score, Population_change:Student_teach_ratio)

# Plot overall city score  chart -->9

ggplot(data=df_rank_consolidate_format, aes(x=City, y=score,fill=Parameter)) +
    geom_bar(stat="identity") +ggtitle(" Overall City Score on Liveablity Index") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +coord_flip() 

City price data from zillow.com

  1. Price of houses are most of the time related to the livability factor of a city. In this study house price data has been extracted from Zillow and then the data has been analyzed along with data on other parameters. For this student median house per sq ft price has been considers.
# load data from zillow
  
city_homeprice<-read.csv("C:/Users/Arindam/Documents/Data Science/Cuny/Data 607/Assignments/Project/City_MedianValuePerSqft_AllHomes.csv")  

# Filter data by state OH

city_homeprice_OH<-city_homeprice[city_homeprice$State=="OH",]

best_cities$name<-gsub("%20"," ",best_cities$name)

# Get home price date for the 10 respective cities

city_id<-city_homeprice_OH$RegionID[tolower(city_homeprice_OH$RegionName)%in% best_cities$name]

home_price<-city_homeprice_OH[city_homeprice_OH$RegionID %in% city_id,]

home_price<-home_price[,c(2,245)]

# Analysis of score and home price rate


df_rank_consolidate$City<-gsub(", Ohio","",df_rank_consolidate$City)

df_rank_consolidate$City<-gsub("%20"," ",df_rank_consolidate$City)

df_rank_consolidate$price_per_SF<-home_price$X2016.02[df_rank_consolidate$City %in% tolower(home_price$RegionName)]

City score with respect to price data

city home price per sq ft/overall city score & city home price per sq ft/bestplaces.net score

  1. Compare the city home price by Overall score and also livability score of bestplaces.net. Then compare the outcomes by cities. These two values will not match as scoring process is completely different. But it would be interesting to see if the outcome is same from these two scores. Final metric has been defined with respect to, price per Overall score or Price per Bestplaces.net score. Lower this ratios are better a city is in terms of buying home or investment purpose as that provide higher living with less price.
# Comparing data against the Liveability score from bestplaces.net


df_rank_consolidate$City<-gsub(", Ohio","",df_rank_consolidate$City)

df_rank_consolidate$Best_net_score<-best_cities$score[tolower(df_rank_consolidate$City) %in% best_cities$name]

df_rank_consolidate$Price_by_Best_net_score<-as.numeric((df_rank_consolidate$price_per_SF)/as.numeric(unfactor(df_rank_consolidate$Best_net_score)))
  
# comparison of home price with respect to Bestplaces.net score  chart -->10

ggplot(data=df_rank_consolidate, aes(x=City, y=Price_by_Best_net_score,fill=City)) +
    geom_bar(stat="identity")+ggtitle("City per SQ FT Home Price by Bestplaces.net score") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

df_rank_consolidate$Price_by_score_index<-(df_rank_consolidate$price_per_SF/df_rank_consolidate$Overall_score)

df_rank_consolidate$Price_by_score_index<-format(round(df_rank_consolidate$Price_by_score_index,2),nsmall=2)

#comparison of home price with respect to Overall city score  chart -->11

ggplot(data=df_rank_consolidate, aes(x=City, y=Price_by_score_index,fill=City)) +
    geom_bar(stat="identity") +ggtitle("City per SQ FT Home Price by overall score") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

df_rank_consolidate$Price_by_Best_net_score<-format(round(df_rank_consolidate$Price_by_Best_net_score,2),nsmall=2)

df_rank_consolidate_com<-df_rank_consolidate[,c(1,13,14)]


df_rank_consolidate_com<-df_rank_consolidate_com %>% gather(score_type,score,Price_by_score_index:Price_by_Best_net_score)

df_rank_consolidate_com$score<-as.numeric(df_rank_consolidate_com$score)

# comparative chart to compare price per City score vs bestplaces.net score chart -->12

ggplot(data=df_rank_consolidate_com, aes(x=City, y=score,fill=score_type)) + geom_bar(stat="identity",position=position_dodge())+ ggtitle("Comparison of City per SQ FT Home Price by derived score index vs Bestplaces.net score") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

City home price per sq ft comparison by each livability parameters

  1. Compare price of the home per sq ft by each parameter ranking score
df_rank_consolidate$price_popu_change<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Population_change),2),nsmall=2))

df_rank_consolidate$price_popu_den<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Population_density),2),nsmall=2))


df_rank_consolidate$price_race_div<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Race_diversity),2),nsmall=2))

df_rank_consolidate$price_youth_per<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Young_Population),2),nsmall=2))

df_rank_consolidate$price_youth_per<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Young_),2),nsmall=2))

df_rank_consolidate$price_mass_commute<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Mass_Commute),2),nsmall=2))


df_rank_consolidate$price_avg_commute<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Avg_Commute),2),nsmall=2))

df_rank_consolidate$price_crimerate_per<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Crime_rate),2),nsmall=2))

df_rank_consolidate$price_stu_ratio<-as.numeric(format(round((df_rank_consolidate$price_per_SF/df_rank_consolidate$Student_teach_ratio),2),nsmall=2))


# create data frame

city_price_paramter<-df_rank_consolidate[,c(1,15,16,17,18)]

df_city_price_com<-city_price_paramter %>% gather(paramter,Price_per_sqt_per_score,price_popu_change:price_youth_per)


city_price_paramter_add<-df_rank_consolidate[,c(1,19,20,21,22)]

df_city_price_com_add<-city_price_paramter_add %>% gather(paramter,Price_per_sqt_per_score,price_mass_commute:price_stu_ratio)


# Plot city home price per sq ft by first four parameters data in chart -->13

ggplot(df_city_price_com, aes(x=City,y=Price_per_sqt_per_score)) +
  geom_line(aes(group = paramter, colour = paramter)) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Plot city home price per sq ft by rest four parameters in chart -->14

ggplot(df_city_price_com_add, aes(x=City,y=Price_per_sqt_per_score,fill=paramter)) +
  geom_bar(stat="identity",position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1))     

Conclusion

(1)Evaluation using overall score-

chart 9 , has the overall sore per city broken down by individual parameter score. Looking at the chart it appears Cleveland has the best overall score but has low score on crime rate. Storytelling has good overall score but has less score on educational investment and mass commute. This overall score will help to understand the livability factors of a city.

(2)Evaluation using city home price by overall score and bestcity.net index

chart 10,11- These two charts helps to understand how much is the median city home price per sq ft with respect to livability score. Lower the number on this better off home buyers to make good investment on homes.These two charts will help to understand if the price is high or low with respect to city livability conditions. Chart 12 has comparative bars of the price per score metrics from two different scores, one from bestplaces.net score and one derived in this study. That comparison will indicate how these two metrics are different. From this chart it appears that derive scores and the given scores are direction ally the same.

(3)Evaluation of city home price by each livability index

Chart 13,14- provides analysis of price data based on individual liability parameter. This chart will also help to understand the key factors that drive the home price.If the value is low that means that factor is not a dominant factor is deriving the price. If that factor is high that means it has major influence in determining the home price in that city.

References

** Following website has been used extensively to extract city data for this study http://www.bestplaces.net/

** Following website has been used to get average median home price per sq ft by cities www.zillow.com/research/data