After 12 great years living in New York City, it is time to move to the suburbs. This fall, my wife and I will begin our search for our first home. While we know that we generally want to live in the Hudson Valley region of NY, somewhere along the Hudson Line of the Metro North, we are having trouble contextualizing the differences in the neighborhoods and cities along this stretch.
I will be using data from 3 sources to generate an interactive visualization which should help us better understand these neighborhoods. First, I will be using data from the US Census to learn about the demographics of the region. Additionally, I will be pulling metrics relating to the market itself from the website Realtor.com. Finally, I will pull listings from the online real estate marketing website Zillow.com. We will be using ZIP Code as our regional unit.
Regarding the logistics of my approach. I will be outlining my data handling, cleaning and transformation in this markdown file, and provide some static exhibits as I explain my findings. I will then provide the final visualization in a hosted, RShiny dashboard.
# ZIP Codes of interest:
zip_List <- c("12601","12604","12603","12590","12512","12527","12508","12524","10516","10524","10537","10566","10511","10596","10548","10567","10520","10545","10510","10570","10594","10532","10591","10523","10607","10533","10503","10502","10522","10706","10703","10701","10710","10704","10705","10471","10470","10463","10562")
# Load Packages
library(dplyr)
library(tidyr)
library(ggmap)
library(gmapsdistance)
library(rvest)
library(RSelenium)
library(stringr)
library(leaflet)
library(tigris)
library(ggthemes)
library(cowplot)
library(grid)
library(gridExtra)
First, we will scrape listings from Zillow for each of the ZIP Codes of interest.
# Started webscrape with RVEST
for (i in 1:length(zip_List)){
# Will be pulling in 3 pages worth of data, max ~150 entires per ZIP Code. It is unlikely that more than 150 active listings are on Zillow for these ZIP Codes, will review.
z_url_1 <- paste("https://www.zillow.com/homes/",zip_List[i],"_rb/",sep="")
z_url_2 <- paste("https://www.zillow.com/ny-",zip_List[i],"/houses/2_p/",sep="")
z_url_3 <- paste("https://www.zillow.com/ny-",zip_List[i],"/houses/3_p/",sep="")
z_link_1 <- read_html(z_url_1)
z_link_2 <- read_html(z_url_2)
z_link_3 <- read_html(z_url_3)
df1 <- z_link_1 %>% html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>% html_text()
df2 <- z_link_2 %>% html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>% html_text()
df3 <- z_link_3 %>% html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>% html_text()
df_all <- rbind(df1,df2,df3)
assign(zip_List[i],df_all)
}
# Initially pulled in data for 17 ZIP Codes before retuning null values. Possibly running into CAPTCHA.
# RSelenium
for (i in 17:length(zip_List)){
# After review, found fewer than 50 entries per ZIP Code. Will be limiting to 1 page.
z_url_1 <- paste("https://www.zillow.com/homes/",zip_List[i],"_rb/",sep="")
remDr$navigate(z_url_1)
Sys.sleep(2)
page <- read_html(remDr$getPageSource()[[1]])
results <- page %>%
html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>%
html_text()
assign(zip_List[i],results)
}
# sucsesfully pulled in remaining ZIP Codes.
# Build/Clean DF
House_DF <- data.frame(matrix(unlist(eval(as.name(zip_List[1]))[1:(6*floor(length(eval(as.name(zip_List[1])))/6))]), ncol = 6, byrow=TRUE))
for (i in 2:length(zip_List)){
df1 <- data.frame(matrix(unlist(eval(as.name(zip_List[1]))[1:(6*floor(length(eval(as.name(zip_List[1])))/6))]), ncol = 6, byrow=TRUE))
House_DF <- rbind(House_DF, df1) }
Clean_House_DF <- House_DF %>% drop_na() %>% distinct()
nrow(House_DF)
nrow(Clean_House_DF)
names(Clean_House_DF) <- c("Address","Price","Beds","Baths","sqft","Type")
Clean_House_DF$Address <- as.character(Clean_House_DF$Address)
We obtained 6 fields from Zillow: price, bedroom/bathroom count, square footage, listing type (house, condo, multifamily home, et al) and address. We need two additonal data elements: geographical coordinates and distance to the city. We can obtain each of these through packages which leverage a Google Maps API.
Regarding distance from the city: as all of the neighborhoods which we are interested are along one train line, these zip codes are arguably ordinal. To make analysis easier and to aid in plotting, we will use “distance from the city” as a continuous variable instead. When thinking about “distance to the city”, I wanted to accurately capture the impact of the distance from the city for each of these listings. Instead of physical distance, I elected to use driving time. As New York City is large, I picked an accessible and well known landmark generally on the northern edge of the city: Yankee Stadium. Driving time is the amount of time it would take to drive from each home, arriving at Yankee Stadium at 9am on the morning of Monday May 24th, 2021.
#Saving API key as variable "CHRIS_BLOOME_API_KEY" in hidden Chunk.
# Set API keys for two packages
set.api.key(CHRIS_BLOOME_API_KEY)
register_google(key = CHRIS_BLOOME_API_KEY)
# Pull GPS cordinates of each house address via ggmap package, connected to Google Maps API
geocode_data <- geocode(unlist(Clean_House_DF$Address), output = "more", source = "google", urlonly = FALSE, nameType = "long")
geocode_data$orgin <- paste(geocode_data$lat,"+",geocode_data$lon, sep="")
YankeeStadium <- c("40.829780585986235+-73.92618523107261")
geocode_data_noNA <- geocode_data %>% drop_na()
# will use Yankee Stadium as destination for calcualting driving distance to City.
geocode_data_noNA$destination <- YankeeStadium
# API can only handle 100 searches per request:
results_drivinga <- gmapsdistance(geocode_data_noNA$orgin[1:50], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivingb <- gmapsdistance(geocode_data_noNA$orgin[51:150], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivingc <- gmapsdistance(geocode_data_noNA$orgin[151:250], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivingd <- gmapsdistance(geocode_data_noNA$orgin[251:350], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivinge <- gmapsdistance(geocode_data_noNA$orgin[351:450], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivingf <- gmapsdistance(geocode_data_noNA$orgin[451:550], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivingg <- gmapsdistance(geocode_data_noNA$orgin[551:650], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_drivingh <- gmapsdistance(geocode_data_noNA$orgin[651:750], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
results_driving_time <- rbind(
as.data.frame(results_drivinga[1]),
as.data.frame(results_drivingb[1]),
as.data.frame(results_drivingc[1]),
as.data.frame(results_drivingd[1]),
as.data.frame(results_drivinge[1]),
as.data.frame(results_drivingf[1]),
as.data.frame(results_drivingg[1]),
as.data.frame(results_drivingh[1]))
names(results_driving_time) <- c("orgin","driving_time_sec")
House_Full_Geo <- cbind(Clean_House_DF, geocode_data)
House_Full_Geo <- left_join(House_Full_Geo, geocode_data_full, by = "address")
House_Final <- House_Full_Geo %>%
select(Price, Beds, Baths, sqft, Type, lon.x, lat.x, address,concat.x,driving_time_sec) %>%
drop_na() %>%
distinct()
names(House_Final) <- c("Price", "Beds", "Baths", "Sqft", "Type", "Lon", "Lat", "Address", "Concat", "Driving_Time_Sec")
# Saving CSV for use in RShiny.
write.csv(House_Final, "House_Final0.csv")
write.csv(House_Full_Geo, "House_Full_Geo0.csv")
We can now clean our data
# Clean Table
# Clean Bedroom
House_Final$Beds <- str_replace(House_Final$Beds, " bds","") %>%
str_replace(" bd","") %>%
str_replace("Studio","1") %>%
as.numeric()
# Clean Bathroom
House_Final$Baths <- str_replace(House_Final$Baths, " ba","") %>%
as.numeric()
# Clean Sqft
House_Final$Sqft <- str_replace(House_Final$Sqft, " sqft","") %>%
str_replace(",","") %>%
as.numeric()
# Clean Type
House_Final$Type <- str_replace(House_Final$Type, "- ","")
# Clean Driving Distance
House_Final$Driving_Time_Min <- round(House_Final$Driving_Time_Sec/60, 2)
# Pull ZIP Code
x <- gsub("[^0-9]", "", House_Final$Address)
House_Final$Zip <- substr(x,nchar(x)-4,nchar(x))
# Clean Price
clean_price_list <- as.character(House_Final$Price)
for(i in 1:length(clean_price_list)){
if(str_detect(clean_price_list[i],"M")) {
x <- clean_price_list[i]
x <- str_replace_all(x,"M","") %>%
str_replace_all("\\$","") %>%
as.numeric()
x <- x * 1000000
clean_price_list[i] <- x
}
}
clean_price_list <-
str_replace_all(clean_price_list,",","")%>%
str_replace_all("\\$","") %>%
str_replace_all("\\+","") %>%
as.numeric()
House_Final$Price_Clean <- clean_price_list
House_Final <- drop_na(House_Final)
# Filter Outliers, out of range.
ZipDF <- as.data.frame(zip_List)
names(ZipDF) <- "Zip"
House_Final <- inner_join(House_Final, ZipDF)
House_Final <- House_Final %>%
filter(Price_Clean < 1000000 & Beds < 7)
… and run some preliminary studies to get a sense of our data.
House_Final %>%
select(Price_Clean, Beds, Baths, Sqft, Driving_Time_Min) %>%
summary()
## Price_Clean Beds Baths Sqft
## Min. : 52900 Min. :1.000 Min. :1.000 Min. : 550
## 1st Qu.:354925 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1556
## Median :500000 Median :3.000 Median :2.000 Median :2000
## Mean :517293 Mean :3.242 Mean :2.441 Mean :2182
## 3rd Qu.:669000 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:2619
## Max. :999300 Max. :6.000 Max. :6.000 Max. :8200
## Driving_Time_Min
## Min. :11.03
## 1st Qu.:28.78
## Median :54.12
## Mean :54.28
## 3rd Qu.:77.77
## Max. :91.75
House_Final %>%
ggplot(aes(x=Driving_Time_Min, y=Price_Clean, color = as.factor(Zip))) + geom_point() + geom_smooth() +
theme_few() +
theme(legend.position = "none") +
labs(title="Price by Distance to NYC",
y ="Price ($)", x = "Driving Time (Min) to Yankee Stadium")
House_Final %>%
ggplot(aes(x=as.factor(Beds), y=Price_Clean)) + geom_boxplot() +
theme_few() +
labs(title="Price by Bedroom Count",
y ="Price ($)", x = "Bedrooms")
Finally, before moving on, lets find where our points lie on a preliminary map.
leaflet() %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=House_Final$Lon, lat=House_Final$Lat, popup=House_Final$Address)
There are two databases maintained by Realtor.com, with several fields of interest. Here is what we will be studying:
From the Inventory file, we will pull:
Median Listing Price: The median listing price within the specified geography during the specified month.
Active Listing Count: The count of active listings within the specified geography during the specified month. The active listing count tracks the number of for sale properties on the market, excluding pending listings where a pending status is available. This is a snapshot measure of how many active listings can be expected on any given day of the specified month.
Days on Market: The median number of days property listings spend on the market within the specified geography during the specified month. Time spent on the market is defined as the time between the initial listing of a property and either its closing date or the date it is taken off the market.
New Listing Count: The count of new listings added to the market within the specified geography. The new listing count represents a typical week’s worth of new listings in a given month. The new listing count can be multiplied by the number of weeks in a month to produce a monthly new listing count.
Price Increase Count: The count of listings which have had their price increased within the specified geography. The price increase count represents a typical week’s worth of listings which have had their price increased in a given month. The price increase count can be multiplied by the number of weeks in a month to produce a monthly price increase count.
Price Decrease Count: The count of listings which have had their price reduced within the specified geography. The price decrease count represents a typical week’s worth of listings which have had their price reduced in a given month. The price decrease count can be multiplied by the number of weeks in a month to produce a monthly price decrease count.
Median List Price Per Sqft: The median listing price per square foot within the specified geography during the specified month.
From the “Hotness” file, we will pull
Hotness Score: The Hotness score is an equally-weighted composite metric of a geography’s supply score and demand score.
Supply Score: The supply score is an index representing a zip code, county or metro’s median days on market ranking compared to other zip codes, counties, or metros.
Demand Score: The supply score is an index representing a zip code, county or metro’s listing page views per property ranking compared to other zip codes, counties, or metros.
Median DOM: The median number of days property listings spend on the market within the specified geography during the specified month. Time spent on the market is defined as the time between the initial listing of a property and either its closing date or the date it is taken off the market.
LDP Unique Viewers Per Property (vs US): The count of viewers a typical property receives in the specified geography divided by the count of views a typical property receives in the US overall during the same month.
#Files pulled from Realator.com onto local storage.
Metrics_Hotness <- read.csv("RDC_Inventory_Hotness_Metrics_Zip_History.csv")
Metrics_Core <- read.csv("RDC_Inventory_Core_Metrics_Zip_History.csv")
# conver ZIP Code to factor
Metrics_Hotness$zip_fact <- as.factor(Metrics_Hotness$postal_code)
Metrics_Core$zip_fact <- as.factor(Metrics_Core$postal_code)
# Limit study to relevant Zips
df1 <- inner_join(ZipDF, Metrics_Hotness, by = c("Zip" = "zip_fact")) %>%
filter(month_date_yyyymm == "202104") %>%
select(Zip, hotness_score, supply_score, demand_score, median_days_on_market, ldp_unique_viewers_per_property_vs_us)
names(df1) <- c("Zip","Hotness_Score","Supply_Score","Demand_Score","Median_Days_on_Market","Unique_Views")
df2 <- inner_join(ZipDF, Metrics_Core, by = c("Zip" = "zip_fact")) %>%
filter(month_date_yyyymm == "202104") %>%
select(Zip, median_listing_price, price_increased_count, price_reduced_count, active_listing_count, median_listing_price_per_square_foot, zip_name, new_listing_count)
names(df2) <- c("Zip", "Median_Listing_Price","Listing_W.Price_Increase_Count", "Listing_W.Price_Decrease_Count","Active_Listing_Count_Permonth","Median_Price_Per_Sq_Ft","Zip_Name","New_Listing_Count")
Realtorcom_DF<-inner_join(df1, df2)
We can view the summary of our relevant metrics:
Realtorcom_DF %>%
select(-c(Zip, Zip_Name)) %>%
summary()
## Hotness_Score Supply_Score Demand_Score Median_Days_on_Market
## Min. : 5.565 Min. : 3.880 Min. : 0.6738 Min. : 83.0
## 1st Qu.: 8.448 1st Qu.: 4.539 1st Qu.: 9.3603 1st Qu.:139.8
## Median :10.234 Median : 7.919 Median :12.0833 Median :240.8
## Mean :11.841 Mean :11.102 Mean :12.5802 Mean :245.7
## 3rd Qu.:12.566 3rd Qu.:17.565 3rd Qu.:15.4753 3rd Qu.:365.8
## Max. :28.327 Max. :28.428 Max. :33.6042 Max. :378.5
## Unique_Views Median_Listing_Price Listing_W.Price_Increase_Count
## Min. :0.0778 Min. : 185000 Min. :0.0000
## 1st Qu.:0.3250 1st Qu.: 336925 1st Qu.:0.0000
## Median :0.3778 Median : 500509 Median :0.0000
## Mean :0.3707 Mean : 553734 Mean :0.5333
## 3rd Qu.:0.4305 3rd Qu.: 762833 3rd Qu.:0.0000
## Max. :0.7111 Max. :1150000 Max. :8.0000
## Listing_W.Price_Decrease_Count Active_Listing_Count_Permonth
## Min. : 0.000 Min. : 20.00
## 1st Qu.: 4.000 1st Qu.: 41.75
## Median : 4.000 Median : 66.50
## Mean : 7.867 Mean : 90.53
## 3rd Qu.: 8.000 3rd Qu.:108.75
## Max. :44.000 Max. :438.00
## Median_Price_Per_Sq_Ft New_Listing_Count
## Min. :151.4 Min. : 4.0
## 1st Qu.:223.5 1st Qu.: 8.0
## Median :296.7 Median :16.0
## Mean :287.6 Mean :25.6
## 3rd Qu.:342.4 3rd Qu.:35.0
## Max. :409.0 Max. :96.0
We will use two different tables provided by the US Census:
Age and Sex Table Id: S0101
INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Table Id: S1901.
We are primarily interested in a handful of metrics form the Census:
Median Income
Population
Percent of the Population in our Age Group : Broadly considering 25 to 40 our age group.
Median Age
Child Dependency Ratio : \(\frac{(Population < 18)*100}{(Population 18 to 65)}\) `
# Import Data
Census_Income <- read.csv("Census_Income.csv")
# Clean, Filter
names(Census_Income) <- c("Median_Income", "Zip")
Census_Income <- inner_join(Census_Income, ZipDF)
Census_Income$Median_Income <- Census_Income$Median_Income %>% str_replace_all(",","")%>%
str_replace_all("\\+","") %>%
as.numeric()
# Import Data
Census_Pop <- read.csv("Census_Pop.csv")
# Clean and Filter
names(Census_Pop) <- c("Zip","Total_Pop","Under_5","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_Plus","5_To_14_Years","15_To_17_Years","Under_18_Years","18_To_24_Years","15_To_44_Years","16_Years_And_Over","18_Years_And_Over","21_Years_And_Over","60_Years_And_Over","62_Years_And_Over","65_Years_And_Over","75_Years_And_Over","Median_Age","Sex_Ratio__Males_Per_100_Females_","Age_Dependency_Ratio","Old_Age_Dependency_Ratio","Child_Dependency_Ratio")
Census_Pop$Zip <- as.factor(Census_Pop$Zip)
Census_Pop$Median_Age <- Census_Pop$Median_Age %>%
as.character() %>%
as.numeric()
Census_Pop$Sex_Ratio__Males_Per_100_Females_ <- Census_Pop$Sex_Ratio__Males_Per_100_Females_ %>%
as.character() %>%
as.numeric()
Census_Pop$Age_Dependency_Ratio <- Census_Pop$Age_Dependency_Ratio %>%
as.character() %>%
as.numeric()
Census_Pop$Old_Age_Dependency_Ratio <- Census_Pop$Old_Age_Dependency_Ratio %>%
as.character() %>%
as.numeric()
Census_Pop$Child_Dependency_Ratio <- Census_Pop$Child_Dependency_Ratio %>%
as.character() %>%
as.numeric()
#Somehow ended up with duplicates.
Census_Pop <- inner_join(Census_Pop,ZipDF) %>% distinct()
# Calculated Variables
Census_Pop$Our_Age_Group_Percent <- (Census_Pop$`25_to_29` + Census_Pop$`30_to_34` + Census_Pop$`35_to_39`)/Census_Pop$Total_Pop
We can now merge these two tables and provide a summary of our relevant metrics.
# New Table
Census_1 <- Census_Pop %>%
select(Zip, Our_Age_Group_Percent, Child_Dependency_Ratio, Median_Age, Total_Pop)
Census_DF <- inner_join(Census_Income, Census_1)
summary(Census_DF)
## Median_Income Zip Our_Age_Group_Percent Child_Dependency_Ratio
## Min. : 47525 10463 : 1 Min. :0.0000 Min. : 1.20
## 1st Qu.: 76714 10470 : 1 1st Qu.:0.1334 1st Qu.:29.18
## Median : 92025 10471 : 1 Median :0.1727 Median :31.95
## Mean :100806 10502 : 1 Mean :0.1662 Mean :32.91
## 3rd Qu.:126684 10503 : 1 3rd Qu.:0.2084 3rd Qu.:39.17
## Max. :174881 10510 : 1 Max. :0.2578 Max. :50.20
## NA's :3 (Other):33 NA's :1 NA's :3
## Median_Age Total_Pop
## Min. :20.40 Min. : 0
## 1st Qu.:39.35 1st Qu.: 4666
## Median :41.90 Median :11070
## Mean :42.53 Mean :16925
## 3rd Qu.:44.65 3rd Qu.:23527
## Max. :66.70 Max. :71132
## NA's :1
As a final step, lets merge our Realtor and Census data together.
Zip_Data_DF <- full_join(Realtorcom_DF,Census_DF)
I will next supply a few parameters and demonstrate how one might interact with the visualizations while learning more about the region.
We will start by filtering our the listings to those that match the following parameters. In the Shiny app, these will be user inputted variables.
# Listing Type Vars.
Type_List <- House_Final %>% select(Type) %>% distinct() %>% as.list() %>% unlist()
# Ones used in graph:
#Type_List[c(1,3,4,5,6,7)]
# Parameters:
# Max Home Price
Input_Price <- 400000
# Listing types
Input_Types <- Type_List[c(1,3,4,5,6,7)]
# Min Baths
Input_Baths1 <- 2.5
# Max
Input_Baths2 <- 6
# Min Bedrooms
Input_Beds1 <- 3
# Max Bedrooms
Input_Beds2 <- 6
# Min Distance from City
Input_Dist2 <- 100
# Max Distance from City
Input_Dist1 <- 5
# Min Sq footage
Input_Sqft <- 100
# Filter for Parameters:
House_Final_Filtered <-
House_Final %>%
filter(
Price_Clean <= Input_Price &
Type %in% Input_Types &
Baths <= Input_Baths2 &
Baths >= Input_Baths1 &
Beds <= Input_Beds2 &
Beds >= Input_Beds1 &
Driving_Time_Min <= Input_Dist2&
Driving_Time_Min >= Input_Dist1&
Sqft >= as.numeric(Input_Sqft)
)
####################################################################################################
# This code belongs below in mapping chunk, however, a var is needed in graphing
options(tigris_use_cache = TRUE)
char_zips1 <- zctas(cb = TRUE, starts_with = "10")
char_zips2 <- zctas(cb = TRUE, starts_with = "12")
char_zips <- rbind(char_zips1, char_zips2)
char_zips <- geo_join(char_zips,
ZipDF,
by_sp = "ZCTA5CE10",
by_df = "Zip",
how = "inner")
#maplayer <- inner_join(char_zips, Realtorcom_DF, by = c("ZCTA5CE10" = "Zip"))
maplayer <- geo_join(char_zips,
Zip_Data_DF,
by_sp = "ZCTA5CE10",
by_df = "Zip",
how = "inner")
# End of mapping code chunk
########################################################################################################
# Normalize variables for graphing
House_Final_Long_Filtered <- House_Final %>%
select(Price,Concat,Driving_Time_Min,Hotness_Score,Median_Days_on_Market,Unique_Views,Median_Listing_Price,Listing_W.Price_Increase_Count,Listing_W.Price_Decrease_Count,Active_Listing_Count_Permonth,Median_Price_Per_Sq_Ft,Median_Income,Our_Age_Group_Percent, Child_Dependency_Ratio, Total_Pop, Price_Clean) %>%
pivot_longer(-c(Price,Concat,Driving_Time_Min,Price_Clean),names_to = "Var", values_to = "Value")
House_Final_Long_Filtered_Stats <-
House_Final_Long_Filtered %>%
group_by(Var) %>%
summarize(Mean = mean(Value, na.rm = TRUE), SD = sd(Value, na.rm = TRUE))
House_Final_Long_Filtered <- full_join(House_Final_Long_Filtered,House_Final_Long_Filtered_Stats)
# Initially built for heatmap var selection, used as list of graphable variables as well:
mapvarlist <- names(maplayer)[c(6,9,11,13,14,15,17,18,19,20,21,22)]
# Build Plot
House_Final_Long_FilteredX <- House_Final_Long_Filtered %>%
filter(Var %in% mapvarlist[c(5,6,9)])
House_Final_Long_FilteredY <- inner_join(House_Final_Long_Filtered, House_Final_Filtered )
# Base Plot
plot0 <- House_Final_Long_FilteredX %>%
ggplot(aes(x=Driving_Time_Min)) +
geom_smooth(aes(y=((Value - Mean) / SD), col = Var), se=FALSE) +
theme_few() + labs(color='Variables') + scale_y_continuous(name = "SD from Mean") +
theme(legend.position = "bottom", legend.title = element_blank()) +
geom_point(aes(y=((Price_Clean - mean(Price_Clean)) / sd(Price_Clean))),
alpha = .02
#fill = "lightgray"
)+
geom_point(data = House_Final_Long_FilteredY, aes(y=((Price_Clean - mean(House_Final_Long_FilteredX$Price_Clean)) / sd(House_Final_Long_FilteredX$Price_Clean))),
#alpha = .02
fill = "lightgray"
)+
labs(title="Price vs Driving Distance to NYC",
x = "Driving Time (Min) to Yankee Stadium")+
scale_y_continuous(name = "SD from Mean",
sec.axis = sec_axis( trans=~.*sd(House_Final_Long_Filtered$Price_Clean) +
mean(House_Final_Long_Filtered$Price_Clean), name="Price ($)", labels=scales::dollar_format()))
plot0
After the user has selected homes via the above visual or the selectors, we want to plot these homes on a map.
Var_Input <- "Our_Age_Group_Percent"
# This is mapping chunk above.
# Import ZIP Code data
options(tigris_use_cache = TRUE)
char_zips1 <- zctas(cb = TRUE, starts_with = "10")
char_zips2 <- zctas(cb = TRUE, starts_with = "12")
char_zips <- rbind(char_zips1, char_zips2)
char_zips <- geo_join(char_zips,
ZipDF,
by_sp = "ZCTA5CE10",
by_df = "Zip",
how = "inner")
# Join Zip Data to mapping data
maplayer <- geo_join(char_zips,
Zip_Data_DF,
by_sp = "ZCTA5CE10",
by_df = "Zip",
how = "inner")
# Pull data realtive to selected variable into list
Map_Value <- maplayer[[Var_Input]]
# Generate palette
pal <- colorNumeric(
palette = "Spectral",
domain = Map_Value,
na.color=rgb(0,0,0,0))
# Map Labels
labels <-
paste0(
"Zip Code: ",
maplayer$ZCTA5CE10, "<br/>",
paste0(Var_Input, ": "),
round(Map_Value,2)) %>%
lapply(htmltools::HTML)
# Generate Map
m <-
maplayer %>%
leaflet() %>%
# addTiles() %>% # Add default OpenStreetMap map tiles
addProviderTiles("CartoDB") %>%
# addMarkers(lng=House_Final$Lon, lat=House_Final$Lat, popup=House_Final$Address) %>%
addMarkers(lng=House_Final_Filtered$Lon, lat=House_Final_Filtered$Lat, popup=paste(House_Final_Filtered$Address,"\n",House_Final_Filtered$Price),
clusterOptions = markerClusterOptions()) %>%
addPolygons(fillColor = ~pal(Map_Value),
weight = 2,
opacity = 1,
color = "white",
dashArray = "3",
fillOpacity = 0.7,
highlight = highlightOptions(weight = 2,
color = "#666",
dashArray = "",
fillOpacity = 0.7,
bringToFront = TRUE),
label = labels) %>%
# add legend
addLegend(pal = pal,
values = ~Map_Value,
opacity = 0.7,
title = htmltools::HTML(paste0(Var_Input, "<br>
by Zip Code")),
position = "bottomright",
na.label = "")
# Print Map
m
We found there are available homes that fit these parameters available on the northern end of the area which we were considering. When viewing the plot above, we also see that this area is relatively desirable due to the higher portion of people in our age range and the lower price per square foot. We also see that here is a higher than average new listing count in this region, which is favorable.
Regarding next steps, we might expand our search and consider the area slightly south east of this region - namely, the eastern section of Putnam County and the southern edge of the neighboring Duchess County in East Fishkill.
Explore Census Data. Explore Census Data Table S0101. (n.d.). https://data.census.gov/cedsci/table?q=S0101.
Explore Census Data. Explore Census Data Table S1901. (n.d.). https://data.census.gov/cedsci/table?q=S1901.
Realtor.com Real Estate Data and Market Trends for Download. Realtor.com Economic Research. (2021, April 29). https://www.realtor.com/research/data/.