Intro

After 12 great years living in New York City, it is time to move to the suburbs. This fall, my wife and I will begin our search for our first home. While we know that we generally want to live in the Hudson Valley region of NY, somewhere along the Hudson Line of the Metro North, we are having trouble contextualizing the differences in the neighborhoods and cities along this stretch.

I will be using data from 3 sources to generate an interactive visualization which should help us better understand these neighborhoods. First, I will be using data from the US Census to learn about the demographics of the region. Additionally, I will be pulling metrics relating to the market itself from the website Realtor.com. Finally, I will pull listings from the online real estate marketing website Zillow.com. We will be using ZIP Code as our regional unit.

Regarding the logistics of my approach. I will be outlining my data handling, cleaning and transformation in this markdown file, and provide some static exhibits as I explain my findings. I will then provide the final visualization in a hosted, RShiny dashboard.

Data Acquisition

# ZIP Codes of interest:
zip_List <- c("12601","12604","12603","12590","12512","12527","12508","12524","10516","10524","10537","10566","10511","10596","10548","10567","10520","10545","10510","10570","10594","10532","10591","10523","10607","10533","10503","10502","10522","10706","10703","10701","10710","10704","10705","10471","10470","10463","10562")

# Load Packages 

library(dplyr)
library(tidyr)
library(ggmap)
library(gmapsdistance)
library(rvest)
library(RSelenium)
library(stringr)
library(leaflet)
library(tigris)
library(ggthemes)
library(cowplot)
library(grid)
library(gridExtra)

Zillow

First, we will scrape listings from Zillow for each of the ZIP Codes of interest.

# Started webscrape with RVEST
      
      for (i in 1:length(zip_List)){
      
      # Will be pulling in 3 pages worth of data, max ~150 entires per ZIP  Code. It is unlikely that more than 150 active listings are on Zillow for these ZIP Codes, will review. 
        
      z_url_1 <-  paste("https://www.zillow.com/homes/",zip_List[i],"_rb/",sep="")
      z_url_2 <-  paste("https://www.zillow.com/ny-",zip_List[i],"/houses/2_p/",sep="") 
      z_url_3 <-  paste("https://www.zillow.com/ny-",zip_List[i],"/houses/3_p/",sep="")
      
      
      
      z_link_1 <- read_html(z_url_1)
      z_link_2 <- read_html(z_url_2)
      z_link_3 <- read_html(z_url_3)
      
      
      df1 <- z_link_1 %>% html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>% html_text()
      
      df2 <- z_link_2 %>% html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>% html_text()
      
      df3 <- z_link_3 %>% html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>% html_text()
      
      df_all <- rbind(df1,df2,df3)
      
      assign(zip_List[i],df_all)
      
      }
      
      # Initially pulled in data for 17 ZIP Codes before retuning null values. Possibly running into CAPTCHA. 
      

# RSelenium

      for (i in 17:length(zip_List)){
      
      # After review, found fewer than 50 entries per ZIP Code. Will be limiting to 1 page. 
        
      z_url_1 <-  paste("https://www.zillow.com/homes/",zip_List[i],"_rb/",sep="")
      
      remDr$navigate(z_url_1)
      Sys.sleep(2)
      page <- read_html(remDr$getPageSource()[[1]])
      results <- page %>%
          html_nodes('.list-card-addr , .list-card-price , .list-card-details li') %>%
          html_text() 
      
      assign(zip_List[i],results)
      
      }
      
      # sucsesfully pulled in remaining ZIP Codes. 


# Build/Clean DF

      House_DF <- data.frame(matrix(unlist(eval(as.name(zip_List[1]))[1:(6*floor(length(eval(as.name(zip_List[1])))/6))]), ncol = 6, byrow=TRUE)) 
      
      for (i in 2:length(zip_List)){
      df1 <- data.frame(matrix(unlist(eval(as.name(zip_List[1]))[1:(6*floor(length(eval(as.name(zip_List[1])))/6))]), ncol = 6, byrow=TRUE)) 
      
      House_DF <- rbind(House_DF, df1) } 
      
      
      Clean_House_DF <- House_DF %>% drop_na() %>% distinct()
      nrow(House_DF)
      nrow(Clean_House_DF)
      
      
      names(Clean_House_DF) <- c("Address","Price","Beds","Baths","sqft","Type")
      
      Clean_House_DF$Address <- as.character(Clean_House_DF$Address)

Mapping Data

We obtained 6 fields from Zillow: price, bedroom/bathroom count, square footage, listing type (house, condo, multifamily home, et al) and address. We need two additonal data elements: geographical coordinates and distance to the city. We can obtain each of these through packages which leverage a Google Maps API.

Regarding distance from the city: as all of the neighborhoods which we are interested are along one train line, these zip codes are arguably ordinal. To make analysis easier and to aid in plotting, we will use “distance from the city” as a continuous variable instead. When thinking about “distance to the city”, I wanted to accurately capture the impact of the distance from the city for each of these listings. Instead of physical distance, I elected to use driving time. As New York City is large, I picked an accessible and well known landmark generally on the northern edge of the city: Yankee Stadium. Driving time is the amount of time it would take to drive from each home, arriving at Yankee Stadium at 9am on the morning of Monday May 24th, 2021.

#Saving API key as variable "CHRIS_BLOOME_API_KEY" in hidden Chunk.

# Set API keys for two packages
      set.api.key(CHRIS_BLOOME_API_KEY)
      register_google(key = CHRIS_BLOOME_API_KEY)

# Pull GPS cordinates of each house address via ggmap package, connected to Google Maps API
      geocode_data <- geocode(unlist(Clean_House_DF$Address), output = "more", source = "google", urlonly = FALSE, nameType = "long")

      geocode_data$orgin <- paste(geocode_data$lat,"+",geocode_data$lon, sep="")
      
      YankeeStadium <- c("40.829780585986235+-73.92618523107261")
      
      geocode_data_noNA <- geocode_data %>% drop_na()
      
# will use Yankee Stadium as destination for calcualting driving distance to City. 
      geocode_data_noNA$destination <- YankeeStadium

# API can only handle 100 searches per request: 
      results_drivinga <- gmapsdistance(geocode_data_noNA$orgin[1:50], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivingb <- gmapsdistance(geocode_data_noNA$orgin[51:150], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivingc <- gmapsdistance(geocode_data_noNA$orgin[151:250], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivingd <- gmapsdistance(geocode_data_noNA$orgin[251:350], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivinge <- gmapsdistance(geocode_data_noNA$orgin[351:450], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivingf <- gmapsdistance(geocode_data_noNA$orgin[451:550], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivingg <- gmapsdistance(geocode_data_noNA$orgin[551:650], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")
      
      results_drivingh <- gmapsdistance(geocode_data_noNA$orgin[651:750], geocode_data_noNA$destination[1], mode="driving", key=get.api.key(), arr_date="2021-05-24", arr_time="13:00:00")

      results_driving_time <- rbind(
      as.data.frame(results_drivinga[1]),
      as.data.frame(results_drivingb[1]),
      as.data.frame(results_drivingc[1]),
      as.data.frame(results_drivingd[1]),
      as.data.frame(results_drivinge[1]),
      as.data.frame(results_drivingf[1]),
      as.data.frame(results_drivingg[1]),
      as.data.frame(results_drivingh[1]))

      names(results_driving_time) <- c("orgin","driving_time_sec")

      House_Full_Geo <- cbind(Clean_House_DF, geocode_data)
      
      House_Full_Geo <- left_join(House_Full_Geo, geocode_data_full, by = "address")

      House_Final <- House_Full_Geo %>%
                      select(Price, Beds, Baths, sqft, Type, lon.x, lat.x, address,concat.x,driving_time_sec) %>%
                      drop_na() %>%
                      distinct()
      
      names(House_Final) <- c("Price", "Beds", "Baths", "Sqft", "Type", "Lon", "Lat", "Address", "Concat", "Driving_Time_Sec")

# Saving CSV for use in  RShiny. 
      write.csv(House_Final, "House_Final0.csv")
      write.csv(House_Full_Geo, "House_Full_Geo0.csv")

We can now clean our data

# Clean Table

  # Clean Bedroom
      House_Final$Beds  <- str_replace(House_Final$Beds, " bds","") %>%
        str_replace(" bd","") %>%
        str_replace("Studio","1") %>%
        as.numeric()
  
  # Clean Bathroom
      House_Final$Baths <- str_replace(House_Final$Baths, " ba","") %>%
        as.numeric()
  
  # Clean Sqft
      House_Final$Sqft <- str_replace(House_Final$Sqft, " sqft","") %>%
        str_replace(",","") %>%
        as.numeric()
  
  # Clean Type    
      House_Final$Type <-  str_replace(House_Final$Type, "- ","")
  
  # Clean Driving Distance    
      House_Final$Driving_Time_Min <- round(House_Final$Driving_Time_Sec/60, 2)
    
  # Pull ZIP Code
      x <- gsub("[^0-9]", "", House_Final$Address) 
      House_Final$Zip <- substr(x,nchar(x)-4,nchar(x))
      
  # Clean Price 
      clean_price_list <- as.character(House_Final$Price)

      for(i in 1:length(clean_price_list)){
        if(str_detect(clean_price_list[i],"M")) { 
          
          x <- clean_price_list[i]
          x <- str_replace_all(x,"M","") %>%
            str_replace_all("\\$","") %>%
            as.numeric()
          x <- x * 1000000 
          
          clean_price_list[i] <- x
        }
      }
          
      
      clean_price_list <- 
      str_replace_all(clean_price_list,",","")%>%
            str_replace_all("\\$","") %>%
            str_replace_all("\\+","") %>%
            as.numeric()
          
      
      House_Final$Price_Clean <- clean_price_list
      House_Final <- drop_na(House_Final)
  
  # Filter Outliers, out of range. 
      ZipDF <- as.data.frame(zip_List)
      names(ZipDF) <- "Zip"
      
      House_Final <- inner_join(House_Final, ZipDF)
      
      House_Final <- House_Final %>%
        filter(Price_Clean < 1000000 & Beds < 7)

… and run some preliminary studies to get a sense of our data.

House_Final %>%
  select(Price_Clean, Beds, Baths, Sqft, Driving_Time_Min) %>%
  summary()

##   Price_Clean          Beds           Baths            Sqft     
##  Min.   : 52900   Min.   :1.000   Min.   :1.000   Min.   : 550  
##  1st Qu.:354925   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1556  
##  Median :500000   Median :3.000   Median :2.000   Median :2000  
##  Mean   :517293   Mean   :3.242   Mean   :2.441   Mean   :2182  
##  3rd Qu.:669000   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:2619  
##  Max.   :999300   Max.   :6.000   Max.   :6.000   Max.   :8200  
##  Driving_Time_Min
##  Min.   :11.03   
##  1st Qu.:28.78   
##  Median :54.12   
##  Mean   :54.28   
##  3rd Qu.:77.77   
##  Max.   :91.75

House_Final %>%
  ggplot(aes(x=Driving_Time_Min, y=Price_Clean, color = as.factor(Zip))) + geom_point() + geom_smooth() +
        theme_few() + 
        theme(legend.position = "none") + 
        labs(title="Price by Distance to NYC",
        y ="Price ($)", x = "Driving Time (Min) to Yankee Stadium")

House_Final %>%
  ggplot(aes(x=as.factor(Beds), y=Price_Clean)) + geom_boxplot() +   
        theme_few() + 
        labs(title="Price by Bedroom Count",
        y ="Price ($)", x = "Bedrooms")

Finally, before moving on, lets find where our points lie on a preliminary map.

leaflet() %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=House_Final$Lon, lat=House_Final$Lat, popup=House_Final$Address)

Realtor.com Data

There are two databases maintained by Realtor.com, with several fields of interest. Here is what we will be studying:

From the Inventory file, we will pull:

Median Listing Price: The median listing price within the specified geography during the specified month.
Active Listing Count: The count of active listings within the specified geography during the specified month. The active listing count tracks the number of for sale properties on the market, excluding pending listings where a pending status is available. This is a snapshot measure of how many active listings can be expected on any given day of the specified month.
Days on Market: The median number of days property listings spend on the market within the specified geography during the specified month. Time spent on the market is defined as the time between the initial listing of a property and either its closing date or the date it is taken off the market.
New Listing Count: The count of new listings added to the market within the specified geography. The new listing count represents a typical week’s worth of new listings in a given month. The new listing count can be multiplied by the number of weeks in a month to produce a monthly new listing count.
Price Increase Count: The count of listings which have had their price increased within the specified geography. The price increase count represents a typical week’s worth of listings which have had their price increased in a given month. The price increase count can be multiplied by the number of weeks in a month to produce a monthly price increase count.
Price Decrease Count: The count of listings which have had their price reduced within the specified geography. The price decrease count represents a typical week’s worth of listings which have had their price reduced in a given month. The price decrease count can be multiplied by the number of weeks in a month to produce a monthly price decrease count.
Median List Price Per Sqft: The median listing price per square foot within the specified geography during the specified month.

From the “Hotness” file, we will pull

Hotness Score: The Hotness score is an equally-weighted composite metric of a geography’s supply score and demand score.
Supply Score: The supply score is an index representing a zip code, county or metro’s median days on market ranking compared to other zip codes, counties, or metros.
Demand Score: The supply score is an index representing a zip code, county or metro’s listing page views per property ranking compared to other zip codes, counties, or metros.
Median DOM: The median number of days property listings spend on the market within the specified geography during the specified month. Time spent on the market is defined as the time between the initial listing of a property and either its closing date or the date it is taken off the market.
LDP Unique Viewers Per Property (vs US): The count of viewers a typical property receives in the specified geography divided by the count of views a typical property receives in the US overall during the same month.

#Files pulled from Realator.com onto local storage. 

      Metrics_Hotness <- read.csv("RDC_Inventory_Hotness_Metrics_Zip_History.csv")
      
      Metrics_Core <- read.csv("RDC_Inventory_Core_Metrics_Zip_History.csv")

# conver ZIP Code to factor 
      Metrics_Hotness$zip_fact <- as.factor(Metrics_Hotness$postal_code)
      Metrics_Core$zip_fact <- as.factor(Metrics_Core$postal_code)

# Limit study to relevant Zips
      df1 <- inner_join(ZipDF, Metrics_Hotness, by = c("Zip" = "zip_fact")) %>%
        filter(month_date_yyyymm == "202104") %>%
        select(Zip, hotness_score, supply_score, demand_score, median_days_on_market, ldp_unique_viewers_per_property_vs_us)
      
      names(df1) <- c("Zip","Hotness_Score","Supply_Score","Demand_Score","Median_Days_on_Market","Unique_Views")
      
      df2 <- inner_join(ZipDF, Metrics_Core, by = c("Zip" = "zip_fact")) %>%
        filter(month_date_yyyymm == "202104") %>%
        select(Zip, median_listing_price, price_increased_count, price_reduced_count, active_listing_count, median_listing_price_per_square_foot, zip_name, new_listing_count)
      
      names(df2) <- c("Zip", "Median_Listing_Price","Listing_W.Price_Increase_Count", "Listing_W.Price_Decrease_Count","Active_Listing_Count_Permonth","Median_Price_Per_Sq_Ft","Zip_Name","New_Listing_Count")
      
      Realtorcom_DF<-inner_join(df1, df2)

We can view the summary of our relevant metrics:

Realtorcom_DF %>% 
  select(-c(Zip, Zip_Name)) %>% 
  summary()

##  Hotness_Score     Supply_Score     Demand_Score     Median_Days_on_Market
##  Min.   : 5.565   Min.   : 3.880   Min.   : 0.6738   Min.   : 83.0        
##  1st Qu.: 8.448   1st Qu.: 4.539   1st Qu.: 9.3603   1st Qu.:139.8        
##  Median :10.234   Median : 7.919   Median :12.0833   Median :240.8        
##  Mean   :11.841   Mean   :11.102   Mean   :12.5802   Mean   :245.7        
##  3rd Qu.:12.566   3rd Qu.:17.565   3rd Qu.:15.4753   3rd Qu.:365.8        
##  Max.   :28.327   Max.   :28.428   Max.   :33.6042   Max.   :378.5        
##   Unique_Views    Median_Listing_Price Listing_W.Price_Increase_Count
##  Min.   :0.0778   Min.   : 185000      Min.   :0.0000                
##  1st Qu.:0.3250   1st Qu.: 336925      1st Qu.:0.0000                
##  Median :0.3778   Median : 500509      Median :0.0000                
##  Mean   :0.3707   Mean   : 553734      Mean   :0.5333                
##  3rd Qu.:0.4305   3rd Qu.: 762833      3rd Qu.:0.0000                
##  Max.   :0.7111   Max.   :1150000      Max.   :8.0000                
##  Listing_W.Price_Decrease_Count Active_Listing_Count_Permonth
##  Min.   : 0.000                 Min.   : 20.00               
##  1st Qu.: 4.000                 1st Qu.: 41.75               
##  Median : 4.000                 Median : 66.50               
##  Mean   : 7.867                 Mean   : 90.53               
##  3rd Qu.: 8.000                 3rd Qu.:108.75               
##  Max.   :44.000                 Max.   :438.00               
##  Median_Price_Per_Sq_Ft New_Listing_Count
##  Min.   :151.4          Min.   : 4.0     
##  1st Qu.:223.5          1st Qu.: 8.0     
##  Median :296.7          Median :16.0     
##  Mean   :287.6          Mean   :25.6     
##  3rd Qu.:342.4          3rd Qu.:35.0     
##  Max.   :409.0          Max.   :96.0

Census Data

We will use two different tables provided by the US Census:

Age and Sex Table Id: S0101
INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) Table Id: S1901.

We are primarily interested in a handful of metrics form the Census:

Median Income
Population
Percent of the Population in our Age Group : Broadly considering 25 to 40 our age group.
Median Age
Child Dependency Ratio : \(\frac{(Population < 18)*100}{(Population 18 to 65)}\) `

# Import Data 
      Census_Income <- read.csv("Census_Income.csv")
# Clean, Filter 
      names(Census_Income) <- c("Median_Income", "Zip")
      
      Census_Income <- inner_join(Census_Income, ZipDF)
      
      Census_Income$Median_Income <- Census_Income$Median_Income %>% str_replace_all(",","")%>%
                  str_replace_all("\\+","") %>%
                  as.numeric()


# Import Data 
      Census_Pop <- read.csv("Census_Pop.csv")
# Clean and Filter 
      names(Census_Pop) <- c("Zip","Total_Pop","Under_5","5_to_9","10_to_14","15_to_19","20_to_24","25_to_29","30_to_34","35_to_39","40_to_44","45_to_49","50_to_54","55_to_59","60_to_64","65_to_69","70_to_74","75_to_79","80_to_84","85_Plus","5_To_14_Years","15_To_17_Years","Under_18_Years","18_To_24_Years","15_To_44_Years","16_Years_And_Over","18_Years_And_Over","21_Years_And_Over","60_Years_And_Over","62_Years_And_Over","65_Years_And_Over","75_Years_And_Over","Median_Age","Sex_Ratio__Males_Per_100_Females_","Age_Dependency_Ratio","Old_Age_Dependency_Ratio","Child_Dependency_Ratio") 
      Census_Pop$Zip <- as.factor(Census_Pop$Zip)
      
      Census_Pop$Median_Age <- Census_Pop$Median_Age %>%
                                as.character() %>%
                                as.numeric()
      Census_Pop$Sex_Ratio__Males_Per_100_Females_ <- Census_Pop$Sex_Ratio__Males_Per_100_Females_ %>%
                                as.character() %>%
                                as.numeric()
      Census_Pop$Age_Dependency_Ratio <- Census_Pop$Age_Dependency_Ratio %>%
                                as.character() %>%
                                as.numeric()
      Census_Pop$Old_Age_Dependency_Ratio <- Census_Pop$Old_Age_Dependency_Ratio %>%
                                as.character() %>%
                                as.numeric()
      Census_Pop$Child_Dependency_Ratio <- Census_Pop$Child_Dependency_Ratio %>%
                                as.character() %>%
                                as.numeric()
      
      #Somehow ended up with duplicates. 
      Census_Pop <- inner_join(Census_Pop,ZipDF) %>% distinct()
# Calculated Variables 
      Census_Pop$Our_Age_Group_Percent <- (Census_Pop$`25_to_29` + Census_Pop$`30_to_34` + Census_Pop$`35_to_39`)/Census_Pop$Total_Pop

We can now merge these two tables and provide a summary of our relevant metrics.

# New Table 
      Census_1 <- Census_Pop %>%
                    select(Zip, Our_Age_Group_Percent, Child_Dependency_Ratio, Median_Age, Total_Pop)
      
      Census_DF <- inner_join(Census_Income, Census_1)
      
      summary(Census_DF)

##  Median_Income         Zip     Our_Age_Group_Percent Child_Dependency_Ratio
##  Min.   : 47525   10463  : 1   Min.   :0.0000        Min.   : 1.20         
##  1st Qu.: 76714   10470  : 1   1st Qu.:0.1334        1st Qu.:29.18         
##  Median : 92025   10471  : 1   Median :0.1727        Median :31.95         
##  Mean   :100806   10502  : 1   Mean   :0.1662        Mean   :32.91         
##  3rd Qu.:126684   10503  : 1   3rd Qu.:0.2084        3rd Qu.:39.17         
##  Max.   :174881   10510  : 1   Max.   :0.2578        Max.   :50.20         
##  NA's   :3        (Other):33   NA's   :1             NA's   :3             
##    Median_Age      Total_Pop    
##  Min.   :20.40   Min.   :    0  
##  1st Qu.:39.35   1st Qu.: 4666  
##  Median :41.90   Median :11070  
##  Mean   :42.53   Mean   :16925  
##  3rd Qu.:44.65   3rd Qu.:23527  
##  Max.   :66.70   Max.   :71132  
##  NA's   :1

As a final step, lets merge our Realtor and Census data together.

Zip_Data_DF <- full_join(Realtorcom_DF,Census_DF)

Visualizations

I will next supply a few parameters and demonstrate how one might interact with the visualizations while learning more about the region.

Filtered Plot w/ Metrics

We will start by filtering our the listings to those that match the following parameters. In the Shiny app, these will be user inputted variables.

# Listing Type Vars. 
  Type_List <- House_Final %>% select(Type) %>% distinct() %>% as.list() %>% unlist()
  # Ones used in graph: 
  #Type_List[c(1,3,4,5,6,7)]



# Parameters: 
    
    # Max Home Price
      Input_Price <- 400000
    # Listing types 
      Input_Types <- Type_List[c(1,3,4,5,6,7)]
    # Min Baths
      Input_Baths1 <- 2.5 
    # Max
      Input_Baths2 <- 6
    # Min Bedrooms  
      Input_Beds1 <- 3
    # Max Bedrooms
      Input_Beds2 <- 6
    # Min Distance from City
      Input_Dist2 <- 100
    # Max Distance from City
      Input_Dist1 <- 5
    # Min Sq footage
      Input_Sqft <- 100

# Filter for Parameters: 
      
      House_Final_Filtered <- 
        House_Final  %>%
        filter(
          Price_Clean <= Input_Price &
            Type %in% Input_Types &
            Baths <= Input_Baths2 &
            Baths >= Input_Baths1 &
            Beds <= Input_Beds2 &
            Beds >= Input_Beds1 &
            Driving_Time_Min <= Input_Dist2&
            Driving_Time_Min >= Input_Dist1&
            Sqft >= as.numeric(Input_Sqft)
            
          
        )
        

####################################################################################################
      # This code belongs below in mapping chunk, however, a var is needed in graphing 

options(tigris_use_cache = TRUE)
char_zips1 <- zctas(cb = TRUE, starts_with = "10")
char_zips2 <- zctas(cb = TRUE, starts_with = "12")

char_zips <- rbind(char_zips1, char_zips2) 

char_zips <- geo_join(char_zips, 
                      ZipDF, 
                      by_sp = "ZCTA5CE10", 
                      by_df = "Zip",
                      how = "inner")


#maplayer <- inner_join(char_zips, Realtorcom_DF, by = c("ZCTA5CE10" = "Zip"))

maplayer <- geo_join(char_zips, 
                      Zip_Data_DF, 
                      by_sp = "ZCTA5CE10", 
                      by_df = "Zip",
                      how = "inner")

    # End of mapping code chunk
########################################################################################################



# Normalize variables for graphing
        
        House_Final_Long_Filtered <- House_Final %>%
            select(Price,Concat,Driving_Time_Min,Hotness_Score,Median_Days_on_Market,Unique_Views,Median_Listing_Price,Listing_W.Price_Increase_Count,Listing_W.Price_Decrease_Count,Active_Listing_Count_Permonth,Median_Price_Per_Sq_Ft,Median_Income,Our_Age_Group_Percent, Child_Dependency_Ratio, Total_Pop, Price_Clean) %>%
            pivot_longer(-c(Price,Concat,Driving_Time_Min,Price_Clean),names_to = "Var", values_to = "Value")
        
                
        House_Final_Long_Filtered_Stats <- 
          House_Final_Long_Filtered %>%
          group_by(Var) %>%
          summarize(Mean = mean(Value, na.rm = TRUE), SD = sd(Value, na.rm = TRUE))
        
        House_Final_Long_Filtered <- full_join(House_Final_Long_Filtered,House_Final_Long_Filtered_Stats)
        
        

# Initially built for heatmap var selection, used as list of graphable variables as well: 
        mapvarlist <- names(maplayer)[c(6,9,11,13,14,15,17,18,19,20,21,22)]    

        
# Build Plot 
  
        House_Final_Long_FilteredX <- House_Final_Long_Filtered %>%
                                            filter(Var %in% mapvarlist[c(5,6,9)])
      
          House_Final_Long_FilteredY <- inner_join(House_Final_Long_Filtered, House_Final_Filtered )
          # Base Plot
          plot0 <- House_Final_Long_FilteredX %>%
            ggplot(aes(x=Driving_Time_Min)) + 
            geom_smooth(aes(y=((Value - Mean) / SD), col = Var), se=FALSE) +
            theme_few() + labs(color='Variables') +  scale_y_continuous(name = "SD from Mean") + 
            theme(legend.position = "bottom", legend.title = element_blank()) +
            geom_point(aes(y=((Price_Clean - mean(Price_Clean)) / sd(Price_Clean))), 
                       alpha = .02
                       #fill = "lightgray"
                       )+ 
            geom_point(data = House_Final_Long_FilteredY, aes(y=((Price_Clean - mean(House_Final_Long_FilteredX$Price_Clean)) / sd(House_Final_Long_FilteredX$Price_Clean))), 
                       #alpha = .02
                       fill = "lightgray"
            )+ 
            labs(title="Price vs Driving Distance to NYC",
                 x = "Driving Time (Min) to Yankee Stadium")+  
            scale_y_continuous(name = "SD from Mean",
                               sec.axis = sec_axis( trans=~.*sd(House_Final_Long_Filtered$Price_Clean) +
                                                      mean(House_Final_Long_Filtered$Price_Clean), name="Price ($)", labels=scales::dollar_format())) 
          
          plot0

Map

After the user has selected homes via the above visual or the selectors, we want to plot these homes on a map.

Var_Input <- "Our_Age_Group_Percent"

# This is mapping chunk above. 


# Import ZIP Code data 
        
        options(tigris_use_cache = TRUE)
        char_zips1 <- zctas(cb = TRUE, starts_with = "10")
        char_zips2 <- zctas(cb = TRUE, starts_with = "12")
        
        char_zips <- rbind(char_zips1, char_zips2) 
        
        char_zips <- geo_join(char_zips, 
                              ZipDF, 
                              by_sp = "ZCTA5CE10", 
                              by_df = "Zip",
                              how = "inner")

# Join Zip Data to mapping data 
        maplayer <- geo_join(char_zips, 
                              Zip_Data_DF, 
                              by_sp = "ZCTA5CE10", 
                              by_df = "Zip",
                              how = "inner")

# Pull data realtive to selected variable into list
        Map_Value <- maplayer[[Var_Input]]

# Generate palette
                
        pal <- colorNumeric(
          palette = "Spectral",
          domain = Map_Value,
           na.color=rgb(0,0,0,0))
  
# Map Labels 
                
        labels <- 
          paste0(
            "Zip Code: ",
            maplayer$ZCTA5CE10, "<br/>",
            paste0(Var_Input, ": "),
            round(Map_Value,2)) %>%
          lapply(htmltools::HTML)

# Generate Map
                
        m <- 
          maplayer %>%
          leaflet() %>%
        #  addTiles() %>%  # Add default OpenStreetMap map tiles
          addProviderTiles("CartoDB") %>% 
         # addMarkers(lng=House_Final$Lon, lat=House_Final$Lat, popup=House_Final$Address) %>%
          addMarkers(lng=House_Final_Filtered$Lon, lat=House_Final_Filtered$Lat, popup=paste(House_Final_Filtered$Address,"\n",House_Final_Filtered$Price),
          clusterOptions = markerClusterOptions()) %>%
         addPolygons(fillColor = ~pal(Map_Value),
                      weight = 2,
                      opacity = 1,
                      color = "white",
                      dashArray = "3",
                      fillOpacity = 0.7,
                      highlight = highlightOptions(weight = 2,
                                                   color = "#666",
                                                   dashArray = "",
                                                   fillOpacity = 0.7,
                                                   bringToFront = TRUE),
                      label = labels) %>%
          # add legend
          addLegend(pal = pal, 
                    values = ~Map_Value, 
                    opacity = 0.7, 
                    title = htmltools::HTML(paste0(Var_Input, "<br> 
                                            by Zip Code")),
                    position = "bottomright",
                    na.label = "")
# Print Map
m

Findings

We found there are available homes that fit these parameters available on the northern end of the area which we were considering. When viewing the plot above, we also see that this area is relatively desirable due to the higher portion of people in our age range and the lower price per square foot. We also see that here is a higher than average new listing count in this region, which is favorable.

Regarding next steps, we might expand our search and consider the area slightly south east of this region - namely, the eastern section of Putnam County and the southern edge of the neighboring Duchess County in East Fishkill.

References

Explore Census Data. Explore Census Data Table S0101. (n.d.). https://data.census.gov/cedsci/table?q=S0101.

Explore Census Data. Explore Census Data Table S1901. (n.d.). https://data.census.gov/cedsci/table?q=S1901.

Realtor.com Real Estate Data and Market Trends for Download. Realtor.com Economic Research. (2021, April 29). https://www.realtor.com/research/data/.

Visualizing Hudson Valley Homes: 608 Final

Christopher Bloome

5/13/2021