Below are a few maps, data visualizations, and models I generated from data scraped from an array of internet sources. These techniques are likely remedial for Hometap, and the results created would only be fixed, microeconomically-focused data points in your presumably more macroeconomic models that analyze changes over long periods of time. I find data analysis extremely interesting, and earnestly enjoyed putting this applied data science sample together for you. Let me know if you have any questions.

1) Foreclosed Property Location Clusters

Foreclosures could represent both incoming low price investment opportunities and potential areas of low credit homeowners. Let’s take a look at foreclosures in Los Angeles County to see where foreclosure clusters appear.

INSIGHT: The scatterplot gives us a general idea of where these foreclosures occur; when mapped, the clusters show strong concentrations near Long Beach, Beverly Hills, and the city of Los Angeles. In addition, upon selecting individual clusters, San Pedro has an unusually high amount of foreclosures. Meanwhile, Inglewood Beach, Manhattan Beach, View Park, and Hawthorne have very low amounts of foreclosures.

Conclusion: Before using these results to target advertising and investments, further research should be done comparing population and foreclosure clusters with data from a longer time span. However, this cluster map does provide adequate insight to consider avoiding or focusing on certain areas of Los Angeles County. These methods can be reapplied to other US cities.

#Data Set Load
library(readxl)
foreclosuredata <- read_excel("~/Downloads/foreclosuredata.xlsx")

#Quick look: where do most forclosures happen?
gf_point(lng ~ lat, data=foreclosuredata) %>%
  gf_lm()

#Map: where exactly are these foreclosures happening in Los Angeles County?
foreclosuremap <- leaflet(foreclosuredata) %>%
  addTiles() %>%
  addMarkers(clusterOptions = markerClusterOptions(), lng = ~lng,
                   lat = ~lat, label = ~PropertyZip) 
foreclosuremap

2) Predicting House Prices

While house prices are affected by factors from crime rates to housing supply in towns, a house’s qualities might suggest what its true value is. Let’s look at house data from King County, WA to see which factors have the strongest association with house sale price.

Modeling: Using stepwise selection, based on Mallow’s Cp values (low = relatively unbiased) and adjusted R squared values (high = very predictive), the ideal model to predict sale price has seven predictors (out of 10 tested): number of bathrooms, square footage of living space, number of floors, whether a property is on the water, the year the house was built, its condition score, and its overall grade. This model proves to be very significant, with all predictors sporting very low p values and a high overall R^2 value. 63.92% of the variation in the model can be explained by differences in the values of the eight predictors, meaning our predictors have significant impacts on house price.

Interpretations: 1) Each additional bathroom predicts a $1,083 increase in sale price. 2) A house being on the water predicts an almost $2000 price increase. 3) An overall house grade matters more than its condition score. 4) We should take a closer look at house age’s relationship with price. 5) Square footage is actually not a major factor behind house price.

Interaction: To test if houses on the water get better grades and therefore higher sales prices, an interaction plot was used. The result: waterfront houses got a wide range of grades: while both waterfront housing status and grade predict higher sales prices, the two predictors do not sway each other through interaction.

#Dataset Load
pricedata <- read_excel("~/Downloads/kc_house_data.xlsx")

mean(~price, data=pricedata)
## [1] 540088.1
#Mean Price is about $500,000

pricedata <- pricedata %>%
  mutate(logpricethousands = log(price/1000))



stepwise <- regsubsets(logpricethousands ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + yr_built + yr_renovated + condition + grade
                       , data=pricedata
                       , method = "seqrep"
                       , nbest = 1)

with(summary(stepwise), data.frame(cp, adjr2, outmat))
##                  cp     adjr2 bedrooms bathrooms sqft_living sqft_lot floors
## 1  ( 1 ) 8754.67388 0.4950776                                               
## 2  ( 1 ) 4647.15704 0.5633952                                               
## 3  ( 1 ) 1322.30545 0.6187029                              *                
## 4  ( 1 )  864.91278 0.6263256                  *           *                
## 5  ( 1 )  441.01395 0.6333920                  *           *                
## 6  ( 1 )  218.57049 0.6371082                  *           *               *
## 7  ( 1 )   88.49688 0.6392883                  *           *               *
## 8  ( 1 ) 5549.09205 0.5484703        *         *           *        *      *
##          waterfront yr_built yr_renovated condition grade
## 1  ( 1 )                                                *
## 2  ( 1 )                   *                            *
## 3  ( 1 )                   *                            *
## 4  ( 1 )                   *                            *
## 5  ( 1 )          *        *                            *
## 6  ( 1 )          *        *                            *
## 7  ( 1 )          *        *                      *     *
## 8  ( 1 )          *        *            *
pricemodel1 <- lm(logpricethousands ~ bathrooms + sqft_living + floors + waterfront + yr_built + condition + grade, data=pricedata)
summary(pricemodel1)
## 
## Call:
## lm(formula = logpricethousands ~ bathrooms + sqft_living + floors + 
##     waterfront + yr_built + condition + grade, data = pricedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65347 -0.21219  0.01649  0.21334  1.36252 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.487e+01  1.869e-01   79.56   <2e-16 ***
## bathrooms    7.995e-02  4.871e-03   16.41   <2e-16 ***
## sqft_living  1.591e-04  4.384e-06   36.28   <2e-16 ***
## floors       8.151e-02  4.983e-03   16.36   <2e-16 ***
## waterfront   5.126e-01  2.507e-02   20.45   <2e-16 ***
## yr_built    -5.720e-03  9.630e-05  -59.40   <2e-16 ***
## condition    4.120e-02  3.592e-03   11.47   <2e-16 ***
## grade        2.326e-01  3.066e-03   75.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3163 on 21605 degrees of freedom
## Multiple R-squared:  0.6394, Adjusted R-squared:  0.6393 
## F-statistic:  5473 on 7 and 21605 DF,  p-value: < 2.2e-16
#take a look at the residuals
gf_point(resid(pricemodel1)~fitted(pricemodel1))

gf_qq(~resid(pricemodel1))

#waterfront vs. nonwaterfront sales based on house age
gf_point(logpricethousands ~ grade, color= ~ifelse(waterfront==1, "waterfront property", "non-waterfront property"), data=pricedata) %>% 
  gf_lm()

3) Locations of Highest and Lowest Property Prices

Million dollar properties sometimes aren’t as liquid as cheaper properties, which can lead to disproportionately low sale prices. Additionally, cheaper properties in neighborhoods with no more room to build might end up being valued much higher ten years in the future. Mapping price quintiles of sold properties in an area can identify which neighborhoods feature low, medium, and high priced houses. That data could potentially provide a useful data point for targeted advertising initiatives and investment terms estimates for prospective customers.

Insight: The quintile heat map shows a strong concentration of expensive properties near Richmond Hill. While Mary Lake is surrounded by mostly cheaply priced houses, the map shows that the actual quantity of homes near the lake is very low. In addition, Niagara Falls features a high number of properties at low to medium prices.

#Dataset Load
propertiesdata <- read_excel("~/Downloads/properties.xlsx")
## New names:
## * `` -> ...1
view(propertiesdata)
propertiesdata <- propertiesdata %>%
  filter(lat >35, lng> -125)

qpal <- colorQuantile("Set1", propertiesdata$Price, n = 5)

map <- leaflet(propertiesdata) %>% addTiles() %>%
  addCircleMarkers(
    radius = 3,
    color = ~qpal(Price),
    fillOpacity = 1
  ) %>%
  addLegend(pal = qpal, values = ~Price, opacity = 1)
## Assuming "lng" and "lat" are longitude and latitude, respectively
map

Are there any outliers in our data set? Let’s find out by plotting cook’s distances.

Answer: the three biggest outliers occured in Beverly, London, and Greenbelt, ON, and can easily be removed from our data set.

proppricemodel1 <- lm(Price ~ lng+lat, data=propertiesdata)
influencestats <- ls.diag(proppricemodel1)
cooks <- data.frame(d=influencestats$cooks)

# identify seriously influential points
MajorInfluence <- filter(cooks, d>(4/25321-2-1))

# find highest 3 outliers
highest.Cook <- propertiesdata %>% 
  mutate(CookD = cooks$d) %>%
  arrange(CookD) 
tail(highest.Cook,n=3)
## # A tibble: 3 x 7
##    ...1 Address                          AreaName     Price   lat     lng  CookD
##   <dbl> <chr>                            <chr>        <dbl> <dbl>   <dbl>  <dbl>
## 1 39754 LOT 35 Clover St Kingston, ON    Beverley    429900  53.9  -0.423 0.0237
## 2 39161 34 MAJESTIC Court St. Thomas, ON London      424900  51.4  -0.163 0.0351
## 3 36924 4740 HIGH RD Ottawa, ON          Greenbelt 32500000  45.3 -75.6   0.0720
#chart of Cook's distances of biggest outliers
twentymodel <- lm(Price~lng+lat, data=highest.Cook)
mplot(twentymodel , which=4) 
## [[1]]

Summary

This sample data science sheet shows a few of my skills, including data visualization, mapping, machine-generated predictive models, outlier analysis, and multiple linear regression.

While these techniques are likely remedial and would only add small data points to your massive models, they represent a few subjects and data science strategies I find very interesting.

Thank you for your time; I hope you and your family stay safe during these times. Here are two Amherst treats to finish up:

## Division III Women's Basketball Titles since 2010

library(plotrix)
## 
## Attaching package: 'plotrix'
## The following object is masked from 'package:mosaic':
## 
##     rescale
labels <- c("Amherst (3)", "All Other 400+ DIII Schools (7)")
pie3D(c(3, 7), col = c('purple', 'gray'), labels = labels, explode=0.1)

## Time I've Invested at Amherst So Far

library(qcc)
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
Percent <- numeric() # Generates an empty vector
Percent <- c(40, 30, 20, 10) #Append the percentage for each type of investment


optionsPareto <- Percent
names(optionsPareto) <- c("Homework","DIII Sports","Val","Reading Pres. Martin's Emails")
pareto.chart (optionsPareto, ylab="Percentage of Time Investment", xlab="Time Investment Type", main="Pareto Chart of Time Invested at Amherst", cex.names=0.6, las=1)

##                                
## Pareto chart analysis for optionsPareto
##                                 Frequency Cum.Freq. Percentage Cum.Percent.
##   Homework                             40        40         40           40
##   DIII Sports                          30        70         30           70
##   Val                                  20        90         20           90
##   Reading Pres. Martin's Emails        10       100         10          100