Pset 5 - Plotly, Shiny, Interactions, Splines - Part 1 and 2

library(leaflet)
library(sf)
library(sp)
library(splines)
library(tidyverse)
library(pubtheme)

Objectives

Continued data exploration of US Census data and CT property values data using interactive visualization libraries like leaflet, plotly and shiny
Include Interaction terms in a linear model
Include spline terms in a linear model

Choosing a home location

A prospective homeowner is interested in buying a house in Connecticut. Since she works in New York City near Grand Central Terminal and her partner works in New Haven, she has the following preferences for the location of her new home:

between NYC and New Haven,
relatively close to a Metro-North train station,
special preference towards stations with many options for convenient trains traveling to Grand Central Station and New Haven.

She also prefers to live in a town where

the house prices aren’t that high, and
the population is relatively diverse, and where at least 5% of the population is Asian.

1. Visualization

Create an interactive visualization or visualizations that she can use to help her make her decision. Use the data below, which contains

US Census data (same as what we previously used in class)
Latitude/longitude for Metro-North stations.
For every pair of stations, information about trips between those two stations: number of trips, the mean duration, and the minimum duration. This has been filtered so that only trips to/from Grand Central, to/from New Haven, and along the train lines that are in Connecticut (New Haven Line, Danbury Line, Waterbury Line, New Canaan Line) remain.

Here is the data with some light cleaning to help you get started.

Census data

dc = readRDS('data/tracts.and.census.with.EV.stations.rds')
dc = dc[dc$state == 'CT',]
head(dc@data,2)

      STATEFP COUNTYFP TRACTCE             AFFGEOID       GEOID NAME LSAD
12844      09      001  090300 1400000US09001090300 09001090300  903   CT
12845      09      001  090400 1400000US09001090400 09001090400  904   CT
        ALAND AWATER  meters    miles state tract           county  state.full
12844 4764507      0 4764507 1.839586    CT   903 Fairfield County Connecticut
12845 7347827      0 7347827 2.837012    CT   904 Fairfield County Connecticut
       pop male female  age male.age female.age white black indian.alaskan
12844 4611 2333   2278 42.9     42.7       43.1  4230    24              0
12845 6518 3355   3163 40.6     36.5       45.7  4742   654             78
      asian pacific other two.or.more white.not.hisp hisp white.hisp black.hisp
12844   180       0    66         111           3888  435        342          0
12845   524      24    88         408           4324  613        418          0
      households i10orless i10to14 i15to19 i20to24 i25to29 i30to34 i35to39
12844       1550        56       8      22      34      16      12      31
12845       2140        24      18      85       0      89      36      49
      i40to44 i45to49 i50to59 i60to74 i75to99 i100to124 i125to149 i150to199
12844      19      43      59      67     231       204       197       261
12845       0      94      24     128     219       343       233       421
      i200ormore hh.income house.value   male.p female.p  white.p    black.p
12844        290    118819      372100 50.59640 49.40360 91.73715  0.5204945
12845        377    121802      344600 51.47284 48.52716 72.75238 10.0337527
       asian.p   hisp.p white.not.hisp.p white.hisp.p black.hisp.p  other.p
12844 3.903709 9.433962         84.32010     7.417046            0 3.838647
12845 8.039276 9.404725         66.33937     6.413010            0 9.174593
      rescaled.house.value hh.income.and.house tot.hh.income tot.house.value
12844             80487.25            99653.12     184169450       576755000
12845             76759.97            99280.99     260656280       737444000
      tot.hh.income.and.house pop.density hh.density income.density
12844               154462344    2506.542   842.5807      100114594
12845               212461309    2297.488   754.3148       91877050
      house.value.density house.and.income.density lev2 lev3
12844           313524273                 83965798    2    4
12845           259936875                 74889116   NA   NA

Train stations

st = read.csv('data/stops.txt')
st = st %>%
  select(stop_id, stop_code, stop_name, stop_lat, stop_lon)
st

    stop_id stop_code              stop_name stop_lat  stop_lon
1         1       0NY          Grand Central 40.75300 -73.97706
2         4       0HL          Harlem-125 St 40.80516 -73.93915
3       622       0YS       Yankees-E 153 St 40.82530 -73.92990
4       184       0HI        Highbridge Yard 40.83590 -73.93150
5         9       0MH         Morris Heights 40.85425 -73.91958
6        10       0UH     University Heights 40.86225 -73.91312
7        11       0MB            Marble Hill 40.87433 -73.91094
8        14       0DV         Spuyten Duyvil 40.87824 -73.92145
9        16       0RV              Riverdale 40.90398 -73.91413
10       17       0LU                 Ludlow 40.92497 -73.90461
11       18       0YK                Yonkers 40.93579 -73.90267
12       19       0GD               Glenwood 40.95050 -73.89906
13       20       0GY              Greystone 40.97270 -73.88907
14       22       0HS     Hastings-on-Hudson 40.99411 -73.88451
15       23       0DF            Dobbs Ferry 41.01246 -73.87949
16       24       0AR      Ardsley-on-Hudson 41.02620 -73.87654
17       25       0IV              Irvington 41.03999 -73.87308
18       27       0TT              Tarrytown 41.07647 -73.86456
19       29       0PM         Philipse Manor 41.09492 -73.86975
20       30       0SB            Scarborough 41.13576 -73.86616
21       31       0OS               Ossining 41.15766 -73.86928
22       33       0HM          Croton-Harmon 41.18990 -73.88239
23       37       0CT              Cortlandt 41.24626 -73.92188
24       39       0PE              Peekskill 41.28596 -73.93042
25       40       0MN                Manitou 41.33260 -73.97043
26       42       0GA               Garrison 41.38178 -73.94720
27       43       0CS            Cold Spring 41.41528 -73.95809
28       44       0BK        Breakneck Ridge 41.45018 -73.98245
29       46       0BC                 Beacon 41.50401 -73.98453
30       49       0NM            New Hamburg 41.58745 -73.94723
31       51       0PO           Poughkeepsie 41.70584 -73.93795
32       54       1ML                Melrose 40.82576 -73.91523
33       55       1TR                Tremont 40.84730 -73.89955
34       56       1FO                Fordham 40.86150 -73.89058
35       57       1BG       Botanical Garden 40.86655 -73.88311
36       58       1WG        Williams Bridge 40.87857 -73.87106
37       59       1WN               Woodlawn 40.89536 -73.86292
38       61       1WF              Wakefield 40.90594 -73.85568
39       62       1MW         Mt Vernon West 40.91214 -73.85113
40       64       1FW              Fleetwood 40.92699 -73.83948
41       65       1BX             Bronxville 40.93978 -73.83521
42       66       1TK               Tuckahoe 40.94939 -73.83017
43       68       1CW              Crestwood 40.95900 -73.82056
44       71       1SC              Scarsdale 40.98917 -73.80863
45       72       1HA              Hartsdale 41.01033 -73.79641
46       74       1WP           White Plains 41.03259 -73.77521
47       76       1NW     North White Plains 41.04981 -73.77314
48       78       1VA               Valhalla 41.07282 -73.77260
49       79       1MP            Mt Pleasant 41.09588 -73.79382
50       80       1HN              Hawthorne 41.10858 -73.79625
51       81       1PV          Pleasantville 41.13522 -73.79266
52       83       1CQ              Chappaqua 41.15801 -73.77488
53       84       1MK               Mt Kisco 41.20824 -73.72978
54       85       1BH          Bedford Hills 41.23732 -73.69994
55       86       1KA                Katonah 41.25955 -73.68416
56       88       1GO         Goldens Bridge 41.29434 -73.67766
57       89       1PY                Purdy's 41.32578 -73.65906
58       90       1CF           Croton Falls 41.34772 -73.66227
59       91       1BW               Brewster 41.39447 -73.61980
60       94       1BR              Southeast 41.41320 -73.62379
61       97       1PA              Patterson 41.51183 -73.60458
62       98       1PW                Pawling 41.56421 -73.60052
63       99       1AT      Appalachian Trail 41.59287 -73.58803
64      100       1WI Harlem Valley-Wingdale 41.63752 -73.57145
65      101       1DO           Dover Plains 41.74040 -73.57650
66      176       1TM          Tenmile River 41.77994 -73.55820
67      177       1WA                Wassaic 41.81472 -73.56220
68      105       2ME        Mt Vernon East  40.91216 -73.83218
69      106       2PH                 Pelham 40.91032 -73.81024
70      108       2NR           New Rochelle 40.91161 -73.78381
71      110       2LA              Larchmont 40.93339 -73.75979
72      111       2MA             Mamaroneck 40.95406 -73.73613
73      112       2HS               Harrison 40.96943 -73.71296
74      114       2RY                    Rye 40.98592 -73.68255
75      115       2PC           Port Chester 41.00073 -73.66470
76      116       2GN              Greenwich 41.02128 -73.62462
77      118       2CC                Cos Cob 41.03017 -73.59831
78      120       2RS              Riverside 41.03168 -73.58817
79      121       2OG          Old Greenwich 41.03382 -73.56586
80      124       2SM               Stamford 41.04661 -73.54285
81      153       3GB              Glenbrook 41.07055 -73.52002
82      651       3SP Springdale MW Facility 41.08215 -73.51968
83      154       3SD             Springdale 41.08876 -73.51783
84      155       3TH          Talmadge Hill 41.11601 -73.49815
85      157       3NC             New Canaan 41.14630 -73.49563
86      127       2NO        Noroton Heights 41.06904 -73.49788
87      128       2DA                 Darien 41.07691 -73.47297
88      129       2RO               Rowayton 41.07746 -73.44553
89      131       2SN          South Norwalk 41.09673 -73.42113
90      158       4M7              Merritt 7 41.14662 -73.42786
91      160       4WI                 Wilton 41.19620 -73.43243
92      161       4CA             Cannondale 41.21662 -73.42670
93      162       4BV            Branchville 41.26763 -73.44142
94      163       4RD                Redding 41.32568 -73.43380
95      164       4BE                 Bethel 41.37622 -73.41817
96      165       4DN                Danbury 41.39615 -73.44879
97      133       2EN           East Norwalk 41.10400 -73.40459
98      134       2WP               Westport 41.11893 -73.37141
99      136       2GF          Green's Farms 41.12226 -73.31541
100     137       2SP              Southport 41.13484 -73.28897
101     138       2FF              Fairfield 41.14308 -73.25774
102     188       2FM        Fairfield Metro 41.16100 -73.23434
103     140       2BP             Bridgeport 41.17868 -73.18708
104     143       2SR              Stratford 41.19425 -73.13153
105     167       5DB          Derby-Shelton 41.31972 -73.08355
106     168       5AN                Ansonia 41.34416 -73.07989
107     169       5SY                Seymour 41.39514 -73.07250
108     170       5BF           Beacon Falls 41.44175 -73.06359
109     171       5NG              Naugatuck 41.49420 -73.05266
110     172       5WB              Waterbury 41.55273 -73.04613
111     145       2MI                Milford 41.22323 -73.05765
112     190       2WH             West Haven 41.27142 -72.96349
113     149       2NH              New Haven 41.29650 -72.92829
114     151       2SS     New Haven-State St 41.30498 -72.92175

Trips

tr = readRDS('data/trips.rds')
tr = tr %>% 
  arrange(stop1, stop2)
head(tr, 10)

         stop1   tf         stop2    n      mean min
1       Bethel from Grand Central  246 110.19919 104
2       Bethel   to Grand Central  188 117.61702 112
3  Branchville from Grand Central  246  94.19919  88
4  Branchville   to Grand Central  188 100.61702  95
5   Bridgeport from Grand Central 3024  90.20602  73
6   Bridgeport   to Grand Central 2920  98.40651  77
7   Bridgeport   to     New Haven 3066  34.35486  25
8   Bridgeport from     New Haven 2804  25.17511  20
9   Cannondale from Grand Central  246  87.19919  81
10  Cannondale   to Grand Central  188  93.61702  88

Visualizations

library(sf)
library(dplyr)


dc_sf <- st_as_sf(dc)

dc_centroids <- st_centroid(dc_sf)
point.x <- st_coordinates(dc_centroids)[, "X"]
point.y <- st_coordinates(dc_centroids)[, "Y"]

dc_sf$point.x <- point.x
dc_sf$point.y <- point.y

dc_metro = dc_sf %>%
  filter(point.x > -74., point.y > 40.6,
         point.x < -73, point.y < 41.5) 

dc_metro$asian_percent <- dc_metro$asian/dc_metro$pop * 100

dc_metro_opt <- dc_metro %>%
  filter(asian_percent >= 5, house.value < 400000)

st_metro = st %>%
  filter(stop_lon > -73.6, stop_lat > 41.,
         stop_lon < -73, stop_lat < 41.4) 


p <- ggplot() +
  geom_sf(data = dc_metro, fill = "gray80", color = "white") +
  
  geom_sf(data = dc_metro_opt, fill = "green", color = "black", alpha = 0.6) +
  
  geom_point(data = st_metro, aes(x = stop_lon, y = stop_lat),
             color = "blue", size = 2) +
  
  labs(title = "Potential Home Locations in Connecticut",
       subtitle = "counties with median house value ≤ $400k, 
       ≥ 5% Asian population") +
  theme_minimal()

p

2. Explanation

Briefly explain the choices you made for your visualization(s). What do you think the prospective homeowner will find most useful? Are there any locations in particular that stand out?

Obviously, I wanted to limit the plot to Connecticut, so I felt that a map would be the best visualization to use. Therefore, I used geom_sf to map the census data.

I wanted to color in the counties that met the buyer’s criteria in green as a quick visual indicator that they were good. Additionally, given the buyer’s stress on this aspect, I wanted to ensure that the train stations were plotted on the map.

I think it will be useful to have the stations overlaid in blue, as this gives a visual indication of how far away the home might be. For example, on the right of the graph at ~73.1W, the green counties that meet the price and demographic criteria generally do not have a train station very close to them. On the other hand, the smaller green counties closer to/inside of the panhandle have many stations very close to them.

Vaguely, there are three “clumps” of counties that stand out to me—the clump on the east, the clump closer to the panhandle, and the clump in the north. The panhandle counties in particular look like they will be of great interest given that they meet the criteria and they have many train stations that connect directly to Grand Central and will not require a transfer of trains.

Shiny

3. Basic Shiny App

Do the following to create a basic shiny app and publish it to shinyapps.io

Configure R Studio to communicate with shinyapps.io by following the directions from this page https://shiny.rstudio.com/articles/shinyapps.html up to and including the “Method 1” section. You will show you how to install rsconnect, create a shinyapps.io account, and set up communication between R Studio and shinyapps.io. Note that you will only have to do this once.
Create a new shiny app like we did in class by clicking File, New File, Shiny Web App. Name the app pset5app1, choose Single File, choose an appropriate directory, and click Create. An app.R file will open in R Studio.
Click Run App to see the app. The app should appear in the Viewer tab in R Studio, or in its own window.
In the upper right of the app, click Publish. Make sure the correct files and account are selected. Click Publish in the lower right.

library(rsconnect)

Paste the url for your app here. The url will be of the form https://yourusername.shinyapps.io/pset5app1/

https://mcconnell361.shinyapps.io/pset5app1/

4. Shiny App for choosing home location

Create a Shiny app that contains visualization you made in #1, and add at least one widget that allow the user to investigate the data by customizing the visualization in some useful way. For example, you could add a sliderInput or selectInput that helps you filter the data, or a selectInput that allows the user to choose how to color the points, etc.

Different types of user inputs can be found here https://shiny.rstudio.com/gallery/widget-gallery.html

Publish your shiny app to shinyapps.io, and paste the url for your app here:

https://mcconnell361.shinyapps.io/pset5_analysis/

Summarize in a sentence why you choose those user inputs and what you learned about the data from the app.

I chose the price slider to allow the user to quickly visualize what areas of the state might be more or less expensive to live in. While not a widget, I also wanted to be able to hover over and see the data for each of the tracts that fit the criteria as well.

CT Property data

First let’s load the CT property and keep only the New Haven properties.

d = readRDS('data/coast.properties.rds')
d = st_drop_geometry(d)

## I'm gonna change some column names 
colnames(d) = gsub('_', '.', colnames(d))

colnames(d)[colnames(d) == 'Assessed.Total'] = 'value'
colnames(d)[colnames(d) == 'Number.of.Bedroom'] = 'beds'
colnames(d)[colnames(d) == 'Number.of.Baths'] = 'baths'
colnames(d)[colnames(d) == 'Number.of.Half.Baths'] = 'half.baths'
colnames(d)[colnames(d) == 'Living.Area'] = 'living'
colnames(d)[colnames(d) == 'SHAPE.Area'] = 'land'
colnames(d)[colnames(d) == 'Condition.Description'] = 'cond'
colnames(d)[colnames(d) == 'State.Use.Description'] = 'use'
colnames(d)[colnames(d) == 'GlobalID'] = 'id'
colnames(d)[colnames(d) == 'Mailing.Address'] = 'address'
colnames(d)[colnames(d) == 'Mailing.City'] = 'city'
colnames(d)[colnames(d) == 'Town.Name'] = 'town'
colnames(d)[colnames(d) == 'Valuation.Year'] = 'val.year'
colnames(d)[colnames(d) == 'Sale.Price'] = 'sale.price'
colnames(d)[colnames(d) == 'Sale.Date'] = 'sale.date'
colnames(d)[colnames(d) == 'AYB'] = 'ayb'
colnames(d)[colnames(d) == 'EYB'] = 'eyb'

d = d %>%
  
  filter(value != 0, 
         grepl('Single Fam|SINGLE FAM|One Fam|ONE FAM', 
               use), 
         Qualified == 'Q', 
         sale.price != 0, 
         !is.na(living), 
         living != 0, 
         Condition %in% c( 'E', 'EX',             ## excellent
                           'VG', 'G', 'GD',       ## very good, good
                           'A+', 'A', 'AV', 'A-', ## avg+, avg, avg-
                           'F', 'FR',             ## fair
                           'P', 'VP' ),           ## poor, v poor
         !duplicated(id)) %>%       
  
  mutate(cond = case_when(cond == 'EX' ~ 'Excellent', 
                          cond == 'Avarage' ~ 'Average', 
                          cond == 'AV' ~ 'Average',
                          cond == 'F' ~ 'Fair', 
                          cond == 'FR' ~ 'Fair', 
                          cond == 'GD' ~ 'Good', 
                          cond == 'G+' ~ 'Very Good', 
                          TRUE ~ cond), 
         
         dist = as.numeric(dist), 
         land = ifelse(land < 0, -land, land)) %>%
  
  select(value, living, land, 
         beds, baths, half.baths, 
         cond, use, dist, ayb, eyb, 
         sale.price, sale.date,
         address, city, town, id, 
         centroid.x, centroid.y, 
         point.x, point.y)

d = d %>% 
  filter(town == 'NEW HAVEN')

5. Data exploration with `ggplotly` and `plotly`

Recall we noticed what looks like two distinct clouds of points in this visualization.

g = ggplot(d, 
       aes(x = log(living), 
           y = log(value))) + 
  geom_point()

g

Use ggplotly or plotly to make a interactive scatter plot of log(value) vs living. Add color, tooltip, or other features that help you investigate why there appears to be two distinct clouds of points. What characteristics do the properties in the upper right cluster of points have in common?

library(ggplot2)
library(plotly)

yale_x <- -72.9277
yale_y <- 41.3144

p <- ggplot(d, aes(x = living, y = log(value),
                   color = use,
                   text = paste("Town:", town,
                                "<br>Address:", address,
                                "<br>Beds:", beds,
                                "<br>Baths:", baths,
                                "<br>Condition:", cond,
                                "<br>Value:", ayb))) +
  geom_point() +
  
  labs(x = "Living Area",
       y = "Log(Value)",
       title = "log(Value) vs Living Area") +
  theme_minimal()

ggplotly(p, tooltip = "text")

Based on this visualization, it appears that the higher “cloud” of points seem to generally be of better condition than the lower one. From using tooltip and hovering over the points, it seems that the points in the higher cloud generally have conditions of Good, Very Good, and Excellent, while the points in the lower cloud generally have conditions of Average, Fair, and Poor.

6. Data exploration with `leaflet`

Use leaflet to make an interactive map of the properties. Add color, tooltips, or other features that helps you further investigate why there are two clouds of points. Does this map confirm your conclusions from the previous question?

library(leaflet)
library(viridis)

pal <- colorFactor(palette = viridis(length(unique(d$cond))), domain = d$cond)

leaflet(data = d) %>%
  
    addTiles() %>%
  
    addCircleMarkers(
    lng = ~centroid.x, 
    lat = ~centroid.y, 
    radius = 5,
    popup = ~paste0(address, "<br>",
                    "Value: $", value),
    color = ~pal(cond),
    stroke = FALSE,
    fillOpacity = 0.7,
   
  ) %>%
  addLegend("bottomright", 
            pal = pal, 
            values = ~cond,
            title = "Property Condition",
            opacity = 1)

Yes, this map helps to confirm my answer to the last question. After plotting by condition and allowing for the user to read address and value information for each point, it appears that there are certain streets/areas that the higher condition houses are concentrated in; for example, Lexington Avenue has many dark purple dots.