library(leaflet)
library(sf)
library(sp)
library(splines)
library(tidyverse)
library(pubtheme)Pset 5 - Plotly, Shiny, Interactions, Splines - Part 1 and 2
Objectives
- Continued data exploration of US Census data and CT property values data using interactive visualization libraries like
leaflet,plotlyandshiny - Include Interaction terms in a linear model
- Include spline terms in a linear model
Choosing a home location
A prospective homeowner is interested in buying a house in Connecticut. Since she works in New York City near Grand Central Terminal and her partner works in New Haven, she has the following preferences for the location of her new home:
- between NYC and New Haven,
- relatively close to a Metro-North train station,
- special preference towards stations with many options for convenient trains traveling to Grand Central Station and New Haven.
She also prefers to live in a town where
- the house prices aren’t that high, and
- the population is relatively diverse, and where at least 5% of the population is Asian.
1. Visualization
Create an interactive visualization or visualizations that she can use to help her make her decision. Use the data below, which contains
- US Census data (same as what we previously used in class)
- Latitude/longitude for Metro-North stations.
- For every pair of stations, information about trips between those two stations: number of trips, the mean duration, and the minimum duration. This has been filtered so that only trips to/from Grand Central, to/from New Haven, and along the train lines that are in Connecticut (New Haven Line, Danbury Line, Waterbury Line, New Canaan Line) remain.
Here is the data with some light cleaning to help you get started.
Census data
dc = readRDS('data/tracts.and.census.with.EV.stations.rds')
dc = dc[dc$state == 'CT',]
head(dc@data,2) STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
12844 09 001 090300 1400000US09001090300 09001090300 903 CT
12845 09 001 090400 1400000US09001090400 09001090400 904 CT
ALAND AWATER meters miles state tract county state.full
12844 4764507 0 4764507 1.839586 CT 903 Fairfield County Connecticut
12845 7347827 0 7347827 2.837012 CT 904 Fairfield County Connecticut
pop male female age male.age female.age white black indian.alaskan
12844 4611 2333 2278 42.9 42.7 43.1 4230 24 0
12845 6518 3355 3163 40.6 36.5 45.7 4742 654 78
asian pacific other two.or.more white.not.hisp hisp white.hisp black.hisp
12844 180 0 66 111 3888 435 342 0
12845 524 24 88 408 4324 613 418 0
households i10orless i10to14 i15to19 i20to24 i25to29 i30to34 i35to39
12844 1550 56 8 22 34 16 12 31
12845 2140 24 18 85 0 89 36 49
i40to44 i45to49 i50to59 i60to74 i75to99 i100to124 i125to149 i150to199
12844 19 43 59 67 231 204 197 261
12845 0 94 24 128 219 343 233 421
i200ormore hh.income house.value male.p female.p white.p black.p
12844 290 118819 372100 50.59640 49.40360 91.73715 0.5204945
12845 377 121802 344600 51.47284 48.52716 72.75238 10.0337527
asian.p hisp.p white.not.hisp.p white.hisp.p black.hisp.p other.p
12844 3.903709 9.433962 84.32010 7.417046 0 3.838647
12845 8.039276 9.404725 66.33937 6.413010 0 9.174593
rescaled.house.value hh.income.and.house tot.hh.income tot.house.value
12844 80487.25 99653.12 184169450 576755000
12845 76759.97 99280.99 260656280 737444000
tot.hh.income.and.house pop.density hh.density income.density
12844 154462344 2506.542 842.5807 100114594
12845 212461309 2297.488 754.3148 91877050
house.value.density house.and.income.density lev2 lev3
12844 313524273 83965798 2 4
12845 259936875 74889116 NA NA
Train stations
st = read.csv('data/stops.txt')
st = st %>%
select(stop_id, stop_code, stop_name, stop_lat, stop_lon)
st stop_id stop_code stop_name stop_lat stop_lon
1 1 0NY Grand Central 40.75300 -73.97706
2 4 0HL Harlem-125 St 40.80516 -73.93915
3 622 0YS Yankees-E 153 St 40.82530 -73.92990
4 184 0HI Highbridge Yard 40.83590 -73.93150
5 9 0MH Morris Heights 40.85425 -73.91958
6 10 0UH University Heights 40.86225 -73.91312
7 11 0MB Marble Hill 40.87433 -73.91094
8 14 0DV Spuyten Duyvil 40.87824 -73.92145
9 16 0RV Riverdale 40.90398 -73.91413
10 17 0LU Ludlow 40.92497 -73.90461
11 18 0YK Yonkers 40.93579 -73.90267
12 19 0GD Glenwood 40.95050 -73.89906
13 20 0GY Greystone 40.97270 -73.88907
14 22 0HS Hastings-on-Hudson 40.99411 -73.88451
15 23 0DF Dobbs Ferry 41.01246 -73.87949
16 24 0AR Ardsley-on-Hudson 41.02620 -73.87654
17 25 0IV Irvington 41.03999 -73.87308
18 27 0TT Tarrytown 41.07647 -73.86456
19 29 0PM Philipse Manor 41.09492 -73.86975
20 30 0SB Scarborough 41.13576 -73.86616
21 31 0OS Ossining 41.15766 -73.86928
22 33 0HM Croton-Harmon 41.18990 -73.88239
23 37 0CT Cortlandt 41.24626 -73.92188
24 39 0PE Peekskill 41.28596 -73.93042
25 40 0MN Manitou 41.33260 -73.97043
26 42 0GA Garrison 41.38178 -73.94720
27 43 0CS Cold Spring 41.41528 -73.95809
28 44 0BK Breakneck Ridge 41.45018 -73.98245
29 46 0BC Beacon 41.50401 -73.98453
30 49 0NM New Hamburg 41.58745 -73.94723
31 51 0PO Poughkeepsie 41.70584 -73.93795
32 54 1ML Melrose 40.82576 -73.91523
33 55 1TR Tremont 40.84730 -73.89955
34 56 1FO Fordham 40.86150 -73.89058
35 57 1BG Botanical Garden 40.86655 -73.88311
36 58 1WG Williams Bridge 40.87857 -73.87106
37 59 1WN Woodlawn 40.89536 -73.86292
38 61 1WF Wakefield 40.90594 -73.85568
39 62 1MW Mt Vernon West 40.91214 -73.85113
40 64 1FW Fleetwood 40.92699 -73.83948
41 65 1BX Bronxville 40.93978 -73.83521
42 66 1TK Tuckahoe 40.94939 -73.83017
43 68 1CW Crestwood 40.95900 -73.82056
44 71 1SC Scarsdale 40.98917 -73.80863
45 72 1HA Hartsdale 41.01033 -73.79641
46 74 1WP White Plains 41.03259 -73.77521
47 76 1NW North White Plains 41.04981 -73.77314
48 78 1VA Valhalla 41.07282 -73.77260
49 79 1MP Mt Pleasant 41.09588 -73.79382
50 80 1HN Hawthorne 41.10858 -73.79625
51 81 1PV Pleasantville 41.13522 -73.79266
52 83 1CQ Chappaqua 41.15801 -73.77488
53 84 1MK Mt Kisco 41.20824 -73.72978
54 85 1BH Bedford Hills 41.23732 -73.69994
55 86 1KA Katonah 41.25955 -73.68416
56 88 1GO Goldens Bridge 41.29434 -73.67766
57 89 1PY Purdy's 41.32578 -73.65906
58 90 1CF Croton Falls 41.34772 -73.66227
59 91 1BW Brewster 41.39447 -73.61980
60 94 1BR Southeast 41.41320 -73.62379
61 97 1PA Patterson 41.51183 -73.60458
62 98 1PW Pawling 41.56421 -73.60052
63 99 1AT Appalachian Trail 41.59287 -73.58803
64 100 1WI Harlem Valley-Wingdale 41.63752 -73.57145
65 101 1DO Dover Plains 41.74040 -73.57650
66 176 1TM Tenmile River 41.77994 -73.55820
67 177 1WA Wassaic 41.81472 -73.56220
68 105 2ME Mt Vernon East 40.91216 -73.83218
69 106 2PH Pelham 40.91032 -73.81024
70 108 2NR New Rochelle 40.91161 -73.78381
71 110 2LA Larchmont 40.93339 -73.75979
72 111 2MA Mamaroneck 40.95406 -73.73613
73 112 2HS Harrison 40.96943 -73.71296
74 114 2RY Rye 40.98592 -73.68255
75 115 2PC Port Chester 41.00073 -73.66470
76 116 2GN Greenwich 41.02128 -73.62462
77 118 2CC Cos Cob 41.03017 -73.59831
78 120 2RS Riverside 41.03168 -73.58817
79 121 2OG Old Greenwich 41.03382 -73.56586
80 124 2SM Stamford 41.04661 -73.54285
81 153 3GB Glenbrook 41.07055 -73.52002
82 651 3SP Springdale MW Facility 41.08215 -73.51968
83 154 3SD Springdale 41.08876 -73.51783
84 155 3TH Talmadge Hill 41.11601 -73.49815
85 157 3NC New Canaan 41.14630 -73.49563
86 127 2NO Noroton Heights 41.06904 -73.49788
87 128 2DA Darien 41.07691 -73.47297
88 129 2RO Rowayton 41.07746 -73.44553
89 131 2SN South Norwalk 41.09673 -73.42113
90 158 4M7 Merritt 7 41.14662 -73.42786
91 160 4WI Wilton 41.19620 -73.43243
92 161 4CA Cannondale 41.21662 -73.42670
93 162 4BV Branchville 41.26763 -73.44142
94 163 4RD Redding 41.32568 -73.43380
95 164 4BE Bethel 41.37622 -73.41817
96 165 4DN Danbury 41.39615 -73.44879
97 133 2EN East Norwalk 41.10400 -73.40459
98 134 2WP Westport 41.11893 -73.37141
99 136 2GF Green's Farms 41.12226 -73.31541
100 137 2SP Southport 41.13484 -73.28897
101 138 2FF Fairfield 41.14308 -73.25774
102 188 2FM Fairfield Metro 41.16100 -73.23434
103 140 2BP Bridgeport 41.17868 -73.18708
104 143 2SR Stratford 41.19425 -73.13153
105 167 5DB Derby-Shelton 41.31972 -73.08355
106 168 5AN Ansonia 41.34416 -73.07989
107 169 5SY Seymour 41.39514 -73.07250
108 170 5BF Beacon Falls 41.44175 -73.06359
109 171 5NG Naugatuck 41.49420 -73.05266
110 172 5WB Waterbury 41.55273 -73.04613
111 145 2MI Milford 41.22323 -73.05765
112 190 2WH West Haven 41.27142 -72.96349
113 149 2NH New Haven 41.29650 -72.92829
114 151 2SS New Haven-State St 41.30498 -72.92175
Trips
tr = readRDS('data/trips.rds')
tr = tr %>%
arrange(stop1, stop2)
head(tr, 10) stop1 tf stop2 n mean min
1 Bethel from Grand Central 246 110.19919 104
2 Bethel to Grand Central 188 117.61702 112
3 Branchville from Grand Central 246 94.19919 88
4 Branchville to Grand Central 188 100.61702 95
5 Bridgeport from Grand Central 3024 90.20602 73
6 Bridgeport to Grand Central 2920 98.40651 77
7 Bridgeport to New Haven 3066 34.35486 25
8 Bridgeport from New Haven 2804 25.17511 20
9 Cannondale from Grand Central 246 87.19919 81
10 Cannondale to Grand Central 188 93.61702 88
Visualizations
library(sf)
library(dplyr)
dc_sf <- st_as_sf(dc)
dc_centroids <- st_centroid(dc_sf)
point.x <- st_coordinates(dc_centroids)[, "X"]
point.y <- st_coordinates(dc_centroids)[, "Y"]
dc_sf$point.x <- point.x
dc_sf$point.y <- point.y
dc_metro = dc_sf %>%
filter(point.x > -74., point.y > 40.6,
point.x < -73, point.y < 41.5)
dc_metro$asian_percent <- dc_metro$asian/dc_metro$pop * 100
dc_metro_opt <- dc_metro %>%
filter(asian_percent >= 5, house.value < 400000)
st_metro = st %>%
filter(stop_lon > -73.6, stop_lat > 41.,
stop_lon < -73, stop_lat < 41.4)
p <- ggplot() +
geom_sf(data = dc_metro, fill = "gray80", color = "white") +
geom_sf(data = dc_metro_opt, fill = "green", color = "black", alpha = 0.6) +
geom_point(data = st_metro, aes(x = stop_lon, y = stop_lat),
color = "blue", size = 2) +
labs(title = "Potential Home Locations in Connecticut",
subtitle = "counties with median house value ≤ $400k,
≥ 5% Asian population") +
theme_minimal()
p2. Explanation
Briefly explain the choices you made for your visualization(s). What do you think the prospective homeowner will find most useful? Are there any locations in particular that stand out?
Obviously, I wanted to limit the plot to Connecticut, so I felt that a map would be the best visualization to use. Therefore, I used geom_sf to map the census data.
I wanted to color in the counties that met the buyer’s criteria in green as a quick visual indicator that they were good. Additionally, given the buyer’s stress on this aspect, I wanted to ensure that the train stations were plotted on the map.
I think it will be useful to have the stations overlaid in blue, as this gives a visual indication of how far away the home might be. For example, on the right of the graph at ~73.1W, the green counties that meet the price and demographic criteria generally do not have a train station very close to them. On the other hand, the smaller green counties closer to/inside of the panhandle have many stations very close to them.
Vaguely, there are three “clumps” of counties that stand out to me—the clump on the east, the clump closer to the panhandle, and the clump in the north. The panhandle counties in particular look like they will be of great interest given that they meet the criteria and they have many train stations that connect directly to Grand Central and will not require a transfer of trains.
Shiny
3. Basic Shiny App
Do the following to create a basic shiny app and publish it to shinyapps.io
- Configure R Studio to communicate with shinyapps.io by following the directions from this page https://shiny.rstudio.com/articles/shinyapps.html up to and including the “Method 1” section. You will show you how to install
rsconnect, create a shinyapps.io account, and set up communication between R Studio and shinyapps.io. Note that you will only have to do this once. - Create a new shiny app like we did in class by clicking File, New File, Shiny Web App. Name the app
pset5app1, choose Single File, choose an appropriate directory, and click Create. Anapp.Rfile will open in R Studio. - Click Run App to see the app. The app should appear in the Viewer tab in R Studio, or in its own window.
- In the upper right of the app, click Publish. Make sure the correct files and account are selected. Click Publish in the lower right.
library(rsconnect)Paste the url for your app here. The url will be of the form https://yourusername.shinyapps.io/pset5app1/
4. Shiny App for choosing home location
Create a Shiny app that contains visualization you made in #1, and add at least one widget that allow the user to investigate the data by customizing the visualization in some useful way. For example, you could add a sliderInput or selectInput that helps you filter the data, or a selectInput that allows the user to choose how to color the points, etc.
Different types of user inputs can be found here https://shiny.rstudio.com/gallery/widget-gallery.html
Publish your shiny app to shinyapps.io, and paste the url for your app here:
https://mcconnell361.shinyapps.io/pset5_analysis/
Summarize in a sentence why you choose those user inputs and what you learned about the data from the app.
I chose the price slider to allow the user to quickly visualize what areas of the state might be more or less expensive to live in. While not a widget, I also wanted to be able to hover over and see the data for each of the tracts that fit the criteria as well.
CT Property data
First let’s load the CT property and keep only the New Haven properties.
d = readRDS('data/coast.properties.rds')
d = st_drop_geometry(d)
## I'm gonna change some column names
colnames(d) = gsub('_', '.', colnames(d))
colnames(d)[colnames(d) == 'Assessed.Total'] = 'value'
colnames(d)[colnames(d) == 'Number.of.Bedroom'] = 'beds'
colnames(d)[colnames(d) == 'Number.of.Baths'] = 'baths'
colnames(d)[colnames(d) == 'Number.of.Half.Baths'] = 'half.baths'
colnames(d)[colnames(d) == 'Living.Area'] = 'living'
colnames(d)[colnames(d) == 'SHAPE.Area'] = 'land'
colnames(d)[colnames(d) == 'Condition.Description'] = 'cond'
colnames(d)[colnames(d) == 'State.Use.Description'] = 'use'
colnames(d)[colnames(d) == 'GlobalID'] = 'id'
colnames(d)[colnames(d) == 'Mailing.Address'] = 'address'
colnames(d)[colnames(d) == 'Mailing.City'] = 'city'
colnames(d)[colnames(d) == 'Town.Name'] = 'town'
colnames(d)[colnames(d) == 'Valuation.Year'] = 'val.year'
colnames(d)[colnames(d) == 'Sale.Price'] = 'sale.price'
colnames(d)[colnames(d) == 'Sale.Date'] = 'sale.date'
colnames(d)[colnames(d) == 'AYB'] = 'ayb'
colnames(d)[colnames(d) == 'EYB'] = 'eyb'
d = d %>%
filter(value != 0,
grepl('Single Fam|SINGLE FAM|One Fam|ONE FAM',
use),
Qualified == 'Q',
sale.price != 0,
!is.na(living),
living != 0,
Condition %in% c( 'E', 'EX', ## excellent
'VG', 'G', 'GD', ## very good, good
'A+', 'A', 'AV', 'A-', ## avg+, avg, avg-
'F', 'FR', ## fair
'P', 'VP' ), ## poor, v poor
!duplicated(id)) %>%
mutate(cond = case_when(cond == 'EX' ~ 'Excellent',
cond == 'Avarage' ~ 'Average',
cond == 'AV' ~ 'Average',
cond == 'F' ~ 'Fair',
cond == 'FR' ~ 'Fair',
cond == 'GD' ~ 'Good',
cond == 'G+' ~ 'Very Good',
TRUE ~ cond),
dist = as.numeric(dist),
land = ifelse(land < 0, -land, land)) %>%
select(value, living, land,
beds, baths, half.baths,
cond, use, dist, ayb, eyb,
sale.price, sale.date,
address, city, town, id,
centroid.x, centroid.y,
point.x, point.y)
d = d %>%
filter(town == 'NEW HAVEN')5. Data exploration with ggplotly and plotly
Recall we noticed what looks like two distinct clouds of points in this visualization.
g = ggplot(d,
aes(x = log(living),
y = log(value))) +
geom_point()
gUse ggplotly or plotly to make a interactive scatter plot of log(value) vs living. Add color, tooltip, or other features that help you investigate why there appears to be two distinct clouds of points. What characteristics do the properties in the upper right cluster of points have in common?
library(ggplot2)
library(plotly)
yale_x <- -72.9277
yale_y <- 41.3144
p <- ggplot(d, aes(x = living, y = log(value),
color = use,
text = paste("Town:", town,
"<br>Address:", address,
"<br>Beds:", beds,
"<br>Baths:", baths,
"<br>Condition:", cond,
"<br>Value:", ayb))) +
geom_point() +
labs(x = "Living Area",
y = "Log(Value)",
title = "log(Value) vs Living Area") +
theme_minimal()
ggplotly(p, tooltip = "text")Based on this visualization, it appears that the higher “cloud” of points seem to generally be of better condition than the lower one. From using tooltip and hovering over the points, it seems that the points in the higher cloud generally have conditions of Good, Very Good, and Excellent, while the points in the lower cloud generally have conditions of Average, Fair, and Poor.
6. Data exploration with leaflet
Use leaflet to make an interactive map of the properties. Add color, tooltips, or other features that helps you further investigate why there are two clouds of points. Does this map confirm your conclusions from the previous question?
library(leaflet)
library(viridis)
pal <- colorFactor(palette = viridis(length(unique(d$cond))), domain = d$cond)
leaflet(data = d) %>%
addTiles() %>%
addCircleMarkers(
lng = ~centroid.x,
lat = ~centroid.y,
radius = 5,
popup = ~paste0(address, "<br>",
"Value: $", value),
color = ~pal(cond),
stroke = FALSE,
fillOpacity = 0.7,
) %>%
addLegend("bottomright",
pal = pal,
values = ~cond,
title = "Property Condition",
opacity = 1)Yes, this map helps to confirm my answer to the last question. After plotting by condition and allowing for the user to read address and value information for each point, it appears that there are certain streets/areas that the higher condition houses are concentrated in; for example, Lexington Avenue has many dark purple dots.