Ciencia de Datos de Ciudades II

TAREA 1: GEOPROCESAMIENTO

Paso I y II: seleccionamos una ciudad Boston con un dataset con registros geo-referenciados con sus coordenadas: Bibliotecas Públicas.

Cargamos las librerias que necesitamos para trabajar.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(sf)

## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

Ahora vamos a leer los dataset seleccionados:

Bibliotecas Públicas agregando el comando stringsAsFactors para las VARIABLES CATEGORCIAS y encoding que se utiliza para corregir ñ y tildes, pero que sirve para corregir otros detalles que aparecen en nuestro dataset*

Bibliotecas_Publicas <- read.csv("Public_Libraries.csv",
                            stringsAsFactors = FALSE,
                            encoding = "UTF-8")

head(Bibliotecas_Publicas)

##          X       Y OBJECTID_1 OBJECTID LIBRARIES_      DISTRICT
## 1 773730.9 2957037          1        5          5              
## 2 772255.2 2930088          2       11         11              
## 3 758459.3 2918939          3       17         17     HYDE PARK
## 4 761341.6 2942108          4       18         18 JAMAICA PLAIN
## 5 768696.7 2945002          5       22         22       ROXBURY
## 6 770005.9 2952641          6        1          1 COPLEY SQUARE
##          ST_ADDRESS        BRANCH ZIPCODE
## 1  151 Cambridge St      West End    2114
## 2 690 Washington St Codman Square    2124
## 3     35 Harvard Av     Hyde Park    2136
## 4     433 Centre St      Connolly    2130
## 5     149 Dudley St        Dudley    2119
## 6   700 Boylston St       Central    2116

Ahora vamos a modificar los nombres de las columnas X y Y:

colnames(Bibliotecas_Publicas)[colnames(Bibliotecas_Publicas) %in% c('Y')] <-
  paste0(c('latitude'))

head(Bibliotecas_Publicas)

##          X latitude OBJECTID_1 OBJECTID LIBRARIES_      DISTRICT
## 1 773730.9  2957037          1        5          5              
## 2 772255.2  2930088          2       11         11              
## 3 758459.3  2918939          3       17         17     HYDE PARK
## 4 761341.6  2942108          4       18         18 JAMAICA PLAIN
## 5 768696.7  2945002          5       22         22       ROXBURY
## 6 770005.9  2952641          6        1          1 COPLEY SQUARE
##          ST_ADDRESS        BRANCH ZIPCODE
## 1  151 Cambridge St      West End    2114
## 2 690 Washington St Codman Square    2124
## 3     35 Harvard Av     Hyde Park    2136
## 4     433 Centre St      Connolly    2130
## 5     149 Dudley St        Dudley    2119
## 6   700 Boylston St       Central    2116

colnames(Bibliotecas_Publicas)[colnames(Bibliotecas_Publicas) %in% c('X')] <-
  paste0(c('longitude'))

head(Bibliotecas_Publicas)

##   longitude latitude OBJECTID_1 OBJECTID LIBRARIES_      DISTRICT
## 1  773730.9  2957037          1        5          5              
## 2  772255.2  2930088          2       11         11              
## 3  758459.3  2918939          3       17         17     HYDE PARK
## 4  761341.6  2942108          4       18         18 JAMAICA PLAIN
## 5  768696.7  2945002          5       22         22       ROXBURY
## 6  770005.9  2952641          6        1          1 COPLEY SQUARE
##          ST_ADDRESS        BRANCH ZIPCODE
## 1  151 Cambridge St      West End    2114
## 2 690 Washington St Codman Square    2124
## 3     35 Harvard Av     Hyde Park    2136
## 4     433 Centre St      Connolly    2130
## 5     149 Dudley St        Dudley    2119
## 6   700 Boylston St       Central    2116

Filtramos los datos que nos interesan de nuestro dataset y que quede más ordenado:

Bibliotecas_Publicas_Ordenadas <- Bibliotecas_Publicas %>% 
  select(longitude, latitude, LIBRARIES_, DISTRICT, ST_ADDRESS)

head(Bibliotecas_Publicas_Ordenadas)

##   longitude latitude LIBRARIES_      DISTRICT        ST_ADDRESS
## 1  773730.9  2957037          5                151 Cambridge St
## 2  772255.2  2930088         11               690 Washington St
## 3  758459.3  2918939         17     HYDE PARK     35 Harvard Av
## 4  761341.6  2942108         18 JAMAICA PLAIN     433 Centre St
## 5  768696.7  2945002         22       ROXBURY     149 Dudley St
## 6  770005.9  2952641          1 COPLEY SQUARE   700 Boylston St

Ahora leemos el dataset de Barrios de Boston, utilizando el comando st_read ya que es un elemento geográfico.

Barrios_Boston <- st_read("Boston_Neighborhoods.geojson")

## Reading layer `Boston_Neighborhoods' from data source `/Users/ani/Desktop/UTDT - materias/04 - 0118 - MU119 Ciencia De Datos Para Ciudades II/Boston/Ciencia de Datos de Ciudades II/Boston_Neighborhoods.geojson' using driver `GeoJSON'
## Simple feature collection with 26 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 739715.3 ymin: 2908294 xmax: 812255.1 ymax: 2970086
## CRS:            4326

head(Barrios_Boston)

## Simple feature collection with 6 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 751472.1 ymin: 2922750 xmax: 776215.4 ymax: 2953742
## CRS:            4326
##   OBJECTID             Name      Acres Neighborhood_ID SqMiles ShapeSTArea
## 1       27       Roslindale 1605.56824              15    2.51  69938272.9
## 2       28    Jamaica Plain 2519.24539              11    3.94 109737890.8
## 3       29     Mission Hill  350.85356              13    0.55  15283120.0
## 4       30         Longwood  188.61195              28    0.29   8215903.5
## 5       31      Bay Village   26.53984              33    0.04   1156070.8
## 6       32 Leather District   15.63991              27    0.02    681271.7
##   ShapeSTLength                       geometry
## 1     53563.913 MULTIPOLYGON (((757409.1 29...
## 2     56349.937 MULTIPOLYGON (((762983.8 29...
## 3     17918.724 MULTIPOLYGON (((766903.6 29...
## 4     11908.757 MULTIPOLYGON (((764826.9 29...
## 5      4650.635 MULTIPOLYGON (((773315.7 29...
## 6      3237.141 MULTIPOLYGON (((775544.1 29...

Vamos a explorar el dataset Barrios_Boston, utilizando el comando plot, para tener una rápida observación de las 7 variables del dataset:

plot(Barrios_Boston)

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

Ahora cruzamos nuestros datos de Barrios y Bibliotecas Públicas en un mapa a ver que observamos:

ggplot() +
  geom_sf(data = Barrios_Boston) +
  geom_point(data = Bibliotecas_Publicas_Ordenadas, 
             aes(x=longitude, y=latitude),
             color = "red")

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

Observamos las bibliotecas asignadas a los distintos barrios.

Para continuar debemos convertir la tabla de Bibliotecas Públicas en un dataset espacial, utilizando el comando st_as_sf() que transforma un dataframe común en uno espacial. Utilizar crs es obligatoria y refiere al sistema de coordenadas, como bajamos los datos de internet debemos igualarlo a 4326. Como todas nuestras bibliotecas estan referenciados con latitud y longitud no debemos filtrar nungún caso.

Bibliotecas_Publicas_Ordenadas_Datos_Espaciales <- Bibliotecas_Publicas_Ordenadas %>% 
  st_as_sf(coords = c("longitude","latitude"), crs = 4326)

head(Bibliotecas_Publicas_Ordenadas_Datos_Espaciales)

## Simple feature collection with 6 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 758459.3 ymin: 2918939 xmax: 773730.9 ymax: 2957037
## CRS:            EPSG:4326
##   LIBRARIES_      DISTRICT        ST_ADDRESS                 geometry
## 1          5                151 Cambridge St POINT (773730.9 2957037)
## 2         11               690 Washington St POINT (772255.2 2930088)
## 3         17     HYDE PARK     35 Harvard Av POINT (758459.3 2918939)
## 4         18 JAMAICA PLAIN     433 Centre St POINT (761341.6 2942108)
## 5         22       ROXBURY     149 Dudley St POINT (768696.7 2945002)
## 6          1 COPLEY SQUARE   700 Boylston St POINT (770005.9 2952641)

Paso III: Ahora mediante un join espacial, utilizando el comando st_join(). A cada biblioteca le asignaremos el barrio que le corresponda.

Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios <- st_join(Bibliotecas_Publicas_Ordenadas_Datos_Espaciales, Barrios_Boston)

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## although coordinates are longitude/latitude, st_intersects assumes that they are planar

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## although coordinates are longitude/latitude, st_intersects assumes that they are planar

head(Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios)

## Simple feature collection with 6 features and 10 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 758459.3 ymin: 2918939 xmax: 773730.9 ymax: 2957037
## CRS:            EPSG:4326
##   LIBRARIES_      DISTRICT        ST_ADDRESS OBJECTID          Name     Acres
## 1          5                151 Cambridge St       40      West End  190.4907
## 2         11               690 Washington St       48    Dorchester 4662.8795
## 3         17     HYDE PARK     35 Harvard Av       46     Hyde Park 2927.2212
## 4         18 JAMAICA PLAIN     433 Centre St       28 Jamaica Plain 2519.2454
## 5         22       ROXBURY     149 Dudley St       35       Roxbury 2108.4691
## 6          1 COPLEY SQUARE   700 Boylston St       37      Back Bay  399.3144
##   Neighborhood_ID SqMiles ShapeSTArea ShapeSTLength                 geometry
## 1              31    0.30     8297743      17728.59 POINT (773730.9 2957037)
## 2               6    7.29   203114217     104344.03 POINT (772255.2 2930088)
## 3              10    4.57   127509243      66861.24 POINT (758459.3 2918939)
## 4              11    3.94   109737891      56349.94 POINT (761341.6 2942108)
## 5              16    3.29    91844546      49488.80 POINT (768696.7 2945002)
## 6               2    0.62    17394066      19455.67 POINT (770005.9 2952641)

Con este nuevo dataset, estamos en condiciones de saber, cuantas Bibliotecas Públias hay por Barrio = Neighborhhod. Asimismo, podremos graficar mapas de valores máximos, mínimos, promedios, entre otros, de la cantidad de Bibliotecas que haya por Barrio.

Paso IV: vamos a graficar

Un gráfico de puntos para visualizar la cantidad de bilioteas por barrio. Primero debemos agrupar los bibliotecas por barrio = Neighborhood_ID.

Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios <- Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios %>% 
  group_by(Neighborhood_ID) %>% 
  summarise(total=n())

Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios

## Simple feature collection with 18 features and 2 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 745942.7 ymin: 2918939 xmax: 783543.5 ymax: 2963037
## CRS:            EPSG:4326

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## # A tibble: 18 x 3
##    Neighborhood_ID total                                                geometry
##  * <fct>           <int>                                          <GEOMETRY [°]>
##  1 10                  1                                POINT (758459.3 2918939)
##  2 11                  2     MULTIPOINT ((760339.2 2937726), (761341.6 2942108))
##  3 12                  2     MULTIPOINT ((766273.3 2926417), (772946.3 2925002))
##  4 13                  1                                POINT (764571.8 2946429)
##  5 14                  1                                POINT (776423.5 2958010)
##  6 15                  1                                POINT (756649.3 2929333)
##  7 16                  2     MULTIPOINT ((765535.7 2939727), (768696.7 2945002))
##  8 17                  1                                  POINT (780886 2947741)
##  9 19                  1                                POINT (748860.9 2928480)
## 10 2                   1                                POINT (770005.9 2952641)
## 11 24                  1                                POINT (756679.2 2956470)
## 12 25                  2       MULTIPOINT ((745942.7 2953223), (749995 2951913))
## 13 31                  1                                POINT (773730.9 2957037)
## 14 32                  1                                POINT (770517.2 2949691)
## 15 4                   1                                POINT (773857.7 2962274)
## 16 6                   5 MULTIPOINT ((769451.7 2937534), (772255.2 2930088), (7…
## 17 7                   1                                POINT (774275.2 2953649)
## 18 8                   1                                POINT (783543.5 2963037)

Mapeamos las biliotecas por barrio distinguiendo por color:

ggplot() +
    geom_sf(data = Barrios_Boston) +
    geom_sf(data = Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios, aes(color = Neighborhood_ID), size=1) +
  theme(legend.position = "bottom")

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

Ahora realizamos el grafico de puntos:

ggplot(Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios) +
  geom_point(aes(x=Neighborhood_ID, y=total), color="red") + 
  labs(title="Bibliotecas por Barrio",
       subtitle="Boston",
       caption="Fuente:https://data.boston.gov/dataset/public-libraries/resource/ed710c4a-689a-47af-9dd0-6e4215003c24",
       x="Barrio ID",
       y="Total")

Se observa que en los barrios prevalecen 1 Biblioteca pública por Barrio aunque en 4 Barrios se destacan 2 Bibliotecas. Es llamativo en el Barrio ID: 6 la concentración de 5 Bibliotecas públicas, habrá que estudiar a que se debe esto.

Ahora vamos a pintar cada barrio = Neighborhood_ID con la cantidad máxima de Bibliotecas Públicas, utilizando un gráfico de coropletas.

Para ello utilizamos al dataset Total_Bibliotecas_Publicas_Ordenadas_con_Barrios y le quitamos la geometría. Luego lo unimos al dataset Barrios_Boston:

Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios <- Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios %>% 
  st_set_geometry(NULL)

head(Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios)

## # A tibble: 6 x 2
##   Neighborhood_ID total
##   <fct>           <int>
## 1 10                  1
## 2 11                  2
## 3 12                  2
## 4 13                  1
## 5 14                  1
## 6 15                  1

Modificamos el nombre de la columa Total = Total Bibliotecas Publicas

colnames(Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios)[colnames(Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios) %in% c('total')] <- paste0(c('Total_Bibliotecas_Publicas'))

head(Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios)

## # A tibble: 6 x 2
##   Neighborhood_ID Total_Bibliotecas_Publicas
##   <fct>                                <int>
## 1 10                                       1
## 2 11                                       2
## 3 12                                       2
## 4 13                                       1
## 5 14                                       1
## 6 15                                       1

Mediante el comando left_join() unificamos el datasat Total_Bibliotecas_Publicas_Ordenadas_con_Barrios con Barrios_Boston:

Bibliotecas_Barrios_Total <- Barrios_Boston %>% 
  left_join(Total_Bibliotecas_Publicas_Ordenadas_Datos_Espaciales_con_Barrios, by="Neighborhood_ID")

head(Bibliotecas_Barrios_Total)

## Simple feature collection with 6 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 751472.1 ymin: 2922750 xmax: 776215.4 ymax: 2953742
## CRS:            4326
##   OBJECTID             Name      Acres Neighborhood_ID SqMiles ShapeSTArea
## 1       27       Roslindale 1605.56824              15    2.51  69938272.9
## 2       28    Jamaica Plain 2519.24539              11    3.94 109737890.8
## 3       29     Mission Hill  350.85356              13    0.55  15283120.0
## 4       30         Longwood  188.61195              28    0.29   8215903.5
## 5       31      Bay Village   26.53984              33    0.04   1156070.8
## 6       32 Leather District   15.63991              27    0.02    681271.7
##   ShapeSTLength Total_Bibliotecas_Publicas                       geometry
## 1     53563.913                          1 MULTIPOLYGON (((757409.1 29...
## 2     56349.937                          2 MULTIPOLYGON (((762983.8 29...
## 3     17918.724                          1 MULTIPOLYGON (((766903.6 29...
## 4     11908.757                         NA MULTIPOLYGON (((764826.9 29...
## 5      4650.635                         NA MULTIPOLYGON (((773315.7 29...
## 6      3237.141                         NA MULTIPOLYGON (((775544.1 29...

names(Bibliotecas_Barrios_Total)

## [1] "OBJECTID"                   "Name"                      
## [3] "Acres"                      "Neighborhood_ID"           
## [5] "SqMiles"                    "ShapeSTArea"               
## [7] "ShapeSTLength"              "Total_Bibliotecas_Publicas"
## [9] "geometry"

Ahora mapeamos:

ggplot() +
  geom_sf(data = Bibliotecas_Barrios_Total, aes(fill = Total_Bibliotecas_Publicas)) + 
  geom_sf_text(data = Bibliotecas_Barrios_Total, aes(label = Name), size=2, colour="black") +
  labs(title = "Total Bibliotecas por Barrio",
         subtitle = "Boston",
         fill = "Total Bibliotecas",
         caption= "Fuente: https://data.boston.gov/dataset/boston-neighborhoods y https://data.boston.gov/dataset/public-libraries/resource/ed710c4a-689a-47af-9dd0-6e4215003c24",
         y="",
         x="") +
  scale_fill_gradient(low="yellow", high="red")

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

## Warning: Computation failed in `stat_sf_coordinates()`:
## Evaluation error: TopologyException: Input geom 1 is invalid: Ring Self-intersection at or near point 757457.09557977319 2924489.7073805183 at 757457.09557977319 2924489.7073805183.

## Warning in st_is_longlat(x): bounding box has potentially an invalid value range
## for longlat data

Se observa como cambia el gradiente por barrio según la cantidad de bibliotecas que hay en cada uno de ellos. Los barrios en color gris refiere que no hay presencia de biblioteas públicas.

Ciencia de Datos de Ciudades II

Ana Fehrmann

25/4/2020

TAREA 1: GEOPROCESAMIENTO